本篇博文主要内容为 2025-08-21 从Arxiv.org论文网站获取的最新论文列表,自动更新,按照NLP、CV、ML、AI、IR五个大方向区分,若需要邮件定时接收,请在评论区留下你的邮箱号。

说明:每日论文数据从Arxiv.org获取,每天早上12:00左右定时自动更新。

友情提示: 如何您需要邮箱接收每日论文数据,请在评论处留下你的邮箱。

目录

概览 (2025-08-21)

今日共更新449篇论文,其中:

  • 自然语言处理55篇(Computation and Language (cs.CL))
  • 人工智能127篇(Artificial Intelligence (cs.AI))
  • 计算机视觉97篇(Computer Vision and Pattern Recognition (cs.CV))
  • 机器学习124篇(Machine Learning (cs.LG))

自然语言处理

[NLP-0] Quantization Meets dLLM s: A Systematic Study of Post-training Quantization for Diffusion LLM s

【速读】: 该论文旨在解决扩散语言模型(diffusion language models, dLLMs)在边缘设备部署时面临的高资源消耗问题,尤其是针对其大规模参数量和低比特量化(low-bit quantization)带来的挑战。解决方案的关键在于首次系统性地研究了dLLMs的后训练量化(post-training quantization, PTQ)策略,并识别出激活异常值(activation outliers)——即异常大的激活值主导动态范围的现象——这一现象显著阻碍了低比特量化下多数激活值的精度保持。通过引入先进的PTQ方法并在多种任务类型与模型变体上进行综合评估,作者从比特宽度、量化方法、任务类别和模型类型四个维度揭示了dLLMs的量化行为规律,为高效部署扩散语言模型提供了实证基础与实践指导。

链接: https://arxiv.org/abs/2508.14896
作者: Haokun Lin,Haobo Xu,Yichen Wu,Ziyu Guo,Renrui Zhang,Zhichao Lu,Ying Wei,Qingfu Zhang,Zhenan Sun
机构: NLPR & MAIS, Institute of Automation, CAS (中国科学院自动化研究所); Tsinghua University (清华大学); City University of Hong Kong (香港城市大学); Harvard University (哈佛大学); The Chinese University of Hong Kong (香港中文大学); Zhejiang University (浙江大学)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: Technical Report, Work in Progress

点击查看摘要

Abstract:Recent advances in diffusion large language models (dLLMs) have introduced a promising alternative to autoregressive (AR) LLMs for natural language generation tasks, leveraging full attention and denoising-based decoding strategies. However, the deployment of these models on edge devices remains challenging due to their massive parameter scale and high resource demands. While post-training quantization (PTQ) has emerged as a widely adopted technique for compressing AR LLMs, its applicability to dLLMs remains largely unexplored. In this work, we present the first systematic study on quantizing diffusion-based language models. We begin by identifying the presence of activation outliers, characterized by abnormally large activation values that dominate the dynamic range. These outliers pose a key challenge to low-bit quantization, as they make it difficult to preserve precision for the majority of values. More importantly, we implement state-of-the-art PTQ methods and conduct a comprehensive evaluation across multiple task types and model variants. Our analysis is structured along four key dimensions: bit-width, quantization method, task category, and model type. Through this multi-perspective evaluation, we offer practical insights into the quantization behavior of dLLMs under different configurations. We hope our findings provide a foundation for future research in efficient dLLM deployment. All codes and experimental setups will be released to support the community.
zh

[NLP-1] Virtual Community: An Open World for Humans Robots and Society

【速读】: 该论文旨在解决人类与机器人在开放世界环境中共存时所面临的复杂社会智能问题,特别是如何实现机器人之间的协作或竞争、人类社会关系的构建,以及人机协同共存机制的研究。其解决方案的关键在于提出并构建了一个名为“Virtual Community”的开放世界平台,该平台基于通用物理引擎并融合真实世界的3D场景,支持多智能体(包括人类和机器人)的交互模拟;其核心创新包括一个开源的多智能体物理仿真器和一套大规模、现实对齐的社区生成流程,能够生成包含丰富场景与角色的虚拟社会环境,并在此基础上设计了两个新的挑战任务——Community Planning Challenge 和 Community Robot Challenge,用于评估高阶任务规划与低阶协作控制能力,从而推动对人机共生环境下具身社会智能的系统性研究。

链接: https://arxiv.org/abs/2508.14893
作者: Qinhong Zhou,Hongxin Zhang,Xiangye Lin,Zheyuan Zhang,Yutian Chen,Wenjun Liu,Zunzhe Zhang,Sunli Chen,Lixing Fang,Qiushi Lyu,Xinyu Sun,Jincheng Yang,Zeyuan Wang,Bao Chi Dang,Zhehuan Chen,Daksha Ladia,Jiageng Liu,Chuang Gan
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Computation and Language (cs.CL); Robotics (cs.RO)
备注: website this https URL

点击查看摘要

Abstract:The rapid progress in AI and Robotics may lead to a profound societal transformation, as humans and robots begin to coexist within shared communities, introducing both opportunities and challenges. To explore this future, we present Virtual Community-an open-world platform for humans, robots, and society-built on a universal physics engine and grounded in real-world 3D scenes. With Virtual Community, we aim to study embodied social intelligence at scale: 1) How robots can intelligently cooperate or compete; 2) How humans develop social relations and build community; 3) More importantly, how intelligent robots and humans can co-exist in an open world. To support these, Virtual Community features: 1) An open-source multi-agent physics simulator that supports robots, humans, and their interactions within a society; 2) A large-scale, real-world aligned community generation pipeline, including vast outdoor space, diverse indoor scenes, and a community of grounded agents with rich characters and appearances. Leveraging Virtual Community, we propose two novel challenges. The Community Planning Challenge evaluates multi-agent reasoning and planning ability in open-world settings, such as cooperating to help agents with daily activities and efficiently connecting other agents. The Community Robot Challenge requires multiple heterogeneous robots to collaborate in solving complex open-world tasks. We evaluate various baselines on these tasks and demonstrate the challenges in both high-level open-world task planning and low-level cooperation controls. We hope that Virtual Community will unlock further study of human-robot coexistence within open-world environments.
zh

[NLP-2] MedReseacher-R1: Expert-Level Medical Deep Researcher via A Knowledge-Informed Trajectory Synthesis Framework

【速读】: 该论文旨在解决当前基于大语言模型(Large Language Model, LLM)的通用深度研究代理在医疗领域表现受限的问题,具体表现为临床推理所需密集医学知识不足以及缺乏针对医疗场景优化的检索工具。其解决方案的关键在于两个核心创新:一是构建基于医学知识图谱的数据合成框架,通过提取稀有医学实体周围子图中的最长路径生成复杂多跳问答对,从而增强模型对医学逻辑链的理解;二是集成定制化的私有医疗检索引擎与通用工具协同工作,提升医疗信息获取的准确性。结合两阶段训练策略(监督微调与在线强化学习),该方法使MedResearcher-R1-32B模型在多个医疗基准测试中达到新SOTA水平,同时保持通用深度研究任务的竞争力,证明了领域特异性架构设计、工具开发和训练数据构造对于小规模开源模型超越大型专有系统的重要性。

链接: https://arxiv.org/abs/2508.14880
作者: Ailing Yu,Lan Yao,Jingnan Liu,Zhe Chen,Jiajun Yin,Yuan Wang,Xinhao Liao,Zhiling Ye,Ji Li,Yun Yue,Hansong Xiao,Hualei Zhou,Chunxiao Guo,Peng Wei,Jinjie Gu
机构: Ant Group (蚂蚁集团); Harbin Institute of Technology (哈尔滨工业大学)
类目: Computation and Language (cs.CL)
备注: 13 pages, 5 figures

点击查看摘要

Abstract:Recent developments in Large Language Model (LLM)-based agents have shown impressive capabilities spanning multiple domains, exemplified by deep research systems that demonstrate superior performance on complex information-seeking and synthesis tasks. While general-purpose deep research agents have shown impressive capabilities, they struggle significantly with medical domain challenges, as evidenced by leading proprietary systems achieving limited accuracy on complex medical benchmarks. The key limitations are: (1) the model lacks sufficient dense medical knowledge for clinical reasoning, and (2) the framework is constrained by the absence of specialized retrieval tools tailored for medical this http URL present a medical deep research agent that addresses these challenges through two core innovations. First, we develop a novel data synthesis framework using medical knowledge graphs, extracting the longest chains from subgraphs around rare medical entities to generate complex multi-hop question-answer pairs. Second, we integrate a custom-built private medical retrieval engine alongside general-purpose tools, enabling accurate medical information synthesis. Our approach generates 2100+ diverse trajectories across 12 medical specialties, each averaging 4.2 tool this http URL a two-stage training paradigm combining supervised fine-tuning and online reinforcement learning with composite rewards, our MedResearcher-R1-32B model demonstrates exceptional performance, establishing new state-of-the-art results on medical benchmarks while maintaining competitive performance on general deep research tasks. Our work demonstrates that strategic domain-specific innovations in architecture, tool design, and training data construction can enable smaller open-source models to outperform much larger proprietary systems in specialized domains.
zh

[NLP-3] Long Chain-of-Thought Reasoning Across Languages

【速读】: 该论文旨在解决大语言模型(Large Language Models, LLMs)在长链式思维(Chain-of-Thought, CoT)推理中严重依赖英语的问题,即跨语言推理能力不足,尤其在非英语语种中表现不佳。其解决方案的关键在于:构建多语言CoT数据集并针对不同语言进行针对性微调,发现使用英语作为中间转换语言的效果因语言而异;通过大规模多语言预训练缩小但无法消除性能差距;且在不同语言上存在数据质量与规模的权衡——小而精的数据集适用于英语和法语,而斯瓦希里语和拉脱维亚语则需要更大但噪声更多的语料库才能提升性能。这一研究为实现公平、有效的多语言推理提供了实证依据和可复用的数据资源。

链接: https://arxiv.org/abs/2508.14828
作者: Josh Barua,Seun Eisape,Kayo Yin,Alane Suhr
机构: University of California, Berkeley (加州大学伯克利分校)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: Accepted to SCALR @ COLM 2025

点击查看摘要

Abstract:Scaling inference through long chains-of-thought (CoTs) has unlocked impressive reasoning capabilities in large language models (LLMs), yet the reasoning process remains almost exclusively English-centric. We construct translated versions of two popular English reasoning datasets, fine-tune Qwen 2.5 (7B) and Qwen 3 (8B) models, and present a systematic study of long CoT generation across French, Japanese, Latvian, and Swahili. Our experiments reveal three key findings. First, the efficacy of using English as a pivot language varies by language: it provides no benefit for French, improves performance when used as the reasoning language for Japanese and Latvian, and proves insufficient for Swahili where both task comprehension and reasoning remain poor. Second, extensive multilingual pretraining in Qwen 3 narrows but does not eliminate the cross-lingual performance gap. A lightweight fine-tune using only 1k traces still improves performance by over 30% in Swahili. Third, data quality versus scale trade-offs are language dependent: small, carefully curated datasets suffice for English and French, whereas larger but noisier corpora prove more effective for Swahili and Latvian. Together, these results clarify when and why long CoTs transfer across languages and provide translated datasets to foster equitable multilingual reasoning research.
zh

[NLP-4] Evaluating Retrieval-Augmented Generation vs. Long-Context Input for Clinical Reasoning over EHRs

【速读】: 该论文旨在解决电子健康记录(Electronic Health Records, EHRs)中文本冗长、噪声大且结构不规整的问题,这些问题给临床医生的信息提取和决策带来显著挑战。传统大型语言模型(Large Language Models, LLMs)虽在处理非结构化医疗文本方面展现出潜力,但其上下文窗口长度有限,难以有效利用完整EHR内容。为此,论文提出采用检索增强生成(Retrieval-Augmented Generation, RAG)方法,通过从整个EHR中检索与任务相关的片段来替代直接输入全部文本,从而减少所需输入token数量并提升效率。关键创新在于设计了三项可在不同医疗机构间快速复制的临床任务(影像学检查提取、抗生素使用时间线生成、关键诊断识别),并在真实住院患者数据上验证RAG相较于仅使用最新临床笔记的方法性能相当甚至更优,同时显著降低计算资源消耗,证明RAG在长文本医疗场景下仍具高效性和竞争力。

链接: https://arxiv.org/abs/2508.14817
作者: Skatje Myers,Dmitriy Dligach,Timothy A. Miller,Samantha Barr,Yanjun Gao,Matthew Churpek,Anoop Mayampurath,Majid Afshar
机构: University of Wisconsin-Madison (威斯康星大学麦迪逊分校); Loyola University Chicago (洛约拉大学芝加哥分校); Boston Children’s Hospital (波士顿儿童医院); Harvard Medical School (哈佛医学院); University of Colorado-Anschutz (科罗拉多大学安舒茨医学校区)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Electronic health records (EHRs) are long, noisy, and often redundant, posing a major challenge for the clinicians who must navigate them. Large language models (LLMs) offer a promising solution for extracting and reasoning over this unstructured text, but the length of clinical notes often exceeds even state-of-the-art models’ extended context windows. Retrieval-augmented generation (RAG) offers an alternative by retrieving task-relevant passages from across the entire EHR, potentially reducing the amount of required input tokens. In this work, we propose three clinical tasks designed to be replicable across health systems with minimal effort: 1) extracting imaging procedures, 2) generating timelines of antibiotic use, and 3) identifying key diagnoses. Using EHRs from actual hospitalized patients, we test three state-of-the-art LLMs with varying amounts of provided context, using either targeted text retrieval or the most recent clinical notes. We find that RAG closely matches or exceeds the performance of using recent notes, and approaches the performance of using the models’ full context while requiring drastically fewer input tokens. Our results suggest that RAG remains a competitive and efficient approach even as newer models become capable of handling increasingly longer amounts of text.
zh

[NLP-5] Privileged Self-Access Matters for Introspection in AI

【速读】: 该论文试图解决的问题是:如何准确定义人工智能(AI)模型的内省(introspection)能力,尤其是在大语言模型(Large Language Models, LLMs)中。当前学界对内省缺乏统一定义,导致相关研究难以比较和推进。论文的关键解决方案在于提出一种“更厚实”的内省定义:即AI的内省是指任何能够通过比第三方在相同或更低计算成本下所能获得的过程更为可靠的方式,获取关于自身内部状态的信息的过程。这一定义强调了信息获取的可靠性与计算效率之间的差异,从而区分出真正具有内省能力的机制与仅表现出表面一致性(如轻量级内省)的行为。实验表明,尽管LLMs在某些情况下可表现出类似内省的行为(例如推理其温度参数),但这些行为并不满足该定义所要求的可靠性标准,因此不能视为真正的内省。

链接: https://arxiv.org/abs/2508.14802
作者: Siyuan Song,Harvey Lederman,Jennifer Hu,Kyle Mahowald
机构: The University of Texas at Austin (德克萨斯大学奥斯汀分校); Johns Hopkins University (约翰霍普金斯大学)
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Whether AI models can introspect is an increasingly important practical question. But there is no consensus on how introspection is to be defined. Beginning from a recently proposed ‘‘lightweight’’ definition, we argue instead for a thicker one. According to our proposal, introspection in AI is any process which yields information about internal states through a process more reliable than one with equal or lower computational cost available to a third party. Using experiments where LLMs reason about their internal temperature parameters, we show they can appear to have lightweight introspection while failing to meaningfully introspect per our proposed definition.
zh

[NLP-6] ransLLM : A Unified Multi-Task Foundation Framework for Urban Transportation via Learnable Prompting

【速读】: 该论文旨在解决城市交通系统中多任务场景下的建模难题,具体包括交通预测、电动汽车(Electric Vehicle, EV)充电需求预测和出租车调度等问题。现有方法存在两大局限:小规模深度学习模型任务特定且数据依赖性强,难以跨场景泛化;而大语言模型(Large Language Model, LLM)虽具备自然语言接口的灵活性,却在处理结构化时空数据和数值推理方面表现不佳。解决方案的关键在于提出一个统一的基础框架TransLLM,其核心创新是通过可学习提示组合(learnable prompt composition)将时空建模与LLM有机结合:一方面设计轻量级时空编码器,利用扩张时间卷积和双邻接图注意力网络捕捉复杂依赖关系,并以结构化嵌入形式接入LLM;另一方面引入基于强化学习训练的实例级提示路由机制,动态根据输入特征个性化生成提示,从而突破固定任务模板的限制,实现对不同任务的有效引导与预测输出。

链接: https://arxiv.org/abs/2508.14782
作者: Jiaming Leng,Yunying Bi,Chuan Qin,Bing Yin,Yanyong Zhang,Chao Wang
机构: University of Science and Technology of China (中国科学技术大学); Computer Network Information Center, Chinese Academy of Sciences (中国科学院计算机网络信息中心); iFLYTEK (科大讯飞)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Urban transportation systems encounter diverse challenges across multiple tasks, such as traffic forecasting, electric vehicle (EV) charging demand prediction, and taxi dispatch. Existing approaches suffer from two key limitations: small-scale deep learning models are task-specific and data-hungry, limiting their generalizability across diverse scenarios, while large language models (LLMs), despite offering flexibility through natural language interfaces, struggle with structured spatiotemporal data and numerical reasoning in transportation domains. To address these limitations, we propose TransLLM, a unified foundation framework that integrates spatiotemporal modeling with large language models through learnable prompt composition. Our approach features a lightweight spatiotemporal encoder that captures complex dependencies via dilated temporal convolutions and dual-adjacency graph attention networks, seamlessly interfacing with LLMs through structured embeddings. A novel instance-level prompt routing mechanism, trained via reinforcement learning, dynamically personalizes prompts based on input characteristics, moving beyond fixed task-specific templates. The framework operates by encoding spatiotemporal patterns into contextual representations, dynamically composing personalized prompts to guide LLM reasoning, and projecting the resulting representations through specialized output layers to generate task-specific predictions. Experiments across seven datasets and three tasks demonstrate the exceptional effectiveness of TransLLM in both supervised and zero-shot settings. Compared to ten baseline models, it delivers competitive performance on both regression and planning problems, showing strong generalization and cross-task adaptability. Our code is available at this https URL.
zh

[NLP-7] Evaluating Multilingual and Code-Switched Alignment in LLM s via Synthetic Natural Language Inference

【速读】: 该论文旨在解决当前大语言模型(Large Language Models, LLMs)在多语言场景下跨语言逻辑一致性对齐能力不足的问题,即模型在不同语言之间是否能保持推理链条的语义稳定性和逻辑连贯性。其解决方案的关键在于构建一个受控的多语言自然语言推理(Multilingual Natural Language Inference, NLI)评估框架:通过生成基于逻辑结构的合成前提-假设对,并将其翻译成语言类型多样化的多种语言,从而精确控制语义关系;同时在单语和混合语言(代码切换,code-switched)条件下进行测试,发现代码切换不仅不会降低性能,反而可能提升模型表现,表明翻译带来的词汇变异可作为正则化信号。该方法有效验证了翻译对齐质量并通过嵌入相似性分析与跨语言对齐可视化加以确认,揭示了现有LLM在跨语言推理中的潜力与脆弱性,并提出代码切换是增强多语言鲁棒性的关键策略。

链接: https://arxiv.org/abs/2508.14735
作者: Samir Abdaljalil,Erchin Serpedin,Khalid Qaraqe,Hasan Kurban
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: Under review

点击查看摘要

Abstract:Large language models (LLMs) are increasingly applied in multilingual contexts, yet their capacity for consistent, logically grounded alignment across languages remains underexplored. We present a controlled evaluation framework for multilingual natural language inference (NLI) that generates synthetic, logic-based premise-hypothesis pairs and translates them into a typologically diverse set of languages. This design enables precise control over semantic relations and allows testing in both monolingual and mixed-language (code-switched) conditions. Surprisingly, code-switching does not degrade, and can even improve, performance, suggesting that translation-induced lexical variation may serve as a regularization signal. We validate semantic preservation through embedding-based similarity analyses and cross-lingual alignment visualizations, confirming the fidelity of translated pairs. Our findings expose both the potential and the brittleness of current LLM cross-lingual reasoning, and identify code-switching as a promising lever for improving multilingual robustness. Code available at: this https URL
zh

[NLP-8] ransplant Then Regenerate: A New Paradigm for Text Data Augmentation EMNLP2025

【速读】: 该论文旨在解决传统文本增强方法在生成语义不变但缺乏多样性的内容时,难以实现结构与风格可控的问题。现有基于回译(Back-translation)等方法主要局限于词汇层面的重述,而大型语言模型(Large Language Models, LLMs)虽具备“知识涌现”能力,却因输出风格和结构难以控制,仍需复杂提示工程。论文提出的解决方案关键在于提出一种名为 LMTransplant 的新范式,其核心思想是“移植-再生成”(transplant-then-regenerate):首先将原始种子文本嵌入由LLM扩展的上下文中,随后要求LLM基于该扩展上下文重新生成变体。此策略充分利用了LLM中蕴含的知识,同时保留原始文本的核心属性,从而实现更具多样性与创造性的内容级变异。

链接: https://arxiv.org/abs/2508.14723
作者: Guangzhan Wang,Hongyu Zhang,Beijun Shen,Xiaodong Gu
机构: Shanghai Jiao Tong University (上海交通大学); Chongqing University (重庆大学)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: Accepted by EMNLP 2025

点击查看摘要

Abstract:Data augmentation is a critical technique in deep learning. Traditional methods like Back-translation typically focus on lexical-level rephrasing, which primarily produces variations with the same semantics. While large language models (LLMs) have enhanced text augmentation by their “knowledge emergence” capability, controlling the style and structure of these outputs remains challenging and requires meticulous prompt engineering. In this paper, we propose LMTransplant, a novel text augmentation paradigm leveraging LLMs. The core idea of LMTransplant is transplant-then-regenerate: incorporating seed text into a context expanded by LLM, and asking the LLM to regenerate a variant based on the expanded context. This strategy allows the model to create more diverse and creative content-level variants by fully leveraging the knowledge embedded in LLMs, while preserving the core attributes of the original text. We evaluate LMTransplant across various text-related tasks, demonstrating its superior performance over existing text augmentation methods. Moreover, LMTransplant demonstrates exceptional scalability as the size of augmented data grows.
zh

[NLP-9] he Digital Sous Chef – A Comparative Study on Fine-Tuning Language Models for Recipe Generation

【速读】: 该论文旨在解决文本驱动的食谱生成(text-based recipe generation)任务中因通用分词策略导致的结构信息丢失与数值精度不足问题。其核心解决方案是提出一种针对性的分词策略,通过在词汇表中引入23个常见分数标记和自定义结构标记,以保留食谱中的关键结构信息和精确数值量,从而提升领域特异性。这一方法显著改善了生成食谱的语义相关性(BERTScore F1提升20%)和流畅性(困惑度降低69.8%),为后续融合现实约束与多模态输入的高级食谱生成研究奠定基础。

链接: https://arxiv.org/abs/2508.14718
作者: Shubham Pundhir,Ganesh Bagler
机构: Indraprastha Institute of Information Technology, Delhi, India (印度信息科技学院德里分校)
类目: Computation and Language (cs.CL)
备注: 8 pages, 4 figures. Code is available at: this https URL

点击查看摘要

Abstract:We established a rigorous benchmark for text-based recipe generation, a fundamental task in natural language generation. We present a comprehensive comparative study contrasting a fine-tuned GPT-2 large (774M) model against the GPT-2 small (124M) model and traditional LSTM/RNN baselines on the 5-cuisine corpus from RecipeDB. Our key contribution is a targeted tokenization strategy that augments the vocabulary with 23 common fraction tokens and custom structural markers. This approach addresses a critical limitation of generic tokenizers by preserving essential recipe structures and precise numerical quantities, thereby enhancing domain specificity. Performance is evaluated using a comprehensive suite of seven automatic metrics spanning fluency (BLEU-4, METEOR), coherence (ROUGE-L), semantic relevance (BERTScore), and diversity. Our experiments show that the large transformer-based approach yields a 20% relative improvement in BERTScore (F1) (0.92 vs 0.72) over the best recurrent baseline, while reducing perplexity by 69.8%. We conclude with a discussion of remaining challenges, particularly regarding factual accuracy, and outline how this foundational study paves the way for integrating real-world constraints and multi-modal inputs in advanced recipe generation research.
zh

[NLP-10] ShizhenGPT : Towards Multimodal LLM s for Traditional Chinese Medicine

【速读】: 该论文旨在解决大型语言模型(Large Language Models, LLMs)在中医(Traditional Chinese Medicine, TCM)领域应用中的两大关键挑战:一是高质量中医数据的稀缺性,二是中医诊断所依赖的多模态感知特性(包括望、闻、问、切等感官信息),这超出了传统LLMs的处理能力。解决方案的核心在于提出首个面向中医的多模态大语言模型ShizhenGPT,其关键创新包括:构建迄今最大的中医多模态数据集(包含100GB+文本与200GB+多模态数据,如120万张图像、200小时音频及生理信号),并通过预训练和指令微调实现深度中医知识理解与跨模态推理能力;实验表明,ShizhenGPT在中医视觉识别与诊断任务上显著优于同类规模模型,并在多模态感知融合方面领先现有模型,为实现中医整体性多模态感知与诊断提供了可行路径。

链接: https://arxiv.org/abs/2508.14706
作者: Junying Chen,Zhenyang Cai,Zhiheng Liu,Yunjin Yang,Rongsheng Wang,Qingying Xiao,Xiangyi Feng,Zhan Su,Jing Guo,Xiang Wan,Guangjun Yu,Haizhou Li,Benyou Wang
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG); Multimedia (cs.MM)
备注:

点击查看摘要

Abstract:Despite the success of large language models (LLMs) in various domains, their potential in Traditional Chinese Medicine (TCM) remains largely underexplored due to two critical barriers: (1) the scarcity of high-quality TCM data and (2) the inherently multimodal nature of TCM diagnostics, which involve looking, listening, smelling, and pulse-taking. These sensory-rich modalities are beyond the scope of conventional LLMs. To address these challenges, we present ShizhenGPT, the first multimodal LLM tailored for TCM. To overcome data scarcity, we curate the largest TCM dataset to date, comprising 100GB+ of text and 200GB+ of multimodal data, including 1.2M images, 200 hours of audio, and physiological signals. ShizhenGPT is pretrained and instruction-tuned to achieve deep TCM knowledge and multimodal reasoning. For evaluation, we collect recent national TCM qualification exams and build a visual benchmark for Medicinal Recognition and Visual Diagnosis. Experiments demonstrate that ShizhenGPT outperforms comparable-scale LLMs and competes with larger proprietary models. Moreover, it leads in TCM visual understanding among existing multimodal LLMs and demonstrates unified perception across modalities like sound, pulse, smell, and vision, paving the way toward holistic multimodal perception and diagnosis in TCM. Datasets, models, and code are publicly available. We hope this work will inspire further exploration in this field.
zh

[NLP-11] MCP-Universe: Benchmarking Large Language Models with Real-World Model Context Protocol Servers

【速读】: 该论文旨在解决当前大语言模型(Large Language Models, LLMs)在与外部数据源和工具交互时的评估基准过于简单、无法反映真实应用场景中复杂挑战的问题,特别是长周期推理(long-horizon reasoning)和陌生工具空间(unknown-tools challenge)等关键难点。解决方案的关键在于提出MCP-Universe,首个专为评估LLMs在真实世界MCP服务器交互中表现而设计的综合性基准,涵盖6个核心领域共11个真实MCP服务器,并引入执行级评估器(execution-based evaluators),包括格式校验器、静态匹配器和动态实时真值检索器,以确保评估的严谨性与实用性。同时,该框架支持可扩展性与用户界面(UI)集成,推动MCP生态系统的持续创新。

链接: https://arxiv.org/abs/2508.14704
作者: Ziyang Luo,Zhiqi Shen,Wenzhuo Yang,Zirui Zhao,Prathyusha Jwalapuram,Amrita Saha,Doyen Sahoo,Silvio Savarese,Caiming Xiong,Junnan Li
机构: Salesforce AI Research (Salesforce人工智能研究中心)
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注: Website: this https URL

点击查看摘要

Abstract:The Model Context Protocol has emerged as a transformative standard for connecting large language models to external data sources and tools, rapidly gaining adoption across major AI providers and development platforms. However, existing benchmarks are overly simplistic and fail to capture real application challenges such as long-horizon reasoning and large, unfamiliar tool spaces. To address this critical gap, we introduce MCP-Universe, the first comprehensive benchmark specifically designed to evaluate LLMs in realistic and hard tasks through interaction with real-world MCP servers. Our benchmark encompasses 6 core domains spanning 11 different MCP servers: Location Navigation, Repository Management, Financial Analysis, 3D Design, Browser Automation, and Web Searching. To ensure rigorous evaluation, we implement execution-based evaluators, including format evaluators for agent format compliance, static evaluators for time-invariant content matching, and dynamic evaluators that automatically retrieve real-time ground truth for temporally sensitive tasks. Through extensive evaluation of leading LLMs, we find that even SOTA models such as GPT-5 (43.72%), Grok-4 (33.33%) and Claude-4.0-Sonnet (29.44%) exhibit significant performance limitations. In addition, our benchmark poses a significant long-context challenge for LLM agents, as the number of input tokens increases rapidly with the number of interaction steps. Moreover, it introduces an unknown-tools challenge, as LLM agents often lack familiarity with the precise usage of the MCP servers. Notably, enterprise-level agents like Cursor cannot achieve better performance than standard ReAct frameworks. Beyond evaluation, we open-source our extensible evaluation framework with UI support, enabling researchers and practitioners to seamlessly integrate new agents and MCP servers while fostering innovation in the rapidly evolving MCP ecosystem.
zh

[NLP-12] Improving in-context learning with a better scoring function

【速读】: 该论文旨在解决大型语言模型(Large Language Models, LLMs)在上下文学习(In-Context Learning, ICL)中对一阶量词(如“all”和“some”)以及线性函数任务表现受限的问题。研究表明,注意力机制中的Softmax评分函数是导致这些限制的关键因素。为此,作者提出了一种新的替代方案——缩放有符号平均(Scaled Signed Averaging, SSA),其核心创新在于以更合理的数学方式聚合注意力权重,从而提升模型对复杂逻辑结构和线性关系的建模能力。实验表明,SSA显著改善了目标任务上的性能,并且在编码器-only 和解码器-only 变换器架构中均表现出优于或相当於传统Softmax基线的结果。

链接: https://arxiv.org/abs/2508.14685
作者: Omar Naim,Swarnadeep Bhar,Jérôme Bolte,Nicholas Asher
机构: IRIT France (法国图卢兹信息研究所); Université de Toulouse (图卢兹大学); Toulouse School of Economics (图卢兹经济学院); Université Toulouse Capitole (图卢兹-卡皮托勒大学); CNRS (法国国家科学研究中心)
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Large language models (LLMs) exhibit a remarkable capacity to learn by analogy, known as in-context learning (ICL). However, recent studies have revealed limitations in this ability. In this paper, we examine these limitations on tasks involving first-order quantifiers such as \em all and \em some, as well as on ICL with linear functions. We identify Softmax, the scoring function in attention mechanism, as a contributing factor to these constraints. To address this, we propose \textbfscaled signed averaging (SSA), a novel alternative to Softmax. Empirical results show that SSA dramatically improves performance on our target tasks. Furthermore, we evaluate both encoder-only and decoder-only transformers models with SSA, demonstrating that they match or exceed their Softmax-based counterparts across a variety of linguistic probing tasks.
zh

[NLP-13] Continuous sentiment scores for literary and multilingual contexts

【速读】: 该论文旨在解决情感分析(Sentiment Analysis)在文学文本中应用时面临的独特挑战,包括隐喻语言、风格模糊性以及情感唤起策略等问题。传统基于词典的工具在低资源语言中表现不佳,而Transformer模型虽具潜力但通常输出粗粒度分类标签,难以实现细粒度分析。其解决方案的关键在于提出一种基于概念向量投影的连续情感评分方法,该方法在多语种文学数据上训练,能够更有效地捕捉跨体裁、跨语言和跨历史时期的情感细微差异,并在英文和丹麦文文本上优于现有工具,其情感评分分布与人类标注高度一致,从而支持更精确的文学情感弧建模。

链接: https://arxiv.org/abs/2508.14620
作者: Laurits Lyngbaek,Pascale Feldkamp,Yuri Bizzoni,Kristoffer Nielbo,Kenneth Enevoldsen
机构: 未知
类目: Computation and Language (cs.CL)
备注: 16 pages after compiling, 3025 words, 6 figures, 5 tables and an algorithm

点击查看摘要

Abstract:Sentiment Analysis is widely used to quantify sentiment in text, but its application to literary texts poses unique challenges due to figurative language, stylistic ambiguity, as well as sentiment evocation strategies. Traditional dictionary-based tools often underperform, especially for low-resource languages, and transformer models, while promising, typically output coarse categorical labels that limit fine-grained analysis. We introduce a novel continuous sentiment scoring method based on concept vector projection, trained on multilingual literary data, which more effectively captures nuanced sentiment expressions across genres, languages, and historical periods. Our approach outperforms existing tools on English and Danish texts, producing sentiment scores whose distribution closely matches human ratings, enabling more accurate analysis and sentiment arc modeling in literature.
zh

[NLP-14] Filling the Gap for Uzbek: Creating Translation Resources for Southern Uzbek

【速读】: 该论文旨在解决南方乌兹别克语(Southern Uzbek, uzs)在自然语言处理(Natural Language Processing, NLP)领域资源极度匮乏的问题,尤其是其在机器翻译(Machine Translation, MT)任务中的低表现。针对这一挑战,研究者构建了新的多源平行语料库(包含来自词典、文学和网络来源的39,994句对),并引入一个997句的FLORES+开发集用于评估;同时基于NLLB-200模型进行了微调(fine-tuned),得到了名为lutfiy的专用翻译模型。解决方案的关键在于:一是通过整合多样化数据源建立高质量平行语料库,二是提出一种后处理方法以恢复阿拉伯文字符中缺失的半空格(half-space characters),从而更准确地识别形态边界(morphological boundaries),提升模型对南方乌兹别克语复杂形态结构的建模能力。所有资源与工具均已开源,为低资源语言(Low-resource Languages)的后续研究提供支持。

链接: https://arxiv.org/abs/2508.14586
作者: Mukhammadsaid Mamasaidov,Azizullah Aral,Abror Shopulatov,Mironshoh Inomjonov
机构: Tilmoch; Academy of Sciences of Afghanistan; MBZUAI
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Southern Uzbek (uzs) is a Turkic language variety spoken by around 5 million people in Afghanistan and differs significantly from Northern Uzbek (uzn) in phonology, lexicon, and orthography. Despite the large number of speakers, Southern Uzbek is underrepresented in natural language processing. We present new resources for Southern Uzbek machine translation, including a 997-sentence FLORES+ dev set, 39,994 parallel sentences from dictionary, literary, and web sources, and a fine-tuned NLLB-200 model (lutfiy). We also propose a post-processing method for restoring Arabic-script half-space characters, which improves handling of morphological boundaries. All datasets, models, and tools are released publicly to support future work on Southern Uzbek and other low-resource languages.
zh

[NLP-15] owards Skeletal and Signer Noise Reduction in Sign Language Production via Quaternion-Based Pose Encoding and Contrastive Learning

【速读】: 该论文旨在解决神经网络在手语生成(Sign Language Production, SLP)中因签名者形态差异和风格多样性导致的类内高变异性问题,从而提升模型对个体差异的鲁棒性。解决方案的关键在于两个核心改进:一是采用四元数空间中的骨关节旋转编码姿势信息,并结合测地线损失(geodesic loss)以提高关节角度运动的准确性与清晰度;二是引入对比损失(contrastive loss),通过词素重叠或SBERT语义相似度对解码器嵌入进行语义结构化,从而过滤掉不携带语义信息的解剖学与风格特征。实验表明,仅使用对比损失即可在Phoenix14T数据集上使正确关键点概率提升16%,结合四元数编码后进一步将平均骨关节角度误差降低6%。

链接: https://arxiv.org/abs/2508.14574
作者: Guilhem Fauré(MULTISPEECH),Mostafa Sadeghi(MULTISPEECH),Sam Bigeard(MULTISPEECH),Slim Ouni(LORIA, MULTISPEECH)
机构: 未知
类目: Computation and Language (cs.CL); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:One of the main challenges in neural sign language production (SLP) lies in the high intra-class variability of signs, arising from signer morphology and stylistic variety in the training data. To improve robustness to such variations, we propose two enhancements to the standard Progressive Transformers (PT) architecture (Saunders et al., 2020). First, we encode poses using bone rotations in quaternion space and train with a geodesic loss to improve the accuracy and clarity of angular joint movements. Second, we introduce a contrastive loss to structure decoder embeddings by semantic similarity, using either gloss overlap or SBERT-based sentence similarity, aiming to filter out anatomical and stylistic features that do not convey relevant semantic information. On the Phoenix14T dataset, the contrastive loss alone yields a 16% improvement in Probability of Correct Keypoint over the PT baseline. When combined with quaternion-based pose encoding, the model achieves a 6% reduction in Mean Bone Angle Error. These results point to the benefit of incorporating skeletal structure modeling and semantically guided contrastive objectives on sign pose representations into the training of Transformer-based SLP models.
zh

[NLP-16] Who Sees What? Structured Thought-Action Sequences for Epistemic Reasoning in LLM s

【速读】: 该论文旨在解决当前基于大语言模型(Large Language Models, LLMs)的自主代理在主动感知(active perception)、协作推理与视角转换(perspective taking,即理解其他智能体可见或已知的信息)任务中表现不足的问题。其核心解决方案是提出一种结构化示例处理流程,通过Fast Downward规划器生成的转化后的解图(solution graphs)构造三类结构化示例:最优目标路径(G-type)、信息节点路径(E-type)和对比性决策序列(L-type),并进一步将这些示例转化为“思考-动作”格式,引导LLM显式表达每一步决策背后的推理过程。实验表明,尽管此类结构化示例有助于减少澄清请求和动作步骤,但仅靠它们仍不足以实现稳定可靠的视角转换能力,凸显了显式信念追踪(belief tracking)、代价建模(cost modelling)以及更丰富的环境设计对于实现LLM代理间社会性协同的关键作用。

链接: https://arxiv.org/abs/2508.14564
作者: Luca Annese,Sabrina Patania,Silvia Serino,Tom Foulsham,Silvia Rossi,Azzurra Ruggeri,Dimitri Ognibene
机构: 未知
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Human-Computer Interaction (cs.HC)
备注: Accepted at ICSR25

点击查看摘要

Abstract:Recent advances in large language models (LLMs) and reasoning frameworks have opened new possibilities for improving the perspective -taking capabilities of autonomous agents. However, tasks that involve active perception, collaborative reasoning, and perspective taking (understanding what another agent can see or knows) pose persistent challenges for current LLM-based systems. This study investigates the potential of structured examples derived from transformed solution graphs generated by the Fast Downward planner to improve the performance of LLM-based agents within a ReAct framework. We propose a structured solution-processing pipeline that generates three distinct categories of examples: optimal goal paths (G-type), informative node paths (E-type), and step-by-step optimal decision sequences contrasting alternative actions (L-type). These solutions are further converted into ``thought-action’’ examples by prompting an LLM to explicitly articulate the reasoning behind each decision. While L-type examples slightly reduce clarification requests and overall action steps, they do not yield consistent improvements. Agents are successful in tasks requiring basic attentional filtering but struggle in scenarios that required mentalising about occluded spaces or weighing the costs of epistemic actions. These findings suggest that structured examples alone are insufficient for robust perspective-taking, underscoring the need for explicit belief tracking, cost modelling, and richer environments to enable socially grounded collaboration in LLM-based agents.
zh

[NLP-17] EmoTale: An Enacted Speech-emotion Dataset in Danish

【速读】: 该论文旨在解决小语种(如丹麦语)在情感语音识别(Speech Emotion Recognition, SER)领域缺乏高质量功能数据集的问题。现有情感语音语料库多集中于主流语言,而丹麦语等小语种因资源稀缺难以支撑有效的SER模型训练与评估。为此,作者提出了EmoTale语料库,包含丹麦语和英语的语音录音及其标注的情感类别,通过自监督语音模型(Self-Supervised Speech Model, SSLM)嵌入和openSMILE特征提取器构建SER模型进行验证。关键解决方案在于采用SSLM嵌入替代传统手工设计特征(hand-crafted features),实验证明其在EmoTale上表现更优,达到64.1%的未加权平均召回率(Unweighted Average Recall, UAR),性能与已有丹麦语情感语音语料库DES相当,验证了该方法的有效性和可迁移性。

链接: https://arxiv.org/abs/2508.14548
作者: Maja J. Hjuler,Harald V. Skat-Rørdam,Line H. Clemmensen,Sneha Das
机构: University Grenoble Alpes, CNRS, Grenoble INP, LIG; Technical University of Denmark
类目: Computation and Language (cs.CL); Sound (cs.SD); Audio and Speech Processing (eess.AS)
备注: To appear in the proceedings of ASRU 2025

点击查看摘要

Abstract:While multiple emotional speech corpora exist for commonly spoken languages, there is a lack of functional datasets for smaller (spoken) languages, such as Danish. To our knowledge, Danish Emotional Speech (DES), published in 1997, is the only other database of Danish emotional speech. We present EmoTale; a corpus comprising Danish and English speech recordings with their associated enacted emotion annotations. We demonstrate the validity of the dataset by investigating and presenting its predictive power using speech emotion recognition (SER) models. We develop SER models for EmoTale and the reference datasets using self-supervised speech model (SSLM) embeddings and the openSMILE feature extractor. We find the embeddings superior to the hand-crafted features. The best model achieves an unweighted average recall (UAR) of 64.1% on the EmoTale corpus using leave-one-speaker-out cross-validation, comparable to the performance on DES.
zh

[NLP-18] Reasoning is about giving reason s

【速读】: 该论文旨在解决当前基于Transformer的模型在处理自然语言推理任务时,难以准确识别和解释论证逻辑结构的问题。具体而言,现有方法虽能实现简单规则链式推理,但在可解释性方面存在不足,且无法支持如溯因推理(abduction)或矛盾识别等更复杂的推理形式。解决方案的关键在于提出一种中间表示——逻辑结构表示(Representation of the Logical Structure, RLS),该表示能够显式捕捉论证中的“逻辑原子”(logical atoms)及其组合规则,从而将推理过程转化为确定性计算。这一机制不仅提升了推理的可解释性,还显著扩展了模型对任意深度推理、实时纠错及交互式讨论的支持能力。

链接: https://arxiv.org/abs/2508.14488
作者: Krunal Shah,Dan Roth
机构: Yutori Inc.; University of Pennsylvania (宾夕法尼亚大学)
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Convincing someone of the truth value of a premise requires understanding and articulating the core logical structure of the argument which proves or disproves the premise. Understanding the logical structure of an argument refers to understanding the underlying “reasons” which make up the proof or disproof of the premise - as a function of the “logical atoms” in the argument. While it has been shown that transformers can “chain” rules to derive simple arguments, the challenge of articulating the “reasons” remains. Not only do current approaches to chaining rules suffer in terms of their interpretability, they are also quite constrained in their ability to accommodate extensions to theoretically equivalent reasoning tasks - a model trained to chain rules cannot support abduction or identify contradictions. In this work we suggest addressing these shortcomings by identifying an intermediate representation (which we call the Representation of the Logical Structure (RLS) of the argument) that possesses an understanding of the logical structure of a natural language argument - the logical atoms in the argument and the rules incorporating them. Given the logical structure, reasoning is deterministic and easy to compute. Therefore, our approach supports all forms of reasoning that depend on the logical structure of the natural language argument, including arbitrary depths of reasoning, on-the-fly mistake rectification and interactive discussion with respect to an argument. We show that we can identify and extract the logical structure of natural language arguments in three popular reasoning datasets with high accuracies, thus supporting explanation generation and extending the reasoning capabilities significantly.
zh

[NLP-19] In2x at WMT25 Translation Task

【速读】: 该论文旨在解决如何将大语言模型(Large Language Models, LLMs)有效扩展至低资源或较少使用语言的问题,特别是聚焦于日语相关翻译任务。其解决方案的关键在于构建一个可泛化的开放系统范式,涵盖数据构造方法与奖励模型设计两个核心方面,以期实现LLM在目标语言上的卓越翻译性能。

链接: https://arxiv.org/abs/2508.14472
作者: Lei Pang,Hanyi Mao,Quanjia Xiao,HaiXiao Liu,Xiangyi Li
机构: Duxiaoman(度小满); University of Chicago (芝加哥大学); Peking University (北京大学)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:This paper presents the open-system submission by the In2x research team for the WMT25 General Machine Translation Shared Task. Our submission focuses on Japanese-related translation tasks, aiming to explore a generalizable paradigm for extending large language models (LLMs) to other languages. This paradigm encompasses aspects such as data construction methods and reward model design. The ultimate goal is to enable large language model systems to achieve exceptional performance in low-resource or less commonly spoken languages.
zh

[NLP-20] DuPO: Enabling Reliable LLM Self-Verification via Dual Preference Optimization

【速读】: 该论文旨在解决强化学习中依赖昂贵标注数据以及仅适用于可验证任务的局限性,同时突破传统双学习(dual learning)对严格成对任务(如翻译与反向翻译)的限制。其解决方案的关键在于提出DuPO框架,通过将原始任务(primal task)的输入分解为已知与未知成分,并构建一个能够利用原始任务输出和已知信息重建未知部分的对偶任务(dual task),从而实现无需人工标注的自监督奖励机制。该方法扩展了双学习的应用范围至非可逆任务(non-invertible tasks),并结合大语言模型(LLM)在单模型内实例化两个任务的能力,显著提升了多种下游任务的表现,包括机器翻译、数学推理和推理时重排序(inference-time reranking)。

链接: https://arxiv.org/abs/2508.14460
作者: Shuaijie She,Yu Bao,Yu Lu,Lu Xu,Tao Li,Wenhao Zhu,Shujian Huang,Shanbo Cheng,Lu Lu,Yuxuan Wang
机构: ByteDance(字节跳动); Nanjing University (南京大学)
类目: Machine Learning (cs.LG); Computation and Language (cs.CL)
备注: 18 pages, 4 figures,

点击查看摘要

Abstract:We present DuPO, a dual learning-based preference optimization framework that generates annotation-free feedback via a generalized duality. DuPO addresses two key limitations: Reinforcement Learning with Verifiable Rewards (RLVR)‘s reliance on costly labels and applicability restricted to verifiable tasks, and traditional dual learning’s restriction to strictly dual task pairs (e.g., translation and back-translation). Specifically, DuPO decomposes a primal task’s input into known and unknown components, then constructs its dual task to reconstruct the unknown part using the primal output and known information (e.g., reversing math solutions to recover hidden variables), broadening applicability to non-invertible tasks. The quality of this reconstruction serves as a self-supervised reward to optimize the primal task, synergizing with LLMs’ ability to instantiate both tasks via a single model. Empirically, DuPO achieves substantial gains across diverse tasks: it enhances the average translation quality by 2.13 COMET over 756 directions, boosts the mathematical reasoning accuracy by an average of 6.4 points on three challenge benchmarks, and enhances performance by 9.3 points as an inference-time reranker (trading computation for accuracy). These results position DuPO as a scalable, general, and annotation-free paradigm for LLM optimization.
zh

[NLP-21] NVIDIA Nemotron Nano 2: An Accurate and Efficient Hybrid Mamba-Transformer Reasoning Model

【速读】: 该论文旨在解决大语言模型在推理任务中面临的高延迟与低吞吐量问题,同时保持或提升模型的准确性。其核心挑战在于如何在有限硬件资源(如单张NVIDIA A10G GPU)下高效处理长上下文(高达128k tokens)的生成任务。解决方案的关键在于提出一种混合架构——Nemotron-Nano-9B-v2,该模型基于Nemotron-H架构,将Transformer中的大部分自注意力层替换为Mamba-2层,从而显著提升推理速度;并通过Minitron策略对120亿参数的基础模型进行压缩和知识蒸馏,最终实现90亿参数模型在bfloat16精度下仍可支持超长序列推理,且在推理任务中相比同类模型(如Qwen3-8B)达到相当或更优的准确率,同时推理吞吐量最高提升6倍。

链接: https://arxiv.org/abs/2508.14444
作者: NVIDIA:Aarti Basant,Abhijit Khairnar,Abhijit Paithankar,Abhinav Khattar,Adi Renduchintala,Adithya Renduchintala,Aditya Malte,Akhiad Bercovich,Akshay Hazare,Alejandra Rico,Aleksander Ficek,Alex Kondratenko,Alex Shaposhnikov,Ali Taghibakhshi,Amelia Barton,Ameya Sunil Mahabaleshwarkar,Amy Shen,Andrew Tao,Ann Guan,Anna Shors,Anubhav Mandarwal,Arham Mehta,Arun Venkatesan,Ashton Sharabiani,Ashwath Aithal,Ashwin Poojary,Ayush Dattagupta,Balaram Buddharaju,Banghua Zhu,Barnaby Simkin,Bilal Kartal,Bita Darvish Rouhani,Bobby Chen,Boris Ginsburg,Brandon Norick,Brian Yu,Bryan Catanzaro,Charles Wang,Charlie Truong,Chetan Mungekar,Chintan Patel,Chris Alexiuk,Christian Munley,Christopher Parisien,Dan Su,Daniel Afrimi,Daniel Korzekwa,Daniel Rohrer,Daria Gitman,David Mosallanezhad,Deepak Narayanan,Dima Rekesh,Dina Yared,Dmytro Pykhtar,Dong Ahn,Duncan Riach,Eileen Long,Elliott Ning,Eric Chung,Erick Galinkin,Evelina Bakhturina,Gargi Prasad,Gerald Shen,Haim Elisha,Harsh Sharma,Hayley Ross,Helen Ngo,Herman Sahota,Hexin Wang,Hoo Chang Shin,Hua Huang,Iain Cunningham,Igor Gitman,Ivan Moshkov,Jaehun Jung,Jan Kautz,Jane Polak Scowcroft,Jared Casper,Jimmy Zhang,Jinze Xue,Jocelyn Huang,Joey Conway,John Kamalu,Jonathan Cohen,Joseph Jennings,Julien Veron Vialard,Junkeun Yi,Jupinder Parmar,Kari Briski,Katherine Cheung,Katherine Luna,Keith Wyss,Keshav Santhanam,Kezhi Kong,Krzysztof Pawelec,Kumar Anik,Kunlun Li,Kushan Ahmadian,Lawrence McAfee
机构: NVIDIA(英伟达)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:We introduce Nemotron-Nano-9B-v2, a hybrid Mamba-Transformer language model designed to increase throughput for reasoning workloads while achieving state-of-the-art accuracy compared to similarly-sized models. Nemotron-Nano-9B-v2 builds on the Nemotron-H architecture, in which the majority of the self-attention layers in the common Transformer architecture are replaced with Mamba-2 layers, to achieve improved inference speed when generating the long thinking traces needed for reasoning. We create Nemotron-Nano-9B-v2 by first pre-training a 12-billion-parameter model (Nemotron-Nano-12B-v2-Base) on 20 trillion tokens using an FP8 training recipe. After aligning Nemotron-Nano-12B-v2-Base, we employ the Minitron strategy to compress and distill the model with the goal of enabling inference on up to 128k tokens on a single NVIDIA A10G GPU (22GiB of memory, bfloat16 precision). Compared to existing similarly-sized models (e.g., Qwen3-8B), we show that Nemotron-Nano-9B-v2 achieves on-par or better accuracy on reasoning benchmarks while achieving up to 6x higher inference throughput in reasoning settings like 8k input and 16k output tokens. We are releasing Nemotron-Nano-9B-v2, Nemotron-Nano12B-v2-Base, and Nemotron-Nano-9B-v2-Base checkpoints along with the majority of our pre- and post-training datasets on Hugging Face.
zh

[NLP-22] Knowledge Graph-Infused Fine-Tuning for Structured Reasoning in Large Language Models

【速读】: 该论文旨在解决大语言模型在处理需要结构化知识的任务时存在的两个核心问题:一是推理链缺失(missing reasoning chains),二是实体层面语义理解不足(insufficient entity-level semantic understanding)。解决方案的关键在于提出一种基于知识图谱注入(knowledge graph injection)的微调算法框架,通过引入结构化的图信息进行辅助学习,利用图神经网络(Graph Neural Network, GNN)编码实体及其关系以构建基于图的语义表示,并设计融合机制将知识图谱嵌入与语言模型的上下文表示联合建模。同时,引入门控机制动态平衡语言语义与结构知识的贡献,缓解不同表征空间间的冲突,从而显著提升模型在实体识别、问答和语言生成等任务中的语义一致性与结构化推理能力。

链接: https://arxiv.org/abs/2508.14427
作者: Wuyang Zhang,Yexin Tian,Xiandong Meng,Mengjie Wang,Junliang Du
机构: 未知
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:This paper addresses the problems of missing reasoning chains and insufficient entity-level semantic understanding in large language models when dealing with tasks that require structured knowledge. It proposes a fine-tuning algorithm framework based on knowledge graph injection. The method builds on pretrained language models and introduces structured graph information for auxiliary learning. A graph neural network is used to encode entities and their relations, constructing a graph-based semantic representation. A fusion mechanism is then designed to jointly model the knowledge graph embeddings with the contextual representations from the language model. To enhance the robustness of knowledge integration, a gating mechanism is introduced to dynamically balance the contributions of linguistic semantics and structural knowledge. This effectively mitigates conflicts between different representational spaces. During training, a joint loss function is constructed to account for both task performance and structural alignment objectives. This helps improve the accuracy of entity prediction and semantic reasoning. The study also includes a series of systematic sensitivity experiments. It evaluates the effects of learning rate, graph coverage, and structural perturbations on model performance. The results further validate the effectiveness and stability of the proposed method across tasks such as entity recognition, question answering, and language generation. Experimental findings show that the proposed structure-aware fine-tuning framework significantly enhances the model’s ability to represent complex semantic units. It demonstrates better semantic consistency and contextual logic modeling in scenarios involving structural reasoning and entity extraction.
zh

[NLP-23] Cognitive Surgery: The Awakening of Implicit Territorial Awareness in LLM s

【速读】: 该论文旨在解决大语言模型(Large Language Models, LLMs)在个体呈现范式(Individual Presentation Paradigm, IPP)下难以准确识别自身生成文本的问题,即模型在仅面对单个文本时无法有效判断其是否由自己生成,而这一能力在成对呈现范式(Pair Presentation Paradigm, PPP)中表现良好。研究发现,这种失败源于一种称为隐式领土意识(Implicit Territorial Awareness, ITA)的现象——模型在表征空间中已具备区分自产与他产文本的潜在能力,但该能力未体现在输出行为中。为此,作者提出认知手术(Cognitive Surgery, CoSur)框架,其关键在于通过四个模块(表征提取、领土构建、作者判别与认知编辑)唤醒并显式化模型的ITA,从而显著提升其在IPP场景下的自识别准确率,实验表明该方法在三种不同LLM上分别实现了83.25%、66.19%和88.01%的平均准确率。

链接: https://arxiv.org/abs/2508.14408
作者: Yinghan Zhou,Weifeng Zhu,Juan Wen,Wanli Peng,Zhengxian Wu,Yiming Xue
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Large language models (LLMs) have been shown to possess a degree of self-recognition capability-the ability to identify whether a given text was generated by themselves. Prior work has demonstrated that this capability is reliably expressed under the Pair Presentation Paradigm (PPP), where the model is presented with two texts and asked to choose which one it authored. However, performance deteriorates sharply under the Individual Presentation Paradigm (IPP), where the model is given a single text to judge authorship. Although this phenomenon has been observed, its underlying causes have not been systematically analyzed. In this paper, we first replicate existing findings to confirm that LLMs struggle to distinguish self- from other-generated text under IPP. We then investigate the reasons for this failure and attribute it to a phenomenon we term Implicit Territorial Awareness (ITA)-the model’s latent ability to distinguish self- and other-texts in representational space, which remains unexpressed in its output behavior. To awaken the ITA of LLMs, we propose Cognitive Surgery (CoSur), a novel framework comprising four main modules: representation extraction, territory construction, authorship discrimination and cognitive editing. Experimental results demonstrate that our proposed method improves the performance of three different LLMs in the IPP scenario, achieving average accuracies of 83.25%, 66.19%, and 88.01%, respectively.
zh

[NLP-24] DEPTH: Hallucination-Free Relation Extraction via Dependency-Aware Sentence Simplification and Two-tiered Hierarchical Refinement

【速读】: 该论文旨在解决大语言模型(Large Language Models, LLMs)在关系抽取任务中因难以准确判断实体对是否存在真实关系而导致的虚假预测(hallucination)问题,尤其在复杂句法结构或语义情境下更为显著,进而导致知识图谱中引入噪声边,影响结构化知识的完整性与下游应用可靠性。解决方案的关键在于提出DEPTHP框架,其核心创新包括:(1)依赖路径感知的句子简化模块(Grounding),通过提取最短依存路径将原始句子压缩为最小但语义完整的局部关系上下文,降低句法噪声;(2)两层分层精炼模块(Refinement),整合局部预测并基于全局语境修正遗漏与不一致;此外,引入因果驱动的奖励模型以解耦虚假相关性,防止奖励黑客行为,从而实现更鲁棒的强化学习微调。

链接: https://arxiv.org/abs/2508.14391
作者: Yupei Yang,Fan Feng,Lin Yang,Wanxi Deng,Lin Qu,Biwei Huang,Shikui Tu,Lei Xu
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Relation extraction enables the construction of structured knowledge for many downstream applications. While large language models (LLMs) have shown great promise in this domain, most existing methods concentrate on relation classification, which predicts the semantic relation type between a related entity pair. However, we observe that LLMs often struggle to reliably determine whether a relation exists, especially in cases involving complex sentence structures or intricate semantics, which leads to spurious predictions. Such hallucinations can introduce noisy edges in knowledge graphs, compromising the integrity of structured knowledge and downstream reliability. To address these challenges, we propose DEPTH, a framework that integrates Dependency-aware sEntence simPlification and Two-tiered Hierarchical refinement into the relation extraction pipeline. Given a sentence and its candidate entity pairs, DEPTH operates in two stages: (1) the Grounding module extracts relations for each pair by leveraging their shortest dependency path, distilling the sentence into a minimal yet coherent relational context that reduces syntactic noise while preserving key semantics; (2) the Refinement module aggregates all local predictions and revises them based on a holistic understanding of the sentence, correcting omissions and inconsistencies. We further introduce a causality-driven reward model that mitigates reward hacking by disentangling spurious correlations, enabling robust fine-tuning via reinforcement learning with human feedback. Experiments on six benchmarks demonstrate that DEPTH reduces the average hallucination rate to 7.0% while achieving a 17.2% improvement in average F1 score over state-of-the-art baselines.
zh

[NLP-25] Credence Calibration Game? Calibrating Large Language Models through Structured Play

【速读】: 该论文旨在解决大型语言模型(Large Language Models, LLMs)在决策关键领域部署时,其置信度估计与实际正确性之间不一致的问题,即模型校准(calibration)不足的问题。现有方法多依赖于事后调整或辅助模型训练,常需额外监督信号或参数更新,限制了实用性。本文提出一种基于提示(prompt-based)的校准框架,受“可信度校准游戏”(Credence Calibration Game)启发,通过构建结构化的交互循环,使LLM根据其预测置信度与实际正确性的一致性获得反馈,并结合先前表现的自然语言摘要进行动态优化。该方案的核心在于利用反馈驱动的提示机制和性能总结实现无需额外训练的持续校准,从而显著提升模型在多种模型和配置下的校准性能。

链接: https://arxiv.org/abs/2508.14390
作者: Ke Fang,Tianyi Zhao,Lu Cheng
机构: University of Pennsylvania (宾夕法尼亚大学); University of Southern California (南加州大学); University of Illinois Chicago (伊利诺伊大学芝加哥分校)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:As Large Language Models (LLMs) are increasingly deployed in decision-critical domains, it becomes essential to ensure that their confidence estimates faithfully correspond to their actual correctness. Existing calibration methods have primarily focused on post-hoc adjustments or auxiliary model training; however, many of these approaches necessitate additional supervision or parameter updates. In this work, we propose a novel prompt-based calibration framework inspired by the Credence Calibration Game. Our method establishes a structured interaction loop wherein LLMs receive feedback based on the alignment of their predicted confidence with correctness. Through feedback-driven prompting and natural language summaries of prior performance, our framework dynamically improves model calibration. Extensive experiments across models and game configurations demonstrate consistent improvements in evaluation metrics. Our results highlight the potential of game-based prompting as an effective strategy for LLM calibration. Code and data are available at this https URL.
zh

[NLP-26] ZPD-SCA: Unveiling the Blind Spots of LLM s in Assessing Students Cognitive Abilities

【速读】: 该论文旨在解决大语言模型(Large Language Models, LLMs)在教育场景中对阅读材料认知难度与学生发展阶段匹配能力不足的问题,尤其关注中文语境下不同年龄组学生的认知能力(Students’ Cognitive Abilities, SCA)与文本难度的对齐问题。其关键解决方案是构建了一个名为ZPD-SCA的新基准数据集,该数据集基于维果茨基“最近发展区”(Zone of Proximal Development, ZPD)理论,由60名特级教师标注,代表全国教师群体前0.15%的专业水平,从而实现对阶段级中文阅读理解难度的精准评估。实验表明,LLMs在零样本学习中表现不佳,但在引入上下文示例后显著提升性能,揭示了模型具备初步的认知对齐判断能力,但存在系统性偏差和文体差异导致的性能波动,为未来改进LLMs在教育应用中的认知适配性提供了可量化、可比较的基础。

链接: https://arxiv.org/abs/2508.14377
作者: Wenhan Dong,Zhen Sun,Yuemeng Zhao,Zifan Peng,Jun Wu,Jingyi Zheng,Yule Liu,Xinlei He,Yu Wang,Ruiming Wang,Xinyi Huang,Lei Mo
机构: School of Psychology, South China Normal University (华南师范大学心理学院); Information Hub, Hong Kong University of Science and Technology (Guangzhou) (香港科技大学(广州)信息枢纽); School of AI, Guangzhou University (广州大学人工智能学院); College of Cyber Security, Jinan University (暨南大学网络空间安全学院)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Computers and Society (cs.CY)
备注:

点击查看摘要

Abstract:Large language models (LLMs) have demonstrated potential in educational applications, yet their capacity to accurately assess the cognitive alignment of reading materials with students’ developmental stages remains insufficiently explored. This gap is particularly critical given the foundational educational principle of the Zone of Proximal Development (ZPD), which emphasizes the need to match learning resources with Students’ Cognitive Abilities (SCA). Despite the importance of this alignment, there is a notable absence of comprehensive studies investigating LLMs’ ability to evaluate reading comprehension difficulty across different student age groups, especially in the context of Chinese language education. To fill this gap, we introduce ZPD-SCA, a novel benchmark specifically designed to assess stage-level Chinese reading comprehension difficulty. The benchmark is annotated by 60 Special Grade teachers, a group that represents the top 0.15% of all in-service teachers nationwide. Experimental results reveal that LLMs perform poorly in zero-shot learning scenarios, with Qwen-max and GLM even falling below the probability of random guessing. When provided with in-context examples, LLMs performance improves substantially, with some models achieving nearly double the accuracy of their zero-shot baselines. These results reveal that LLMs possess emerging abilities to assess reading difficulty, while also exposing limitations in their current training for educationally aligned judgment. Notably, even the best-performing models display systematic directional biases, suggesting difficulties in accurately aligning material difficulty with SCA. Furthermore, significant variations in model performance across different genres underscore the complexity of task. We envision that ZPD-SCA can provide a foundation for evaluating and improving LLMs in cognitively aligned educational applications.
zh

[NLP-27] ISCA: A Framework for Interview-Style Conversational Agents

【速读】: 该论文旨在解决传统访谈式对话代理在数据收集过程中缺乏控制性与标准化的问题,尤其是在需要追踪态度形成或行为变化等场景中,传统方法难以实现高质量的定量分析。其解决方案的关键在于提出一种低计算资源消耗的非生成式(non-generative)对话系统,通过结构化的交互流程和在线管理面板实现无需编程即可快速构建和调整访谈内容,从而兼顾定性数据采集的灵活性与定量分析的规范性。

链接: https://arxiv.org/abs/2508.14344
作者: Charles Welch,Allison Lahnala,Vasudha Varadarajan,Lucie Flek,Rada Mihalcea,J. Lomax Boyd,João Sedoc
机构: McMaster University (麦克马斯特大学); Stony Brook University (石溪大学); University of Bonn (波恩大学); University of Michigan (密歇根大学); Johns Hopkins University (约翰霍普金斯大学); New York University (纽约大学)
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:We present a low-compute non-generative system for implementing interview-style conversational agents which can be used to facilitate qualitative data collection through controlled interactions and quantitative analysis. Use cases include applications to tracking attitude formation or behavior change, where control or standardization over the conversational flow is desired. We show how our system can be easily adjusted through an online administrative panel to create new interviews, making the tool accessible without coding. Two case studies are presented as example applications, one regarding the Expressive Interviewing system for COVID-19 and the other a semi-structured interview to survey public opinion on emerging neurotechnology. Our code is open-source, allowing others to build off of our work and develop extensions for additional functionality.
zh

[NLP-28] Beyond Semantic Similarity: Reducing Unnecessary API Calls via Behavior-Aligned Retriever

【速读】: 该论文旨在解决工具增强型大语言模型(Tool-augmented Large Language Models, TALLMs)在调用外部函数时因不准确的函数调用导致的效率低下和性能下降问题。现有方法如微调或基于示例的提示策略,往往存在训练开销高且无法处理演示样本不一致性的问题,从而误导模型的工具调用行为。解决方案的关键在于提出一个行为对齐检索器(Behavior-Aligned Retriever, BAR),通过构建包含不同函数调用行为的语料库,并采用对比学习框架设计定制化的正负样本对与双负对比损失函数,实现对行为一致性的有效建模,从而提升LLM在工具调用决策上的准确性,同时保持任务性能并降低整体成本。

链接: https://arxiv.org/abs/2508.14323
作者: Yixin Chen,Ying Xiong,Shangyu Wu,Yufei Cui,Xue Liu,Nan Guan,Chun Jason Xue
机构: City University of Hong Kong (香港城市大学); Mohamed Bin Zayed University of Artificial Intelligence (穆罕默德·本·扎耶德人工智能大学); McGill University (麦吉尔大学)
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Tool-augmented large language models (LLMs) leverage external functions to extend their capabilities, but inaccurate function calls can lead to inefficiencies and increased this http URL methods address this challenge by fine-tuning LLMs or using demonstration-based prompting, yet they often suffer from high training overhead and fail to account for inconsistent demonstration samples, which misguide the model’s invocation behavior. In this paper, we trained a behavior-aligned retriever (BAR), which provides behaviorally consistent demonstrations to help LLMs make more accurate tool-using decisions. To train the BAR, we construct a corpus including different function-calling behaviors, i.e., calling or this http URL use the contrastive learning framework to train the BAR with customized positive/negative pairs and a dual-negative contrastive loss, ensuring robust retrieval of behaviorally consistent this http URL demonstrate that our approach significantly reduces erroneous function calls while maintaining high task performance, offering a cost-effective and efficient solution for tool-augmented LLMs.
zh

[NLP-29] SurveyGen-I: Consistent Scientific Survey Generation with Evolving Plans and Memory-Guided Writing

【速读】: 该论文旨在解决当前基于大语言模型(Large Language Models, LLMs)的自动综述生成方法在长篇多章节综述中难以保持内容连贯性以及引用覆盖不全的问题。其核心解决方案是提出SurveyGen-I框架,该框架采用“粗粒度到细粒度”的检索策略、自适应规划机制与记忆引导生成技术:首先通过survey-level检索构建初始大纲和写作计划,随后在生成过程中利用记忆机制动态存储已写内容与术语,确保各子章节间的语义一致性;当检测到上下文不足时,触发子章节级别的细粒度检索,从而显著提升综述的内容质量、逻辑一致性和引用完整性。

链接: https://arxiv.org/abs/2508.14317
作者: Jing Chen,Zhiheng Yang,Yixian Shen,Jie Liu,Adam Belloum,Chrysa Papagainni,Paola Grosso
机构: University of Amsterdam (阿姆斯特丹大学); Vrije Universiteit Amsterdam (自由大学)
类目: Computation and Language (cs.CL); Information Retrieval (cs.IR)
备注: The code is available at this https URL , 20 pages, 16 figures

点击查看摘要

Abstract:Survey papers play a critical role in scientific communication by consolidating progress across a field. Recent advances in Large Language Models (LLMs) offer a promising solution by automating key steps in the survey-generation pipeline, such as retrieval, structuring, and summarization. However, existing LLM-based approaches often struggle with maintaining coherence across long, multi-section surveys and providing comprehensive citation coverage. To address these limitations, we introduce SurveyGen-I, an automatic survey generation framework that combines coarse-to-fine retrieval, adaptive planning, and memory-guided generation. SurveyGen-I first performs survey-level retrieval to construct the initial outline and writing plan, and then dynamically refines both during generation through a memory mechanism that stores previously written content and terminology, ensuring coherence across subsections. When the system detects insufficient context, it triggers fine-grained subsection-level retrieval. During generation, SurveyGen-I leverages this memory mechanism to maintain coherence across subsections. Experiments across four scientific domains demonstrate that SurveyGen-I consistently outperforms previous works in content quality, consistency, and citation coverage.
zh

[NLP-30] Zero-knowledge LLM hallucination detection and mitigation through fine-grained cross-model consistency

【速读】: 该论文旨在解决大语言模型(Large Language Models, LLMs)在生成内容时易产生幻觉(hallucinations)的问题,即模型输出看似合理但包含事实性错误的内容。解决方案的关键在于提出了一种名为Finch-Zk的黑盒框架,其核心创新是:1)基于细粒度跨模型一致性检查策略,通过对比不同模型对语义等价提示(semantically-equivalent prompts)的响应,识别出细微的事实错误;2)采用针对性的修正技术,仅对存在错误的片段进行精准修正,同时保留正确内容不变。该方法无需外部知识源即可有效提升LLM输出的事实可靠性,在多个基准数据集上显著优于现有方法。

链接: https://arxiv.org/abs/2508.14314
作者: Aman Goel,Daniel Schwartz,Yanjun Qi
机构: Amazon Web Services (亚马逊网络服务)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Large language models (LLMs) have demonstrated impressive capabilities across diverse tasks, but they remain susceptible to hallucinations–generating content that appears plausible but contains factual inaccuracies. We present Finch-Zk, a black-box framework that leverages FINe-grained Cross-model consistency to detect and mitigate Hallucinations in LLM outputs without requiring external knowledge sources. Finch-Zk introduces two key innovations: 1) a cross-model consistency checking strategy that reveals fine-grained inaccuracies by comparing responses generated by diverse models from semantically-equivalent prompts, and 2) a targeted mitigation technique that applies precise corrections to problematic segments while preserving accurate content. Experiments on the FELM dataset show Finch-Zk improves hallucination detection F1 scores by 6-39% compared to existing approaches. For mitigation, Finch-Zk achieves 7-8 absolute percentage points improvement in answer accuracy on the GPQA-diamond dataset when applied to state-of-the-art models like Llama 4 Maverick and Claude 4 Sonnet. Extensive evaluation across multiple models demonstrates that Finch-Zk provides a practical, deployment-ready safeguard for enhancing factual reliability in production LLM systems.
zh

[NLP-31] A Joint Multitask Model for Morpho-Syntactic Parsing

【速读】: 该论文旨在解决统一树库(Universal Dependencies, UD)2025形态句法解析共享任务中同时预测形态学和句法分析的问题,其挑战在于如何在新型UD标注体系下高效建模多任务联合学习。解决方案的关键在于设计一个基于XLM-RoBERTa的共享编码器架构,并搭配三个专用解码器分别完成词素识别(content word identification)、依存句法分析(dependency parsing)和形态句法特征预测(morphosyntactic feature prediction),从而实现跨语言的端到端多任务优化。实验表明,该模型在九种类型多样语言上取得最优性能,平均MSLAS为78.7%,LAS为80.1%,Feats F1达90.3%,且消融研究验证了匹配任务标注的金标准分词与词素识别对模型效果至关重要。

链接: https://arxiv.org/abs/2508.14307
作者: Demian Inostroza,Mel Mistica,Ekaterina Vylomova,Chris Guest,Kemal Kurniawan
机构: University of Melbourne (墨尔本大学)
类目: Computation and Language (cs.CL)
备注: 8 pages, SyntaxFest, UniDive 2025 Morpho-Syntactic Parsing shared task

点击查看摘要

Abstract:We present a joint multitask model for the UniDive 2025 Morpho-Syntactic Parsing shared task, where systems predict both morphological and syntactic analyses following novel UD annotation scheme. Our system uses a shared XLM-RoBERTa encoder with three specialized decoders for content word identification, dependency parsing, and morphosyntactic feature prediction. Our model achieves the best overall performance on the shared task’s leaderboard covering nine typologically diverse languages, with an average MSLAS score of 78.7 percent, LAS of 80.1 percent, and Feats F1 of 90.3 percent. Our ablation studies show that matching the task’s gold tokenization and content word identification are crucial to model performance. Error analysis reveals that our model struggles with core grammatical cases (particularly Nom-Acc) and nominal features across languages.
zh

[NLP-32] GLASS: Test-Time Acceleration for LLM s via Global-Local Neural Importance Aggregation

【速读】: 该论文旨在解决在边缘硬件上部署大语言模型(Large Language Models, LLMs)时,如何通过激进的、针对提示(prompt)动态剪枝来降低计算量而不损害生成质量的问题。现有静态或基于预测器的方法要么固定稀疏模式,要么引入额外运行时开销;而近期零样本方法依赖单一提示的统计信息,在短提示和/或长文本生成场景下表现不佳。解决方案的关键在于提出A/I-GLASS:一种基于激活(Activation)和影响(Impact)的全局-局部神经重要性聚合方法,用于前馈网络(Feed-Forward Network, FFN)的稀疏化。该方法无需训练,通过排序聚合机制整合提示局部与模型内在全局的神经单元重要性指标,从而实现高效且精准的动态剪枝,在长文本生成等挑战场景中显著优于现有无训练方法,且不依赖辅助预测器或增加推理开销。

链接: https://arxiv.org/abs/2508.14302
作者: Amirmohsen Sattarifard,Sepehr Lavasani,Ehsan Imani,Kunlin Zhang,Hanlin Xu,Fengyu Sun,Negar Hassanpour,Chao Gao
机构: 1: University of California, Berkeley (加州大学伯克利分校); 2: Stanford University (斯坦福大学)
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Deploying Large Language Models (LLMs) on edge hardware demands aggressive, prompt-aware dynamic pruning to reduce computation without degrading quality. Static or predictor-based schemes either lock in a single sparsity pattern or incur extra runtime overhead, and recent zero-shot methods that rely on statistics from a single prompt fail on short prompt and/or long generation scenarios. We introduce A/I-GLASS: Activation- and Impact-based Global-Local neural importance Aggregation for feed-forward network SparSification, two training-free methods that dynamically select FFN units using a rank-aggregation of prompt local and model-intrinsic global neuron statistics. Empirical results across multiple LLMs and benchmarks demonstrate that GLASS significantly outperforms prior training-free methods, particularly in challenging long-form generation scenarios, without relying on auxiliary predictors or adding any inference overhead.
zh

[NLP-33] MultiFuzz: A Dense Retrieval-based Multi-Agent System for Network Protocol Fuzzing

【速读】: 该论文旨在解决传统协议模糊测试(Protocol Fuzzing)技术在处理复杂协议语法时语义理解不足以及种子变异策略僵化的问题,同时针对现有基于大语言模型(Large Language Models, LLMs)的方法如ChatAFL中存在的输出不可靠、LLM幻觉及对协议规范知识假设过强等局限性。其解决方案的关键在于提出一种基于密集检索的多智能体系统(MultiFuzz),通过构建RFC文档的代理块(agentic chunks)嵌入向量数据库实现检索增强生成(Retrieval-Augmented Generation, RAG)机制,结合专业化智能体与结构化工具辅助推理,使智能体能够生成更可靠且结构化的变异指令,从而提升协议消息变异的语法规则遵循性和状态空间覆盖度;此外,框架将模糊测试过程模块化为协作式智能体组,并借助思维链(Chain-of-Thought)推理动态调整策略,显著优于当前最优(SOTA)方法如NSFuzz、AFLNet和ChatAFL。

链接: https://arxiv.org/abs/2508.14300
作者: Youssef Maklad,Fares Wael,Ali Hamdi,Wael Elsersy,Khaled Shaban
机构: MSA University (MSA大学); Qatar University (卡塔尔大学)
类目: Cryptography and Security (cs.CR); Computation and Language (cs.CL); Multiagent Systems (cs.MA); Networking and Internet Architecture (cs.NI)
备注:

点击查看摘要

Abstract:Traditional protocol fuzzing techniques, such as those employed by AFL-based systems, often lack effectiveness due to a limited semantic understanding of complex protocol grammars and rigid seed mutation strategies. Recent works, such as ChatAFL, have integrated Large Language Models (LLMs) to guide protocol fuzzing and address these limitations, pushing protocol fuzzers to wider exploration of the protocol state space. But ChatAFL still faces issues like unreliable output, LLM hallucinations, and assumptions of LLM knowledge about protocol specifications. This paper introduces MultiFuzz, a novel dense retrieval-based multi-agent system designed to overcome these limitations by integrating semantic-aware context retrieval, specialized agents, and structured tool-assisted reasoning. MultiFuzz utilizes agentic chunks of protocol documentation (RFC Documents) to build embeddings in a vector database for a retrieval-augmented generation (RAG) pipeline, enabling agents to generate more reliable and structured outputs, enhancing the fuzzer in mutating protocol messages with enhanced state coverage and adherence to syntactic constraints. The framework decomposes the fuzzing process into modular groups of agents that collaborate through chain-of-thought reasoning to dynamically adapt fuzzing strategies based on the retrieved contextual knowledge. Experimental evaluations on the Real-Time Streaming Protocol (RTSP) demonstrate that MultiFuzz significantly improves branch coverage and explores deeper protocol states and transitions over state-of-the-art (SOTA) fuzzers such as NSFuzz, AFLNet, and ChatAFL. By combining dense retrieval, agentic coordination, and language model reasoning, MultiFuzz establishes a new paradigm in autonomous protocol fuzzing, offering a scalable and extensible foundation for future research in intelligent agentic-based fuzzing systems.
zh

[NLP-34] okens with Meaning: A Hybrid Tokenization Approach for NLP

【速读】: 该论文旨在解决现有子词分词方法(如Byte Pair Encoding, BPE 和 WordPiece)在处理形态学丰富且黏着性强的语言(如土耳其语)时,因依赖词频而非语言结构而导致的分词不准确问题。其解决方案的关键在于提出一种混合分词框架,融合基于规则的形态分析与统计子词分割:通过音位归一化、词根-词缀词典以及一种新算法,在保持词素完整性的同时优化词汇效率;同时引入共享标识符以减少词形变化带来的冗余(如 -ler 与 -lar),并添加特殊标记(如大写标记)避免因大小写导致的词汇膨胀。该方法在 TR-MMLU 基准测试中显著提升了土耳其语的 Token Percentage(90.29%)和 Pure Token Percentage(85.8%),优于 LLaMA、Gemma 和 GPT 等主流分词器,具有语言无关性与可扩展性,为构建更具解释性和多语言适应性的自然语言处理系统提供了有效路径。

链接: https://arxiv.org/abs/2508.14292
作者: M. Ali Bayram,Ali Arda Fincan,Ahmet Semih Gümüş,Sercan Karakaş,Banu Diri,Savaş Yıldırım,Demircan Çelik
机构: Yıldız Technical University (伊斯坦布尔技术大学); Yeditepe University (耶迪特佩大学); University of Chicago (芝加哥大学); Istanbul Bilgi University (伊斯坦布尔比尔吉大学)
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Tokenization plays a pivotal role in natural language processing (NLP), shaping how text is segmented and interpreted by language models. While subword methods such as Byte Pair Encoding (BPE) and WordPiece have been effective, they often struggle with morphologically rich and agglutinative languages because they rely on frequency rather than linguistic structure. We introduce a hybrid tokenization framework that combines rule-based morphological analysis with statistical subword segmentation. The method uses phonological normalization, root-affix dictionaries, and a novel algorithm that balances morpheme preservation with vocabulary efficiency. It assigns shared identifiers to phonologically variant affixes (e.g., -ler and -lar) and altered root forms (e.g., kitap vs. kitabı), reducing redundancy while maintaining semantic integrity. Special tokens are added for whitespace and case, including an UPPERCASE marker to avoid vocabulary inflation from capitalization. BPE is integrated for out-of-vocabulary coverage without harming morphological coherence. On the TR-MMLU benchmark, the tokenizer achieves the highest Turkish Token Percentage (90.29%) and Pure Token Percentage (85.8%). Comparisons with tokenizers from LLaMA, Gemma, and GPT show more linguistically meaningful and coherent tokens. Although demonstrated on Turkish, the approach is language-independent and adaptable to other languages, offering a practical path toward more interpretable and effective multilingual NLP systems.
zh

[NLP-35] Measuring LLM Code Generation Stability via Structural Entropy

【速读】: 该论文旨在解决大语言模型(Large Language Models, LLMs)在代码生成任务中稳定性评估的难题,即如何客观衡量模型输出代码结构的一致性与可靠性,从而判断其在真实开发场景中的可用性。传统指标如pass@k、BLEU或CodeBLEU存在依赖参考答案、语言特定或执行依赖等局限,难以全面刻画生成代码的结构性稳定性。解决方案的关键在于将信息论中的熵概念引入程序分析领域,通过抽象语法树(Abstract Syntax Tree, AST)的深度受限子树频次构建概率分布,并设计两类参考无关(reference-free)、语言无关(language-agnostic)且无需运行测试的度量:一是基于Jensen-Shannon散度的结构重叠度量,用于量化不同生成结果间的结构相似性;二是结构交叉熵比(Structural Cross-Entropy ratio),用于识别高频模式缺失情况。该方法在O(n,d)时间内完成计算,可区分控制流形状和标识符层面的变异性,显著提升了对LLM代码生成稳定性的细粒度分析能力。

链接: https://arxiv.org/abs/2508.14288
作者: Yewei Song,Tiezhu Sun,Xunzhu Tang,Prateek Rajput,Tegawende F. Bissyande,Jacques Klein
机构: The Interdisciplinary Centre for Security, Reliability and Trust (跨学科安全、可靠性与信任中心); University of Luxembourg (卢森堡大学)
类目: oftware Engineering (cs.SE); Computation and Language (cs.CL)
备注: ASE-NIER

点击查看摘要

Abstract:Assessing the stability of code generation from large language models (LLMs) is essential for judging their reliability in real-world development. We extend prior “structural-entropy concepts” to the program domain by pairing entropy with abstract syntax tree (AST) analysis. For any fixed prompt, we collect the multiset of depth-bounded subtrees of AST in each generated program and treat their relative frequencies as a probability distribution. We then measure stability in two complementary ways: (i) Jensen-Shannon divergence, a symmetric, bounded indicator of structural overlap, and (ii) a Structural Cross-Entropy ratio that highlights missing high-probability patterns. Both metrics admit structural-only and token-aware variants, enabling separate views on control-flow shape and identifier-level variability. Unlike pass@k, BLEU, or CodeBLEU, our metrics are reference-free, language-agnostic, and execution-independent. We benchmark several leading LLMs on standard code generation tasks, demonstrating that AST-driven structural entropy reveals nuances in model consistency and robustness. The method runs in O(n,d) time with no external tests, providing a lightweight addition to the code-generation evaluation toolkit.
zh

[NLP-36] GRILE: A Benchmark for Grammar Reasoning and Explanation in Romanian LLM s

【速读】: 该论文旨在解决大语言模型(Large Language Models, LLMs)在低资源语言(如罗马尼亚语)教育场景中的可信赖性和教学价值不明确的问题。其解决方案的关键在于构建并公开首个针对罗马尼亚语的多选题基准测试集GRILE,包含1,151道来自高利害考试(国家评估、高考及大学入学考试)的题目,并系统评估七种先进多语言与罗马尼亚语专用LLMs在两项核心能力上的表现:准确选择答案和生成语法正确、具有教学意义的解释。研究发现,尽管Gemini 2.5 Pro达到83%准确率,多数开源模型低于65%,且48%的解释存在事实或教学错误;进一步分析揭示了形态学处理和最新DOOM3正字法规范应用方面的系统性缺陷。这一工作为可信教育自然语言处理提供了新基准,并推动可控解释生成与评估的研究发展。

链接: https://arxiv.org/abs/2508.14279
作者: Adrian-Marius Dumitran,Alexandra-Mihaela Danila,Angela-Liliana Dumitran
机构: University of Bucharest (布加勒斯特大学)
类目: Computation and Language (cs.CL); Computers and Society (cs.CY)
备注: Accepted as long paper @RANLP2025

点击查看摘要

Abstract:LLMs (Large language models) have revolutionized NLP (Natural Language Processing), yet their pedagogical value for low-resource languages remains unclear. We present GRILE (Grammar Romanian Inference and Language Explanations) , the first open benchmark of 1,151 multiple-choice questions harvested from Romanian high-stakes exams (National Evaluation, Baccalaureate, university admissions). GRILE enables us to probe two complementary abilities of seven state-of-the-art multilingual and Romanian-specific LLMs: (i) selecting the correct answer, and (ii) producing linguistically accurate explanations. While Gemini 2.5 Pro reaches 83% accuracy, most open-weight models stay below 65%, and 48% of their explanations contain factual or pedagogical flaws according to expert review. A detailed error analysis pinpoints systematic weaknesses in morphology and in applying the latest DOOM3 orthographic norms. All data, code and a public web demo are released to catalyze future research. Our findings expose open challenges for trustworthy educational NLP in low-resource settings and establish GRILE as a new test-bed for controllable explanation generation and evaluation.
zh

[NLP-37] Disentangling concept semantics via multilingual averag ing in Sparse Autoencoders

【速读】: 该论文旨在解决大型语言模型(Large Language Models, LLMs)在知识表示与推理方面存在的语义不清晰问题,特别是由于嵌入(embeddings)和稀疏自编码器(sparse autoencoders)所提取的表征中,概念语义与句法及语言特异性信息纠缠不清的问题。解决方案的关键在于利用稀疏自编码器从不同语言版本的本体类(OWL ontology classes)文本中提取概念激活值,并通过平均这些跨语言的概念激活来分离出纯粹的概念语义,从而构建一个与本体类之间真实关系高度对齐的“概念平均”表示,实现对模型内部状态更准确的机制性解释。

链接: https://arxiv.org/abs/2508.14275
作者: Cliff O’Reilly,Ernesto Jimenez-Ruiz,Tillman Weyde
机构: City St George’s, University of London (伦敦城市圣乔治大学)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Connecting LLMs with formal knowledge representation and reasoning is a promising approach to address their shortcomings. Embeddings and sparse autoencoders are widely used to represent textual content, but the semantics are entangled with syntactic and language-specific information. We propose a method that isolates concept semantics in Large Langue Models by averaging concept activations derived via Sparse Autoencoders. We create English text representations from OWL ontology classes, translate the English into French and Chinese and then pass these texts as prompts to the Gemma 2B LLM. Using the open source Gemma Scope suite of Sparse Autoencoders, we obtain concept activations for each class and language version. We average the different language activations to derive a conceptual average. We then correlate the conceptual averages with a ground truth mapping between ontology classes. Our results give a strong indication that the conceptual average aligns to the true relationship between classes when compared with a single language by itself. The result hints at a new technique which enables mechanistic interpretation of internal network states with higher accuracy.
zh

[NLP-38] Lets Use ChatGPT To Write Our Paper! Benchmarking LLM s To Write the Introduction of a Research Paper

【速读】: 该论文试图解决的问题是:如何利用大语言模型(Large Language Models, LLMs)生成高质量的科研论文引言(Introduction),以辅助研究人员提升写作效率与质量。当前尽管LLMs在文本生成方面取得显著进展,但其在学术语境下生成结构严谨、内容忠实且逻辑连贯的引言仍面临挑战。解决方案的关键在于提出并定义了一个新的评估任务——科学引言生成(Scientific Introduction Generation, SciIG),该任务基于论文标题、摘要和相关工作(Related Works)自动生成引言,并构建了来自NAACL 2025和ICLR 2025的全新数据集,通过多维指标(包括词汇重叠、语义相似度、内容覆盖度、忠实性、一致性、引用正确性和叙事质量)对五种前沿LLM(含开源与闭源模型)进行全面评估。实验表明,LLaMA-4 Maverick在多数指标上表现最优,尤其在语义相似度和忠实性方面突出,且三样本提示(three-shot prompting)优于少样本策略,为开发高效、可靠的科研写作辅助工具提供了实证依据与实践指导。

链接: https://arxiv.org/abs/2508.14273
作者: Krishna Garg,Firoz Shaikh,Sambaran Bandyopadhyay,Cornelia Caragea
机构: University of Minnesota (明尼苏达大学); Rutgers University (罗格斯大学)
类目: Computation and Language (cs.CL)
备注: 20 pages, 15 figures

点击查看摘要

Abstract:As researchers increasingly adopt LLMs as writing assistants, generating high-quality research paper introductions remains both challenging and essential. We introduce Scientific Introduction Generation (SciIG), a task that evaluates LLMs’ ability to produce coherent introductions from titles, abstracts, and related works. Curating new datasets from NAACL 2025 and ICLR 2025 papers, we assess five state-of-the-art models, including both open-source (DeepSeek-v3, Gemma-3-12B, LLaMA 4-Maverick, MistralAI Small 3.1) and closed-source GPT-4o systems, across multiple dimensions: lexical overlap, semantic similarity, content coverage, faithfulness, consistency, citation correctness, and narrative quality. Our comprehensive framework combines automated metrics with LLM-as-a-judge evaluations. Results demonstrate LLaMA-4 Maverick’s superior performance on most metrics, particularly in semantic similarity and faithfulness. Moreover, three-shot prompting consistently outperforms fewer-shot approaches. These findings provide practical insights into developing effective research writing assistants and set realistic expectations for LLM-assisted academic writing. To foster reproducibility and future research, we will publicly release all code and datasets.
zh

[NLP-39] wo Birds with One Stone: Multi-Task Detection and Attribution of LLM -Generated Text

【速读】: 该论文旨在解决大型语言模型(Large Language Models, LLMs)生成文本在安全性和完整性方面带来的挑战,特别是针对AI生成内容检测与作者归属识别(authorship attribution)这两个关键问题。现有方法多集中于区分AI与人类撰写的英文文本,且缺乏对多语言和多模型来源的泛化能力。论文提出的DA-MTL框架是一种多任务学习(multi-task learning)方法,其核心创新在于同时建模文本检测与作者归属任务,通过共享不同任务间的特征表示并捕捉各自独特性,从而提升两个任务的整体性能。该方案在九个数据集和四种骨干模型上验证有效,展现出跨语言、跨模型的鲁棒性,并能抵抗对抗性混淆技术,为LLM行为分析和内容溯源提供了新的研究范式。

链接: https://arxiv.org/abs/2508.14190
作者: Zixin Rao,Youssef Mohamed,Shang Liu,Zeyan Liu
机构: 未知
类目: Cryptography and Security (cs.CR); Computation and Language (cs.CL); Machine Learning (cs.LG)
备注: Securecomm 2025

点击查看摘要

Abstract:Large Language Models (LLMs), such as GPT-4 and Llama, have demonstrated remarkable abilities in generating natural language. However, they also pose security and integrity challenges. Existing countermeasures primarily focus on distinguishing AI-generated content from human-written text, with most solutions tailored for English. Meanwhile, authorship attribution–determining which specific LLM produced a given text–has received comparatively little attention despite its importance in forensic analysis. In this paper, we present DA-MTL, a multi-task learning framework that simultaneously addresses both text detection and authorship attribution. We evaluate DA-MTL on nine datasets and four backbone models, demonstrating its strong performance across multiple languages and LLM sources. Our framework captures each task’s unique characteristics and shares insights between them, which boosts performance in both tasks. Additionally, we conduct a thorough analysis of cross-modal and cross-lingual patterns and assess the framework’s robustness against adversarial obfuscation techniques. Our findings offer valuable insights into LLM behavior and the generalization of both detection and authorship attribution.
zh

[NLP-40] Comparing energy consumption and accuracy in text classification inference

【速读】: 该论文旨在解决大规模语言模型(Large Language Models, LLMs)在自然语言处理(Natural Language Processing, NLP)任务中推理阶段能源消耗与性能之间的权衡问题。以往研究多关注训练阶段的能耗,而忽视了推理阶段的实际能效表现。其解决方案的关键在于通过系统性实证分析,揭示不同模型架构和硬件配置下文本分类推理过程中的准确率与能耗关系:发现最优准确率模型可能同时具备高能效,而更大规模的LLM往往伴随显著更高的能耗和更低的分类准确性;并进一步指出推理能耗与运行时间存在强相关性,从而提出以执行时间为代理指标来估算能耗,为可持续人工智能发展提供可操作的优化路径。

链接: https://arxiv.org/abs/2508.14170
作者: Johannes Zschache,Tilman Hartwig
机构: 未知
类目: Computation and Language (cs.CL); Computers and Society (cs.CY)
备注: Key results in Figure 1, submitted to Nature Communications, 25 pages

点击查看摘要

Abstract:The increasing deployment of large language models (LLMs) in natural language processing (NLP) tasks raises concerns about energy efficiency and sustainability. While prior research has largely focused on energy consumption during model training, the inference phase has received comparatively less attention. This study systematically evaluates the trade-offs between model accuracy and energy consumption in text classification inference across various model architectures and hardware configurations. Our empirical analysis shows that the best-performing model in terms of accuracy can also be energy-efficient, while larger LLMs tend to consume significantly more energy with lower classification accuracy. We observe substantial variability in inference energy consumption ( mWh to kWh), influenced by model type, model size, and hardware specifications. Additionally, we find a strong correlation between inference energy consumption and model runtime, indicating that execution time can serve as a practical proxy for energy usage in settings where direct measurement is not feasible. These findings have implications for sustainable AI development, providing actionable insights for researchers, industry practitioners, and policymakers seeking to balance performance and resource efficiency in NLP applications.
zh

[NLP-41] DPad: Efficient Diffusion Language Models with Suffix Dropout

【速读】: 该论文旨在解决基于扩散模型的大型语言模型(Diffusion-based Large Language Models, dLLMs)在文本生成过程中存在的高计算开销问题,即每一步预测所有未来后缀标记(suffix tokens),但仅保留极小部分有效信息,造成严重冗余。解决方案的关键在于提出一种无需训练的“扩散草稿”(Diffusion Scratchpad, DPad)机制,通过两个核心策略实现高效推理:(i) 滑动窗口(sliding window)限制注意力范围为固定长度的局部后缀窗口;(ii) 距离衰减丢弃(distance-decay dropout)在注意力计算前确定性地移除远距离后缀标记,从而显著减少冗余计算。该设计兼容现有优化技术如前缀缓存(prefix caching),且仅需少量代码即可实现,在多个基准测试中实现了最高达61.4倍的速度提升,同时保持与原始dLLMs相当的生成精度。

链接: https://arxiv.org/abs/2508.14148
作者: Xinhua Chen,Sitao Huang,Cong Guo,Chiyue Wei,Yintao He,Jianyi Zhang,Hai “Hellen” Li,Yiran Chen
机构: Duke University (杜克大学)
类目: Computation and Language (cs.CL); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Diffusion-based Large Language Models (dLLMs) parallelize text generation by framing decoding as a denoising process, but suffer from high computational overhead since they predict all future suffix tokens at each step while retaining only a small fraction. We propose Diffusion Scratchpad (DPad), a training-free method that restricts attention to a small set of nearby suffix tokens, preserving fidelity while eliminating redundancy. DPad integrates two strategies: (i) a sliding window, which maintains a fixed-length suffix window, and (ii) distance-decay dropout, which deterministically removes distant suffix tokens before attention computation. This simple design is compatible with existing optimizations such as prefix caching and can be implemented with only a few lines of code. Comprehensive evaluations across multiple benchmarks on LLaDA-1.5 and Dream models demonstrate that DPad delivers up to \mathbf61.4\times speedup over vanilla dLLMs while maintaining comparable accuracy, highlighting its potential for efficient and scalable long-sequence inference. Our code is available at this https URL.
zh

[NLP-42] MMReview: A Multidisciplinary and Multimodal Benchmark for LLM -Based Peer Review Automation

【速读】: 该论文旨在解决当前基于大语言模型(Large Language Models, LLMs)的学术同行评审自动化系统缺乏统一评估基准的问题,尤其在处理包含图表等多模态内容时,现有方法难以全面、准确且符合人类偏好地生成评审意见。其解决方案的关键在于提出一个名为MMReview的综合性评测基准,涵盖人工智能、自然科学、工程科学和社会科学四大类共17个研究领域中的240篇论文,每篇均包含专家撰写的评审意见和多模态内容(如图表)。该基准设计了13项任务,分为四类核心维度:分步评审生成能力、评审结果表述质量、与人类偏好的一致性以及对抗性输入扰动下的鲁棒性,从而系统性评估LLMs和多模态大语言模型(Multimodal LLMs, MLLMs)在复杂评审场景中的表现,为构建标准化的自动同行评审系统奠定基础。

链接: https://arxiv.org/abs/2508.14146
作者: Xian Gao,Jiacheng Ruan,Zongyun Zhang,Jingsheng Gao,Ting Liu,Yuzhuo Fu
机构: Shanghai Jiao Tong University (上海交通大学)
类目: Computation and Language (cs.CL)
备注: Work in progress

点击查看摘要

Abstract:With the rapid growth of academic publications, peer review has become an essential yet time-consuming responsibility within the research community. Large Language Models (LLMs) have increasingly been adopted to assist in the generation of review comments; however, current LLM-based review tasks lack a unified evaluation benchmark to rigorously assess the models’ ability to produce comprehensive, accurate, and human-aligned assessments, particularly in scenarios involving multimodal content such as figures and tables. To address this gap, we propose \textbfMMReview, a comprehensive benchmark that spans multiple disciplines and modalities. MMReview includes multimodal content and expert-written review comments for 240 papers across 17 research domains within four major academic disciplines: Artificial Intelligence, Natural Sciences, Engineering Sciences, and Social Sciences. We design a total of 13 tasks grouped into four core categories, aimed at evaluating the performance of LLMs and Multimodal LLMs (MLLMs) in step-wise review generation, outcome formulation, alignment with human preferences, and robustness to adversarial input manipulation. Extensive experiments conducted on 16 open-source models and 5 advanced closed-source models demonstrate the thoroughness of the benchmark. We envision MMReview as a critical step toward establishing a standardized foundation for the development of automated peer review systems.
zh

[NLP-43] DLLM Quant: Quantizing Diffusion-based Large Language Models

【速读】: 该论文旨在解决扩散式大语言模型(Diffusion-based Large Language Models, DLLMs)在后训练量化(Post-Training Quantization, PTQ)过程中面临的严重性能下降问题,尤其是当直接应用传统PTQ方法(如AWQ)时,在LLADA数据集上W4A4量化条件下准确率下降高达16%。核心挑战源于DLLMs的三大机制——动态掩码(dynamic masking)、迭代生成(iterative generation)和双向注意力(bidirectional attention)与现有PTQ方法存在本质冲突:1)不同解码步的token分布差异未被校准方法捕捉;2)量化误差随迭代累积并放大;3)掩码与未掩码token的特征分布不兼容。为应对这些问题,论文提出DLLMQuant框架,其关键创新在于三项技术:1)时间-掩码自适应采样(Temporal-Mask Adaptive Sampling, TMAS),联合建模时间步与掩码状态以捕获跨步分布;2)交互感知激活量化(Interaction-Aware Activation Quantization, IA-AQ),利用双向注意力中的交互信号动态分配量化资源;3)确定性引导量化(Certainty-Guided Quantization, CGQ),融合掩码状态与token得分作为权重因子进行误差补偿,从而提升权重量化对DLLMs的适配性。实验表明,该方案在显著提升效率的同时大幅改善了模型性能。

链接: https://arxiv.org/abs/2508.14090
作者: Chen Xu,Dawei Yang
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: 12 pages, 6 figures

点击查看摘要

Abstract:Diffusion-based large language models (DLLMs) have shown promise for non-autoregressive text generation, but their deployment is constrained by large model sizes and heavy computational costs. Post-training quantization (PTQ), a widely used method for compressing and accelerating Large Language Models (LLMs), suffers from severe accuracy degradation and reduced generalization performance when directly applied to DLLMs (e.g., AWQ suffers a 16% accuracy drop on LLADA under W4A4). This paper explores how DLLMs’ key mechanisms - dynamic masking, iterative generation, bidirectional attention - clash with quantization. We identify three core issues: 1) Iterative generation and dynamic masking ratios lead to distinct token distributions across decoding steps, which are not adequately captured by existing PTQ calibration methods; 2) Quantization errors are accumulated and amplified progressively during iteration in DLLMs, causing quantized models to perform worse as decoding steps progress; 3) Unmasked tokens stabilize while masked remain probabilistic, making overall feature distribution incompatible with existing PTQ methods. To address these issues, we propose DLLMQuant, a PTQ framework tailored for DLLMs, which incorporates three novel techniques: 1) Temporal-Mask Adaptive Sampling (TMAS), a calibration method that accounts for both time and mask factors, with the capacity to capture distributions across timesteps. 2) Interaction-Aware Activation Quantization (IA-AQ), which utilizes bidirectional attention’s interaction signals to dynamically allocate quantization resources. 3) Certainty-Guided Quantization (CGQ), which integrates mask status and token scores as key weighting criteria into error compensation, making weight quantization more suitable for DLLMs. Experiments show that DLLMQuant achieves significant performance gains while enhancing efficiency.
zh

[NLP-44] Punctuation and Predicates in Language Models

【速读】: 该论文旨在解决大语言模型(Large Language Models, LLMs)中信息收集与传播机制的内在逻辑问题,特别是关注标点符号 token 的作用及其在不同层中的必要性与充分性,以及模型是否对输入的不同语义组件(如主语、形容词、条件句等)形成早期静态摘要或保持跨层敏感性。其解决方案的关键在于采用基于干预(intervention-based)的技术手段,包括交换干预(interchange intervention)和层交换实验(layer-swapping experiments),系统评估了 GPT-2、DeepSeek 和 Gemma 等模型中 punctuation token 的功能角色及不同类型推理规则(如条件语句 if-then 和全称量化 for all)的处理差异。研究发现,punctuation 在不同模型中表现出显著的层特异性作用,且推理规则的处理方式存在明显分化,揭示了 LLM 内部信息处理机制的异质性和可解释性潜力。

链接: https://arxiv.org/abs/2508.14067
作者: Sonakshi Chauhan,Maheep Chaudhary,Koby Choy,Samuel Nellessen,Nandi Schoots
机构: 未知
类目: Computation and Language (cs.CL); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:In this paper we explore where information is collected and how it is propagated throughout layers in large language models (LLMs). We begin by examining the surprising computational importance of punctuation tokens which previous work has identified as attention sinks and memory aids. Using intervention-based techniques, we evaluate the necessity and sufficiency (for preserving model performance) of punctuation tokens across layers in GPT-2, DeepSeek, and Gemma. Our results show stark model-specific differences: for GPT-2, punctuation is both necessary and sufficient in multiple layers, while this holds far less in DeepSeek and not at all in Gemma. Extending beyond punctuation, we ask whether LLMs process different components of input (e.g., subjects, adjectives, punctuation, full sentences) by forming early static summaries reused across the network, or if the model remains sensitive to changes in these components across layers. Extending beyond punctuation, we investigate whether different reasoning rules are processed differently by LLMs. In particular, through interchange intervention and layer-swapping experiments, we find that conditional statements (if, then), and universal quantification (for all) are processed very differently. Our findings offer new insight into the internal mechanisms of punctuation usage and reasoning in LLMs and have implications for interpretability.
zh

[NLP-45] Assessing and Mitigating Data Memorization Risks in Fine-Tuned Large Language Models

【速读】: 该论文旨在解决大型语言模型(Large Language Models, LLMs)在微调过程中因记忆训练数据而导致的隐私泄露问题。研究发现,使用重复的敏感数据进行微调会使隐私泄露率从基线水平的0%–5%显著上升至60%–75%,平均增幅达64.2%。解决方案的关键在于提出并验证了一种多层隐私保护框架,包含四种互补技术:语义去重(semantic data deduplication)、生成阶段差分隐私(differential privacy during generation)、基于熵的过滤(entropy-based filtering)以及基于模式的内容过滤(pattern-based content filtering),这些方法可在将隐私泄露降至0%的同时,保留原始模型94.7%的性能表现。

链接: https://arxiv.org/abs/2508.14062
作者: Badrinath Ramakrishnan,Akshaya Balaji
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: 14 pages, 2 figures. Code and experimental framework available at this https URL

点击查看摘要

Abstract:Large Language Models (LLMs) have demonstrated remarkable capabilities across diverse natural language processing tasks, but their tendency to memorize training data poses significant privacy risks, particularly during fine-tuning processes. This paper presents a comprehensive empirical analysis of data memorization in fine-tuned LLMs and introduces a novel multi-layered privacy protection framework. Through controlled experiments on modern LLM architectures including GPT-2, Phi-3, and Gemma-2, we demonstrate that fine-tuning with repeated sensitive data increases privacy leakage rates from baseline levels of 0-5% to 60-75%, representing a 64.2% average increase across tested models. We propose and rigorously evaluate four complementary privacy protection methods: semantic data deduplication, differential privacy during generation, entropy-based filtering, and pattern-based content filtering. Our experimental results show that these techniques can reduce data leakage to 0% while maintaining 94.7% of original model utility.
zh

[NLP-46] Confidence Estimation for Text-to-SQL in Large Language Models

【速读】: 该论文旨在解决文本到SQL(text-to-SQL)任务中模型生成查询语句时的置信度估计问题,即在无法获取标准答案(gold answers)的情况下,评估模型生成的SQL语句的可靠性。其核心挑战在于如何在限制访问模型权重和梯度的大语言模型(LLMs)场景下实现有效的置信度判断。解决方案的关键在于区分黑盒与白盒两种策略:黑盒场景下采用基于一致性的方法(consistency-based methods),通过多个推理路径的一致性来衡量置信度;白盒场景下则利用SQL语法感知(SQL-syntax-aware)的方式解析模型logits,从而更精准地捕捉语义合理性;此外,论文还发现执行结果(execution-based grounding)作为补充信号可显著提升两类方法的性能表现。

链接: https://arxiv.org/abs/2508.14056
作者: Sepideh Entezari Maleki,Mohammadreza Pourreza,Davood Rafiei
机构: 未知
类目: Computation and Language (cs.CL); Databases (cs.DB)
备注:

点击查看摘要

Abstract:Confidence estimation for text-to-SQL aims to assess the reliability of model-generated SQL queries without having access to gold answers. We study this problem in the context of large language models (LLMs), where access to model weights and gradients is often constrained. We explore both black-box and white-box confidence estimation strategies, evaluating their effectiveness on cross-domain text-to-SQL benchmarks. Our evaluation highlights the superior performance of consistency-based methods among black-box models and the advantage of SQL-syntax-aware approaches for interpreting LLM logits in white-box settings. Furthermore, we show that execution-based grounding of queries provides a valuable supplementary signal, improving the effectiveness of both approaches.
zh

[NLP-47] -REX: Table – Refute or Entail eXplainer

【速读】: 该论文旨在解决基于结构化表格数据的文本陈述验证问题,这一任务在自然语言处理中具有重要现实意义但极具挑战性。当前基于大语言模型(Large Language Models, LLMs)的表格事实核查方法虽取得进展,但仍难以被非专家用户使用。解决方案的关键在于提出T-REX(Table – Refute or Entail eXplainer),这是一个首个面向多模态、多语言表格的实时交互式验证工具,其核心创新在于集成经过指令微调的推理型LLM,从而实现高准确率与可解释性的结合,使非专业人士也能便捷地进行事实核查。

链接: https://arxiv.org/abs/2508.14055
作者: Tim Luka Horstmann,Baptiste Geisenberger,Mehwish Alam
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Verifying textual claims against structured tabular data is a critical yet challenging task in Natural Language Processing with broad real-world impact. While recent advances in Large Language Models (LLMs) have enabled significant progress in table fact-checking, current solutions remain inaccessible to non-experts. We introduce T-REX (T-REX: Table – Refute or Entail eXplainer), the first live, interactive tool for claim verification over multimodal, multilingual tables using state-of-the-art instruction-tuned reasoning LLMs. Designed for accuracy and transparency, T-REX empowers non-experts by providing access to advanced fact-checking technology. The system is openly available online.
zh

[NLP-48] Contrastive Analysis of Constituent Order Preferences Within Adverbial Roles in English and Chinese News: A Large-Language-Model-Driven Approach

【速读】: 该论文试图解决英文与中文新闻语篇中功能块(functional chunks)在句法位置分布上的差异问题,特别是其在信息结构和语用驱动下的构式偏好。解决方案的关键在于利用大型语言模型(Large Language Model, LLM)标注的可比英汉新闻语料库,从功能块的状语角色出发,系统分析两类语言在主谓宾(SVO)结构中功能块的位置倾向及其共现时的顺序调整机制,从而揭示词序既具有系统性偏好(如英语多后置、汉语多前置),又具备动态适应性(由信息焦点和语用目的驱动)。

链接: https://arxiv.org/abs/2508.14054
作者: Yiran Rex Ma
机构: Beijing University of Posts and Telecommunications (北京邮电大学)
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Based on comparable English-Chinese news corpora annotated by Large Language Model (LLM), this paper attempts to explore the differences in constituent order of English-Chinese news from the perspective of functional chunks with adverbial roles, and analyze their typical positional preferences and distribution patterns. It is found that: (1) English news prefers linear narrative of core information first, and functional chunks are mostly post-positioned, while Chinese news prefers overall presentation mode of background first, and functional chunks are often pre-positioned; (2) In SVO structure, both English and Chinese news show differences in the distribution of functional chunks, but the tendency of Chinese pre-positioning is more significant, while that of English post-positioning is relatively mild; (3) When function blocks are co-occurring, both English and Chinese news show high flexibility, and the order adjustment is driven by information and pragmatic purposes. The study reveals that word order has both systematic preference and dynamic adaptability, providing new empirical support for contrastive study of English-Chinese information structure.
zh

[NLP-49] FinAgent Bench: A Benchmark Dataset for Agent ic Retrieval in Financial Question Answering

【速读】: 该论文旨在解决金融领域信息检索(Information Retrieval, IR)中传统方法在语义相似性捕捉和细粒度推理能力上的不足,尤其是在面对文档结构与领域专业知识时的局限性。其核心问题是缺乏一个能够评估大语言模型(Large Language Models, LLMs)在金融场景下进行多步推理(multi-step reasoning)能力的基准测试工具。解决方案的关键在于提出FinAgentBench——首个面向金融领域的、大规模的“代理式检索”(agentic retrieval)基准,该基准包含3,429个专家标注样本,明确区分两个推理步骤:首先识别最相关的文档类型,其次精确定位所选文档中的关键段落,并通过量化指标评估LLM在复杂金融任务中的检索行为。此设计有效缓解了上下文限制问题,为理解并提升LLM在特定领域内的检索性能提供了可复现的基础。

链接: https://arxiv.org/abs/2508.14052
作者: Chanyeol Choi,Jihoon Kwon,Alejandro Lopez-Lira,Chaewoon Kim,Minjae Kim,Juneha Hwang,Jaeseon Ha,Hojun Choi,Suyeol Yun,Yongjin Kim,Yongjae Lee
机构: LinqAlphaKorea; University of Florida (佛罗里达大学); UNISTKorea
类目: Information Retrieval (cs.IR); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注: 6 pages

点击查看摘要

Abstract:Accurate information retrieval (IR) is critical in the financial domain, where investors must identify relevant information from large collections of documents. Traditional IR methods-whether sparse or dense-often fall short in retrieval accuracy, as it requires not only capturing semantic similarity but also performing fine-grained reasoning over document structure and domain-specific knowledge. Recent advances in large language models (LLMs) have opened up new opportunities for retrieval with multi-step reasoning, where the model ranks passages through iterative reasoning about which information is most relevant to a given query. However, there exists no benchmark to evaluate such capabilities in the financial domain. To address this gap, we introduce FinAgentBench, the first large-scale benchmark for evaluating retrieval with multi-step reasoning in finance – a setting we term agentic retrieval. The benchmark consists of 3,429 expert-annotated examples on SP-100 listed firms and assesses whether LLM agents can (1) identify the most relevant document type among candidates, and (2) pinpoint the key passage within the selected document. Our evaluation framework explicitly separates these two reasoning steps to address context limitations. This design enables to provide a quantitative basis for understanding retrieval-centric LLM behavior in finance. We evaluate a suite of state-of-the-art models and further demonstrated how targeted fine-tuning can significantly improve agentic retrieval performance. Our benchmark provides a foundation for studying retrieval-centric LLM behavior in complex, domain-specific tasks for finance. We will release the dataset publicly upon acceptance of the paper and plan to expand and share dataset for the full SP 500 and beyond.
zh

[NLP-50] Benchmarking Sociolinguistic Diversity in Swahili NLP: A Taxonomy-Guided Approach

【速读】: 该论文旨在解决斯瓦希里语自然语言处理(Natural Language Processing, NLP)领域中缺乏基于社会语言学多样性(sociolinguistic diversity)的评估体系的问题。其解决方案的关键在于构建一个结构化的分类体系(taxonomy),并以此为视角系统分析预训练与指令微调语言模型在真实语料中的预测误差,从而揭示社会语言变体(如部落影响、城市俚语、代码混用和借词)对模型性能的影响,推动更具文化根基的评估框架发展。

链接: https://arxiv.org/abs/2508.14051
作者: Kezia Oketch,John P. Lalor,Ahmed Abbasi
机构: University of Notre Dame (圣母大学)
类目: Computation and Language (cs.CL); Computers and Society (cs.CY)
备注:

点击查看摘要

Abstract:We introduce the first taxonomy-guided evaluation of Swahili NLP, addressing gaps in sociolinguistic diversity. Drawing on health-related psychometric tasks, we collect a dataset of 2,170 free-text responses from Kenyan speakers. The data exhibits tribal influences, urban vernacular, code-mixing, and loanwords. We develop a structured taxonomy and use it as a lens for examining model prediction errors across pre-trained and instruction-tuned language models. Our findings advance culturally grounded evaluation frameworks and highlight the role of sociolinguistic variation in shaping model performance.
zh

[NLP-51] From Image Captioning to Visual Storytelling

【速读】: 该论文旨在解决视觉故事生成(Visual Storytelling)任务中如何平衡图像序列的语义 grounding 与叙事连贯性的问题。其核心挑战在于生成的故事既要忠实于输入图像序列,又要具备自然语言的逻辑性和流畅性。解决方案的关键在于将视觉故事生成视为图像描述(Image Captioning)的超集,采用两阶段框架:首先利用视觉到语言模型为每张图像生成描述,再通过语言到语言的方法将这些描述整合为连贯的叙事文本。这种分步处理策略不仅提升了故事质量,还显著缩短了训练时间,并增强了模型的可复用性和可重复性。

链接: https://arxiv.org/abs/2508.14045
作者: Admitos Passadakis,Yingjin Song,Albert Gatt
机构: Delft University of Technology (代尔夫特理工大学); Utrecht University (乌得勒支大学)
类目: Computation and Language (cs.CL); Computer Vision and Pattern Recognition (cs.CV)
备注: 16 pages (including references), 5 figures and 6 tables

点击查看摘要

Abstract:Visual Storytelling is a challenging multimodal task between Vision Language, where the purpose is to generate a story for a stream of images. Its difficulty lies on the fact that the story should be both grounded to the image sequence but also narrative and coherent. The aim of this work is to balance between these aspects, by treating Visual Storytelling as a superset of Image Captioning, an approach quite different compared to most of prior relevant studies. This means that we firstly employ a vision-to-language model for obtaining captions of the input images, and then, these captions are transformed into coherent narratives using language-to-language methods. Our multifarious evaluation shows that integrating captioning and storytelling under a unified framework, has a positive impact on the quality of the produced stories. In addition, compared to numerous previous studies, this approach accelerates training time and makes our framework readily reusable and reproducible by anyone interested. Lastly, we propose a new metric/tool, named ideality, that can be used to simulate how far some results are from an oracle model, and we apply it to emulate human-likeness in visual storytelling.
zh

[NLP-52] he Prompting Brain: Neurocognitive Markers of Expertise in Guiding Large Language Models

【速读】: 该论文旨在解决生成式 AI(Generative AI)交互中人类提示工程(Prompt Engineering)专家认知能力的神经机制尚不明确的问题。其解决方案的关键在于通过跨横断面功能磁共振成像(fMRI)研究,识别专家与中级提示工程师在大脑功能连接性和关键认知网络功率频谱动态上的差异,发现左中颞回和左额极区域的功能连接增强以及认知网络活动模式的改变,从而揭示提示工程熟练度相关的神经标记。这一神经认知指标为理解人机协同中的认知适应机制提供了实证基础,并为设计更符合人类认知流程的智能接口提供理论支撑。

链接: https://arxiv.org/abs/2508.14869
作者: Hend Al-Khalifa,Raneem Almansour,Layan Abdulrahman Alhuasini,Alanood Alsaleh,Mohamad-Hani Temsah,Mohamad-Hani_Temsah,Ashwag Rafea S Alruwaili
机构: 未知
类目: Neurons and Cognition (q-bio.NC); Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Prompt engineering has rapidly emerged as a critical skill for effective interaction with large language models (LLMs). However, the cognitive and neural underpinnings of this expertise remain largely unexplored. This paper presents findings from a cross-sectional pilot fMRI study investigating differences in brain functional connectivity and network activity between experts and intermediate prompt engineers. Our results reveal distinct neural signatures associated with higher prompt engineering literacy, including increased functional connectivity in brain regions such as the left middle temporal gyrus and the left frontal pole, as well as altered power-frequency dynamics in key cognitive networks. These findings offer initial insights into the neurobiological basis of prompt engineering proficiency. We discuss the implications of these neurocognitive markers in Natural Language Processing (NLP). Understanding the neural basis of human expertise in interacting with LLMs can inform the design of more intuitive human-AI interfaces, contribute to cognitive models of LLM interaction, and potentially guide the development of AI systems that better align with human cognitive workflows. This interdisciplinary approach aims to bridge the gap between human cognition and machine intelligence, fostering a deeper understanding of how humans learn and adapt to complex AI systems.
zh

[NLP-53] MahaTTS: A Unified Framework for Multilingual Text-to-Speech Synthesis

【速读】: 该论文旨在解决当前文本到语音(Text-to-Speech, TTS)模型在多语言场景下的局限性,尤其是对印地语等印度语言(Indic languages)支持不足的问题,从而提升这些语言群体获取信息的能力。解决方案的关键在于构建了一个名为MahaTTS-v2的多语言、多说话人TTS系统,其核心创新包括:利用Wav2Vec2.0提取语义特征以实现跨语言的语义表示,结合语言模型(Language Model, LM)进行文本到语义映射,并采用条件流模型(Conditional Flow Model, CFM)将语义信息转换为梅尔频谱图(melspectrogram),从而生成高质量语音。该方法在约20K小时的印度语言数据上训练,显著提升了对印地语等低资源语言的语音合成效果。

链接: https://arxiv.org/abs/2508.14049
作者: Jaskaran Singh,Amartya Roy Chowdhury,Raghav Prabhakar,Varshul C. W
机构: 未知
类目: Audio and Speech Processing (eess.AS); Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Current Text-to-Speech models pose a multilingual challenge, where most of the models traditionally focus on English and European languages, thereby hurting the potential to provide access to information to many more people. To address this gap, we introduce MahaTTS-v2 a Multilingual Multi-speaker Text-To-Speech (TTS) system that has excellent multilingual expressive capabilities in Indic languages. The model has been trained on around 20K hours of data specifically focused on Indian languages. Our approach leverages Wav2Vec2.0 tokens for semantic extraction, and a Language Model (LM) for text-to-semantic modeling. Additionally, we have used a Conditional Flow Model (CFM) for semantics to melspectogram generation. The experimental results indicate the effectiveness of the proposed approach over other frameworks. Our code is available at this https URL
zh

[NLP-54] RAG -Boost: Retrieval-Augmented Generation Enhanced LLM -based Speech Recognition INTERSPEECH2025

【速读】: 该论文旨在解决基于大语言模型(Large Language Model, LLM)的自动语音识别(Automatic Speech Recognition, ASR)系统在实际应用中因词汇偏差或领域术语识别错误而导致的性能瓶颈问题。解决方案的关键在于引入一个实时检索增强生成(Retrieval-Augmented Generation, RAG)模块,该模块通过查询包含音频-文本对和领域术语的向量存储(vector store),动态获取与当前部分ASR假设(partial ASR hypothesis)最相关的上下文信息,并将这些检索结果与实时ASR输出进行融合,从而修正识别错误并提升最终LLM生成结果的准确性。

链接: https://arxiv.org/abs/2508.14048
作者: Pengcheng Wang,Sheng Li,Takahiro Shinozaki
机构: Tokyo Institute of Technology (东京工业大学); Keio University (庆应义塾大学)
类目: Audio and Speech Processing (eess.AS); Computation and Language (cs.CL)
备注: accepted at Interspeech2025 MLC-SLM Challenge workshop (task I system description)

点击查看摘要

Abstract:In this paper, we propose RAG-Boost (ST-ShinozakiLab Task I system), which enhances the baseline LLM-based ASR system of the MLC-SLM Challenge (task I) with a retrieval-augmented generation (RAG) module on the fly. Each partial ASR hypothesis queries a vector store of audio-text pairs and domain terms, and the retrieved results are fused with the live ASR hypotheses to fix recognition errors. The fused hypotheses are passed to the LLM, yielding improved responses.
zh

计算机视觉

[CV-0] Snap-Snap: Taking Two Images to Reconstruct 3D Human Gaussians in Milliseconds

【速读】:该论文旨在解决从极稀疏视角(仅前后视图)重建完整三维人体的问题,以降低用户创建3D数字人的门槛。其核心挑战在于如何在输入图像重叠区域极少的情况下保持几何一致性并恢复缺失信息。解决方案的关键在于:1)基于基础重建模型重新设计几何重建模块,使其能在有限视角下仍能预测一致的点云;2)引入增强算法补充缺失的颜色信息,并将完整带颜色的点云直接转换为3D高斯表示,从而提升渲染质量。该方法在THuman2.0和跨域数据集上实现了当前最优性能,且可在单张NVIDIA RTX 4090显卡上实现190ms的重建速度,同时支持低分辨率移动设备采集的图像。

链接: https://arxiv.org/abs/2508.14892
作者: Jia Lu,Taoran Yi,Jiemin Fang,Chen Yang,Chuiyun Wu,Wei Shen,Wenyu Liu,Qi Tian,Xinggang Wang
机构: Huazhong University of Science and Technology (华中科技大学); Huawei Inc. (华为公司); Shanghai Jiaotong University (上海交通大学)
类目: Graphics (cs.GR); Computer Vision and Pattern Recognition (cs.CV)
备注: Project page: this https URL

点击查看摘要

Abstract:Reconstructing 3D human bodies from sparse views has been an appealing topic, which is crucial to broader the related applications. In this paper, we propose a quite challenging but valuable task to reconstruct the human body from only two images, i.e., the front and back view, which can largely lower the barrier for users to create their own 3D digital humans. The main challenges lie in the difficulty of building 3D consistency and recovering missing information from the highly sparse input. We redesign a geometry reconstruction model based on foundation reconstruction models to predict consistent point clouds even input images have scarce overlaps with extensive human data training. Furthermore, an enhancement algorithm is applied to supplement the missing color information, and then the complete human point clouds with colors can be obtained, which are directly transformed into 3D Gaussians for better rendering quality. Experiments show that our method can reconstruct the entire human in 190 ms on a single NVIDIA RTX 4090, with two images at a resolution of 1024x1024, demonstrating state-of-the-art performance on the THuman2.0 and cross-domain datasets. Additionally, our method can complete human reconstruction even with images captured by low-cost mobile devices, reducing the requirements for data collection. Demos and code are available at this https URL.
zh

[CV-1] GaussianArt: Unified Modeling of Geometry and Motion for Articulated Objects

【速读】:该论文旨在解决交互式环境中刚性物体的数字孪生重建问题,特别是针对传统方法将几何与运动分离导致的重建流程复杂、扩展性差的问题。其关键解决方案是提出一种统一的表示方法,利用关节3D高斯(articulated 3D Gaussians)联合建模物体的几何结构与运动关系,从而提升运动分解的鲁棒性,并支持多达20个部件的复杂多段关节结构,显著优于以往在2–3个部件时即出现初始化脆弱性的方法。

链接: https://arxiv.org/abs/2508.14891
作者: Licheng Shen,Saining Zhang,Honghan Li,Peilin Yang,Zihao Huang,Zongzheng Zhang,Hao Zhao
机构: AIR(人工智能研究院); THU(清华大学); NTU(南洋理工大学); BIT(北京信息科技大学); HUST(华中科技大学); BAAI(北京智源研究院)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Project Page: this https URL

点击查看摘要

Abstract:Reconstructing articulated objects is essential for building digital twins of interactive environments. However, prior methods typically decouple geometry and motion by first reconstructing object shape in distinct states and then estimating articulation through post-hoc alignment. This separation complicates the reconstruction pipeline and restricts scalability, especially for objects with complex, multi-part articulation. We introduce a unified representation that jointly models geometry and motion using articulated 3D Gaussians. This formulation improves robustness in motion decomposition and supports articulated objects with up to 20 parts, significantly outperforming prior approaches that often struggle beyond 2–3 parts due to brittle initialization. To systematically assess scalability and generalization, we propose MPArt-90, a new benchmark consisting of 90 articulated objects across 20 categories, each with diverse part counts and motion configurations. Extensive experiments show that our method consistently achieves superior accuracy in part-level geometry reconstruction and motion estimation across a broad range of object types. We further demonstrate applicability to downstream tasks such as robotic simulation and human-scene interaction modeling, highlighting the potential of unified articulated representations in scalable physical modeling.
zh

[CV-2] MS-CLR: Multi-Skeleton Contrastive Learning for Human Action Recognition

【速读】:该论文旨在解决基于骨架的动作识别中,现有对比学习方法依赖单一骨架表示(skeleton convention)而导致跨数据集泛化能力受限的问题。其核心解决方案是提出多骨架对比学习(Multi-Skeleton Contrastive Learning, MS-CLR),通过在同一条动作序列中提取多种骨架结构并对其姿态表示进行对齐,促使模型学习到结构不变性并捕捉多样化的解剖学线索,从而获得更具表达力和泛化性的特征表示。关键创新在于引入统一的表示方案以适配不同关节布局和尺度的骨架,并利用多骨架对比机制提升模型鲁棒性与性能。

链接: https://arxiv.org/abs/2508.14889
作者: Mert Kiray,Alvaro Ritter,Nassir Navab,Benjamin Busam
机构: Technical University of Munich (慕尼黑工业大学); 3Dwe.ai
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Contrastive learning has gained significant attention in skeleton-based action recognition for its ability to learn robust representations from unlabeled data. However, existing methods rely on a single skeleton convention, which limits their ability to generalize across datasets with diverse joint structures and anatomical coverage. We propose Multi-Skeleton Contrastive Learning (MS-CLR), a general self-supervised framework that aligns pose representations across multiple skeleton conventions extracted from the same sequence. This encourages the model to learn structural invariances and capture diverse anatomical cues, resulting in more expressive and generalizable features. To support this, we adapt the ST-GCN architecture to handle skeletons with varying joint layouts and scales through a unified representation scheme. Experiments on the NTU RGB+D 60 and 120 datasets demonstrate that MS-CLR consistently improves performance over strong single-skeleton contrastive learning baselines. A multi-skeleton ensemble further boosts performance, setting new state-of-the-art results on both datasets.
zh

[CV-3] MeshCoder: LLM -Powered Structured Mesh Code Generation from Point Clouds

【速读】:该论文旨在解决现有3D物体重建方法在建模复杂几何结构时受限于领域特定语言(DSL)表达能力不足和小规模数据集的问题,从而难以实现高保真度的可编辑性与泛化能力。其解决方案的关键在于提出MeshCoder框架,通过构建一套丰富的Blender Python API来合成复杂几何体,并基于此构建大规模配对的“对象-代码”数据集,其中每个对象的代码按语义部分进行分解;随后训练一个跨模态大语言模型(LLM),将点云直接映射为可执行的Blender Python脚本,从而实现从3D形状到程序代码的高精度重构,并支持直观的几何与拓扑编辑及增强LLM在3D理解任务中的推理能力。

链接: https://arxiv.org/abs/2508.14879
作者: Bingquan Dai,Li Ray Luo,Qihong Tang,Jie Wang,Xinyu Lian,Hao Xu,Minghan Qin,Xudong Xu,Bo Dai,Haoqian Wang,Zhaoyang Lyu,Jiangmiao Pang
机构: Shanghai Artificial Intelligence Laboratory (上海人工智能实验室); Tsinghua University (清华大学); Harbin Institute of Technology, Shenzhen (哈尔滨工业大学深圳校区); Beijing Institute of Technology (北京理工大学); HKUST(GZ) (香港科技大学(广州))
类目: Graphics (cs.GR); Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Reconstructing 3D objects into editable programs is pivotal for applications like reverse engineering and shape editing. However, existing methods often rely on limited domain-specific languages (DSLs) and small-scale datasets, restricting their ability to model complex geometries and structures. To address these challenges, we introduce MeshCoder, a novel framework that reconstructs complex 3D objects from point clouds into editable Blender Python scripts. We develop a comprehensive set of expressive Blender Python APIs capable of synthesizing intricate geometries. Leveraging these APIs, we construct a large-scale paired object-code dataset, where the code for each object is decomposed into distinct semantic parts. Subsequently, we train a multimodal large language model (LLM) that translates 3D point cloud into executable Blender Python scripts. Our approach not only achieves superior performance in shape-to-code reconstruction tasks but also facilitates intuitive geometric and topological editing through convenient code modifications. Furthermore, our code-based representation enhances the reasoning capabilities of LLMs in 3D shape understanding tasks. Together, these contributions establish MeshCoder as a powerful and flexible solution for programmatic 3D shape reconstruction and understanding.
zh

[CV-4] Lifespan Pancreas Morphology for Control vs Type 2 Diabetes using AI on Largescale Clinical Imaging

【速读】:该论文旨在解决胰腺形态学在生命周期中的变化规律及其在2型糖尿病(Type 2 Diabetes, T2D)中的异常特征识别问题,以期为早期检测T2D及相关胰腺疾病提供影像学依据。其解决方案的关键在于:首先,通过分析2533例临床腹部CT与MRI扫描数据,利用人工智能(AI)方法对胰腺进行自动分割并提取13项形态学特征;其次,比较CT与MRI两种成像模态在寿命范围内测量一致性,确定最优影像技术;最后,采用广义加性模型用于位置、尺度和形状的回归(GAMLSS)建模,在675名T2D患者与675名匹配对照组之间识别出10项显著偏离正常衰老趋势的形态学指标(经多重比较校正后p < 0.05),从而证实T2D患者胰腺体积更小且形态变化具有年龄和性别特异性。

链接: https://arxiv.org/abs/2508.14878
作者: Lucas W. Remedios,Chloe Cho,Trent M. Schwartz,Dingjie Su,Gaurav Rudravaram,Chenyu Gao,Aravind R. Krishnan,Adam M. Saunders,Michael E. Kim,Shunxing Bao,Thomas A. Lasko,Alvin C. Powers,Bennett A. Landman,John Virostko
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Purpose: Understanding how the pancreas changes is critical for detecting deviations in type 2 diabetes and other pancreatic disease. We measure pancreas size and shape using morphological measurements from ages 0 to 90. Our goals are to 1) identify reliable clinical imaging modalities for AI-based pancreas measurement, 2) establish normative morphological aging trends, and 3) detect potential deviations in type 2 diabetes. Approach: We analyzed a clinically acquired dataset of 2533 patients imaged with abdominal CT or MRI. We resampled the scans to 3mm isotropic resolution, segmented the pancreas using automated methods, and extracted 13 morphological pancreas features across the lifespan. First, we assessed CT and MRI measurements to determine which modalities provide consistent lifespan trends. Second, we characterized distributions of normative morphological patterns stratified by age group and sex. Third, we used GAMLSS regression to model pancreas morphology trends in 1350 patients matched for age, sex, and type 2 diabetes status to identify any deviations from normative aging associated with type 2 diabetes. Results: When adjusting for confounders, the aging trends for 10 of 13 morphological features were significantly different between patients with type 2 diabetes and non-diabetic controls (p 0.05 after multiple comparisons corrections). Additionally, MRI appeared to yield different pancreas measurements than CT using our AI-based method. Conclusions: We provide lifespan trends demonstrating that the size and shape of the pancreas is altered in type 2 diabetes using 675 control patients and 675 diabetes patients. Moreover, our findings reinforce that the pancreas is smaller in type 2 diabetes. Additionally, we contribute a reference of lifespan pancreas morphology from a large cohort of non-diabetic control patients in a clinical setting. Subjects: Computer Vision and Pattern Recognition (cs.CV) Cite as: arXiv:2508.14878 [cs.CV] (or arXiv:2508.14878v1 [cs.CV] for this version) https://doi.org/10.48550/arXiv.2508.14878 Focus to learn more arXiv-issued DOI via DataCite Submission history From: Lucas Remedios PhD [view email] [v1] Wed, 20 Aug 2025 17:49:15 UTC (2,848 KB)
zh

[CV-5] Squeezed Diffusion Models

【速读】:该论文旨在解决扩散模型(Diffusion Models)在噪声注入过程中忽略数据结构的问题,即传统方法通常采用各向同性的高斯噪声,未能充分利用数据分布的主成分方向信息。其解决方案的关键在于引入压缩扩散模型(Squeezed Diffusion Models, SDM),通过沿训练数据分布主成分方向各向异性地缩放噪声,从而更有效地增强信号-噪声比(SNR)。具体而言,SDM在主轴方向上调整噪声尺度,其中一种配置(Heisenberg扩散模型)在正交方向进行反向缩放以保持不确定性平衡,而另一种标准变体仅对主轴进行缩放。实验表明,适度的“反压缩”(antisqueezing)——即在主轴方向增加方差——能显著提升生成质量(FID降低最高达15%),并改善精度-召回前沿,证明了简单且数据感知的噪声调制策略可在不改变网络架构的前提下实现稳定性能提升。

链接: https://arxiv.org/abs/2508.14871
作者: Jyotirmai Singh,Samar Khanna,James Burgess
机构: Stanford University (斯坦福大学)
类目: Machine Learning (cs.LG); Computer Vision and Pattern Recognition (cs.CV)
备注: 7 pages, 3 figures

点击查看摘要

Abstract:Diffusion models typically inject isotropic Gaussian noise, disregarding structure in the data. Motivated by the way quantum squeezed states redistribute uncertainty according to the Heisenberg uncertainty principle, we introduce Squeezed Diffusion Models (SDM), which scale noise anisotropically along the principal component of the training distribution. As squeezing enhances the signal-to-noise ratio in physics, we hypothesize that scaling noise in a data-dependent manner can better assist diffusion models in learning important data features. We study two configurations: (i) a Heisenberg diffusion model that compensates the scaling on the principal axis with inverse scaling on orthogonal directions and (ii) a standard SDM variant that scales only the principal axis. Counterintuitively, on CIFAR-10/100 and CelebA-64, mild antisqueezing - i.e. increasing variance on the principal axis - consistently improves FID by up to 15% and shifts the precision-recall frontier toward higher recall. Our results demonstrate that simple, data-aware noise shaping can deliver robust generative gains without architectural changes.
zh

[CV-6] EventSSEG: Event-driven Self-Supervised Segmentation with Probabilistic Attention

【速读】:该论文旨在解决基于帧的摄像头在自动驾驶中实现低延迟和低计算开销的路面临界分割(road segmentation)难题,同时利用事件相机(event camera)低功耗感知的优势。其关键解决方案是提出EventSSEG方法,该方法完全依赖事件数据进行计算(event-only computing),并引入概率注意力机制(probabilistic attention mechanism),从而避免了传统预训练权重迁移所需的大量标注数据;此外,通过事件自监督学习(event-based self-supervised learning)有效缓解了标注事件数据稀缺的问题,显著提升了模型性能与实用性。

链接: https://arxiv.org/abs/2508.14856
作者: Lakshmi Annamalai,Chetan Singh Thakur
机构: Defence Research and Development Organization (国防研究与发展组织); Indian Institute of Science (印度科学研究所)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Road segmentation is pivotal for autonomous vehicles, yet achieving low latency and low compute solutions using frame based cameras remains a challenge. Event cameras offer a promising alternative. To leverage their low power sensing, we introduce EventSSEG, a method for road segmentation that uses event only computing and a probabilistic attention mechanism. Event only computing poses a challenge in transferring pretrained weights from the conventional camera domain, requiring abundant labeled data, which is scarce. To overcome this, EventSSEG employs event-based self supervised learning, eliminating the need for extensive labeled data. Experiments on DSEC-Semantic and DDD17 show that EventSSEG achieves state of the art performance with minimal labeled events. This approach maximizes event cameras capabilities and addresses the lack of labeled events.
zh

[CV-7] ransLight: Image-Guided Customized Lighting Control with Generative Decoupling

【速读】:该论文旨在解决现有光照编辑方法难以同时实现定制化光效控制与保持内容完整性的难题,尤其在将复杂光效从参考图像迁移至用户指定目标图像这一挑战性任务中表现不足。解决方案的关键在于提出一种名为TransLight的新框架,其核心创新是通过“生成解耦”(Generative Decoupling)策略,利用两个微调后的扩散模型精确分离图像内容与光效,从而构建了一个百万级的图像-内容-光效三元组数据集;在此基础上,采用IC-Light作为生成模型,并引入参考光照图像作为额外条件信号进行训练,实现了高保真、高自由度的光效迁移,显著提升了光照控制的灵活性与自然性。

链接: https://arxiv.org/abs/2508.14814
作者: Zongming Li,Lianghui Zhu,Haocheng Shen,Longjin Ran,Wenyu Liu,Xinggang Wang
机构: OpenAI(OpenAI); University of Science and Technology of China (中国科学技术大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: 15 pages, 9 figures

点击查看摘要

Abstract:Most existing illumination-editing approaches fail to simultaneously provide customized control of light effects and preserve content integrity. This makes them less effective for practical lighting stylization requirements, especially in the challenging task of transferring complex light effects from a reference image to a user-specified target image. To address this problem, we propose TransLight, a novel framework that enables high-fidelity and high-freedom transfer of light effects. Extracting the light effect from the reference image is the most critical and challenging step in our method. The difficulty lies in the complex geometric structure features embedded in light effects that are highly coupled with content in real-world scenarios. To achieve this, we first present Generative Decoupling, where two fine-tuned diffusion models are used to accurately separate image content and light effects, generating a newly curated, million-scale dataset of image-content-light triplets. Then, we employ IC-Light as the generative model and train our model with our triplets, injecting the reference lighting image as an additional conditioning signal. The resulting TransLight model enables customized and natural transfer of diverse light effects. Notably, by thoroughly disentangling light effects from reference images, our generative decoupling strategy endows TransLight with highly flexible illumination control. Experimental results establish TransLight as the first method to successfully transfer light effects across disparate images, delivering more customized illumination control than existing techniques and charting new directions for research in illumination harmonization and editing.
zh

[CV-8] Repeating Words for Video-Language Retrieval with Coarse-to-Fine Objectives

【速读】:该论文旨在解决视频-语言检索中高精度与低训练成本之间的矛盾问题,以及现有方法对视频与文本细粒度信息挖掘不足的局限性。其关键解决方案包括:(1)提出一种粗到细的目标函数,结合对比学习(contrastive learning)与匹配学习(matching learning),以增强视频与文本语义对齐;(2)设计基于帧-词相似性分析的粒度感知表示模块(Granularity-Aware Representation module),用于提取细粒度特征;(3)发现原始字幕中的关键词重复现象(称为“Repetition”)有助于提升检索性能,并据此提出一种无需额外预训练的推理阶段投票机制与新的匹配熵(Matching Entropy)指标,显著改善最终检索效果。

链接: https://arxiv.org/abs/2508.14812
作者: Haoyu Zhao,Jiaxi Gu,Shicong Wang,Xing Zhang,Hang Xu,Zuxuan Wu,Yu-Gang Jiang
机构: Fudan University (复旦大学); Tencent (腾讯); Huawei Noah’s Ark Lab (华为诺亚方舟实验室)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 11 pages, 4 figures

点击查看摘要

Abstract:The explosive growth of video streaming presents challenges in achieving high accuracy and low training costs for video-language retrieval. However, existing methods rely on large-scale pre-training to improve video retrieval performance, resulting in significant computational demands. Additionally, the fine-grained information in videos and texts remains underexplored. To alleviate these problems, we propose a novel framework to learn fine-grained features for better alignment and introduce an inference pipeline to improve performance without additional training. Specifically, we employ coarse-to-fine objectives to understand the semantic information of video-text pairs, including contrastive and matching learning. The fine-grained data used for training is obtained through the Granularity-Aware Representation module, which is designed based on similarity analysis between video frames and words in captions. Furthermore, we observe that the repetition of keywords in the original captions, referred to as “Repetition”, can enhance retrieval performance and improve alignment between video and text. Based on this insight, we propose a novel and effective inference pipeline that incorporates a voting mechanism and a new Matching Entropy metric to achieve better retrieval performance without requiring additional pre-training. Experimental results on four benchmarks demonstrate that the proposed method outperforms previous approaches. Additionally, our inference pipeline achieves significant performance improvements, with a 2.1% increase in Recall@1 on the MSR-VTT dataset and a 1.6% increase on the DiDeMo dataset.
zh

[CV-9] nker: Diffusions Gift to 3D–Multi-View Consistent Editing From Sparse Inputs without Per-Scene Optimization

【速读】:该论文旨在解决现有3D编辑方法依赖大量场景特定微调(per-scene finetuning)以实现多视角一致性的问题,从而限制了其通用性和可扩展性。解决方案的关键在于利用预训练扩散模型(pretrained diffusion models)中隐含的三维感知能力(latent 3D awareness),无需任何场景级微调即可实现高质量、多视角一致的3D编辑。具体而言,Tinker框架包含两个核心组件:一是参考驱动的多视角编辑器(Referring multi-view editor),支持精确且跨视角一致的编辑操作;二是任意视角到视频的合成器(Any-view-to-video synthesizer),借助视频扩散模型中的时空先验完成稀疏输入下的高保真场景补全与新视角生成。这一设计显著降低了3D内容创作的门槛,并实现了零样本(zero-shot)下的通用3D编辑性能。

链接: https://arxiv.org/abs/2508.14811
作者: Canyu Zhao,Xiaoman Li,Tianjian Feng,Zhiyue Zhao,Hao Chen,Chunhua Shen
机构: Zhejiang University (浙江大学); Zhejiang University of Technology (浙江工业大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Project webpage: this https URL

点击查看摘要

Abstract:We introduce Tinker, a versatile framework for high-fidelity 3D editing that operates in both one-shot and few-shot regimes without any per-scene finetuning. Unlike prior techniques that demand extensive per-scene optimization to ensure multi-view consistency or to produce dozens of consistent edited input views, Tinker delivers robust, multi-view consistent edits from as few as one or two images. This capability stems from repurposing pretrained diffusion models, which unlocks their latent 3D awareness. To drive research in this space, we curate the first large-scale multi-view editing dataset and data pipeline, spanning diverse scenes and styles. Building on this dataset, we develop our framework capable of generating multi-view consistent edited views without per-scene training, which consists of two novel components: (1) Referring multi-view editor: Enables precise, reference-driven edits that remain coherent across all viewpoints. (2) Any-view-to-video synthesizer: Leverages spatial-temporal priors from video diffusion to perform high-quality scene completion and novel-view generation even from sparse inputs. Through extensive experiments, Tinker significantly reduces the barrier to generalizable 3D content creation, achieving state-of-the-art performance on editing, novel-view synthesis, and rendering enhancement tasks. We believe that Tinker represents a key step towards truly scalable, zero-shot 3D editing. Project webpage: this https URL
zh

[CV-10] DINOv3 with Test-Time Training for Medical Image Registration

【速读】:该论文旨在解决医学图像配准(Medical Image Registration)中学习型方法对大规模训练数据依赖性强的问题,从而限制了其在临床场景中的实际应用。解决方案的关键在于提出一种无需训练的配准流程,该流程利用冻结的DINOv3编码器提取特征,并在测试阶段通过优化变形场(deformation field)在特征空间中实现精确配准,从而在不依赖额外训练的情况下,实现高精度且形变规则的图像对齐。

链接: https://arxiv.org/abs/2508.14809
作者: Shansong Wang,Mojtaba Safari,Mingzhe Hu,Qiang Li,Chih-Wei Chang,Richard LJ Qiu,Xiaofeng Yang
机构: Emory University (埃默里大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Prior medical image registration approaches, particularly learning-based methods, often require large amounts of training data, which constrains clinical adoption. To overcome this limitation, we propose a training-free pipeline that relies on a frozen DINOv3 encoder and test-time optimization of the deformation field in feature space. Across two representative benchmarks, the method is accurate and yields regular deformations. On Abdomen MR-CT, it attained the best mean Dice score (DSC) of 0.790 together with the lowest 95th percentile Hausdorff Distance (HD95) of 4.9±5.0 and the lowest standard deviation of Log-Jacobian (SDLogJ) of 0.08±0.02. On ACDC cardiac MRI, it improves mean DSC to 0.769 and reduces SDLogJ to 0.11 and HD95 to 4.8, a marked gain over the initial alignment. The results indicate that operating in a compact foundation feature space at test time offers a practical and general solution for clinical registration without additional training.
zh

[CV-11] MF-LPR2: Multi-Frame License Plate Image Restoration and Recognition using Optical Flow

【速读】:该论文旨在解决行车记录仪图像中低质量车牌(license plate)区域因分辨率低、运动模糊和眩光等问题导致的识别困难问题。现有依赖预训练先验知识的生成式模型在恢复此类图像时易引入严重伪影和失真,难以保证识别准确性。解决方案的关键在于提出一种多帧车牌复原与识别框架MF-LPR²,通过相邻帧间的对齐与聚合来消除图像歧义,而非依赖预训练知识;其核心创新包括:利用先进的光流估计算法实现高精度帧对齐,并设计了基于车牌图像序列时空一致性的错误光流检测与校正机制,从而显著提升图像质量和识别准确率。

链接: https://arxiv.org/abs/2508.14797
作者: Kihyun Na,Junseok Oh,Youngkwan Cho,Bumjin Kim,Sungmin Cho,Jinyoung Choi,Injung Kim
机构: Handong Global University (翰林全球大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: Accepted for publication in Computer Vision and Image Understanding (CVIU), 2025

点击查看摘要

Abstract:License plate recognition (LPR) is important for traffic law enforcement, crime investigation, and surveillance. However, license plate areas in dash cam images often suffer from low resolution, motion blur, and glare, which make accurate recognition challenging. Existing generative models that rely on pretrained priors cannot reliably restore such poor-quality images, frequently introducing severe artifacts and distortions. To address this issue, we propose a novel multi-frame license plate restoration and recognition framework, MF-LPR ^2 , which addresses ambiguities in poor-quality images by aligning and aggregating neighboring frames instead of relying on pretrained knowledge. To achieve accurate frame alignment, we employ a state-of-the-art optical flow estimator in conjunction with carefully designed algorithms that detect and correct erroneous optical flow estimations by leveraging the spatio-temporal consistency inherent in license plate image sequences. Our approach enhances both image quality and recognition accuracy while preserving the evidential content of the input images. In addition, we constructed a novel Realistic LPR (RLPR) dataset to evaluate MF-LPR ^2 . The RLPR dataset contains 200 pairs of low-quality license plate image sequences and high-quality pseudo ground-truth images, reflecting the complexities of real-world scenarios. In experiments, MF-LPR ^2 outperformed eight recent restoration models in terms of PSNR, SSIM, and LPIPS by significant margins. In recognition, MF-LPR ^2 achieved an accuracy of 86.44%, outperforming both the best single-frame LPR (14.04%) and the multi-frame LPR (82.55%) among the eleven baseline models. The results of ablation studies confirm that our filtering and refinement algorithms significantly contribute to these improvements.
zh

[CV-12] Adversarial Hospital-Invariant Feature Learning for WSI Patch Classification

【速读】:该论文旨在解决病理基础模型(Pathology Foundation Models, PFMs)在多中心临床部署中因医院来源特征差异导致的域偏差(domain bias)问题,即模型可能无意中学习到与特定医院相关的图像特征,从而影响其在新医院数据上的泛化能力。解决方案的关键在于提出一种轻量级对抗性框架,通过引入一个可训练适配器(adapter)和一个域分类器,并利用梯度反转层(Gradient Reversal Layer, GRL)进行连接,使冻结编码器生成的任务判别性但域不变的表示,从而有效移除潜在的医院特异性特征,同时保持或提升疾病分类性能,尤其在跨医院(未见医院)场景下表现显著改善。

链接: https://arxiv.org/abs/2508.14779
作者: Mengliang Zhang,Jacob M. Luber
机构: The University of Texas at Arlington (德克萨斯大学阿灵顿分校)
类目: Computer Vision and Pattern Recognition (cs.CV); Image and Video Processing (eess.IV)
备注: 8 pages,6 figures

点击查看摘要

Abstract:Pathology foundation models (PFMs) have demonstrated remarkable potential in whole-slide image (WSI) diagnosis. However, pathology images from different hospitals often vary due to differences in scanning hardware and preprocessing styles, which may lead PFMs to inadvertently learn hospital-specific features, posing risks for clinical deployment. In this work, we present the first systematic study of domain bias in PFMs arising from hospital source characteristics. Specifically, we (1) construct a pipeline for quantifying domain bias in PFMs, (2) evaluate and compare the performance of multiple models, and (3) propose a lightweight adversarial framework that removes latent hospital-specific features from frozen representations without modifying the encoder itself. By introducing a trainable adapter and a domain classifier connected through a gradient reversal layer (GRL), our method learns task-discriminative yet domain-invariant representations. Experiments on multi-center histopathology datasets demonstrate that our approach substantially reduces domain predictability while maintaining or even improving disease classification performance, particularly in out-of-domain (unseen hospital) scenarios. Further analyses, including hospital detection and feature space visualization, confirm the effectiveness of our method in mitigating hospital bias. We will provide our code based on acceptance.
zh

[CV-13] 6-DoF Object Tracking with Event-based Optical Flow and Frames

【速读】:该论文旨在解决高速运动物体在6自由度(6-DoF)空间中实时位姿跟踪的问题,尤其针对传统RGB相机因帧率限制和运动模糊导致的性能下降。解决方案的关键在于融合事件相机(event camera)与RGB相机的优势:利用事件相机高时间分辨率和低延迟特性,设计一种基于事件的光流算法以实现物体6-DoF速度追踪;同时结合RGB图像驱动的全局位姿估计算法,在低频下提供精确位姿估计;通过将高速运动的速度信息与低频位姿估计进行融合,从而实现对高速移动物体的稳定、准确的6-DoF位姿跟踪。

链接: https://arxiv.org/abs/2508.14776
作者: Zhichao Li,Arren Glover,Chiara Bartolozzi,Lorenzo Natale
机构: Istituto Italiano di Tecnologia (意大利技术研究院); University of Genoa (热那亚大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Tracking the position and orientation of objects in space (i.e., in 6-DoF) in real time is a fundamental problem in robotics for environment interaction. It becomes more challenging when objects move at high-speed due to frame rate limitations in conventional cameras and motion blur. Event cameras are characterized by high temporal resolution, low latency and high dynamic range, that can potentially overcome the impacts of motion blur. Traditional RGB cameras provide rich visual information that is more suitable for the challenging task of single-shot object pose estimation. In this work, we propose using event-based optical flow combined with an RGB based global object pose estimator for 6-DoF pose tracking of objects at high-speed, exploiting the core advantages of both types of vision sensors. Specifically, we propose an event-based optical flow algorithm for object motion measurement to implement an object 6-DoF velocity tracker. By integrating the tracked object 6-DoF velocity with low frequency estimated pose from the global pose estimator, the method can track pose when objects move at high-speed. The proposed algorithm is tested and validated on both synthetic and real world data, demonstrating its effectiveness, especially in high-speed motion scenarios.
zh

[CV-14] Fusing Monocular RGB Images with AIS Data to Create a 6D Pose Estimation Dataset for Marine Vessels

【速读】:该论文旨在解决海洋船舶6D姿态估计数据集构建中依赖单一自动识别系统(AIS)信息所带来的局限性问题,如设备可靠性差、数据被篡改及传输延迟等。其关键解决方案是通过融合单目RGB图像与AIS数据,利用YOLOX-X目标检测模型在图像空间中定位船舶,并采用透视n点(Perspective-n-Point, PnP)变换方法将AIS坐标精准映射到图像坐标系,从而生成带有3D边界框标注的6D姿态数据。该方法显著降低了投影误差,且无需人工标注即可构建大规模数据集,最终形成了公开可用的BONK-pose数据集,为6D姿态估计网络的训练与评估提供了支持。

链接: https://arxiv.org/abs/2508.14767
作者: Fabian Holst,Emre Gülsoylu,Simone Frintrop
机构: University of Hamburg(汉堡大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Robotics (cs.RO)
备注: Author version of the submission to the IEEE Journal of Oceanic Engineering

点击查看摘要

Abstract:The paper presents a novel technique for creating a 6D pose estimation dataset for marine vessels by fusing monocular RGB images with Automatic Identification System (AIS) data. The proposed technique addresses the limitations of relying purely on AIS for location information, caused by issues like equipment reliability, data manipulation, and transmission delays. By combining vessel detections from monocular RGB images, obtained using an object detection network (YOLOX-X), with AIS messages, the technique generates 3D bounding boxes that represent the vessels’ 6D poses, i.e. spatial and rotational dimensions. The paper evaluates different object detection models to locate vessels in image space. We also compare two transformation methods (homography and Perspective-n-Point) for aligning AIS data with image coordinates. The results of our work demonstrate that the Perspective-n-Point (PnP) method achieves a significantly lower projection error compared to homography-based approaches used before, and the YOLOX-X model achieves a mean Average Precision (mAP) of 0.80 at an Intersection over Union (IoU) threshold of 0.5 for relevant vessel classes. We show indication that our approach allows the creation of a 6D pose estimation dataset without needing manual annotation. Additionally, we introduce the Boats on Nordelbe Kehrwieder (BONK-pose), a publicly available dataset comprising 3753 images with 3D bounding box annotations for pose estimation, created by our data fusion approach. This dataset can be used for training and evaluating 6D pose estimation networks. In addition we introduce a set of 1000 images with 2D bounding box annotations for ship detection from the same scene.
zh

[CV-15] Improved Mapping Between Illuminations and Sensors for RAW Images

【速读】:该论文旨在解决RAW图像在深度学习应用中因传感器特性和光照条件差异而导致的数据采集困难问题。由于RAW图像具有强颜色偏差(color cast),且其特性依赖于特定传感器和照明环境,构建适用于深度学习的RAW数据集需在不同传感器下对大量场景进行重复拍摄,成本高昂且难以覆盖多样光照条件。解决方案的关键在于引入首个包含390种照明条件、4种相机和18个场景的高质量RAW图像数据集,并提出一种轻量级神经网络方法,实现跨传感器和跨光照的映射(illumination and sensor mapping),从而显著降低数据采集负担,并提升下游任务(如神经ISP训练)的性能表现。

链接: https://arxiv.org/abs/2508.14730
作者: Abhijith Punnappurath,Luxi Zhao,Hoang Le,Abdelrahman Abdelhamed,SaiKiran Kumar Tedla,Michael S. Brown
机构: Samsung Electronics(三星电子); York University (约克大学); Google Research(谷歌研究)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:RAW images are unprocessed camera sensor output with sensor-specific RGB values based on the sensor’s color filter spectral sensitivities. RAW images also incur strong color casts due to the sensor’s response to the spectral properties of scene illumination. The sensor- and illumination-specific nature of RAW images makes it challenging to capture RAW datasets for deep learning methods, as scenes need to be captured for each sensor and under a wide range of illumination. Methods for illumination augmentation for a given sensor and the ability to map RAW images between sensors are important for reducing the burden of data capture. To explore this problem, we introduce the first-of-its-kind dataset comprising carefully captured scenes under a wide range of illumination. Specifically, we use a customized lightbox with tunable illumination spectra to capture several scenes with different cameras. Our illumination and sensor mapping dataset has 390 illuminations, four cameras, and 18 scenes. Using this dataset, we introduce a lightweight neural network approach for illumination and sensor mapping that outperforms competing methods. We demonstrate the utility of our approach on the downstream task of training a neural ISP. Link to project page: this https URL.
zh

[CV-16] Multiscale Video Transformers for Class Agnostic Segmentation in Autonomous Driving

【速读】:该论文旨在解决自动驾驶中的安全性问题,特别是针对训练阶段未见的未知物体(unknown objects)和突发驾驶场景的检测难题。传统视频语义与全景分割方法依赖已知类别,难以识别新类;而基于大语言模型的视觉定位方法计算开销大,不适合实时应用。其解决方案的关键在于提出一种高效的多尺度视频变换器(multiscale video transformer),通过端到端训练实现无需光学流(optical flow)的类别无关分割(class-agnostic segmentation)。该方法采用多阶段多尺度查询-记忆解码机制(multi-stage multiscale query-memory decoding)和尺度特定的随机丢弃令牌策略(scale-specific random drop-token),在保持高分辨率时空特征的同时提升效率,借助共享可学习的记忆模块(shared, learnable memory module)避免特征压缩,从而在DAVIS’16、KITTI和Cityscapes数据集上显著优于多尺度基线模型,并展现出在安全关键型机器人系统中实时稠密预测的潜力。

链接: https://arxiv.org/abs/2508.14729
作者: Leila Cheshmi,Mennatullah Siam
机构: Ontario Tech University (安大略理工大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 6 pages, 2 figures, 1 table

点击查看摘要

Abstract:Ensuring safety in autonomous driving is a complex challenge requiring handling unknown objects and unforeseen driving scenarios. We develop multiscale video transformers capable of detecting unknown objects using only motion cues. Video semantic and panoptic segmentation often relies on known classes seen during training, overlooking novel categories. Recent visual grounding with large language models is computationally expensive, especially for pixel-level output. We propose an efficient video transformer trained end-to-end for class-agnostic segmentation without optical flow. Our method uses multi-stage multiscale query-memory decoding and a scale-specific random drop-token to ensure efficiency and accuracy, maintaining detailed spatiotemporal features with a shared, learnable memory module. Unlike conventional decoders that compress features, our memory-centric design preserves high-resolution information at multiple scales. We evaluate on DAVIS’16, KITTI, and Cityscapes. Our method consistently outperforms multiscale baselines while being efficient in GPU memory and run-time, demonstrating a promising direction for real-time, robust dense prediction in safety-critical robotics.
zh

[CV-17] GSFix3D: Diffusion-Guided Repair of Novel Views in Gaussian Splatting

【速读】:该论文旨在解决3D Gaussian Splatting在极端新视角或部分观测区域生成高质量渲染图像时的局限性,以及扩散模型因依赖文本提示且缺乏场景特定信息感知能力而导致3D重建精度不足的问题。其解决方案的关键在于提出GSFix3D框架,核心组件GSFixer是一种通过定制微调协议训练得到的潜在扩散模型,能够融合网格和3D高斯表示,将预训练生成模型适配至不同重建方法所生成的环境与伪影类型中,从而实现对未见相机位姿下新视角的鲁棒修复;同时引入随机掩码增强策略以提升缺失区域的合理补全能力。实验表明,该方法在挑战性基准上达到当前最优性能,且仅需少量场景特定微调即可实现高质量重建。

链接: https://arxiv.org/abs/2508.14717
作者: Jiaxin Wei,Stefan Leutenegger,Simon Schaefer
机构: Technical University of Munich (慕尼黑工业大学); ETH Zurich (苏黎世联邦理工学院)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Recent developments in 3D Gaussian Splatting have significantly enhanced novel view synthesis, yet generating high-quality renderings from extreme novel viewpoints or partially observed regions remains challenging. Meanwhile, diffusion models exhibit strong generative capabilities, but their reliance on text prompts and lack of awareness of specific scene information hinder accurate 3D reconstruction tasks. To address these limitations, we introduce GSFix3D, a novel framework that improves the visual fidelity in under-constrained regions by distilling prior knowledge from diffusion models into 3D representations, while preserving consistency with observed scene details. At its core is GSFixer, a latent diffusion model obtained via our customized fine-tuning protocol that can leverage both mesh and 3D Gaussians to adapt pretrained generative models to a variety of environments and artifact types from different reconstruction methods, enabling robust novel view repair for unseen camera poses. Moreover, we propose a random mask augmentation strategy that empowers GSFixer to plausibly inpaint missing regions. Experiments on challenging benchmarks demonstrate that our GSFix3D and GSFixer achieve state-of-the-art performance, requiring only minimal scene-specific fine-tuning on captured data. Real-world test further confirms its resilience to potential pose errors. Our code and data will be made publicly available. Project page: this https URL.
zh

[CV-18] Seeing Further on the Shoulders of Giants: Knowledge Inheritance for Vision Foundation Models

【速读】:该论文旨在解决当前视觉基础模型(Vision Foundation Models, VFMs)依赖大规模高质量标注数据进行训练所带来的瓶颈问题,尤其是在资源受限机构中难以获取海量数据和高端GPU设备的现实困境。其核心挑战在于如何高效利用已有的开源预训练模型(尤其是领域特定模型)来构建具备通用性的视觉基础模型,同时避免因模型分布差异导致的知识迁移不平衡问题。解决方案的关键在于提出一种基于模型驱动的方法,通过联合知识迁移与保留机制实现多教师模型在共享潜在空间中的统一,从而缓解因分布差异引发的“不平衡迁移”问题;此外,引入一种知识保留策略,以一个通用教师模型作为知识库,借助适配模块(adapter module)整合其余特定任务教师模型的知识,最终在不依赖大量标注数据的情况下构建出性能优越、支持多下游任务的视觉基础模型。

链接: https://arxiv.org/abs/2508.14707
作者: Jiabo Huang,Chen Chen,Lingjuan Lyu
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Technical report

点击查看摘要

Abstract:Vision foundation models (VFMs) are predominantly developed using data-centric methods. These methods require training on vast amounts of data usually with high-quality labels, which poses a bottleneck for most institutions that lack both large-scale data and high-end GPUs. On the other hand, many open-source vision models have been pretrained on domain-specific data, enabling them to distill and represent core knowledge in a form that is transferable across diverse applications. Even though these models are highly valuable assets, they remain largely under-explored in empowering the development of a general-purpose VFM. In this paper, we presents a new model-driven approach for training VFMs through joint knowledge transfer and preservation. Our method unifies multiple pre-trained teacher models in a shared latent space to mitigate the ``imbalanced transfer’’ issue caused by their distributional gaps. Besides, we introduce a knowledge preservation strategy to take a general-purpose teacher as a knowledge base for integrating knowledge from the remaining purpose-specific teachers using an adapter module. By unifying and aggregating existing models, we build a powerful VFM to inherit teachers’ expertise without needing to train on a large amount of labeled data. Our model not only provides generalizable visual features, but also inherently supports multiple downstream tasks. Extensive experiments demonstrate that our VFM outperforms existing data-centric models across four fundamental vision tasks, including image classification, object detection, semantic and instance segmentation.
zh

[CV-19] GeMS: Efficient Gaussian Splatting for Extreme Motion Blur

【速读】:该论文旨在解决现有三维高斯溅射(3D Gaussian Splatting, 3DGS)方法在极端运动模糊(extreme motion blur)图像下无法直接重建场景的问题。传统方法如ExBluRF和Deblur-GS依赖清晰图像进行相机位姿估计与点云生成,而基于COLMAP初始化的方法(如BAD-Gaussians)则因严重模糊导致特征匹配不可靠而失效。其解决方案的关键在于提出GeMS框架:首先通过VGGSfM(基于深度学习的Structure-from-Motion)直接从模糊输入中估计位姿并生成点云;其次采用3DGS-MCMC方法将高斯分布视为概率采样,避免启发式稀疏化与修剪步骤;最后联合优化相机轨迹与高斯参数以实现稳定重建。为进一步提升性能,还引入GeMS-E版本,利用事件相机的双积分(Event-based Double Integral, EDI)去模糊技术逐步重构更清晰图像,从而改善初始位姿与点云质量,最终实现从极端模糊图像出发的高质量3DGS重建。

链接: https://arxiv.org/abs/2508.14682
作者: Gopi Raju Matta,Trisha Reddypalli,Vemunuri Divya Madhuri,Kaushik Mitra
机构: IIT Madras (印度理工学院马德拉斯分校)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:We introduce GeMS, a framework for 3D Gaussian Splatting (3DGS) designed to handle severely motion-blurred images. State-of-the-art deblurring methods for extreme blur, such as ExBluRF, as well as Gaussian Splatting-based approaches like Deblur-GS, typically assume access to sharp images for camera pose estimation and point cloud generation, an unrealistic assumption. Methods relying on COLMAP initialization, such as BAD-Gaussians, also fail due to unreliable feature correspondences under severe blur. To address these challenges, we propose GeMS, a 3DGS framework that reconstructs scenes directly from extremely blurred images. GeMS integrates: (1) VGGSfM, a deep learning-based Structure-from-Motion pipeline that estimates poses and generates point clouds directly from blurred inputs; (2) 3DGS-MCMC, which enables robust scene initialization by treating Gaussians as samples from a probability distribution, eliminating heuristic densification and pruning; and (3) joint optimization of camera trajectories and Gaussian parameters for stable reconstruction. While this pipeline produces strong results, inaccuracies may remain when all inputs are severely blurred. To mitigate this, we propose GeMS-E, which integrates a progressive refinement step using events: (4) Event-based Double Integral (EDI) deblurring restores sharper images that are then fed into GeMS, improving pose estimation, point cloud generation, and overall reconstruction. Both GeMS and GeMS-E achieve state-of-the-art performance on synthetic and real-world datasets. To our knowledge, this is the first framework to address extreme motion blur within 3DGS directly from severely blurred inputs.
zh

[CV-20] owards PerSense: Advancing Training-Free Personalized Instance Segmentation in Dense Images

【速读】:该论文旨在解决密集视觉场景中实例分割(Instance Segmentation)的难题,尤其针对遮挡、背景杂乱和尺度变化等挑战。其核心解决方案是提出PerSense框架,该框架无需训练且模型无关,通过两个关键模块实现:一是实例检测模块(Instance Detection Module, IDM),利用密度图(Density Maps, DMs)生成实例级候选点提示;二是点提示选择模块(Point Prompt Selection Module, PPSM),结合自适应阈值和空间门控机制过滤误检。此外,通过反馈机制自动选择有效样本以优化密度图质量,从而提升分割精度。进一步提出的PerSense++版本引入多样性感知的样本选择策略、混合轮廓与峰值提示生成机制以及无关掩码拒绝模块(Irrelevant Mask Rejection Module, IMRM),显著增强了在复杂密集场景下的鲁棒性。

链接: https://arxiv.org/abs/2508.14660
作者: Muhammad Ibraheem Siddiqui,Muhammad Umer Sheikh,Hassan Abid,Kevin Henry,Muhammad Haris Khan
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: arXiv admin note: text overlap with arXiv:2405.13518

点击查看摘要

Abstract:Segmentation in dense visual scenes poses significant challenges due to occlusions, background clutter, and scale variations. To address this, we introduce PerSense, an end-to-end, training-free, and model-agnostic one-shot framework for Personalized instance Segmentation in dense images. PerSense employs a novel Instance Detection Module (IDM) that leverages density maps (DMs) to generate instance-level candidate point prompts, followed by a Point Prompt Selection Module (PPSM) that filters false positives via adaptive thresholding and spatial gating. A feedback mechanism further enhances segmentation by automatically selecting effective exemplars to improve DM quality. We additionally present PerSense++, an enhanced variant that incorporates three additional components to improve robustness in cluttered scenes: (i) a diversity-aware exemplar selection strategy that leverages feature and scale diversity for better DM generation; (ii) a hybrid IDM combining contour and peak-based prompt generation for improved instance separation within complex density patterns; and (iii) an Irrelevant Mask Rejection Module (IMRM) that discards spatially inconsistent masks using outlier analysis. Finally, to support this underexplored task, we introduce PerSense-D, a dedicated benchmark for personalized segmentation in dense images. Extensive experiments across multiple benchmarks demonstrate that PerSense++ outperforms existing methods in dense settings.
zh

[CV-21] Understanding Data Influence with Differential Approximation

【速读】:该论文旨在解决现有样本影响估计方法在准确性上的不足,尤其是在神经网络非凸优化场景下难以有效应用的问题。传统方法常假设损失函数为凸函数,这限制了其在真实复杂模型中的适用性。论文提出了一种新的样本影响估计方法——Diff-In,其核心在于通过累积相邻训练步骤间的影响差异来近似单个样本的全局影响,从而避免对模型凸性的依赖。关键创新在于利用二阶近似技术精确估算这些影响差值,并通过有限差分法高效计算Hessian与梯度的乘积,使算法在保持与一阶方法相当的计算复杂度的同时,显著提升估计精度。理论分析和大量实验表明,Diff-In在数据清洗、数据删除和核心集选择等任务中均优于现有基线方法,且可扩展至百万级数据点,在视觉-语言预训练场景下的数据剪枝中展现出卓越性能。

链接: https://arxiv.org/abs/2508.14648
作者: Haoru Tan,Sitong Wu,Xiuzhe Wu,Wang Wang,Bo Zhao,Zeke Xie,Gui-Song Xia,Xiaojuan Qi
机构: University of Hong Kong (香港大学); Chinese University of Hong Kong (香港中文大学); Shanghai Jiao Tong University (上海交通大学); Hong Kong University of Science and Technology (广州) (香港科技大学(广州)); Wuhan University (武汉大学); Stanford University (斯坦福大学)
类目: Machine Learning (cs.LG); Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Data plays a pivotal role in the groundbreaking advancements in artificial intelligence. The quantitative analysis of data significantly contributes to model training, enhancing both the efficiency and quality of data utilization. However, existing data analysis tools often lag in accuracy. For instance, many of these tools even assume that the loss function of neural networks is convex. These limitations make it challenging to implement current methods effectively. In this paper, we introduce a new formulation to approximate a sample’s influence by accumulating the differences in influence between consecutive learning steps, which we term Diff-In. Specifically, we formulate the sample-wise influence as the cumulative sum of its changes/differences across successive training iterations. By employing second-order approximations, we approximate these difference terms with high accuracy while eliminating the need for model convexity required by existing methods. Despite being a second-order method, Diff-In maintains computational complexity comparable to that of first-order methods and remains scalable. This efficiency is achieved by computing the product of the Hessian and gradient, which can be efficiently approximated using finite differences of first-order gradients. We assess the approximation accuracy of Diff-In both theoretically and empirically. Our theoretical analysis demonstrates that Diff-In achieves significantly lower approximation error compared to existing influence estimators. Extensive experiments further confirm its superior performance across multiple benchmark datasets in three data-centric tasks: data cleaning, data deletion, and coreset selection. Notably, our experiments on data pruning for large-scale vision-language pre-training show that Diff-In can scale to millions of data points and outperforms strong baselines.
zh

[CV-22] AnchorSync: Global Consistency Optimization for Long Video Editing ACM-MM2025

【速读】:该论文旨在解决长视频编辑中难以同时保持全局结构一致性和时间连贯性的问题,尤其在分钟级视频序列中易出现结构漂移(structural drift)和时间伪影(temporal artifacts)。其解决方案的关键在于提出了一种基于扩散模型的框架AnchorSync,通过将编辑任务解耦为稀疏锚点帧编辑与平滑的中间帧插值两阶段处理:第一阶段利用渐进式去噪机制强制结构一致性,第二阶段借助多模态引导策略保留时间动态特性,从而实现高质量、长时间的视频编辑效果。

链接: https://arxiv.org/abs/2508.14609
作者: Zichi Liu,Yinggui Wang,Tao Wei,Chao Ma
机构: Shanghai Jiao Tong University (上海交通大学); Ant Group (蚂蚁集团)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: ACM MM 2025; Code is released at this https URL

点击查看摘要

Abstract:Editing long videos remains a challenging task due to the need for maintaining both global consistency and temporal coherence across thousands of frames. Existing methods often suffer from structural drift or temporal artifacts, particularly in minute-long sequences. We introduce AnchorSync, a novel diffusion-based framework that enables high-quality, long-term video editing by decoupling the task into sparse anchor frame editing and smooth intermediate frame interpolation. Our approach enforces structural consistency through a progressive denoising process and preserves temporal dynamics via multimodal guidance. Extensive experiments show that AnchorSync produces coherent, high-fidelity edits, surpassing prior methods in visual quality and temporal stability.
zh

[CV-23] SMTrack: End-to-End Trained Spiking Neural Networks for Multi-Object Tracking in RGB Videos

【速读】:该论文旨在解决脉冲神经网络(Spiking Neural Networks, SNNs)在复杂时序视觉任务中应用受限的问题,特别是针对标准RGB视频流下的多目标跟踪(Multi-Object Tracking, MOT)任务缺乏高效直接训练框架的挑战。解决方案的关键在于提出SMTrack——首个端到端直接训练的深度SNN框架,其核心创新包括:引入自适应且尺度感知的归一化Wasserstein距离损失(Adaptive and Scale-aware Normalized Wasserstein Distance Loss, Asa-NWDLoss),通过动态调整批次内平均目标尺寸作为归一化因子,提升对小目标的敏感性;同时集成TrackTrack身份模块以增强关联阶段的目标轨迹一致性与鲁棒性,从而在BEE24、MOT17、MOT20和DanceTrack等多个数据集上实现媲美先进ANN方法的跟踪性能。

链接: https://arxiv.org/abs/2508.14607
作者: Pengzhi Zhong,Xinzhe Wang,Dan Zeng,Qihua Zhou,Feixiang He,Shuiwang Li
机构: Guilin University of Technology (桂林理工大学); Guangxi Key Laboratory of Embedded Technology and Intelligent System (广西嵌入式技术与智能系统重点实验室); Shenzhen University (深圳大学); Sun Yat-sen University (中山大学); Central South University (中南大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Brain-inspired Spiking Neural Networks (SNNs) exhibit significant potential for low-power computation, yet their application in visual tasks remains largely confined to image classification, object detection, and event-based tracking. In contrast, real-world vision systems still widely use conventional RGB video streams, where the potential of directly-trained SNNs for complex temporal tasks such as multi-object tracking (MOT) remains underexplored. To address this challenge, we propose SMTrack-the first directly trained deep SNN framework for end-to-end multi-object tracking on standard RGB videos. SMTrack introduces an adaptive and scale-aware Normalized Wasserstein Distance loss (Asa-NWDLoss) to improve detection and localization performance under varying object scales and densities. Specifically, the method computes the average object size within each training batch and dynamically adjusts the normalization factor, thereby enhancing sensitivity to small objects. For the association stage, we incorporate the TrackTrack identity module to maintain robust and consistent object trajectories. Extensive evaluations on BEE24, MOT17, MOT20, and DanceTrack show that SMTrack achieves performance on par with leading ANN-based MOT methods, advancing robust and accurate SNN-based tracking in complex scenarios.
zh

[CV-24] UST-SSM: Unified Spatio-Temporal State Space Models for Point Cloud Video Modeling ICCV2025

【速读】:该论文旨在解决点云视频(point cloud video)在时序建模中因空间-时间无序性导致的难以有效利用远距离时空特征的问题。现有基于选择性状态空间模型(Selective State Space Model, SSM)的方法虽具备线性复杂度优势,但直接将点云视频按时间顺序展开为一维序列时,无法处理其固有的非结构化特性,从而限制了对语义相关但时空分布分散的点的有效建模。解决方案的关键在于提出统一的时空状态空间模型(Unified Spatio-Temporal State Space Model, UST-SSM),通过三个核心模块实现:1)时空选择扫描(Spatial-Temporal Selection Scanning, STSS),利用提示引导聚类重构语义感知序列;2)时空结构聚合(Spatio-Temporal Structure Aggregation, STSA),补偿缺失的4D几何与运动细节;3)时间交互采样(Temporal Interaction Sampling, TIS),通过非锚帧利用和扩展感受野增强细粒度时序依赖关系。

链接: https://arxiv.org/abs/2508.14604
作者: Peiming Li,Ziyi Wang,Yulin Yuan,Hong Liu,Xiangming Meng,Junsong Yuan,Mengyuan Liu
机构: Peking University (北京大学); Zhejiang University (浙江大学); State University of New York at Buffalo (纽约州立大学布法罗分校)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: 8 pages, 5 figures, Accepted to ICCV2025

点击查看摘要

Abstract:Point cloud videos capture dynamic 3D motion while reducing the effects of lighting and viewpoint variations, making them highly effective for recognizing subtle and continuous human actions. Although Selective State Space Models (SSMs) have shown good performance in sequence modeling with linear complexity, the spatio-temporal disorder of point cloud videos hinders their unidirectional modeling when directly unfolding the point cloud video into a 1D sequence through temporally sequential scanning. To address this challenge, we propose the Unified Spatio-Temporal State Space Model (UST-SSM), which extends the latest advancements in SSMs to point cloud videos. Specifically, we introduce Spatial-Temporal Selection Scanning (STSS), which reorganizes unordered points into semantic-aware sequences through prompt-guided clustering, thereby enabling the effective utilization of points that are spatially and temporally distant yet similar within the sequence. For missing 4D geometric and motion details, Spatio-Temporal Structure Aggregation (STSA) aggregates spatio-temporal features and compensates. To improve temporal interaction within the sampled sequence, Temporal Interaction Sampling (TIS) enhances fine-grained temporal dependencies through non-anchor frame utilization and expanded receptive fields. Experimental results on the MSR-Action3D, NTU RGB+D, and Synthia 4D datasets validate the effectiveness of our method. Our code is available at this https URL.
zh

[CV-25] Incremental Object Detection with Prompt-based Methods ICCV

【速读】:该论文旨在解决视觉提示(visual prompt)方法在增量目标检测(incremental object detection, IOD)任务中表现不佳的问题,特别是针对复杂领域增量学习(domain-incremental learning)场景下,现有基于提示的方法缺乏有效性和泛化能力的局限。其关键解决方案是将视觉提示与少量历史数据回放(replay)相结合:通过引入少量先前类别的样本进行记忆回放,显著提升了模型在不遗忘旧知识的前提下学习新类别的性能,从而实现了当前prompt-based方法在IOD任务中的最优效果。

链接: https://arxiv.org/abs/2508.14599
作者: Matthias Neuwirth-Trapp,Maarten Bieshaar,Danda Pani Paudel,Luc Van Gool
机构: Bosch Research; INSAIT, Sofia University “St. Kliment Ohridski”
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted to ICCV Workshops 2025

点击查看摘要

Abstract:Visual prompt-based methods have seen growing interest in incremental learning (IL) for image classification. These approaches learn additional embedding vectors while keeping the model frozen, making them efficient to train. However, no prior work has applied such methods to incremental object detection (IOD), leaving their generalizability unclear. In this paper, we analyze three different prompt-based methods under a complex domain-incremental learning setting. We additionally provide a wide range of reference baselines for comparison. Empirically, we show that the prompt-based approaches we tested underperform in this setting. However, a strong yet practical method, combining visual prompts with replaying a small portion of previous data, achieves the best results. Together with additional experiments on prompt length and initialization, our findings offer valuable insights for advancing prompt-based IL in IOD.
zh

[CV-26] Reliable Smoke Detection via Optical Flow-Guided Feature Fusion and Transformer-Based Uncertainty Modeling

【速读】:该论文旨在解决烟雾检测中因光照变化、流体动力学特性及环境噪声导致的传统探测器可靠性不足的问题,尤其在单目图像条件下实现高保真早期火灾预警。其解决方案的关键在于提出了一种基于信息融合的框架:首先利用受四色定理启发的双阶段水平集分数阶变分模型进行光学流估计,以保留运动不连续性;随后通过高斯混合模型将颜色编码的光学流图与外观特征融合,生成烟雾区域的二值分割掩膜;最终将这些融合表征输入改进的带多尺度不确定性估计头的两阶段不确定感知移窗Transformer(Two-Phase Uncertainty-Aware Shifted Windows Transformer),其中第一阶段优化检测精度,第二阶段联合建模认知不确定性(epistemic uncertainty)与随机不确定性(aleatoric uncertainty)以输出预测置信度,从而显著提升模型在复杂场景下的泛化能力和鲁棒性。

链接: https://arxiv.org/abs/2508.14597
作者: Nitish Kumar Mahala,Muzammil Khan,Pushpendra Kumar
机构: Maulana Azad National Institute of Technology Bhopal (莫拉纳·阿扎德国家技术学院博帕尔); University of Twente (特温特大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Fire outbreaks pose critical threats to human life and infrastructure, necessitating high-fidelity early-warning systems that detect combustion precursors such as smoke. However, smoke plumes exhibit complex spatiotemporal dynamics influenced by illumination variability, flow kinematics, and environmental noise, undermining the reliability of traditional detectors. To address these challenges without the logistical complexity of multi-sensor arrays, we propose an information-fusion framework by integrating smoke feature representations extracted from monocular imagery. Specifically, a Two-Phase Uncertainty-Aware Shifted Windows Transformer for robust and reliable smoke detection, leveraging a novel smoke segmentation dataset, constructed via optical flow-based motion encoding, is proposed. The optical flow estimation is performed with a four-color-theorem-inspired dual-phase level-set fractional-order variational model, which preserves motion discontinuities. The resulting color-encoded optical flow maps are fused with appearance cues via a Gaussian Mixture Model to generate binary segmentation masks of the smoke regions. These fused representations are fed into the novel Shifted-Windows Transformer, which is augmented with a multi-scale uncertainty estimation head and trained under a two-phase learning regimen. First learning phase optimizes smoke detection accuracy, while during the second phase, the model learns to estimate plausibility confidence in its predictions by jointly modeling aleatoric and epistemic uncertainties. Extensive experiments using multiple evaluation metrics and comparative analysis with state-of-the-art approaches demonstrate superior generalization and robustness, offering a reliable solution for early fire detection in surveillance, industrial safety, and autonomous monitoring applications.
zh

[CV-27] Controllable Latent Space Augmentation for Digital Pathology ICCV2025

【速读】:该论文旨在解决数字病理学中全切片图像(Whole Slide Image, WSI)分析面临的两大挑战:一是WSI具有吉像素级分辨率,导致计算成本高;二是标注数据稀疏,难以训练鲁棒的模型。针对这些问题,作者提出HistAug——一种基于潜在空间的可控增强方法,其关键在于通过条件生成机制在嵌入空间中实现可控的、语义明确的增强操作(如色调变化、形态学腐蚀等),从而在单次前向传播中高效处理大量patch,并保持初始语义信息不变。相比传统基于噪声的扰动或特征层面的增强,HistAug能显著提升多实例学习(Multiple Instance Learning, MIL)模型性能,尤其在小样本场景下优势明显。

链接: https://arxiv.org/abs/2508.14588
作者: Sofiène Boutaj,Marin Scalbert,Pierre Marza,Florent Couzinie-Devy,Maria Vakalopoulou,Stergios Christodoulidis
机构: MICS, CentraleSupélec – Université Paris-Saclay; Bioptimus, Inc.; VitaDX International
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted at ICCV 2025

点击查看摘要

Abstract:Whole slide image (WSI) analysis in digital pathology presents unique challenges due to the gigapixel resolution of WSIs and the scarcity of dense supervision signals. While Multiple Instance Learning (MIL) is a natural fit for slide-level tasks, training robust models requires large and diverse datasets. Even though image augmentation techniques could be utilized to increase data variability and reduce overfitting, implementing them effectively is not a trivial task. Traditional patch-level augmentation is prohibitively expensive due to the large number of patches extracted from each WSI, and existing feature-level augmentation methods lack control over transformation semantics. We introduce HistAug, a fast and efficient generative model for controllable augmentations in the latent space for digital pathology. By conditioning on explicit patch-level transformations (e.g., hue, erosion), HistAug generates realistic augmented embeddings while preserving initial semantic information. Our method allows the processing of a large number of patches in a single forward pass efficiently, while at the same time consistently improving MIL model performance. Experiments across multiple slide-level tasks and diverse organs show that HistAug outperforms existing methods, particularly in low-data regimes. Ablation studies confirm the benefits of learned transformations over noise-based perturbations and highlight the importance of uniform WSI-wise augmentation. Code is available at this https URL.
zh

[CV-28] Safety-Critical Learning for Long-Tail Events: The TUM Traffic Accident Dataset ICRA40

【速读】:该论文旨在解决高速公路交通事故检测难题,尤其是针对高车速场景下事故识别的准确性与鲁棒性问题。现有交通网络安全性研究虽已取得进展,但事故仍难以完全避免,亟需通过数据驱动的方法实现对真实世界中复杂事故场景的有效建模与自动识别。解决方案的关键在于构建了TUM Traffic Accident (TUMTraf-A)数据集,包含10个类别的294,924个2D标注框和93,012个3D标注框,覆盖48,144帧来自四路路边摄像头与激光雷达(LiDAR)的多模态数据(采样频率为10 Hz),并提出Accid3nD模型——一种融合规则驱动(rule-based)与学习驱动(learning-based)机制的事故检测方法。实验及消融分析表明,该方法在复杂真实场景中具有良好的鲁棒性和检测性能。

链接: https://arxiv.org/abs/2508.14567
作者: Walter Zimmer,Ross Greer,Xingcheng Zhou,Rui Song,Marc Pavel,Daniel Lehmberg,Ahmed Ghita,Akshay Gopalkrishnan,Mohan Trivedi,Alois Knoll
机构: Technical University of Munich (慕尼黑工业大学); Fraunhofer Institute for Transportation and Infrastructure Systems (IVI) (弗劳恩霍夫交通与基础设施系统研究所); University of California San Diego (加州大学圣地亚哥分校); University of California Merced (加州大学默塞德分校); SETLabs Research GmbH (SETLabs 研究有限公司)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted for ICRA 40 Year Anniversary (ICRA40)

点击查看摘要

Abstract:Even though a significant amount of work has been done to increase the safety of transportation networks, accidents still occur regularly. They must be understood as an unavoidable and sporadic outcome of traffic networks. We present the TUM Traffic Accident (TUMTraf-A) dataset, a collection of real-world highway accidents. It contains ten sequences of vehicle crashes at high-speed driving with 294,924 labeled 2D and 93,012 labeled 3D boxes and track IDs within 48,144 labeled frames recorded from four roadside cameras and LiDARs at 10 Hz. The dataset contains ten object classes and is provided in the OpenLABEL format. We propose Accid3nD, an accident detection model that combines a rule-based approach with a learning-based one. Experiments and ablation studies on our dataset show the robustness of our proposed method. The dataset, model, and code are available on our project website: this https URL.
zh

[CV-29] GOGS: High-Fidelity Geometry and Relighting for Glossy Objects via Gaussian Surfels

【速读】:该论文旨在解决光泽物体(glossy objects)从RGB图像中进行逆向渲染(inverse rendering)时存在的根本性模糊问题,特别是现有方法在几何重建精度、材质分解能力以及新光照下的光真实重光照(photorealistic relighting)效果方面的局限。其解决方案的关键在于提出一种基于2D高斯surfels的两阶段框架GOGS:第一阶段通过物理基础渲染(physics-based rendering)结合split-sum近似与来自基础模型的几何先验,实现鲁棒的表面重建;第二阶段利用蒙特卡洛重要性采样完整渲染方程,通过可微分的2D高斯射线追踪建模间接光照,并借助球面mipmap方向编码捕捉各向异性高光细节,从而有效提升高频镜面反射区域的重建质量与材质分离准确性。

链接: https://arxiv.org/abs/2508.14563
作者: Xingyuan Yang,Min Wei
机构: Chengdu University of Information Technology (成都信息工程大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 13 pages, 13 figures

点击查看摘要

Abstract:Inverse rendering of glossy objects from RGB imagery remains fundamentally limited by inherent ambiguity. Although NeRF-based methods achieve high-fidelity reconstruction via dense-ray sampling, their computational cost is prohibitive. Recent 3D Gaussian Splatting achieves high reconstruction efficiency but exhibits limitations under specular reflections. Multi-view inconsistencies introduce high-frequency surface noise and structural artifacts, while simplified rendering equations obscure material properties, leading to implausible relighting results. To address these issues, we propose GOGS, a novel two-stage framework based on 2D Gaussian surfels. First, we establish robust surface reconstruction through physics-based rendering with split-sum approximation, enhanced by geometric priors from foundation models. Second, we perform material decomposition by leveraging Monte Carlo importance sampling of the full rendering equation, modeling indirect illumination via differentiable 2D Gaussian ray tracing and refining high-frequency specular details through spherical mipmap-based directional encoding that captures anisotropic highlights. Extensive experiments demonstrate state-of-the-art performance in geometry reconstruction, material separation, and photorealistic relighting under novel illuminations, outperforming existing inverse rendering approaches.
zh

[CV-30] Locality-aware Concept Bottleneck Model

【速读】:该论文旨在解决无监督概念瓶颈模型(Concept Bottleneck Models, CBMs)在图像中无法准确定位概念区域的问题,即模型在预测概念存在时常常关注与概念无关的视觉区域,导致空间定位不准确。解决方案的关键在于提出一种局部感知的概念瓶颈模型(Locality-aware Concept Bottleneck Model, LCBM),其核心机制是引入原型学习(prototype learning),为每个概念分配一个原型,该原型被训练为表征该概念的典型局部特征,并通过基础模型(foundation models)确保原型与其对应概念的相关性;随后利用这些原型引导模型识别出每个概念应从图像中的正确局部区域进行预测,从而实现概念的高精度空间定位,同时保持与传统方法相当的分类性能。

链接: https://arxiv.org/abs/2508.14562
作者: Sujin Jeon,Hyundo Lee,Eungseo Kim,Sanghack Lee,Byoung-Tak Zhang,Inwoo Hwang
机构: Seoul National University (首尔国立大学); Columbia University (哥伦比亚大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 34 pages, 25 figures

点击查看摘要

Abstract:Concept bottleneck models (CBMs) are inherently interpretable models that make predictions based on human-understandable visual cues, referred to as concepts. As obtaining dense concept annotations with human labeling is demanding and costly, recent approaches utilize foundation models to determine the concepts existing in the images. However, such label-free CBMs often fail to localize concepts in relevant regions, attending to visually unrelated regions when predicting concept presence. To this end, we propose a framework, coined Locality-aware Concept Bottleneck Model (LCBM), which utilizes rich information from foundation models and adopts prototype learning to ensure accurate spatial localization of the concepts. Specifically, we assign one prototype to each concept, promoted to represent a prototypical image feature of that concept. These prototypes are learned by encouraging them to encode similar local regions, leveraging foundation models to assure the relevance of each prototype to its associated concept. Then we use the prototypes to facilitate the learning process of identifying the proper local region from which each concept should be predicted. Experimental results demonstrate that LCBM effectively identifies present concepts in the images and exhibits improved localization while maintaining comparable classification performance.
zh

[CV-31] Making Pose Representations More Expressive and Disentangled via Residual Vector Quantization

【速读】:该论文旨在解决基于姿态码(pose code)的可控运动生成方法在表达细微运动特征时能力不足的问题,即离散姿态码难以捕捉高频率等精细运动细节,从而限制了生成运动的丰富性和准确性。解决方案的关键在于引入残差向量量化(residual vector quantization, RVQ)机制,将连续运动特征作为补充嵌入到原有的姿态码潜在表示中,从而在保持姿态码可解释性和可编辑性的基础上,有效捕获如高频运动细节等细微特征,提升了生成运动的质量与可控性。

链接: https://arxiv.org/abs/2508.14561
作者: Sukhyun Jeong,Hong-Gi Shin,Yong-Hoon Choi
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Robotics (cs.RO)
备注:

点击查看摘要

Abstract:Recent progress in text-to-motion has advanced both 3D human motion generation and text-based motion control. Controllable motion generation (CoMo), which enables intuitive control, typically relies on pose code representations, but discrete pose codes alone cannot capture fine-grained motion details, limiting expressiveness. To overcome this, we propose a method that augments pose code-based latent representations with continuous motion features using residual vector quantization (RVQ). This design preserves the interpretability and manipulability of pose codes while effectively capturing subtle motion characteristics such as high-frequency details. Experiments on the HumanML3D dataset show that our model reduces Frechet inception distance (FID) from 0.041 to 0.015 and improves Top-1 R-Precision from 0.508 to 0.510. Qualitative analysis of pairwise direction similarity between pose codes further confirms the model’s controllability for motion editing.
zh

[CV-32] A Comprehensive Review of Agricultural Parcel and Boundary Delineation from Remote Sensing Images: Recent Progress and Future Perspectives

【速读】:该论文旨在解决农业地块边界识别与划分(Agricultural Parcel and Boundary Delineation, APBD)在遥感影像中自动化、高精度检测的难题。其解决方案的关键在于系统梳理和分类当前APBD研究方法,涵盖传统图像处理、传统机器学习及深度学习三大类,并重点分析以语义分割、目标检测和Transformer为基础的深度学习方法。通过构建元数据统计分析,论文进一步揭示了不同算法、传感器类型、作物种类及评估方式对APBD性能的影响,为未来研究提供清晰的知识图谱与发展方向。

链接: https://arxiv.org/abs/2508.14558
作者: Juepeng Zheng,Zi Ye,Yibin Wen,Jianxi Huang,Zhiwei Zhang,Qingmei Li,Qiong Hu,Baodong Xu,Lingyuan Zhao,Haohuan Fu
机构: Sun Yat-Sen University (中山大学); Southwest Jiaotong University (西南交通大学); China Agricultural University (中国农业大学); Tsinghua University (清华大学); Central China Normal University (华中师范大学); Huazhong Agricultural University (华中农业大学); HuanTian Wisdom Technology Co., Ltd. (环天智慧科技有限公司); National Supercomputing Center in Shenzhen (深圳国家超级计算中心)
类目: Computer Vision and Pattern Recognition (cs.CV); Image and Video Processing (eess.IV)
备注:

点击查看摘要

Abstract:Powered by advances in multiple remote sensing sensors, the production of high spatial resolution images provides great potential to achieve cost-efficient and high-accuracy agricultural inventory and analysis in an automated way. Lots of studies that aim at providing an inventory of the level of each agricultural parcel have generated many methods for Agricultural Parcel and Boundary Delineation (APBD). This review covers APBD methods for detecting and delineating agricultural parcels and systematically reviews the past and present of APBD-related research applied to remote sensing images. With the goal to provide a clear knowledge map of existing APBD efforts, we conduct a comprehensive review of recent APBD papers to build a meta-data analysis, including the algorithm, the study site, the crop type, the sensor type, the evaluation method, etc. We categorize the methods into three classes: (1) traditional image processing methods (including pixel-based, edge-based and region-based); (2) traditional machine learning methods (such as random forest, decision tree); and (3) deep learning-based methods. With deep learning-oriented approaches contributing to a majority, we further discuss deep learning-based methods like semantic segmentation-based, object detection-based and Transformer-based methods. In addition, we discuss five APBD-related issues to further comprehend the APBD domain using remote sensing data, such as multi-sensor data in APBD task, comparisons between single-task learning and multi-task learning in the APBD domain, comparisons among different algorithms and different APBD tasks, etc. Finally, this review proposes some APBD-related applications and a few exciting prospects and potential hot topics in future APBD research. We hope this review help researchers who involved in APBD domain to keep track of its development and tendency.
zh

[CV-33] Improving OCR using internal document redundancy

【速读】:该论文旨在解决当前光学字符识别(OCR)系统在处理低质量文档时性能下降的问题,尤其是在印刷文档中,尽管域内数据变异性较低,但域间变异性较高,导致现有OCR方法难以充分利用文档内部的冗余信息。解决方案的关键在于提出一种无监督方法,通过挖掘文档内字符形状的冗余性来修正OCR输出结果并优化聚类效果;具体而言,该方法引入了一种扩展的高斯混合模型(GMM),结合期望最大化(EM)算法与簇内重对齐过程及正态性统计检验,从而提升对退化文档(如乌拉圭军方档案和17世纪至20世纪中期欧洲报纸)的识别准确率。

链接: https://arxiv.org/abs/2508.14557
作者: Diego Belzarena,Seginus Mowlavi,Aitor Artola,Camilo Mariño,Marina Gardella,Ignacio Ramírez,Antoine Tadros,Roy He,Natalia Bottaioli,Boshra Rajaei,Gregory Randall,Jean-Michel Morel
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG); Image and Video Processing (eess.IV)
备注: 28 pages, 10 figures, including supplementary material. Code: this https URL . Dataset: this https URL

点击查看摘要

Abstract:Current OCR systems are based on deep learning models trained on large amounts of data. Although they have shown some ability to generalize to unseen data, especially in detection tasks, they can struggle with recognizing low-quality data. This is particularly evident for printed documents, where intra-domain data variability is typically low, but inter-domain data variability is high. In that context, current OCR methods do not fully exploit each document’s redundancy. We propose an unsupervised method by leveraging the redundancy of character shapes within a document to correct imperfect outputs of a given OCR system and suggest better clustering. To this aim, we introduce an extended Gaussian Mixture Model (GMM) by alternating an Expectation-Maximization (EM) algorithm with an intra-cluster realignment process and normality statistical testing. We demonstrate improvements in documents with various levels of degradation, including recovered Uruguayan military archives and 17th to mid-20th century European newspapers.
zh

[CV-34] WISE-FUSE: Efficient Whole Slide Image Encoding via Coarse-to-Fine Patch Selection with VLM and LLM Knowledge Fusion

【速读】:该论文旨在解决计算病理学(Computational Pathology, CPath)中全切片图像(Whole Slide Images, WSIs)因像素规模庞大而导致的编码效率瓶颈问题,即传统方法需处理数十万至百万级高分辨率补丁(patch),导致预处理与训练时间长达数天甚至数周。解决方案的关键在于提出一种自适应WSI编码框架WISE-FUSE,其核心是利用病理领域视觉-语言模型(vision-language models)和大语言模型(Large Language Models, LLMs)对诊断相关区域进行选择性处理:首先通过知识蒸馏机制计算低分辨率补丁与类别特定文本描述之间的相似度得分,筛选出信息量高的小样本区域;随后仅对这些关键区域的高分辨率补丁进行编码,并融合文本嵌入以增强诊断上下文,从而在显著降低编码时间(超过三倍)的同时保持或提升诊断性能。

链接: https://arxiv.org/abs/2508.14537
作者: Yonghan Shin,SeungKyu Kim,Won-Ki Jeong
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Whole slide images (WSIs) in computational pathology (CPath) pose a major computational challenge due to their gigapixel scale, often requiring the processing of tens to hundreds of thousands of high-resolution patches per slide. This results in prohibitive encoding costs, with preprocessing and training times extending to days or even weeks-making WSI encoding the most significant bottleneck in real-world deployment. In this work, we propose WISE-FUSE, an adaptive WSI encoding framework that leverages pathology-domain vision-language models and large language models to address this challenge by selectively processing diagnostically relevant regions. WISE-FUSE first computes similarity scores between low-resolution patches and class-specific textual descriptions using a knowledge distillation mechanism that preserves fine-grained diagnostic features. Based on these similarity scores, we select a small subset of informative regions for the target task, which quickly eliminates irrelevant patches at the coarse level. The corresponding high-resolution patches are then selectively encoded and fused with textual embeddings to reinforce diagnostic context. Extensive experiments demonstrate that WISE-FUSE reduces WSI encoding time by over threefold while achieving diagnostic performance comparable to or surpassing that of exhaustive patch processing, offering a scalable and practical solution for CPath.
zh

[CV-35] Adversarial Generation and Collaborative Evolution of Safety-Critical Scenarios for Autonomous Vehicles

【速读】:该论文旨在解决当前自动驾驶安全评估中生成安全关键场景(safety-critical scenarios)的局限性问题,即现有方法主要依赖预定义威胁模式或规则驱动策略,难以暴露多样化且未预见的失效模式。其解决方案的关键在于提出ScenGE框架,该框架通过两个核心步骤实现:首先基于结构化驾驶知识的大型语言模型进行元场景生成(Meta-Scenario Generation),推理出行为合理且具有挑战性的对抗代理(adversarial agent);其次利用背景车辆构建对抗协作图(adversarial collaborator graph),优化关键目标轨迹以同时压缩自车(ego vehicle)的可操作空间并制造关键遮挡,从而放大初始威胁。此方法显著提升了安全关键场景的多样性与严重性,在多个强化学习驱动的自动驾驶模型上平均使碰撞案例增加31.96%,并通过真实车辆测试和人工评估验证了生成场景的合理性与临界性。

链接: https://arxiv.org/abs/2508.14527
作者: Jiangfan Liu,Yongkang Guo,Fangzhi Zhong,Tianyuan Zhang,Zonglei Jing,Siyuan Liang,Jiakai Wang,Mingchuan Zhang,Aishan Liu,Xianglong Liu
机构: Beihang University (北京航空航天大学); Nanyang Technological University (南洋理工大学); Zhongguancun Laboratory (中关村实验室); Henan University of Science and Technology (河南科技大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:The generation of safety-critical scenarios in simulation has become increasingly crucial for safety evaluation in autonomous vehicles prior to road deployment in society. However, current approaches largely rely on predefined threat patterns or rule-based strategies, which limit their ability to expose diverse and unforeseen failure modes. To overcome these, we propose ScenGE, a framework that can generate plentiful safety-critical scenarios by reasoning novel adversarial cases and then amplifying them with complex traffic flows. Given a simple prompt of a benign scene, it first performs Meta-Scenario Generation, where a large language model, grounded in structured driving knowledge, infers an adversarial agent whose behavior poses a threat that is both plausible and deliberately challenging. This meta-scenario is then specified in executable code for precise in-simulator control. Subsequently, Complex Scenario Evolution uses background vehicles to amplify the core threat introduced by Meta-Scenario. It builds an adversarial collaborator graph to identify key agent trajectories for optimization. These perturbations are designed to simultaneously reduce the ego vehicle’s maneuvering space and create critical occlusions. Extensive experiments conducted on multiple reinforcement learning based AV models show that ScenGE uncovers more severe collision cases (+31.96%) on average than SoTA baselines. Additionally, our ScenGE can be applied to large model based AV systems and deployed on different simulators; we further observe that adversarial training on our scenarios improves the model robustness. Finally, we validate our framework through real-world vehicle tests and human evaluation, confirming that the generated scenarios are both plausible and critical. We hope our paper can build up a critical step towards building public trust and ensuring their safe deployment.
zh

[CV-36] PB-IAD: Utilizing multimodal foundation models for semantic industrial anomaly detection in dynamic manufacturing environments

【速读】:该论文旨在解决工业异常检测(Industrial Anomaly Detection, IAD)中因数据稀疏性、动态生产环境适应性差以及领域用户参与度低所带来的挑战。传统统计与数据驱动方法受限于对大量标注数据的依赖及在复杂工况下的灵活性不足,难以满足现代制造场景的需求。其解决方案的关键在于提出一种基于提示(Prompt-based)的工业异常检测框架 PB-IAD,该框架充分利用基础模型(Foundation Models)的多模态感知与推理能力,通过设计专门用于迭代嵌入领域知识的提示模板和预处理模块,将领域专家输入转化为有效系统指令,从而实现无需数据科学背景的用户自定义配置。实验表明,在数据稀缺和少样本场景下,仅依靠语义指令即可显著提升检测性能,优于 PatchCore 等前沿方法。

链接: https://arxiv.org/abs/2508.14504
作者: Bernd Hofmann,Albert Scheck,Joerg Franke,Patrick Bruendl
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:The detection of anomalies in manufacturing processes is crucial to ensure product quality and identify process deviations. Statistical and data-driven approaches remain the standard in industrial anomaly detection, yet their adaptability and usability are constrained by the dependence on extensive annotated datasets and limited flexibility under dynamic production conditions. Recent advances in the perception capabilities of foundation models provide promising opportunities for their adaptation to this downstream task. This paper presents PB-IAD (Prompt-based Industrial Anomaly Detection), a novel framework that leverages the multimodal and reasoning capabilities of foundation models for industrial anomaly detection. Specifically, PB-IAD addresses three key requirements of dynamic production environments: data sparsity, agile adaptability, and domain user centricity. In addition to the anomaly detection, the framework includes a prompt template that is specifically designed for iteratively implementing domain-specific process knowledge, as well as a pre-processing module that translates domain user inputs into effective system prompts. This user-centric design allows domain experts to customise the system flexibly without requiring data science expertise. The proposed framework is evaluated by utilizing GPT-4.1 across three distinct manufacturing scenarios, two data modalities, and an ablation study to systematically assess the contribution of semantic instructions. Furthermore, PB-IAD is benchmarked to state-of-the-art methods for anomaly detection such as PatchCore. The results demonstrate superior performance, particularly in data-sparse scenarios and low-shot settings, achieved solely through semantic instructions.
zh

[CV-37] SATURN: Autoregressive Image Generation Guided by Scene Graphs

【速读】:该论文旨在解决当前文本到图像生成模型在处理复杂提示时难以准确捕捉场景布局和物体间关系的问题。现有基于场景图(scene graph)的方法通常依赖于计算密集的GAN或扩散模型,效率与保真度均落后于现代自回归架构。解决方案的关键在于提出SATURN(Structured Arrangement of Triplets for Unified Rendering Networks),其通过将场景图转化为一种基于显著性排序的标记序列,使冻结的CLIP-VQ-VAE主干网络能够理解结构信息,同时仅微调VAR Transformer模块。这一轻量级设计在Visual Genome数据集上显著提升了生成质量(FID从56.45降至21.62,Inception Score从16.03升至24.78),并在对象数量和空间关系准确性方面优于SG2IM和SGDiff等方法,实现了结构感知与自回归高保真生成的有效结合。

链接: https://arxiv.org/abs/2508.14502
作者: Thanh-Nhan Vo,Trong-Thuan Nguyen,Tam V. Nguyen,Minh-Triet Tran
机构: University of Science, VNU-HCM (胡志明市国家大学所属科学大学); Vietnam National University, Ho Chi Minh City, Vietnam (越南胡志明市国家大学); University of Dayton (代顿大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted to MAPR 2025

点击查看摘要

Abstract:State-of-the-art text-to-image models excel at photorealistic rendering but often struggle to capture the layout and object relationships implied by complex prompts. Scene graphs provide a natural structural prior, yet previous graph-guided approaches have typically relied on heavy GAN or diffusion pipelines, which lag behind modern autoregressive architectures in both speed and fidelity. We introduce SATURN (Structured Arrangement of Triplets for Unified Rendering Networks), a lightweight extension to VAR-CLIP that translates a scene graph into a salience-ordered token sequence, enabling a frozen CLIP-VQ-VAE backbone to interpret graph structure while fine-tuning only the VAR transformer. On the Visual Genome dataset, SATURN reduces FID from 56.45% to 21.62% and increases the Inception Score from 16.03 to 24.78, outperforming prior methods such as SG2IM and SGDiff without requiring extra modules or multi-stage training. Qualitative results further confirm improvements in object count fidelity and spatial relation accuracy, showing that SATURN effectively combines structural awareness with state-of-the-art autoregressive fidelity.
zh

[CV-38] WeedSense: Multi-Task Learning for Weed Segmentation Height Estimation and Growth Stage Classification ICCV

【速读】:该论文旨在解决农业中杂草管理的挑战,特别是如何通过高效、精准的杂草分析实现可持续和精准农业实践。传统方法在杂草监测与分类上存在效率低、多任务协同困难等问题,难以满足实时决策需求。解决方案的关键在于提出 WeedSense——一种基于多任务学习(Multi-task Learning)的新型架构,其核心创新包括:1)采用双路径编码器融合通用逆瓶颈模块(Universal Inverted Bottleneck blocks),以提取多层次特征;2)设计基于Transformer的多任务分叉解码器(Multi-Task Bifurcated Decoder),实现语义分割、株高估计与生长阶段分类的联合优化;3)构建包含16种杂草物种、11周生长周期的像素级标注数据集,支持多任务训练与评估。该方案在保证高精度(如分割mIoU达89.78%、株高估计MAE为1.67cm)的同时,实现160 FPS实时推理,且相比串行单任务模型提速3倍、参数减少32.4%,显著提升了杂草智能识别系统的实用性与部署效率。

链接: https://arxiv.org/abs/2508.14486
作者: Toqi Tahamid Sarker,Khaled R Ahmed,Taminul Islam,Cristiana Bernardi Rankrape,Karla Gage
机构: Southern Illinois University Carbondale (南伊利诺伊大学卡本代尔分校)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: This paper has been submitted and accepted for publication at ICCVW 2025

点击查看摘要

Abstract:Weed management represents a critical challenge in agriculture, significantly impacting crop yields and requiring substantial resources for control. Effective weed monitoring and analysis strategies are crucial for implementing sustainable agricultural practices and site-specific management approaches. We introduce WeedSense, a novel multi-task learning architecture for comprehensive weed analysis that jointly performs semantic segmentation, height estimation, and growth stage classification. We present a unique dataset capturing 16 weed species over an 11-week growth cycle with pixel-level annotations, height measurements, and temporal labels. WeedSense leverages a dual-path encoder incorporating Universal Inverted Bottleneck blocks and a Multi-Task Bifurcated Decoder with transformer-based feature fusion to generate multi-scale features and enable simultaneous prediction across multiple tasks. WeedSense outperforms other state-of-the-art models on our comprehensive evaluation. On our multi-task dataset, WeedSense achieves mIoU of 89.78% for segmentation, 1.67cm MAE for height estimation, and 99.99% accuracy for growth stage classification while maintaining real-time inference at 160 FPS. Our multitask approach achieves 3 \times faster inference than sequential single-task execution and uses 32.4% fewer parameters. Please see our project page at this http URL.
zh

[CV-39] Vivid-VR: Distilling Concepts from Text-to-Video Diffusion Transformer for Photorealistic Video Restoration

【速读】:该论文旨在解决基于DiT(Diffusion Transformer)架构的生成式视频修复(generative video restoration)方法在使用ControlNet进行可控生成时,因多模态对齐不完善导致的分布偏移(distribution drift)问题,进而引发纹理真实感下降和时间一致性受损。其核心解决方案是提出一种概念蒸馏训练策略(concept distillation training strategy),利用预训练的文本到视频(text-to-video, T2V)模型生成包含嵌入文本概念的训练样本,从而将T2V模型的概念理解能力蒸馏至目标模型中,以维持纹理质量和时序一致性;同时,通过重构控制架构——引入控制特征投影器(control feature projector)以过滤输入视频潜在表示中的退化伪影,并设计双分支结构的ControlNet连接器(ControlNet connector),融合MLP特征映射与交叉注意力机制实现动态控制特征检索,显著提升生成可控性与内容保真度。

链接: https://arxiv.org/abs/2508.14483
作者: Haoran Bai,Xiaoxu Chen,Canqian Yang,Zongyao He,Sibin Deng,Ying Chen
机构: Alibaba Group - Taobao & Tmall Group (阿里巴巴集团-淘宝与天猫团队)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:We present Vivid-VR, a DiT-based generative video restoration method built upon an advanced T2V foundation model, where ControlNet is leveraged to control the generation process, ensuring content consistency. However, conventional fine-tuning of such controllable pipelines frequently suffers from distribution drift due to limitations in imperfect multimodal alignment, resulting in compromised texture realism and temporal coherence. To tackle this challenge, we propose a concept distillation training strategy that utilizes the pretrained T2V model to synthesize training samples with embedded textual concepts, thereby distilling its conceptual understanding to preserve texture and temporal quality. To enhance generation controllability, we redesign the control architecture with two key components: 1) a control feature projector that filters degradation artifacts from input video latents to minimize their propagation through the generation pipeline, and 2) a new ControlNet connector employing a dual-branch design. This connector synergistically combines MLP-based feature mapping with cross-attention mechanism for dynamic control feature retrieval, enabling both content preservation and adaptive control signal modulation. Extensive experiments show that Vivid-VR performs favorably against existing approaches on both synthetic and real-world benchmarks, as well as AIGC videos, achieving impressive texture realism, visual vividness, and temporal consistency. The codes and checkpoints are publicly available at this https URL.
zh

[CV-40] LookOut: Real-World Humanoid Egocentric Navigation

【速读】:该论文旨在解决从第一人称视角视频中预测未来6D头部姿态序列的问题,以支持人形机器人、虚拟现实/增强现实(VR/AR)及辅助导航等应用中的无碰撞轨迹规划。其核心挑战在于建模静态与动态环境的几何与语义约束,并学习通过头部转动表达的主动信息获取行为。解决方案的关键是提出一个基于时序聚合3D潜在特征的推理框架,该框架能够捕捉环境结构和动态变化对头部运动的影响;同时,作者构建了Aria Navigation Dataset(AND),包含4小时真实场景下用户导航的多样的行为数据,为训练和验证模型提供了高质量的数据基础。实验表明,该方法能有效学习类人导航策略,如等待减速、重新规划路径和观察交通情况,并在未见环境中实现良好泛化能力。

链接: https://arxiv.org/abs/2508.14466
作者: Boxiao Pan,Adam W. Harley,C. Karen Liu,Leonidas J. Guibas
机构: Stanford University (斯坦福大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:The ability to predict collision-free future trajectories from egocentric observations is crucial in applications such as humanoid robotics, VR / AR, and assistive navigation. In this work, we introduce the challenging problem of predicting a sequence of future 6D head poses from an egocentric video. In particular, we predict both head translations and rotations to learn the active information-gathering behavior expressed through head-turning events. To solve this task, we propose a framework that reasons over temporally aggregated 3D latent features, which models the geometric and semantic constraints for both the static and dynamic parts of the environment. Motivated by the lack of training data in this space, we further contribute a data collection pipeline using the Project Aria glasses, and present a dataset collected through this approach. Our dataset, dubbed Aria Navigation Dataset (AND), consists of 4 hours of recording of users navigating in real-world scenarios. It includes diverse situations and navigation behaviors, providing a valuable resource for learning real-world egocentric navigation policies. Extensive experiments show that our model learns human-like navigation behaviors such as waiting / slowing down, rerouting, and looking around for traffic while generalizing to unseen environments. Check out our project webpage at this https URL.
zh

[CV-41] DreamSwapV: Mask-guided Subject Swapping for Any Customized Video Editing

【速读】:该论文旨在解决视频生成中个性化视频编辑的关键挑战——主体替换(subject swapping)问题,现有方法或局限于特定领域(如人体动画或手物交互),或依赖间接编辑范式与模糊文本提示,导致最终结果保真度不足。解决方案的核心在于提出一个掩码引导、主体无关的端到端框架DreamSwapV,通过引入多条件输入与专用条件融合模块实现细粒度控制,并设计自适应掩码策略以适配不同尺度和属性的主体,从而增强被替换主体与其周围场景的交互一致性。

链接: https://arxiv.org/abs/2508.14465
作者: Weitao Wang,Zichen Wang,Hongdeng Shen,Yulei Lu,Xirui Fan,Suhui Wu,Jun Zhang,Haoqian Wang,Hao Zhang
机构: 1. University of Science and Technology of China (中国科学技术大学); 2. Alibaba Group (阿里巴巴集团)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:With the rapid progress of video generation, demand for customized video editing is surging, where subject swapping constitutes a key component yet remains under-explored. Prevailing swapping approaches either specialize in narrow domains–such as human-body animation or hand-object interaction–or rely on some indirect editing paradigm or ambiguous text prompts that compromise final fidelity. In this paper, we propose DreamSwapV, a mask-guided, subject-agnostic, end-to-end framework that swaps any subject in any video for customization with a user-specified mask and reference image. To inject fine-grained guidance, we introduce multiple conditions and a dedicated condition fusion module that integrates them efficiently. In addition, an adaptive mask strategy is designed to accommodate subjects of varying scales and attributes, further improving interactions between the swapped subject and its surrounding context. Through our elaborate two-phase dataset construction and training scheme, our DreamSwapV outperforms existing methods, as validated by comprehensive experiments on VBench indicators and our first introduced DreamSwapV-Benchmark.
zh

[CV-42] Ouroboros: Single-step Diffusion Models for Cycle-consistent Forward and Inverse Rendering ICCV2025

【速读】:该论文旨在解决多步扩散模型在正向渲染(forward rendering)与逆向渲染(inverse rendering)任务中因独立处理而导致的循环不一致性(cycle inconsistency)以及推理速度慢的问题。其解决方案的关键在于提出Ouroboros框架,该框架由两个单步扩散模型组成,分别负责正向和逆向渲染,并通过相互强化机制实现两者之间的协同优化;同时引入循环一致性约束,确保正向与逆向渲染输出的一致性,从而在保持高质量重建的同时显著提升推理效率,并可无需训练直接迁移至视频分解任务,有效降低视频序列中的时间不一致性。

链接: https://arxiv.org/abs/2508.14461
作者: Shanlin Sun,Yifan Wang,Hanwen Zhang,Yifeng Xiong,Qin Ren,Ruogu Fang,Xiaohui Xie,Chenyu You
机构: University of California, Irvine (加州大学欧文分校); Stony Brook University (石溪大学); Huazhong University of Science and Technology (华中科技大学); University of Florida (佛罗里达大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted by ICCV 2025

点击查看摘要

Abstract:While multi-step diffusion models have advanced both forward and inverse rendering, existing approaches often treat these problems independently, leading to cycle inconsistency and slow inference speed. In this work, we present Ouroboros, a framework composed of two single-step diffusion models that handle forward and inverse rendering with mutual reinforcement. Our approach extends intrinsic decomposition to both indoor and outdoor scenes and introduces a cycle consistency mechanism that ensures coherence between forward and inverse rendering outputs. Experimental results demonstrate state-of-the-art performance across diverse scenes while achieving substantially faster inference speed compared to other diffusion-based methods. We also demonstrate that Ouroboros can transfer to video decomposition in a training-free manner, reducing temporal inconsistency in video sequences while maintaining high-quality per-frame inverse rendering.
zh

[CV-43] D3-Talker: Dual-Branch Decoupled Deformation Fields for Few-Shot 3D Talking Head Synthesis

【速读】:该论文旨在解决3D说话头合成中因依赖长时视频训练而导致的个性化建模效率低下问题,尤其是在仅有少量帧数据时难以实现精准唇部同步与高质量图像生成的问题。解决方案的关键在于提出D³-Talker框架,其核心创新是构建一个静态的3D高斯属性场,并利用音频和面部运动信号分别控制两个独立的高斯属性变形场,从而有效解耦通用特征与个性化变形;同时设计了一种新颖的相似性对比损失函数用于预训练阶段以增强解耦效果,并引入粗到精模块提升图像清晰度,缓解头部运动引起的模糊问题。

链接: https://arxiv.org/abs/2508.14449
作者: Yuhang Guo,Kaijun Deng,Siyang Song,Jindong Xie,Wenhui Ma,Linlin Shen
机构: Shenzhen University (深圳大学); University of Exeter (埃克塞特大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:A key challenge in 3D talking head synthesis lies in the reliance on a long-duration talking head video to train a new model for each target identity from scratch. Recent methods have attempted to address this issue by extracting general features from audio through pre-training models. However, since audio contains information irrelevant to lip motion, existing approaches typically struggle to map the given audio to realistic lip behaviors in the target face when trained on only a few frames, causing poor lip synchronization and talking head image quality. This paper proposes D^3-Talker, a novel approach that constructs a static 3D Gaussian attribute field and employs audio and Facial Motion signals to independently control two distinct Gaussian attribute deformation fields, effectively decoupling the predictions of general and personalized deformations. We design a novel similarity contrastive loss function during pre-training to achieve more thorough decoupling. Furthermore, we integrate a Coarse-to-Fine module to refine the rendered images, alleviating blurriness caused by head movements and enhancing overall image quality. Extensive experiments demonstrate that D^3-Talker outperforms state-of-the-art methods in both high-fidelity rendering and accurate audio-lip synchronization with limited training data. Our code will be provided upon acceptance.
zh

[CV-44] Generalizable Engagement Estimation in Conversation via Domain Prompting and Parallel Attention

【速读】:该论文旨在解决自适应人机交互系统中对话参与度(Conversational Engagement)估计的泛化能力不足问题,尤其是跨领域和跨语言场景下模型性能下降的挑战。其核心解决方案是提出DAPA(Domain-Adaptive Parallel Attention)框架,关键创新在于引入域提示机制(Domain Prompting)——通过在输入前添加可学习的域特定向量,显式地将模型条件化于数据来源,从而实现域感知的适应性同时保留通用的参与度表征;此外,还设计了并行交叉注意力模块(Parallel Cross-Attention),以显式对齐前向BiLSTM(反应状态)与后向BiLSTM(预期状态)之间的交互同步性,增强对对话动态性的建模能力。实验表明,该方法在多个跨文化、跨语言基准上达到新SOTA,尤其在NoXi-J测试集上相比强基线提升0.45的CCC指标。

链接: https://arxiv.org/abs/2508.14448
作者: Yangche Yu,Yin Chen,Jia Li,Peng Jia,Yu Zhang,Li Dai,Zhenzhen Hu,Meng Wang,Richang Hong
机构: Hefei University of Technology (合肥工业大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 1st Place in the Engagement Estimation Task held by MultiMediate 25

点击查看摘要

Abstract:Accurate engagement estimation is essential for adaptive human-computer interaction systems, yet robust deployment is hindered by poor generalizability across diverse domains and challenges in modeling complex interaction this http URL tackle these issues, we propose DAPA (Domain-Adaptive Parallel Attention), a novel framework for generalizable conversational engagement modeling. DAPA introduces a Domain Prompting mechanism by prepending learnable domain-specific vectors to the input, explicitly conditioning the model on the data’s origin to facilitate domain-aware adaptation while preserving generalizable engagement representations. To capture interactional synchrony, the framework also incorporates a Parallel Cross-Attention module that explicitly aligns reactive (forward BiLSTM) and anticipatory (backward BiLSTM) states between this http URL experiments demonstrate that DAPA establishes a new state-of-the-art performance on several cross-cultural and cross-linguistic benchmarks, notably achieving an absolute improvement of 0.45 in Concordance Correlation Coefficient (CCC) over a strong baseline on the NoXi-J test set. The superiority of our method was also confirmed by winning the first place in the Multi-Domain Engagement Estimation Challenge at MultiMediate’25.
zh

[CV-45] Reconstruction Using the Invisible: Intuition from NIR and Metadata for Enhanced 3D Gaussian Splatting

【速读】:该论文旨在解决3D Gaussian Splatting (3DGS) 在农业场景中应用受限的问题,主要挑战包括光照不均、遮挡以及视场有限等。其解决方案的关键在于提出了一种名为 NIRSplat 的多模态高斯点绘图架构,通过融合近红外(Near-Infrared, NIR)图像、RGB 图像、文本元数据(如植被指数 NDVI、NDWI 和叶绿素指数)、深度图与 LiDAR 数据,并引入交叉注意力机制和基于3D点的位置编码,从而增强几何先验并提升对复杂农业环境的建模能力。该方法显著优于现有地标法(如 3DGS、CoR-GS 和 InstantSplat),在农业场景下展现出更强的鲁棒性和语义丰富性。

链接: https://arxiv.org/abs/2508.14443
作者: Gyusam Chang,Tuan-Anh Vu,Vivek Alumootil,Harris Song,Deanna Pham,Sangpil Kim,M. Khalid Jawed
机构: University of California, Los Angeles (加州大学洛杉矶分校)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:While 3D Gaussian Splatting (3DGS) has rapidly advanced, its application in agriculture remains underexplored. Agricultural scenes present unique challenges for 3D reconstruction methods, particularly due to uneven illumination, occlusions, and a limited field of view. To address these limitations, we introduce \textbfNIRPlant, a novel multimodal dataset encompassing Near-Infrared (NIR) imagery, RGB imagery, textual metadata, Depth, and LiDAR data collected under varied indoor and outdoor lighting conditions. By integrating NIR data, our approach enhances robustness and provides crucial botanical insights that extend beyond the visible spectrum. Additionally, we leverage text-based metadata derived from vegetation indices, such as NDVI, NDWI, and the chlorophyll index, which significantly enriches the contextual understanding of complex agricultural environments. To fully exploit these modalities, we propose \textbfNIRSplat, an effective multimodal Gaussian splatting architecture employing a cross-attention mechanism combined with 3D point-based positional encoding, providing robust geometric priors. Comprehensive experiments demonstrate that \textbfNIRSplat outperforms existing landmark methods, including 3DGS, CoR-GS, and InstantSplat, highlighting its effectiveness in challenging agricultural scenarios. The code and dataset are publicly available at: this https URL
zh

[CV-46] MUSE: Multi-Subject Unified Synthesis via Explicit Layout Semantic Expansion ICCV2025

【速读】:该论文旨在解决布局可控的多主体合成(Layout-controllable Multi-subject Synthesis, LMS)问题,即在统一图像中精确控制多个参考主体的空间位置并保持其身份一致性。现有方法虽分别提升了布局控制与主体生成能力,但在同时满足空间精度和身份保真度方面仍存在显著挑战。解决方案的关键在于提出MUSE框架,其核心创新是采用拼接交叉注意力(Concatenated Cross-Attention, CCA)机制,通过显式语义空间扩展将布局约束与文本提示无缝融合,实现跨模态双向对齐且无干扰;此外,设计渐进式两阶段训练策略,将LMS任务分解为可学习的子目标以提升优化效率。实验表明,MUSE在零样本端到端生成中展现出更优的空间准确性和身份一致性。

链接: https://arxiv.org/abs/2508.14440
作者: Fei Peng,Junqiang Wu,Yan Li,Tingting Gao,Di Zhang,Huiyuan Fu
机构: Beijing University of Posts and Telecommunications (北京邮电大学); Kuaishou Technology (快手科技)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: This paper is accepted by ICCV 2025

点击查看摘要

Abstract:Existing text-to-image diffusion models have demonstrated remarkable capabilities in generating high-quality images guided by textual prompts. However, achieving multi-subject compositional synthesis with precise spatial control remains a significant challenge. In this work, we address the task of layout-controllable multi-subject synthesis (LMS), which requires both faithful reconstruction of reference subjects and their accurate placement in specified regions within a unified image. While recent advancements have separately improved layout control and subject synthesis, existing approaches struggle to simultaneously satisfy the dual requirements of spatial precision and identity preservation in this composite task. To bridge this gap, we propose MUSE, a unified synthesis framework that employs concatenated cross-attention (CCA) to seamlessly integrate layout specifications with textual guidance through explicit semantic space expansion. The proposed CCA mechanism enables bidirectional modality alignment between spatial constraints and textual descriptions without interference. Furthermore, we design a progressive two-stage training strategy that decomposes the LMS task into learnable sub-objectives for effective optimization. Extensive experiments demonstrate that MUSE achieves zero-shot end-to-end generation with superior spatial accuracy and identity consistency compared to existing solutions, advancing the frontier of controllable image synthesis. Our code and model are available at this https URL.
zh

[CV-47] FOCUS: Frequency-Optimized Conditioning of DiffUSion Models for mitigating catastrophic forgetting during Test-Time Adaptation

【速读】:该论文旨在解决模型在测试时适应(test-time adaptation)过程中因域变化(domain shift)导致的灾难性遗忘问题,即在适应新域的同时如何有效保留任务相关的语义知识。其解决方案的关键在于提出一种基于频率条件控制的输入自适应方法FOCUS,该方法构建于扩散驱动的去噪框架中,利用轻量级Y形频率预测网络(Y-FPN)从噪声图像中解耦高低频信息,并通过学习的空间自适应频率先验,在扩散逆向过程中对去噪步骤进行条件约束,从而保护密集预测任务中的关键语义信息。此外,FOCUS还可生成伪标签用于补充监督信号,显著缓解了现有自适应方法在有限监督下的性能退化问题。

链接: https://arxiv.org/abs/2508.14437
作者: Gabriel Tjio,Jie Zhang,Xulei Yang,Yun Xing,Nhat Chung,Xiaofeng Cao,Ivor W. Tsang,Chee Keong Kwoh,Qing Guo
机构: Nanyang Technological University (南洋理工大学); Institute for Infocomm Research, ASTAR (新加坡资讯通信研究院); University of Alberta (阿尔伯塔大学); Ho Chi Minh City International University (胡志明市国际大学); Tongji University (同济大学); Institute for Infocomm Research, ASTAR (新加坡资讯通信研究院); Nanyang Technological University (南洋理工大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Test-time adaptation enables models to adapt to evolving domains. However, balancing the tradeoff between preserving knowledge and adapting to domain shifts remains challenging for model adaptation methods, since adapting to domain shifts can induce forgetting of task-relevant knowledge. To address this problem, we propose FOCUS, a novel frequency-based conditioning approach within a diffusion-driven input-adaptation framework. Utilising learned, spatially adaptive frequency priors, our approach conditions the reverse steps during diffusion-driven denoising to preserve task-relevant semantic information for dense prediction. FOCUS leverages a trained, lightweight, Y-shaped Frequency Prediction Network (Y-FPN) that disentangles high and low frequency information from noisy images. This minimizes the computational costs involved in implementing our approach in a diffusion-driven framework. We train Y-FPN with FrequencyMix, a novel data augmentation method that perturbs the images across diverse frequency bands, which improves the robustness of our approach to diverse corruptions. We demonstrate the effectiveness of FOCUS for semantic segmentation and monocular depth estimation across 15 corruption types and three datasets, achieving state-of-the-art averaged performance. In addition to improving standalone performance, FOCUS complements existing model adaptation methods since we can derive pseudo labels from FOCUS-denoised images for additional supervision. Even under limited, intermittent supervision with the pseudo labels derived from the FOCUS denoised images, we show that FOCUS mitigates catastrophic forgetting for recent model adaptation methods. Subjects: Computer Vision and Pattern Recognition (cs.CV) Cite as: arXiv:2508.14437 [cs.CV] (or arXiv:2508.14437v1 [cs.CV] for this version) https://doi.org/10.48550/arXiv.2508.14437 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
zh

[CV-48] HyperDiff: Hypergraph Guided Diffusion Model for 3D Human Pose Estimation

【速读】:该论文旨在解决单目3D人体姿态估计(Monocular 3D Human Pose Estimation, HPE)中常见的深度模糊性和遮挡问题,以及传统方法在利用骨骼结构信息时忽略多尺度特征导致的精度下降问题。其解决方案的关键在于提出一种融合扩散模型(Diffusion Model)与超图卷积网络(HyperGCN)的新架构——HyperDiff:其中扩散模型有效建模数据不确定性以缓解深度模糊和遮挡影响,而HyperGCN作为去噪器,通过多粒度结构精准捕捉关节间的高阶关联性,显著提升复杂姿态下的去噪能力。该设计在Human3.6M和MPI-INF-3DHP数据集上实现了当前最优性能,并具备灵活适配不同计算资源的能力。

链接: https://arxiv.org/abs/2508.14431
作者: Bing Han,Yuhua Huang,Pan Gao
机构: Nanjing University of Aeronautics and Astronautics (南京航空航天大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Monocular 3D human pose estimation (HPE) often encounters challenges such as depth ambiguity and occlusion during the 2D-to-3D lifting process. Additionally, traditional methods may overlook multi-scale skeleton features when utilizing skeleton structure information, which can negatively impact the accuracy of pose estimation. To address these challenges, this paper introduces a novel 3D pose estimation method, HyperDiff, which integrates diffusion models with HyperGCN. The diffusion model effectively captures data uncertainty, alleviating depth ambiguity and occlusion. Meanwhile, HyperGCN, serving as a denoiser, employs multi-granularity structures to accurately model high-order correlations between joints. This improves the model’s denoising capability especially for complex poses. Experimental results demonstrate that HyperDiff achieves state-of-the-art performance on the Human3.6M and MPI-INF-3DHP datasets and can flexibly adapt to varying computational resources to balance performance and efficiency.
zh

[CV-49] MoCHA-former: Moiré-Conditioned Hybrid Adaptive Transformer for Video Demoiréing

【速读】:该论文旨在解决相机拍摄显示屏时因色彩滤光阵列(Color Filter Array, CFA)与显示器子像素频率不匹配而导致的摩尔纹(Moiré pattern)问题,该现象在视频捕捉中尤为显著,且具有空间变化性强、大尺度结构扩散、通道统计依赖性以及帧间快速波动等复杂特性。解决方案的核心是提出MoCHA-former模型,其关键创新在于两个模块:一是解耦式自适应去摩尔纹(Decoupled Moiré Adaptive Demoiréing, DMAD),通过摩尔纹解耦块(MDB)和细节解耦块(DDB)分离摩尔纹与内容,并利用摩尔纹条件块(MCB)生成针对性恢复特征;二是时空自适应去摩尔纹(Spatio-Temporal Adaptive Demoiréing, STAD),引入空间融合块(SFB)结合窗口注意力机制以捕获大尺度结构,并采用特征通道注意力(FCA)建模RAW图像中的通道依赖关系,同时无需显式对齐即可实现时间一致性。

链接: https://arxiv.org/abs/2508.14423
作者: Jeahun Sung,Changhyun Roh,Chanho Eom,Jihyong Oh
机构: Chung-Ang University (中央大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Please visit our project page at [this http URL link]( this https URL )

点击查看摘要

Abstract:Recent advances in portable imaging have made camera-based screen capture ubiquitous. Unfortunately, frequency aliasing between the camera’s color filter array (CFA) and the display’s sub-pixels induces moiré patterns that severely degrade captured photos and videos. Although various demoiréing models have been proposed to remove such moiré patterns, these approaches still suffer from several limitations: (i) spatially varying artifact strength within a frame, (ii) large-scale and globally spreading structures, (iii) channel-dependent statistics and (iv) rapid temporal fluctuations across frames. We address these issues with the Moiré Conditioned Hybrid Adaptive Transformer (MoCHA-former), which comprises two key components: Decoupled Moiré Adaptive Demoiréing (DMAD) and Spatio-Temporal Adaptive Demoiréing (STAD). DMAD separates moiré and content via a Moiré Decoupling Block (MDB) and a Detail Decoupling Block (DDB), then produces moiré-adaptive features using a Moiré Conditioning Block (MCB) for targeted restoration. STAD introduces a Spatial Fusion Block (SFB) with window attention to capture large-scale structures, and a Feature Channel Attention (FCA) to model channel dependence in RAW frames. To ensure temporal consistency, MoCHA-former performs implicit frame alignment without any explicit alignment module. We analyze moiré characteristics through qualitative and quantitative studies, and evaluate on two video datasets covering RAW and sRGB domains. MoCHA-former consistently surpasses prior methods across PSNR, SSIM, and LPIPS.
zh

[CV-50] Disentanglement in T-space for Faster and Distributed Training of Diffusion Models with Fewer Latent-states

【速读】:该论文旨在解决扩散模型(Diffusion Models)中一个长期存在的假设问题,即认为训练时需要大量潜变量状态或时间步数(T ≈ 1,000)才能使逆向生成过程接近高斯分布,从而保证生成质量。其核心解决方案在于:通过精心设计的噪声调度(noise schedule),可在极小数量的潜变量状态下(如 T ≈ 32)实现与传统大 T 模型相当的性能;进一步地,将这一极限推进至仅用一个潜变量状态(即 T = 1),提出“T 空间中的完全解耦”(complete disentanglement in T-space)的概念,并通过组合多个独立训练的单状态模型生成高质量样本。实验表明,该方法在两个不同数据集上均实现了 4–6 倍于传统扩散模型的收敛速度提升。

链接: https://arxiv.org/abs/2508.14413
作者: Samarth Gupta,Raghudeep Gadde,Rui Chen,Aleix M. Martinez
机构: Amazon(亚马逊)
类目: Machine Learning (cs.LG); Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:We challenge a fundamental assumption of diffusion models, namely, that a large number of latent-states or time-steps is required for training so that the reverse generative process is close to a Gaussian. We first show that with careful selection of a noise schedule, diffusion models trained over a small number of latent states (i.e. T \sim 32 ) match the performance of models trained over a much large number of latent states ( T \sim 1,000 ). Second, we push this limit (on the minimum number of latent states required) to a single latent-state, which we refer to as complete disentanglement in T-space. We show that high quality samples can be easily generated by the disentangled model obtained by combining several independently trained single latent-state models. We provide extensive experiments to show that the proposed disentangled model provides 4-6 \times faster convergence measured across a variety of metrics on two different datasets.
zh

[CV-51] A Real-world Display Inverse Rendering Dataset

【速读】:该论文旨在解决当前缺乏基于显示-相机(display-camera)成像系统的实世界逆渲染(inverse rendering)数据集的问题,这限制了相关方法的开发与评估。其关键解决方案是构建并校准一套由LCD显示器和双极化相机组成的成像系统,采集多样物体在单光源逐次照射(one-light-at-a-time, OLAT)模式下的图像,并提供高质量的真值几何信息。该数据集支持在任意显示图案和不同噪声水平下合成图像,从而为逆渲染算法提供了可靠的基准测试平台。

链接: https://arxiv.org/abs/2508.14411
作者: Seokjun Choi,Hoon-Gyu Chung,Yujin Jeon,Giljoo Nam,Seung-Hwan Baek
机构: POSTECH; Meta
类目: Graphics (cs.GR); Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Inverse rendering aims to reconstruct geometry and reflectance from captured images. Display-camera imaging systems offer unique advantages for this task: each pixel can easily function as a programmable point light source, and the polarized light emitted by LCD displays facilitates diffuse-specular separation. Despite these benefits, there is currently no public real-world dataset captured using display-camera systems, unlike other setups such as light stages. This absence hinders the development and evaluation of display-based inverse rendering methods. In this paper, we introduce the first real-world dataset for display-based inverse rendering. To achieve this, we construct and calibrate an imaging system comprising an LCD display and stereo polarization cameras. We then capture a diverse set of objects with diverse geometry and reflectance under one-light-at-a-time (OLAT) display patterns. We also provide high-quality ground-truth geometry. Our dataset enables the synthesis of captured images under arbitrary display patterns and different noise levels. Using this dataset, we evaluate the performance of existing photometric stereo and inverse rendering methods, and provide a simple, yet effective baseline for display inverse rendering, outperforming state-of-the-art inverse rendering methods. Code and dataset are available on our project page at this https URL
zh

[CV-52] CTA-Flux: Integrating Chinese Cultural Semantics into High-Quality English Text-to-Image Communities

【速读】:该论文旨在解决当前以英语为中心的文本到图像(Text-to-Image, TTI)生成模型(如Flux)在处理中文提示词时表现不佳的问题,尤其是由于训练数据的语言和文化偏见导致的语义失真与图像真实性下降。现有方法如翻译成英文或微调双语映射关系,难以保留中文特有的文化语义信息。解决方案的关键在于提出Chinese Text Adapter-Flux (CTA-Flux),通过引入多模态扩散Transformer(MultiModal Diffusion Transformer, MMDiT)直接控制Flux主干网络,而非依赖ControlNet类架构,从而显著减少参数量并增强对中文语义的理解能力,同时保持与LoRA、IP-Adapter和ControlNet等现有插件的兼容性,实现高质量且文化忠实的中文图文生成。

链接: https://arxiv.org/abs/2508.14405
作者: Yue Gong,Shanyuan Liu,Liuzhuozheng Li,Jian Zhu,Bo Cheng,Liebucha Wu,Xiaoyu Wu,Yuhang Ma,Dawei Leng,Yuhui Yin
机构: 360(三六零); 360(三六零)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:We proposed the Chinese Text Adapter-Flux (CTA-Flux). An adaptation method fits the Chinese text inputs to Flux, a powerful text-to-image (TTI) generative model initially trained on the English corpus. Despite the notable image generation ability conditioned on English text inputs, Flux performs poorly when processing non-English prompts, particularly due to linguistic and cultural biases inherent in predominantly English-centric training datasets. Existing approaches, such as translating non-English prompts into English or finetuning models for bilingual mappings, inadequately address culturally specific semantics, compromising image authenticity and quality. To address this issue, we introduce a novel method to bridge Chinese semantic understanding with compatibility in English-centric TTI model communities. Existing approaches relying on ControlNet-like architectures typically require a massive parameter scale and lack direct control over Chinese semantics. In comparison, CTA-flux leverages MultiModal Diffusion Transformer (MMDiT) to control the Flux backbone directly, significantly reducing the number of parameters while enhancing the model’s understanding of Chinese semantics. This integration significantly improves the generation quality and cultural authenticity without extensive retraining of the entire model, thus maintaining compatibility with existing text-to-image plugins such as LoRA, IP-Adapter, and ControlNet. Empirical evaluations demonstrate that CTA-flux supports Chinese and English prompts and achieves superior image generation quality, visual realism, and faithful depiction of Chinese semantics.
zh

[CV-53] Img2ST-Net: Efficient High-Resolution Spatial Omics Prediction from Whole Slide Histology Images via Fully Convolutional Image-to-Image Learning

【速读】:该论文旨在解决高分辨率空间转录组(Spatial Transcriptomics, ST)数据生成中的计算效率低、模型不稳定以及预测与评估困难等问题,尤其针对Visium HD等平台实现8 μm或更精细分辨率时所面临的极端稀疏性和低表达水平挑战。解决方案的关键在于提出Img2ST-Net框架,其采用全卷积神经网络架构,将高分辨率ST数据建模为超像素(super-pixel)表示,从而将原本逐斑点(spot-by-spot)的回归任务转化为具有数百至数千通道的超内容图像生成问题,实现了并行化高效预测;同时引入专为高分辨率ST设计的结构相似性评价指标SSIM-ST,以提升在稀疏表达模式下的评估鲁棒性。这一方法不仅显著提升了计算效率,还更好地保留了空间组学数据固有的空间组织特性。

链接: https://arxiv.org/abs/2508.14393
作者: Junchao Zhu,Ruining Deng,Junlin Guo,Tianyuan Yao,Juming Xiong,Chongyu Qu,Mengmeng Yin,Yu Wang,Shilin Zhao,Haichun Yang,Daguang Xu,Yucheng Tang,Yuankai Huo
机构: Vanderbilt University (范德比尔特大学); Weill Cornell Medicine (威尔康奈尔医学院); Vanderbilt University Medical Center (范德比尔特大学医学中心); NVIDIA (英伟达)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Recent advances in multi-modal AI have demonstrated promising potential for generating the currently expensive spatial transcriptomics (ST) data directly from routine histology images, offering a means to reduce the high cost and time-intensive nature of ST data acquisition. However, the increasing resolution of ST, particularly with platforms such as Visium HD achieving 8um or finer, introduces significant computational and modeling challenges. Conventional spot-by-spot sequential regression frameworks become inefficient and unstable at this scale, while the inherent extreme sparsity and low expression levels of high-resolution ST further complicate both prediction and evaluation. To address these limitations, we propose Img2ST-Net, a novel histology-to-ST generation framework for efficient and parallel high-resolution ST prediction. Unlike conventional spot-by-spot inference methods, Img2ST-Net employs a fully convolutional architecture to generate dense, HD gene expression maps in a parallelized manner. By modeling HD ST data as super-pixel representations, the task is reformulated from image-to-omics inference into a super-content image generation problem with hundreds or thousands of output channels. This design not only improves computational efficiency but also better preserves the spatial organization intrinsic to spatial omics data. To enhance robustness under sparse expression patterns, we further introduce SSIM-ST, a structural-similarity-based evaluation metric tailored for high-resolution ST analysis. We present a scalable, biologically coherent framework for high-resolution ST prediction. Img2ST-Net offers a principled solution for efficient and accurate ST inference at scale. Our contributions lay the groundwork for next-generation ST modeling that is robust and resolution-aware. The source code has been made publicly available at this https URL.
zh

[CV-54] QuadINR: Hardware-Efficient Implicit Neural Representations Through Quadratic Activation

【速读】:该论文旨在解决隐式神经表示(Implicit Neural Representations, INRs)中因激活函数(Activation Functions, AFs)导致的频谱偏差(spectral bias)问题,同时克服传统复杂激活函数带来的显著硬件开销。其解决方案的关键在于提出QuadINR,一种采用分段二次激活函数(piecewise quadratic AFs)的硬件高效INR架构:此类函数在傅里叶级数中包含丰富的谐波成分,通过神经切线核(Neural Tangent Kernel, NTK)分析验证了其对高频信号更强的表达能力;此外,作者设计了一个统一的N-stage流水线框架,支持多种激活函数在硬件上的高效实现,最终在FPGA和ASIC平台上实现了高达97%的资源与功耗降低、93%的延迟改善,并在图像和视频重建任务中取得最高2.06dB的PSNR提升。

链接: https://arxiv.org/abs/2508.14374
作者: Wenyong Zhou,Boyu Li,Jiachen Ren,Taiqiang Wu,Zhilin Ai,Zhengwu Liu,Ngai Wong
机构: The University of Hong Kong (香港大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 5 pages, 4 figures

点击查看摘要

Abstract:Implicit Neural Representations (INRs) encode discrete signals continuously while addressing spectral bias through activation functions (AFs). Previous approaches mitigate this bias by employing complex AFs, which often incur significant hardware overhead. To tackle this challenge, we introduce QuadINR, a hardware-efficient INR that utilizes piecewise quadratic AFs to achieve superior performance with dramatic reductions in hardware consumption. The quadratic functions encompass rich harmonic content in their Fourier series, delivering enhanced expressivity for high-frequency signals, as verified through Neural Tangent Kernel (NTK) analysis. We develop a unified N -stage pipeline framework that facilitates efficient hardware implementation of various AFs in INRs. We demonstrate FPGA implementations on the VCU128 platform and an ASIC implementation in a 28nm process. Experiments across images and videos show that QuadINR achieves up to 2.06dB PSNR improvement over prior work, with an area of only 1914 \mu m ^2 and a dynamic power of 6.14mW, reducing resource and power consumption by up to 97% and improving latency by up to 93% vs existing baselines.
zh

[CV-55] CFNet: Bidirectional face-bone transformation via a Transformer-based coarse-to-fine point movement network

【速读】:该论文旨在解决正颌外科手术规划中面部骨骼点云形变模拟的准确性与效率问题,传统生物力学仿真方法存在计算耗时长、数据处理繁琐及精度不足的局限,而现有基于深度学习的方法则受限于大规模点云处理能力弱、感受野有限导致噪声点增多,以及依赖复杂预处理和后处理操作的问题。解决方案的关键在于提出一种基于Transformer的粗粒度到细粒度点移动网络(TCFNet),其核心创新包括:第一阶段采用Transformer网络捕捉全局特征以学习点间复杂对应关系,第二阶段引入局部信息聚合网络(LIA-Net)建模局部几何结构(如边缘、方向和相对位置特征),从而弥补Transformer在局部细节上的精度损失;同时利用门控循环单元(GRU)将全局特征引导至局部位移估计,提升预测精度。此外,受可变形医学图像配准启发,设计了一种辅助损失函数,融合专家知识优化关键区域重建效果,最终在多个数据集上实现了优于现有最先进(SOTA)方法的评估指标与可视化结果。

链接: https://arxiv.org/abs/2508.14373
作者: Runshi Zhang,Bimeng Jie,Yang He,Junchen Wang
机构: Beihang University (北京航空航天大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 17 pages, 11 figures

点击查看摘要

Abstract:Computer-aided surgical simulation is a critical component of orthognathic surgical planning, where accurately simulating face-bone shape transformations is significant. The traditional biomechanical simulation methods are limited by their computational time consumption levels, labor-intensive data processing strategies and low accuracy. Recently, deep learning-based simulation methods have been proposed to view this problem as a point-to-point transformation between skeletal and facial point clouds. However, these approaches cannot process large-scale points, have limited receptive fields that lead to noisy points, and employ complex preprocessing and postprocessing operations based on registration. These shortcomings limit the performance and widespread applicability of such methods. Therefore, we propose a Transformer-based coarse-to-fine point movement network (TCFNet) to learn unique, complicated correspondences at the patch and point levels for dense face-bone point cloud transformations. This end-to-end framework adopts a Transformer-based network and a local information aggregation network (LIA-Net) in the first and second stages, respectively, which reinforce each other to generate precise point movement paths. LIA-Net can effectively compensate for the neighborhood precision loss of the Transformer-based network by modeling local geometric structures (edges, orientations and relative position features). The previous global features are employed to guide the local displacement using a gated recurrent unit. Inspired by deformable medical image registration, we propose an auxiliary loss that can utilize expert knowledge for reconstructing critical this http URL with the existing state-of-the-art (SOTA) methods on gathered datasets, TCFNet achieves outstanding evaluation metrics and visualization results. The code is available at this https URL.
zh

[CV-56] FastTracker: Real-Time and Accurate Visual Tracking

【速读】:该论文旨在解决传统多目标跟踪(Multi-Object Tracking, MOT)系统在非行人目标(尤其是车辆)上的泛化能力不足的问题。现有方法主要针对行人设计,在复杂交通场景中难以保持高精度的轨迹连续性和身份一致性,尤其在遮挡严重或结构信息复杂的环境下表现受限。解决方案的关键在于提出一个通用化跟踪框架,包含两个核心组件:一是遮挡感知的重识别机制(occlusion-aware re-identification mechanism),通过增强对被遮挡目标的身份保留能力提升跟踪鲁棒性;二是道路结构感知的轨迹片段优化策略(road-structure-aware tracklet refinement strategy),利用车道方向、人行横道和道路边界等语义场景先验信息优化轨迹连续性与准确性。该框架在新构建的车辆级标注数据集及多个公开基准上均取得优异性能,验证了其在多类目标跟踪中的有效性与通用性。

链接: https://arxiv.org/abs/2508.14370
作者: Hamidreza Hashempoor,Yu Dong Hwang
机构: Pintel Co. Ltd.(Pintel公司)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Conventional multi-object tracking (MOT) systems are predominantly designed for pedestrian tracking and often exhibit limited generalization to other object categories. This paper presents a generalized tracking framework capable of handling multiple object types, with a particular emphasis on vehicle tracking in complex traffic scenes. The proposed method incorporates two key components: (1) an occlusion-aware re-identification mechanism that enhances identity preservation for heavily occluded objects, and (2) a road-structure-aware tracklet refinement strategy that utilizes semantic scene priors such as lane directions, crosswalks, and road boundaries to improve trajectory continuity and accuracy. In addition, we introduce a new benchmark dataset comprising diverse vehicle classes with frame-level tracking annotations, specifically curated to support evaluation of vehicle-focused tracking methods. Extensive experimental results demonstrate that the proposed approach achieves robust performance on both the newly introduced dataset and several public benchmarks, highlighting its effectiveness in general-purpose object tracking. While our framework is designed for generalized multi-class tracking, it also achieves strong performance on conventional benchmarks, with HOTA scores of 66.4 on MOT17 and 65.7 on MOT20 test sets. Code and Benchmark are available: this http URL, this http URL.
zh

[CV-57] aming Transformer for Emotion-Controllable Talking Face Generation

【速读】:该论文旨在解决情感可控的说话人脸生成问题,即在给定特定音频条件下合成具有目标情绪且保持身份一致性的逼真视频。其核心挑战在于如何有效建模与特定情绪相关的多模态关系,并利用该关系生成保留身份特征的情感视频。解决方案的关键在于提出一种离散化的方法:首先通过预训练策略将音频解耦为独立成分,并将视频量化为视觉标记(visual tokens)的组合;随后引入情感锚点(emotion-anchor, EA)表示,将情绪信息嵌入到视觉标记中;最后采用自回归Transformer模型,在给定条件下游走全局视觉标记分布并预测用于合成视频的索引序列,从而实现高保真、情绪可控的说话人脸生成。

链接: https://arxiv.org/abs/2508.14359
作者: Ziqi Zhang,Cheng Deng
机构: Xidian University (西安电子科技大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Talking face generation is a novel and challenging generation task, aiming at synthesizing a vivid speaking-face video given a specific audio. To fulfill emotion-controllable talking face generation, current methods need to overcome two challenges: One is how to effectively model the multimodal relationship related to the specific emotion, and the other is how to leverage this relationship to synthesize identity preserving emotional videos. In this paper, we propose a novel method to tackle the emotion-controllable talking face generation task discretely. Specifically, we employ two pre-training strategies to disentangle audio into independent components and quantize videos into combinations of visual tokens. Subsequently, we propose the emotion-anchor (EA) representation that integrates the emotional information into visual tokens. Finally, we introduce an autoregressive transformer to model the global distribution of the visual tokens under the given conditions and further predict the index sequence for synthesizing the manipulated videos. We conduct experiments on the MEAD dataset that controls the emotion of videos conditioned on multiple emotional audios. Extensive experiments demonstrate the superiorities of our method both qualitatively and quantitatively.
zh

[CV-58] Learning Point Cloud Representations with Pose Continuity for Depth-Based Category-Level 6D Object Pose Estimation ICCV2025

【速读】:该论文旨在解决类别级物体位姿估计(category-level object pose estimation)中因仅依赖6D位姿作为监督信号而缺乏对位姿内在连续性建模的问题,从而导致预测不一致且泛化能力差。其解决方案的关键在于提出HRC-Pose框架,该框架基于深度图像实现端到端位姿估计,并通过对比学习(contrastive learning)构建保持6D位姿连续性的点云表示;具体地,HRC-Pose将位姿解耦为旋转和平移分量分别编码,并设计了一种面向多任务、多类别的6D位姿感知分层排序策略,在对比过程中同时考虑旋转差异、平移差异及类别信息,从而显式建模位姿空间的连续性;此外,网络进一步设计了独立处理旋转感知与平移感知嵌入的位姿估计模块,实验证明该方法在REAL275和CAMERA25基准上显著优于现有纯深度方法并具备实时性能。

链接: https://arxiv.org/abs/2508.14358
作者: Zhujun Li,Shuo Zhang,Ioannis Stamos
机构: Graduate Center, CUNY (纽约市立大学研究生院); Hunter College, CUNY (纽约市立大学亨特学院); Weill Cornell Medicine (威尔康奈尔医学院)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Robotics (cs.RO)
备注: Accepted by ICCV 2025 Workshop on Recovering 6D Object Pose (R6D)

点击查看摘要

Abstract:Category-level object pose estimation aims to predict the 6D pose and 3D size of objects within given categories. Existing approaches for this task rely solely on 6D poses as supervisory signals without explicitly capturing the intrinsic continuity of poses, leading to inconsistencies in predictions and reduced generalization to unseen poses. To address this limitation, we propose HRC-Pose, a novel depth-only framework for category-level object pose estimation, which leverages contrastive learning to learn point cloud representations that preserve the continuity of 6D poses. HRC-Pose decouples object pose into rotation and translation components, which are separately encoded and leveraged throughout the network. Specifically, we introduce a contrastive learning strategy for multi-task, multi-category scenarios based on our 6D pose-aware hierarchical ranking scheme, which contrasts point clouds from multiple categories by considering rotational and translational differences as well as categorical information. We further design pose estimation modules that separately process the learned rotation-aware and translation-aware embeddings. Our experiments demonstrate that HRC-Pose successfully learns continuous feature spaces. Results on REAL275 and CAMERA25 benchmarks show that our method consistently outperforms existing depth-only state-of-the-art methods and runs in real-time, demonstrating its effectiveness and potential for real-world applications. Our code is at this https URL.
zh

[CV-59] Organ-Agents : Virtual Human Physiology Simulator via LLM s

【速读】:该论文旨在解决复杂生理系统模拟中缺乏高保真、多系统协同建模工具的问题,尤其在重症监护场景下难以实现对多器官功能障碍的动态交互进行精确再现。解决方案的关键在于提出Organ-Agents框架,其核心是基于大语言模型(Large Language Models, LLMs)驱动的多智能体架构,每个智能体(Simulator)专门建模一个特定生理系统(如心血管、肾脏、免疫等),通过监督微调(supervised fine-tuning)结合时间序列数据训练,并引入强化学习引导的动态参考选择与误差修正机制以实现跨系统协调。该方法不仅在7,134例脓毒症患者数据上验证了高精度和鲁棒性,还支持生成具有临床合理性的反事实治疗轨迹,为精准诊断、治疗模拟和假设检验提供了可解释且泛化的数字孪生平台。

链接: https://arxiv.org/abs/2508.14357
作者: Rihao Chang,He Jiao,Weizhi Nie,Honglin Guo,Keliang Xie,Zhenhua Wu,Lina Zhao,Yunpeng Bai,Yongtao Ma,Lanjun Wang,Yuting Su,Xi Gao,Weijie Wang,Nicu Sebe,Bruno Lepri,Bingwei Sun
机构: Tianjin University (天津大学); University of Trento (特伦托大学); Nanjing Medical University (南京医科大学)
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Recent advances in large language models (LLMs) have enabled new possibilities in simulating complex physiological systems. We introduce Organ-Agents, a multi-agent framework that simulates human physiology via LLM-driven agents. Each Simulator models a specific system (e.g., cardiovascular, renal, immune). Training consists of supervised fine-tuning on system-specific time-series data, followed by reinforcement-guided coordination using dynamic reference selection and error correction. We curated data from 7,134 sepsis patients and 7,895 controls, generating high-resolution trajectories across 9 systems and 125 variables. Organ-Agents achieved high simulation accuracy on 4,509 held-out patients, with per-system MSEs 0.16 and robustness across SOFA-based severity strata. External validation on 22,689 ICU patients from two hospitals showed moderate degradation under distribution shifts with stable simulation. Organ-Agents faithfully reproduces critical multi-system events (e.g., hypotension, hyperlactatemia, hypoxemia) with coherent timing and phase progression. Evaluation by 15 critical care physicians confirmed realism and physiological plausibility (mean Likert ratings 3.9 and 3.7). Organ-Agents also enables counterfactual simulations under alternative sepsis treatment strategies, generating trajectories and APACHE II scores aligned with matched real-world patients. In downstream early warning tasks, classifiers trained on synthetic data showed minimal AUROC drops (0.04), indicating preserved decision-relevant patterns. These results position Organ-Agents as a credible, interpretable, and generalizable digital twin for precision diagnosis, treatment simulation, and hypothesis testing in critical care.
zh

[CV-60] Deep Learning for Taxol Exposure Analysis: A New Cell Image Dataset and Attention-Based Baseline Model

【速读】:该论文旨在解决现有检测方法在评估化疗药物紫杉醇(Taxol)对细胞作用时存在的局限性,即依赖专业设备、人员和复杂样本制备,导致成本高、效率低且难以实现高通量或实时分析的问题。其关键解决方案是构建一个公开的显微图像数据集,用于自动化分析C6胶质瘤细胞在不同浓度紫杉醇处理下的形态变化,并提出一种名为ResAttention-KNN的基线模型,该模型结合ResNet-50与卷积块注意力模块(Convolutional Block Attention Module, CBAM),并在特征嵌入空间中使用k近邻分类器(k-Nearest Neighbors classifier),从而实现基于注意力机制的特征精炼与非参数化分类相结合,提升模型的鲁棒性和可解释性。

链接: https://arxiv.org/abs/2508.14349
作者: Sean Fletcher,Gabby Scott,Douglas Currie,Xin Zhang,Yuqi Song,Bruce MacLeod
机构: University of Southern Maine (缅因州大学南部分校)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted to the 2025 IEEE International Workshop on Foundations of Machine Learning for Drug Safety (FMLDS), to appear in November 2025

点击查看摘要

Abstract:Monitoring the effects of the chemotherapeutic agent Taxol at the cellular level is critical for both clinical evaluation and biomedical research. However, existing detection methods require specialized equipment, skilled personnel, and extensive sample preparation, making them expensive, labor-intensive, and unsuitable for high-throughput or real-time analysis. Deep learning approaches have shown great promise in medical and biological image analysis, enabling automated, high-throughput assessment of cellular morphology. Yet, no publicly available dataset currently exists for automated morphological analysis of cellular responses to Taxol exposure. To address this gap, we introduce a new microscopy image dataset capturing C6 glioma cells treated with varying concentrations of Taxol. To provide an effective solution for Taxol concentration classification and establish a benchmark for future studies on this dataset, we propose a baseline model named ResAttention-KNN, which combines a ResNet-50 with Convolutional Block Attention Modules and uses a k-Nearest Neighbors classifier in the learned embedding space. This model integrates attention-based refinement and non-parametric classification to enhance robustness and interpretability. Both the dataset and implementation are publicly released to support reproducibility and facilitate future research in vision-based biomedical analysis.
zh

[CV-61] HandCraft: Dynamic Sign Generation for Synthetic Data Augmentation

【速读】:该论文旨在解决手语识别(Sign Language Recognition, SLR)模型因训练数据不足而导致性能受限的问题。其解决方案的关键在于提出了一种基于CMLPe的轻量级手语生成模型,并结合合成数据预训练策略,显著提升了识别准确率。该方法在LSFB和DiSPLaY数据集上使用Mamba-SL和Transformer-SL分类器取得了新的最优结果,且表明合成数据预训练在某些情况下优于传统数据增强方法,甚至能与之互补,从而为SLR领域提供了一种计算高效、可推广的手语生成与数据增强范式。

链接: https://arxiv.org/abs/2508.14345
作者: Gaston Gustavo Rios
机构: Universidad Nacional del Sur (国立南部大学); Universidad Nacional de La Plata (国立拉普拉塔大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注: 26 pages, 4 figures, 9 tables, code available at this https URL

点击查看摘要

Abstract:Sign Language Recognition (SLR) models face significant performance limitations due to insufficient training data availability. In this article, we address the challenge of limited data in SLR by introducing a novel and lightweight sign generation model based on CMLPe. This model, coupled with a synthetic data pretraining approach, consistently improves recognition accuracy, establishing new state-of-the-art results for the LSFB and DiSPLaY datasets using our Mamba-SL and Transformer-SL classifiers. Our findings reveal that synthetic data pretraining outperforms traditional augmentation methods in some cases and yields complementary benefits when implemented alongside them. Our approach democratizes sign generation and synthetic data pretraining for SLR by providing computationally efficient methods that achieve significant performance improvements across diverse datasets.
zh

[CV-62] Inter-Class Relational Loss for Small Object Detection: A Case Study on License Plates

【速读】:该论文旨在解决单阶段多目标检测任务中,基于交并比(Intersection over Union, IoU)的损失函数在小目标(small objects)梯度更新时因梯度极度平坦而导致学习效率低下的问题。其关键解决方案是提出一种跨类关系损失(Inter-class Relational Loss, ICR loss),利用不同类别物体间的空间依存关系(如车牌与车辆的位置关联性)来增强小目标的梯度信号:当预测的小目标(如车牌)未落在大目标(如车辆)区域内时,引入一个与两者边界框重叠面积成反比的惩罚项,从而引导模型通过大目标的特征信息优化小目标定位。该方法可无缝集成至现有IoU-based损失中,无需额外超参数调优即可显著提升检测性能(如YOLOv12-T和UAV-DETR分别在mAP@50上提升10.3%和1.6%)。

链接: https://arxiv.org/abs/2508.14343
作者: Dian Ning,Dong Seog Han
机构: Kyungpook National University (庆北国立大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:In one-stage multi-object detection tasks, various intersection over union (IoU)-based solutions aim at smooth and stable convergence near the targets during training. However, IoU-based losses fail to correctly update the gradient of small objects due to an extremely flat gradient. During the update of multiple objects, the learning of small objects’ gradients suffers more because of insufficient gradient updates. Therefore, we propose an inter-class relational loss to efficiently update the gradient of small objects while not sacrificing the learning efficiency of other objects based on the simple fact that an object has a spatial relationship to another object (e.g., a car plate is attached to a car in a similar position). When the predicted car plate’s bounding box is not within its car, a loss punishment is added to guide the learning, which is inversely proportional to the overlapped area of the car’s and predicted car plate’s bounding box. By leveraging the spatial relationship at the inter-class level, the loss guides small object predictions using larger objects and enhances latent information in deeper feature maps. In this paper, we present twofold contributions using license plate detection as a case study: (1) a new small vehicle multi-license plate dataset (SVMLP), featuring diverse real-world scenarios with high-quality annotations; and (2) a novel inter-class relational loss function designed to promote effective detection performance. We highlight the proposed ICR loss penalty can be easily added to existing IoU-based losses and enhance the performance. These contributions improve the standard mean Average Precision (mAP) metric, achieving gains of 10.3% and 1.6% in mAP ^\texttest_50 for YOLOv12-T and UAV-DETR, respectively, without any additional hyperparameter tuning. Code and dataset will be available soon.
zh

[CV-63] MoVieDrive: Multi-Modal Multi-View Urban Scene Video Generation

【速读】:该论文旨在解决当前自动驾驶场景视频生成方法中仅限于RGB视频生成、缺乏多模态(如深度图和语义图)支持的问题,从而限制了对城市场景的全面理解。其关键解决方案在于提出一种统一的多模态多视角视频生成框架,通过构建一个包含共享模态组件与特定模态组件的扩散Transformer模型,并利用多样化的条件输入编码可控的场景结构与内容线索,实现单一框架下多模态、多视角驾驶场景视频的高质量生成。

链接: https://arxiv.org/abs/2508.14327
作者: Guile Wu,David Huang,Dongfeng Bai,Bingbing Liu
机构: Huawei Noah’s Ark Lab (华为诺亚方舟实验室); University of Toronto (多伦多大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Technical Report

点击查看摘要

Abstract:Video generation has recently shown superiority in urban scene synthesis for autonomous driving. Existing video generation approaches to autonomous driving primarily focus on RGB video generation and lack the ability to support multi-modal video generation. However, multi-modal data, such as depth maps and semantic maps, are crucial for holistic urban scene understanding in autonomous driving. Although it is feasible to use multiple models to generate different modalities, this increases the difficulty of model deployment and does not leverage complementary cues for multi-modal data generation. To address this problem, in this work, we propose a novel multi-modal multi-view video generation approach to autonomous driving. Specifically, we construct a unified diffusion transformer model composed of modal-shared components and modal-specific components. Then, we leverage diverse conditioning inputs to encode controllable scene structure and content cues into the unified diffusion model for multi-modal multi-view video generation. In this way, our approach is capable of generating multi-modal multi-view driving scene videos in a unified framework. Our experiments on the challenging real-world autonomous driving dataset, nuScenes, show that our approach can generate multi-modal multi-view urban scene videos with high fidelity and controllability, surpassing the state-of-the-art methods.
zh

[CV-64] Pixels to Play: A Foundation Model for 3D Gameplay

【速读】:该论文旨在解决当前游戏AI代理在泛化能力与部署效率方面的局限性问题,即如何让一个基础模型(foundation model)仅通过玩家可见的像素输入,在无需大量游戏特定工程的情况下,实现对多种3D视频游戏的通用控制,并表现出类人行为。解决方案的关键在于构建Pixels2Play-0.1(P2P0.1),该模型采用端到端的行为克隆(behavior cloning)训练范式,结合人工标注的游戏演示数据与未标注的公开视频数据(通过逆动力学模型推断动作标签),并使用解码器-only的Transformer架构实现自回归动作输出,从而在保持低延迟的同时有效处理大规模动作空间,实现在单个消费级GPU上的高效推理。

链接: https://arxiv.org/abs/2508.14295
作者: Yuguang Yue,Chris Green,Samuel Hunt,Irakli Salia,Wenzhe Shi,Jonathan J Hunt
机构: Player2(玩家二)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:We introduce Pixels2Play-0.1 (P2P0.1), a foundation model that learns to play a wide range of 3D video games with recognizable human-like behavior. Motivated by emerging consumer and developer use cases - AI teammates, controllable NPCs, personalized live-streamers, assistive testers - we argue that an agent must rely on the same pixel stream available to players and generalize to new titles with minimal game-specific engineering. P2P0.1 is trained end-to-end with behavior cloning: labeled demonstrations collected from instrumented human game-play are complemented by unlabeled public videos, to which we impute actions via an inverse-dynamics model. A decoder-only transformer with auto-regressive action output handles the large action space while remaining latency-friendly on a single consumer GPU. We report qualitative results showing competent play across simple Roblox and classic MS-DOS titles, ablations on unlabeled data, and outline the scaling and evaluation steps required to reach expert-level, text-conditioned control.
zh

[CV-65] OccluNet: Spatio-Temporal Deep Learning for Occlusion Detection on DSA MICCAI2025

【速读】:该论文旨在解决急性缺血性卒中(AIS)患者在血管内血栓切除术(EVT)过程中,通过数字减影血管造影(DSA)序列准确检测血管闭塞的难题。由于解剖结构复杂且临床时间紧迫,人工判读DSA序列存在主观性和效率瓶颈。解决方案的关键在于提出OccluNet模型,该模型融合YOLOX目标检测器与基于Transformer的时间注意力机制,实现对DSA序列中时空特征的联合建模;其中,纯时间注意力和分时空间注意力两种变体均能有效捕捉时序一致性特征,显著优于仅基于单帧或最小强度投影训练的YOLOv11基线模型,在MR CLEAN注册数据集上达到89.02%的精确率和74.87%的召回率。

链接: https://arxiv.org/abs/2508.14286
作者: Anushka A. Kore,Frank G. te Nijenhuis,Matthijs van der Sluijs,Wim van Zwam,Charles Majoie,Geert Lycklama à Nijeholt,Danny Ruijters,Frans Vos,Sandra Cornelissen,Ruisheng Su,Theo van Walsum
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: To be published in Proceedings of the SWITCH Workshop at MICCAI 2025, Lecture Notes in Computer Science (LNCS), Springer

点击查看摘要

Abstract:Accurate detection of vascular occlusions during endovascular thrombectomy (EVT) is critical in acute ischemic stroke (AIS). Interpretation of digital subtraction angiography (DSA) sequences poses challenges due to anatomical complexity and time constraints. This work proposes OccluNet, a spatio-temporal deep learning model that integrates YOLOX, a single-stage object detector, with transformer-based temporal attention mechanisms to automate occlusion detection in DSA sequences. We compared OccluNet with a YOLOv11 baseline trained on either individual DSA frames or minimum intensity projections. Two spatio-temporal variants were explored for OccluNet: pure temporal attention and divided space-time attention. Evaluation on DSA images from the MR CLEAN Registry revealed the model’s capability to capture temporally consistent features, achieving precision and recall of 89.02% and 74.87%, respectively. OccluNet significantly outperformed the baseline models, and both attention variants attained similar performance. Source code is available at this https URL
zh

[CV-66] Multi-Rationale Explainable Object Recognition via Contrastive Conditional Inference

【速读】:该论文旨在解决当前基于视觉-语言模型(如CLIP)的可解释目标识别中存在的两个核心问题:一是现有方法依赖提示(prompt-based)条件化机制,受限于CLIP文本编码器的能力,且对解释结构的条件约束较弱;二是先前数据集通常仅提供单一、常含噪声的解释理由(rationale),无法充分捕捉图像中判别性特征的多样性。解决方案的关键在于提出一种对比条件推理(contrastive conditional inference, CCI)框架,该框架显式建模图像嵌入、类别标签与多个解释理由之间的概率关系,无需训练即可实现更有效的理由条件化,从而提升分类准确性和解释质量。该方法在多理由可解释目标识别基准上达到最先进性能,包括出色的零样本表现,为未来可解释目标识别模型的评估提供了更全面的标准。

链接: https://arxiv.org/abs/2508.14280
作者: Ali Rasekh,Sepehr Kazemi Ranjbar,Simon Gottschalk
机构: Leibniz University Hannover, Germany (汉诺威莱布尼茨大学, 德国); L3S Research Center (L3S 研究中心)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Explainable object recognition using vision-language models such as CLIP involves predicting accurate category labels supported by rationales that justify the decision-making process. Existing methods typically rely on prompt-based conditioning, which suffers from limitations in CLIP’s text encoder and provides weak conditioning on explanatory structures. Additionally, prior datasets are often restricted to single, and frequently noisy, rationales that fail to capture the full diversity of discriminative image features. In this work, we introduce a multi-rationale explainable object recognition benchmark comprising datasets in which each image is annotated with multiple ground-truth rationales, along with evaluation metrics designed to offer a more comprehensive representation of the task. To overcome the limitations of previous approaches, we propose a contrastive conditional inference (CCI) framework that explicitly models the probabilistic relationships among image embeddings, category labels, and rationales. Without requiring any training, our framework enables more effective conditioning on rationales to predict accurate object categories. Our approach achieves state-of-the-art results on the multi-rationale explainable object recognition benchmark, including strong zero-shot performance, and sets a new standard for both classification accuracy and rationale quality. Together with the benchmark, this work provides a more complete framework for evaluating future models in explainable object recognition. The code will be made available online.
zh

[CV-67] GALA: Guided Attention with Language Alignment for Open Vocabulary Gaussian Splatting

【速读】:该论文旨在解决现有方法在从2D图像中构建细粒度、语言感知的3D场景表示时存在的局限性,特别是难以实现开放词汇(open-vocabulary)的3D语义理解。其解决方案的关键在于提出GALA框架,该框架基于3D高斯溅射(3D Gaussian Splatting, 3DGS),通过自监督对比学习蒸馏出场景特定的3D实例特征场,并引入一个核心组件——包含两个可学习码本(codebooks)的交叉注意力模块,用于编码与视角无关的语义嵌入(view-independent semantic embeddings)。该设计不仅保障了同一实例内部特征的一致性,还实现了2D与3D开放词汇查询的无缝支持,同时避免了对每个高斯点进行高维特征学习,显著降低内存消耗。

链接: https://arxiv.org/abs/2508.14278
作者: Elena Alegret Regalado,Kunyi Li,Sen Wang,Siyun Liang,Michael Niemeyer,Stefano Gasperini,Nassir Navab,Federico Tombari
机构: Technical University of Munich (慕尼黑工业大学); Universitat Politècnica de Catalunya (加泰罗尼亚理工大学); Google(谷歌); Munich Center for Machine Learning (慕尼黑机器学习中心); University of Tubingen (图宾根大学); ETH Zurich (苏黎世联邦理工学院); Visualais
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:3D scene reconstruction and understanding have gained increasing popularity, yet existing methods still struggle to capture fine-grained, language-aware 3D representations from 2D images. In this paper, we present GALA, a novel framework for open-vocabulary 3D scene understanding with 3D Gaussian Splatting (3DGS). GALA distills a scene-specific 3D instance feature field via self-supervised contrastive learning. To extend to generalized language feature fields, we introduce the core contribution of GALA, a cross-attention module with two learnable codebooks that encode view-independent semantic embeddings. This design not only ensures intra-instance feature similarity but also supports seamless 2D and 3D open-vocabulary queries. It reduces memory consumption by avoiding per-Gaussian high-dimensional feature learning. Extensive experiments on real-world datasets demonstrate GALA’s remarkable open-vocabulary performance on both 2D and 3D.
zh

[CV-68] ooth-Diffusion: Guided 3D CBCT Synthesis with Fine-Grained Tooth Conditioning MICCAI2025

【速读】:该论文旨在解决牙科锥形束计算机断层扫描(CBCT)图像在医学影像合成中难以实现解剖学上真实且具备细粒度控制的问题,特别是在牙齿存在性与排列配置的精确调控方面。其解决方案的关键在于提出了一种基于条件扩散框架的3D牙体体积生成方法,通过牙位级二值属性作为引导信号,结合小波域去噪扩散机制、FiLM(Feature-wise Linear Modulation)条件控制以及掩码损失函数,聚焦于相关解剖结构的学习,从而实现对牙列的精准修改(如添加、移除牙齿)和完整牙列合成,同时在未见样本上表现出优异的结构相似性(SSIM > 0.91)和低Fréchet Inception Distance (FID) 分数,显著提升了生成图像的真实性和泛化能力。

链接: https://arxiv.org/abs/2508.14276
作者: Said Djafar Said,Torkan Gholamalizadeh,Mostafa Mehdipour Ghazi
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: MICCAI 2025 Workshop on Oral and Dental Image Analysis (ODIN)

点击查看摘要

Abstract:Despite the growing importance of dental CBCT scans for diagnosis and treatment planning, generating anatomically realistic scans with fine-grained control remains a challenge in medical image synthesis. In this work, we propose a novel conditional diffusion framework for 3D dental volume generation, guided by tooth-level binary attributes that allow precise control over tooth presence and configuration. Our approach integrates wavelet-based denoising diffusion, FiLM conditioning, and masked loss functions to focus learning on relevant anatomical structures. We evaluate the model across diverse tasks, such as tooth addition, removal, and full dentition synthesis, using both paired and distributional similarity metrics. Results show strong fidelity and generalization with low FID scores, robust inpainting performance, and SSIM values above 0.91 even on unseen scans. By enabling realistic, localized modification of dentition without rescanning, this work opens opportunities for surgical planning, patient communication, and targeted data augmentation in dental AI workflows. The codes are available at: this https URL.
zh

[CV-69] Effect of Data Augmentation on Conformal Prediction for Diabetic Retinopathy MICCAI-2025

【速读】:该论文旨在解决深度学习模型在糖尿病视网膜病变(Diabetic Retinopathy, DR)分级等高风险临床任务中因缺乏可靠不确定性量化而导致的可信度不足问题。当前模型虽具备高准确率,但其预测结果的不确定性无法有效评估,限制了其临床部署价值。解决方案的关键在于引入分布无关的同质性预测(Conformal Prediction, CP)框架,通过系统性地分析不同数据增强策略对CP性能的影响,发现样本混合类增强方法(如Mixup和CutMix)不仅能提升预测准确性,还能生成更可靠且高效的不确定性估计;而某些增强技术(如CLAHE)反而可能削弱模型的确定性判断。研究强调需将数据增强策略与下游不确定性量化协同设计,以构建真正可信的医学影像AI系统。

链接: https://arxiv.org/abs/2508.14266
作者: Rizwan Ahamed,Annahita Amireskandari,Joel Palko,Carol Laxson,Binod Bhattarai,Prashnna Gyawali
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: 3rd Workshop in Data Engineering in Medical Imaging (DEMI), MICCAI-2025 Workshop

点击查看摘要

Abstract:The clinical deployment of deep learning models for high-stakes tasks such as diabetic retinopathy (DR) grading requires demonstrable reliability. While models achieve high accuracy, their clinical utility is limited by a lack of robust uncertainty quantification. Conformal prediction (CP) offers a distribution-free framework to generate prediction sets with statistical guarantees of coverage. However, the interaction between standard training practices like data augmentation and the validity of these guarantees is not well understood. In this study, we systematically investigate how different data augmentation strategies affect the performance of conformal predictors for DR grading. Using the DDR dataset, we evaluate two backbone architectures – ResNet-50 and a Co-Scale Conv-Attentional Transformer (CoaT) – trained under five augmentation regimes: no augmentation, standard geometric transforms, CLAHE, Mixup, and CutMix. We analyze the downstream effects on conformal metrics, including empirical coverage, average prediction set size, and correct efficiency. Our results demonstrate that sample-mixing strategies like Mixup and CutMix not only improve predictive accuracy but also yield more reliable and efficient uncertainty estimates. Conversely, methods like CLAHE can negatively impact model certainty. These findings highlight the need to co-design augmentation strategies with downstream uncertainty quantification in mind to build genuinely trustworthy AI systems for medical imaging.
zh

[CV-70] Directed-Tokens: A Robust Multi-Modality Alignment Approach to Large Language-Vision Models

【速读】:该论文旨在解决大型多模态模型(Large Multimodal Models, LMMs)在视觉与文本模态对齐过程中存在的鲁棒性不足和泛化能力差的问题,其根源在于视觉与文本特征之间的对齐与相关性可能导致模型依赖于表面关联而非深层语义理解。解决方案的关键在于引入两种新颖的预训练与微调任务:一是重建图像顺序(image order reconstruction),二是重建文本顺序(text order reconstruction),从而增强模型对跨模态结构信息的理解;同时提出一种定向令牌(directed-token)方法以捕捉视觉与文本知识,并设计图像到响应引导损失(Image-to-Response Guided loss)进一步提升模型在生成响应时的视觉理解能力。该方法在多个学术任务导向和指令遵循的LMM基准测试中实现了当前最优性能(state-of-the-art)。

链接: https://arxiv.org/abs/2508.14264
作者: Thanh-Dat Truong,Huu-Thien Tran,Tran Thai Son,Bhiksha Raj,Khoa Luu
机构: CVIU Lab, University of Arkansas, USA; Vietnam National University, Ho Chi Minh City University of Science, Vietnam; Carnegie Mellon University, USA
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Large multimodal models (LMMs) have gained impressive performance due to their outstanding capability in various understanding tasks. However, these models still suffer from some fundamental limitations related to robustness and generalization due to the alignment and correlation between visual and textual features. In this paper, we introduce a simple but efficient learning mechanism for improving the robust alignment between visual and textual modalities by solving shuffling problems. In particular, the proposed approach can improve reasoning capability, visual understanding, and cross-modality alignment by introducing two new tasks: reconstructing the image order and the text order into the LMM’s pre-training and fine-tuning phases. In addition, we propose a new directed-token approach to capture visual and textual knowledge, enabling the capability to reconstruct the correct order of visual inputs. Then, we introduce a new Image-to-Response Guided loss to further improve the visual understanding of the LMM in its responses. The proposed approach consistently achieves state-of-the-art (SoTA) performance compared with prior LMMs on academic task-oriented and instruction-following LMM benchmarks.
zh

[CV-71] OmniSense: Towards Edge-Assisted Online Analytics for 360-Degree Videos

【速读】:该论文旨在解决360°视频在边缘计算环境下进行实时沉浸式视频分析时面临的高延迟与低精度问题,尤其是在计算资源和网络带宽受限场景下。其核心挑战在于如何高效处理海量的全景视频数据,同时保证分析结果的准确性与实时性。解决方案的关键在于提出OmniSense框架,该框架通过引入轻量级球面兴趣区域(Spherical Region of Interest, SRoI)预测算法来剔除冗余信息,并结合视频内容特征与网络动态变化,智能地调整视觉模型的计算规模,从而实现资源利用最优化。实验表明,相比无资源感知的基线方法,OmniSense在保持相似端到端延迟的前提下,准确率提升达19.8%–114.6%,且在相同精度下相较基线提速2.0×–2.4×。

链接: https://arxiv.org/abs/2508.14237
作者: Miao Zhang,Yifei Zhu,Linfeng Shen,Fangxin Wang,Jiangchuan Liu
机构: Simon Fraser University (西蒙菲莎大学); Shanghai Jiao Tong University (上海交通大学); The Chinese University of Hong Kong, Shenzhen (香港中文大学(深圳)); Peng Cheng Laboratory (鹏城实验室)
类目: Networking and Internet Architecture (cs.NI); Computer Vision and Pattern Recognition (cs.CV); Multimedia (cs.MM); Image and Video Processing (eess.IV)
备注: 10 pages; Accepted by INFOCOM’23

点击查看摘要

Abstract:With the reduced hardware costs of omnidirectional cameras and the proliferation of various extended reality applications, more and more 360^\circ videos are being captured. To fully unleash their potential, advanced video analytics is expected to extract actionable insights and situational knowledge without blind spots from the videos. In this paper, we present OmniSense, a novel edge-assisted framework for online immersive video analytics. OmniSense achieves both low latency and high accuracy, combating the significant computation and network resource challenges of analyzing 360^\circ videos. Motivated by our measurement insights into 360^\circ videos, OmniSense introduces a lightweight spherical region of interest (SRoI) prediction algorithm to prune redundant information in 360^\circ frames. Incorporating the video content and network dynamics, it then smartly scales vision models to analyze the predicted SRoIs with optimized resource utilization. We implement a prototype of OmniSense with commodity devices and evaluate it on diverse real-world collected 360^\circ videos. Extensive evaluation results show that compared to resource-agnostic baselines, it improves the accuracy by 19.8% – 114.6% with similar end-to-end latencies. Meanwhile, it hits 2.0\times – 2.4\times speedups while keeping the accuracy on par with the highest accuracy of baselines.
zh

[CV-72] Accelerating Image Classification with Graph Convolutional Neural Networks using Voronoi Diagrams

【速读】:该论文旨在解决传统卷积神经网络(Convolutional Neural Networks, CNNs)在处理复杂场景和细粒度分类任务时性能受限的问题,尤其是在图像中像素或区域间关系建模不足的情况下。其解决方案的关键在于引入一种基于图卷积网络(Graph Convolutional Networks, GCNs)与Voronoi图相结合的新框架,通过将图像中的像素或区域表示为图的顶点,并利用Delaunay三角剖分进行结构简化,从而更有效地捕捉图像内部的空间关联性。特别地,作者提出了一种改进的GCN模型——归一化Voronoi图卷积网络(Normalized Voronoi Graph Convolution Network, NVGCN),该模型不仅提升了预处理速度和分类准确率,还在多个基准数据集上超越了现有最先进方法,展示了图结构建模在图像分类中的显著优势。

链接: https://arxiv.org/abs/2508.14218
作者: Mustafa Mohammadi Gharasuie,Luis Rueda
机构: University of Windsor (温莎大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注: 14 pages, 13 figures

点击查看摘要

Abstract:Recent advances in image classification have been significantly propelled by the integration of Graph Convolutional Networks (GCNs), offering a novel paradigm for handling complex data structures. This study introduces an innovative framework that employs GCNs in conjunction with Voronoi diagrams to peform image classification, leveraging their exceptional capability to model relational data. Unlike conventional convolutional neural networks, our approach utilizes a graph-based representation of images, where pixels or regions are treated as vertices of a graph, which are then simplified in the form of the corresponding Delaunay triangulations. Our model yields significant improvement in pre-processing time and classification accuracy on several benchmark datasets, surpassing existing state-of-the-art models, especially in scenarios that involve complex scenes and fine-grained categories. The experimental results, validated via cross-validation, underscore the potential of integrating GCNs with Voronoi diagrams in advancing image classification tasks. This research contributes to the field by introducing a novel approach to image classification, while opening new avenues for developing graph-based learning paradigms in other domains of computer vision and non-structured data. In particular, we have proposed a new version of the GCN in this paper, namely normalized Voronoi Graph Convolution Network (NVGCN), which is faster than the regular GCN.
zh

[CV-73] A Survey on Video Anomaly Detection via Deep Learning: Human Vehicle and Environment

【速读】:该论文旨在解决视频异常检测(Video Anomaly Detection, VAD)领域中研究碎片化的问题,即当前方法在不同监督级别和自适应学习范式(如在线学习、主动学习与持续学习)下缺乏系统性整合,且在人类中心、车辆中心和环境中心三类应用场景中存在差异化的挑战与设计需求。其解决方案的关键在于对现有文献进行结构化梳理,从监督层级、学习机制到应用类别三个维度进行系统归纳,并在此基础上识别当前方法的核心贡献与局限性,从而为学术界提供一个清晰的理论与实践框架,推动VAD技术向更统一、可扩展和实用的方向发展。

链接: https://arxiv.org/abs/2508.14203
作者: Ghazal Alinezhad Noghre,Armin Danesh Pazho,Hamed Tabkhi
机构: UNC Charlotte (北卡罗来纳大学夏洛特分校)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Video Anomaly Detection (VAD) has emerged as a pivotal task in computer vision, with broad relevance across multiple fields. Recent advances in deep learning have driven significant progress in this area, yet the field remains fragmented across domains and learning paradigms. This survey offers a comprehensive perspective on VAD, systematically organizing the literature across various supervision levels, as well as adaptive learning methods such as online, active, and continual learning. We examine the state of VAD across three major application categories: human-centric, vehicle-centric, and environment-centric scenarios, each with distinct challenges and design considerations. In doing so, we identify fundamental contributions and limitations of current methodologies. By consolidating insights from subfields, we aim to provide the community with a structured foundation for advancing both theoretical understanding and real-world applicability of VAD systems. This survey aims to support researchers by providing a useful reference, while also drawing attention to the broader set of open challenges in anomaly detection, including both fundamental research questions and practical obstacles to real-world deployment.
zh

[CV-74] CLIPSym: Delving into Symmetry Detection with CLIP

【速读】:该论文旨在解决计算机视觉中对旋转和反射对称性检测的挑战,尤其关注如何利用预训练视觉-语言模型(如CLIP)中的语义信息提升检测性能。其解决方案的关键在于提出CLIPSym框架,该框架结合了CLIP的图像与语言编码器,并引入基于Transformer与G-卷积混合结构的旋转等变解码器,以更好地捕捉几何对称性特征;同时设计了一种新颖的语义感知提示分组(Semantic-Aware Prompt Grouping, SAPG)技术,通过聚合高频物体相关的提示词来增强语义线索的整合能力,从而显著提升对称性检测的准确性。

链接: https://arxiv.org/abs/2508.14197
作者: Tinghan Yang,Md Ashiqur Rahman,Raymond A. Yeh
机构: Purdue University (普渡大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Symmetry is one of the most fundamental geometric cues in computer vision, and detecting it has been an ongoing challenge. With the recent advances in vision-language models,~i.e., CLIP, we investigate whether a pre-trained CLIP model can aid symmetry detection by leveraging the additional symmetry cues found in the natural image descriptions. We propose CLIPSym, which leverages CLIP’s image and language encoders and a rotation-equivariant decoder based on a hybrid of Transformer and G -Convolution to detect rotation and reflection symmetries. To fully utilize CLIP’s language encoder, we have developed a novel prompting technique called Semantic-Aware Prompt Grouping (SAPG), which aggregates a diverse set of frequent object-based prompts to better integrate the semantic cues for symmetry detection. Empirically, we show that CLIPSym outperforms the current state-of-the-art on three standard symmetry detection datasets (DENDI, SDRW, and LDRS). Finally, we conduct detailed ablations verifying the benefits of CLIP’s pre-training, the proposed equivariant decoder, and the SAPG technique. The code is available at this https URL.
zh

[CV-75] Local Scale Equivariance with Latent Deep Equilibrium Canonicalizer

【速读】:该论文旨在解决计算机视觉中因尺度变化(scale variation)带来的挑战,即同一类物体在不同距离下呈现的尺寸差异及其在同一图像内局部尺度不一致性问题。解决方案的关键在于提出一种深度平衡规范器(deep equilibrium canonicalizer, DEC),该方法通过增强模型的局部尺度等变性(local scale equivariance)来提升对尺度变化的鲁棒性。DEC可无缝集成至现有网络架构并适配预训练模型,在ImageNet基准测试中显著提升了ViT、DeiT、Swin和BEiT等主流深度网络的性能与局部尺度一致性。

链接: https://arxiv.org/abs/2508.14187
作者: Md Ashiqur Rahman,Chiao-An Yang,Michael N. Cheng,Lim Jun Hao,Jeremiah Jiang,Teck-Yian Lim,Raymond A. Yeh
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Graphics (cs.GR); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Scale variation is a fundamental challenge in computer vision. Objects of the same class can have different sizes, and their perceived size is further affected by the distance from the camera. These variations are local to the objects, i.e., different object sizes may change differently within the same image. To effectively handle scale variations, we present a deep equilibrium canonicalizer (DEC) to improve the local scale equivariance of a model. DEC can be easily incorporated into existing network architectures and can be adapted to a pre-trained model. Notably, we show that on the competitive ImageNet benchmark, DEC improves both model performance and local scale consistency across four popular pre-trained deep-nets, e.g., ViT, DeiT, Swin, and BEiT. Our code is available at this https URL.
zh

[CV-76] RynnEC: Bringing MLLM s into Embodied World

【速读】:该论文旨在解决具身智能体在物理世界中进行精细感知与交互时面临的挑战,尤其是如何实现对视频中特定区域的细粒度理解与推理。现有方法往往缺乏对局部区域的精准建模能力,限制了具身代理在对象属性识别、分割及空间推理等任务中的表现。解决方案的关键在于提出 RynnEC——一个基于通用视觉-语言基础模型的视频多模态大语言模型,其核心创新包括引入区域编码器(region encoder)和掩码解码器(mask decoder),从而支持灵活的区域级视频交互;同时,为缓解3D标注数据稀缺问题,设计了一种基于第一人称视角视频的数据生成管道,并构建了 RynnEC-Bench 作为区域中心的评估基准,以系统性衡量具身认知能力。

链接: https://arxiv.org/abs/2508.14160
作者: Ronghao Dang,Yuqian Yuan,Yunxuan Mao,Kehan Li,Jiangpin Liu,Zhikai Wang,Xin Li,Fan Wang,Deli Zhao
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Robotics (cs.RO)
备注: The technical report of RynnEC, an embodied cognition MLLM

点击查看摘要

Abstract:We introduce RynnEC, a video multimodal large language model designed for embodied cognition. Built upon a general-purpose vision-language foundation model, RynnEC incorporates a region encoder and a mask decoder, enabling flexible region-level video interaction. Despite its compact architecture, RynnEC achieves state-of-the-art performance in object property understanding, object segmentation, and spatial reasoning. Conceptually, it offers a region-centric video paradigm for the brain of embodied agents, providing fine-grained perception of the physical world and enabling more precise interactions. To mitigate the scarcity of annotated 3D datasets, we propose an egocentric video based pipeline for generating embodied cognition data. Furthermore, we introduce RynnEC-Bench, a region-centered benchmark for evaluating embodied cognitive capabilities. We anticipate that RynnEC will advance the development of general-purpose cognitive cores for embodied agents and facilitate generalization across diverse embodied tasks. The code, model checkpoints, and benchmark are available at: this https URL
zh

[CV-77] LENS: Learning to Segment Anything with Unified Reinforced Reasoning

【速读】:该论文旨在解决文本提示图像分割(text-prompted image segmentation)中模型在未见提示(unseen prompts)和未见领域(unseen domains)下泛化能力不足的问题,其核心在于现有监督微调方法通常忽略测试时显式的链式思维(chain-of-thought, CoT)推理过程。解决方案的关键是提出一种可扩展的强化学习(reinforcement learning, RL)框架 LENS,该框架以端到端方式联合优化推理过程与分割任务,并设计了跨句子级、边界框级和分割级的统一奖励机制,激励模型生成具有信息量的 CoT 理由同时提升掩码质量。实验表明,基于 Qwen2.5-VL-3B-Instruct 模型的 LENS 在 RefCOCO、RefCOCO+ 和 RefCOCOg 基准上平均 cIoU 达到 81.2%,相比强基线 GLaMM 提升最高达 5.6%,验证了 RL 驱动的 CoT 推理作为鲁棒先验对构建更通用的 Segment Anything 模型具有重要意义。

链接: https://arxiv.org/abs/2508.14153
作者: Lianghui Zhu,Bin Ouyang,Yuxuan Zhang,Tianheng Cheng,Rui Hu,Haocheng Shen,Longjin Ran,Xiaoxin Chen,Li Yu,Wenyu Liu,Xinggang Wang
机构: Huazhong University of Science and Technology (华中科技大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: Code is released at this https URL

点击查看摘要

Abstract:Text-prompted image segmentation enables fine-grained visual understanding and is critical for applications such as human-computer interaction and robotics. However, existing supervised fine-tuning methods typically ignore explicit chain-of-thought (CoT) reasoning at test time, which limits their ability to generalize to unseen prompts and domains. To address this issue, we introduce LENS, a scalable reinforcement-learning framework that jointly optimizes the reasoning process and segmentation in an end-to-end manner. We propose unified reinforcement-learning rewards that span sentence-, box-, and segment-level cues, encouraging the model to generate informative CoT rationales while refining mask quality. Using a publicly available 3-billion-parameter vision-language model, i.e., Qwen2.5-VL-3B-Instruct, LENS achieves an average cIoU of 81.2% on the RefCOCO, RefCOCO+, and RefCOCOg benchmarks, outperforming the strong fine-tuned method, i.e., GLaMM, by up to 5.6%. These results demonstrate that RL-driven CoT reasoning serves as a robust prior for text-prompted segmentation and offers a practical path toward more generalizable Segment Anything models. Code is available at this https URL.
zh

[CV-78] STAS: Spatio-Temporal Adaptive Computation Time for Spiking Transformers

【速读】:该论文旨在解决脉冲神经网络(Spiking Neural Networks, SNNs)在视觉Transformer(ViT)架构中因多时间步操作导致的高延迟与计算开销问题,同时克服现有动态计算方法在空间、时间或结构冗余上处理碎片化的问题。其解决方案的关键在于提出STAS(Spatio-Temporal Adaptive computation time for Spiking transformers)框架,通过协同设计静态网络架构与动态计算策略实现统一优化:一方面引入集成脉冲补丁分割(Integrated Spike Patch Splitting, I-SPS)模块以建立时间稳定性,解决原始架构对时序相似性的违反问题;另一方面设计自适应脉冲自注意力(Adaptive Spiking Self-Attention, A-SSA)模块,在时空两个维度上进行token剪枝,从而显著降低能耗(最高达45.9%)并提升精度。

链接: https://arxiv.org/abs/2508.14138
作者: Donghwa Kang,Doohyun Kim,Sang-Ki Ko,Jinkyu Lee,Brent ByungHoon Kang,Hyeongboo Baek
机构: Sungkyunkwan University (成均馆大学)
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV); Neural and Evolutionary Computing (cs.NE)
备注: 8 pages

点击查看摘要

Abstract:Spiking neural networks (SNNs) offer energy efficiency over artificial neural networks (ANNs) but suffer from high latency and computational overhead due to their multi-timestep operational nature. While various dynamic computation methods have been developed to mitigate this by targeting spatial, temporal, or architecture-specific redundancies, they remain fragmented. While the principles of adaptive computation time (ACT) offer a robust foundation for a unified approach, its application to SNN-based vision Transformers (ViTs) is hindered by two core issues: the violation of its temporal similarity prerequisite and a static architecture fundamentally unsuited for its principles. To address these challenges, we propose STAS (Spatio-Temporal Adaptive computation time for Spiking transformers), a framework that co-designs the static architecture and dynamic computation policy. STAS introduces an integrated spike patch splitting (I-SPS) module to establish temporal stability by creating a unified input representation, thereby solving the architectural problem of temporal dissimilarity. This stability, in turn, allows our adaptive spiking self-attention (A-SSA) module to perform two-dimensional token pruning across both spatial and temporal axes. Implemented on spiking Transformer architectures and validated on CIFAR-10, CIFAR-100, and ImageNet, STAS reduces energy consumption by up to 45.9%, 43.8%, and 30.1%, respectively, while simultaneously improving accuracy over SOTA models.
zh

[CV-79] Federated Action Recognition for Smart Worker Assistance Using FastPose

【速读】:该论文旨在解决智能制造环境中工人动作识别(Human Activity Recognition, HAR)的隐私保护与跨用户泛化能力不足的问题。现有基于骨骼信息的HAR方法多依赖集中式数据集,在工业场景中难以满足隐私敏感需求,且在不同用户间迁移性能较差。其解决方案的关键在于提出一种基于联邦学习(Federated Learning, FL)的框架,结合改进的FastPose模型提取上肢骨骼特征,并采用LSTM和Transformer两种时序建模结构,在四种训练范式下对比验证:包括集中式训练、本地训练、加权联邦平均(FedAvg)以及联邦集成学习(FedEnsemble)。实验表明,FL方法不仅保障了数据隐私,还在全局测试集和未见外部客户端上显著提升识别准确率,其中FedEnsemble相比集中式训练最高提升达+58.3个百分点,证明了联邦学习在异构工业环境下实现可扩展、隐私友好的姿态驱动HAR的有效性。

链接: https://arxiv.org/abs/2508.14113
作者: Vinit Hegiste,Vidit Goyal,Tatjana Legler,Martin Ruskowski
机构: RPTU Kaiserslautern-Landau (RPTU凯撒斯劳滕-兰道大学); German Research Center for Artificial Intelligence (DFKI) (德国人工智能研究中心)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Distributed, Parallel, and Cluster Computing (cs.DC); Human-Computer Interaction (cs.HC)
备注: 8 pages and submitted to FLTA2025 conference

点击查看摘要

Abstract:In smart manufacturing environments, accurate and real-time recognition of worker actions is essential for productivity, safety, and human-machine collaboration. While skeleton-based human activity recognition (HAR) offers robustness to lighting, viewpoint, and background variations, most existing approaches rely on centralized datasets, which are impractical in privacy-sensitive industrial scenarios. This paper presents a federated learning (FL) framework for pose-based HAR using a custom skeletal dataset of eight industrially relevant upper-body gestures, captured from five participants and processed using a modified FastPose model. Two temporal backbones, an LSTM and a Transformer encoder, are trained and evaluated under four paradigms: centralized, local (per-client), FL with weighted federated averaging (FedAvg), and federated ensemble learning (FedEnsemble). On the global test set, the FL Transformer improves over centralized training by +12.4 percentage points, with FedEnsemble delivering a +16.3 percentage points gain. On an unseen external client, FL and FedEnsemble exceed centralized accuracy by +52.6 and +58.3 percentage points, respectively. These results demonstrate that FL not only preserves privacy but also substantially enhances cross-user generalization, establishing it as a practical solution for scalable, privacy-aware HAR in heterogeneous industrial settings.
zh

[CV-80] A comparative study of some wavelet and sampling operators on various features of an image

【速读】:该论文旨在解决图像处理中基于采样Kantorovich(Sampling Kantorovich, SK)算子的局部与全局逼近性质问题,尤其关注不同类型的SK算子(如高斯、双边及阈值小波基算子)在图像去噪和特征保持方面的性能差异。解决方案的关键在于通过引入一系列量化指标(包括均方误差(MSE)、斑点指数(SI)、斑点抑制指数(SSI)、斑点均值保持指数(SMPI)以及等效视数(ENL))对多种算子进行系统评估,并结合2D Shepp-Logan Phantom图像的感兴趣区域(ROI)分析验证基本逼近定理(FTA),从而揭示各类算子在非理想条件下对图像不同特征的适应性优势。

链接: https://arxiv.org/abs/2508.14043
作者: Digvijay Singh,Rahul Shukla,Karunesh Kumar Singh
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Functional Analysis (math.FA)
备注: 15 pages

点击查看摘要

Abstract:This research includes the study of some positive sampling Kantorovich operators (SK operators) and their convergence properties. A comprehensive analysis of both local and global approximation properties is presented using sampling Kantorovich (SK), Gaussian, Bilateral and the thresholding wavelet-based operators in the framework of SK-operators. Explicitly, we start the article by introducing the basic terminology and state the fundamental theorem of approximation (FTA) by imposing the various required conditions corresponding to the various defined operators. We measure the error and study the other mathematical parameters such as the mean square error (MSE), the speckle index (SI), the speckle suppression index (SSI), the speckle mean preservation index (SMPI), and the equivalent number of looks (ENL) at various levels of resolution parameters. The nature of these operators are demonstrated via an example under ideal conditions in tabulated form at a certain level of samples. Eventually, another numerical example is illustrated to discuss the region of interest (ROI) via SI, SSI and SMPI of 2D Shepp-Logan Phantom taken slice from the 3D image, which gives the justification of the fundamental theorem of approximation (FTA). At the end of the derivation and illustrations we observe that the various operators have their own significance while studying the various features of the image because of the uneven nature of an image (non-ideal condition). Therefore, to some extent, some operators work well and some do not for some specific features of the image.
zh

[CV-81] Rule-based Key-Point Extraction for MR-Guided Biomechanical Digital Twins of the Spine

【速读】:该论文旨在解决数字孪生(Digital Twin)在临床应用中因缺乏个体化、高精度解剖建模而导致的模拟准确性不足问题。其核心挑战在于如何从MRI图像中提取亚像素级关键点,以构建与个体解剖结构高度匹配的生物力学模型。解决方案的关键在于提出一种基于规则的图像处理方法,该方法借鉴了先前CT成像中的关键技术,并引入鲁棒的图像配准和椎体特异性朝向估计策略,从而实现对肌肉和韧带附着点等重要解剖标志的精准定位,为生物力学仿真提供可靠的边界条件和载荷施加点,最终支持个性化脊柱力学模拟及临床诊断与治疗方案的定制化设计。

链接: https://arxiv.org/abs/2508.14708
作者: Robert Graf,Tanja Lerchl,Kati Nispel,Hendrik Möller,Matan Atad,Julian McGinnis,Julius Maria Watrinet,Johannes Paetzold,Daniel Rueckert,Jan S. Kirschke
机构: 未知
类目: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Digital twins offer a powerful framework for subject-specific simulation and clinical decision support, yet their development often hinges on accurate, individualized anatomical modeling. In this work, we present a rule-based approach for subpixel-accurate key-point extraction from MRI, adapted from prior CT-based methods. Our approach incorporates robust image alignment and vertebra-specific orientation estimation to generate anatomically meaningful landmarks that serve as boundary conditions and force application points, like muscle and ligament insertions in biomechanical models. These models enable the simulation of spinal mechanics considering the subject’s individual anatomy, and thus support the development of tailored approaches in clinical diagnostics and treatment planning. By leveraging MR imaging, our method is radiation-free and well-suited for large-scale studies and use in underrepresented populations. This work contributes to the digital twin ecosystem by bridging the gap between precise medical image analysis with biomechanical simulation, and aligns with key themes in personalized modeling for healthcare.
zh

[CV-82] Virtual Multiplex Staining for Histological Images using a Marker-wise Conditioned Diffusion Model

【速读】:该论文旨在解决多路成像(multiplex imaging)在病理学中因数据获取复杂性和成本高而难以广泛应用的问题,以及现有HE(苏木精-伊红染色)图像数据库缺乏对应多路图像导致无法进行多模态分析的局限性。其解决方案的关键在于利用预训练的潜在扩散模型(latent diffusion models, LDMs)的强大先验知识,构建一种基于条件扩散模型的虚拟多路染色框架,通过将HE图像作为输入并以每个标记物为条件,实现逐标记生成多路图像;同时采用单步采样微调策略,在共享统一架构的前提下提升颜色对比度保真度与推理效率,从而显著扩展可生成标记类型数量(达18种),优于以往仅支持2–3种标记的方法。

链接: https://arxiv.org/abs/2508.14681
作者: Hyun-Jic Oh,Junsik Kim,Zhiyi Shi,Yichen Wu,Yu-An Chen,Peter K. Sorger,Hanspeter Pfister,Won-Ki Jeong
机构: 1. Korea Advanced Institute of Science and Technology (韩国科学技术院); 2. Harvard University (哈佛大学); 3. University of California, Berkeley (加州大学伯克利分校); 4. Rice University (莱斯大学)
类目: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Multiplex imaging is revolutionizing pathology by enabling the simultaneous visualization of multiple biomarkers within tissue samples, providing molecular-level insights that traditional hematoxylin and eosin (HE) staining cannot provide. However, the complexity and cost of multiplex data acquisition have hindered its widespread adoption. Additionally, most existing large repositories of HE images lack corresponding multiplex images, limiting opportunities for multimodal analysis. To address these challenges, we leverage recent advances in latent diffusion models (LDMs), which excel at modeling complex data distributions utilizing their powerful priors for fine-tuning to a target domain. In this paper, we introduce a novel framework for virtual multiplex staining that utilizes pretrained LDM parameters to generate multiplex images from HE images using a conditional diffusion model. Our approach enables marker-by-marker generation by conditioning the diffusion model on each marker, while sharing the same architecture across all markers. To tackle the challenge of varying pixel value distributions across different marker stains and to improve inference speed, we fine-tune the model for single-step sampling, enhancing both color contrast fidelity and inference efficiency through pixel-level loss functions. We validate our framework on two publicly available datasets, notably demonstrating its effectiveness in generating up to 18 different marker types with improved accuracy, a substantial increase over the 2-3 marker types achieved in previous approaches. This validation highlights the potential of our framework, pioneering virtual multiplex staining. Finally, this paper bridges the gap between HE and multiplex imaging, potentially enabling retrospective studies and large-scale analyses of existing HE image repositories.
zh

[CV-83] From Slices to Structures: Unsupervised 3D Reconstruction of Female Pelvic Anatomy from Freehand Transvaginal Ultrasound

【速读】:该论文旨在解决当前三维超声成像(3D ultrasound)因依赖专用硬件和受限采集协议而难以广泛推广的问题。其核心挑战在于如何从自由手操作的二维经阴道超声(transvaginal ultrasound, TVS)扫查中重建高质量的3D解剖结构,且无需外部追踪设备或学习型姿态估计器。解决方案的关键在于提出了一种无监督框架,将高斯点绘(Gaussian Splatting)方法适配至超声领域,引入一种面向切片的可微光栅化器(slice-aware, differentiable rasterizer),并以各向异性3D高斯函数建模解剖结构,通过图像级监督直接优化参数,同时利用无传感器探头运动估计和领域特定几何先验实现高效、紧凑且内存友好的3D表示,从而仅凭计算手段即可实现高空间保真度的3D重建。

链接: https://arxiv.org/abs/2508.14552
作者: Max Krähenmann,Sergio Tascon-Morales,Fabian Laumer,Julia E. Vogt,Ece Ozkan
机构: ETH Zurich (苏黎世联邦理工学院); Scanvio Medical AG (斯坎维医疗公司)
类目: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Volumetric ultrasound has the potential to significantly improve diagnostic accuracy and clinical decision-making, yet its widespread adoption remains limited by dependence on specialized hardware and restrictive acquisition protocols. In this work, we present a novel unsupervised framework for reconstructing 3D anatomical structures from freehand 2D transvaginal ultrasound (TVS) sweeps, without requiring external tracking or learned pose estimators. Our method adapts the principles of Gaussian Splatting to the domain of ultrasound, introducing a slice-aware, differentiable rasterizer tailored to the unique physics and geometry of ultrasound imaging. We model anatomy as a collection of anisotropic 3D Gaussians and optimize their parameters directly from image-level supervision, leveraging sensorless probe motion estimation and domain-specific geometric priors. The result is a compact, flexible, and memory-efficient volumetric representation that captures anatomical detail with high spatial fidelity. This work demonstrates that accurate 3D reconstruction from 2D ultrasound images can be achieved through purely computational means, offering a scalable alternative to conventional 3D systems and enabling new opportunities for AI-assisted analysis and diagnosis.
zh

[CV-84] Deep Skin Lesion Segmentation with Transformer-CNN Fusion: Toward Intelligent Skin Cancer Analysis

【速读】:该论文旨在解决皮肤病变图像中复杂结构、边界模糊及显著尺度变化带来的分割精度不足问题。其解决方案的关键在于改进的TransUNet架构:通过在传统编码器-解码器框架中引入Transformer模块以建模全局语义信息,同时保留卷积分支以维持局部纹理与边缘特征,从而增强对细粒度结构的感知能力;此外,设计了边界引导注意力机制和多尺度上采样路径,有效提升病灶边界的定位精度与分割一致性,使模型在复杂场景下具备更强的边界重建与结构恢复能力。

链接: https://arxiv.org/abs/2508.14509
作者: Xin Wang,Xiaopei Zhang,Xingang Wang
机构: 未知
类目: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:This paper proposes a high-precision semantic segmentation method based on an improved TransUNet architecture to address the challenges of complex lesion structures, blurred boundaries, and significant scale variations in skin lesion images. The method integrates a transformer module into the traditional encoder-decoder framework to model global semantic information, while retaining a convolutional branch to preserve local texture and edge features. This enhances the model’s ability to perceive fine-grained structures. A boundary-guided attention mechanism and multi-scale upsampling path are also designed to improve lesion boundary localization and segmentation consistency. To verify the effectiveness of the approach, a series of experiments were conducted, including comparative studies, hyperparameter sensitivity analysis, data augmentation effects, input resolution variation, and training data split ratio tests. Experimental results show that the proposed model outperforms existing representative methods in mIoU, mDice, and mAcc, demonstrating stronger lesion recognition accuracy and robustness. In particular, the model achieves better boundary reconstruction and structural recovery in complex scenarios, making it well-suited for the key demands of automated segmentation tasks in skin lesion analysis.
zh

[CV-85] Fine-grained Image Quality Assessment for Perceptual Image Restoration

【速读】:该论文旨在解决现有图像质量评估(Image Quality Assessment, IQA)指标在图像恢复(Image Restoration, IR)任务中难以区分恢复图像细微质量差异的问题。传统IQA方法主要依赖标量评分,但在面对IR任务时表现出显著不足,尤其在细粒度质量比较上与人类偏好存在不一致。解决方案的关键在于构建首个面向图像恢复的细粒度质量评估数据集FGRestore,包含18,408张恢复图像和30,886条成对偏好标注,并基于此提出FGResQ模型,该模型同时具备粗粒度评分回归与细粒度排序能力,从而更准确地匹配人类感知与恢复质量之间的关系。

链接: https://arxiv.org/abs/2508.14475
作者: Xiangfei Sheng,Xiaofeng Pan,Zhichao Yang,Pengfei Chen,Leida Li
机构: 未知
类目: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV); Multimedia (cs.MM)
备注: 9 pages,6 figures

点击查看摘要

Abstract:Recent years have witnessed remarkable achievements in perceptual image restoration (IR), creating an urgent demand for accurate image quality assessment (IQA), which is essential for both performance comparison and algorithm optimization. Unfortunately, the existing IQA metrics exhibit inherent weakness for IR task, particularly when distinguishing fine-grained quality differences among restored images. To address this dilemma, we contribute the first-of-its-kind fine-grained image quality assessment dataset for image restoration, termed FGRestore, comprising 18,408 restored images across six common IR tasks. Beyond conventional scalar quality scores, FGRestore was also annotated with 30,886 fine-grained pairwise preferences. Based on FGRestore, a comprehensive benchmark was conducted on the existing IQA metrics, which reveal significant inconsistencies between score-based IQA evaluations and the fine-grained restoration quality. Motivated by these findings, we further propose FGResQ, a new IQA model specifically designed for image restoration, which features both coarse-grained score regression and fine-grained quality ranking. Extensive experiments and comparisons demonstrate that FGResQ significantly outperforms state-of-the-art IQA metrics. Codes and model weights have been released in this https URL
zh

[CV-86] Physics-Constrained Diffusion Reconstruction with Posterior Correction for Quantitative and Fast PET Imaging

【速读】:该论文旨在解决深度学习在正电子发射断层成像(PET)重建中面临的定量准确性不足与伪影问题,这些问题源于模型可解释性差、数据驱动依赖性强以及过拟合等挑战。其解决方案的关键在于提出了一种带有后验物理修正的条件扩散模型(PET-DPC),通过创新的归一化流程生成几何时间飞行概率图像(GTP-image)作为输入,并在扩散采样过程中嵌入散射、衰减和随机事件的物理信息,实现后验校正,从而显著提升重建精度与鲁棒性。

链接: https://arxiv.org/abs/2508.14364
作者: Yucun Hou,Fenglin Zhan,Chenxi Li,Ziquan Yuan,Haoyu Lu,Yue Chen,Yihao Chen,Kexin Wang,Runze Liao,Haoqi Wen,Ganxi Du,Jiaru Ni,Taoran Chen,Jinyue Zhang,Jigang Yang,Jianyong Jiang
机构: 未知
类目: Medical Physics (physics.med-ph); Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Deep learning-based reconstruction of positron emission tomography(PET) data has gained increasing attention in recent years. While these methods achieve fast reconstruction,concerns remain regarding quantitative accuracy and the presence of artifacts,stemming from limited model interpretability,data driven dependence, and overfitting this http URL challenges have hindered clinical this http URL address them,we propose a conditional diffusion model with posterior physical correction (PET-DPC) for PET image reconstruction. An innovative normalization procedure generates the input Geometric TOF Probabilistic Image (GTP-image),while physical information is incorporated during the diffusion sampling process to perform posterior scatter,attenuation,and random corrections. The model was trained and validated on 300 brain and 50 whole-body PET datasets,a physical phantom,and 20 simulated brain datasets. PET-DPC produced reconstructions closely aligned with fully corrected OSEM images,outperforming end-to-end deep learning models in quantitative metrics and,in some cases, surpassing traditional iterative methods. The model also generalized well to out-of-distribution(OOD) data. Compared to iterative methods,PET-DPC reduced reconstruction time by 50% for brain scans and 85% for whole-body scans. Ablation studies confirmed the critical role of posterior correction in implementing scatter and attenuation corrections,enhancing reconstruction accuracy. Experiments with physical phantoms further demonstrated PET-DPC’s ability to preserve background uniformity and accurately reproduce tumor-to-background intensity ratios. Overall,these results highlight PET-DPC as a promising approach for rapid, quantitatively accurate PET reconstruction,with strong potential to improve clinical imaging workflows.
zh

[CV-87] A Systematic Study of Deep Learning Models and xAI Methods for Region-of-Interest Detection in MRI Scans

【速读】:该论文旨在解决膝关节磁共振成像(MRI)中人工解读耗时且存在观察者间差异的问题,提出基于深度学习与可解释人工智能(xAI)技术的自动化感兴趣区域(ROI)检测方法。其解决方案的关键在于系统性评估多种深度学习架构(包括ResNet50、InceptionV3、Vision Transformer及多类U-Net变体)结合xAI技术(如Grad-CAM和显著性图)在MRI图像分类与分割中的性能表现,发现卷积神经网络(CNN)迁移学习方法在当前MRNet数据集下最优,而Grad-CAM提供了最具临床意义的模型解释性。

链接: https://arxiv.org/abs/2508.14151
作者: Justin Yiu,Kushank Arora,Daniel Steinberg,Rohit Ghiya
机构: Georgia Institute of Technology (佐治亚理工学院); Imperial College London (帝国理工学院)
类目: Image and Video Processing (eess.IV); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Magnetic Resonance Imaging (MRI) is an essential diagnostic tool for assessing knee injuries. However, manual interpretation of MRI slices remains time-consuming and prone to inter-observer variability. This study presents a systematic evaluation of various deep learning architectures combined with explainable AI (xAI) techniques for automated region of interest (ROI) detection in knee MRI scans. We investigate both supervised and self-supervised approaches, including ResNet50, InceptionV3, Vision Transformers (ViT), and multiple U-Net variants augmented with multi-layer perceptron (MLP) classifiers. To enhance interpretability and clinical relevance, we integrate xAI methods such as Grad-CAM and Saliency Maps. Model performance is assessed using AUC for classification and PSNR/SSIM for reconstruction quality, along with qualitative ROI visualizations. Our results demonstrate that ResNet50 consistently excels in classification and ROI identification, outperforming transformer-based models under the constraints of the MRNet dataset. While hybrid U-Net + MLP approaches show potential for leveraging spatial features in reconstruction and interpretability, their classification performance remains lower. Grad-CAM consistently provided the most clinically meaningful explanations across architectures. Overall, CNN-based transfer learning emerges as the most effective approach for this dataset, while future work with larger-scale pretraining may better unlock the potential of transformer models.
zh

[CV-88] Automated surgical planning with nnU-Net: delineation of the anatomy in hepatobiliary phase MRI

【速读】:该论文旨在解决肝胆相增强磁共振成像(gadoxetic acid-enhanced MRI)中肝脏解剖结构(包括肝实质、肿瘤、门静脉、肝静脉及胆道系统)自动分割的临床应用难题,以提升术前规划效率。其解决方案的关键在于基于nnU-Net v1架构开发了一种深度学习模型,通过在72例患者数据上训练并特别关注薄结构和拓扑保真度,实现了对多种复杂解剖结构的高精度自动化分割;测试集结果显示各结构Dice相似系数(DSC)均达到临床可用水平,且在真实临床场景中仅需少量人工修正即可用于三维手术规划,显著缩短了工作流程,并成功识别出放射科医生漏诊的额外肿瘤。

链接: https://arxiv.org/abs/2508.14133
作者: Karin A. Olthof,Matteo Fusagli,Bianca Güttner,Tiziano Natali,Bram Westerink,Stefanie Speidel,Theo J.M. Ruers,Koert F.D. Kuhlmann,Andrey Zhylka
机构: 未知
类目: Image and Video Processing (eess.IV); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
备注: 14 pages, 5 figures

点击查看摘要

Abstract:Background: The aim of this study was to develop and evaluate a deep learning-based automated segmentation method for hepatic anatomy (i.e., parenchyma, tumors, portal vein, hepatic vein and biliary tree) from the hepatobiliary phase of gadoxetic acid-enhanced MRI. This method should ease the clinical workflow of preoperative planning. Methods: Manual segmentation was performed on hepatobiliary phase MRI scans from 90 consecutive patients who underwent liver surgery between January 2020 and October 2023. A deep learning network (nnU-Net v1) was trained on 72 patients with an extra focus on thin structures and topography preservation. Performance was evaluated on an 18-patient test set by comparing automated and manual segmentations using Dice similarity coefficient (DSC). Following clinical integration, 10 segmentations (assessment dataset) were generated using the network and manually refined for clinical use to quantify required adjustments using DSC. Results: In the test set, DSCs were 0.97+/-0.01 for liver parenchyma, 0.80+/-0.04 for hepatic vein, 0.79+/-0.07 for biliary tree, 0.77+/-0.17 for tumors, and 0.74+/-0.06 for portal vein. Average tumor detection rate was 76.6+/-24.1%, with a median of one false-positive per patient. The assessment dataset showed minor adjustments were required for clinical use of the 3D models, with high DSCs for parenchyma (1.00+/-0.00), portal vein (0.98+/-0.01) and hepatic vein (0.95+/-0.07). Tumor segmentation exhibited greater variability (DSC 0.80+/-0.27). During prospective clinical use, the model detected three additional tumors initially missed by radiologists. Conclusions: The proposed nnU-Net-based segmentation method enables accurate and automated delineation of hepatic anatomy. This enables 3D planning to be applied efficiently as a standard-of-care for every patient undergoing liver surgery. Comments: 14 pages, 5 figures Subjects: Image and Video Processing (eess.IV); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV) Cite as: arXiv:2508.14133 [eess.IV] (or arXiv:2508.14133v1 [eess.IV] for this version) https://doi.org/10.48550/arXiv.2508.14133 Focus to learn more arXiv-issued DOI via DataCite Submission history From: Karin Olthof [view email] [v1] Tue, 19 Aug 2025 11:58:19 UTC (1,041 KB)
zh

[CV-89] Fracture Detection and Localisation in Wrist and Hand Radiographs using Detection Transformer Variants

【速读】:该论文旨在解决急诊科中腕关节和手部骨折的放射影像(X光片)人工诊断效率低、易出错的问题。针对这一挑战,研究提出了一种基于对象检测Transformer模型(Co-DETR)的自动化诊断流程,其关键在于:首先利用在COCO数据集上预训练的Co-DETR模型对X光图像中的骨折区域进行定位与分类,再结合ResNet-50分类器对裁剪出的可疑区域进行精细化异常判断,并采用监督对比学习(supervised contrastive learning)提升特征嵌入质量,从而实现高精度、高召回率的骨折识别。最终系统在真实临床数据上达到83.1%准确率、85.1%精确率和96.4%召回率,展现出良好的泛化能力与临床实用性。

链接: https://arxiv.org/abs/2508.14129
作者: Aditya Bagri,Vasanthakumar Venugopal,Anandakumar D,Revathi Ezhumalai,Kalyan Sivasailam,Bargava Subramanian,VarshiniPriya,Meenakumari K S,Abi M,Renita S
机构: 未知
类目: Image and Video Processing (eess.IV); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
备注: 18 pages, 21 figures

点击查看摘要

Abstract:Background: Accurate diagnosis of wrist and hand fractures using radiographs is essential in emergency care, but manual interpretation is slow and prone to errors. Transformer-based models show promise in improving medical image analysis, but their application to extremity fractures is limited. This study addresses this gap by applying object detection transformers to wrist and hand X-rays. Methods: We fine-tuned the RT-DETR and Co-DETR models, pre-trained on COCO, using over 26,000 annotated X-rays from a proprietary clinical dataset. Each image was labeled for fracture presence with bounding boxes. A ResNet-50 classifier was trained on cropped regions to refine abnormality classification. Supervised contrastive learning was used to enhance embedding quality. Performance was evaluated using AP@50, precision, and recall metrics, with additional testing on real-world X-rays. Results: RT-DETR showed moderate results (AP@50 = 0.39), while Co-DETR outperformed it with an AP@50 of 0.615 and faster convergence. The integrated pipeline achieved 83.1% accuracy, 85.1% precision, and 96.4% recall on real-world X-rays, demonstrating strong generalization across 13 fracture types. Visual inspection confirmed accurate localization. Conclusion: Our Co-DETR-based pipeline demonstrated high accuracy and clinical relevance in wrist and hand fracture detection, offering reliable localization and differentiation of fracture types. It is scalable, efficient, and suitable for real-time deployment in hospital workflows, improving diagnostic speed and reliability in musculoskeletal radiology. Comments: 18 pages, 21 figures Subjects: Image and Video Processing (eess.IV); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV) MSC classes: 68T45 ACMclasses: I.2.10 Cite as: arXiv:2508.14129 [eess.IV] (or arXiv:2508.14129v1 [eess.IV] for this version) https://doi.org/10.48550/arXiv.2508.14129 Focus to learn more arXiv-issued DOI via DataCite Submission history From: Anandakumar D [view email] [v1] Tue, 19 Aug 2025 05:41:49 UTC (6,040 KB)
zh

[CV-90] 3D Cardiac Anatomy Generation Using Mesh Latent Diffusion Models

【速读】:该论文旨在解决3D医学影像中心脏解剖结构生成的难题,特别是在心肌梗死患者中生成多样化且逼真的左心室三维网格(3D mesh)数据,以支持体外试验(in silico trials)、电机械仿真及机器学习模型的数据增强等应用。其解决方案的关键在于提出了一种新颖的潜空间扩散模型架构——MeshLDM,该模型能够有效捕捉心脏在舒张末期(end-diastolic)和收缩末期(end-systolic)两个生理阶段的形态特征,并在定量和定性评估中表现出优异性能,生成的网格与金标准相比平均差异仅为2.4%。

链接: https://arxiv.org/abs/2508.14122
作者: Jolanta Mozyrska,Marcel Beetz,Luke Melas-Kyriazi,Vicente Grau,Abhirup Banerjee,Alfonso Bueno-Orovio
机构: University of Oxford (牛津大学)
类目: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG); Tissues and Organs (q-bio.TO)
备注:

点击查看摘要

Abstract:Diffusion models have recently gained immense interest for their generative capabilities, specifically the high quality and diversity of the synthesized data. However, examples of their applications in 3D medical imaging are still scarce, especially in cardiology. Generating diverse realistic cardiac anatomies is crucial for applications such as in silico trials, electromechanical computer simulations, or data augmentations for machine learning models. In this work, we investigate the application of Latent Diffusion Models (LDMs) for generating 3D meshes of human cardiac anatomies. To this end, we propose a novel LDM architecture – MeshLDM. We apply the proposed model on a dataset of 3D meshes of left ventricular cardiac anatomies from patients with acute myocardial infarction and evaluate its performance in terms of both qualitative and quantitative clinical and 3D mesh reconstruction metrics. The proposed MeshLDM successfully captures characteristics of the cardiac shapes at end-diastolic (relaxation) and end-systolic (contraction) cardiac phases, generating meshes with a 2.4% difference in population mean compared to the gold standard.
zh

[CV-91] Hallucinations in medical devices

【速读】:该论文旨在解决医疗设备中因深度学习和数据驱动方法导致的错误输出问题,特别是这些错误常被描述为“幻觉”(hallucination),但缺乏统一且实用的定义。其解决方案的关键在于提出一个普适性的定义:将幻觉界定为一种看似合理、可能对任务产生显著影响或无害的错误类型,从而为跨产品领域的医疗设备幻觉评估提供标准化框架。这一定义有助于推动针对幻觉现象的系统性评价与缓解策略研究。

链接: https://arxiv.org/abs/2508.14118
作者: Jason Granstedt,Prabhat Kc,Rucha Deshpande,Victor Garcia,Aldo Badano
机构: U. S. Food and Drug Administration (美国食品药品监督管理局)
类目: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV)
备注: 19 pages, 2 figures

点击查看摘要

Abstract:Computer methods in medical devices are frequently imperfect and are known to produce errors in clinical or diagnostic tasks. However, when deep learning and data-based approaches yield output that exhibit errors, the devices are frequently said to hallucinate. Drawing from theoretical developments and empirical studies in multiple medical device areas, we introduce a practical and universal definition that denotes hallucinations as a type of error that is plausible and can be either impactful or benign to the task at hand. The definition aims at facilitating the evaluation of medical devices that suffer from hallucinations across product areas. Using examples from imaging and non-imaging applications, we explore how the proposed definition relates to evaluation methodologies and discuss existing approaches for minimizing the prevalence of hallucinations.
zh

[CV-92] High-Throughput Low-Cost Segmentation of Brightfield Microscopy Live Cell Images

【速读】:该论文旨在解决亮场显微镜下未染色活细胞图像的分割难题,其核心挑战包括低对比度、噪声干扰、细胞运动引起的模糊以及时间动态表型变化等,这些问题在高通量成像场景中尤为突出。解决方案的关键在于构建一个基于卷积神经网络(CNN)的低资源消耗流水线,该流水线融合了统一U-Net架构中的冻结编码器对比分析、注意力机制、实例感知系统、自适应损失函数、硬实例重训练、动态学习率策略、渐进式训练以缓解过拟合,并引入集成技术提升性能。该方法在公开数据集上实现了93%的测试准确率和平均89%的F1分数(标准差0.07),且仅用10%的相位对比图像进行训练即可有效泛化至相位对比的LIVECell数据集,展现出跨模态鲁棒性和实际实验室部署潜力。

链接: https://arxiv.org/abs/2508.14106
作者: Surajit Das,Gourav Roy,Pavel Zun
机构: ITMO University (俄罗斯圣彼得堡国立研究技术大学); Jadavpur University (加达夫浦大学)
类目: Quantitative Methods (q-bio.QM); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV); Image and Video Processing (eess.IV)
备注:

点击查看摘要

Abstract:Live cell culture is crucial in biomedical studies for analyzing cell properties and dynamics in vitro. This study focuses on segmenting unstained live cells imaged with bright-field microscopy. While many segmentation approaches exist for microscopic images, none consistently address the challenges of bright-field live-cell imaging with high throughput, where temporal phenotype changes, low contrast, noise, and motion-induced blur from cellular movement remain major obstacles. We developed a low-cost CNN-based pipeline incorporating comparative analysis of frozen encoders within a unified U-Net architecture enhanced with attention mechanisms, instance-aware systems, adaptive loss functions, hard instance retraining, dynamic learning rates, progressive mechanisms to mitigate overfitting, and an ensemble technique. The model was validated on a public dataset featuring diverse live cell variants, showing consistent competitiveness with state-of-the-art methods, achieving 93% test accuracy and an average F1-score of 89% (std. 0.07) on low-contrast, noisy, and blurry images. Notably, the model was trained primarily on bright-field images with limited exposure to phase-contrast microscopy (10%), yet it generalized effectively to the phase-contrast LIVECell dataset, demonstrating modality, robustness and strong performance. This highlights its potential for real-world laboratory deployment across imaging conditions. The model requires minimal compute power and is adaptable using basic deep learning setups such as Google Colab, making it practical for training on other cell variants. Our pipeline outperforms existing methods in robustness and precision for bright-field microscopy segmentation. The code and dataset are available for reproducibility
zh

[CV-93] Activity Coefficient-based Channel Selection for Electroencephalogram: A Task-Independent Approach

【速读】:该论文旨在解决高密度脑电图(EEG)信号在脑机接口(BCI)应用中因通道数量增加而导致的交叉干扰和计算开销问题,同时克服现有通道选择方法任务依赖性强、需为每个新应用场景重新优化的局限性。其解决方案的关键在于提出一种无任务特异性的通道选择方法——基于活动系数的通道选择(ACCS),该方法引入了一个新的指标“通道活动系数”(Channel Activity Coefficient, CAC),用于量化各通道的信息量,并通过选取CAC排名前16的通道实现高达34.97%的多类分类准确率提升;该方法识别出的通道集合与下游任务或模型无关,具有高度可迁移性和通用性,适用于多种EEG相关应用。

链接: https://arxiv.org/abs/2508.14060
作者: Kartik Pandey,Arun Balasubramanian,Debasis Samanta
机构: 未知
类目: Neurons and Cognition (q-bio.NC); Computer Vision and Pattern Recognition (cs.CV); Human-Computer Interaction (cs.HC); Machine Learning (cs.LG); Signal Processing (eess.SP)
备注:

点击查看摘要

Abstract:Electroencephalogram (EEG) signals have gained widespread adoption in brain-computer interface (BCI) applications due to their non-invasive, low-cost, and relatively simple acquisition process. The demand for higher spatial resolution, particularly in clinical settings, has led to the development of high-density electrode arrays. However, increasing the number of channels introduces challenges such as cross-channel interference and computational overhead. To address these issues, modern BCI systems often employ channel selection algorithms. Existing methods, however, are typically task-specific and require re-optimization for each new application. This work proposes a task-agnostic channel selection method, Activity Coefficient-based Channel Selection (ACCS), which uses a novel metric called the Channel Activity Coefficient (CAC) to quantify channel utility based on activity levels. By selecting the top 16 channels ranked by CAC, ACCS achieves up to 34.97% improvement in multi-class classification accuracy. Unlike traditional approaches, ACCS identifies a reusable set of informative channels independent of the downstream task or model, making it highly adaptable for diverse EEG-based applications.
zh

人工智能

[AI-0] Graph Structure Learning with Temporal Graph Information Bottleneck for Inductive Representation Learning ECAI

【速读】:该论文旨在解决动态图中节点和边随时间演变时的归纳式表示学习问题,核心挑战在于如何有效表征未见过的节点以及缓解噪声或冗余的图信息。解决方案的关键在于提出一种名为GTGIB的通用框架,其创新性地融合了图结构学习(Graph Structure Learning, GSL)与时间图信息瓶颈(Temporal Graph Information Bottleneck, TGIB):首先设计了一个基于GSL的两步结构增强器,用于优化节点邻域以增强图结构信息;其次通过变分近似推导出可计算的时间图信息瓶颈目标函数,对边和特征进行正则化,从而实现稳定高效的优化。实验表明,基于GTGIB的模型在四个真实世界数据集上均优于现有方法,在归纳设置下表现显著提升,并在归纳和转导设置中均展现出一致性优势。

链接: https://arxiv.org/abs/2508.14859
作者: Jiafeng Xiong,Rizos Sakellariou
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注: Accepted in the 28th European Conference on Artificial Intelligence (ECAI), 2025

点击查看摘要

Abstract:Temporal graph learning is crucial for dynamic networks where nodes and edges evolve over time and new nodes continuously join the system. Inductive representation learning in such settings faces two major challenges: effectively representing unseen nodes and mitigating noisy or redundant graph information. We propose GTGIB, a versatile framework that integrates Graph Structure Learning (GSL) with Temporal Graph Information Bottleneck (TGIB). We design a novel two-step GSL-based structural enhancer to enrich and optimize node neighborhoods and demonstrate its effectiveness and efficiency through theoretical proofs and experiments. The TGIB refines the optimized graph by extending the information bottleneck principle to temporal graphs, regularizing both edges and features based on our derived tractable TGIB objective function via variational approximation, enabling stable and efficient optimization. GTGIB-based models are evaluated to predict links on four real-world datasets; they outperform existing methods in all datasets under the inductive setting, with significant and consistent improvement in the transductive setting.
zh

[AI-1] TIME[t] subseteq SPACE[O(sqrtt)] via Tree Height Compression

【速读】:该论文旨在解决确定性多带图灵机的时间复杂度与空间复杂度之间的关系问题,具体证明了 \TIME[t] \subseteq \SPACE[O(\sqrt{t})] 的空间模拟结果。其核心贡献在于提出了一种“高度压缩定理”(Height Compression Theorem),通过统一且对数空间可实现的方式,将典型的左深紧凑计算树(left-deep succinct computation tree)重构为一种二叉树结构,使得任意深度优先遍历路径上的栈深度保持在 O(\log T)(其中 T = \lceil t/b \rceil),同时保证叶节点处的工作量为 O(b),内部节点为 O(1),边的合法性可在对数空间内验证;合并点间的语义一致性由一个精确的 O(b) 窗口重放机制保障。该方案的关键创新包括:中点递归(midpoint recursion)、路径相关势函数控制活跃接口数量、以及用平衡二叉合并器替代多路合并以限制入度,并结合代数重放引擎(Algebraic Replay Engine)和无指针深度优先搜索(pointerless DFS)等算法设计,最终获得空间消耗 S(b) = O(b + \log(t/b)) 的加法型权衡,当取最优块大小 b = \Theta(\sqrt{t}) 时达到 O(\sqrt{t}) 空间复杂度。

链接: https://arxiv.org/abs/2508.14831
作者: Logan Nye
机构: 未知
类目: Computational Complexity (cs.CC); Artificial Intelligence (cs.AI); Data Structures and Algorithms (cs.DS)
备注: 32 pages

点击查看摘要

Abstract:We prove a square-root space simulation for deterministic multitape Turing machines, showing \TIME[t] \subseteq \SPACE[O(\sqrtt)] . The key step is a Height Compression Theorem that uniformly (and in logspace) reshapes the canonical left-deep succinct computation tree for a block-respecting run into a binary tree whose evaluation-stack depth along any DFS path is O(\log T) for T = \lceil t/b \rceil , while preserving O(b) work at leaves, O(1) at internal nodes, and edges that are logspace-checkable; semantic correctness across merges is witnessed by an exact O(b) window replay at the unique interface. The proof uses midpoint (balanced) recursion, a per-path potential that bounds simultaneously active interfaces by O(\log T) , and an indegree-capping replacement of multiway merges by balanced binary combiners. Algorithmically, an Algebraic Replay Engine with constant-degree maps over a constant-size field, together with pointerless DFS and index-free streaming, ensures constant-size per-level tokens and eliminates wide counters, yielding the additive tradeoff S(b)=O(b + \log(t/b)) for block sizes b \ge b_0 with b_0 = \Theta(\log t) , which at the canonical choice b = \Theta(\sqrtt) gives O(\sqrtt) space; the b_0 threshold rules out degenerate blocks where addressing scratch would dominate the window footprint. The construction is uniform, relativizes, and is robust to standard model choices. Consequences include branching-program upper bounds 2^O(\sqrts) for size- s bounded-fan-in circuits, tightened quadratic-time lower bounds for \SPACE[n] -complete problems via the standard hierarchy argument, and O(\sqrtt) -space certifying interpreters; under explicit locality assumptions, the framework extends to geometric d -dimensional models.
zh

[AI-2] From Passive Tool to Socio-cognitive Teammate: A Conceptual Framework for Agent ic AI in Human-AI Collaborative Learning

【速读】:该论文旨在解决当前教育领域中缺乏对生成式AI(Generative AI)作为主动学习参与者角色的系统性理解与设计框架的问题,尤其在从传统工具型AI向协同伙伴型AI转变的过程中。其解决方案的关键在于提出一个四层次的新型概念框架(APCP框架),明确划分AI在人机协同学习中的代理程度:(1)自适应工具,(2)主动助手,(3)共学伙伴,(4)同侪合作者;该框架基于社会文化学习理论和计算机支持协作学习(CSCL)理论,为分析人与AI之间角色与责任的动态变化提供了结构化语言,并强调通过功能设计实现高效协作,即便AI不具备真正意识或共享意图,亦可成为具有教育价值的功能性合作者。

链接: https://arxiv.org/abs/2508.14825
作者: Lixiang Yan
机构: 未知
类目: Human-Computer Interaction (cs.HC); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:The role of Artificial Intelligence (AI) in education is undergoing a rapid transformation, moving beyond its historical function as an instructional tool towards a new potential as an active participant in the learning process. This shift is driven by the emergence of agentic AI, autonomous systems capable of proactive, goal-directed action. However, the field lacks a robust conceptual framework to understand, design, and evaluate this new paradigm of human-AI interaction in learning. This paper addresses this gap by proposing a novel conceptual framework (the APCP framework) that charts the transition from AI as a tool to AI as a collaborative partner. We present a four-level model of escalating AI agency within human-AI collaborative learning: (1) the AI as an Adaptive Instrument, (2) the AI as a Proactive Assistant, (3) the AI as a Co-Learner, and (4) the AI as a Peer Collaborator. Grounded in sociocultural theories of learning and Computer-Supported Collaborative Learning (CSCL), this framework provides a structured vocabulary for analysing the shifting roles and responsibilities between human and AI agents. The paper further engages in a critical discussion of the philosophical underpinnings of collaboration, examining whether an AI, lacking genuine consciousness or shared intentionality, can be considered a true collaborator. We conclude that while AI may not achieve authentic phenomenological partnership, it can be designed as a highly effective functional collaborator. This distinction has significant implications for pedagogy, instructional design, and the future research agenda for AI in education, urging a shift in focus towards creating learning environments that harness the complementary strengths of both human and AI.
zh

[AI-3] PepThink-R1: LLM for Interpretable Cyclic Peptide Optimization with CoT SFT and Reinforcement Learning

【速读】:该论文旨在解决治疗性肽设计中面临的三大挑战:序列空间庞大、实验数据有限以及现有生成模型可解释性差的问题。其解决方案的关键在于提出PepThink-R1框架,该框架将大语言模型(Large Language Models, LLMs)与链式思维(Chain-of-Thought, CoT)监督微调及强化学习(Reinforcement Learning, RL)相结合,通过在序列生成过程中显式推理单体水平的修改,实现对多种药理学性质的优化,同时保持设计决策的可解释性。该方法利用定制化的奖励函数平衡化学有效性与属性提升,使模型能够自主探索多样化的序列变体,在脂溶性、稳定性及暴露量等关键指标上显著优于通用LLM(如GPT-5)和领域特定基线模型。

链接: https://arxiv.org/abs/2508.14765
作者: Ruheng Wang,Hang Zhang,Trieu Nguyen,Shasha Feng,Hao-Wei Pang,Xiang Yu,Li Xiao,Peter Zhiping Zhang
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Designing therapeutic peptides with tailored properties is hindered by the vastness of sequence space, limited experimental data, and poor interpretability of current generative models. To address these challenges, we introduce PepThink-R1, a generative framework that integrates large language models (LLMs) with chain-of-thought (CoT) supervised fine-tuning and reinforcement learning (RL). Unlike prior approaches, PepThink-R1 explicitly reasons about monomer-level modifications during sequence generation, enabling interpretable design choices while optimizing for multiple pharmacological properties. Guided by a tailored reward function balancing chemical validity and property improvements, the model autonomously explores diverse sequence variants. We demonstrate that PepThink-R1 generates cyclic peptides with significantly enhanced lipophilicity, stability, and exposure, outperforming existing general LLMs (e.g., GPT-5) and domain-specific baseline in both optimization success and interpretability. To our knowledge, this is the first LLM-based peptide design framework that combines explicit reasoning with RL-driven property control, marking a step toward reliable and transparent peptide optimization for therapeutic discovery.
zh

[AI-4] Cross-Modality Controlled Molecule Generation with Diffusion Language Model

【速读】:该论文旨在解决当前基于SMILES的扩散模型在分子生成中仅支持单模态约束的问题,即模型在训练时注入条件信号后无法灵活适应新的约束条件,且每次变更约束均需从头重新训练。为应对这一挑战,作者提出了一种跨模态可控分子生成方法(Cross-Modality Controlled Molecule Generation with Diffusion Language Model, CMCM-DLM),其关键在于构建两个可训练模块——结构控制模块(Structure Control Module, SCM)和性质控制模块(Property Control Module, PCM),并采用两阶段生成机制:第一阶段利用SCM在扩散早期注入结构约束以锚定分子骨架,第二阶段通过PCM在后期推理中引导分子优化其化学性质以匹配目标值,从而实现对多模态约束(如分子结构与化学性质)的联合控制,并支持无需重新训练即可动态添加新约束的能力。

链接: https://arxiv.org/abs/2508.14748
作者: Yunzhe Zhang,Yifei Wang,Khanh Vinh Nguyen,Pengyu Hong
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Current SMILES-based diffusion models for molecule generation typically support only unimodal constraint. They inject conditioning signals at the start of the training process and require retraining a new model from scratch whenever the constraint changes. However, real-world applications often involve multiple constraints across different modalities, and additional constraints may emerge over the course of a study. This raises a challenge: how to extend a pre-trained diffusion model not only to support cross-modality constraints but also to incorporate new ones without retraining. To tackle this problem, we propose the Cross-Modality Controlled Molecule Generation with Diffusion Language Model (CMCM-DLM), demonstrated by two distinct cross modalities: molecular structure and chemical properties. Our approach builds upon a pre-trained diffusion model, incorporating two trainable modules, the Structure Control Module (SCM) and the Property Control Module (PCM), and operates in two distinct phases during the generation process. In Phase I, we employs the SCM to inject structural constraints during the early diffusion steps, effectively anchoring the molecular backbone. Phase II builds on this by further introducing PCM to guide the later stages of inference to refine the generated molecules, ensuring their chemical properties match the specified targets. Experimental results on multiple datasets demonstrate the efficiency and adaptability of our approach, highlighting CMCM-DLM’s significant advancement in molecular generation for drug discovery applications.
zh

[AI-5] AFABench: A Generic Framework for Benchmarking Active Feature Acquisition

【速读】:该论文旨在解决主动特征获取(Active Feature Acquisition, AFA)领域缺乏标准化评估基准的问题。在许多实际场景中,获取数据实例的所有特征可能因成本、延迟或隐私限制而不可行,AFA通过动态选择每条数据的子集特征来权衡预测性能与获取代价。然而,现有方法(如贪心信息论策略和非贪婪强化学习方法)缺乏统一的公平比较平台,阻碍了该领域的系统性发展。本文的关键解决方案是提出了AFABench——首个面向AFA的基准框架,其包含多样化的合成与真实数据集、支持多种获取策略,并具备模块化设计以方便新方法集成。此外,作者还引入了一个名为AFAContext的新合成数据集,用于检验AFA策略的前瞻能力,从而揭示不同方法在权衡效率与准确性方面的关键差异,为未来研究提供可操作的洞见。

链接: https://arxiv.org/abs/2508.14734
作者: Valter Schütz,Han Wu,Reza Rezvan,Linus Aronsson,Morteza Haghir Chehreghani
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:In many real-world scenarios, acquiring all features of a data instance can be expensive or impractical due to monetary cost, latency, or privacy concerns. Active Feature Acquisition (AFA) addresses this challenge by dynamically selecting a subset of informative features for each data instance, trading predictive performance against acquisition cost. While numerous methods have been proposed for AFA, ranging from greedy information-theoretic strategies to non-myopic reinforcement learning approaches, fair and systematic evaluation of these methods has been hindered by the lack of standardized benchmarks. In this paper, we introduce AFABench, the first benchmark framework for AFA. Our benchmark includes a diverse set of synthetic and real-world datasets, supports a wide range of acquisition policies, and provides a modular design that enables easy integration of new methods and tasks. We implement and evaluate representative algorithms from all major categories, including static, greedy, and reinforcement learning-based approaches. To test the lookahead capabilities of AFA policies, we introduce a novel synthetic dataset, AFAContext, designed to expose the limitations of greedy selection. Our results highlight key trade-offs between different AFA strategies and provide actionable insights for future research. The benchmark code is available at: this https URL.
zh

[AI-6] Emerson-Lei and Manna-Pnueli Games for LTLf and PPLTL Synthesis

【速读】:该论文旨在解决在无限轨迹(infinite-trace)环境下实现反应式合成(reactive synthesis)的难题,特别是如何将有限轨迹时序逻辑(LTLf/PPLTL)的技术有效迁移至无限轨迹场景中,同时保持全时序逻辑(LTL)的表达能力。其解决方案的关键在于引入两类基于图博弈(games on graphs)的新方法:一是基于Emerson-Lei博弈的符号化求解器,通过将低阶性质(如保证、安全)归约到高阶性质(如重现、持久)来处理;二是提出Manna-Pnueli博弈,该方法直接将Manna-Pnueli目标嵌入博弈结构中,并通过组合一系列更简单的Emerson-Lei博弈的解构成一个有向无环图(DAG),从而实现理论上更高效的求解策略。实证结果表明,Manna-Pnueli博弈在多数情况下表现优越,但并非普适,提示未来可结合两者以进一步提升实际性能。

链接: https://arxiv.org/abs/2508.14725
作者: Daniel Hausmann,Shufang Zhu,Gianmarco Parretti,Christoph Weinhuber,Giuseppe De Giacomo,Nir Piterman
机构: 未知
类目: Logic in Computer Science (cs.LO); Artificial Intelligence (cs.AI); Formal Languages and Automata Theory (cs.FL)
备注:

点击查看摘要

Abstract:Recently, the Manna-Pnueli Hierarchy has been used to define the temporal logics LTLfp and PPLTLp, which allow to use finite-trace LTLf/PPLTL techniques in infinite-trace settings while achieving the expressiveness of full LTL. In this paper, we present the first actual solvers for reactive synthesis in these logics. These are based on games on graphs that leverage DFA-based techniques from LTLf/PPLTL to construct the game arena. We start with a symbolic solver based on Emerson-Lei games, which reduces lower-class properties (guarantee, safety) to higher ones (recurrence, persistence) before solving the game. We then introduce Manna-Pnueli games, which natively embed Manna-Pnueli objectives into the arena. These games are solved by composing solutions to a DAG of simpler Emerson-Lei games, resulting in a provably more efficient approach. We implemented the solvers and practically evaluated their performance on a range of representative formulas. The results show that Manna-Pnueli games often offer significant advantages, though not universally, indicating that combining both approaches could further enhance practical performance.
zh

[AI-7] Data-Driven Probabilistic Evaluation of Logic Properties with PAC-Confidence on Mealy Machines

【速读】:该论文旨在解决网络物理系统(Cyber-Physical Systems, CPS)在缺乏精确模型时,如何基于数据驱动方法准确评估其有限时间步长内的安全概率的问题。解决方案的关键在于提出一种基于Probably Approximately Correct (PAC) 学习范式的数据驱动方法,将离散逻辑与系统的概率可达性分析相连接,并通过主动学习机制(active learning paradigm)逐步引导采样新数据,从而在保证理论置信度的前提下高效估计系统安全性。

链接: https://arxiv.org/abs/2508.14710
作者: Swantje Plambeck,Ali Salamati,Eyke Huellermeier,Goerschwin Fey
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Cyber-Physical Systems (CPS) are complex systems that require powerful models for tasks like verification, diagnosis, or debugging. Often, suitable models are not available and manual extraction is difficult. Data-driven approaches then provide a solution to, e.g., diagnosis tasks and verification problems based on data collected from the system. In this paper, we consider CPS with a discrete abstraction in the form of a Mealy machine. We propose a data-driven approach to determine the safety probability of the system on a finite horizon of n time steps. The approach is based on the Probably Approximately Correct (PAC) learning paradigm. Thus, we elaborate a connection between discrete logic and probabilistic reachability analysis of systems, especially providing an additional confidence on the determined probability. The learning process follows an active learning paradigm, where new learning data is sampled in a guided way after an initial learning set is collected. We validate the approach with a case study on an automated lane-keeping system.
zh

[AI-8] Learning in Repeated Multi-Objective Stackelberg Games with Payoff Manipulation ECAI2025

【速读】:该论文致力于解决重复多目标Stackelberg博弈中领导者对收益操纵(payoff manipulation)的问题,即在不知道跟随者(follower)多目标效用函数权重参数的情况下,如何通过策略性地提供自身收益份额来影响跟随者的确定性最优响应,从而实现长期利益最大化。其解决方案的关键在于提出基于期望效用(Expected Utility, EU)和长期期望效用(long-term Expected Utility, longEU)的操纵策略,这些策略指导领导者在每次交互中权衡短期收益与长期偏好信息获取,以优化累积效用;理论证明表明,在无限重复交互下,longEU策略收敛至最优操纵方案,且实验结果验证了该方法能够在无需显式协商或先验知识的前提下提升领导者累计效用并促进双方共赢。

链接: https://arxiv.org/abs/2508.14705
作者: Phurinut Srisawad,Juergen Branke,Long Tran-Thanh
机构: 未知
类目: Computer Science and Game Theory (cs.GT); Artificial Intelligence (cs.AI)
备注: Extended version of the paper accepted at the 28th European Conference on Artificial Intelligence (ECAI 2025); Paper ID: M2635

点击查看摘要

Abstract:We study payoff manipulation in repeated multi-objective Stackelberg games, where a leader may strategically influence a follower’s deterministic best response, e.g., by offering a share of their own payoff. We assume that the follower’s utility function, representing preferences over multiple objectives, is unknown but linear, and its weight parameter must be inferred through interaction. This introduces a sequential decision-making challenge for the leader, who must balance preference elicitation with immediate utility maximisation. We formalise this problem and propose manipulation policies based on expected utility (EU) and long-term expected utility (longEU), which guide the leader in selecting actions and offering incentives that trade off short-term gains with long-term impact. We prove that under infinite repeated interactions, longEU converges to the optimal manipulation. Empirical results across benchmark environments demonstrate that our approach improves cumulative leader utility while promoting mutually beneficial outcomes, all without requiring explicit negotiation or prior knowledge of the follower’s utility function.
zh

[AI-9] Foe for Fraud: Transferable Adversarial Attacks in Credit Card Fraud Detection

【速读】:该论文旨在解决信用评分卡欺诈检测(Credit Card Fraud Detection, CCFD)中机器学习(Machine Learning, ML)模型在面对对抗扰动时的鲁棒性不足问题,尤其是在表格型交易数据场景下,此类攻击尚未被充分研究。解决方案的关键在于构建一个全面的框架,系统评估CCFD模型在黑盒与白盒对抗攻击设置下对梯度-based攻击方法的敏感性,并通过迁移攻击实验验证这些扰动对非梯度-based模型依然有效,从而揭示表格数据同样存在隐蔽的脆弱性,强调了金融领域需加强ML模型的安全防护机制以保障高价值交易的稳定性与可信度。

链接: https://arxiv.org/abs/2508.14699
作者: Jan Lum Fok,Qingwen Zeng,Shiping Chen,Oscar Fawkes,Huaming Chen
机构: 未知
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Credit card fraud detection (CCFD) is a critical application of Machine Learning (ML) in the financial sector, where accurately identifying fraudulent transactions is essential for mitigating financial losses. ML models have demonstrated their effectiveness in fraud detection task, in particular with the tabular dataset. While adversarial attacks have been extensively studied in computer vision and deep learning, their impacts on the ML models, particularly those trained on CCFD tabular datasets, remains largely unexplored. These latent vulnerabilities pose significant threats to the security and stability of the financial industry, especially in high-value transactions where losses could be substantial. To address this gap, in this paper, we present a holistic framework that investigate the robustness of CCFD ML model against adversarial perturbations under different circumstances. Specifically, the gradient-based attack methods are incorporated into the tabular credit card transaction data in both black- and white-box adversarial attacks settings. Our findings confirm that tabular data is also susceptible to subtle perturbations, highlighting the need for heightened awareness among financial technology practitioners regarding ML model security and trustworthiness. Furthermore, the experiments by transferring adversarial samples from gradient-based attack method to non-gradient-based models also verify our findings. Our results demonstrate that such attacks remain effective, emphasizing the necessity of developing robust defenses for CCFD algorithms.
zh

[AI-10] ECHO: Frequency-aware Hierarchical Encoding for Variable-length Signal

【速读】:该论文旨在解决预训练基础模型在工业信号建模(如声学、振动等传感器数据)中应用不足的问题,特别是现有基于子带编码器的方法受限于固定输入长度且缺乏显式的频率位置编码,导致在不同采样配置下难以实现精确的频谱定位。其解决方案的关键在于提出一种新颖的基础模型,融合先进的带分割架构与相对频率位置嵌入机制,从而在无需填充或分段的情况下支持任意长度输入,并生成同时保留时序和频谱保真度的紧凑嵌入表示,显著提升了模型在异常检测与故障识别任务中的性能与泛化能力。

链接: https://arxiv.org/abs/2508.14689
作者: Yucong Zhang,Juan Liu,Ming Li
机构: 未知
类目: ound (cs.SD); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Pre-trained foundation models have demonstrated remarkable success in vision and language, yet their potential for general machine signal modeling-covering acoustic, vibration, and other industrial sensor data-remains under-explored. Existing approach using sub-band-based encoders has achieved competitive results but are limited by fixed input lengths, and the absence of explicit frequency positional encoding. In this work, we propose a novel foundation model that integrates an advanced band-split architecture with relative frequency positional embeddings, enabling precise spectral localization across arbitrary sampling configurations. The model supports inputs of arbitrary length without padding or segmentation, producing a concise embedding that retains both temporal and spectral fidelity. We evaluate our method on SIREN (this https URL), a newly introduced large-scale benchmark for machine signal encoding that unifies multiple datasets, including all DCASE task 2 challenges (2020-2025) and widely-used industrial signal corpora. Experimental results demonstrate consistent state-of-the-art performance in anomaly detection and fault identification, confirming the effectiveness and generalization capability of the proposed model. We open-sourced ECHO on this https URL.
zh

[AI-11] ELATE: Evolutionary Language model for Automated Time-series Engineering

【速读】:该论文旨在解决时间序列预测中特征工程(feature engineering)依赖人工、耗时且缺乏领域知识引导的问题。现有自动化方法多采用穷举搜索,计算成本高且难以融合领域洞察。其解决方案的关键在于提出ELATE(Evolutionary Language model for Automated Time-series Engineering),该方法将语言模型嵌入进化框架,利用时间序列统计量和特征重要性指标对候选特征进行指导与剪枝,同时由语言模型生成上下文相关的新型特征变换,从而实现高效、智能化的自动特征工程。实验表明,ELATE在多个领域平均提升预测准确率8.4%。

链接: https://arxiv.org/abs/2508.14667
作者: Andrew Murray,Danial Dervovic,Michael Cashmore
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注: 27 pages, 4 figures. Comments welcome

点击查看摘要

Abstract:Time-series prediction involves forecasting future values using machine learning models. Feature engineering, whereby existing features are transformed to make new ones, is critical for enhancing model performance, but is often manual and time-intensive. Existing automation attempts rely on exhaustive enumeration, which can be computationally costly and lacks domain-specific insights. We introduce ELATE (Evolutionary Language model for Automated Time-series Engineering), which leverages a language model within an evolutionary framework to automate feature engineering for time-series data. ELATE employs time-series statistical measures and feature importance metrics to guide and prune features, while the language model proposes new, contextually relevant feature transformations. Our experiments demonstrate that ELATE improves forecasting accuracy by an average of 8.4% across various domains.
zh

[AI-12] Entropy-Constrained Strategy Optimization in Urban Floods: A Multi-Agent Framework with LLM and Knowledge Graph Integration

【速读】:该论文旨在解决极端城市降雨事件下应急调度系统面临的三大核心挑战:一是多目标(如交通流、任务完成度与风险缓解)之间的动态权衡缺乏情境感知策略;二是环境条件快速变化导致静态规则失效;三是大语言模型(Large Language Model, LLM)生成策略存在语义不稳定性与执行一致性差的问题。现有方法难以在统一框架内实现感知、全局优化与多智能体协调的协同。其解决方案的关键在于提出H-J框架——一个分层多智能体系统,融合知识引导提示(knowledge-guided prompting)、熵约束生成(entropy-constrained generation)和反馈驱动优化(feedback-driven optimization),构建从多源感知到战略执行及持续迭代的闭环流程,从而提升城市洪涝响应中的韧性与鲁棒性。

链接: https://arxiv.org/abs/2508.14654
作者: Peilin Ji,Xiao Xue,Simeng Wang,Wenhao Yan
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注: 17 pages including appendix, 6 figures

点击查看摘要

Abstract:In recent years, the increasing frequency of extreme urban rainfall events has posed significant challenges to emergency scheduling systems. Urban flooding often leads to severe traffic congestion and service disruptions, threatening public safety and mobility. However, effective decision making remains hindered by three key challenges: (1) managing trade-offs among competing goals (e.g., traffic flow, task completion, and risk mitigation) requires dynamic, context-aware strategies; (2) rapidly evolving environmental conditions render static rules inadequate; and (3) LLM-generated strategies frequently suffer from semantic instability and execution inconsistency. Existing methods fail to align perception, global optimization, and multi-agent coordination within a unified framework. To tackle these challenges, we introduce H-J, a hierarchical multi-agent framework that integrates knowledge-guided prompting, entropy-constrained generation, and feedback-driven optimization. The framework establishes a closed-loop pipeline spanning from multi-source perception to strategic execution and continuous refinement. We evaluate H-J on real-world urban topology and rainfall data under three representative conditions: extreme rainfall, intermittent bursts, and daily light rain. Experiments show that H-J outperforms rule-based and reinforcement-learning baselines in traffic smoothness, task success rate, and system robustness. These findings highlight the promise of uncertainty-aware, knowledge-constrained LLM-based approaches for enhancing resilience in urban flood response.
zh

[AI-13] OneLoc: Geo-Aware Generative Recommender Systems for Local Life Service

【速读】:该论文旨在解决本地生活服务场景中视频推荐的复杂性问题,即如何同时兼顾用户兴趣与实时地理位置信息,以实现精准推荐。传统推荐方法难以有效融合地理信息并平衡多目标(如用户兴趣、用户与商家距离及业务指标等),因此提出了一种端到端生成式推荐模型OneLoc。其关键解决方案包括:(1)通过地理感知语义ID对视频和地理信息进行联合编码,实现跨模态token化;(2)在编码器中引入地理感知自注意力机制,融合视频位置相似性与用户实时位置;(3)设计邻域感知提示(neighbor-aware prompt)捕捉用户周边环境上下文用于生成;此外,采用强化学习框架并定义地理奖励与GMV奖励函数,以协同优化多目标。该方案已在快手App本地生活服务中落地,日均服务4亿活跃用户,GMV和订单数分别提升21.016%和17.891%。

链接: https://arxiv.org/abs/2508.14646
作者: Zhipeng Wei,Kuo Cai,Junda She,Jie Chen,Minghao Chen,Yang Zeng,Qiang Luo,Wencong Zeng,Ruiming Tang,Kun Gai,Guorui Zhou
机构: 未知
类目: Information Retrieval (cs.IR); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Local life service is a vital scenario in Kuaishou App, where video recommendation is intrinsically linked with store’s location information. Thus, recommendation in our scenario is challenging because we should take into account user’s interest and real-time location at the same time. In the face of such complex scenarios, end-to-end generative recommendation has emerged as a new paradigm, such as OneRec in the short video scenario, OneSug in the search scenario, and EGA in the advertising scenario. However, in local life service, an end-to-end generative recommendation model has not yet been developed as there are some key challenges to be solved. The first challenge is how to make full use of geographic information. The second challenge is how to balance multiple objectives, including user interests, the distance between user and stores, and some other business objectives. To address the challenges, we propose OneLoc. Specifically, we leverage geographic information from different perspectives: (1) geo-aware semantic ID incorporates both video and geographic information for tokenization, (2) geo-aware self-attention in the encoder leverages both video location similarity and user’s real-time location, and (3) neighbor-aware prompt captures rich context information surrounding users for generation. To balance multiple objectives, we use reinforcement learning and propose two reward functions, i.e., geographic reward and GMV reward. With the above design, OneLoc achieves outstanding offline and online performance. In fact, OneLoc has been deployed in local life service of Kuaishou App. It serves 400 million active users daily, achieving 21.016% and 17.891% improvements in terms of gross merchandise value (GMV) and orders numbers.
zh

[AI-14] LeanGeo: Formalizing Competitional Geometry problems in Lean

【速读】:该论文旨在解决当前几何问题求解系统难以在统一框架内表达问题、且与其它数学领域集成困难的问题,同时应对几何证明依赖直观图示而导致验证复杂性的挑战。解决方案的关键在于提出LeanGeo,一个基于Lean 4定理证明器的统一形式化系统,其核心包括一套完备的高等几何定理库(基于Lean的基础逻辑),支持严格的形式化证明验证,并能无缝集成Mathlib数学库;此外,作者还构建了LeanGeo-Bench这一形式化几何基准测试集,涵盖国际数学奥林匹克(IMO)等高级几何问题,用于评估大语言模型在几何推理上的能力边界。

链接: https://arxiv.org/abs/2508.14644
作者: Chendong Song,Zihan Wang,Frederick Pu,Haiming Wang,Xiaohan Lin,Junqi Liu,Jia Li,Zhengying Liu
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注: 28 pages

点击查看摘要

Abstract:Geometry problems are a crucial testbed for AI reasoning capabilities. Most existing geometry solving systems cannot express problems within a unified framework, thus are difficult to integrate with other mathematical fields. Besides, since most geometric proofs rely on intuitive diagrams, verifying geometry problems is particularly challenging. To address these gaps, we introduce LeanGeo, a unified formal system for formalizing and solving competition-level geometry problems within the Lean 4 theorem prover. LeanGeo features a comprehensive library of high-level geometric theorems with Lean’s foundational logic, enabling rigorous proof verification and seamless integration with Mathlib. We also present LeanGeo-Bench, a formal geometry benchmark in LeanGeo, comprising problems from the International Mathematical Olympiad (IMO) and other advanced sources. Our evaluation demonstrates the capabilities and limitations of state-of-the-art Large Language Models on this benchmark, highlighting the need for further advancements in automated geometric reasoning. We open source the theorem library and the benchmark of LeanGeo at this https URL.
zh

[AI-15] Can LLM Agents Solve Collaborative Tasks? A Study on Urgency-Aware Planning and Coordination

【速读】:该论文旨在解决多智能体系统中如何有效协作以完成复杂现实任务的问题,特别是针对需要分工、优先级排序和协同规划的结构化救援任务。其解决方案的关键在于利用大语言模型(Large Language Models, LLMs)作为智能体,在已知的基于图的环境中进行资源分配与行动协调,并通过一系列协调敏感指标(如任务成功率、冗余动作、空间冲突和紧迫性加权效率)系统评估其性能,从而揭示LLMs在物理具身多智能体协作中的优势与失效模式。

链接: https://arxiv.org/abs/2508.14635
作者: João Vitor de Carvalho Silva,Douglas G. Macharet
机构: 未知
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:The ability to coordinate actions across multiple agents is critical for solving complex, real-world problems. Large Language Models (LLMs) have shown strong capabilities in communication, planning, and reasoning, raising the question of whether they can also support effective collaboration in multi-agent settings. In this work, we investigate the use of LLM agents to solve a structured victim rescue task that requires division of labor, prioritization, and cooperative planning. Agents operate in a fully known graph-based environment and must allocate resources to victims with varying needs and urgency levels. We systematically evaluate their performance using a suite of coordination-sensitive metrics, including task success rate, redundant actions, room conflicts, and urgency-weighted efficiency. This study offers new insights into the strengths and failure modes of LLMs in physically grounded multi-agent collaboration tasks, contributing to future benchmarks and architectural improvements.
zh

[AI-16] An Open-Source HW-SW Co-Development Framework Enabling Efficient Multi-Accelerator Systems

【速读】:该论文旨在解决异构加速器为中心的计算集群在集成过程中存在的数据传输效率低下与软硬件兼容性问题,这些问题阻碍了高性能与易用性兼备的统一解决方案的实现。其核心解决方案是提出SNAX——一个开源的软硬件一体化框架,通过一种新颖的混合耦合机制(hybrid-coupling scheme)实现松散耦合的异步控制与紧密耦合的数据访问,从而提升多加速器平台的效率和灵活性;关键创新在于可复用的硬件模块设计与基于MLIR(Multi-Level Intermediate Representation)的可定制编译器,二者协同自动化系统管理任务,显著简化多加速器集群的开发与部署流程,并在低功耗异构片上系统(SoC)中实现神经网络性能提升达10倍、加速器利用率保持90%以上。

链接: https://arxiv.org/abs/2508.14582
作者: Ryan Albert Antonio,Joren Dumoulin,Xiaoling Yi,Josse Van Delm,Yunhao Deng,Guilherme Paim,Marian Verhelst
机构: 未知
类目: Hardware Architecture (cs.AR); Artificial Intelligence (cs.AI)
备注: 7 pages, 10 figures, 1 table, to be published in ISLPED 2025

点击查看摘要

Abstract:Heterogeneous accelerator-centric compute clusters are emerging as efficient solutions for diverse AI workloads. However, current integration strategies often compromise data movement efficiency and encounter compatibility issues in hardware and software. This prevents a unified approach that balances performance and ease of use. To this end, we present SNAX, an open-source integrated HW-SW framework enabling efficient multi-accelerator platforms through a novel hybrid-coupling scheme, consisting of loosely coupled asynchronous control and tightly coupled data access. SNAX brings reusable hardware modules designed to enhance compute accelerator utilization, and its customizable MLIR-based compiler to automate key system management tasks, jointly enabling rapid development and deployment of customized multi-accelerator compute clusters. Through extensive experimentation, we demonstrate SNAX’s efficiency and flexibility in a low-power heterogeneous SoC. Accelerators can easily be integrated and programmed to achieve 10x improvement in neural network performance compared to other accelerator systems while maintaining accelerator utilization of 90% in full system operation.
zh

[AI-17] Mamba2 Meets Silence: Robust Vocal Source Separation for Sparse Regions

【速读】:该论文旨在解决音乐源分离中语音(vocal)隔离的准确性问题,尤其针对传统基于Transformer的模型在捕捉间歇性出现的 vocals 时表现不佳的问题。其解决方案的关键在于采用 Mamba2——一种先进的状态空间模型(state space model),以更有效地建模长程时间依赖关系,并结合频带分割(band-splitting)策略与双路径(dual-path)架构,从而高效处理长音频序列。实验表明,该方法在 cSDR(信噪比增强指标)上达到 11.03 dB,为当前最优结果,且在不同输入长度和 vocal 出现模式下均表现出稳定性能,验证了基于 Mamba 的模型在高分辨率音频处理中的有效性。

链接: https://arxiv.org/abs/2508.14556
作者: Euiyeon Kim,Yong-Hoon Choi
机构: 未知
类目: ound (cs.SD); Artificial Intelligence (cs.AI); Audio and Speech Processing (eess.AS)
备注:

点击查看摘要

Abstract:We introduce a new music source separation model tailored for accurate vocal isolation. Unlike Transformer-based approaches, which often fail to capture intermittently occurring vocals, our model leverages Mamba2, a recent state space model, to better capture long-range temporal dependencies. To handle long input sequences efficiently, we combine a band-splitting strategy with a dual-path architecture. Experiments show that our approach outperforms recent state-of-the-art models, achieving a cSDR of 11.03 dB-the best reported to date-and delivering substantial gains in uSDR. Moreover, the model exhibits stable and consistent performance across varying input lengths and vocal occurrence patterns. These results demonstrate the effectiveness of Mamba-based models for high-resolution audio processing and open up new directions for broader applications in audio research.
zh

[AI-18] owards LLM -generated explanations for Component-based Knowledge Graph Question Answering Systems

【速读】:该论文旨在解决组件化问答(Question Answering, QA)系统中因人工智能(AI)方法驱动的决策过程难以解释而导致的可解释性问题。其解决方案的关键在于利用组件的输入输出数据流作为行为表征基础,将输入表示为SPARQL查询、输出表示为RDF三元组,并基于此构建可理解的自然语言解释。通过对比模板生成与大语言模型(Large Language Models, LLMs)自动生成两种方式,实验表明LLMs生成的解释在用户评价中质量更高,从而实现了以RDF和SPARQL为上下文的自动化解释机制,显著提升了QA组件行为与决策的人类可理解性。

链接: https://arxiv.org/abs/2508.14553
作者: Dennis Schiese,Aleksandr Perevalov,Andreas Both
机构: 未知
类目: oftware Engineering (cs.SE); Artificial Intelligence (cs.AI)
备注: Presented at ICWI 2024, Zagreb. Released with ISBN: 978-989-8704-62-7. Data source: this https URL

点击查看摘要

Abstract:Over time, software systems have reached a level of complexity that makes it difficult for their developers and users to explain particular decisions made by them. In this paper, we focus on the explainability of component-based systems for Question Answering (QA). These components often conduct processes driven by AI methods, in which behavior and decisions cannot be clearly explained or justified, s.t., even for QA experts interpreting the executed process and its results is hard. To address this challenge, we present an approach that considers the components’ input and output data flows as a source for representing the behavior and provide explanations for the components, enabling users to comprehend what happened. In the QA framework used here, the data flows of the components are represented as SPARQL queries (inputs) and RDF triples (outputs). Hence, we are also providing valuable insights on verbalization regarding these data types. In our experiments, the approach generates explanations while following template-based settings (baseline) or via the use of Large Language Models (LLMs) with different configurations (automatic generation). Our evaluation shows that the explanations generated via LLMs achieve high quality and mostly outperform template-based approaches according to the users’ ratings. Therefore, it enables us to automatically explain the behavior and decisions of QA components to humans while using RDF and SPARQL as a context for explanations.
zh

[AI-19] Adaptively Robust LLM Inference Optimization under Prediction Uncertainty

【速读】:该论文旨在解决大型语言模型(Large Language Model, LLM)推理调度中的总延迟最小化问题,特别是在在线多任务服务场景下如何提升调度效率并降低能耗。其核心挑战在于:输入提示长度在请求到达时已知,但输出 token 长度未知,而后者直接影响内存占用和处理时间。解决方案的关键在于利用机器学习预测输出长度的区间分类(即最小-最大范围),并设计两种调度算法——保守型算法 \mathcal{A}_\max 依据预测上限调度以避免内存溢出,但易因过估计导致性能下降;更优的自适应算法 \mathcal{A}_\min 初始采用预测下限作为估计值,并在推理过程中动态修正该估计,从而实现对数尺度的竞争比(log-scale competitive ratio),且仅依赖于预测区间的下界,这在实践中更具可行性,因为上界通常更难准确预测。数值仿真表明,\mathcal{A}_\min 的表现接近“事后优化调度器”(hindsight scheduler),展现出高效性和鲁棒性。

链接: https://arxiv.org/abs/2508.14544
作者: Zixi Chen,Yinyu Ye,Zijie Zhou
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Optimization and Control (math.OC)
备注:

点击查看摘要

Abstract:We study the problem of optimizing Large Language Model (LLM) inference scheduling to minimize total latency. LLM inference is an online and multi-task service process and also heavily energy consuming by which a pre-trained LLM processes input requests and generates output tokens sequentially. Therefore, it is vital to improve its scheduling efficiency and reduce the power consumption while a great amount of prompt requests are arriving. A key challenge in LLM inference scheduling is that while the prompt length is known upon arrival, the output length, which critically impacts memory usage and processing time, is unknown. To address this uncertainty, we propose algorithms that leverage machine learning to predict output lengths, assuming the prediction provides an interval classification (min-max range) for each request. We first design a conservative algorithm, \mathcalA_\max , which schedules requests based on the upper bound of predicted output lengths to prevent memory overflow. However, this approach is overly conservative: as prediction accuracy decreases, performance degrades significantly due to potential overestimation. To overcome this limitation, we propose \mathcalA_\min , an adaptive algorithm that initially treats the predicted lower bound as the output length and dynamically refines this estimate during inferencing. We prove that \mathcalA_\min achieves a log-scale competitive ratio. Through numerical simulations, we demonstrate that \mathcalA_\min often performs nearly as well as the hindsight scheduler, highlighting both its efficiency and robustness in practical scenarios. Moreover, \mathcalA_\min relies solely on the lower bound of the prediction interval–an advantageous design choice since upper bounds on output length are typically more challenging to predict accurately. Subjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Optimization and Control (math.OC) Cite as: arXiv:2508.14544 [cs.LG] (or arXiv:2508.14544v1 [cs.LG] for this version) https://doi.org/10.48550/arXiv.2508.14544 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
zh

[AI-20] Post-hoc LLM -Supported Debugging of Distributed Processes

【速读】:该论文旨在解决当前软件系统中手动调试(manual debugging)资源消耗大、效率低的问题,尤其是在日益复杂和分布式的软件架构下,传统调试方法已显滞后。其解决方案的关键在于利用系统运行时的进程数据(process data),结合生成式 AI(Generative AI)技术,从实际的进程数据、接口信息和文档中自动生成自然语言解释(natural-language explanations),从而帮助开发者更高效地理解程序行为及潜在错误,且该方法具备语言无关性(language-agnostic),适用于不同编程语言的系统。

链接: https://arxiv.org/abs/2508.14540
作者: Dennis Schiese,Andreas Both
机构: 未知
类目: oftware Engineering (cs.SE); Artificial Intelligence (cs.AI)
备注: Presented at ICWE 2025, Delft (30 June - 03 July 2025)

点击查看摘要

Abstract:In this paper, we address the problem of manual debugging, which nowadays remains resource-intensive and in some parts archaic. This problem is especially evident in increasingly complex and distributed software systems. Therefore, our objective of this work is to introduce an approach that can possibly be applied to any system, at both the macro- and micro-level, to ease this debugging process. This approach utilizes a system’s process data, in conjunction with generative AI, to generate natural-language explanations. These explanations are generated from the actual process data, interface information, and documentation to guide the developers more efficiently to understand the behavior and possible errors of a process and its sub-processes. Here, we present a demonstrator that employs this approach on a component-based Java system. However, our approach is language-agnostic. Ideally, the generated explanations will provide a good understanding of the process, even if developers are not familiar with all the details of the considered system. Our demonstrator is provided as an open-source web application that is freely accessible to all users.
zh

[AI-21] Beyond ReLU: Chebyshev-DQN for Enhanced Deep Q-Networks

【速读】:该论文旨在解决深度Q网络(Deep Q-Network, DQN)在复杂强化学习任务中因神经网络难以准确逼近动作价值函数(action-value function)而导致性能受限的问题。其解决方案的关键在于引入切比雪夫多项式基(Chebyshev polynomial basis)构建一种新型架构——切比雪夫DQN(Chebyshev-DQN, Ch-DQN),通过利用切比雪夫多项式优异的函数逼近能力,提升特征表示的有效性,从而增强模型的学习效率和最终性能。实验表明,适度多项式阶数(N=4)可显著优于标准DQN(提升约39%),但过高阶数(N=8)则可能损害学习效果,凸显了模型复杂度与性能之间的权衡关系。

链接: https://arxiv.org/abs/2508.14536
作者: Saman Yazdannik,Morteza Tayefi,Shamim Sanisales
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:The performance of Deep Q-Networks (DQN) is critically dependent on the ability of its underlying neural network to accurately approximate the action-value function. Standard function approximators, such as multi-layer perceptrons, may struggle to efficiently represent the complex value landscapes inherent in many reinforcement learning problems. This paper introduces a novel architecture, the Chebyshev-DQN (Ch-DQN), which integrates a Chebyshev polynomial basis into the DQN framework to create a more effective feature representation. By leveraging the powerful function approximation properties of Chebyshev polynomials, we hypothesize that the Ch-DQN can learn more efficiently and achieve higher performance. We evaluate our proposed model on the CartPole-v1 benchmark and compare it against a standard DQN with a comparable number of parameters. Our results demonstrate that the Ch-DQN with a moderate polynomial degree (N=4) achieves significantly better asymptotic performance, outperforming the baseline by approximately 39%. However, we also find that the choice of polynomial degree is a critical hyperparameter, as a high degree (N=8) can be detrimental to learning. This work validates the potential of using orthogonal polynomial bases in deep reinforcement learning while also highlighting the trade-offs involved in model complexity.
zh

[AI-22] EffiFusion-GAN: Efficient Fusion Generative Adversarial Network for Speech Enhancement

【速读】:该论文旨在解决语音增强(Speech Enhancement)中模型复杂度高、计算资源消耗大且在资源受限环境下性能受限的问题。其解决方案的关键在于提出EffiFusion-GAN,一种轻量但高效的生成对抗网络(Generative Adversarial Network, GAN),通过在多尺度块中引入深度可分离卷积(Depthwise Separable Convolutions)以高效捕捉多样化的声学特征,并结合双归一化增强注意力机制与残差精修策略提升训练稳定性和收敛速度;同时采用动态剪枝(Dynamic Pruning)技术进一步压缩模型规模而不显著损失性能,从而实现高性能语音增强在边缘设备等资源受限场景下的部署可行性。

链接: https://arxiv.org/abs/2508.14525
作者: Bin Wen,Tien-Ping Tan
机构: 未知
类目: ound (cs.SD); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:We introduce EffiFusion-GAN (Efficient Fusion Generative Adversarial Network), a lightweight yet powerful model for speech enhancement. The model integrates depthwise separable convolutions within a multi-scale block to capture diverse acoustic features efficiently. An enhanced attention mechanism with dual normalization and residual refinement further improves training stability and convergence. Additionally, dynamic pruning is applied to reduce model size while maintaining performance, making the framework suitable for resource-constrained environments. Experimental evaluation on the public VoiceBank+DEMAND dataset shows that EffiFusion-GAN achieves a PESQ score of 3.45, outperforming existing models under the same parameter settings.
zh

[AI-23] MISS: Multi-Modal Tree Indexing and Searching with Lifelong Sequential Behavior for Retrieval Recommendation CIKM2025

【速读】:该论文旨在解决大规模工业推荐系统中检索阶段难以有效利用用户长期序列行为以及现有方法多依赖交互信息而忽略多模态信息的问题。其核心解决方案是提出一种名为MISS(Multi-modal Indexing and Searching with lifelong Sequence)的框架,关键创新在于构建基于多模态嵌入的索引树以提升检索精度,并引入协同通用搜索单元(Co-GSU)与多模态通用搜索单元(MM-GSU),实现对用户长期序列中多样化兴趣的多视角精准捕捉,从而在树状检索结构中高效融合多模态信息与生命周期行为建模。

链接: https://arxiv.org/abs/2508.14515
作者: Chengcheng Guo,Junda She,Kuo Cai,Shiyao Wang,Qigen Hu,Qiang Luo,Kun Gai,Guorui Zhou
机构: 未知
类目: Information Retrieval (cs.IR); Artificial Intelligence (cs.AI)
备注: CIKM 2025

点击查看摘要

Abstract:Large-scale industrial recommendation systems typically employ a two-stage paradigm of retrieval and ranking to handle huge amounts of information. Recent research focuses on improving the performance of retrieval model. A promising way is to introduce extensive information about users and items. On one hand, lifelong sequential behavior is valuable. Existing lifelong behavior modeling methods in ranking stage focus on the interaction of lifelong behavior and candidate items from retrieval stage. In retrieval stage, it is difficult to utilize lifelong behavior because of a large corpus of candidate items. On the other hand, existing retrieval methods mostly relay on interaction information, potentially disregarding valuable multi-modal information. To solve these problems, we represent the pioneering exploration of leveraging multi-modal information and lifelong sequence model within the advanced tree-based retrieval model. We propose Multi-modal Indexing and Searching with lifelong Sequence (MISS), which contains a multi-modal index tree and a multi-modal lifelong sequence modeling module. Specifically, for better index structure, we propose multi-modal index tree, which is built using the multi-modal embedding to precisely represent item similarity. To precisely capture diverse user interests in user lifelong sequence, we propose collaborative general search unit (Co-GSU) and multi-modal general search unit (MM-GSU) for multi-perspective interests searching.
zh

[AI-24] Exact Shapley Attributions in Quadratic-time for FANOVA Gaussian Processes

【速读】:该论文旨在解决在概率性机器学习模型(如高斯过程,GP)中精确计算Shapley值的计算复杂度问题,尤其是在特征数量较多时,传统方法因指数级复杂度而难以实用。针对这一挑战,作者提出了一种基于FANOVA高斯过程(FANOVA GP)的高效解决方案:通过利用FANOVA分解的闭式(随机)Möbius表示和受牛顿恒等式启发的递归算法,实现了对局部和全局解释中Shapley值的二次时间复杂度精确计算。关键创新在于将不确定性建模融入Shapley值计算中,不仅获得特征贡献的期望值,还量化其方差,从而提供可信赖、可扩展且符合博弈论公理的解释,显著提升了结构化概率模型在可解释人工智能(Explainable AI)中的实用性。

链接: https://arxiv.org/abs/2508.14499
作者: Majid Mohammadi,Krikamol Muandet,Ilaria Tiddi,Annette Ten Teije,Siu Lun Chau
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Shapley values are widely recognized as a principled method for attributing importance to input features in machine learning. However, the exact computation of Shapley values scales exponentially with the number of features, severely limiting the practical application of this powerful approach. The challenge is further compounded when the predictive model is probabilistic - as in Gaussian processes (GPs) - where the outputs are random variables rather than point estimates, necessitating additional computational effort in modeling higher-order moments. In this work, we demonstrate that for an important class of GPs known as FANOVA GP, which explicitly models all main effects and interactions, exact Shapley attributions for both local and global explanations can be computed in quadratic time. For local, instance-wise explanations, we define a stochastic cooperative game over function components and compute the exact stochastic Shapley value in quadratic time only, capturing both the expected contribution and uncertainty. For global explanations, we introduce a deterministic, variance-based value function and compute exact Shapley values that quantify each feature’s contribution to the model’s overall sensitivity. Our methods leverage a closed-form (stochastic) Möbius representation of the FANOVA decomposition and introduce recursive algorithms, inspired by Newton’s identities, to efficiently compute the mean and variance of Shapley values. Our work enhances the utility of explainable AI, as demonstrated by empirical studies, by providing more scalable, axiomatically sound, and uncertainty-aware explanations for predictions generated by structured probabilistic models.
zh

[AI-25] Detecting Reading-Induced Confusion Using EEG and Eye Tracking

【速读】:该论文旨在解决自然阅读过程中用户因信息冲突或超出理解能力而产生的困惑(confusion)问题,这一现象在学习和人机交互中具有重要影响。解决方案的关键在于通过多模态数据融合——即结合脑电图(EEG)与眼动追踪(eye tracking)技术,提取神经层面的N400事件相关电位(ERP)和行为层面的眼动特征,构建机器学习模型以高精度识别阅读中的困惑状态。研究发现,颞叶区域的脑电信号主导了困惑的神经表征,并且多模态模型相比单一模态显著提升了分类准确率(提升4–22%),最高达89.6%,为开发可实时监测用户认知状态的轻量级脑-机接口(BCI)及自适应系统提供了基础。

链接: https://arxiv.org/abs/2508.14442
作者: Haojun Zhuang,Dünya Baradari,Nataliya Kosmyna,Arnav Balyan,Constanze Albrecht,Stephanie Chen,Pattie Maes
机构: 未知
类目: Human-Computer Interaction (cs.HC); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Humans regularly navigate an overwhelming amount of information via text media, whether reading articles, browsing social media, or interacting with chatbots. Confusion naturally arises when new information conflicts with or exceeds a reader’s comprehension or prior knowledge, posing a challenge for learning. In this study, we present a multimodal investigation of reading-induced confusion using EEG and eye tracking. We collected neural and gaze data from 11 adult participants as they read short paragraphs sampled from diverse, real-world sources. By isolating the N400 event-related potential (ERP), a well-established neural marker of semantic incongruence, and integrating behavioral markers from eye tracking, we provide a detailed analysis of the neural and behavioral correlates of confusion during naturalistic reading. Using machine learning, we show that multimodal (EEG + eye tracking) models improve classification accuracy by 4-22% over unimodal baselines, reaching an average weighted participant accuracy of 77.3% and a best accuracy of 89.6%. Our results highlight the dominance of the brain’s temporal regions in these neural signatures of confusion, suggesting avenues for wearable, low-electrode brain-computer interfaces (BCI) for real-time monitoring. These findings lay the foundation for developing adaptive systems that dynamically detect and respond to user confusion, with potential applications in personalized learning, human-computer interaction, and accessibility.
zh

[AI-26] he Agent Behavior: Model Governance and Challenges in the AI Digital Age

【速读】:该论文旨在解决网络环境中智能代理(Agent)行为日益趋近人类行为所引发的信任、责任、伦理与安全等挑战,尤其关注代理行为监管困难导致的数据污染和责任不清问题。其解决方案的关键在于提出“网络行为生命周期”模型,将网络行为划分为六个阶段,并系统分析人类与代理在各阶段的行为差异;进一步引入“代理对代理”(Agent for Agent, A4A)范式与“人机行为差异”(Human-Agent Behavioral Disparity, HABD)模型,从决策机制、执行效率、意图-行为一致性、行为惯性及非理性模式五个维度刻画人机行为本质差异,从而为构建可信赖的人机协同体系提供理论支撑与技术路径。

链接: https://arxiv.org/abs/2508.14415
作者: Qiang Zhang,Pei Yan,Yijia Xu,Chuanpo Fu,Yong Fang,Yang Liu
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Advancements in AI have led to agents in networked environments increasingly mirroring human behavior, thereby blurring the boundary between artificial and human actors in specific contexts. This shift brings about significant challenges in trust, responsibility, ethics, security and etc. The difficulty in supervising of agent behaviors may lead to issues such as data contamination and unclear accountability. To address these challenges, this paper proposes the “Network Behavior Lifecycle” model, which divides network behavior into 6 stages and systematically analyzes the behavioral differences between humans and agents at each stage. Based on these insights, the paper further introduces the “Agent for Agent (A4A)” paradigm and the “Human-Agent Behavioral Disparity (HABD)” model, which examine the fundamental distinctions between human and agent behaviors across 5 dimensions: decision mechanism, execution efficiency, intention-behavior consistency, behavioral inertia, and irrational patterns. The effectiveness of the model is verified through real-world cases such as red team penetration and blue team defense. Finally, the paper discusses future research directions in dynamic cognitive governance architecture, behavioral disparity quantification, and meta-governance protocol stacks, aiming to provide a theoretical foundation and technical roadmap for secure and trustworthy human-agent collaboration.
zh

[AI-27] Automated Optimization Modeling through Expert-Guided Large Language Model Reasoning

【速读】:该论文旨在解决优化建模(Optimization Modeling, OM)过程中依赖领域专家、耗时且易出错的问题,同时克服现有大语言模型(Large Language Models, LLMs)方法在基准标注错误率高(高达42%)、评估范围狭窄(仅关注最优值)以及计算效率低(过度依赖多智能体系统或模型微调)等局限。其解决方案的关键在于提出ORThought框架,该框架通过链式思维(chain-of-thought reasoning)引入专家级优化建模原则,实现对复杂优化问题的自动化建模,并结合经过系统性纠错与标准化注释的新基准LogiOR,显著提升了建模准确性与效率,在复杂问题上表现尤为突出。

链接: https://arxiv.org/abs/2508.14410
作者: Beinuo Yang,Qishen Zhou,Junyi Li,Xingchen Su,Simon Hu
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Optimization Modeling (OM) is essential for solving complex decision-making problems. However, the process remains time-consuming and error-prone, heavily relying on domain experts. While Large Language Models (LLMs) show promise in addressing these challenges through their natural language understanding and reasoning capabilities, current approaches face three critical limitations: high benchmark labeling error rates reaching up to 42%, narrow evaluation scope that only considers optimal values, and computational inefficiency due to heavy reliance on multi-agent systems or model fine-tuning. In this work, we first enhance existing datasets through systematic error correction and more comprehensive annotation. Additionally, we introduce LogiOR, a new optimization modeling benchmark from the logistics domain, containing more complex problems with standardized annotations. Furthermore, we present ORThought, a novel framework that leverages expert-level optimization modeling principles through chain-of-thought reasoning to automate the OM process. Through extensive empirical evaluation, we demonstrate that ORThought outperforms existing approaches, including multi-agent frameworks, with particularly significant advantages on complex optimization problems. Finally, we provide a systematic analysis of our method, identifying critical success factors and failure modes, providing valuable insights for future research on LLM-based optimization modeling.
zh

[AI-28] NoteIt: A System Converting Instructional Videos to Interactable Notes Through Multimodal Video Understanding

【速读】:该论文旨在解决现有自动笔记生成工具在保留教学视频中信息完整性方面的不足,以及无法满足用户对数字笔记多样化呈现格式和交互功能的需求。其解决方案的关键在于提出了一种名为NoteIt的系统,该系统采用一种新颖的处理流程,能够忠实提取视频中的层次化结构和多模态关键信息,并通过交互式界面使用户可根据个人偏好进一步自定义笔记内容与展示形式,从而实现更高效、灵活且贴近用户需求的笔记生成体验。

链接: https://arxiv.org/abs/2508.14395
作者: Running Zhao,Zhihan Jiang,Xinchen Zhang,Chirui Chang,Handi Chen,Weipeng Deng,Luyao Jin,Xiaojuan Qi,Xun Qian,Edith C.H. Ngai
机构: 未知
类目: Human-Computer Interaction (cs.HC); Artificial Intelligence (cs.AI)
备注: Accepted to UIST 2025. Project website: this https URL

点击查看摘要

Abstract:Users often take notes for instructional videos to access key knowledge later without revisiting long videos. Automated note generation tools enable users to obtain informative notes efficiently. However, notes generated by existing research or off-the-shelf tools fail to preserve the information conveyed in the original videos comprehensively, nor can they satisfy users’ expectations for diverse presentation formats and interactive features when using notes digitally. In this work, we present NoteIt, a system, which automatically converts instructional videos to interactable notes using a novel pipeline that faithfully extracts hierarchical structure and multimodal key information from videos. With NoteIt’s interface, users can interact with the system to further customize the content and presentation formats of the notes according to their preferences. We conducted both a technical evaluation and a comparison user study (N=36). The solid performance in objective metrics and the positive user feedback demonstrated the effectiveness of the pipeline and the overall usability of NoteIt. Project website: this https URL
zh

[AI-29] Online Incident Response Planning under Model Misspecification through Bayesian Learning and Belief Quantization CCS

【速读】:该论文旨在解决网络安全事件响应中因系统模型不准确或不完整而导致决策效率低下和实用性受限的问题(即模型误设问题)。传统决策支持框架依赖于详尽的系统模型,难以适应实际攻击场景中的不确定性。其解决方案的关键在于提出一种在线贝叶斯学习方法——MOBAL(Misspecified Online Bayesian Learning),通过迭代更新对模型的假设来实现自适应调整,并将推测模型量化为有限状态马尔可夫模型,从而利用动态规划进行高效响应规划。该方法在理论上保证了贝叶斯学习的渐近一致性,并提供了对模型误设和量化误差的边界控制,实验证明其在CAGE-2基准测试中展现出更强的适应性和鲁棒性。

链接: https://arxiv.org/abs/2508.14385
作者: Kim Hammar,Tao Li
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Cryptography and Security (cs.CR); Systems and Control (eess.SY)
备注: Accepted to ACM CCS AISec2025

点击查看摘要

Abstract:Effective responses to cyberattacks require fast decisions, even when information about the attack is incomplete or inaccurate. However, most decision-support frameworks for incident response rely on a detailed system model that describes the incident, which restricts their practical utility. In this paper, we address this limitation and present an online method for incident response planning under model misspecification, which we call MOBAL: Misspecified Online Bayesian Learning. MOBAL iteratively refines a conjecture about the model through Bayesian learning as new information becomes available, which facilitates model adaptation as the incident unfolds. To determine effective responses online, we quantize the conjectured model into a finite Markov model, which enables efficient response planning through dynamic programming. We prove that Bayesian learning is asymptotically consistent with respect to the information feedback. Additionally, we establish bounds on misspecification and quantization errors. Experiments on the CAGE-2 benchmark show that MOBAL outperforms the state of the art in terms of adaptability and robustness to model misspecification.
zh

[AI-30] Computing-In-Memory Dataflow for Minimal Buffer Traffic

【速读】:该论文旨在解决计算内存一体化(Computing-In-Memory, CIM)架构在加速深度可分离卷积(depthwise convolution)时面临的两大挑战:一是CIM存储资源利用率低,二是缓冲区(buffer)数据传输负担重,后者虽被长期忽视但对延迟和能耗影响显著。解决方案的关键在于提出一种新型CIM数据流(dataflow),通过最大化数据复用并提升内存利用效率来显著减少缓冲区流量;该数据流基于坚实的理论基础设计,在MobileNet和EfficientNet模型上验证了其有效性,相较基线(传统权值静态数据流)可降低缓冲区流量77.4–87.0%,进而使总数据传输能耗和延迟分别减少10.1–17.9%和15.6–27.8%。

链接: https://arxiv.org/abs/2508.14375
作者: Choongseok Song,Doo Seok Jeong
机构: 未知
类目: Hardware Architecture (cs.AR); Artificial Intelligence (cs.AI)
备注: IEEE International Conference on Computer Design

点击查看摘要

Abstract:Computing-In-Memory (CIM) offers a potential solution to the memory wall issue and can achieve high energy efficiency by minimizing data movement, making it a promising architecture for edge AI devices. Lightweight models like MobileNet and EfficientNet, which utilize depthwise convolution for feature extraction, have been developed for these devices. However, CIM macros often face challenges in accelerating depthwise convolution, including underutilization of CIM memory and heavy buffer traffic. The latter, in particular, has been overlooked despite its significant impact on latency and energy consumption. To address this, we introduce a novel CIM dataflow that significantly reduces buffer traffic by maximizing data reuse and improving memory utilization during depthwise convolution. The proposed dataflow is grounded in solid theoretical principles, fully demonstrated in this paper. When applied to MobileNet and EfficientNet models, our dataflow reduces buffer traffic by 77.4-87.0%, leading to a total reduction in data traffic energy and latency by 10.1-17.9% and 15.6-27.8%, respectively, compared to the baseline (conventional weight-stationary dataflow).
zh

[AI-31] Generative AI Against Poaching: Latent Composite Flow Matching for Wildlife Conservation

【速读】:该论文旨在解决野生动物盗猎行为预测中的两大挑战:一是盗猎事件检测不完全导致的数据缺陷问题,二是真实世界数据稀缺限制模型性能的问题。其核心解决方案在于:首先,通过将流匹配(flow matching)与基于占用状态的检测模型相结合,在潜在空间中训练模型以推断未观测到的盗猎活动分布;其次,采用从线性模型预测初始化的复合流结构替代扩散模型中随机噪声的初始化方式,从而引入先验知识并提升模型在小样本下的泛化能力。实验表明,该方法在乌干达两个国家公园的数据集上均显著提高了预测准确性。

链接: https://arxiv.org/abs/2508.14342
作者: Lingkai Kong,Haichuan Wang,Charles A. Emogor,Vincent Börsch-Supan,Lily Xu,Milind Tambe
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Multiagent Systems (cs.MA)
备注:

点击查看摘要

Abstract:Poaching poses significant threats to wildlife and biodiversity. A valuable step in reducing poaching is to forecast poacher behavior, which can inform patrol planning and other conservation interventions. Existing poaching prediction methods based on linear models or decision trees lack the expressivity to capture complex, nonlinear spatiotemporal patterns. Recent advances in generative modeling, particularly flow matching, offer a more flexible alternative. However, training such models on real-world poaching data faces two central obstacles: imperfect detection of poaching events and limited data. To address imperfect detection, we integrate flow matching with an occupancy-based detection model and train the flow in latent space to infer the underlying occupancy state. To mitigate data scarcity, we adopt a composite flow initialized from a linear-model prediction rather than random noise which is the standard in diffusion models, injecting prior knowledge and improving generalization. Evaluations on datasets from two national parks in Uganda show consistent gains in predictive accuracy.
zh

[AI-32] A Comparative Evaluation of Teacher-Guided Reinforcement Learning Techniques for Autonomous Cyber Operations

【速读】:该论文试图解决自主网络攻防操作(Autonomous Cyber Operations, ACO)中智能体从零开始学习导致收敛速度慢、早期性能差的问题。解决方案的关键在于引入四种不同的教师引导(teacher-guided)技术,通过在模拟的CybORG环境中对这些方法进行对比评估,验证了教师指导能够显著提升训练效率,包括早期策略性能和收敛速度,从而为自主网络安全领域提供了更高效的训练范式。

链接: https://arxiv.org/abs/2508.14340
作者: Konur Tholl,Mariam El Mezouar,Ranwa Al Mallah
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Autonomous Cyber Operations (ACO) rely on Reinforcement Learning (RL) to train agents to make effective decisions in the cybersecurity domain. However, existing ACO applications require agents to learn from scratch, leading to slow convergence and poor early-stage performance. While teacher-guided techniques have demonstrated promise in other domains, they have not yet been applied to ACO. In this study, we implement four distinct teacher-guided techniques in the simulated CybORG environment and conduct a comparative evaluation. Our results demonstrate that teacher integration can significantly improve training efficiency in terms of early policy performance and convergence speed, highlighting its potential benefits for autonomous cybersecurity.
zh

[AI-33] Power Stabilization for AI Training Datacenters

【速读】:该论文旨在解决大规模人工智能(AI)训练任务在数千个GPU上运行时所面临的电源管理难题,特别是由计算密集型与通信密集型阶段交替导致的剧烈功率波动问题。这类功率波动不仅随训练规模增大而加剧,其频率谱还可能与电网关键频率谐振,从而对电力基础设施造成物理损害。解决方案的关键在于从软件、GPU硬件到数据中心基础设施三个层面协同创新:通过优化调度策略、改进GPU功耗调节机制以及增强数据中心供电稳定性,提出多维度综合干预方案,并利用真实硬件和云内功率模拟器进行验证,以实现对大规模AI训练负载功率的稳定控制。

链接: https://arxiv.org/abs/2508.14318
作者: Esha Choukse,Brijesh Warrier,Scot Heath,Luz Belmont,April Zhao,Hassan Ali Khan,Brian Harry,Matthew Kappel,Russell J. Hewett,Kushal Datta,Yu Pei,Caroline Lichtenberger,John Siegler,David Lukofsky,Zaid Kahn,Gurpreet Sahota,Andy Sullivan,Charles Frederick,Hien Thai,Rebecca Naughton,Daniel Jurnove,Justin Harp,Reid Carper,Nithish Mahalingam,Srini Varkala,Alok Gautam Kumbhare,Satyajit Desai,Venkatesh Ramamurthy,Praneeth Gottumukkala,Girish Bhatia,Kelsey Wildstone,Laurentiu Olariu,Mohammed Ayna,Mike Kendrick,Ricardo Bianchini,Aaron Hurst,Reza Zamani,Xin Li,Gene Oden,Rory Carmichael,Tom Li,Apoorv Gupta,Nilesh Dattani,Lawrence Marwong,Rob Nertney,Jeff Liott,Miro Enev,Divya Ramakrishnan,Ian Buck,Jonah Alben
机构: 未知
类目: Hardware Architecture (cs.AR); Artificial Intelligence (cs.AI); Distributed, Parallel, and Cluster Computing (cs.DC)
备注:

点击查看摘要

Abstract:Large Artificial Intelligence (AI) training workloads spanning several tens of thousands of GPUs present unique power management challenges. These arise due to the high variability in power consumption during the training. Given the synchronous nature of these jobs, during every iteration there is a computation-heavy phase, where each GPU works on the local data, and a communication-heavy phase where all the GPUs synchronize on the data. Because compute-heavy phases require much more power than communication phases, large power swings occur. The amplitude of these power swings is ever increasing with the increase in the size of training jobs. An even bigger challenge arises from the frequency spectrum of these power swings which, if harmonized with critical frequencies of utilities, can cause physical damage to the power grid infrastructure. Therefore, to continue scaling AI training workloads safely, we need to stabilize the power of such workloads. This paper introduces the challenge with production data and explores innovative solutions across the stack: software, GPU hardware, and datacenter infrastructure. We present the pros and cons of each of these approaches and finally present a multi-pronged approach to solving the challenge. The proposed solutions are rigorously tested using a combination of real hardware and Microsoft’s in-house cloud power simulator, providing critical insights into the efficacy of these interventions under real-world conditions.
zh

[AI-34] Your Reward Function for RL is Your Best PRM for Search: Unifying RL and Search-Based TTS

【速读】:该论文旨在解决大语言模型(Large Language Models, LLMs)在测试时扩展(Test-time Scaling, TTS)中两类主流方法——基于强化学习(Reinforcement Learning, RL)的方法与基于搜索(Search-based)的方法——存在的局限性:前者因稀疏奖励信号导致训练不稳定且样本效率低,后者依赖昂贵的人工或LLM标注的中间过程数据,且在分布偏移下性能下降。解决方案的关键在于提出AIRL-S,首次自然统一了RL与搜索范式,其核心洞察是:RL训练过程中学习到的奖励函数本质上就是最优的过程奖励模型(Process Reward Model, PRM)。通过结合对抗逆强化学习(Adversarial Inverse Reinforcement Learning, AIRL)与群体相对策略优化(Group Relative Policy Optimization, GRPO),AIRL-S直接从正确的推理轨迹中学习一个密集且动态的PRM,无需任何人工标注的中间步骤数据;在推理阶段,该PRM同时作为RL的评论家(critic)和搜索启发式函数,实现鲁棒的推理链扩展、缓解奖励黑客问题,并提升跨任务泛化能力。

链接: https://arxiv.org/abs/2508.14313
作者: Can Jin,Yang Zhou,Qixin Zhang,Hongwu Peng,Di Zhang,Marco Pavone,Ligong Han,Zhang-Wei Hong,Tong Che,Dimitris N. Metaxas
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Test-time scaling (TTS) for large language models (LLMs) has thus far fallen into two largely separate paradigms: (1) reinforcement learning (RL) methods that optimize sparse outcome-based rewards, yet suffer from instability and low sample efficiency; and (2) search-based techniques guided by independently trained, static process reward models (PRMs), which require expensive human- or LLM-generated labels and often degrade under distribution shifts. In this paper, we introduce AIRL-S, the first natural unification of RL-based and search-based TTS. Central to AIRL-S is the insight that the reward function learned during RL training inherently represents the ideal PRM for guiding downstream search. Specifically, we leverage adversarial inverse reinforcement learning (AIRL) combined with group relative policy optimization (GRPO) to learn a dense, dynamic PRM directly from correct reasoning traces, entirely eliminating the need for labeled intermediate process data. At inference, the resulting PRM simultaneously serves as the critic for RL rollouts and as a heuristic to effectively guide search procedures, facilitating robust reasoning chain extension, mitigating reward hacking, and enhancing cross-task generalization. Experimental results across eight benchmarks, including mathematics, scientific reasoning, and code generation, demonstrate that our unified approach improves performance by 9 % on average over the base model, matching GPT-4o. Furthermore, when integrated into multiple search algorithms, our PRM consistently outperforms all baseline PRMs trained with labeled data. These results underscore that, indeed, your reward function for RL is your best PRM for search, providing a robust and cost-effective solution to complex reasoning tasks in LLMs.
zh

[AI-35] Learning Time-Varying Convexifications of Multiple Fairness Measures

【速读】:该论文旨在解决多公平性度量(multiple fairness measures)在实际应用中权重未知且可能随时间变化的问题,尤其在仅能获得图结构反馈(graph-structured feedback)的限制下,如何动态学习这些公平性正则项的时变凸化形式。其解决方案的关键在于设计一种在线学习机制,能够自适应地调整不同公平性指标的权重,并通过有限的图结构反馈信息实现对多个公平性约束的联合优化,从而在保证模型性能的同时提升公平性表现。

链接: https://arxiv.org/abs/2508.14311
作者: Quan Zhou,Jakub Marecek,Robert Shorten
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:There is an increasing appreciation that one may need to consider multiple measures of fairness, e.g., considering multiple group and individual fairness notions. The relative weights of the fairness regularisers are a priori unknown, may be time varying, and need to be learned on the fly. We consider the learning of time-varying convexifications of multiple fairness measures with limited graph-structured feedback.
zh

[AI-36] Explaining Hitori Puzzles: Neurosymbolic Proof Staging for Sequential Decisions

【速读】:该论文旨在解决复杂决策序列的可解释性问题,特别是如何融合符号推理与生成式 AI(Generative AI)来提供清晰、可信的解释。其解决方案的关键在于提出一种神经符号(neurosymbolic)方法,将 SAT 求解器的精确逻辑推理能力与大型语言模型(Large Language Models, LLMs)的自然语言生成能力相结合:对于 Hitori 等谜题中局部约束(如不重复数字),使用短分辨率证明(short resolution proofs)进行形式化解释;而对于全局约束(如连通性约束),则借助 LLM 生成视觉导向的自然语言说明。这种灵活组合提升了人类对自动化决策过程的理解与信任。

链接: https://arxiv.org/abs/2508.14294
作者: Maria Leonor Pacheco,Fabio Somenzi,Dananjay Srinivas,Ashutosh Trivedi
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:We propose a neurosymbolic approach to the explanation of complex sequences of decisions that combines the strengths of decision procedures and Large Language Models (LLMs). We demonstrate this approach by producing explanations for the solutions of Hitori puzzles. The rules of Hitori include local constraints that are effectively explained by short resolution proofs. However, they also include a connectivity constraint that is more suitable for visual explanations. Hence, Hitori provides an excellent testing ground for a flexible combination of SAT solvers and LLMs. We have implemented a tool that assists humans in solving Hitori puzzles, and we present experimental evidence of its effectiveness.
zh

[AI-37] Amortized Bayesian Meta-Learning for Low-Rank Adaptation of Large Language Models

【速读】:该论文旨在解决基于低秩适配(LoRA)微调的大语言模型(LLM)在未见数据集上泛化能力不足的问题。现有方法如基于上下文提示优化或元学习虽能提升泛化性能,但存在计算和内存开销大的缺陷,例如需要长上下文或二阶梯度更新。其解决方案的关键在于提出一种摊销贝叶斯元学习的LoRA方法(Amortized Bayesian Meta-Learning for LoRA, ABMLL),通过重构任务特定参数与全局参数的关系,并引入新的超参数平衡重建精度与任务参数对全局参数的保真度,在保持计算效率的同时显著提升模型泛化能力。此外,由于采用贝叶斯框架,ABMLL还能提供更优的不确定性量化能力。

链接: https://arxiv.org/abs/2508.14285
作者: Liyi Zhang,Jake Snell,Thomas L. Griffiths
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Machine Learning (stat.ML)
备注: 6 pages, 2 figures

点击查看摘要

Abstract:Fine-tuning large language models (LLMs) with low-rank adaptaion (LoRA) is a cost-effective way to incorporate information from a specific dataset. However, it is often unclear how well the fine-tuned LLM will generalize, i.e., how well it will perform on unseen datasets. Methods have been proposed to improve generalization by optimizing with in-context prompts, or by using meta-learning to fine-tune LLMs. However, these methods are expensive in memory and computation, requiring either long-context prompts or saving copies of parameters and using second-order gradient updates. To address these challenges, we propose Amortized Bayesian Meta-Learning for LoRA (ABMLL). This method builds on amortized Bayesian meta-learning for smaller models, adapting this approach to LLMs while maintaining its computational efficiency. We reframe task-specific and global parameters in the context of LoRA and use a set of new hyperparameters to balance reconstruction accuracy and the fidelity of task-specific parameters to the global ones. ABMLL provides effective generalization and scales to large models such as Llama3-8B. Furthermore, as a result of using a Bayesian framework, ABMLL provides improved uncertainty quantification. We test ABMLL on Unified-QA and CrossFit datasets and find that it outperforms existing methods on these benchmarks in terms of both accuracy and expected calibration error.
zh

[AI-38] Incident Analysis for AI Agents AAAI

【速读】:该论文旨在解决当前AI代理(AI agent)使用过程中发生的事故(incident)缺乏系统性分析与报告机制的问题。现有事故报告流程主要依赖公开数据,难以获取如代理的思维链(chain of thought)或浏览器历史等敏感但关键的信息,从而限制了对事故根本原因的理解与预防。解决方案的关键在于提出一个基于系统安全(systems safety)方法的事故分析框架,将事故成因归纳为三类因素:系统相关因素(如CBRN训练数据)、情境因素(如提示注入)和认知因素(如误解用户请求),并明确指出应收集的三类核心信息:活动日志、系统文档及访问权限、以及代理所用工具的相关信息。该框架为改进事故报告内容和开发方/部署方的数据保留策略提供了结构化指导,有助于提升AI代理事故的风险管理能力。

链接: https://arxiv.org/abs/2508.14231
作者: Carson Ezell,Xavier Roberts-Gaal,Alan Chan
机构: 未知
类目: Computers and Society (cs.CY); Artificial Intelligence (cs.AI)
备注: 16 pages (10 pages main text), 4 figures, 3 tables. To be published in the Proceedings of the 2025 AAAI/ACM Conference on AI, Ethics, Society (AIES)

点击查看摘要

Abstract:As AI agents become more widely deployed, we are likely to see an increasing number of incidents: events involving AI agent use that directly or indirectly cause harm. For example, agents could be prompt-injected to exfiltrate private information or make unauthorized purchases. Structured information about such incidents (e.g., user prompts) can help us understand their causes and prevent future occurrences. However, existing incident reporting processes are not sufficient for understanding agent incidents. In particular, such processes are largely based on publicly available data, which excludes useful, but potentially sensitive, information such as an agent’s chain of thought or browser history. To inform the development of new, emerging incident reporting processes, we propose an incident analysis framework for agents. Drawing on systems safety approaches, our framework proposes three types of factors that can cause incidents: system-related (e.g., CBRN training data), contextual (e.g., prompt injections), and cognitive (e.g., misunderstanding a user request). We also identify specific information that could help clarify which factors are relevant to a given incident: activity logs, system documentation and access, and information about the tools an agent uses. We provide recommendations for 1) what information incident reports should include and 2) what information developers and deployers should retain and make available to incident investigators upon request. As we transition to a world with more agents, understanding agent incidents will become increasingly crucial for managing risks.
zh

[AI-39] Large Language Models are Highly Aligned with Human Ratings of Emotional Stimuli

【速读】:该论文旨在解决大语言模型(Large Language Models, LLMs)在情感评估任务中与人类行为一致性的问题,即探究LLMs如何理解和响应情绪负载的刺激(如文字和图像),从而为它们在人机交互或代理角色中的应用提供依据。解决方案的关键在于通过对比多个主流LLMs对已由人类标注情感内容的数据集进行评分的结果,量化其与人类评分的一致性程度,发现GPT-4o在多数情绪维度上与人类高度一致(相关系数r ≥ 0.9),尤其在“快乐”维度表现最佳;同时揭示了LLMs在唤醒度(arousal)维度上一致性较低,且整体评分更趋同于人类,表明其在情绪理解方面存在生物智能与人工智能力量的差异与相似性。

链接: https://arxiv.org/abs/2508.14214
作者: Mattson Ogg,Chace Ashcraft,Ritwik Bose,Raphael Norman-Tenazas,Michael Wolmetz
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Emotions exert an immense influence over human behavior and cognition in both commonplace and high-stress tasks. Discussions of whether or how to integrate large language models (LLMs) into everyday life (e.g., acting as proxies for, or interacting with, human agents), should be informed by an understanding of how these tools evaluate emotionally loaded stimuli or situations. A model’s alignment with human behavior in these cases can inform the effectiveness of LLMs for certain roles or interactions. To help build this understanding, we elicited ratings from multiple popular LLMs for datasets of words and images that were previously rated for their emotional content by humans. We found that when performing the same rating tasks, GPT-4o responded very similarly to human participants across modalities, stimuli and most rating scales (r = 0.9 or higher in many cases). However, arousal ratings were less well aligned between human and LLM raters, while happiness ratings were most highly aligned. Overall LLMs aligned better within a five-category (happiness, anger, sadness, fear, disgust) emotion framework than within a two-dimensional (arousal and valence) organization. Finally, LLM ratings were substantially more homogenous than human ratings. Together these results begin to describe how LLM agents interpret emotional stimuli and highlight similarities and differences among biological and artificial intelligence in key behavioral domains.
zh

[AI-40] Neuro-inspired Ensemble-to-Ensemble Communication Primitives for Sparse and Efficient ANNs

【速读】:该论文旨在解决深度人工神经网络(Artificial Neural Network, ANN)在规模扩大时面临的计算资源消耗高、参数冗余以及泛化能力受限等问题。其核心挑战在于如何在保持或提升模型性能的同时,实现更高效的结构设计。解决方案的关键在于引入生物神经回路中观察到的功能连接模式作为结构先验——具体而言,基于小鼠视觉皮层中“群体到群体”(ensemble-to-ensemble)的稀疏且模块化的通信机制,提出G2GNet架构,在前馈层中强制实施稀疏和模块化连接;此外,结合动态稀疏训练(Dynamic Sparse Training, DST)机制与受海马体启发的重连规则(基于激活相关性的权重调整),在训练过程中自适应地剪枝和再生连接,从而在达到高达75%稀疏度的同时,相较密集基线模型提升最高达4.3%的准确率,显著减少计算量并增强模型性能。

链接: https://arxiv.org/abs/2508.14140
作者: Orestis Konstantaropoulos,Stelios Manolis Smirnakis,Maria Papadopouli
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:The structure of biological neural circuits-modular, hierarchical, and sparsely interconnected-reflects an efficient trade-off between wiring cost, functional specialization, and robustness. These principles offer valuable insights for artificial neural network (ANN) design, especially as networks grow in depth and scale. Sparsity, in particular, has been widely explored for reducing memory and computation, improving speed, and enhancing generalization. Motivated by systems neuroscience findings, we explore how patterns of functional connectivity in the mouse visual cortex-specifically, ensemble-to-ensemble communication, can inform ANN design. We introduce G2GNet, a novel architecture that imposes sparse, modular connectivity across feedforward layers. Despite having significantly fewer parameters than fully connected models, G2GNet achieves superior accuracy on standard vision benchmarks. To our knowledge, this is the first architecture to incorporate biologically observed functional connectivity patterns as a structural bias in ANN design. We complement this static bias with a dynamic sparse training (DST) mechanism that prunes and regrows edges during training. We also propose a Hebbian-inspired rewiring rule based on activation correlations, drawing on principles of biological plasticity. G2GNet achieves up to 75% sparsity while improving accuracy by up to 4.3% on benchmarks, including Fashion-MNIST, CIFAR-10, and CIFAR-100, outperforming dense baselines with far fewer computations.
zh

[AI-41] he Statistical Validation of Innovation Lens

【速读】:该论文旨在解决科学发现过程中因信息过载和科研进展加速而导致的科研项目评估与资源分配困难问题。其解决方案的关键在于识别科学发现的潜在结构,并通过训练一个分类器来预测高引用论文,从而为科研决策提供数据驱动的支持。研究在计算机科学、物理学和PubMed领域中验证了该方法的有效性,展示了统计模型在前瞻性识别具有高影响力的科研成果方面的潜力。

链接: https://arxiv.org/abs/2508.14139
作者: Giacomo Radaelli,Jonah Lynch
机构: 未知
类目: Digital Libraries (cs.DL); Artificial Intelligence (cs.AI)
备注: 7 pages, 6 figures

点击查看摘要

Abstract:Information overload and the rapid pace of scientific advancement make it increasingly difficult to evaluate and allocate resources to new research proposals. Is there a structure to scientific discovery that could inform such decisions? We present statistical evidence for such structure, by training a classifier that successfully predicts high-citation research papers between 2010-2024 in the Computer Science, Physics, and PubMed domains.
zh

[AI-42] ERIS: An Energy-Guided Feature Disentanglement Framework for Out-of-Distribution Time Series Classification

【速读】:该论文旨在解决时间序列分类(Time Series Classification, TSC)中模型在分布外(Out-of-Distribution, OOD)数据上性能不可靠的问题,其根源在于模型将领域特定特征与标签相关特征错误地纠缠在一起,导致虚假关联。为解决此问题,作者提出了一种端到端的能量正则化信息增强框架(Energy-Regularized Information for Shift-Robustness, ERIS),其关键在于通过引入语义引导的特征解耦机制实现可靠且可解释的特征分离:首先设计能量引导校准机制提供语义指导以实现模型自校准;其次采用权重级正交策略强制域特定特征与标签相关特征在结构上独立,减少干扰;最后通过辅助对抗训练注入结构化扰动以提升鲁棒性。该方案结合数学约束与语义引导,显著提升了模型在跨分布场景下的泛化能力。

链接: https://arxiv.org/abs/2508.14134
作者: Xin Wu,Fei Teng,Ji Zhang,Xingwang Li,Yuxuan Liang
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注: conference

点击查看摘要

Abstract:An ideal time series classification (TSC) should be able to capture invariant representations, but achieving reliable performance on out-of-distribution (OOD) data remains a core obstacle. This obstacle arises from the way models inherently entangle domain-specific and label-relevant features, resulting in spurious correlations. While feature disentanglement aims to solve this, current methods are largely unguided, lacking the semantic direction required to isolate truly universal features. To address this, we propose an end-to-end Energy-Regularized Information for Shift-Robustness (\textbfERIS) framework to enable guided and reliable feature disentanglement. The core idea is that effective disentanglement requires not only mathematical constraints but also semantic guidance to anchor the separation process. ERIS incorporates three key mechanisms to achieve this goal. Specifically, we first introduce an energy-guided calibration mechanism, which provides crucial semantic guidance for the separation, enabling the model to self-calibrate. Additionally, a weight-level orthogonality strategy enforces structural independence between domain-specific and label-relevant features, thereby mitigating their interference. Moreover, an auxiliary adversarial training mechanism enhances robustness by injecting structured perturbations. Experiments demonstrate that ERIS improves upon state-of-the-art baselines by an average of 4.04% accuracy across four benchmarks.
zh

[AI-43] An Improved Multi-Agent Algorithm for Cooperative and Competitive Environments by Identifying and Encourag ing Cooperation among Agents

【速读】:该论文旨在解决多智能体强化学习(Multi-Agent Reinforcement Learning, MARL)中协作行为不足的问题,即现有算法难以有效促进智能体间的协同策略以提升团队与个体奖励。其解决方案的关键在于在MADDPG算法基础上引入一个新的参数,用于识别并强化智能体间的合作行为,从而在不改变原有框架的前提下显著提升团队整体表现和个体收益。

链接: https://arxiv.org/abs/2508.14131
作者: Junjie Qi,Siqi Mao,Tianyi Tan
机构: 未知
类目: Multiagent Systems (cs.MA); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:We propose an improved algorithm by identifying and encouraging cooperative behavior in multi-agent environments. First, we analyze the shortcomings of existing algorithms in addressing multi-agent reinforcement learning problems. Then, based on the existing algorithm MADDPG, we introduce a new parameter to increase the reward that an agent can obtain when cooperative behavior among agents is identified. Finally, we compare our improved algorithm with MADDPG in environments from PettingZoo. The results show that the new algorithm helps agents achieve both higher team rewards and individual rewards.
zh

[AI-44] CCFC: Core Core-Full-Core Dual-Track Defense for LLM Jailbreak Protection

【速读】:该论文旨在解决大型语言模型(Large Language Models, LLMs)在实际部署中面临的越狱攻击(Jailbreak attacks)问题,尤其是针对提示注入(prompt injection)和结构感知型越狱攻击的脆弱性。其解决方案的关键在于提出一种双轨提示级防御框架CCFC(Core-Core-Full-Core),该框架通过少量示例提示(few-shot prompting)提取用户查询的语义核心,并在此基础上并行运行两个互补路径:一是仅基于语义核心的“核心-only”路径,用以忽略恶意干扰(如毒性后缀或前缀注入);二是“核心-全核心”(CFC)路径,用于破坏基于梯度或编辑的攻击所依赖的结构模式。最终响应由两路径间的安全一致性检查决定,从而在不牺牲良性查询保真度的前提下,显著降低攻击成功率(较现有最优防御方法提升50%-75%)。

链接: https://arxiv.org/abs/2508.14128
作者: Jiaming Hu,Haoyu Wang,Debarghya Mukherjee,Ioannis Ch. Paschalidis
机构: 未知
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI)
备注: 11 pages, 1 figure

点击查看摘要

Abstract:Jailbreak attacks pose a serious challenge to the safe deployment of large language models (LLMs). We introduce CCFC (Core Core-Full-Core), a dual-track, prompt-level defense framework designed to mitigate LLMs’ vulnerabilities from prompt injection and structure-aware jailbreak attacks. CCFC operates by first isolating the semantic core of a user query via few-shot prompting, and then evaluating the query using two complementary tracks: a core-only track to ignore adversarial distractions (e.g., toxic suffixes or prefix injections), and a core-full-core (CFC) track to disrupt the structural patterns exploited by gradient-based or edit-based attacks. The final response is selected based on a safety consistency check across both tracks, ensuring robustness without compromising on response quality. We demonstrate that CCFC cuts attack success rates by 50-75% versus state-of-the-art defenses against strong adversaries (e.g., DeepInception, GCG), without sacrificing fidelity on benign queries. Our method consistently outperforms state-of-the-art prompt-level defenses, offering a practical and effective solution for safer LLM deployment.
zh

[AI-45] A Cost-Effective Framework for Predicting Parking Availability Using Geospatial Data and Machine Learning

【速读】:该论文旨在解决大学校园内停车资源紧张与停车位占用信息不透明的问题,以提升学生在上课时段快速、便捷地找到空闲停车位的效率。其解决方案的关键在于提出一个无需部署传感器的智能框架,通过空间连接(spatial join)整合街道地图、移动轨迹和气象数据等多源异构信息,构建基于时间序列的停车行为预测模型。研究对比了线性回归、支持向量回归(SVR)、随机森林回归(RFR)和长短期记忆网络(LSTM)四种算法,发现RFR在预测精度上表现最优(RMSE=0.142,R²=0.582),而LSTM因具备时序建模能力,在数据量增加和时间窗口延长后可能更具潜力。

链接: https://arxiv.org/abs/2508.14125
作者: Madyan Bagosher,Tala Mustafa,Mohammad Alsmirat,Amal Al-Ali,Isam Mashhour Al Jawarneh
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:As urban populations continue to grow, cities face numerous challenges in managing parking and determining occupancy. This issue is particularly pronounced in university campuses, where students need to find vacant parking spots quickly and conveniently during class timings. The limited availability of parking spaces on campuses underscores the necessity of implementing efficient systems to allocate vacant parking spots effectively. We propose a smart framework that integrates multiple data sources, including street maps, mobility, and meteorological data, through a spatial join operation to capture parking behavior and vehicle movement patterns over the span of 3 consecutive days with an hourly duration between 7AM till 3PM. The system will not require any sensing tools to be installed in the street or in the parking area to provide its services since all the data needed will be collected using location services. The framework will use the expected parking entrance and time to specify a suitable parking area. Several forecasting models, namely, Linear Regression, Support Vector Regression (SVR), Random Forest Regression (RFR), and Long Short-Term Memory (LSTM), are evaluated. Hyperparameter tuning was employed using grid search, and model performance is assessed using Root Mean Squared Error (RMSE), Mean Absolute Error (MAE) and Coefficient of Determination (R2). Random Forest Regression achieved the lowest RMSE of 0.142 and highest R2 of 0.582. However, given the time-series nature of the task, an LSTM model may perform better with additional data and longer timesteps.
zh

[AI-46] AI Agents for Photonic Integrated Circuit Design Automation

【速读】:该论文旨在解决光子集成电路(Photonic Integrated Circuit, PIC)设计中人工干预多、效率低的问题,提出了一种名为Photonics Intelligent Design and Optimization (PhIDO) 的多智能体框架,能够将自然语言描述的PIC设计需求自动转化为版图掩膜文件(layout mask files)。其解决方案的关键在于利用7种推理型大语言模型(reasoning large language models)对复杂设计指令进行语义解析与逻辑推理,并通过构建包含102个设计描述的测试集验证系统性能,最终在单器件设计中达到91%的成功率,在组件数≤15的设计中,Gemini-2.5-pro等模型实现了约57%的端到端 pass@5 成功率,同时具备最低的输出token数量和成本。

链接: https://arxiv.org/abs/2508.14123
作者: Ankita Sharma,YuQi Fu,Vahid Ansari,Rishabh Iyer,Fiona Kuang,Kashish Mistry,Raisa Islam Aishy,Sara Ahmad,Joaquin Matres,Dirk R. Englund,Joyce K.S. Poon
机构: 未知
类目: Hardware Architecture (cs.AR); Artificial Intelligence (cs.AI); Applied Physics (physics.app-ph); Optics (physics.optics)
备注:

点击查看摘要

Abstract:We present Photonics Intelligent Design and Optimization (PhIDO), a multi-agent framework that converts natural-language photonic integrated circuit (PIC) design requests into layout mask files. We compare 7 reasoning large language models for PhIDO using a testbench of 102 design descriptions that ranged from single devices to 112-component PICs. The success rate for single-device designs was up to 91%. For design queries with less than or equal to 15 components, o1, Gemini-2.5-pro, and Claude Opus 4 achieved the highest end-to-end pass@5 success rates of approximately 57%, with Gemini-2.5-pro requiring the fewest output tokens and lowest cost. The next steps toward autonomous PIC development include standardized knowledge representations, expanded datasets, extended verification, and robotic automation.
zh

[AI-47] SimGenHOI: Physically Realistic Whole-Body Humanoid-Object Interaction via Generative Modeling and Reinforcement Learning

【速读】:该论文旨在解决生成式AI在人形机器人与物体交互(Humanoid-Object Interaction, HOI)中面临的物理真实性不足问题,如不合理的接触、穿透和非自然的全身动作,这些问题限制了HOI在真实物理环境中的成功执行。解决方案的关键在于提出一个统一框架SimGenHOI,其核心由两部分组成:一是基于扩散Transformer(Diffusion Transformer, DiT)的HOI生成模型,能够根据文本提示、物体几何形状、稀疏物体路径点及初始人形姿态预测关键动作,并通过插值生成平滑的长时程运动轨迹;二是设计了一个接触感知的全身控制策略,利用强化学习训练以跟踪生成动作并修正穿透和足部滑动等物理伪影。此外,引入互优化微调机制,使生成模型与控制策略迭代改进,从而显著提升运动的真实性和跟踪鲁棒性。

链接: https://arxiv.org/abs/2508.14120
作者: Yuhang Lin,Yijia Xie,Jiahong Xie,Yuehao Huang,Ruoyu Wang,Jiajun Lv,Yukai Ma,Xingxing Zuo
机构: 未知
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Generating physically realistic humanoid-object interactions (HOI) is a fundamental challenge in robotics. Existing HOI generation approaches, such as diffusion-based models, often suffer from artifacts such as implausible contacts, penetrations, and unrealistic whole-body actions, which hinder successful execution in physical environments. To address these challenges, we introduce SimGenHOI, a unified framework that combines the strengths of generative modeling and reinforcement learning to produce controllable and physically plausible HOI. Our HOI generative model, based on Diffusion Transformers (DiT), predicts a set of key actions conditioned on text prompts, object geometry, sparse object waypoints, and the initial humanoid pose. These key actions capture essential interaction dynamics and are interpolated into smooth motion trajectories, naturally supporting long-horizon generation. To ensure physical realism, we design a contact-aware whole-body control policy trained with reinforcement learning, which tracks the generated motions while correcting artifacts such as penetration and foot sliding. Furthermore, we introduce a mutual fine-tuning strategy, where the generative model and the control policy iteratively refine each other, improving both motion realism and tracking robustness. Extensive experiments demonstrate that SimGenHOI generates realistic, diverse, and physically plausible humanoid-object interactions, achieving significantly higher tracking success rates in simulation and enabling long-horizon manipulation tasks. Code will be released upon acceptance on our project page: this https URL.
zh

[AI-48] Documenting Deployment with Fabric: A Repository of Real-World AI Governance

【速读】:该论文试图解决当前学术研究中对人工智能(Artificial Intelligence, AI)部署实践关注不足的问题,尤其是缺乏对已落地AI系统治理机制的系统性梳理与实证分析。解决方案的关键在于构建并公开发布一个名为Fabric的AI应用案例库,其中包含20个经从业者访谈收集的真实AI使用场景及其治理机制,并通过与实践者共同设计的可视化工作流图谱,揭示实际部署中的人类监督策略和防护措施。该工具旨在为研究人员提供可扩展、持续演进的研究平台,用于评估AI治理的有效性并识别治理空白。

链接: https://arxiv.org/abs/2508.14119
作者: Mackenzie Jorgensen,Kendall Brogle,Katherine M. Collins,Lujain Ibrahim,Arina Shah,Petra Ivanovic,Noah Broestl,Gabriel Piles,Paul Dongha,Hatim Abdulhussein,Adrian Weller,Jillian Powers,Umang Bhatt
机构: 未知
类目: Computers and Society (cs.CY); Artificial Intelligence (cs.AI); Human-Computer Interaction (cs.HC)
备注: AIES 2025

点击查看摘要

Abstract:Artificial intelligence (AI) is increasingly integrated into society, from financial services and traffic management to creative writing. Academic literature on the deployment of AI has mostly focused on the risks and harms that result from the use of AI. We introduce Fabric, a publicly available repository of deployed AI use cases to outline their governance mechanisms. Through semi-structured interviews with practitioners, we collect an initial set of 20 AI use cases. In addition, we co-design diagrams of the AI workflow with the practitioners. We discuss the oversight mechanisms and guardrails used in practice to safeguard AI use. The Fabric repository includes visual diagrams of AI use cases and descriptions of the deployed systems. Using the repository, we surface gaps in governance and find common patterns in human oversight of deployed AI systems. We intend for Fabric to serve as an extendable, evolving tool for researchers to study the effectiveness of AI governance.
zh

[AI-49] Enriching Moral Perspectives on AI: Concepts of Trust amongst Africans

【速读】:该论文试图解决的问题是:当前关于人工智能(AI)可信性的研究主要基于西方、受教育、工业化、富裕和民主(WEIRD)社会的视角,缺乏对非洲地区开发者、研究者和使用者等专业群体在AI信任构建中的多元认知与实践经验的系统探讨。解决方案的关键在于通过一项涵盖25个非洲国家的问卷调查(n=157),揭示当地专业人士如何在本土社会文化背景下理解并实践AI信任,发现其价值观深受社区关系与集体主义影响,并将非洲特有的“阿非利加关系性”(Afro-relationalism)概念融入国际通用的信任维度(如可靠性、可解释性和问责制),从而为全球AI治理提供更具文化敏感性和地方适配性的实证基础。

链接: https://arxiv.org/abs/2508.14116
作者: Lameck Mbangula Amugongo,Nicola J Bidwell,Joseph Mwatukange
机构: 未知
类目: Computers and Society (cs.CY); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:The trustworthiness of AI is considered essential to the adoption and application of AI systems. However, the meaning of trust varies across industry, research and policy spaces. Studies suggest that professionals who develop and use AI regard an AI system as trustworthy based on their personal experiences and social relations at work. Studies about trust in AI and the constructs that aim to operationalise trust in AI (e.g., consistency, reliability, explainability and accountability). However, the majority of existing studies about trust in AI are situated in Western, Educated, Industrialised, Rich and Democratic (WEIRD) societies. The few studies about trust and AI in Africa do not include the views of people who develop, study or use AI in their work. In this study, we surveyed 157 people with professional and/or educational interests in AI from 25 African countries, to explore how they conceptualised trust in AI. Most respondents had links with workshops about trust and AI in Africa in Namibia and Ghana. Respondents’ educational background, transnational mobility, and country of origin influenced their concerns about AI systems. These factors also affected their levels of distrust in certain AI applications and their emphasis on specific principles designed to foster trust. Respondents often expressed that their values are guided by the communities in which they grew up and emphasised communal relations over individual freedoms. They described trust in many ways, including applying nuances of Afro-relationalism to constructs in international discourse, such as reliability and reliance. Thus, our exploratory study motivates more empirical research about the ways trust is practically enacted and experienced in African social realities of AI design, use and governance.
zh

[AI-50] Ambiguity Resolution with Human Feedback for Code Writing Tasks

【速读】:该论文旨在解决代码编写任务中自然语言规范(natural language specification)常存在歧义的问题,这种歧义要求程序员具备识别并澄清模糊点的能力。解决方案的关键在于提出了一种新颖的技术(ARHF:Ambiguity Resolution with Human Feedback),其核心机制是:首先识别可能导致歧义的具体输入示例,其次通过有限的人类反馈获取目标行为信息,最后利用这些反馈生成能明确解决歧义的代码。

链接: https://arxiv.org/abs/2508.14114
作者: Aditey Nandan,Viraj Kumar
机构: 未知
类目: oftware Engineering (cs.SE); Artificial Intelligence (cs.AI)
备注: Accepted at the Proceedings of the 33rd International Conference on Computers in Education (ICCE 2025), Asia-Pacific Society for Computers in Education (APSCE)

点击查看摘要

Abstract:Specifications for code writing tasks are usually expressed in natural language and may be ambiguous. Programmers must therefore develop the ability to recognize ambiguities in task specifications and resolve them by asking clarifying questions. We present and evaluate a prototype system, based on a novel technique (ARHF: Ambiguity Resolution with Human Feedback), that (1) suggests specific inputs on which a given task specification may be ambiguous, (2) seeks limited human feedback about the code’s desired behavior on those inputs, and (3) uses this feedback to generate code that resolves these ambiguities. We evaluate the efficacy of our prototype, and we discuss the implications of such assistive systems on Computer Science education.
zh

[AI-51] PAPPL: Personalized AI-Powered Progressive Learning Platform

【速读】:该论文旨在解决工程教育中长期存在的标准化教学框架难以满足学生多样化学习需求的问题,尤其针对传统评价方式(如考试和作业)无法实现个性化学习体验的局限性。其解决方案的关键在于开发了一个名为Personalized AI-Powered Progressive Learning (PAPPL)的智能辅导系统(Intelligent Tutoring System, ITS),该系统基于GPT-4o大语言模型(Large Language Model, LLM),通过整合专家模块、学生模块、导师模块和用户界面,实现对学生学习行为的动态追踪与分析,识别常见认知误区,并提供渐进式、情境敏感且符合教学逻辑的反馈,从而构建一个可扩展、数据驱动的个性化学习环境,显著提升工程类STEM学科的教学适应性和有效性。

链接: https://arxiv.org/abs/2508.14109
作者: Shayan Bafandkar,Sungyong Chung,Homa Khosravian,Alireza Talebpour
机构: 未知
类目: Computers and Society (cs.CY); Artificial Intelligence (cs.AI); Emerging Technologies (cs.ET)
备注:

点击查看摘要

Abstract:Engineering education has historically been constrained by rigid, standardized frameworks, often neglecting students’ diverse learning needs and interests. While significant advancements have been made in online and personalized education within K-12 and foundational sciences, engineering education at both undergraduate and graduate levels continues to lag in adopting similar innovations. Traditional evaluation methods, such as exams and homework assignments, frequently overlook individual student requirements, impeding personalized educational experiences. To address these limitations, this paper introduces the Personalized AI-Powered Progressive Learning (PAPPL) platform, an advanced Intelligent Tutoring System (ITS) designed specifically for engineering education. It highlights the development of a scalable, data-driven tutoring environment leveraging cutting-edge AI technology to enhance personalized learning across diverse academic disciplines, particularly in STEM fields. PAPPL integrates core ITS components including the expert module, student module, tutor module, and user interface, and utilizes GPT-4o, a sophisticated large language model (LLM), to deliver context-sensitive and pedagogically sound hints based on students’ interactions. The system uniquely records student attempts, detects recurring misconceptions, and generates progressively targeted feedback, providing personalized assistance that adapts dynamically to each student’s learning profile. Additionally, PAPPL offers instructors detailed analytics, empowering evidence-based adjustments to teaching strategies. This study provides a fundamental framework for the progression of Generative ITSs scalable to all education levels, delivering important perspectives on personalized progressive learning and the wider possibilities of Generative AI in the field of education.
zh

[AI-52] You Dont Know Until You Click:Automated GUI Testing for Production-Ready Software Evaluation

【速读】:该论文旨在解决当前基准测试在评估大语言模型(Large Language Models, LLMs)生成的软件应用时存在的局限性问题,即现有方法多依赖静态检查或二进制通过/失败脚本,无法有效捕捉真实应用场景中交互行为和运行时动态特性,导致难以判断生成软件是否具备生产就绪能力。其解决方案的关键在于提出 RealDevWorld 评价框架,包含两个核心组件:一是 RealDevBench,一个涵盖194个开放式软件工程任务的多样化数据集,融合多模态元素以模拟现实复杂性;二是 AppEvalPilot,一种“代理作为裁判”的自动化评估系统,通过模拟真实的图形用户界面(GUI)交互来全面评估软件的功能正确性、视觉保真度及运行时行为,从而实现细粒度、任务特定的诊断反馈,显著提升评估与人工判断的一致性(相关系数0.85,准确率0.92),并大幅减少对人工评审的依赖。

链接: https://arxiv.org/abs/2508.14104
作者: Yutong Bian,Xianhao Lin,Yupeng Xie,Tianyang Liu,Mingchen Zhuge,Siyuan Lu,Haoming Tang,Jinlin Wang,Jiayi Zhang,Jiaqi Chen,Xiangru Tang,Yongxin Ni,Sirui Hong,Chenglin Wu
机构: 未知
类目: oftware Engineering (cs.SE); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Large Language Models (LLMs) and code agents in software development are rapidly evolving from generating isolated code snippets to producing full-fledged software applications with graphical interfaces, interactive logic, and dynamic behaviors. However, current benchmarks fall short in evaluating such production-ready software, as they often rely on static checks or binary pass/fail scripts, failing to capture the interactive behaviors and runtime dynamics that define real-world usability - qualities that only emerge when an application is actively used. This is the blind spot of current evaluation: you don’t know if an app works until you click through it, interact with it, and observe how it responds. To bridge this gap, we introduce RealDevWorld, a novel evaluation framework for automated end-to-end assessment of LLMs’ ability to generate production-ready repositories from scratch. It features two key components: (1) RealDevBench, a diverse collection of 194 open-ended software engineering tasks across multiple domains, incorporating multimodal elements to reflect real-world complexity; and (2) AppEvalPilot, a new agent-as-a-judge evaluation system that simulates realistic, GUI-based user interactions to automatically and holistically assess software functional correctness, visual fidelity, and runtime behavior. The framework delivers fine-grained, task-specific diagnostic feedback, supporting nuanced evaluation beyond simple success/failure judgments. Empirical results show that RealDevWorld delivers effective, automatic, and human-aligned evaluations, achieving an accuracy of 0.92 and a correlation of 0.85 with expert human assessments, while significantly reducing the reliance on manual review. This enables scalable, human-aligned assessment of production-level software generated by LLMs. Our code is available on GitHub.
zh

[AI-53] Implicit Hypergraph Neural Network ICDM2025

【速读】:该论文旨在解决现有超图神经网络(Hypergraph Neural Networks, HNNs)在捕获长程高阶依赖关系时性能下降的问题。传统HNN方法通常仅进行少量消息传递轮次以学习节点表示,导致只能捕捉局部信息而忽略远距离的高阶关联;然而,若盲目增加消息传递轮次以增强长程依赖建模能力,反而会因过度聚合噪声或冗余信息而导致性能退化。为此,作者提出隐式超图神经网络(Implicit Hypergraph Neural Network, IHNN),其核心创新在于通过隐式微分技术联合学习节点与超边的固定点表示(fixed-point representations),并在端到端框架下优化模型参数。该方案利用可计算的投影梯度下降法实现高效训练,有效缓解了因过度消息传递引发的性能衰减问题,在多个真实世界超图数据集上的节点分类任务中显著优于现有方法,确立了新的基准性能。

链接: https://arxiv.org/abs/2508.14101
作者: Akash Choudhuri,Yongjian Zhong,Bijaya Adhikari
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注: Submitted to ICDM 2025

点击查看摘要

Abstract:Hypergraphs offer a generalized framework for capturing high-order relationships between entities and have been widely applied in various domains, including healthcare, social networks, and bioinformatics. Hypergraph neural networks, which rely on message-passing between nodes over hyperedges to learn latent representations, have emerged as the method of choice for predictive tasks in many of these domains. These approaches typically perform only a small number of message-passing rounds to learn the representations, which they then utilize for predictions. The small number of message-passing rounds comes at a cost, as the representations only capture local information and forego long-range high-order dependencies. However, as we demonstrate, blindly increasing the message-passing rounds to capture long-range dependency also degrades the performance of hyper-graph neural networks. Recent works have demonstrated that implicit graph neural networks capture long-range dependencies in standard graphs while maintaining performance. Despite their popularity, prior work has not studied long-range dependency issues on hypergraph neural networks. Here, we first demonstrate that existing hypergraph neural networks lose predictive power when aggregating more information to capture long-range dependency. We then propose Implicit Hypergraph Neural Network (IHNN), a novel framework that jointly learns fixed-point representations for both nodes and hyperedges in an end-to-end manner to alleviate this issue. Leveraging implicit differentiation, we introduce a tractable projected gradient descent approach to train the model efficiently. Extensive experiments on real-world hypergraphs for node classification demonstrate that IHNN outperforms the closest prior works in most settings, establishing a new state-of-the-art in hypergraph learning. Comments: Submitted to ICDM 2025 Subjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI) Cite as: arXiv:2508.14101 [cs.LG] (or arXiv:2508.14101v1 [cs.LG] for this version) https://doi.org/10.48550/arXiv.2508.14101 Focus to learn more arXiv-issued DOI via DataCite
zh

[AI-54] Domain Translation of a Soft Robotic Arm using Conditional Cycle Generative Adversarial Network

【速读】:该论文旨在解决软体机器人在不同物理域之间知识迁移困难的问题,尤其是在材料随时间退化导致系统动力学发生变化时,传统基于精确建模的方法难以适应新环境。其关键解决方案是提出一种基于条件循环生成对抗网络(Conditional Cycle Generative Adversarial Network, CCGAN)的域翻译框架,通过学习源域与目标域间输入压力信号与末端执行器位姿之间的映射关系,实现跨域技能迁移;同时结合动态学习策略,将源域中训练好的姿态控制器适配至粘度增加十倍的目标域,从而提升软体机器人控制器的泛化能力与鲁棒性。

链接: https://arxiv.org/abs/2508.14100
作者: Nilay Kushawaha,Carlo Alessi,Lorenzo Fruzzetti,Egidio Falotico
机构: 未知
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI)
备注: Accepted at IEEE International Conference on Robotic Systems and Applications

点击查看摘要

Abstract:Deep learning provides a powerful method for modeling the dynamics of soft robots, offering advantages over traditional analytical approaches that require precise knowledge of the robot’s structure, material properties, and other physical characteristics. Given the inherent complexity and non-linearity of these systems, extracting such details can be challenging. The mappings learned in one domain cannot be directly transferred to another domain with different physical properties. This challenge is particularly relevant for soft robots, as their materials gradually degrade over time. In this paper, we introduce a domain translation framework based on a conditional cycle generative adversarial network (CCGAN) to enable knowledge transfer from a source domain to a target domain. Specifically, we employ a dynamic learning approach to adapt a pose controller trained in a standard simulation environment to a domain with tenfold increased viscosity. Our model learns from input pressure signals conditioned on corresponding end-effector positions and orientations in both domains. We evaluate our approach through trajectory-tracking experiments across five distinct shapes and further assess its robustness under noise perturbations and periodicity tests. The results demonstrate that CCGAN-GP effectively facilitates cross-domain skill transfer, paving the way for more adaptable and generalizable soft robotic controllers.
zh

[AI-55] No More Marching: Learning Humanoid Locomotion for Short-Range SE(2) Targets

【速读】:该论文旨在解决人形机器人在真实工作环境中执行短距离、任务驱动的SE(2)目标位姿移动时,现有基于学习的运动控制方法因优化目标为速度跟踪而非直接位姿到达,导致效率低下、行为类似“行军式”运动的问题。解决方案的关键在于提出一种基于星座(constellation-based)的奖励函数,该函数能够引导机器人实现更自然且高效的面向目标的运动策略,并结合一个包含能量消耗、到达时间与步数等指标的基准测试框架进行评估,从而在仿真到硬件的迁移中验证了所提方法的有效性。

链接: https://arxiv.org/abs/2508.14098
作者: Pranay Dugar,Mohitvishnu S. Gadde,Jonah Siekmann,Yesh Godse,Aayam Shrestha,Alan Fern
机构: 未知
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Humanoids operating in real-world workspaces must frequently execute task-driven, short-range movements to SE(2) target poses. To be practical, these transitions must be fast, robust, and energy efficient. While learning-based locomotion has made significant progress, most existing methods optimize for velocity-tracking rather than direct pose reaching, resulting in inefficient, marching-style behavior when applied to short-range tasks. In this work, we develop a reinforcement learning approach that directly optimizes humanoid locomotion for SE(2) targets. Central to this approach is a new constellation-based reward function that encourages natural and efficient target-oriented movement. To evaluate performance, we introduce a benchmarking framework that measures energy consumption, time-to-target, and footstep count on a distribution of SE(2) goals. Our results show that the proposed approach consistently outperforms standard methods and enables successful transfer from simulation to hardware, highlighting the importance of targeted reward design for practical short-range humanoid locomotion.
zh

[AI-56] Non-Dissipative Graph Propagation for Non-Local Community Detection IJCNN2025

【速读】:该论文旨在解决在异质性图(heterophilic graphs)中进行社区检测(community detection)的难题,这类图的特点是相似节点或属于同一社区的节点通常相距较远,而传统图神经网络(Graph Neural Networks, GNNs)依赖局部消息传递机制难以捕捉长程依赖关系。解决方案的关键在于提出一种无监督的反对称图神经网络(Unsupervised Antisymmetric Graph Neural Network, uAGNN),其通过引入非耗散动力系统确保稳定性,并利用反对称权重矩阵实现长程信息的有效传播,从而同时建模局部与全局图结构,显著提升在高和中等异质性场景下的社区检测性能。

链接: https://arxiv.org/abs/2508.14097
作者: William Leeney,Alessio Gravina,Davide Bacciu
机构: 未知
类目: ocial and Information Networks (cs.SI); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: Accepted at IJCNN 2025

点击查看摘要

Abstract:Community detection in graphs aims to cluster nodes into meaningful groups, a task particularly challenging in heterophilic graphs, where nodes sharing similarities and membership to the same community are typically distantly connected. This is particularly evident when this task is tackled by graph neural networks, since they rely on an inherently local message passing scheme to learn the node representations that serve to cluster nodes into communities. In this work, we argue that the ability to propagate long-range information during message passing is key to effectively perform community detection in heterophilic graphs. To this end, we introduce the Unsupervised Antisymmetric Graph Neural Network (uAGNN), a novel unsupervised community detection approach leveraging non-dissipative dynamical systems to ensure stability and to propagate long-range information effectively. By employing antisymmetric weight matrices, uAGNN captures both local and global graph structures, overcoming the limitations posed by heterophilic scenarios. Extensive experiments across ten datasets demonstrate uAGNN’s superior performance in high and medium heterophilic settings, where traditional methods fail to exploit long-range dependencies. These results highlight uAGNN’s potential as a powerful tool for unsupervised community detection in diverse graph environments.
zh

[AI-57] Hard Examples Are All You Need: Maximizing GRPO Post-Training Under Annotation Budgets

【速读】:该论文旨在解决资源受限条件下语言模型对齐(alignment)中训练数据选择策略的优化问题,即在固定数据采集预算下,如何最优地选取不同难度级别的样本以提升模型性能。其解决方案的关键在于通过基于基础模型的多样本评估估计样本难度,并采用Group Relative Policy Optimization (GRPO) 方法对不同难度子集进行对比实验,发现优先选择最难样本能带来最大性能提升(最高达47%),这是因为硬样本在GRPO训练过程中提供了更多可学习的机会,从而显著增强模型在推理任务上的表现。

链接: https://arxiv.org/abs/2508.14094
作者: Benjamin Pikus,Pratyush Ranjan Tiwari,Burton Ye
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Collecting high-quality training examples for language model fine-tuning is expensive, with practical budgets limiting the amount of data that can be procured. We investigate a critical question for resource-constrained alignment: under a fixed acquisition budget, should practitioners prioritize examples that are easy, medium, hard, or of random difficulty? We study Group Relative Policy Optimization (GRPO) fine-tuning across different model sizes and families, comparing four subset selection policies chosen from the same unlabeled pool using base-model difficulty estimates obtained via multi-sample evaluation. Our experiments reveal that training on the hardest examples yields the largest performance gains, up to 47%, while training on easy examples yield the smallest gains. Analysis reveals that this effect arises from harder examples providing more learnable opportunities during GRPO training. These findings provide practical guidance for budget-constrained post-training: prioritizing hard examples yields substantial performance gains on reasoning tasks when using GRPO.
zh

[AI-58] Logical Expressivity and Explanations for Monotonic GNNs with Scoring Functions KR2025

【速读】:该论文旨在解决图神经网络(Graph Neural Networks, GNNs)在知识图谱(Knowledge Graphs, KGs)中进行链接预测任务时缺乏可解释性的问题。现有方法虽能通过提取Datalog规则实现一定程度的可解释性,但仅适用于受限的、低表达能力的图编码/解码方式。本文提出了一种更通用且广泛使用的链接预测框架,即结合GNN与评分函数(scoring function)进行预测,并设计了使GNN和评分函数均具备单调性的机制;利用该单调性可提取出可靠的解释规则,同时借助已有关于评分函数所能捕获规则类别的理论结果进一步增强规则提取的有效性。关键创新在于将GNN与评分函数统一为单调结构,并基于此构建等价的Datalog程序,从而在保持模型性能的同时显著提升可解释性。

链接: https://arxiv.org/abs/2508.14091
作者: Matthew Morris,David J. Tena Cucala,Bernardo Cuenca Grau
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Logic in Computer Science (cs.LO)
备注: Full version (with appendices) of paper accepted to KR 2025 (22nd International Conference on Principles of Knowledge Representation and Reasoning)

点击查看摘要

Abstract:Graph neural networks (GNNs) are often used for the task of link prediction: predicting missing binary facts in knowledge graphs (KGs). To address the lack of explainability of GNNs on KGs, recent works extract Datalog rules from GNNs with provable correspondence guarantees. The extracted rules can be used to explain the GNN’s predictions; furthermore, they can help characterise the expressive power of various GNN models. However, these works address only a form of link prediction based on a restricted, low-expressivity graph encoding/decoding method. In this paper, we consider a more general and popular approach for link prediction where a scoring function is used to decode the GNN output into fact predictions. We show how GNNs and scoring functions can be adapted to be monotonic, use the monotonicity to extract sound rules for explaining predictions, and leverage existing results about the kind of rules that scoring functions can capture. We also define procedures for obtaining equivalent Datalog programs for certain classes of monotonic GNNs with scoring functions. Our experiments show that, on link prediction benchmarks, monotonic GNNs and scoring functions perform well in practice and yield many sound rules.
zh

[AI-59] CoBAD: Modeling Collective Behaviors for Human Mobility Anomaly Detection

【速读】:该论文旨在解决人类移动行为中集体异常(collective anomaly)的检测问题,即识别多个个体之间在时空上不一致的群体性异常行为模式,这与传统仅关注个体移动模式的异常检测方法不同。其关键解决方案是提出CoBAD模型,通过构建基于共现事件图的集体事件序列(Collective Event Sequences, CES)进行无监督学习,并引入两阶段注意力机制以同时建模个体移动模式和跨个体交互关系;此外,模型通过掩码事件和链接重建任务在大规模数据上预训练,从而有效识别两类集体异常:意外共现异常和被忽视的缺失异常(absence anomalies),显著优于现有基线方法,在AUCROC和AUCPR指标上提升达13%-70%。

链接: https://arxiv.org/abs/2508.14088
作者: Haomin Wen,Shurui Cao,Leman Akoglu
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Social and Information Networks (cs.SI)
备注:

点击查看摘要

Abstract:Detecting anomalies in human mobility is essential for applications such as public safety and urban planning. While traditional anomaly detection methods primarily focus on individual movement patterns (e.g., a child should stay at home at night), collective anomaly detection aims to identify irregularities in collective mobility behaviors across individuals (e.g., a child is at home alone while the parents are elsewhere) and remains an underexplored challenge. Unlike individual anomalies, collective anomalies require modeling spatiotemporal dependencies between individuals, introducing additional complexity. To address this gap, we propose CoBAD, a novel model designed to capture Collective Behaviors for human mobility Anomaly Detection. We first formulate the problem as unsupervised learning over Collective Event Sequences (CES) with a co-occurrence event graph, where CES represents the event sequences of related individuals. CoBAD then employs a two-stage attention mechanism to model both the individual mobility patterns and the interactions across multiple individuals. Pre-trained on large-scale collective behavior data through masked event and link reconstruction tasks, CoBAD is able to detect two types of collective anomalies: unexpected co-occurrence anomalies and absence anomalies, the latter of which has been largely overlooked in prior work. Extensive experiments on large-scale mobility datasets demonstrate that CoBAD significantly outperforms existing anomaly detection baselines, achieving an improvement of 13%-18% in AUCROC and 19%-70% in AUCPR. All source code is available at this https URL.
zh

[AI-60] FM4NPP: A Scaling Foundation Model for Nuclear and Particle Physics

【速读】:该论文旨在解决如何在实验粒子物理领域构建可扩展且具备泛化能力的科学基础模型(Scientific Foundation Model, FM)的问题,尤其针对探测器数据稀疏、空间分布分散等特性与自然语言差异显著的挑战。其解决方案的关键在于提出一种新颖的自监督训练方法,适用于探测器数据,并基于一个包含超过1100万次粒子碰撞事件的新数据集进行训练;同时采用冻结权重结合任务特定适配器(adapter)的策略,在下游任务中实现一致优于基线模型的表现,并展现出优异的数据效率适应能力。进一步分析表明,该FM提取的表征具有任务无关性,可通过单一线性映射快速适配至不同下游任务。

链接: https://arxiv.org/abs/2508.14087
作者: David Park,Shuhang Li,Yi Huang,Xihaier Luo,Haiwang Yu,Yeonju Go,Christopher Pinkenburg,Yuewei Lin,Shinjae Yoo,Joseph Osborn,Jin Huang,Yihui Ren
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); High Energy Physics - Experiment (hep-ex)
备注:

点击查看摘要

Abstract:Large language models have revolutionized artificial intelligence by enabling large, generalizable models trained through self-supervision. This paradigm has inspired the development of scientific foundation models (FMs). However, applying this capability to experimental particle physics is challenging due to the sparse, spatially distributed nature of detector data, which differs dramatically from natural language. This work addresses if an FM for particle physics can scale and generalize across diverse tasks. We introduce a new dataset with more than 11 million particle collision events and a suite of downstream tasks and labeled data for evaluation. We propose a novel self-supervised training method for detector data and demonstrate its neural scalability with models that feature up to 188 million parameters. With frozen weights and task-specific adapters, this FM consistently outperforms baseline models across all downstream tasks. The performance also exhibits robust data-efficient adaptation. Further analysis reveals that the representations extracted by the FM are task-agnostic but can be specialized via a single linear mapping for different downstream tasks.
zh

[AI-61] GeoMAE: Masking Representation Learning for Spatio-Temporal Graph Forecasting with Missing Values

【速读】:该论文旨在解决城市中兴趣点(Points of Interest, POIs)人群流动(crowd flow)精准推断问题,其核心挑战在于标注数据稀缺、POI间复杂的时空依赖关系以及精确人群流动与GPS报告之间的多维相关性。解决方案的关键在于将人群流动推断任务重构为一种自监督属性图表示学习问题,并提出了一种新颖的对比自学习框架(Contrastive Self-learning framework for Spatio-Temporal data, \model)。该方法首先基于POI间的距离构建空间邻接图,随后利用对比学习技术挖掘大量无标签时空数据中的潜在结构信息,通过交换预测机制从相似实例中预测目标子图表示,最终在预训练基础上使用少量准确人群流动数据进行微调,从而显著提升模型在低质量数据场景下的泛化性能。

链接: https://arxiv.org/abs/2508.14083
作者: Songyu Ke,Chenyu Wu,Yuxuan Liang,Xiuwen Yi,Yanping Sun,Junbo Zhang,Yu Zheng
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注: 8 pages, incomplete version

点击查看摘要

Abstract:Accurate acquisition of crowd flow at Points of Interest (POIs) is pivotal for effective traffic management, public service, and urban planning. Despite this importance, due to the limitations of urban sensing techniques, the data quality from most sources is inadequate for monitoring crowd flow at each POI. This renders the inference of accurate crowd flow from low-quality data a critical and challenging task. The complexity is heightened by three key factors: 1) \emphThe scarcity and rarity of labeled data, 2) \emphThe intricate spatio-temporal dependencies among POIs, and 3) \emphThe myriad correlations between precise crowd flow and GPS reports. To address these challenges, we recast the crowd flow inference problem as a self-supervised attributed graph representation learning task and introduce a novel \underlineContrastive \underlineSelf-learning framework for \underlineSpatio-\underlineTemporal data (\model). Our approach initiates with the construction of a spatial adjacency graph founded on the POIs and their respective distances. We then employ a contrastive learning technique to exploit large volumes of unlabeled spatio-temporal data. We adopt a swapped prediction approach to anticipate the representation of the target subgraph from similar instances. Following the pre-training phase, the model is fine-tuned with accurate crowd flow data. Our experiments, conducted on two real-world datasets, demonstrate that the \model pre-trained on extensive noisy data consistently outperforms models trained from scratch. Comments: 8 pages, incomplete version Subjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI) Cite as: arXiv:2508.14083 [cs.LG] (or arXiv:2508.14083v1 [cs.LG] for this version) https://doi.org/10.48550/arXiv.2508.14083 Focus to learn more arXiv-issued DOI via DataCite
zh

[AI-62] Label Smoothing is a Prag matic Information Bottleneck

【速读】:该论文试图解决的是标签平滑(label smoothing)在深度学习模型中的理论解释问题,特别是其与信息瓶颈(information bottleneck, IB)原理之间的联系。传统上,标签平滑通过软化标签分布来提升模型的泛化能力,但其内在机制缺乏严谨的理论支撑。本文的关键贡献在于从信息瓶颈视角重新审视标签平滑:在模型具备足够表达能力且无冲突标签的前提下,理论和实验表明,标签平滑所得到的模型输出恰好逼近信息瓶颈的最优解。因此,标签平滑可被解释为一种实用的信息瓶颈方法,其核心优势在于既能简化实现,又能使模型对与目标无关或条件独立的冗余因素保持鲁棒性。

链接: https://arxiv.org/abs/2508.14077
作者: Sota Kudo
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注: 18 pages, 8 figures, published in Transactions on Machine Learning Research (TMLR), 2025

点击查看摘要

Abstract:This study revisits label smoothing via a form of information bottleneck. Under the assumption of sufficient model flexibility and no conflicting labels for the same input, we theoretically and experimentally demonstrate that the model output obtained through label smoothing explores the optimal solution of the information bottleneck. Based on this, label smoothing can be interpreted as a practical approach to the information bottleneck, enabling simple implementation. As an information bottleneck method, we experimentally show that label smoothing also exhibits the property of being insensitive to factors that do not contain information about the target, or to factors that provide no additional information about it when conditioned on another variable.
zh

[AI-63] PersRM-R1: Enhance Personalized Reward Modeling with Reinforcement Learning

【速读】:该论文旨在解决现有奖励模型(Reward Models, RMs)在个性化对齐任务中难以捕捉用户特定偏好、尤其在数据稀缺和跨领域场景下表现不足的问题。其解决方案的关键在于提出PersRM-R1框架,该框架通过仅需一个或少数几个用户示例即可识别并建模个人因素,并结合合成数据生成与两阶段训练流程(监督微调+强化微调),显著提升了模型在准确性与泛化能力上的表现,从而实现更有效的个性化大语言模型(Large Language Models, LLMs)对齐。

链接: https://arxiv.org/abs/2508.14076
作者: Mengdi Li,Guanqiao Chen,Xufeng Zhao,Haochen Wen,Shu Yang,Di Wang
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Reward models (RMs), which are central to existing post-training methods, aim to align LLM outputs with human values by providing feedback signals during fine-tuning. However, existing RMs struggle to capture nuanced, user-specific preferences, especially under limited data and across diverse domains. Thus, we introduce PersRM-R1, the first reasoning-based reward modeling framework specifically designed to identify and represent personal factors from only one or a few personal exemplars. To address challenges including limited data availability and the requirement for robust generalization, our approach combines synthetic data generation with a two-stage training pipeline consisting of supervised fine-tuning followed by reinforcement fine-tuning. Experimental results demonstrate that PersRM-R1 outperforms existing models of similar size and matches the performance of much larger models in both accuracy and generalizability, paving the way for more effective personalized LLMs.
zh

[AI-64] Explainable Graph Spectral Clustering For Text Embeddings

【速读】:该论文旨在解决图谱聚类(Graph Spectral Clustering)结果的可解释性问题,特别是在文本文档聚类场景中,如何基于不同文档嵌入方式提升聚类结果的可理解性。其解决方案的关键在于将原始基于词项向量空间余弦相似度计算的可解释性方法推广至更通用的文档嵌入表示,特别是引入GloVe(Global Vectors for Word Representation)等预训练词嵌入模型所生成的文档向量表示,从而实现对基于语义信息的聚类结果进行更有效和更具泛化能力的解释。

链接: https://arxiv.org/abs/2508.14075
作者: Mieczysław A. Kłopotek,Sławomir T. Wierzchoń,Bartłomiej Starosta,Piotr Borkowski,Dariusz Czerski,Eryk Laskowski
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注: 47 pages, 19 tables, 11 figures

点击查看摘要

Abstract:In a previous paper, we proposed an introduction to the explainability of Graph Spectral Clustering results for textual documents, given that document similarity is computed as cosine similarity in term vector space. In this paper, we generalize this idea by considering other embeddings of documents, in particular, based on the GloVe embedding idea. Comments: 47 pages, 19 tables, 11 figures Subjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI) Cite as: arXiv:2508.14075 [cs.LG] (or arXiv:2508.14075v1 [cs.LG] for this version) https://doi.org/10.48550/arXiv.2508.14075 Focus to learn more arXiv-issued DOI via DataCite
zh

[AI-65] GEPD:GAN-Enhanced Generalizable Model for EEG-Based Detection of Parkinsons Disease

【速读】:该论文旨在解决基于脑电图(Electroencephalography, EEG)的帕金森病(Parkinson’s disease, PD)跨数据集分类中模型泛化能力不足的问题,主要挑战来源于不同EEG数据集间检测方法的差异性以及单个数据集规模较小。解决方案的关键在于提出一种生成对抗网络增强的通用模型(Generative Adversarial Network-enhanced Generalizable Model, GEPD),其核心包括:1)设计一个生成网络以控制生成数据与真实数据之间的分布相似性,从而融合多源EEG数据;2)引入EEG信号质量评估模型确保生成数据的质量;3)构建结合多个卷积神经网络的分类网络,有效捕捉EEG信号的时间-频率特征,同时保持结构的通用性和可扩展性。实验表明,该模型在跨数据集场景下达到84.3%的准确率和84.0%的F1分数,验证了其良好的泛化性能。

链接: https://arxiv.org/abs/2508.14074
作者: Qian Zhang,Ruilin Zhang,Biaokai Zhu,Xun Han,Jun Xiao,Yifan Liu,Zhe Wang
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注: Accepted by International Conference on Intelligent Computing(ICIC 2025)

点击查看摘要

Abstract:Electroencephalography has been established as an effective method for detecting Parkinson’s disease, typically diagnosed this http URL Parkinson’s disease detection methods have shown significant success within individual datasets, however, the variability in detection methods across different EEG datasets and the small size of each dataset pose challenges for training a generalizable model for cross-dataset scenarios. To address these issues, this paper proposes a GAN-enhanced generalizable model, named GEPD, specifically for EEG-based cross-dataset classification of Parkinson’s this http URL, we design a generative network that creates fusion EEG data by controlling the distribution similarity between generated data and real this http URL addition, an EEG signal quality assessment model is designed to ensure the quality of generated data this http URL, we design a classification network that utilizes a combination of multiple convolutional neural networks to effectively capture the time-frequency characteristics of EEG signals, while maintaining a generalizable structure and ensuring easy this http URL work is dedicated to utilizing intelligent methods to study pathological manifestations, aiming to facilitate the diagnosis and monitoring of neurological this http URL evaluation results demonstrate that our model performs comparably to state-of-the-art models in cross-dataset settings, achieving an accuracy of 84.3% and an F1-score of 84.0%, showcasing the generalizability of the proposed model.
zh

[AI-66] MCLPD:Multi-view Contrastive Learning for EEG-based PD Detection Across Datasets ECAI2025

【速读】:该论文旨在解决帕金森病(Parkinson’s disease, PD)早期检测中因脑电图(electroencephalography, EEG)数据标注成本高导致的标注数据量少、跨数据集差异大(如采集协议和受试者人口统计学特征不一致)所引发的模型鲁棒性和泛化能力不足的问题。解决方案的关键在于提出一种名为MCLPD的半监督学习框架,其核心创新包括:在预训练阶段采用多视图对比学习(multi-view contrastive pre-training),利用未标注的UNM数据构建时频域双重增强的对比对,以增强数据多样性并融合时间-频率信息;在微调阶段仅使用少量来自另外两个数据集(UI和UC)的标签数据进行轻量级监督微调。该方法显著提升了跨数据集的PD检测性能,在仅使用1%标注数据时F1分数即达0.91(UI)和0.81(UC),进一步提升至5%标注数据时分别达到0.97和0.87,有效降低了对标注数据的依赖并增强了模型的泛化能力。

链接: https://arxiv.org/abs/2508.14073
作者: Qian Zhanga,Ruilin Zhang,Jun Xiao,Yifan Liu,Zhe Wang
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注: Acccepted by European Conference on Artificial Intelligence(ECAI 2025)

点击查看摘要

Abstract:Electroencephalography has been validated as an effective technique for detecting Parkinson’s disease,particularly in its early this http URL,the high cost of EEG data annotation often results in limited dataset size and considerable discrepancies across datasets,including differences in acquisition protocols and subject demographics,significantly hinder the robustness and generalizability of models in cross-dataset detection this http URL address such challenges,this paper proposes a semi-supervised learning framework named MCLPD,which integrates multi-view contrastive pre-training with lightweight supervised fine-tuning to enhance cross-dataset PD detection this http URL pre-training,MCLPD uses self-supervised learning on the unlabeled UNM this http URL build contrastive pairs,it applies dual augmentations in both time and frequency domains,which enrich the data and naturally fuse time-frequency this http URL the fine-tuning phase,only a small proportion of labeled data from another two datasets (UI and UC)is used for supervised this http URL results show that MCLPD achieves F1 scores of 0.91 on UI and 0.81 on UC using only 1%of labeled data,which further improve to 0.97 and 0.87,respectively,when 5%of labeled data is this http URL to existing methods,MCLPD substantially improves cross-dataset generalization while reducing the dependency on labeled data,demonstrating the effectiveness of the proposed framework.
zh

[AI-67] Edge-Selector Model Applied for Local Search Neighborhood for Solving Vehicle Routing Problems

【速读】:该论文旨在解决车辆路径问题(Vehicle Routing Problem, VRP)的优化难题,特别是如何在大规模实例中高效地找到高质量解。其解决方案的关键在于提出一种融合机器学习与元启发式算法的混合机制,核心是设计了一个边选择器模型(edge solution selector model),用于在局部搜索过程中识别并禁止某些低效移动(prohibited moves),从而引导搜索方向。该边选择器通过两种学习机制实现:一是基于梯度提升树和前馈神经网络的表格型二分类器,二是利用图神经网络(Graph Neural Network, GNN)直接预测解中的边结构以指导局部搜索。该方法显著提升了多种先进元启发式基线算法在不同规模和变体(如带容量约束的VRP和带时间窗的VRP)上的性能表现,且具备良好的可扩展性和泛化能力。

链接: https://arxiv.org/abs/2508.14071
作者: Bachtiar Herdianto,Romain Billot,Flavien Lucas,Marc Sevaux,Daniele Vigo
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注: 29 pages, 12 figures

点击查看摘要

Abstract:This research proposes a hybrid Machine Learning and metaheuristic mechanism that is designed to solve Vehicle Routing Problems (VRPs). The main of our method is an edge solution selector model, which classifies solution edges to identify prohibited moves during the local search, hence guiding the search process within metaheuristic baselines. Two learning-based mechanisms are used to develop the edge selector: a simple tabular binary classifier and a Graph Neural Network (GNN). The tabular classifier employs Gradient Boosting Trees and Feedforward Neural Network as the baseline algorithms. Adjustments to the decision threshold are also applied to handle the class imbalance in the problem instance. An alternative mechanism employs the GNN to utilize graph structure for direct solution edge prediction, with the objective of guiding local search by predicting prohibited moves. These hybrid mechanisms are then applied in state-fo-the-art metaheuristic baselines. Our method demonstrates both scalability and generalizability, achieving performance improvements across different baseline metaheuristics, various problem sizes and variants, including the Capacitated Vehicle Routing Problem (CVRP) and CVRP with Time Windows (CVRPTW). Experimental evaluations on benchmark datasets up to 30,000 customer nodes, supported by pair-wise statistical analysis, verify the observed improvements.
zh

[AI-68] Special-Character Adversarial Attacks on Open-Source Language Model

【速读】:该论文旨在解决大语言模型(Large Language Models, LLMs)在实际应用中面临的字符级对抗攻击(character-level adversarial manipulations)所带来的安全漏洞问题。其解决方案的关键在于识别并防御这些细微但具有误导性的输入扰动,从而提升LLMs在真实场景下的鲁棒性和安全性。

链接: https://arxiv.org/abs/2508.14070
作者: Ephraiem Sarabamoun
机构: 未知
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Large language models (LLMs) have achieved remarkable performance across diverse natural language processing tasks, yet their vulnerability to character-level adversarial manipulations presents significant security challenges for real-world deployments.
zh

[AI-69] Revisit Choice Network for Synthesis and Technology Mapping

【速读】:该论文旨在解决布尔优化、等价验证和工艺映射中因结构偏差(structural bias)导致的性能瓶颈问题,尤其是现有无损合成方法在构建选择网络(choice network)时忽视了选择质量,从而影响技术映射的有效性。其解决方案的关键在于提出了一种名为Cristal的新方法论与框架,通过三个核心步骤实现:1)代表性逻辑锥搜索以定位关键子电路;2)基于等式饱和(equality saturation)的结构突变机制生成多样化选择结构;3)优先级排序的选择选取策略结合选择网络构造与验证,从而构建数量更少但质量更高的选择节点。实验表明,Cristal在延迟导向和面积导向模式下均显著优于当前最先进的ABC工具,在多个基准测试集上实现了平均面积/延迟降低,并大幅减少运行时间。

链接: https://arxiv.org/abs/2508.14068
作者: Chen Chen,Jiaqi Yin,Cunxi Yu
机构: 未知
类目: Hardware Architecture (cs.AR); Artificial Intelligence (cs.AI)
备注: Accepted by ICCAD 2025

点击查看摘要

Abstract:Choice network construction is a critical technique for alleviating structural bias issues in Boolean optimization, equivalence checking, and technology mapping. Previous works on lossless synthesis utilize independent optimization to generate multiple snapshots, and use simulation and SAT solvers to identify functionally equivalent nodes. These nodes are then merged into a subject graph with choice nodes. However, such methods often neglect the quality of these choices, raising the question of whether they truly contribute to effective technology mapping. This paper introduces Cristal, a novel methodology and framework for constructing Boolean choice networks. Specifically, Cristal introduces a new flow of choice network-based synthesis and mapping, including representative logic cone search, structural mutation for generating diverse choice structures via equality saturation, and priority-ranking choice selection along with choice network construction and validation. Through these techniques, Cristal constructs fewer but higher-quality choices. Our experimental results demonstrate that Cristal outperforms the state-of-the-art Boolean choice network construction implemented in ABC in the post-mapping stage, achieving average reductions of 3.85%/8.35% (area/delay) in delay-oriented mode, 0.11%/2.74% in area-oriented mode, and a 63.77% runtime reduction on large-scale cases across a diverse set of combinational circuits from the IWLS 2005, ISCAS’89, and EPFL benchmark suites. Comments: Accepted by ICCAD 2025 Subjects: Hardware Architecture (cs.AR); Artificial Intelligence (cs.AI) Cite as: arXiv:2508.14068 [cs.AR] (or arXiv:2508.14068v1 [cs.AR] for this version) https://doi.org/10.48550/arXiv.2508.14068 Focus to learn more arXiv-issued DOI via DataCite
zh

[AI-70] Retrieval-Augmented Generation in Industry: An Interview Study on Use Cases Requirements Challenges and Evaluation

【速读】:该论文旨在解决当前工业界对检索增强生成(Retrieval-Augmented Generation, RAG)的实际应用研究严重不足的问题,尤其是在真实场景中RAG系统的部署、需求与挑战尚不明确。解决方案的关键在于通过半结构化访谈对13位行业从业者进行调研,系统梳理了RAG在工业环境中的使用案例、核心系统需求、主要挑战及评估方法,从而揭示当前RAG应用仍处于原型阶段且集中于领域特定问答任务,同时识别出数据保护、安全性和质量是首要关注点,而自动化评估和可扩展性等问题尚未得到充分重视。

链接: https://arxiv.org/abs/2508.14066
作者: Lorenz Brehme,Benedikt Dornauer,Thomas Ströhle,Maximilian Ehrhart,Ruth Breu
机构: 未知
类目: Information Retrieval (cs.IR); Artificial Intelligence (cs.AI)
备注: This preprint was accepted for presentation at the 17th International Conference on Knowledge Discovery and Information Retrieval (KDIR25)

点击查看摘要

Abstract:Retrieval-Augmented Generation (RAG) is a well-established and rapidly evolving field within AI that enhances the outputs of large language models by integrating relevant information retrieved from external knowledge sources. While industry adoption of RAG is now beginning, there is a significant lack of research on its practical application in industrial contexts. To address this gap, we conducted a semistructured interview study with 13 industry practitioners to explore the current state of RAG adoption in real-world settings. Our study investigates how companies apply RAG in practice, providing (1) an overview of industry use cases, (2) a consolidated list of system requirements, (3) key challenges and lessons learned from practical experiences, and (4) an analysis of current industry evaluation methods. Our main findings show that current RAG applications are mostly limited to domain-specific QA tasks, with systems still in prototype stages; industry requirements focus primarily on data protection, security, and quality, while issues such as ethics, bias, and scalability receive less attention; data preprocessing remains a key challenge, and system evaluation is predominantly conducted by humans rather than automated methods.
zh

[AI-71] An automatic patent literature retrieval system based on LLM -RAG

【速读】:该论文旨在解决传统专利文献检索方法在处理复杂查询意图和跨技术领域语义关联时效率低、相关性差的问题,从而影响知识产权管理和企业研发(R&D)决策的准确性。其解决方案的关键在于构建一个集成大语言模型(Large Language Models, LLMs)与检索增强生成(Retrieval-Augmented Generation, RAG)技术的自动化专利检索框架,通过三个核心模块实现:1)专利数据标准化预处理;2)基于LLM生成嵌入(embeddings)的高效向量检索引擎;3)融合外部文档检索与上下文感知响应生成的RAG增强查询模块。实验表明,该框架在Google Patents数据集上达到80.5%的语义匹配准确率和92.1%的召回率,显著优于基线LLM方法,且具备良好的跨域分类与语义聚类泛化能力。

链接: https://arxiv.org/abs/2508.14064
作者: Yao Ding,Yuqing Wu,Ziyang Ding
机构: 未知
类目: Information Retrieval (cs.IR); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:With the acceleration of technological innovation efficient retrieval and classification of patent literature have become essential for intellectual property management and enterprise RD Traditional keyword and rulebased retrieval methods often fail to address complex query intents or capture semantic associations across technical domains resulting in incomplete and lowrelevance results This study presents an automated patent retrieval framework integrating Large Language Models LLMs with RetrievalAugmented Generation RAG technology The system comprises three components: 1) a preprocessing module for patent data standardization, 2) a highefficiency vector retrieval engine leveraging LLMgenerated embeddings, and 3) a RAGenhanced query module that combines external document retrieval with contextaware response generation Evaluations were conducted on the Google Patents dataset 20062024 containing millions of global patent records with metadata such as filing date domain and status The proposed gpt35turbo0125RAG configuration achieved 805 semantic matching accuracy and 92.1% recall surpassing baseline LLM methods by 28 percentage points The framework also demonstrated strong generalization in crossdomain classification and semantic clustering tasks These results validate the effectiveness of LLMRAG integration for intelligent patent retrieval providing a foundation for nextgeneration AIdriven intellectual property analysis platforms
zh

[AI-72] A Multi-Agent Approach to Neurological Clinical Reasoning

【速读】:该论文旨在解决大型语言模型(Large Language Models, LLMs)在神经学领域复杂推理能力不足的问题,尤其是其在处理高阶临床推理任务时表现不稳定、准确性有限的挑战。现有方法如检索增强生成(Retrieval-Augmented Generation, RAG)虽能提升部分性能,但在涉及多概念整合与深层逻辑推理的任务中效果有限。论文提出的关键解决方案是构建一个结构化的多智能体(multi-agent)框架,该框架将神经学推理过程分解为四个专业化认知模块:问题分析、知识检索、答案合成与验证,从而模拟人类专家的分步决策机制。这一设计显著提升了模型对复杂神经学问题的理解与推理能力,尤其使原本表现不佳的通用模型(如LLaMA 3.3-70B)在Level 3复杂度问题上准确率从69.5%提升至89.2%,并实现了跨亚专科领域的性能一致性,证明了基于认知分工的多智能体架构在复杂医学推理中的有效性。

链接: https://arxiv.org/abs/2508.14063
作者: Moran Sorka,Alon Gorenshtein,Dvir Aran,Shahar Shelly
机构: 未知
类目: Information Retrieval (cs.IR); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Large language models (LLMs) have shown promise in medical domains, but their ability to handle specialized neurological reasoning requires systematic evaluation. We developed a comprehensive benchmark using 305 questions from Israeli Board Certification Exams in Neurology, classified along three complexity dimensions: factual knowledge depth, clinical concept integration, and reasoning complexity. We evaluated ten LLMs using base models, retrieval-augmented generation (RAG), and a novel multi-agent system. Results showed significant performance variation. OpenAI-o1 achieved the highest base performance (90.9% accuracy), while specialized medical models performed poorly (52.9% for Meditron-70B). RAG provided modest benefits but limited effectiveness on complex reasoning questions. In contrast, our multi-agent framework, decomposing neurological reasoning into specialized cognitive functions including question analysis, knowledge retrieval, answer synthesis, and validation, achieved dramatic improvements, especially for mid-range models. The LLaMA 3.3-70B-based agentic system reached 89.2% accuracy versus 69.5% for its base model, with substantial gains on level 3 complexity questions. The multi-agent approach transformed inconsistent subspecialty performance into uniform excellence, addressing neurological reasoning challenges that persisted with RAG enhancement. We validated our approach using an independent dataset of 155 neurological cases from MedQA. Results confirm that structured multi-agent approaches designed to emulate specialized cognitive processes significantly enhance complex medical reasoning, offering promising directions for AI assistance in challenging clinical contexts.
zh

[AI-73] Dual-Phase Playtime-guided Recommendation: Interest Intensity Exploration and Multimodal Random Walks ACM-MM

【速读】:该论文旨在解决视频游戏推荐系统中因游戏目录快速扩张而导致的推荐准确性与多样性难以兼顾的问题。现有模型普遍忽视了游戏平台特有的行为信号——玩家游戏时长(playtime),且未有效利用多模态信息来提升推荐多样性。解决方案的关键在于提出一种双阶段引导的推荐模型 DP2Rec:第一阶段通过双贝塔建模分离用户强弱偏好,实现精细化用户画像以提升推荐准确性;第二阶段引入基于游戏时长和多模态语义相似性的随机游走机制,模拟玩家探索行为,在保留核心兴趣的同时,借助潜在语义关联和自适应类别平衡促进跨品类发现,从而协同优化推荐准确性和多样性。

链接: https://arxiv.org/abs/2508.14058
作者: Jingmao Zhang,Zhiting Zhao,Yunqi Lin,Jianghong Ma,Tianjun Wei,Haijun Zhang,Xiaofeng Zhang
机构: 未知
类目: Information Retrieval (cs.IR); Artificial Intelligence (cs.AI)
备注: Accepted for publication at ACM Multimedia (ACM MM) 2025. 10 pages, 5 figures. Code and dataset: this https URL

点击查看摘要

Abstract:The explosive growth of the video game industry has created an urgent need for recommendation systems that can scale with expanding catalogs and maintain user engagement. While prior work has explored accuracy and diversity in recommendations, existing models underutilize playtime, a rich behavioral signal unique to gaming platforms, and overlook the potential of multimodal information to enhance diversity. In this paper, we propose DP2Rec, a novel Dual-Phase Playtime-guided Recommendation model designed to jointly optimize accuracy and diversity. First, we introduce a playtime-guided interest intensity exploration module that separates strong and weak preferences via dual-beta modeling, enabling fine-grained user profiling and more accurate recommendations. Second, we present a playtime-guided multimodal random walks module that simulates player exploration using transitions guided by both playtime-derived interest similarity and multimodal semantic similarity. This mechanism preserves core preferences while promoting cross-category discovery through latent semantic associations and adaptive category balancing. Extensive experiments on a real-world game dataset show that DP2Rec outperforms existing methods in both recommendation accuracy and diversity.
zh

[AI-74] MAHL: Multi-Agent LLM -Guided Hierarchical Chiplet Design with Adaptive Debugging

【速读】:该论文旨在解决生成式 AI (Generative AI) 在芯片设计(特别是 chiplet 设计)中面临的挑战,包括扁平化设计导致的复杂性、高昂的验证成本以及参数优化不精确等问题,这些问题严重限制了 LLM(Large Language Models)在 2.5D 集成等先进封装技术中的应用能力。解决方案的关键在于提出 MAHL 框架——一个基于分层 LLM 的 chiplet 设计生成框架,其核心是六个协作智能体,通过层次化描述生成、检索增强型代码生成、基于多样性流的验证机制以及多粒度设计空间探索,实现 AI 算法与硬件的高效映射,从而显著提升设计准确性(如 Pass@5 从 0 提升至 0.72)并优化 Power, Performance and Area (PPA) 指标。

链接: https://arxiv.org/abs/2508.14053
作者: Jinwei Tang,Jiayin Qin,Nuo Xu,Pragnya Sudershan Nalla,Yu Cao,Yang(Katie)Zhao,Caiwen Ding
机构: 未知
类目: Hardware Architecture (cs.AR); Artificial Intelligence (cs.AI); Multiagent Systems (cs.MA)
备注:

点击查看摘要

Abstract:As program workloads (e.g., AI) increase in size and algorithmic complexity, the primary challenge lies in their high dimensionality, encompassing computing cores, array sizes, and memory hierarchies. To overcome these obstacles, innovative approaches are required. Agile chip design has already benefited from machine learning integration at various stages, including logic synthesis, placement, and routing. With Large Language Models (LLMs) recently demonstrating impressive proficiency in Hardware Description Language (HDL) generation, it is promising to extend their abilities to 2.5D integration, an advanced technique that saves area overhead and development costs. However, LLM-driven chiplet design faces challenges such as flatten design, high validation cost and imprecise parameter optimization, which limit its chiplet design capability. To address this, we propose MAHL, a hierarchical LLM-based chiplet design generation framework that features six agents which collaboratively enable AI algorithm-hardware mapping, including hierarchical description generation, retrieval-augmented code generation, diverseflow-based validation, and multi-granularity design space exploration. These components together enhance the efficient generation of chiplet design with optimized Power, Performance and Area (PPA). Experiments show that MAHL not only significantly improves the generation accuracy of simple RTL design, but also increases the generation accuracy of real-world chiplet design, evaluated by Pass@5, from 0 to 0.72 compared to conventional LLMs under the best-case scenario. Compared to state-of-the-art CLARIE (expert-based), MAHL achieves comparable or even superior PPA results under certain optimization objectives.
zh

[AI-75] he Hidden Cost of Readability: How Code Formatting Silently Consumes Your LLM Budget ICSE’26

【速读】:该论文旨在解决代码格式化元素(如缩进和换行符)对大型语言模型(Large Language Models, LLMs)在代码生成任务中效率影响的问题。研究表明,这些视觉上提升人类可读性的格式符号对LLMs而言并非必要,反而增加了输入token数量,导致计算开销上升和响应延迟。解决方案的关键在于:通过移除格式化元素实现输入token平均减少24.5%,同时保持模型性能稳定;进一步结合提示工程(prompting)或微调(fine-tuning),可将输出代码长度最多减少36.1%而不损害正确性。为此,作者还开发了一个双向代码转换工具,支持在不影响人类可读性的前提下无缝集成到现有LLM推理流程中,从而兼顾效率与实用性。

链接: https://arxiv.org/abs/2508.13666
作者: Dangfeng Pan,Zhensu Sun,Cenyuan Zhang,David Lo,Xiaoning Du
机构: 未知
类目: oftware Engineering (cs.SE); Artificial Intelligence (cs.AI)
备注: Accepted by ICSE’26 (First Cycle)

点击查看摘要

Abstract:Source code is usually formatted with elements like indentation and newlines to improve readability for human developers. However, these visual aids do not seem to be beneficial for large language models (LLMs) in the same way since the code is processed as a linear sequence of tokens. Furthermore, these additional tokens can lead to increased computational costs and longer response times for LLMs. If such formatting elements are non-essential to LLMs, we can reduce such costs by removing them from the code. To figure out the role played by formatting elements, we conduct a comprehensive empirical study to evaluate the impact of code formatting on LLM performance and efficiency. Through large-scale experiments on Fill-in-the-Middle Code Completion tasks across four programming languages (Java, Python, C++, C#) and ten LLMs-including both commercial and open-source models-we systematically analyze token count and performance when formatting elements are removed. Key findings indicate that LLMs can maintain performance across formatted code and unformatted code, achieving an average input token reduction of 24.5% with negligible output token reductions. This makes code format removal a practical optimization strategy for improving LLM efficiency. Further exploration reveals that both prompting and fine-tuning LLMs can lead to significant reductions (up to 36.1%) in output code length without compromising correctness. To facilitate practical applications, we develop a bidirectional code transformation tool for format processing, which can be seamlessly integrated into existing LLM inference workflows, ensuring both human readability and LLM efficiency.
zh

[AI-76] Reliable generation of isomorphic physics problems using ChatGPT with prompt-chaining and tool use

【速读】:该论文旨在解决现有大语言模型(Large Language Models, LLMs)在生成同构物理问题时存在的质量不稳定、结构控制不足以及缺乏自动验证机制的问题。其解决方案的关键在于采用提示链(prompt chaining)与工具调用(tool use)相结合的方法,通过分步引导模型生成具有精确数值和空间关系变化的同构问题,并利用Python代码解释器实现自动解题验证与简单图示生成,从而显著提升输出的一致性与可靠性,为教师提供一种高效、可扩展的问题生成工具,支持个性化自适应测试和自动化内容开发。

链接: https://arxiv.org/abs/2508.14755
作者: Zhongzhou Chen
机构: 未知
类目: Physics Education (physics.ed-ph); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:We present a method for generating large numbers of isomorphic physics problems using ChatGPT through prompt chaining and tool use. This approach enables precise control over structural variations-such as numeric values and spatial relations-while supporting diverse contextual variations in the problem body. By utilizing the Python code interpreter, the method supports automatic solution validation and simple diagram generation, addressing key limitations in existing LLM-based methods. We generated two example isomorphic problem banks and compared the outcome against simpler prompt-based approaches. Results show that prompt-chaining produces significantly higher quality and more consistent outputs than simpler, non-chaining prompts. This work demonstrates a promising method for efficient problem creation accessible to the average instructor, which opens new possibilities for personalized adaptive testing and automated content development.
zh

[AI-77] A Study of the Scale Invariant Signal to Distortion Ratio in Speech Separation with Noisy References

【速读】:该论文旨在解决在监督语音分离任务中,使用带有噪声的参考信号(如WSJ0-2Mix数据集)作为训练目标时,Scale-Invariant Signal-to-Distortion Ratio (SI-SDR) 评估指标和训练目标所引发的问题:噪声会限制可达到的SI-SDR性能,并可能导致分离输出中引入不期望的噪声成分。解决方案的关键在于对参考信号进行增强,并利用WHAM!数据集对混合语音进行扩充,从而训练出能够避免学习噪声参考信号的模型。这一方法通过提升参考信号质量来改善模型鲁棒性,尽管实验表明处理参考信号可能引入伪影,进而限制整体性能提升。

链接: https://arxiv.org/abs/2508.14623
作者: Simon Dahl Jepsen,Mads Græsbøll Christensen,Jesper Rindom Jensen
机构: 未知
类目: Audio and Speech Processing (eess.AS); Artificial Intelligence (cs.AI); Sound (cs.SD)
备注: Accepted for IEEE ASRU 2025, Workshop on Automatic Speech Recognition and Understanding. Copyright © 2025 IEEE. 8 pages, 6 figures, 2 tables

点击查看摘要

Abstract:This paper examines the implications of using the Scale-Invariant Signal-to-Distortion Ratio (SI-SDR) as both evaluation and training objective in supervised speech separation, when the training references contain noise, as is the case with the de facto benchmark WSJ0-2Mix. A derivation of the SI-SDR with noisy references reveals that noise limits the achievable SI-SDR, or leads to undesired noise in the separated outputs. To address this, a method is proposed to enhance references and augment the mixtures with WHAM!, aiming to train models that avoid learning noisy references. Two models trained on these enhanced datasets are evaluated with the non-intrusive NISQA.v2 metric. Results show reduced noise in separated speech but suggest that processing references may introduce artefacts, limiting overall quality gains. Negative correlation is found between SI-SDR and perceived noisiness across models on the WSJ0-2Mix and Libri2Mix test sets, underlining the conclusion from the derivation.
zh

[AI-78] Synaptic bundle theory for spike-driven sensor-motor system: More than eight independent synaptic bundles collapse reward-STDP learning

【速读】:该论文试图解决的问题是:在基于神经元尖峰(spike-based)控制信号的类脑传感器-运动系统中,学习过程容易发生崩溃,导致无法有效训练。其解决方案的关键在于设计了一个可调节传感器到运动神经元连接中独立突触束(independent synaptic bundles)数量的系统,并通过量化分析反向权重更新次数揭示了学习失败与成功背后的机制。研究表明,当运动神经元数量或独立突触束数量超过临界阈值时,学习会崩溃;而较少的运动神经元虽增加失败概率,但一旦成功则加快学习速度,这一现象可通过反向权重更新的数量进行定量解释。该方法为构建稳定且可学习的尖峰驱动系统提供了关键参数范围,从而使得以往因学习困难而难以研究的尖峰功能得以探索。

链接: https://arxiv.org/abs/2508.14492
作者: Takeshi Kobayashi,Shogo Yonekura,Yasuo Kuniyoshi
机构: 未知
类目: Neurons and Cognition (q-bio.NC); Artificial Intelligence (cs.AI); Adaptation and Self-Organizing Systems (nlin.AO)
备注: 5 pages, 4 figures

点击查看摘要

Abstract:Neuronal spikes directly drive muscles and endow animals with agile movements, but applying the spike-based control signals to actuators in artificial sensor-motor systems inevitably causes a collapse of learning. We developed a system that can vary \emphthe number of independent synaptic bundles in sensor-to-motor connections. This paper demonstrates the following four findings: (i) Learning collapses once the number of motor neurons or the number of independent synaptic bundles exceeds a critical limit. (ii) The probability of learning failure is increased by a smaller number of motor neurons, while (iii) if learning succeeds, a smaller number of motor neurons leads to faster learning. (iv) The number of weight updates that move in the opposite direction of the optimal weight can quantitatively explain these results. The functions of spikes remain largely unknown. Identifying the parameter range in which learning systems using spikes can be constructed will make it possible to study the functions of spikes that were previously inaccessible due to the difficulty of learning.
zh

[AI-79] New Insights into Automatic Treatment Planning for Cancer Radiotherapy Using Explainable Artificial Intelligence

【速读】:该论文旨在解决生成式 AI(Generative AI)在自动治疗计划制定中的决策过程不透明问题,特别是在前列腺癌调强放射治疗(IMRT)的逆向规划中,如何理解AI代理(agent)基于剂量体积直方图(DVH)输入对治疗参数(TPPs)进行调整的机制。解决方案的关键在于采用可解释人工智能(Explainable AI, EXAI)方法,对不同训练阶段的Actor-Critic with Experience Replay(ACER)代理进行归因分析,揭示其从DVH输入到TPP调整动作的因果映射关系,并结合规划效能与效率、策略空间熵等指标系统评估其学习路径。研究发现,高性能ACER代理能逐步识别DVH中的剂量违规区域并采取全局性TPP调整策略,其归因与剂量违规缓解的器官级相似度达0.25–0.5,且所需调参步数显著减少(~12–13 vs. 22),表明该方法可有效提升AI代理的可解释性和临床可信度。

链接: https://arxiv.org/abs/2508.14229
作者: Md Mainul Abrar,Xun Jia,Yujie Chi
机构: 未知
类目: Medical Physics (physics.med-ph); Artificial Intelligence (cs.AI)
备注: 19 pages, 7 figures, 1 table, Oral presentation at the conference ‘American Association of Physicists in Medicine 2025, 67th Annual Meeting and Exhibition’

点击查看摘要

Abstract:Objective: This study aims to uncover the opaque decision-making process of an artificial intelligence (AI) agent for automatic treatment planning. Approach: We examined a previously developed AI agent based on the Actor-Critic with Experience Replay (ACER) network, which automatically tunes treatment planning parameters (TPPs) for inverse planning in prostate cancer intensity modulated radiotherapy. We selected multiple checkpoint ACER agents from different stages of training and applied an explainable AI (EXAI) method to analyze the attribution from dose-volume histogram (DVH) inputs to TPP-tuning decisions. We then assessed each agent’s planning efficacy and efficiency and evaluated their policy and final TPP tuning spaces. Combining these analyses, we systematically examined how ACER agents generated high-quality treatment plans in response to different DVH inputs. Results: Attribution analysis revealed that ACER agents progressively learned to identify dose-violation regions from DVH inputs and promote appropriate TPP-tuning actions to mitigate them. Organ-wise similarities between DVH attributions and dose-violation reductions ranged from 0.25 to 0.5 across tested agents. Agents with stronger attribution-violation similarity required fewer tuning steps (~12-13 vs. 22), exhibited a more concentrated TPP-tuning space with lower entropy (~0.3 vs. 0.6), converged on adjusting only a few TPPs, and showed smaller discrepancies between practical and theoretical tuning steps. Putting together, these findings indicate that high-performing ACER agents can effectively identify dose violations from DVH inputs and employ a global tuning strategy to achieve high-quality treatment planning, much like skilled human planners. Significance: Better interpretability of the agent’s decision-making process may enhance clinician trust and inspire new strategies for automatic treatment planning. Comments: 19 pages, 7 figures, 1 table, Oral presentation at the conference ‘American Association of Physicists in Medicine 2025, 67th Annual Meeting and Exhibition’ Subjects: Medical Physics (physics.med-ph); Artificial Intelligence (cs.AI) MSC classes: Primary 68T27, Secondary 68T99, ACMclasses: J.3; I.2.m Cite as: arXiv:2508.14229 [physics.med-ph] (or arXiv:2508.14229v1 [physics.med-ph] for this version) https://doi.org/10.48550/arXiv.2508.14229 Focus to learn more arXiv-issued DOI via DataCite Submission history From: Md Mainul Abrar [view email] [v1] Tue, 19 Aug 2025 19:38:16 UTC (5,775 KB)
zh

[AI-80] owards Low-Latency Tracking of Multiple Speakers With Short-Context Speaker Embeddings

【速读】:该论文旨在解决短时上下文下说话人嵌入(speaker embedding)提取困难以及重叠语音对跟踪系统身份重分配(identity reassignment)性能造成负面影响的问题。其核心挑战在于:传统说话人嵌入提取器在处理短时语音片段和多说话人重叠场景时表现不佳,而延长时间上下文虽可提升识别效果,却会引入更多跟踪错误并增加延迟。解决方案的关键在于提出一种基于知识蒸馏(Knowledge Distillation, KD)的训练方法,用于从双说话人混合信号中提取短时上下文下的有效说话人嵌入,并结合波束成形(beamforming)技术利用空间信息减少语音重叠干扰;同时探索固定块大小的身份重分配策略(blockwise identity reassignment),以实现低延迟的说话人嵌入驱动的跟踪系统。实验表明,所提方法在短时嵌入提取上更具鲁棒性,但对同时发声场景仍需进一步优化。

链接: https://arxiv.org/abs/2508.14115
作者: Taous Iatariene,Alexandre Guérin,Romain Serizel(MULTISPEECH)
机构: 未知
类目: Audio and Speech Processing (eess.AS); Artificial Intelligence (cs.AI); Sound (cs.SD); Signal Processing (eess.SP)
备注:

点击查看摘要

Abstract:Speaker embeddings are promising identity-related features that can enhance the identity assignment performance of a tracking system by leveraging its spatial predictions, i.e, by performing identity reassignment. Common speaker embedding extractors usually struggle with short temporal contexts and overlapping speech, which imposes long-term identity reassignment to exploit longer temporal contexts. However, this increases the probability of tracking system errors, which in turn impacts negatively on identity reassignment. To address this, we propose a Knowledge Distillation (KD) based training approach for short context speaker embedding extraction from two speaker mixtures. We leverage the spatial information of the speaker of interest using beamforming to reduce overlap. We study the feasibility of performing identity reassignment over blocks of fixed size, i.e., blockwise identity reassignment, to go towards a low-latency speaker embedding based tracking system. Results demonstrate that our distilled models are effective at short-context embedding extraction and more robust to overlap. Although, blockwise reassignment results indicate that further work is needed to handle simultaneous speech more effectively.
zh

[AI-81] Surya: Foundation Model for Heliophysics

【速读】:该论文旨在解决太阳物理学中模型泛化能力不足的问题,即现有模型多为任务特定且受限于标注数据稀缺,难以跨太阳现象进行有效迁移。其解决方案的关键在于提出Surya——首个基于全分辨率SDO数据、以时间推进作为预训练任务的太阳物理基础模型(foundation model),采用时空Transformer架构结合谱门控与长短程注意力机制,并通过自回归滚动优化进一步提升性能。该方法能够从多仪器观测中学习通用太阳表征,在零样本条件下即可预测太阳动力学和耀斑事件,经LoRA微调后在太阳风预测、活动区分割、耀斑预测及极紫外光谱分析等下游任务中表现优异,验证了其对太阳演化物理机制的有效学习能力。

链接: https://arxiv.org/abs/2508.14112
作者: Sujit Roy,Johannes Schmude,Rohit Lal,Vishal Gaur,Marcus Freitag,Julian Kuehnert,Theodore van Kessel,Dinesha V. Hegde,Andrés Muñoz-Jaramillo,Johannes Jakubik,Etienne Vos,Kshitiz Mandal,Ata Akbari Asanjan,Joao Lucas de Sousa Almeida,Amy Lin,Talwinder Singh,Kang Yang,Chetraj Pandey,Jinsu Hong,Berkay Aydin,Thorsten Kurth,Ryan McGranaghan,Spiridon Kasapis,Vishal Upendran,Shah Bahauddin,Daniel da Silva,Nikolai V. Pogorelov,Campbell Watson,Manil Maskey,Madhulika Guhathakurta,Juan Bernabe-Moreno,Rahul Ramachandran
机构: 未知
类目: olar and Stellar Astrophysics (astro-ph.SR); Instrumentation and Methods for Astrophysics (astro-ph.IM); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Heliophysics is central to understanding and forecasting space weather events and solar activity. Despite decades of high-resolution observations from the Solar Dynamics Observatory (SDO), most models remain task-specific and constrained by scarce labeled data, limiting their capacity to generalize across solar phenomena. We introduce Surya, a 366M parameter foundation model for heliophysics designed to learn general-purpose solar representations from multi-instrument SDO observations, including eight Atmospheric Imaging Assembly (AIA) channels and five Helioseismic and Magnetic Imager (HMI) products. Surya employs a spatiotemporal transformer architecture with spectral gating and long–short range attention, pretrained on high-resolution solar image forecasting tasks and further optimized through autoregressive rollout tuning. Zero-shot evaluations demonstrate its ability to forecast solar dynamics and flare events, while downstream fine-tuning with parameter-efficient Low-Rank Adaptation (LoRA) shows strong performance on solar wind forecasting, active region segmentation, solar flare forecasting, and EUV spectra. Surya is the first foundation model in heliophysics that uses time advancement as a pretext task on full-resolution SDO data. Its novel architecture and performance suggest that the model is able to learn the underlying physics behind solar evolution.
zh

[AI-82] SuryaBench: Benchmark Dataset for Advancing Machine Learning in Heliophysics and Space Weather Prediction

【速读】:该论文旨在解决当前太阳物理与空间天气预测中缺乏高质量、标准化且适用于机器学习(Machine Learning, ML)任务的数据集问题,从而阻碍了AI驱动模型的开发与可重复性验证。解决方案的关键在于构建一个高分辨率、经预处理的太阳动力学观测数据集,源自NASA的太阳动力学天文台(Solar Dynamics Observatory, SDO),涵盖从2010年5月到2024年7月的一个完整太阳周期,并对AIA和HMI仪器数据进行系统校正(包括航天器滚转角修正、轨道调整、曝光归一化及退化补偿),同时配套提供多个基准任务数据集,覆盖活动区分割、活动区浮现预测、日冕磁场外推、太阳耀斑预测、极紫外光谱预测及太阳风速度估计等核心空间天气应用,实现数据标准化、任务统一化和模型可比性提升,推动AI在空间天气预报中的落地应用。

链接: https://arxiv.org/abs/2508.14107
作者: Sujit Roy,Dinesha V. Hegde,Johannes Schmude,Amy Lin,Vishal Gaur,Rohit Lal,Kshitiz Mandal,Talwinder Singh,Andrés Muñoz-Jaramillo,Kang Yang,Chetraj Pandey,Jinsu Hong,Berkay Aydin,Ryan McGranaghan,Spiridon Kasapis,Vishal Upendran,Shah Bahauddin,Daniel da Silva,Marcus Freitag,Iksha Gurung,Nikolai Pogorelov,Campbell Watson,Manil Maskey,Juan Bernabe-Moreno,Rahul Ramachandran
机构: 未知
类目: olar and Stellar Astrophysics (astro-ph.SR); Instrumentation and Methods for Astrophysics (astro-ph.IM); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:This paper introduces a high resolution, machine learning-ready heliophysics dataset derived from NASA’s Solar Dynamics Observatory (SDO), specifically designed to advance machine learning (ML) applications in solar physics and space weather forecasting. The dataset includes processed imagery from the Atmospheric Imaging Assembly (AIA) and Helioseismic and Magnetic Imager (HMI), spanning a solar cycle from May 2010 to July 2024. To ensure suitability for ML tasks, the data has been preprocessed, including correction of spacecraft roll angles, orbital adjustments, exposure normalization, and degradation compensation. We also provide auxiliary application benchmark datasets complementing the core SDO dataset. These provide benchmark applications for central heliophysics and space weather tasks such as active region segmentation, active region emergence forecasting, coronal field extrapolation, solar flare prediction, solar EUV spectra prediction, and solar wind speed estimation. By establishing a unified, standardized data collection, this dataset aims to facilitate benchmarking, enhance reproducibility, and accelerate the development of AI-driven models for critical space weather prediction tasks, bridging gaps between solar physics, machine learning, and operational forecasting.
zh

机器学习

[LG-0] Compute-Optimal Scaling for Value-Based Deep RL

链接: https://arxiv.org/abs/2508.14881
作者: Preston Fu,Oleh Rybkin,Zhiyuan Zhou,Michal Nauman,Pieter Abbeel,Sergey Levine,Aviral Kumar
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:As models grow larger and training them becomes expensive, it becomes increasingly important to scale training recipes not just to larger models and more data, but to do so in a compute-optimal manner that extracts maximal performance per unit of compute. While such scaling has been well studied for language modeling, reinforcement learning (RL) has received less attention in this regard. In this paper, we investigate compute scaling for online, value-based deep RL. These methods present two primary axes for compute allocation: model capacity and the update-to-data (UTD) ratio. Given a fixed compute budget, we ask: how should resources be partitioned across these axes to maximize sample efficiency? Our analysis reveals a nuanced interplay between model size, batch size, and UTD. In particular, we identify a phenomenon we call TD-overfitting: increasing the batch quickly harms Q-function accuracy for small models, but this effect is absent in large models, enabling effective use of large batch size at scale. We provide a mental model for understanding this phenomenon and build guidelines for choosing batch size and UTD to optimize compute usage. Our findings provide a grounded starting point for compute-optimal scaling in deep RL, mirroring studies in supervised learning but adapted to TD learning.

[LG-1] Universal and Transferable Adversarial Attack on Large Language Models Using Exponentiated Gradient Descent

链接: https://arxiv.org/abs/2508.14853
作者: Sajib Biswas,Mao Nishino,Samuel Jacob Chacko,Xiuwen Liu
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:As large language models (LLMs) are increasingly deployed in critical applications, ensuring their robustness and safety alignment remains a major challenge. Despite the overall success of alignment techniques such as reinforcement learning from human feedback (RLHF) on typical prompts, LLMs remain vulnerable to jailbreak attacks enabled by crafted adversarial triggers appended to user prompts. Most existing jailbreak methods either rely on inefficient searches over discrete token spaces or direct optimization of continuous embeddings. While continuous embeddings can be given directly to selected open-source models as input, doing so is not feasible for proprietary models. On the other hand, projecting these embeddings back into valid discrete tokens introduces additional complexity and often reduces attack effectiveness. We propose an intrinsic optimization method which directly optimizes relaxed one-hot encodings of the adversarial suffix tokens using exponentiated gradient descent coupled with Bregman projection, ensuring that the optimized one-hot encoding of each token always remains within the probability simplex. We provide theoretical proof of convergence for our proposed method and implement an efficient algorithm that effectively jailbreaks several widely used LLMs. Our method achieves higher success rates and faster convergence compared to three state-of-the-art baselines, evaluated on five open-source LLMs and four adversarial behavior datasets curated for evaluating jailbreak methods. In addition to individual prompt attacks, we also generate universal adversarial suffixes effective across multiple prompts and demonstrate transferability of optimized suffixes to different LLMs.

[LG-2] Multimodal Quantum Vision Transformer for Enzyme Commission Classification from Biochemical Representations

链接: https://arxiv.org/abs/2508.14844
作者: Murat Isik,Mandeep Kaur Saggi,Humaira Gowher,Sabre Kais
类目: Machine Learning (cs.LG)
*备注: Accepted at IEEE International Conference on Quantum Artificial Intelligence (QAI) 2025

点击查看摘要

Abstract:Accurately predicting enzyme functionality remains one of the major challenges in computational biology, particularly for enzymes with limited structural annotations or sequence homology. We present a novel multimodal Quantum Machine Learning (QML) framework that enhances Enzyme Commission (EC) classification by integrating four complementary biochemical modalities: protein sequence embeddings, quantum-derived electronic descriptors, molecular graph structures, and 2D molecular image representations. Quantum Vision Transformer (QVT) backbone equipped with modality-specific encoders and a unified cross-attention fusion module. By integrating graph features and spatial patterns, our method captures key stereoelectronic interactions behind enzyme function. Experimental results demonstrate that our multimodal QVT model achieves a top-1 accuracy of 85.1%, outperforming sequence-only baselines by a substantial margin and achieving better performance results compared to other QML models.

[LG-3] On Defining Neural Averag ing

链接: https://arxiv.org/abs/2508.14832
作者: Su Hyeong Lee,Richard Ngo
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:What does it even mean to average neural networks? We investigate the problem of synthesizing a single neural network from a collection of pretrained models, each trained on disjoint data shards, using only their final weights and no access to training data. In forming a definition of neural averaging, we take insight from model soup, which appears to aggregate multiple models into a singular model while enhancing generalization performance. In this work, we reinterpret model souping as a special case of a broader framework: Amortized Model Ensembling (AME) for neural averaging, a data-free meta-optimization approach that treats model differences as pseudogradients to guide neural weight updates. We show that this perspective not only recovers model soup but enables more expressive and adaptive ensembling strategies. Empirically, AME produces averaged neural solutions that outperform both individual experts and model soup baselines, especially in out-of-distribution settings. Our results suggest a principled and generalizable notion of data-free model weight aggregation and defines, in one sense, how to perform neural averaging.

[LG-4] Successive Halving with Learning Curve Prediction via Latent Kronecker Gaussian Processes

链接: https://arxiv.org/abs/2508.14818
作者: Jihao Andreas Lin,Nicolas Mayoraz,Steffen Rendle,Dima Kuzmin,Emil Praun,Berivan Isik
类目: Machine Learning (cs.LG)
*备注: AutoML 2025 Non-Archival Track

点击查看摘要

Abstract:Successive Halving is a popular algorithm for hyperparameter optimization which allocates exponentially more resources to promising candidates. However, the algorithm typically relies on intermediate performance values to make resource allocation decisions, which can cause it to prematurely prune slow starters that would eventually become the best candidate. We investigate whether guiding Successive Halving with learning curve predictions based on Latent Kronecker Gaussian Processes can overcome this limitation. In a large-scale empirical study involving different neural network architectures and a click prediction dataset, we compare this predictive approach to the standard approach based on current performance values. Our experiments show that, although the predictive approach achieves competitive performance, it is not Pareto optimal compared to investing more resources into the standard approach, because it requires fully observed learning curves as training data. However, this downside could be mitigated by leveraging existing learning curve data.

[LG-5] Enhancing Contrastive Link Prediction With Edge Balancing Augmentation CIKM2025

链接: https://arxiv.org/abs/2508.14808
作者: Chen-Hao Chang,Hui-Ju Hung,Chia-Hsun Lu,Chih-Ya Shen
类目: Machine Learning (cs.LG)
*备注: Accepted by CIKM 2025

点击查看摘要

Abstract:Link prediction is one of the most fundamental tasks in graph mining, which motivates the recent studies of leveraging contrastive learning to enhance the performance. However, we observe two major weaknesses of these studies: i) the lack of theoretical analysis for contrastive learning on link prediction, and ii) inadequate consideration of node degrees in contrastive learning. To address the above weaknesses, we provide the first formal theoretical analysis for contrastive learning on link prediction, where our analysis results can generalize to the autoencoder-based link prediction models with contrastive learning. Motivated by our analysis results, we propose a new graph augmentation approach, Edge Balancing Augmentation (EBA), which adjusts the node degrees in the graph as the augmentation. We then propose a new approach, named Contrastive Link Prediction with Edge Balancing Augmentation (CoEBA), that integrates the proposed EBA and the proposed new contrastive losses to improve the model performance. We conduct experiments on 8 benchmark datasets. The results demonstrate that our proposed CoEBA significantly outperforms the other state-of-the-art link prediction models.

[LG-6] Source-Guided Flow Matching

链接: https://arxiv.org/abs/2508.14807
作者: Zifan Wang,Alice Harting,Matthieu Barreau,Michael M. Zavlanos,Karl H. Johansson
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Guidance of generative models is typically achieved by modifying the probability flow vector field through the addition of a guidance field. In this paper, we instead propose the Source-Guided Flow Matching (SGFM) framework, which modifies the source distribution directly while keeping the pre-trained vector field intact. This reduces the guidance problem to a well-defined problem of sampling from the source distribution. We theoretically show that SGFM recovers the desired target distribution exactly. Furthermore, we provide bounds on the Wasserstein error for the generated distribution when using an approximate sampler of the source distribution and an approximate vector field. The key benefit of our approach is that it allows the user to flexibly choose the sampling method depending on their specific problem. To illustrate this, we systematically compare different sampling methods and discuss conditions for asymptotically exact guidance. Moreover, our framework integrates well with optimal flow matching models since the straight transport map generated by the vector field is preserved. Experimental results on synthetic 2D benchmarks, image datasets, and physics-informed generative tasks demonstrate the effectiveness and flexibility of the proposed framework.

[LG-7] A Guide for Manual Annotation of Scientific Imagery: How to Prepare for Large Projects

链接: https://arxiv.org/abs/2508.14801
作者: Azim Ahmadzadeh,Rohan Adhyapak,Armin Iraji,Kartik Chaurasiya,V Aparna,Petrus C. Martens
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Despite the high demand for manually annotated image data, managing complex and costly annotation projects remains under-discussed. This is partly due to the fact that leading such projects requires dealing with a set of diverse and interconnected challenges which often fall outside the expertise of specific domain experts, leaving practical guidelines scarce. These challenges range widely from data collection to resource allocation and recruitment, from mitigation of biases to effective training of the annotators. This paper provides a domain-agnostic preparation guide for annotation projects, with a focus on scientific imagery. Drawing from the authors’ extensive experience in managing a large manual annotation project, it addresses fundamental concepts including success measures, annotation subjects, project goals, data availability, and essential team roles. Additionally, it discusses various human biases and recommends tools and technologies to improve annotation quality and efficiency. The goal is to encourage further research and frameworks for creating a comprehensive knowledge base to reduce the costs of manual annotation projects across various fields.

[LG-8] Synthetic Adaptive Guided Embeddings (SAGE): A Novel Knowledge Distillation Method

链接: https://arxiv.org/abs/2508.14783
作者: Suleyman Olcay Polat,Poli A. Nemkova,Mark V. Albert
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Model distillation enables the transfer of knowledge from large-scale models to compact student models, facilitating deployment in resource-constrained environments. However, conventional distillation approaches often suffer from computational overhead and limited generalization. We propose a novel adaptive distillation framework that dynamically augments training data in regions of high student model loss. Using UMAP-based dimensionality reduction and nearest neighbor sampling, our method identifies underperforming regions in the embedding space and generates targeted synthetic examples to guide student learning. To further improve efficiency, we introduce a lightweight teacher-student interface that bypasses the teacher’s input layer, enabling direct distillation on vectorized representations. Experiments across standard NLP benchmarks demonstrate that our 66M-parameter student model consistently matches or surpasses established baselines, achieving 91.2% on QNLI and 92.3% on SST-2, while training with fewer epochs. These results highlight the promise of loss-aware data augmentation and vectorized distillation for efficient and effective model compression.

[LG-9] Context Steering: A New Paradigm for Compression-based Embeddings by Synthesizing Relevant Information Features

链接: https://arxiv.org/abs/2508.14780
作者: Guillermo Sarasa Durán,Ana Granados Fontecha,Francisco de Borja Rodríguez Ortíz
类目: Machine Learning (cs.LG); Information Theory (cs.IT)
*备注:

点击查看摘要

Abstract:Compression-based distances (CD) offer a flexible and domain-agnostic means of measuring similarity by identifying implicit information through redundancies between data objects. However, as similarity features are derived from the data, rather than defined as an input, it often proves difficult to align with the task at hand, particularly in complex clustering or classification settings. To address this issue, we introduce “context steering,” a novel methodology that actively guides the feature-shaping process. Instead of passively accepting the emergent data structure (typically a hierarchy derived from clustering CDs), our approach “steers” the process by systematically analyzing how each object influences the relational context within a clustering framework. This process generates a custom-tailored embedding that isolates and amplifies class-distinctive information. We validate the capabilities of this strategy using Normalized Compression Distance (NCD) and Relative Compression Distance (NRC) with common hierarchical clustering, providing an effective alternative to common transductive methods. Experimental results across heterogeneous datasets-from text to real-world audio-validate the robustness and generality of context steering, marking a fundamental shift in their application: from merely discovering inherent data structures to actively shaping a feature space tailored to a specific objective.

[LG-10] Federated Distillation on Edge Devices: Efficient Client-Side Filtering for Non-IID Data

链接: https://arxiv.org/abs/2508.14769
作者: Ahmed Mujtaba,Gleb Radchenko,Radu Prodan,Marc Masana
类目: Machine Learning (cs.LG); Distributed, Parallel, and Cluster Computing (cs.DC)
*备注: This paper was accepted at the International Conference on Federated Learning Technologies and Applications, 2025. The final version is available at IEEE Xplore

点击查看摘要

Abstract:Federated distillation has emerged as a promising collaborative machine learning approach, offering enhanced privacy protection and reduced communication compared to traditional federated learning by exchanging model outputs (soft logits) rather than full model parameters. However, existing methods employ complex selective knowledge-sharing strategies that require clients to identify in-distribution proxy data through computationally expensive statistical density ratio estimators. Additionally, server-side filtering of ambiguous knowledge introduces latency to the process. To address these challenges, we propose a robust, resource-efficient EdgeFD method that reduces the complexity of the client-side density ratio estimation and removes the need for server-side filtering. EdgeFD introduces an efficient KMeans-based density ratio estimator for effectively filtering both in-distribution and out-of-distribution proxy data on clients, significantly improving the quality of knowledge sharing. We evaluate EdgeFD across diverse practical scenarios, including strong non-IID, weak non-IID, and IID data distributions on clients, without requiring a pre-trained teacher model on the server for knowledge distillation. Experimental results demonstrate that EdgeFD outperforms state-of-the-art methods, consistently achieving accuracy levels close to IID scenarios even under heterogeneous and challenging conditions. The significantly reduced computational overhead of the KMeans-based estimator is suitable for deployment on resource-constrained edge devices, thereby enhancing the scalability and real-world applicability of federated distillation. The code is available online for reproducibility.

[LG-11] HERAKLES: Hierarchical Skill Compilation for Open-ended LLM Agents

链接: https://arxiv.org/abs/2508.14751
作者: Thomas Carta,Clément Romac,Loris Gaven,Pierre-Yves Oudeyer,Olivier Sigaud,Sylvain Lamprier
类目: Machine Learning (cs.LG)
*备注: 42 pages

点击查看摘要

Abstract:Open-ended AI agents need to be able to learn efficiently goals of increasing complexity, abstraction and heterogeneity over their lifetime. Beyond sampling efficiently their own goals, autotelic agents specifically need to be able to keep the growing complexity of goals under control, limiting the associated growth in sample and computational complexity. To adress this challenge, recent approaches have leveraged hierarchical reinforcement learning (HRL) and language, capitalizing on its compositional and combinatorial generalization capabilities to acquire temporally extended reusable behaviours. Existing approaches use expert defined spaces of subgoals over which they instantiate a hierarchy, and often assume pre-trained associated low-level policies. Such designs are inadequate in open-ended scenarios, where goal spaces naturally diversify across a broad spectrum of difficulties. We introduce HERAKLES, a framework that enables a two-level hierarchical autotelic agent to continuously compile mastered goals into the low-level policy, executed by a small, fast neural network, dynamically expanding the set of subgoals available to the high-level policy. We train a Large Language Model (LLM) to serve as the high-level controller, exploiting its strengths in goal decomposition and generalization to operate effectively over this evolving subgoal space. We evaluate HERAKLES in the open-ended Crafter environment and show that it scales effectively with goal complexity, improves sample efficiency through skill compilation, and enables the agent to adapt robustly to novel challenges over time.

[LG-12] MissionHD: Data-Driven Refinement of Reasoning Graph Structure through Hyperdimensional Causal Path Encoding and Decoding

链接: https://arxiv.org/abs/2508.14746
作者: Sanggeon Yun,Raheeb Hassan,Ryozo Masukawa,Mohsen Imani
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Reasoning graphs from Large Language Models (LLMs) are often misaligned with downstream visual tasks such as video anomaly detection (VAD). Existing Graph Structure Refinement (GSR) methods are ill-suited for these novel, dataset-less graphs. We introduce Data-driven GSR (D-GSR), a new paradigm that directly optimizes graph structure using downstream task data, and propose MissionHD, a hyperdimensional computing (HDC) framework to operationalize it. MissionHD uses an efficient encode-decode process to refine the graph, guided by the downstream task signal. Experiments on challenging VAD and VAR benchmarks show significant performance improvements when using our refined graphs, validating our approach as an effective pre-processing step.

[LG-13] CaTE Data Curation for Trustworthy AI

链接: https://arxiv.org/abs/2508.14741
作者: Mary Versa Clemens-Sewall,Christopher Cervantes,Emma Rafkin,J. Neil Otte,Tom Magelinski,Libby Lewis,Michelle Liu,Dana Udwin,Monique Kirkman-Bey
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:This report provides practical guidance to teams designing or developing AI-enabled systems for how to promote trustworthiness during the data curation phase of development. In this report, the authors first define data, the data curation phase, and trustworthiness. We then describe a series of steps that the development team, especially data scientists, can take to build a trustworthy AI-enabled system. We enumerate the sequence of core steps and trace parallel paths where alternatives exist. The descriptions of these steps include strengths, weaknesses, preconditions, outcomes, and relevant open-source software tool implementations. In total, this report is a synthesis of data curation tools and approaches from relevant academic literature, and our goal is to equip readers with a diverse yet coherent set of practices for improving AI trustworthiness.

[LG-14] Assessing the Quality and Security of AI-Generated Code: A Quantitative Analysis

链接: https://arxiv.org/abs/2508.14727
作者: Abbas Sabra,Olivier Schmitt,Joseph Tyler
类目: oftware Engineering (cs.SE); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:This study presents a quantitative evaluation of the code quality and security of five prominent Large Language Models (LLMs): Claude Sonnet 4, Claude 3.7 Sonnet, GPT-4o, Llama 3.2 90B, and OpenCoder 8B. While prior research has assessed the functional performance of LLM-generated code, this research tested LLM output from 4,442 Java coding assignments through comprehensive static analysis using SonarQube. The findings suggest that although LLMs can generate functional code, they also introduce a range of software defects, including bugs, security vulnerabilities, and code smells. These defects do not appear to be isolated; rather, they may represent shared weaknesses stemming from systemic limitations within current LLM code generation methods. In particular, critically severe issues, such as hard-coded passwords and path traversal vulnerabilities, were observed across multiple models. These results indicate that LLM-generated code requires verification in order to be considered production-ready. This study found no direct correlation between a model’s functional performance (measured by Pass@1 rate of unit tests) and the overall quality and security of its generated code, measured by the number of SonarQube issues in benchmark solutions that passed the functional tests. This suggests that functional benchmark performance score is not a good indicator of overall code quality and security. The goal of this study is not to rank LLM performance but to highlight that all evaluated models appear to share certain weaknesses. Consequently, these findings support the view that static analysis can be a valuable instrument for detecting latent defects and an important safeguard for organizations that deploy AI in software development.

[LG-15] Addressing Graph Anomaly Detection via Causal Edge Separation and Spectrum KDD

链接: https://arxiv.org/abs/2508.14684
作者: Zengyi Wo,Wenjun Wang,Minglai Shao,Chang Liu,Yumeng Wang,Yueheng Sun
类目: Machine Learning (cs.LG)
*备注: Proceedings of the 2024 KDD Workshop

点击查看摘要

Abstract:In the real world, anomalous entities often add more legitimate connections while hiding direct links with other anomalous entities, leading to heterophilic structures in anomalous networks that most GNN-based techniques fail to address. Several works have been proposed to tackle this issue in the spatial domain. However, these methods overlook the complex relationships between node structure encoding, node features, and their contextual environment and rely on principled guidance, research on solving spectral domain heterophilic problems remains limited. This study analyzes the spectral distribution of nodes with different heterophilic degrees and discovers that the heterophily of anomalous nodes causes the spectral energy to shift from low to high frequencies. To address the above challenges, we propose a spectral neural network CES2-GAD based on causal edge separation for anomaly detection on heterophilic graphs. Firstly, CES2-GAD will separate the original graph into homophilic and heterophilic edges using causal interventions. Subsequently, various hybrid-spectrum filters are used to capture signals from the segmented graphs. Finally, representations from multiple signals are concatenated and input into a classifier to predict anomalies. Extensive experiments with real-world datasets have proven the effectiveness of the method we proposed.

[LG-16] Improving Fairness in Graph Neural Networks via Counterfactual Debiasing KDD

链接: https://arxiv.org/abs/2508.14683
作者: Zengyi Wo,Chang Liu,Yumeng Wang,Minglai Shao,Wenjun Wang
类目: Machine Learning (cs.LG)
*备注: Proceedings of the 2024 KDD Workshop

点击查看摘要

Abstract:Graph Neural Networks (GNNs) have been successful in modeling graph-structured data. However, similar to other machine learning models, GNNs can exhibit bias in predictions based on attributes like race and gender. Moreover, bias in GNNs can be exacerbated by the graph structure and message-passing mechanisms. Recent cutting-edge methods propose mitigating bias by filtering out sensitive information from input or representations, like edge dropping or feature masking. Yet, we argue that such strategies may unintentionally eliminate non-sensitive features, leading to a compromised balance between predictive accuracy and fairness. To tackle this challenge, we present a novel approach utilizing counterfactual data augmentation for bias mitigation. This method involves creating diverse neighborhoods using counterfactuals before message passing, facilitating unbiased node representations learning from the augmented graph. Subsequently, an adversarial discriminator is employed to diminish bias in predictions by conventional GNN classifiers. Our proposed technique, Fair-ICD, ensures the fairness of GNNs under moderate conditions. Experiments on standard datasets using three GNN backbones demonstrate that Fair-ICD notably enhances fairness metrics while preserving high predictive performance.

[LG-17] Clinical semantics for lung cancer prediction

链接: https://arxiv.org/abs/2508.14627
作者: Luis H. John,Jan A. Kors,Jenna M. Reps,Peter R. Rijnbeek,Egill A. Fridgeirsson
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Background: Existing clinical prediction models often represent patient data using features that ignore the semantic relationships between clinical concepts. This study integrates domain-specific semantic information by mapping the SNOMED medical term hierarchy into a low-dimensional hyperbolic space using Poincaré embeddings, with the aim of improving lung cancer onset prediction. Methods: Using a retrospective cohort from the Optum EHR dataset, we derived a clinical knowledge graph from the SNOMED taxonomy and generated Poincaré embeddings via Riemannian stochastic gradient descent. These embeddings were then incorporated into two deep learning architectures, a ResNet and a Transformer model. Models were evaluated for discrimination (area under the receiver operating characteristic curve) and calibration (average absolute difference between observed and predicted probabilities) performance. Results: Incorporating pre-trained Poincaré embeddings resulted in modest and consistent improvements in discrimination performance compared to baseline models using randomly initialized Euclidean embeddings. ResNet models, particularly those using a 10-dimensional Poincaré embedding, showed enhanced calibration, whereas Transformer models maintained stable calibration across configurations. Discussion: Embedding clinical knowledge graphs into hyperbolic space and integrating these representations into deep learning models can improve lung cancer onset prediction by preserving the hierarchical structure of clinical terminologies used for prediction. This approach demonstrates a feasible method for combining data-driven feature extraction with established clinical knowledge. Subjects: Machine Learning (cs.LG) Cite as: arXiv:2508.14627 [cs.LG] (or arXiv:2508.14627v1 [cs.LG] for this version) https://doi.org/10.48550/arXiv.2508.14627 Focus to learn more arXiv-issued DOI via DataCite Submission history From: Luis John [view email] [v1] Wed, 20 Aug 2025 11:29:47 UTC (1,362 KB)

[LG-18] A Fuzzy-Enhanced Explainable AI Framework for Flight Continuous Descent Operations Classification

链接: https://arxiv.org/abs/2508.14618
作者: Amin Noroozi,Sandaruwan K. Sethunge,Elham Norouzi,Phat T. Phan,Kavinda U. Waduge,Md. Arafatur Rahman
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Continuous Descent Operations (CDO) involve smooth, idle-thrust descents that avoid level-offs, reducing fuel burn, emissions, and noise while improving efficiency and passenger comfort. Despite its operational and environmental benefits, limited research has systematically examined the factors influencing CDO performance. Moreover, many existing methods in related areas, such as trajectory optimization, lack the transparency required in aviation, where explainability is critical for safety and stakeholder trust. This study addresses these gaps by proposing a Fuzzy-Enhanced Explainable AI (FEXAI) framework that integrates fuzzy logic with machine learning and SHapley Additive exPlanations (SHAP) analysis. For this purpose, a comprehensive dataset of 29 features, including 11 operational and 18 weather-related features, was collected from 1,094 flights using Automatic Dependent Surveillance-Broadcast (ADS-B) data. Machine learning models and SHAP were then applied to classify flights’ CDO adherence levels and rank features by importance. The three most influential features, as identified by SHAP scores, were then used to construct a fuzzy rule-based classifier, enabling the extraction of interpretable fuzzy rules. All models achieved classification accuracies above 90%, with FEXAI providing meaningful, human-readable rules for operational users. Results indicated that the average descent rate within the arrival route, the number of descent segments, and the average change in directional heading during descent were the strongest predictors of CDO performance. The FEXAI method proposed in this study presents a novel pathway for operational decision support and could be integrated into aviation tools to enable real-time advisories that maintain CDO adherence under varying operational conditions.

[LG-19] Measuring IIA Violations in Similarity Choices with Bayesian Models UAI-2025 UAI2025

链接: https://arxiv.org/abs/2508.14615
作者: Hugo Sales Corrêa,Suryanarayana Sankagiri,Daniel Ratton Figueiredo,Matthias Grossglauser
类目: Machine Learning (cs.LG); Machine Learning (stat.ML)
*备注: 26 pages and 34 figures, for associated code and data, see this https URL , poster session in UAI 2025

点击查看摘要

Abstract:Similarity choice data occur when humans make choices among alternatives based on their similarity to a target, e.g., in the context of information retrieval and in embedding learning settings. Classical metric-based models of similarity choice assume independence of irrelevant alternatives (IIA), a property that allows for a simpler formulation. While IIA violations have been detected in many discrete choice settings, the similarity choice setting has received scant attention. This is because the target-dependent nature of the choice complicates IIA testing. We propose two statistical methods to test for IIA: a classical goodness-of-fit test and a Bayesian counterpart based on the framework of Posterior Predictive Checks (PPC). This Bayesian approach, our main technical contribution, quantifies the degree of IIA violation beyond its mere significance. We curate two datasets: one with choice sets designed to elicit IIA violations, and another with randomly generated choice sets from the same item universe. Our tests confirmed significant IIA violations on both datasets, and notably, we find a comparable degree of violation between them. Further, we devise a new PPC test for population homogeneity. Results show that the population is indeed homogenous, suggesting that the IIA violations are driven by context effects – specifically, interactions within the choice sets. These results highlight the need for new similarity choice models that account for such context effects.

[LG-20] DualNILM: Energy Injection Identification Enabled Disaggregation with Deep Multi-Task Learning

链接: https://arxiv.org/abs/2508.14600
作者: Xudong Wang,Guoming Tang,Junyu Xue,Srinivasan Keshav,Tongxin Li,Chris Ding
类目: Machine Learning (cs.LG); Signal Processing (eess.SP)
*备注: Preprint

点击查看摘要

Abstract:Non-Intrusive Load Monitoring (NILM) offers a cost-effective method to obtain fine-grained appliance-level energy consumption in smart homes and building applications. However, the increasing adoption of behind-the-meter energy sources, such as solar panels and battery storage, poses new challenges for conventional NILM methods that rely solely on at-the-meter data. The injected energy from the behind-the-meter sources can obscure the power signatures of individual appliances, leading to a significant decline in NILM performance. To address this challenge, we present DualNILM, a deep multi-task learning framework designed for the dual tasks of appliance state recognition and injected energy identification in NILM. By integrating sequence-to-point and sequence-to-sequence strategies within a Transformer-based architecture, DualNILM can effectively capture multi-scale temporal dependencies in the aggregate power consumption patterns, allowing for accurate appliance state recognition and energy injection identification. We conduct validation of DualNILM using both self-collected and synthesized open NILM datasets that include both appliance-level energy consumption and energy injection. Extensive experimental results demonstrate that DualNILM maintains an excellent performance for the dual tasks in NILM, much outperforming conventional methods.

[LG-21] A Comprehensive Evaluation of the Sensitivity of Density-Ratio Estimation Based Fairness Measurement in Regression

链接: https://arxiv.org/abs/2508.14576
作者: Abdalwahab Almajed,Maryam Tabar,Peyman Najafirad
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:The prevalence of algorithmic bias in Machine Learning (ML)-driven approaches has inspired growing research on measuring and mitigating bias in the ML domain. Accordingly, prior research studied how to measure fairness in regression which is a complex problem. In particular, recent research proposed to formulate it as a density-ratio estimation problem and relied on a Logistic Regression-driven probabilistic classifier-based approach to solve it. However, there are several other methods to estimate a density ratio, and to the best of our knowledge, prior work did not study the sensitivity of such fairness measurement methods to the choice of underlying density ratio estimation algorithm. To fill this gap, this paper develops a set of fairness measurement methods with various density-ratio estimation cores and thoroughly investigates how different cores would affect the achieved level of fairness. Our experimental results show that the choice of density-ratio estimation core could significantly affect the outcome of fairness measurement method, and even, generate inconsistent results with respect to the relative fairness of various algorithms. These observations suggest major issues with density-ratio estimation based fairness measurement in regression and a need for further research to enhance their reliability.

[LG-22] Cooperative SGD with Dynamic Mixing Matrices ECAI-2025

链接: https://arxiv.org/abs/2508.14565
作者: Soumya Sarkar,Shweta Jain
类目: Machine Learning (cs.LG); Distributed, Parallel, and Cluster Computing (cs.DC)
*备注: Accepted at 28th European Conference on Artificial Intelligence (ECAI-2025) in main paper track

点击查看摘要

Abstract:One of the most common methods to train machine learning algorithms today is the stochastic gradient descent (SGD). In a distributed setting, SGD-based algorithms have been shown to converge theoretically under specific circumstances. A substantial number of works in the distributed SGD setting assume a fixed topology for the edge devices. These papers also assume that the contribution of nodes to the global model is uniform. However, experiments have shown that such assumptions are suboptimal and a non uniform aggregation strategy coupled with a dynamically shifting topology and client selection can significantly improve the performance of such models. This paper details a unified framework that covers several Local-Update SGD-based distributed algorithms with dynamic topologies and provides improved or matching theoretical guarantees on convergence compared to existing work.

[LG-23] FedEve: On Bridging the Client Drift and Period Drift for Cross-device Federated Learning

链接: https://arxiv.org/abs/2508.14539
作者: Tao Shen,Zexi Li,Didi Zhu,Ziyu Zhao,Chao Wu,Fei Wu
类目: Machine Learning (cs.LG); Distributed, Parallel, and Cluster Computing (cs.DC)
*备注:

点击查看摘要

Abstract:Federated learning (FL) is a machine learning paradigm that allows multiple clients to collaboratively train a shared model without exposing their private data. Data heterogeneity is a fundamental challenge in FL, which can result in poor convergence and performance degradation. Client drift has been recognized as one of the factors contributing to this issue resulting from the multiple local updates in FedAvg. However, in cross-device FL, a different form of drift arises due to the partial client participation, but it has not been studied well. This drift, we referred as period drift, occurs as participating clients at each communication round may exhibit distinct data distribution that deviates from that of all clients. It could be more harmful than client drift since the optimization objective shifts with every round. In this paper, we investigate the interaction between period drift and client drift, finding that period drift can have a particularly detrimental effect on cross-device FL as the degree of data heterogeneity increases. To tackle these issues, we propose a predict-observe framework and present an instantiated method, FedEve, where these two types of drift can compensate each other to mitigate their overall impact. We provide theoretical evidence that our approach can reduce the variance of model updates. Extensive experiments demonstrate that our method outperforms alternatives on non-iid data in cross-device settings. Subjects: Machine Learning (cs.LG); Distributed, Parallel, and Cluster Computing (cs.DC) Cite as: arXiv:2508.14539 [cs.LG] (or arXiv:2508.14539v1 [cs.LG] for this version) https://doi.org/10.48550/arXiv.2508.14539 Focus to learn more arXiv-issued DOI via DataCite (pending registration)

[LG-24] Great GATsBi: Hybrid Multimodal Trajectory Forecasting for Bicycles using Anticipation Mechanism

链接: https://arxiv.org/abs/2508.14523
作者: Kevin Riehl,Shaimaa K. El-Baklish,Anastasios Kouvelas,Michail A. Makridis
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Accurate prediction of road user movement is increasingly required by many applications ranging from advanced driver assistance systems to autonomous driving, and especially crucial for road safety. Even though most traffic accident fatalities account to bicycles, they have received little attention, as previous work focused mainly on pedestrians and motorized vehicles. In this work, we present the Great GATsBi, a domain-knowledge-based, hybrid, multimodal trajectory prediction framework for bicycles. The model incorporates both physics-based modeling (inspired by motorized vehicles) and social-based modeling (inspired by pedestrian movements) to explicitly account for the dual nature of bicycle movement. The social interactions are modeled with a graph attention network, and include decayed historical, but also anticipated, future trajectory data of a bicycles neighborhood, following recent insights from psychological and social studies. The results indicate that the proposed ensemble of physics models – performing well in the short-term predictions – and social models – performing well in the long-term predictions – exceeds state-of-the-art performance. We also conducted a controlled mass-cycling experiment to demonstrate the framework’s performance when forecasting bicycle trajectories and modeling social interactions with road users.

[LG-25] Artificial Intelligence-Based Multiscale Temporal Modeling for Anomaly Detection in Cloud Services

链接: https://arxiv.org/abs/2508.14503
作者: Lian Lian,Yilin Li,Song Han,Renzi Meng,Sibo Wang,Ming Wang
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:This study proposes an anomaly detection method based on the Transformer architecture with integrated multiscale feature perception, aiming to address the limitations of temporal modeling and scale-aware feature representation in cloud service environments. The method first employs an improved Transformer module to perform temporal modeling on high-dimensional monitoring data, using a self-attention mechanism to capture long-range dependencies and contextual semantics. Then, a multiscale feature construction path is introduced to extract temporal features at different granularities through downsampling and parallel encoding. An attention-weighted fusion module is designed to dynamically adjust the contribution of each scale to the final decision, enhancing the model’s robustness in anomaly pattern modeling. In the input modeling stage, standardized multidimensional time series are constructed, covering core signals such as CPU utilization, memory usage, and task scheduling states, while positional encoding is used to strengthen the model’s temporal awareness. A systematic experimental setup is designed to evaluate performance, including comparative experiments and hyperparameter sensitivity analysis, focusing on the impact of optimizers, learning rates, anomaly ratios, and noise levels. Experimental results show that the proposed method outperforms mainstream baseline models in key metrics, including precision, recall, AUC, and F1-score, and maintains strong stability and detection performance under various perturbation conditions, demonstrating its superior capability in complex cloud environments.

[LG-26] Semantic Energy: Detecting LLM Hallucination Beyond Entropy

链接: https://arxiv.org/abs/2508.14496
作者: Huan Ma,Jiadong Pan,Jing Liu,Yan Chen,Joey Tianyi Zhou,Guangyu Wang,Qinghua Hu,Hua Wu,Changqing Zhang,Haifeng Wang
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Large Language Models (LLMs) are being increasingly deployed in real-world applications, but they remain susceptible to hallucinations, which produce fluent yet incorrect responses and lead to erroneous decision-making. Uncertainty estimation is a feasible approach to detect such hallucinations. For example, semantic entropy estimates uncertainty by considering the semantic diversity across multiple sampled responses, thus identifying hallucinations. However, semantic entropy relies on post-softmax probabilities and fails to capture the model’s inherent uncertainty, causing it to be ineffective in certain scenarios. To address this issue, we introduce Semantic Energy, a novel uncertainty estimation framework that leverages the inherent confidence of LLMs by operating directly on logits of penultimate layer. By combining semantic clustering with a Boltzmann-inspired energy distribution, our method better captures uncertainty in cases where semantic entropy fails. Experiments across multiple benchmarks show that Semantic Energy significantly improves hallucination detection and uncertainty estimation, offering more reliable signals for downstream applications such as hallucination detection.

[LG-27] On the notion of missingness for path attribution explainability methods in medical settings: Guiding the selection of medically meaningful baselines

链接: https://arxiv.org/abs/2508.14482
作者: Alexander Geiger,Lars Wagner,Daniel Rueckert,Dirk Wilhelm,Alissa Jell
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:The explainability of deep learning models remains a significant challenge, particularly in the medical domain where interpretable outputs are critical for clinical trust and transparency. Path attribution methods such as Integrated Gradients rely on a baseline input representing the absence of relevant features (“missingness”). Commonly used baselines, such as all-zero inputs, are often semantically meaningless, especially in medical contexts where missingness can itself be informative. While alternative baseline choices have been explored, existing methods lack a principled approach to dynamically select baselines tailored to each input. In this work, we examine the notion of missingness in the medical setting, analyze its implications for baseline selection, and introduce a counterfactual-guided approach to address the limitations of conventional baselines. We argue that a clinically normal but input-close counterfactual represents a more accurate representation of a meaningful absence of features in medical data. To implement this, we use a Variational Autoencoder to generate counterfactual baselines, though our concept is generative-model-agnostic and can be applied with any suitable counterfactual method. We evaluate the approach on three distinct medical data sets and empirically demonstrate that counterfactual baselines yield more faithful and medically relevant attributions compared to standard baseline choices.

[LG-28] Fast Symbolic Regression Benchmarking

链接: https://arxiv.org/abs/2508.14481
作者: Viktor Martinek
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Symbolic regression (SR) uncovers mathematical models from data. Several benchmarks have been proposed to compare the performance of SR algorithms. However, existing ground-truth rediscovery benchmarks overemphasize the recovery of “the one” expression form or rely solely on computer algebra systems (such as SymPy) to assess success. Furthermore, existing benchmarks continue the expression search even after its discovery. We improve upon these issues by introducing curated lists of acceptable expressions, and a callback mechanism for early termination. As a starting point, we use the symbolic regression for scientific discovery (SRSD) benchmark problems proposed by Yoshitomo et al., and benchmark the two SR packages this http URL and TiSR. The new benchmarking method increases the rediscovery rate of this http URL from 26.7%, as reported by Yoshitomo et at., to 44.7%. Performing the benchmark takes 41.2% less computational expense. TiSR’s rediscovery rate is 69.4%, while performing the benchmark saves 63% time.

[LG-29] Personalized Counterfactual Framework: Generating Potential Outcomes from Wearable Data

链接: https://arxiv.org/abs/2508.14432
作者: Ajan Subramanian,Amir M. Rahmani
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Wearable sensor data offer opportunities for personalized health monitoring, yet deriving actionable insights from their complex, longitudinal data streams is challenging. This paper introduces a framework to learn personalized counterfactual models from multivariate wearable data. This enables exploring what-if scenarios to understand potential individual-specific outcomes of lifestyle choices. Our approach first augments individual datasets with data from similar patients via multi-modal similarity analysis. We then use a temporal PC (Peter-Clark) algorithm adaptation to discover predictive relationships, modeling how variables at time t-1 influence physiological changes at time t. Gradient Boosting Machines are trained on these discovered relationships to quantify individual-specific effects. These models drive a counterfactual engine projecting physiological trajectories under hypothetical interventions (e.g., activity or sleep changes). We evaluate the framework via one-step-ahead predictive validation and by assessing the plausibility and impact of interventions. Evaluation showed reasonable predictive accuracy (e.g., mean heart rate MAE 4.71 bpm) and high counterfactual plausibility (median 0.9643). Crucially, these interventions highlighted significant inter-individual variability in response to hypothetical lifestyle changes, showing the framework’s potential for personalized insights. This work provides a tool to explore personalized health dynamics and generate hypotheses on individual responses to lifestyle changes.

[LG-30] Offline Imitation Learning upon Arbitrary Demonstrations by Pre-Training Dynamics Representations

链接: https://arxiv.org/abs/2508.14383
作者: Haitong Ma,Bo Dai,Zhaolin Ren,Yebin Wang,Na Li
类目: Robotics (cs.RO); Machine Learning (cs.LG)
*备注: 7 pages, 5 figures

点击查看摘要

Abstract:Limited data has become a major bottleneck in scaling up offline imitation learning (IL). In this paper, we propose enhancing IL performance under limited expert data by introducing a pre-training stage that learns dynamics representations, derived from factorizations of the transition dynamics. We first theoretically justify that the optimal decision variable of offline IL lies in the representation space, significantly reducing the parameters to learn in the downstream IL. Moreover, the dynamics representations can be learned from arbitrary data collected with the same dynamics, allowing the reuse of massive non-expert data and mitigating the limited data issues. We present a tractable loss function inspired by noise contrastive estimation to learn the dynamics representations at the pre-training stage. Experiments on MuJoCo demonstrate that our proposed algorithm can mimic expert policies with as few as a single trajectory. Experiments on real quadrupeds show that we can leverage pre-trained dynamics representations from simulator data to learn to walk from a few real-world demonstrations.

[LG-31] Action-Constrained Imitation Learning ICML2025

链接: https://arxiv.org/abs/2508.14379
作者: Chia-Han Yeh,Tse-Sheng Nan,Risto Vuorio,Wei Hung,Hung-Yen Wu,Shao-Hua Sun,Ping-Chun Hsieh
类目: Robotics (cs.RO); Machine Learning (cs.LG)
*备注: Published in ICML 2025

点击查看摘要

Abstract:Policy learning under action constraints plays a central role in ensuring safe behaviors in various robot control and resource allocation applications. In this paper, we study a new problem setting termed Action-Constrained Imitation Learning (ACIL), where an action-constrained imitator aims to learn from a demonstrative expert with larger action space. The fundamental challenge of ACIL lies in the unavoidable mismatch of occupancy measure between the expert and the imitator caused by the action constraints. We tackle this mismatch through \textittrajectory alignment and propose DTWIL, which replaces the original expert demonstrations with a surrogate dataset that follows similar state trajectories while adhering to the action constraints. Specifically, we recast trajectory alignment as a planning problem and solve it via Model Predictive Control, which aligns the surrogate trajectories with the expert trajectories based on the Dynamic Time Warping (DTW) distance. Through extensive experiments, we demonstrate that learning from the dataset generated by DTWIL significantly enhances performance across multiple robot control tasks and outperforms various benchmark imitation learning algorithms in terms of sample efficiency. Our code is publicly available at this https URL.

[LG-32] Hilbert geometry of the symmetric positive-definite bicone: Application to the geometry of the extended Gaussian family

链接: https://arxiv.org/abs/2508.14369
作者: Jacek Karwowski,Frank Nielsen
类目: Computational Geometry (cs.CG); Machine Learning (cs.LG); Probability (math.PR)
*备注: 21 pages

点击查看摘要

Abstract:The extended Gaussian family is the closure of the Gaussian family obtained by completing the Gaussian family with the counterpart elements induced by degenerate covariance or degenerate precision matrices, or a mix of both degeneracies. The parameter space of the extended Gaussian family forms a symmetric positive semi-definite matrix bicone, i.e. two partial symmetric positive semi-definite matrix cones joined at their bases. In this paper, we study the Hilbert geometry of such an open bounded convex symmetric positive-definite bicone. We report the closed-form formula for the corresponding Hilbert metric distance and study exhaustively its invariance properties. We also touch upon potential applications of this geometry for dealing with extended Gaussian distributions.

[LG-33] SBGD: Improving Graph Diffusion Generative Model via Stochastic Block Diffusion

链接: https://arxiv.org/abs/2508.14352
作者: Junwei Su,Shan Wu
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Graph diffusion generative models (GDGMs) have emerged as powerful tools for generating high-quality graphs. However, their broader adoption faces challenges in \emphscalability and size generalization. GDGMs struggle to scale to large graphs due to their high memory requirements, as they typically operate in the full graph space, requiring the entire graph to be stored in memory during training and inference. This constraint limits their feasibility for large-scale real-world graphs. GDGMs also exhibit poor size generalization, with limited ability to generate graphs of sizes different from those in the training data, restricting their adaptability across diverse applications. To address these challenges, we propose the stochastic block graph diffusion (SBGD) model, which refines graph representations into a block graph space. This space incorporates structural priors based on real-world graph patterns, significantly reducing memory complexity and enabling scalability to large graphs. The block representation also improves size generalization by capturing fundamental graph structures. Empirical results show that SBGD achieves significant memory improvements (up to 6 \times ) while maintaining comparable or even superior graph generation performance relative to state-of-the-art methods. Furthermore, experiments demonstrate that SBGD better generalizes to unseen graph sizes. The significance of SBGD extends beyond being a scalable and effective GDGM; it also exemplifies the principle of modularization in generative modeling, offering a new avenue for exploring generative models by decomposing complex tasks into more manageable components.

[LG-34] A Non-Asymptotic Convergent Analysis for Scored-Based Graph Generative Model via a System of Stochastic Differential Equations

链接: https://arxiv.org/abs/2508.14351
作者: Junwei Su,Chuan Wu
类目: Machine Learning (cs.LG); Machine Learning (stat.ML)
*备注:

点击查看摘要

Abstract:Score-based graph generative models (SGGMs) have proven effective in critical applications such as drug discovery and protein synthesis. However, their theoretical behavior, particularly regarding convergence, remains underexplored. Unlike common score-based generative models (SGMs), which are governed by a single stochastic differential equation (SDE), SGGMs involve a system of coupled SDEs. In SGGMs, the graph structure and node features are governed by separate but interdependent SDEs. This distinction makes existing convergence analyses from SGMs inapplicable for SGGMs. In this work, we present the first non-asymptotic convergence analysis for SGGMs, focusing on the convergence bound (the risk of generative error) across three key graph generation paradigms: (1) feature generation with a fixed graph structure, (2) graph structure generation with fixed node features, and (3) joint generation of both graph structure and node features. Our analysis reveals several unique factors specific to SGGMs (e.g., the topological properties of the graph structure) which affect the convergence bound. Additionally, we offer theoretical insights into the selection of hyperparameters (e.g., sampling steps and diffusion length) and advocate for techniques like normalization to improve convergence. To validate our theoretical findings, we conduct a controlled empirical study using synthetic graph models, and the results align with our theoretical predictions. This work deepens the theoretical understanding of SGGMs, demonstrates their applicability in critical domains, and provides practical guidance for designing effective models.

[LG-35] On the Interplay between Graph Structure and Learning Algorithms in Graph Neural Networks

链接: https://arxiv.org/abs/2508.14338
作者: Junwei Su,Chuan Wu
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:This paper studies the interplay between learning algorithms and graph structure for graph neural networks (GNNs). Existing theoretical studies on the learning dynamics of GNNs primarily focus on the convergence rates of learning algorithms under the interpolation regime (noise-free) and offer only a crude connection between these dynamics and the actual graph structure (e.g., maximum degree). This paper aims to bridge this gap by investigating the excessive risk (generalization performance) of learning algorithms in GNNs within the generalization regime (with noise). Specifically, we extend the conventional settings from the learning theory literature to the context of GNNs and examine how graph structure influences the performance of learning algorithms such as stochastic gradient descent (SGD) and Ridge regression. Our study makes several key contributions toward understanding the interplay between graph structure and learning in GNNs. First, we derive the excess risk profiles of SGD and Ridge regression in GNNs and connect these profiles to the graph structure through spectral graph theory. With this established framework, we further explore how different graph structures (regular vs. power-law) impact the performance of these algorithms through comparative analysis. Additionally, we extend our analysis to multi-layer linear GNNs, revealing an increasing non-isotropic effect on the excess risk profile, thereby offering new insights into the over-smoothing issue in GNNs from the perspective of learning algorithms. Our empirical results align with our theoretical predictions, \emphcollectively showcasing a coupling relation among graph structure, GNNs and learning algorithms, and providing insights on GNN algorithm design and selection in practice.

[LG-36] NeRC: Neural Ranging Correction through Differentiable Moving Horizon Location Estimation

链接: https://arxiv.org/abs/2508.14336
作者: Xu Weng,K.V. Ling,Haochen Liu,Bingheng Wang,Kun Cao
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:GNSS localization using everyday mobile devices is challenging in urban environments, as ranging errors caused by the complex propagation of satellite signals and low-quality onboard GNSS hardware are blamed for undermining positioning accuracy. Researchers have pinned their hopes on data-driven methods to regress such ranging errors from raw measurements. However, the grueling annotation of ranging errors impedes their pace. This paper presents a robust end-to-end Neural Ranging Correction (NeRC) framework, where localization-related metrics serve as the task objective for training the neural modules. Instead of seeking impractical ranging error labels, we train the neural network using ground-truth locations that are relatively easy to obtain. This functionality is supported by differentiable moving horizon location estimation (MHE) that handles a horizon of measurements for positioning and backpropagates the gradients for training. Even better, as a blessing of end-to-end learning, we propose a new training paradigm using Euclidean Distance Field (EDF) cost maps, which alleviates the demands on labeled locations. We evaluate the proposed NeRC on public benchmarks and our collected datasets, demonstrating its distinguished improvement in positioning accuracy. We also deploy NeRC on the edge to verify its real-time performance for mobile devices.

[LG-37] Multi-view Graph Condensation via Tensor Decomposition

链接: https://arxiv.org/abs/2508.14330
作者: Nícolas Roque dos Santos,Dawon Ahn,Diego Minatel,Alneu de Andrade Lopes,Evangelos E. Papalexakis
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Graph Neural Networks (GNNs) have demonstrated remarkable results in various real-world applications, including drug discovery, object detection, social media analysis, recommender systems, and text classification. In contrast to their vast potential, training them on large-scale graphs presents significant computational challenges due to the resources required for their storage and processing. Graph Condensation has emerged as a promising solution to reduce these demands by learning a synthetic compact graph that preserves the essential information of the original one while maintaining the GNN’s predictive performance. Despite their efficacy, current graph condensation approaches frequently rely on a computationally intensive bi-level optimization. Moreover, they fail to maintain a mapping between synthetic and original nodes, limiting the interpretability of the model’s decisions. In this sense, a wide range of decomposition techniques have been applied to learn linear or multi-linear functions from graph data, offering a more transparent and less resource-intensive alternative. However, their applicability to graph condensation remains unexplored. This paper addresses this gap and proposes a novel method called Multi-view Graph Condensation via Tensor Decomposition (GCTD) to investigate to what extent such techniques can synthesize an informative smaller graph and achieve comparable downstream task performance. Extensive experiments on six real-world datasets demonstrate that GCTD effectively reduces graph size while preserving GNN performance, achieving up to a 4.0\ improvement in accuracy on three out of six datasets and competitive performance on large graphs compared to existing approaches. Our code is available at this https URL.

[LG-38] FedRAIN-Lite: Federated Reinforcement Algorithms for Improving Idealised Numerical Weather and Climate Models

链接: https://arxiv.org/abs/2508.14315
作者: Pritthijit Nath,Sebastian Schemm,Henry Moss,Peter Haynes,Emily Shuckburgh,Mark Webb
类目: Machine Learning (cs.LG); Atmospheric and Oceanic Physics (physics.ao-ph)
*备注: 21 pages, 6 figures

点击查看摘要

Abstract:Sub-grid parameterisations in climate models are traditionally static and tuned offline, limiting adaptability to evolving states. This work introduces FedRAIN-Lite, a federated reinforcement learning (FedRL) framework that mirrors the spatial decomposition used in general circulation models (GCMs) by assigning agents to latitude bands, enabling local parameter learning with periodic global aggregation. Using a hierarchy of simplified energy-balance climate models, from a single-agent baseline (ebm-v1) to multi-agent ensemble (ebm-v2) and GCM-like (ebm-v3) setups, we benchmark three RL algorithms under different FedRL configurations. Results show that Deep Deterministic Policy Gradient (DDPG) consistently outperforms both static and single-agent baselines, with faster convergence and lower area-weighted RMSE in tropical and mid-latitude zones across both ebm-v2 and ebm-v3 setups. DDPG’s ability to transfer across hyperparameters and low computational cost make it well-suited for geographically adaptive parameter learning. This capability offers a scalable pathway towards high-complexity GCMs and provides a prototype for physically aligned, online-learning climate models that can evolve with a changing climate. Code accessible at this https URL.

[LG-39] Graph Concept Bottleneck Models

链接: https://arxiv.org/abs/2508.14255
作者: Haotian Xu,Tsui-Wei Weng,Lam M. Nguyen,Tengfei Ma
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Concept Bottleneck Models (CBMs) provide explicit interpretations for deep neural networks through concepts and allow intervention with concepts to adjust final predictions. Existing CBMs assume concepts are conditionally independent given labels and isolated from each other, ignoring the hidden relationships among concepts. However, the set of concepts in CBMs often has an intrinsic structure where concepts are generally correlated: changing one concept will inherently impact its related concepts. To mitigate this limitation, we propose GraphCBMs: a new variant of CBM that facilitates concept relationships by constructing latent concept graphs, which can be combined with CBMs to enhance model performance while retaining their interpretability. Our experiment results on real-world image classification tasks demonstrate Graph CBMs offer the following benefits: (1) superior in image classification tasks while providing more concept structure information for interpretability; (2) able to utilize latent concept graphs for more effective interventions; and (3) robust in performance across different training and architecture settings.

[LG-40] Optimal Subspace Embeddings: Resolving Nelson-Nguyen Conjecture Up to Sub-Polylogarithmic Factors

链接: https://arxiv.org/abs/2508.14234
作者: Shabarish Chenakkod,Michał Dereziński,Xiaoyu Dong
类目: Data Structures and Algorithms (cs.DS); Machine Learning (cs.LG); Numerical Analysis (math.NA); Probability (math.PR); Machine Learning (stat.ML)
*备注:

点击查看摘要

Abstract:We give a proof of the conjecture of Nelson and Nguyen [FOCS 2013] on the optimal dimension and sparsity of oblivious subspace embeddings, up to sub-polylogarithmic factors: For any n\geq d and \epsilon\geq d^-O(1) , there is a random \tilde O(d/\epsilon^2)\times n matrix \Pi with \tilde O(\log(d)/\epsilon) non-zeros per column such that for any A\in\mathbbR^n\times d , with high probability, (1-\epsilon)|Ax|\leq|\Pi Ax|\leq(1+\epsilon)|Ax| for all x\in\mathbbR^d , where \tilde O(\cdot) hides only sub-polylogarithmic factors in d . Our result in particular implies a new fastest sub-current matrix multiplication time reduction of size \tilde O(d/\epsilon^2) for a broad class of n\times d linear regression tasks. A key novelty in our analysis is a matrix concentration technique we call iterative decoupling, which we use to fine-tune the higher-order trace moment bounds attainable via existing random matrix universality tools [Brailovskaya and van Handel, GAFA 2024]. Subjects: Data Structures and Algorithms (cs.DS); Machine Learning (cs.LG); Numerical Analysis (math.NA); Probability (math.PR); Machine Learning (stat.ML) Cite as: arXiv:2508.14234 [cs.DS] (or arXiv:2508.14234v1 [cs.DS] for this version) https://doi.org/10.48550/arXiv.2508.14234 Focus to learn more arXiv-issued DOI via DataCite (pending registration)

[LG-41] Reliability comparison of vessel trajectory prediction models via Probability of Detection

链接: https://arxiv.org/abs/2508.14198
作者: Zahra Rastin,Kathrin Donandt,Dirk Söffker
类目: Machine Learning (cs.LG); Computational Engineering, Finance, and Science (cs.CE); Systems and Control (eess.SY)
*备注: 2025 IEEE Intelligent Vehicles Symposium (IV)

点击查看摘要

Abstract:This contribution addresses vessel trajectory prediction (VTP), focusing on the evaluation of different deep learning-based approaches. The objective is to assess model performance in diverse traffic complexities and compare the reliability of the approaches. While previous VTP models overlook the specific traffic situation complexity and lack reliability assessments, this research uses a probability of detection analysis to quantify model reliability in varying traffic scenarios, thus going beyond common error distribution analyses. All models are evaluated on test samples categorized according to their traffic situation during the prediction horizon, with performance metrics and reliability estimates obtained for each category. The results of this comprehensive evaluation provide a deeper understanding of the strengths and weaknesses of the different prediction approaches, along with their reliability in terms of the prediction horizon lengths for which safe forecasts can be guaranteed. These findings can inform the development of more reliable vessel trajectory prediction approaches, enhancing safety and efficiency in future inland waterways navigation.

[LG-42] Noise Robust One-Class Intrusion Detection on Dynamic Graphs

链接: https://arxiv.org/abs/2508.14192
作者: Aleksei Liuliakov,Alexander Schulz,Luca Hermes,Barbara Hammer
类目: Machine Learning (cs.LG); Machine Learning (stat.ML)
*备注:

点击查看摘要

Abstract:In the domain of network intrusion detection, robustness against contaminated and noisy data inputs remains a critical challenge. This study introduces a probabilistic version of the Temporal Graph Network Support Vector Data Description (TGN-SVDD) model, designed to enhance detection accuracy in the presence of input noise. By predicting parameters of a Gaussian distribution for each network event, our model is able to naturally address noisy adversarials and improve robustness compared to a baseline model. Our experiments on a modified CIC-IDS2017 data set with synthetic noise demonstrate significant improvements in detection performance compared to the baseline TGN-SVDD model, especially as noise levels increase.

[LG-43] RewardRank: Optimizing True Learning-to-Rank Utility

链接: https://arxiv.org/abs/2508.14180
作者: Gaurav Bhatt,Kiran Koshy Thekumparampil,Tanmay Gangwani,Tesi Xiao,Leonid Sigal
类目: Information Retrieval (cs.IR); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Traditional ranking systems rely on proxy loss functions that assume simplistic user behavior, such as users preferring a rank list where items are sorted by hand-crafted relevance. However, real-world user interactions are influenced by complex behavioral biases, including position bias, brand affinity, decoy effects, and similarity aversion, which these objectives fail to capture. As a result, models trained on such losses often misalign with actual user utility, such as the probability of any click or purchase across the ranked list. In this work, we propose a data-driven framework for modeling user behavior through counterfactual reward learning. Our method, RewardRank, first trains a deep utility model to estimate user engagement for entire item permutations using logged data. Then, a ranking policy is optimized to maximize predicted utility via differentiable soft permutation operators, enabling end-to-end training over the space of factual and counterfactual rankings. To address the challenge of evaluation without ground-truth for unseen permutations, we introduce two automated protocols: (i) \textitKD-Eval , using a position-aware oracle for counterfactual reward estimation, and (ii) \textitLLM-Eval , which simulates user preferences via large language models. Experiments on large-scale benchmarks, including Baidu-ULTR and the Amazon KDD Cup datasets, demonstrate that our approach consistently outperforms strong baselines, highlighting the effectiveness of modeling user behavior dynamics for utility-optimized ranking. Our code is available at: this https URL

[LG-44] Beyond Turing: Memory-Amortized Inference as a Foundation for Cognitive Computation

链接: https://arxiv.org/abs/2508.14143
作者: Xin Li
类目: Machine Learning (cs.LG); Neurons and Cognition (q-bio.NC)
*备注:

点击查看摘要

Abstract:Intelligence is fundamentally non-ergodic: it emerges not from uniform sampling or optimization from scratch, but from the structured reuse of prior inference trajectories. We introduce Memory-Amortized Inference (MAI) as a formal framework in which cognition is modeled as inference over latent cycles in memory, rather than recomputation through gradient descent. MAI systems encode inductive biases via structural reuse, minimizing entropy and enabling context-aware, structure-preserving inference. This approach reframes cognitive systems not as ergodic samplers, but as navigators over constrained latent manifolds, guided by persistent topological memory. Through the lens of delta-homology, we show that MAI provides a principled foundation for Mountcastle’s Universal Cortical Algorithm, modeling each cortical column as a local inference operator over cycle-consistent memory states. Furthermore, we establish a time-reversal duality between MAI and reinforcement learning: whereas RL propagates value forward from reward, MAI reconstructs latent causes backward from memory. This inversion paves a path toward energy-efficient inference and addresses the computational bottlenecks facing modern AI. MAI thus offers a unified, biologically grounded theory of intelligence based on structure, reuse, and memory. We also briefly discuss the profound implications of MAI for achieving artificial general intelligence (AGI).

[LG-45] Learning to Learn the Macroscopic Fundamental Diagram using Physics-Informed and meta Machine Learning techniques

链接: https://arxiv.org/abs/2508.14137
作者: Amalie Roark,Serio Agriesti,Francisco Camara Pereira,Guido Cantelmo
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:The Macroscopic Fundamental Diagram is a popular tool used to describe traffic dynamics in an aggregated way, with applications ranging from traffic control to incident analysis. However, estimating the MFD for a given network requires large numbers of loop detectors, which is not always available in practice. This article proposes a framework harnessing meta-learning, a subcategory of machine learning that trains models to understand and adapt to new tasks on their own, to alleviate the data scarcity challenge. The developed model is trained and tested by leveraging data from multiple cities and exploiting it to model the MFD of other cities with different shares of detectors and topological structures. The proposed meta-learning framework is applied to an ad-hoc Multi-Task Physics-Informed Neural Network, specifically designed to estimate the MFD. Results show an average MSE improvement in flow prediction ranging between ~ 17500 and 36000 (depending on the subset of loop detectors tested). The meta-learning framework thus successfully generalizes across diverse urban settings and improves performance on cities with limited data, demonstrating the potential of using meta-learning when a limited number of detectors is available. Finally, the proposed framework is validated against traditional transfer learning approaches and tested with FitFun, a non-parametric model from the literature, to prove its transferability.

[LG-46] opological Data Analysis for Unsupervised Anomaly Detection and Customer Segmentation on Banking Data

链接: https://arxiv.org/abs/2508.14136
作者: Leonardo Aldo Alejandro Barberi,Linda Maria De Cave
类目: Machine Learning (cs.LG); Computational Geometry (cs.CG)
*备注:

点击查看摘要

Abstract:This paper introduces advanced techniques of Topological Data Analysis (TDA) for unsupervised anomaly detection and customer segmentation in banking data. Using the Mapper algorithm and persistent homology, we develop unsupervised procedures that uncover meaningful patterns in customers’ banking data by exploiting topological information. The framework we present in this paper yields actionable insights that combine the abstract mathematical subject of topology with real-life use cases that are useful in industry.

[LG-47] owards Agent -based Test Support Systems: An Unsupervised Environment Design Approach

链接: https://arxiv.org/abs/2508.14135
作者: Collins O.Ogbodo,Timothy J. Rogers,Mattia Dal Borgo,David J. Wagg
类目: Machine Learning (cs.LG)
*备注: 17 pages, 11 figures; currently under peer review

点击查看摘要

Abstract:Modal testing plays a critical role in structural analysis by providing essential insights into dynamic behaviour across a wide range of engineering industries. In practice, designing an effective modal test campaign involves complex experimental planning, comprising a series of interdependent decisions that significantly influence the final test outcome. Traditional approaches to test design are typically static-focusing only on global tests without accounting for evolving test campaign parameters or the impact of such changes on previously established decisions, such as sensor configurations, which have been found to significantly influence test outcomes. These rigid methodologies often compromise test accuracy and adaptability. To address these limitations, this study introduces an agent-based decision support framework for adaptive sensor placement across dynamically changing modal test environments. The framework formulates the problem using an underspecified partially observable Markov decision process, enabling the training of a generalist reinforcement learning agent through a dual-curriculum learning strategy. A detailed case study on a steel cantilever structure demonstrates the efficacy of the proposed method in optimising sensor locations across frequency segments, validating its robustness and real-world applicability in experimental settings.

[LG-48] Comparison of derivative-free and gradient-based minimization for multi-objective compositional design of shape memory alloys

链接: https://arxiv.org/abs/2508.14127
作者: S. Josyula,Y. Noiman,E. J. Payton,T. Giovannelli
类目: Machine Learning (cs.LG); Materials Science (cond-mat.mtrl-sci); Optimization and Control (math.OC)
*备注:

点击查看摘要

Abstract:Designing shape memory alloys (SMAs) that meet performance targets while remaining affordable and sustainable is a complex challenge. In this work, we focus on optimizing SMA compositions to achieve a desired martensitic start temperature (Ms) while minimizing cost. To do this, we use machine learning models as surrogate predictors and apply numerical optimization methods to search for suitable alloy combinations. We trained two types of machine learning models, a tree-based ensemble and a neural network, using a dataset of experimentally characterized alloys and physics-informed features. The tree-based model was used with a derivative-free optimizer (COBYLA), while the neural network, which provides gradient information, was paired with a gradient-based optimizer (TRUST-CONSTR). Our results show that while both models predict Ms with similar accuracy, the optimizer paired with the neural network finds better solutions more consistently. COBYLA often converged to suboptimal results, especially when the starting guess was far from the target. The TRUST-CONSTR method showed more stable behavior and was better at reaching alloy compositions that met both objectives. This study demonstrates a practical approach to exploring new SMA compositions by combining physics-informed data, machine learning models, and optimization algorithms. Although the scale of our dataset is smaller than simulation-based efforts, the use of experimental data improves the reliability of the predictions. The approach can be extended to other materials where design trade-offs must be made with limited data.

[LG-49] From AI for Science to Agent ic Science: A Survey on Autonomous Scientific Discovery

链接: https://arxiv.org/abs/2508.14111
作者: Jiaqi Wei,Yuejin Yang,Xiang Zhang,Yuhan Chen,Xiang Zhuang,Zhangyang Gao,Dongzhan Zhou,Guangshuai Wang,Zhiqiang Gao,Juntai Cao,Zijie Qiu,Xuming He,Qiang Zhang,Chenyu You,Shuangjia Zheng,Ning Ding,Wanli Ouyang,Nanqing Dong,Yu Cheng,Siqi Sun,Lei Bai,Bowen Zhou
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Artificial intelligence (AI) is reshaping scientific discovery, evolving from specialized computational tools into autonomous research partners. We position Agentic Science as a pivotal stage within the broader AI for Science paradigm, where AI systems progress from partial assistance to full scientific agency. Enabled by large language models (LLMs), multimodal systems, and integrated research platforms, agentic AI shows capabilities in hypothesis generation, experimental design, execution, analysis, and iterative refinement – behaviors once regarded as uniquely human. This survey provides a domain-oriented review of autonomous scientific discovery across life sciences, chemistry, materials science, and physics. We unify three previously fragmented perspectives – process-oriented, autonomy-oriented, and mechanism-oriented – through a comprehensive framework that connects foundational capabilities, core processes, and domain-specific realizations. Building on this framework, we (i) trace the evolution of AI for Science, (ii) identify five core capabilities underpinning scientific agency, (iii) model discovery as a dynamic four-stage workflow, (iv) review applications across the above domains, and (v) synthesize key challenges and future opportunities. This work establishes a domain-oriented synthesis of autonomous scientific discovery and positions Agentic Science as a structured paradigm for advancing AI-driven research.

[LG-50] Beyond Fixed Morphologies: Learning Graph Policies with Trust Region Compensation in Variable Action Spaces

链接: https://arxiv.org/abs/2508.14102
作者: Thomas Gallien
类目: Machine Learning (cs.LG); Robotics (cs.RO); Systems and Control (eess.SY)
*备注:

点击查看摘要

Abstract:Trust region-based optimization methods have become foundational reinforcement learning algorithms that offer stability and strong empirical performance in continuous control tasks. Growing interest in scalable and reusable control policies translate also in a demand for morphological generalization, the ability of control policies to cope with different kinematic structures. Graph-based policy architectures provide a natural and effective mechanism to encode such structural differences. However, while these architectures accommodate variable morphologies, the behavior of trust region methods under varying action space dimensionality remains poorly understood. To this end, we conduct a theoretical analysis of trust region-based policy optimization methods, focusing on both Trust Region Policy Optimization (TRPO) and its widely used first-order approximation, Proximal Policy Optimization (PPO). The goal is to demonstrate how varying action space dimensionality influence the optimization landscape, particularly under the constraints imposed by KL-divergence or policy clipping penalties. Complementing the theoretical insights, an empirical evaluation under morphological variation is carried out using the Gymnasium Swimmer environment. This benchmark offers a systematically controlled setting for varying the kinematic structure without altering the underlying task, making it particularly well-suited to study morphological generalization.

[LG-51] Physics-Informed Reward Machines

链接: https://arxiv.org/abs/2508.14093
作者: Daniel Ajeleye,Ashutosh Trivedi,Majid Zamani
类目: Machine Learning (cs.LG)
*备注: 20 pages, currently under review in a conference

点击查看摘要

Abstract:Reward machines (RMs) provide a structured way to specify non-Markovian rewards in reinforcement learning (RL), thereby improving both expressiveness and programmability. Viewed more broadly, they separate what is known about the environment, captured by the reward mechanism, from what remains unknown and must be discovered through sampling. This separation supports techniques such as counterfactual experience generation and reward shaping, which reduce sample complexity and speed up learning. We introduce physics-informed reward machines (pRMs), a symbolic machine designed to express complex learning objectives and reward structures for RL agents, thereby enabling more programmable, expressive, and efficient learning. We present RL algorithms capable of exploiting pRMs via counterfactual experiences and reward shaping. Our experimental results show that these techniques accelerate reward acquisition during the training phases of RL. We demonstrate the expressiveness and effectiveness of pRMs through experiments in both finite and continuous physical environments, illustrating that incorporating pRMs significantly improves learning efficiency across several control tasks.

[LG-52] Systematic FAIRness Assessment of Open Voice Biomarker Datasets for Mental Health and Neurodegenerative Diseases

链接: https://arxiv.org/abs/2508.14089
作者: Ishaan Mahapatra,Nihar R. Mahapatra
类目: ound (cs.SD); Machine Learning (cs.LG); Audio and Speech Processing (eess.AS)
*备注: To appear in the Proceedings of the 28th International Conference on Text, Speech and Dialogue (TSD 2025), Erlangen, Germany, August 25-28, 2025

点击查看摘要

Abstract:Voice biomarkers–human-generated acoustic signals such as speech, coughing, and breathing–are promising tools for scalable, non-invasive detection and monitoring of mental health and neurodegenerative diseases. Yet, their clinical adoption remains constrained by inconsistent quality and limited usability of publicly available datasets. To address this gap, we present the first systematic FAIR (Findable, Accessible, Interoperable, Reusable) evaluation of 27 publicly available voice biomarker datasets focused on these disease areas. Using the FAIR Data Maturity Model and a structured, priority-weighted scoring method, we assessed FAIRness at subprinciple, principle, and composite levels. Our analysis revealed consistently high Findability but substantial variability and weaknesses in Accessibility, Interoperability, and Reusability. Mental health datasets exhibited greater variability in FAIR scores, while neurodegenerative datasets were slightly more consistent. Repository choice also significantly influenced FAIRness scores. To enhance dataset quality and clinical utility, we recommend adopting structured, domain-specific metadata standards, prioritizing FAIR-compliant repositories, and routinely applying structured FAIR evaluation frameworks. These findings provide actionable guidance to improve dataset interoperability and reuse, thereby accelerating the clinical translation of voice biomarker technologies.

[LG-53] EEGDM: EEG Representation Learning via Generative Diffusion Model

链接: https://arxiv.org/abs/2508.14086
作者: Jia Hong Puah,Sim Kuan Goh,Ziwei Zhang,Zixuan Ye,Chow Khuen Chan,Kheng Seang Lim,Si Lei Fong,Kok Sin Woon
类目: Machine Learning (cs.LG)
*备注: EEGDM Preprint

点击查看摘要

Abstract:While electroencephalogram (EEG) has been a crucial tool for monitoring the brain and diagnosing neurological disorders (e.g., epilepsy), learning meaningful representations from raw EEG signals remains challenging due to limited annotations and high signal variability. Recently, EEG foundation models (FMs) have shown promising potential by adopting transformer architectures and self-supervised pre-training methods from large language models (e.g., masked prediction) to learn representations from diverse EEG data, followed by fine-tuning on specific EEG tasks. Nonetheless, these large models often incurred high computational costs during both training and inference, with only marginal performance improvements as model size increases. In this work, we proposed EEG representation learning framework building upon Generative Diffusion Model (EEGDM). Specifically, we developed structured state-space model for diffusion pretraining (SSMDP) to better capture the temporal dynamics of EEG signals and trained the architecture using a Denoising Diffusion Probabilistic Model. The resulting latent EEG representations were then used for downstream classification tasks via our proposed latent fusion transformer (LFT). To evaluate our method, we used the multi-event Temple University EEG Event Corpus and compared EEGDM with current state-of-the-art approaches, including EEG FMs. Empirical results showed that our method outperformed existing methods while being approximately 19x more lightweight. These findings suggested that EEGDM offered a promising alternative to current FMs. Our code is available at: this https URL.

[LG-54] Parameter-Aware Ensemble SINDy for Interpretable Symbolic SGS Closure

链接: https://arxiv.org/abs/2508.14085
作者: Hanseul Kang,Shervin Karimkashi,Ville Vuorinen
类目: Machine Learning (cs.LG); Fluid Dynamics (physics.flu-dyn)
*备注: 16 pages, 6 figures

点击查看摘要

Abstract:We present a scalable, parameter-aware sparse regression framework for discovering interpretable partial differential equations and subgrid-scale closures from multi-parameter simulation data. Building on SINDy (Sparse Identification of Nonlinear Dynamics), our approach addresses key limitations through four innovations: symbolic parameterisation enabling physical parameters to vary within unified regression; Dimensional Similarity Filter enforcing unit-consistency whilst reducing candidate libraries; memory-efficient Gram-matrix accumulation enabling batch processing; and ensemble consensus with coefficient stability analysis for robust model identification. Validation on canonical one-dimensional benchmarks demonstrates reliable recovery of governing equations across parameter ranges. Applied to filtered Burgers datasets, the framework discovers an SGS closure \tau_\mathrmSGS = 0.1603\cdot\Delta^2\left(\frac\partial \baru\partial x\right)^2 , corresponding to a Smagorinsky constant of approximately 0.4004. This represents autonomous discovery of Smagorinsky-type closure structure from data without prior theoretical assumptions. The discovered model achieves R^2 = 0.886 across filter scales and demonstrates improved prediction accuracy compared to classical closures. The framework’s ability to identify physically meaningful SGS forms and calibrate coefficients offers a complementary approach to existing turbulence modelling methods, contributing to the growing field of data-driven closure discovery. Comments: 16 pages, 6 figures Subjects: Machine Learning (cs.LG); Fluid Dynamics (physics.flu-dyn) Cite as: arXiv:2508.14085 [cs.LG] (or arXiv:2508.14085v1 [cs.LG] for this version) https://doi.org/10.48550/arXiv.2508.14085 Focus to learn more arXiv-issued DOI via DataCite

[LG-55] oward Generalist Semi-supervised Regression via Decoupled Representation Distillation

链接: https://arxiv.org/abs/2508.14082
作者: Ye Su,Hezhe Qiao,Wei Huang,Lin Chen
类目: Machine Learning (cs.LG)
*备注: 12 pages

点击查看摘要

Abstract:Semi-supervised regression (SSR), which aims to predict continuous scores of samples while reducing reliance on a large amount of labeled data, has recently received considerable attention across various applications, including computer vision, natural language processing, and audio and medical analysis. Existing semi-supervised methods typically apply consistency regularization on the general regression task by generating pseudo-labels. However, these methods heavily rely on the quality of pseudo-labels, and direct regression fails to learn the label distribution and can easily lead to overfitting. To address these challenges, we introduce an end-to-end Decoupled Representation distillation framework (DRILL) which is specially designed for the semi-supervised regression task where we transform the general regression task into a Discrete Distribution Estimation (DDE) task over multiple buckets to better capture the underlying label distribution and mitigate the risk of overfitting associated with direct regression. Then we employ the Decoupled Distribution Alignment (DDA) to align the target bucket and non-target bucket between teacher and student on the distribution of buckets, encouraging the student to learn more robust and generalized knowledge from the teacher. Extensive experiments conducted on datasets from diverse domains demonstrate that the proposed DRILL has strong generalization and outperforms the competing methods.

[LG-56] oward Lifelong Learning in Equilibrium Propagation: Sleep-like and Awake Rehearsal for Enhanced Stability

链接: https://arxiv.org/abs/2508.14081
作者: Yoshimasa Kubo,Jean Erik Delanois,Maxim Bazhenov
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Recurrent neural networks (RNNs) trained using Equilibrium Propagation (EP), a biologically plausible training algorithm, have demonstrated strong performance in various tasks such as image classification and reinforcement learning. However, these networks face a critical challenge in continuous learning: catastrophic forgetting, where previously acquired knowledge is overwritten when new tasks are learned. This limitation contrasts with the human brain’s ability to retain and integrate both old and new knowledge, aided by processes like memory consolidation during sleep through the replay of learned information. To address this challenge in RNNs, here we propose a sleep-like replay consolidation (SRC) algorithm for EP-trained RNNs. We found that SRC significantly improves RNN’s resilience to catastrophic forgetting in continuous learning scenarios. In class-incremental learning with SRC implemented after each new task training, the EP-trained multilayer RNN model (MRNN-EP) performed significantly better compared to feedforward networks incorporating several well-established regularization techniques. The MRNN-EP performed on par with MRNN trained using Backpropagation Through Time (BPTT) when both were equipped with SRC on MNIST data and surpassed BPTT-based models on the Fashion MNIST, Kuzushiji-MNIST, CIFAR10, and ImageNet datasets. Combining SRC with rehearsal, also known as “awake replay”, further boosted the network’s ability to retain long-term knowledge while continuing to learn new tasks. Our study reveals the applicability of sleep-like replay techniques to RNNs and highlights the potential for integrating human-like learning behaviors into artificial neural networks (ANNs).

[LG-57] KnowDR-REC: A Benchmark for Referring Expression Comprehension with Real-World Knowledge

链接: https://arxiv.org/abs/2508.14080
作者: Guanghao Jin,Jingpei Wu,Tianpei Guo,Yiyi Niu,Weidong Zhou,Guoyang Liu
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Referring Expression Comprehension (REC) is a popular multimodal task that aims to accurately detect target objects within a single image based on a given textual expression. However, due to the limitations of earlier models, traditional REC benchmarks either rely solely on intra-image cues or lack sufficiently fine-grained instance annotations, making them inadequate for evaluating the reasoning capabilities of Multi-modal Large Language Models (MLLMs). To address this gap, we propose a new benchmark, KnowDR-REC, characterized by three key features: Firstly, it is built upon real-world knowledge, requiring fine-grained multimodal reasoning across text and image. Secondly, the dataset includes elaborately constructed negative samples via fine-grained expression editing, designed to evaluate a model’s robustness and anti-hallucination ability. Lastly, we introduce three novel evaluation metrics to systematically explore the model’s internal reasoning process. We evaluate 16 state-of-the-art multimodal models on KnowDR-REC, with experimental results showing that existing MLLMs still struggle with knowledge-driven visual grounding tasks. Furthermore, we observe a decoupling between textual understanding and visual grounding in MLLMs, where many models are significantly influenced by memorized shortcut correlations, which severely affect their behavior on our benchmark and hinder genuine multimodal reasoning. We anticipate that the proposed benchmark will inspire future research towards developing more robust, interpretable, and knowledge-intensive visual grounding frameworks, driving the development of more reliable and robust multimodal systems for complex real-world scenarios.

[LG-58] A Guide to Robust Generalization: The Impact of Architecture Pre-training and Optimization Strategy

链接: https://arxiv.org/abs/2508.14079
作者: Maxime Heuillet,Rishika Bhagwatkar,Jonas Ngnawé,Yann Pequignot,Alexandre Larouche,Christian Gagné,Irina Rish,Ola Ahmad,Audrey Durand
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Deep learning models operating in the image domain are vulnerable to small input perturbations. For years, robustness to such perturbations was pursued by training models from scratch (i.e., with random initializations) using specialized loss objectives. Recently, robust fine-tuning has emerged as a more efficient alternative: instead of training from scratch, pretrained models are adapted to maximize predictive performance and robustness. To conduct robust fine-tuning, practitioners design an optimization strategy that includes the model update protocol (e.g., full or partial) and the specialized loss objective. Additional design choices include the architecture type and size, and the pretrained representation. These design choices affect robust generalization, which is the model’s ability to maintain performance when exposed to new and unseen perturbations at test time. Understanding how these design choices influence generalization remains an open question with significant practical implications. In response, we present an empirical study spanning 6 datasets, 40 pretrained architectures, 2 specialized losses, and 3 adaptation protocols, yielding 1,440 training configurations and 7,200 robustness measurements across five perturbation types. To our knowledge, this is the most diverse and comprehensive benchmark of robust fine-tuning to date. While attention-based architectures and robust pretrained representations are increasingly popular, we find that convolutional neural networks pretrained in a supervised manner on large datasets often perform best. Our analysis both confirms and challenges prior design assumptions, highlighting promising research directions and offering practical guidance.

[LG-59] Out-of-Sample Hydrocarbon Production Forecasting: Time Series Machine Learning using Productivity Index-Driven Features and Inductive Conformal Prediction

链接: https://arxiv.org/abs/2508.14078
作者: Mohamed Hassan Abdalla Idris,Jakub Marek Cebula,Jebraeel Gholinezhad,Shamsul Masum,Hongjie Ma
类目: Machine Learning (cs.LG); Data Analysis, Statistics and Probability (physics.data-an)
*备注:

点击查看摘要

Abstract:This research introduces a new ML framework designed to enhance the robustness of out-of-sample hydrocarbon production forecasting, specifically addressing multivariate time series analysis. The proposed methodology integrates Productivity Index (PI)-driven feature selection, a concept derived from reservoir engineering, with Inductive Conformal Prediction (ICP) for rigorous uncertainty quantification. Utilizing historical data from the Volve (wells PF14, PF12) and Norne (well E1H) oil fields, this study investigates the efficacy of various predictive algorithms-namely Long Short-Term Memory (LSTM), Bidirectional LSTM (BiLSTM), Gated Recurrent Unit (GRU), and eXtreme Gradient Boosting (XGBoost) - in forecasting historical oil production rates (OPR_H). All the models achieved “out-of-sample” production forecasts for an upcoming future timeframe. Model performance was comprehensively evaluated using traditional error metrics (e.g., MAE) supplemented by Forecast Bias and Prediction Direction Accuracy (PDA) to assess bias and trend-capturing capabilities. The PI-based feature selection effectively reduced input dimensionality compared to conventional numerical simulation workflows. The uncertainty quantification was addressed using the ICP framework, a distribution-free approach that guarantees valid prediction intervals (e.g., 95% coverage) without reliance on distributional assumptions, offering a distinct advantage over traditional confidence intervals, particularly for complex, non-normal data. Results demonstrated the superior performance of the LSTM model, achieving the lowest MAE on test (19.468) and genuine out-of-sample forecast data (29.638) for well PF14, with subsequent validation on Norne well E1H. These findings highlight the significant potential of combining domain-specific knowledge with advanced ML techniques to improve the reliability of hydrocarbon production forecasts.

[LG-60] Multi-Objective Bayesian Optimization with Independent Tanimoto Kernel Gaussian Processes for Diverse Pareto Front Exploration

链接: https://arxiv.org/abs/2508.14072
作者: Anabel Yong
类目: Machine Learning (cs.LG); Chemical Physics (physics.chem-ph)
*备注: Masters of Science thesis

点击查看摘要

Abstract:We present GP-MOBO, a novel multi-objective Bayesian Optimization algorithm that advances the state-of-the-art in molecular optimization. Our approach integrates a fast minimal package for Exact Gaussian Processes (GPs) capable of efficiently handling the full dimensionality of sparse molecular fingerprints without the need for extensive computational resources. GP-MOBO consistently outperforms traditional methods like GP-BO by fully leveraging fingerprint dimensionality, leading to the identification of higher-quality and valid SMILES. Moreover, our model achieves a broader exploration of the chemical search space, as demonstrated by its superior proximity to the Pareto front in all tested scenarios. Empirical results from the DockSTRING dataset reveal that GP-MOBO yields higher geometric mean values across 20 Bayesian optimization iterations, underscoring its effectiveness and efficiency in addressing complex multi-objective optimization challenges with minimal computational overhead.

[LG-61] Load Forecasting on A Highly Sparse Electrical Load Dataset Using Gaussian Interpolation

链接: https://arxiv.org/abs/2508.14069
作者: Chinmoy Biswas,Nafis Faisal,Vivek Chowdhury,Abrar Al-Shadid Abir,Sabir Mahmud,Mithon Rahman,Shaikh Anowarul Fattah,Hafiz Imtiaz
类目: Machine Learning (cs.LG); Signal Processing (eess.SP)
*备注: Under review in Elsevier Electric Power Systems Research

点击查看摘要

Abstract:Sparsity, defined as the presence of missing or zero values in a dataset, often poses a major challenge while operating on real-life datasets. Sparsity in features or target data of the training dataset can be handled using various interpolation methods, such as linear or polynomial interpolation, spline, moving average, or can be simply imputed. Interpolation methods usually perform well with Strict Sense Stationary (SSS) data. In this study, we show that an approximately 62% sparse dataset with hourly load data of a power plant can be utilized for load forecasting assuming the data is Wide Sense Stationary (WSS), if augmented with Gaussian interpolation. More specifically, we perform statistical analysis on the data, and train multiple machine learning and deep learning models on the dataset. By comparing the performance of these models, we empirically demonstrate that Gaussian interpolation is a suitable option for dealing with load forecasting problems. Additionally, we demonstrate that Long Short-term Memory (LSTM)-based neural network model offers the best performance among a diverse set of classical and neural network-based models.

[LG-62] Personalized Contest Recommendation in Fantasy Sports

链接: https://arxiv.org/abs/2508.14065
作者: Madiraju Srilakshmi,Kartavya Kothari,Kamlesh Marathe,Vedavyas Chigurupati,Hitesh Kapoor
类目: Information Retrieval (cs.IR); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:In daily fantasy sports, players enter into “contests” where they compete against each other by building teams of athletes that score fantasy points based on what actually occurs in a real-life sports match. For any given sports match, there are a multitude of contests available to players, with substantial variation across 3 main dimensions: entry fee, number of spots, and the prize pool distribution. As player preferences are also quite heterogeneous, contest personalization is an important tool to match players with contests. This paper presents a scalable contest recommendation system, powered by a Wide and Deep Interaction Ranker (WiDIR) at its core. We productionized this system at our company, one of the large fantasy sports platforms with millions of daily contests and millions of players, where online experiments show a marked improvement over other candidate models in terms of recall and other critical business metrics.

[LG-63] Graph Neural Network for Product Recommendation on the Amazon Co-purchase Graph

链接: https://arxiv.org/abs/2508.14059
作者: Mengyang Cao,Frank F. Yang,Yi Jin,Yijun Yan
类目: Information Retrieval (cs.IR); Machine Learning (cs.LG)
*备注: 15 pages, 5 figures, preprint

点击查看摘要

Abstract:Identifying relevant information among massive volumes of data is a challenge for modern recommendation systems. Graph Neural Networks (GNNs) have demonstrated significant potential by utilizing structural and semantic relationships through graph-based learning. This study assessed the abilities of four GNN architectures, LightGCN, GraphSAGE, GAT, and PinSAGE, on the Amazon Product Co-purchase Network under link prediction settings. We examined practical trade-offs between architectures, model performance, scalability, training complexity and generalization. The outcomes demonstrated each model’s performance characteristics for deploying GNN in real-world recommendation scenarios.

[LG-64] Deep Learning for School Dropout Detection: A Comparison of Tabular and Graph-Based Models for Predicting At-Risk Students

链接: https://arxiv.org/abs/2508.14057
作者: Pablo G. Almeida,Guilherme A. L. Silva,Valéria Santos,Gladston Moreira,Pedro Silva,Eduardo Luz
类目: Machine Learning (cs.LG)
*备注: 12 pages

点击查看摘要

Abstract:Student dropout is a significant challenge in educational systems worldwide, leading to substantial social and economic costs. Predicting students at risk of dropout allows for timely interventions. While traditional Machine Learning (ML) models operating on tabular data have shown promise, Graph Neural Networks (GNNs) offer a potential advantage by capturing complex relationships inherent in student data if structured as graphs. This paper investigates whether transforming tabular student data into graph structures, primarily using clustering techniques, enhances dropout prediction accuracy. We compare the performance of GNNs (a custom Graph Convolutional Network (GCN) and GraphSAGE) on these generated graphs against established tabular models (Random Forest (RF), XGBoost, and TabNet) using a real-world student dataset. Our experiments explore various graph construction strategies based on different clustering algorithms (K-Means, HDBSCAN) and dimensionality reduction techniques (Principal Component Analysis (PCA), Uniform Manifold Approximation and Projection (UMAP)). Our findings demonstrate that a specific GNN configuration, GraphSAGE on a graph derived from PCA-KMeans clustering, achieved superior performance, notably improving the macro F1-score by approximately 7 percentage points and accuracy by nearly 2 percentage points over the strongest tabular baseline (XGBoost). However, other GNN configurations and graph construction methods did not consistently surpass tabular models, emphasizing the critical role of the graph generation strategy and GNN architecture selection. This highlights both the potential of GNNs and the challenges in optimally transforming tabular data for graph-based learning in this domain.

[LG-65] he C-index Multiverse

链接: https://arxiv.org/abs/2508.14821
作者: Begoña B. Sierra,Colin McLean,Peter S. Hall,Catalina A. Vallejos
类目: Machine Learning (stat.ML); Machine Learning (cs.LG); Applications (stat.AP)
*备注: 21 pages main text with 6 figures and 3 tables. 19 pages of supplementary material

点击查看摘要

Abstract:Quantifying out-of-sample discrimination performance for time-to-event outcomes is a fundamental step for model evaluation and selection in the context of predictive modelling. The concordance index, or C-index, is a widely used metric for this purpose, particularly with the growing development of machine learning methods. Beyond differences between proposed C-index estimators (e.g. Harrell’s, Uno’s and Antolini’s), we demonstrate the existence of a C-index multiverse among available R and python software, where seemingly equal implementations can yield different results. This can undermine reproducibility and complicate fair comparisons across models and studies. Key variation sources include tie handling and adjustment to censoring. Additionally, the absence of a standardised approach to summarise risk from survival distributions, result in another source of variation dependent on input types. We demonstrate the consequences of the C-index multiverse when quantifying predictive performance for several survival models (from Cox proportional hazards to recent deep learning approaches) on publicly available breast cancer data, and semi-synthetic examples. Our work emphasises the need for better reporting to improve transparency and reproducibility. This article aims to be a useful guideline, helping analysts when navigating the multiverse, providing unified documentation and highlighting potential pitfalls of existing software. All code is publicly available at: this http URL.

[LG-66] Learning from users behaviour of some well-known congested traffic networks

链接: https://arxiv.org/abs/2508.14804
作者: Isolda Cardoso,Lucas Venturato,Jorgelina Walpen
类目: Optimization and Control (math.OC); Machine Learning (cs.LG)
*备注: 12 pages, 4 figures, 3 tables

点击查看摘要

Abstract:We consider the problem of predicting users’ behavior of a congested traffic network under an equilibrium condition, the traffic assignment problem. We propose a two-stage machine learning approach which couples a neural network with a fixed point algorithm, and we evaluate its performance along several classical congested traffic networks.

[LG-67] Distributional Adversarial Attacks and Training in Deep Hedging

链接: https://arxiv.org/abs/2508.14757
作者: Guangyi He,Tobias Sutter,Lukas Gonon
类目: Optimization and Control (math.OC); Machine Learning (cs.LG)
*备注: Preprint. Under review

点击查看摘要

Abstract:In this paper, we study the robustness of classical deep hedging strategies under distributional shifts by leveraging the concept of adversarial attacks. We first demonstrate that standard deep hedging models are highly vulnerable to small perturbations in the input distribution, resulting in significant performance degradation. Motivated by this, we propose an adversarial training framework tailored to increase the robustness of deep hedging strategies. Our approach extends pointwise adversarial attacks to the distributional setting and introduces a computationally tractable reformulation of the adversarial optimization problem over a Wasserstein ball. This enables the efficient training of hedging strategies that are resilient to distributional perturbations. Through extensive numerical experiments, we show that adversarially trained deep hedging strategies consistently outperform their classical counterparts in terms of out-of-sample performance and resilience to model misspecification. Our findings establish a practical and effective framework for robust deep hedging under realistic market uncertainties.

[LG-68] Evaluation and Optimization of Leave-one-out Cross-validation for the Lasso

链接: https://arxiv.org/abs/2508.14368
作者: Ryan Burn
类目: Machine Learning (stat.ML); Machine Learning (cs.LG); Computation (stat.CO)
*备注: 18 pages, 3 figures, 7 tables

点击查看摘要

Abstract:I develop an algorithm to produce the piecewise quadratic that computes leave-one-out cross-validation for the lasso as a function of its hyperparameter. The algorithm can be used to find exact hyperparameters that optimize leave-one-out cross-validation either globally or locally, and its practicality is demonstrated on real-world data sets.

[LG-69] Comparing Model-agnostic Feature Selection Methods through Relative Efficiency

链接: https://arxiv.org/abs/2508.14268
作者: Chenghui Zheng,Garvesh Raskutti
类目: Machine Learning (stat.ML); Machine Learning (cs.LG); Methodology (stat.ME)
*备注:

点击查看摘要

Abstract:Feature selection and importance estimation in a model-agnostic setting is an ongoing challenge of significant interest. Wrapper methods are commonly used because they are typically model-agnostic, even though they are computationally intensive. In this paper, we focus on feature selection methods related to the Generalized Covariance Measure (GCM) and Leave-One-Covariate-Out (LOCO) estimation, and provide a comparison based on relative efficiency. In particular, we present a theoretical comparison under three model settings: linear models, non-linear additive models, and single index models that mimic a single-layer neural network. We complement this with extensive simulations and real data examples. Our theoretical results, along with empirical findings, demonstrate that GCM-related methods generally outperform LOCO under suitable regularity conditions. Furthermore, we quantify the asymptotic relative efficiency of these approaches. Our simulations and real data analysis include widely used machine learning methods such as neural networks and gradient boosting trees.

[LG-70] EmoSLLM : Parameter-Efficient Adaptation of LLM s for Speech Emotion Recognition

链接: https://arxiv.org/abs/2508.14130
作者: Hugo Thimonier,Antony Perzo,Renaud Seguier
类目: Audio and Speech Processing (eess.AS); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Emotion recognition from speech is a challenging task that requires capturing both linguistic and paralinguistic cues, with critical applications in human-computer interaction and mental health monitoring. Recent works have highlighted the ability of Large Language Models (LLMs) to perform tasks outside of the sole natural language area. In particular, recent approaches have investigated coupling LLMs with other data modalities by using pre-trained backbones and different fusion mechanisms. This work proposes a novel approach that fine-tunes an LLM with audio and text representations for emotion prediction. Our method first extracts audio features using an audio feature extractor, which are then mapped into the LLM’s representation space via a learnable interfacing module. The LLM takes as input (1) the transformed audio features, (2) additional features in the form of natural language (e.g., the transcript), and (3) a textual prompt describing the emotion prediction task. To efficiently adapt the LLM to this multimodal task, we employ Low-Rank Adaptation (LoRA), enabling parameter-efficient fine-tuning. Experimental results on standard emotion recognition benchmarks demonstrate that our model outperforms all but one existing Speech-Text LLMs in the literature, while requiring less than half the parameters of competing approaches. This highlights our approach’s effectiveness in integrating multi-modal inputs for speech-based emotion understanding while maintaining significant computational efficiency.

信息检索

[IR-0] Benefiting from Negative yet Informative Feedback by Contrasting Opposing Sequential Patterns

链接: https://arxiv.org/abs/2508.14786
作者: Veronika Ivanova,Evgeny Frolov,Alexey Vasilev
类目: Information Retrieval (cs.IR)
*备注:

点击查看摘要

Abstract:We consider the task of learning from both positive and negative feedback in a sequential recommendation scenario, as both types of feedback are often present in user interactions. Meanwhile, conventional sequential learning models usually focus on considering and predicting positive interactions, ignoring that reducing items with negative feedback in recommendations improves user satisfaction with the service. Moreover, the negative feedback can potentially provide a useful signal for more accurate identification of true user interests. In this work, we propose to train two transformer encoders on separate positive and negative interaction sequences. We incorporate both types of feedback into the training objective of the sequential recommender using a composite loss function that includes positive and negative cross-entropy as well as a cleverly crafted contrastive term, that helps better modeling opposing patterns. We demonstrate the effectiveness of this approach in terms of increasing true-positive metrics compared to state-of-the-art sequential recommendation methods while reducing the number of wrongly promoted negative items.

[IR-1] DGenCTR: Towards a Universal Generative Paradigm for Click-Through Rate Prediction via Discrete Diffusion

链接: https://arxiv.org/abs/2508.14500
作者: Moyu Zhang,Yun Chen,Yujun Jin,Jinxin Hu,Yu Zhang
类目: Information Retrieval (cs.IR)
*备注: 11 pages, 4 figures, 4 tables

点击查看摘要

Abstract:Recent advances in generative models have inspired the field of recommender systems to explore generative approaches, but most existing research focuses on sequence generation, a paradigm ill-suited for click-through rate (CTR) prediction. CTR models critically depend on a large number of cross-features between the target item and the user to estimate the probability of clicking on the item, and discarding these cross-features will significantly impair model performance. Therefore, to harness the ability of generative models to understand data distributions and thereby alleviate the constraints of traditional discriminative models in label-scarce space, diverging from the item-generation paradigm of sequence generation methods, we propose a novel sample-level generation paradigm specifically designed for the CTR task: a two-stage Discrete Diffusion-Based Generative CTR training framework (DGenCTR). This two-stage framework comprises a diffusion-based generative pre-training stage and a CTR-targeted supervised fine-tuning stage for CTR. Finally, extensive offline experiments and online A/B testing conclusively validate the effectiveness of our framework.

[IR-2] Global-Distribution Aware Scenario-Specific Variational Representation Learning Framework CIKM2025

链接: https://arxiv.org/abs/2508.14493
作者: Moyu Zhang,Yujun Jin,Jinxin Hu,Yu Zhang
类目: Information Retrieval (cs.IR)
*备注: Accepted by CIKM 2025, 6 pages, 1 figures, 5 tables

点击查看摘要

Abstract:With the emergence of e-commerce, the recommendations provided by commercial platforms must adapt to diverse scenarios to accommodate users’ varying shopping preferences. Current methods typically use a unified framework to offer personalized recommendations for different scenarios. However, they often employ shared bottom representations, which partially hinders the model’s capacity to capture scenario uniqueness. Ideally, users and items should exhibit specific characteristics in different scenarios, prompting the need to learn scenario-specific representations to differentiate scenarios. Yet, variations in user and item interactions across scenarios lead to data sparsity issues, impeding the acquisition of scenario-specific representations. To learn robust scenario-specific representations, we introduce a Global-Distribution Aware Scenario-Specific Variational Representation Learning Framework (GSVR) that can be directly applied to existing multi-scenario methods. Specifically, considering the uncertainty stemming from limited samples, our approach employs a probabilistic model to generate scenario-specific distributions for each user and item in each scenario, estimated through variational inference (VI). Additionally, we introduce the global knowledge-aware multinomial distributions as prior knowledge to regulate the learning of the posterior user and item distributions, ensuring similarities among distributions for users with akin interests and items with similar side information. This mitigates the risk of users or items with fewer records being overwhelmed in sparse scenarios. Extensive experimental results affirm the efficacy of GSVR in assisting existing multi-scenario recommendation methods in learning more robust representations.

[IR-3] Distribution-Guided Auto-Encoder for User Multimodal Interest Cross Fusion CIKM2025

链接: https://arxiv.org/abs/2508.14485
作者: Moyu Zhang,Yongxiang Tang,Yujun Jin,Jinxin Hu,Yu Zhang
类目: Information Retrieval (cs.IR)
*备注: Accepted by CIKM 2025, 11 pages, 4 figures, 4 tables

点击查看摘要

Abstract:Traditional recommendation methods rely on correlating the embedding vectors of item IDs to capture implicit collaborative filtering signals to model the user’s interest in the target item. Consequently, traditional ID-based methods often encounter data sparsity problems stemming from the sparse nature of ID features. To alleviate the problem of item ID sparsity, recommendation models incorporate multimodal item information to enhance recommendation accuracy. However, existing multimodal recommendation methods typically employ early fusion approaches, which focus primarily on combining text and image features, while neglecting the contextual influence of user behavior sequences. This oversight prevents dynamic adaptation of multimodal interest representations based on behavioral patterns, consequently restricting the model’s capacity to effectively capture user multimodal interests. Therefore, this paper proposes the Distribution-Guided Multimodal-Interest Auto-Encoder (DMAE), which achieves the cross fusion of user multimodal interest at the behavioral this http URL, extensive experiments demonstrate the superiority of DMAE.

[IR-4] Diverse Negative Sampling for Implicit Collaborative Filtering

链接: https://arxiv.org/abs/2508.14468
作者: Yueqing Xuan,Kacper Sokol,Mark Sanderson,Jeffrey Chan
类目: Information Retrieval (cs.IR)
*备注:

点击查看摘要

Abstract:Implicit collaborative filtering recommenders are usually trained to learn user positive preferences. Negative sampling, which selects informative negative items to form negative training data, plays a crucial role in this process. Since items are often clustered in the latent space, existing negative sampling strategies normally oversample negative items from the dense regions. This leads to homogeneous negative data and limited model expressiveness. In this paper, we propose Diverse Negative Sampling (DivNS), a novel approach that explicitly accounts for diversity in negative training data during the negative sampling process. DivNS first finds hard negative items with large preference scores and constructs user-specific caches that store unused but highly informative negative samples. Then, its diversity-augmented sampler selects a diverse subset of negative items from the cache while ensuring dissimilarity from the user’s hard negatives. Finally, a synthetic negatives generator combines the selected diverse negatives with hard negatives to form more effective training data. The resulting synthetic negatives are both informative and diverse, enabling recommenders to learn a broader item space and improve their generalisability. Extensive experiments on four public datasets demonstrate the effectiveness of DivNS in improving recommendation quality while maintaining computational efficiency.

[IR-5] You Only Evaluate Once: A Tree-based Rerank Method at Meituan CIKM2025

链接: https://arxiv.org/abs/2508.14420
作者: Shuli Wang,Yinqiu Huang,Changhao Li,Yuan Zhou,Yonggang Liu,Yongqiang Zhang,Yinhua Zhu,Haitao Wang,Xingxing Wang
类目: Information Retrieval (cs.IR)
*备注: Accepted by CIKM 2025

点击查看摘要

Abstract:Reranking plays a crucial role in modern recommender systems by capturing the mutual influences within the list. Due to the inherent challenges of combinatorial search spaces, most methods adopt a two-stage search paradigm: a simple General Search Unit (GSU) efficiently reduces the candidate space, and an Exact Search Unit (ESU) effectively selects the optimal sequence. These methods essentially involve making trade-offs between effectiveness and efficiency, while suffering from a severe \textbfinconsistency problem, that is, the GSU often misses high-value lists from ESU. To address this problem, we propose YOLOR, a one-stage reranking method that removes the GSU while retaining only the ESU. Specifically, YOLOR includes: (1) a Tree-based Context Extraction Module (TCEM) that hierarchically aggregates multi-scale contextual features to achieve “list-level effectiveness”, and (2) a Context Cache Module (CCM) that enables efficient feature reuse across candidate permutations to achieve “permutation-level efficiency”. Extensive experiments across public and industry datasets validate YOLOR’s performance, and we have successfully deployed YOLOR on the Meituan food delivery platform.

[IR-6] GPT -2 as a Compression Preprocessor: Improving Gzip for Structured Text Domains

链接: https://arxiv.org/abs/2508.14061
作者: Anurag Kumar Ojha
类目: Information Retrieval (cs.IR)
*备注:

点击查看摘要

Abstract:In the modern era, large volumes of data are being produced continuously, especially in domain-specific fields such as medical records and clinical files, defence logs and HTML-based web traffic. Data with such volume and complexity needs to be compressed before storing and transmitting efficiently. Data compression has gained significant attention from modern researchers, resulting in the development of fast and efficient compression algorithms such as Gzip. However, since gzip works on the principle of repetition of binary patterns, one of the limitations of gzip is that domain-specific formats like JSON, XML, HTML, and log files, while structured, may have semantic repetition but not syntactic repetition, which gzip finds difficult to compress. In this article, we propose a GPT-based preprocessor for such domain-specific files. We propose a pipeline made up of GPT-2 taking domain-specific files as input, which pattern-based compressors like gzip find difficult to work on. The preprocessor results are output in a file that is designed for compressors like gzip. After preprocessing, the gzip works on the other end of the pipeline and compresses the data as usual. We used different types of both real-world and synthetically generated data, such as logs and HTML files, for the experiment of the proposed model. We found promising results and an improvement of the Defence logs by 0.34 per cent and HTML files by 5.8 per cent.

附件下载

点击下载今日全部论文列表