本篇博文主要内容为 2025-08-04 从Arxiv.org论文网站获取的最新论文列表,自动更新,按照NLP、CV、ML、AI、IR五个大方向区分,若需要邮件定时接收,请在评论区留下你的邮箱号。
说明:每日论文数据从Arxiv.org获取,每天早上12:00左右定时自动更新。
友情提示: 如何您需要邮箱接收每日论文数据,请在评论处留下你的邮箱。
目录
概览 (2025-08-04)
今日共更新432篇论文,其中:
- 自然语言处理共65篇(Computation and Language (cs.CL))
- 人工智能共118篇(Artificial Intelligence (cs.AI))
- 计算机视觉共108篇(Computer Vision and Pattern Recognition (cs.CV))
- 机器学习共119篇(Machine Learning (cs.LG))
自然语言处理
[NLP-0] Beyond Fixed: Variable-Length Denoising for Diffusion Large Language Models
【速读】: 该论文旨在解决扩散式大语言模型(Diffusion Large Language Models, DLLMs)在实际应用中因静态预定义生成长度所带来的性能与效率矛盾问题:过短的长度限制复杂任务表现,而过长的长度则带来显著计算开销甚至性能下降。解决方案的关键在于提出一种无需额外训练的去噪策略——DAEDAL(Dynamic Adaptive Length Expansion for Diffusion LLMs),其核心机制分为两阶段:首先,在去噪前基于序列完成度指标从短初始长度迭代扩展至粗粒度适配任务的长度;其次,在去噪过程中通过插入掩码标记动态识别并扩展生成不足区域,从而确保最终输出完整且高效。该方法有效缓解了DLLMs对固定长度的依赖,提升了生成质量与计算效率。
链接: https://arxiv.org/abs/2508.00819
作者: Jinsong Li,Xiaoyi Dong,Yuhang Zang,Yuhang Cao,Jiaqi Wang,Dahua Lin
机构: The Chinese University of Hong Kong (香港中文大学); Shanghai AI Laboratory (上海人工智能实验室)
类目: Computation and Language (cs.CL)
备注: Code is available at this https URL
Abstract:Diffusion Large Language Models (DLLMs) are emerging as a powerful alternative to the dominant Autoregressive Large Language Models, offering efficient parallel generation and capable global context modeling. However, the practical application of DLLMs is hindered by a critical architectural constraint: the need for a statically predefined generation length. This static length allocation leads to a problematic trade-off: insufficient lengths cripple performance on complex tasks, while excessive lengths incur significant computational overhead and sometimes result in performance degradation. While the inference framework is rigid, we observe that the model itself possesses internal signals that correlate with the optimal response length for a given task. To bridge this gap, we leverage these latent signals and introduce DAEDAL, a novel training-free denoising strategy that enables Dynamic Adaptive Length Expansion for Diffusion Large Language Models. DAEDAL operates in two phases: 1) Before the denoising process, DAEDAL starts from a short initial length and iteratively expands it to a coarse task-appropriate length, guided by a sequence completion metric. 2) During the denoising process, DAEDAL dynamically intervenes by pinpointing and expanding insufficient generation regions through mask token insertion, ensuring the final output is fully developed. Extensive experiments on DLLMs demonstrate that DAEDAL achieves performance comparable, and in some cases superior, to meticulously tuned fixed-length baselines, while simultaneously enhancing computational efficiency by achieving a higher effective token ratio. By resolving the static length constraint, DAEDAL unlocks new potential for DLLMs, bridging a critical gap with their Autoregressive counterparts and paving the way for more efficient and capable generation.
zh
[NLP-1] Do They Understand Them? An Updated Evaluation on Nonbinary Pronoun Handling in Large Language Models
【速读】: 该论文旨在解决大型语言模型(Large Language Models, LLMs)在敏感应用场景中对性别包容性 pronoun(代词)处理不足的问题,尤其关注性别中立代词和新式代词(neopronouns)的使用准确性。其解决方案的关键在于构建并发布一个更新、扩展的基准测试工具——MISGENDERED+,用于系统评估主流LLMs在零样本(zero-shot)、少样本(few-shot)以及性别身份推理任务中的代词忠实度(pronoun fidelity)。通过在GPT-4o、Claude 4、DeepSeek-V3、Qwen Turbo和Qwen2.5等五种代表性模型上的实证分析,研究揭示了当前模型在性别中立代词上已有显著改进,但在新式代词和反向推理任务中仍存在不一致性,从而指出了未来实现更具包容性的生成式AI(Generative AI)研究的方向。
链接: https://arxiv.org/abs/2508.00788
作者: Xushuo Tang,Yi Ding,Zhengyi Yang,Yin Chen,Yongrui Gu,Wenke Yang,Mingchen Ju,Xin Cao,Yongfei Liu,Wenjie Zhang
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:
Abstract:Large language models (LLMs) are increasingly deployed in sensitive contexts where fairness and inclusivity are critical. Pronoun usage, especially concerning gender-neutral and neopronouns, remains a key challenge for responsible AI. Prior work, such as the MISGENDERED benchmark, revealed significant limitations in earlier LLMs’ handling of inclusive pronouns, but was constrained to outdated models and limited evaluations. In this study, we introduce MISGENDERED+, an extended and updated benchmark for evaluating LLMs’ pronoun fidelity. We benchmark five representative LLMs, GPT-4o, Claude 4, DeepSeek-V3, Qwen Turbo, and Qwen2.5, across zero-shot, few-shot, and gender identity inference. Our results show notable improvements compared with previous studies, especially in binary and gender-neutral pronoun accuracy. However, accuracy on neopronouns and reverse inference tasks remains inconsistent, underscoring persistent gaps in identity-sensitive reasoning. We discuss implications, model-specific observations, and avenues for future inclusive AI research.
zh
[NLP-2] ITUNLP at SemEval-2025 Task 8: Question-Answering over Tabular Data: A Zero-Shot Approach using LLM -Driven Code Generation
【速读】: 该论文旨在解决跨领域表格数据上的问答任务(Question-Answering over Tabular Data),具体针对SemEval-2025 Task 8: DataBench的两个子任务——DataBench QA(Subtask I)和DataBench Lite QA(Subtask II)。其核心挑战在于如何在无标注训练数据的情况下,准确理解自然语言问题并生成可执行的Python代码以从表格中提取答案。解决方案的关键在于提出一种基于大型语言模型(Large Language Model, LLM)的零样本(zero-shot)代码生成框架,利用最先进的开源LLM通过优化提示策略生成Pandas代码,从而实现对表格数据的精准查询与回答。实验表明,该方法在性能上优于其他替代方案,且在同类系统中排名靠前。
链接: https://arxiv.org/abs/2508.00762
作者: Atakan Site,Emre Hakan Erdemir,Gülşen Eryiğit
机构: Istanbul Technical University (伊斯坦布尔技术大学)
类目: Computation and Language (cs.CL)
备注:
Abstract:This paper presents our system for SemEval-2025 Task 8: DataBench, Question-Answering over Tabular Data. The primary objective of this task is to perform question answering on given tabular datasets from diverse domains under two subtasks: DataBench QA (Subtask I) and DataBench Lite QA (Subtask II). To tackle both subtasks, we developed a zero-shot solution with a particular emphasis on leveraging Large Language Model (LLM)-based code generation. Specifically, we propose a Python code generation framework utilizing state-of-the-art open-source LLMs to generate executable Pandas code via optimized prompting strategies. Our experiments reveal that different LLMs exhibit varying levels of effectiveness in Python code generation. Additionally, results show that Python code generation achieves superior performance in tabular question answering compared to alternative approaches. Although our ranking among zero-shot systems is unknown at the time of this paper’s submission, our system achieved eighth place in Subtask I and sixth place in Subtask~II among the 30 systems that outperformed the baseline in the open-source models category.
zh
[NLP-3] MMBERT: Scaled Mixture-of-Experts Multimodal BERT for Robust Chinese Hate Speech Detection under Cloaking Perturbations
【速读】: 该论文旨在解决中文社交媒体中仇恨言论(hate speech)检测的难题,特别是针对用户广泛采用的“伪装技术”(cloaking techniques)以规避传统文本-based检测系统的问题。解决方案的关键在于提出一种基于BERT的多模态框架MMBERT,该框架融合了文本、语音和视觉模态,并通过Mixture-of-Experts(MoE)架构实现模态间高效协同;为提升模型稳定性,设计了渐进式三阶段训练策略,引入模态特异性专家、共享自注意力机制及基于路由器的专家分配机制,从而增强对对抗性扰动的鲁棒性。实证结果表明,MMBERT在多个中文仇恨言论数据集上显著优于微调后的BERT模型、LLM以及基于上下文学习的LLM方法。
链接: https://arxiv.org/abs/2508.00760
作者: Qiyao Xue,Yuchen Dou,Ryan Shi,Xiang Lorraine Li,Wei Gao
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:
Abstract:Hate speech detection on Chinese social networks presents distinct challenges, particularly due to the widespread use of cloaking techniques designed to evade conventional text-based detection systems. Although large language models (LLMs) have recently improved hate speech detection capabilities, the majority of existing work has concentrated on English datasets, with limited attention given to multimodal strategies in the Chinese context. In this study, we propose MMBERT, a novel BERT-based multimodal framework that integrates textual, speech, and visual modalities through a Mixture-of-Experts (MoE) architecture. To address the instability associated with directly integrating MoE into BERT-based models, we develop a progressive three-stage training paradigm. MMBERT incorporates modality-specific experts, a shared self-attention mechanism, and a router-based expert allocation strategy to enhance robustness against adversarial perturbations. Empirical results in several Chinese hate speech datasets show that MMBERT significantly surpasses fine-tuned BERT-based encoder models, fine-tuned LLMs, and LLMs utilizing in-context learning approaches.
zh
[NLP-4] GLiDRE: Generalist Lightweight model for Document-level Relation Extraction
【速读】: 该论文旨在解决文档级关系抽取(Document-Level Relation Extraction, DocRE)在少样本(few-shot)和零样本(zero-shot)设置下性能表现不佳的问题,这一挑战源于跨句实体间复杂交互的建模难度。解决方案的关键在于借鉴GLiNER模型中紧凑命名实体识别(Named Entity Recognition, NER)架构的成功经验,提出一种名为GLiDRE的新模型,其核心思想是通过轻量化设计与高效特征聚合机制,在保持模型简洁性的同时显著提升在数据稀缺场景下的泛化能力。实验表明,GLiDRE在Re-DocRED数据集上实现了少样本条件下的最先进性能。
链接: https://arxiv.org/abs/2508.00757
作者: Robin Armingaud,Romaric Besançon
机构: Université Paris-Saclay (巴黎萨克雷大学); CEA (法国原子能和替代能源委员会); List (List实验室)
类目: Computation and Language (cs.CL)
备注: Submitted to ARR July
Abstract:Relation Extraction (RE) is a fundamental task in Natural Language Processing, and its document-level variant poses significant challenges, due to the need to model complex interactions between entities across sentences. Current approaches, largely based on the ATLOP architecture, are commonly evaluated on benchmarks like DocRED and Re-DocRED. However, their performance in zero-shot or few-shot settings remains largely underexplored due to the task’s complexity. Recently, the GLiNER model has shown that a compact NER model can outperform much larger Large Language Models. With a similar motivation, we introduce GLiDRE, a new model for document-level relation extraction that builds on the key ideas of GliNER. We benchmark GLiDRE against state-of-the-art models across various data settings on the Re-DocRED dataset. Our results demonstrate that GLiDRE achieves state-of-the-art performance in few-shot scenarios. Our code is publicly available.
zh
[NLP-5] Agent ic large language models improve retrieval-based radiology question answering
【速读】: 该论文旨在解决放射学问答(Radiology Question Answering, QA)中因传统检索增强生成(Retrieval-Augmented Generation, RAG)系统依赖单步检索而导致的复杂临床推理能力不足的问题。其核心解决方案是提出一种代理式检索增强生成(Agentic RAG)框架,该框架使大型语言模型(LLM)能够自主分解放射学问题、迭代式地从Radiopaedia中检索针对性临床证据,并动态合成基于证据的回答。此方法显著提升了诊断准确性(平均达73%),尤其在中等规模模型(如Mistral Large)和小规模模型(如Qwen 2.5-7B)中效果突出,同时减少了幻觉(均值9.4%),增强了事实性与临床相关性,验证了代理机制在提升放射学AI辅助决策中的关键作用。
链接: https://arxiv.org/abs/2508.00743
作者: Sebastian Wind,Jeta Sopa,Daniel Truhn,Mahshad Lotfinia,Tri-Thien Nguyen,Keno Bressem,Lisa Adams,Mirabela Rusu,Harald Köstler,Gerhard Wellein,Andreas Maier,Soroosh Tayebi Arasteh
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:
Abstract:Clinical decision-making in radiology increasingly benefits from artificial intelligence (AI), particularly through large language models (LLMs). However, traditional retrieval-augmented generation (RAG) systems for radiology question answering (QA) typically rely on single-step retrieval, limiting their ability to handle complex clinical reasoning tasks. Here we propose an agentic RAG framework enabling LLMs to autonomously decompose radiology questions, iteratively retrieve targeted clinical evidence from Radiopaedia, and dynamically synthesize evidence-based responses. We evaluated 24 LLMs spanning diverse architectures, parameter scales (0.5B to 670B), and training paradigms (general-purpose, reasoning-optimized, clinically fine-tuned), using 104 expert-curated radiology questions from previously established RSNA-RadioQA and ExtendedQA datasets. Agentic retrieval significantly improved mean diagnostic accuracy over zero-shot prompting (73% vs. 64%; P0.001) and conventional online RAG (73% vs. 68%; P0.001). The greatest gains occurred in mid-sized models (e.g., Mistral Large improved from 72% to 81%) and small-scale models (e.g., Qwen 2.5-7B improved from 55% to 71%), while very large models (200B parameters) demonstrated minimal changes (2% improvement). Additionally, agentic retrieval reduced hallucinations (mean 9.4%) and retrieved clinically relevant context in 46% of cases, substantially aiding factual grounding. Even clinically fine-tuned models exhibited meaningful improvements (e.g., MedGemma-27B improved from 71% to 81%), indicating complementary roles of retrieval and fine-tuning. These results highlight the potential of agentic frameworks to enhance factuality and diagnostic accuracy in radiology QA, particularly among mid-sized LLMs, warranting future studies to validate their clinical utility.
zh
[NLP-6] Applying Psychometrics to Large Language Model Simulated Populations: Recreating the HEXACO Personality Inventory Experiment with Generative Agents
【速读】: 该论文旨在解决生成式 AI(Generative AI)代理在社会科学研究中作为人类参与者替代品的有效性问题,特别是其基于预设人格背景(persona)能否可靠地再现人类个体的人格结构。解决方案的关键在于:首先,通过构建由310名GPT-4驱动的代理组成的群体并施以HEXACO人格量表测试,进行因子分析以检验其人格结构是否可复现;其次,验证在足够精心设计的人格设定下,GPT-4内部人格维度具有稳定性和一致性;最后,通过跨模型比较揭示不同大语言模型在人格建模中的偏差与局限性,从而为设计更具代表性和一致性的代理人格提供实证依据和实践指导。
链接: https://arxiv.org/abs/2508.00742
作者: Sarah Mercer,Daniel P. Martin,Phil Swatton
机构: The Alan Turing Institute(艾伦·图灵研究所)
类目: Computation and Language (cs.CL); Machine Learning (cs.LG)
备注: 26 pages, 14 figures
Abstract:Generative agents powered by Large Language Models demonstrate human-like characteristics through sophisticated natural language interactions. Their ability to assume roles and personalities based on predefined character biographies has positioned them as cost-effective substitutes for human participants in social science research. This paper explores the validity of such persona-based agents in representing human populations; we recreate the HEXACO personality inventory experiment by surveying 310 GPT-4 powered agents, conducting factor analysis on their responses, and comparing these results to the original findings presented by Ashton, Lee, Goldberg in 2004. Our results found 1) a coherent and reliable personality structure was recoverable from the agents’ responses demonstrating partial alignment to the HEXACO framework. 2) the derived personality dimensions were consistent and reliable within GPT-4, when coupled with a sufficiently curated population, and 3) cross-model analysis revealed variability in personality profiling, suggesting model-specific biases and limitations. We discuss the practical considerations and challenges encountered during the experiment. This study contributes to the ongoing discourse on the potential benefits and limitations of using generative agents in social science research and provides useful guidance on designing consistent and representative agent personas to maximise coverage and representation of human personality traits.
zh
[NLP-7] Out-of-Context Abduction: LLM s Make Inferences About Procedural Data Leverag ing Declarative Facts in Earlier Training Data
【速读】: 该论文旨在解决大语言模型(Large Language Models, LLMs)是否具备基于训练数据中隐含信息进行“情境外溯因推理”(out-of-context abduction)的能力这一问题,即模型能否在未显式学习对话示例的情况下,仅凭对虚构聊天机器人行为特征的描述,推断出其名称或生成更符合该角色的行为。其解决方案的关键在于设计特定实验:首先训练LLM(如GPT-4o)仅学习虚构聊天机器人的名称与行为描述,不提供具体对话样本;随后测试其能否通过观察典型响应推断出对应名称,并进一步验证若预先训练模型理解某一聊天机器人的行为模式,可使其在迭代生成过程中更准确地表现出该角色特征。结果表明,GPT-4o具备此类推理能力,揭示了LLM在缺乏上下文时仍能利用训练数据中的潜在结构进行合理推断,这对提升模型的情境感知能力及AI安全性具有重要意义。
链接: https://arxiv.org/abs/2508.00741
作者: Sohaib Imran,Rob Lamb,Peter M. Atkinson
机构: Lancaster University (兰卡斯特大学); Stanford University (斯坦福大学)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:
Abstract:Large language models (LLMs) are trained on large corpora, yet it is unclear whether they can reason about the information present within their training data. We design experiments to study out-of-context abduction in LLMs, the ability to infer the most plausible explanations for observations using relevant facts present in training data. We train treatment LLMs on names and behavior descriptions of fictitious chatbots, but not on examples of dialogue with the chatbots. We find that OpenAI’s GPT 4o LLM can correctly infer at least one chatbot’s name after observing example responses characteristic of that chatbot. We also find that previously training GPT 4o on descriptions of a chatbot’s behavior allows it to display behaviors more characteristic of the chatbot when iteratively trained to display such behaviors. Our results have implications for situational awareness in LLMs and, therefore, for AI safety.
zh
[NLP-8] Dynamically Adaptive Reasoning via LLM -Guided MCTS for Efficient and Context-Aware KGQA
【速读】: 该论文旨在解决知识图谱问答(Knowledge Graph Question Answering, KGQA)中现有方法的两大核心问题:一是传统“检索-推理”范式依赖静态路径提取,缺乏上下文自适应能力;二是基于大语言模型(Large Language Models, LLMs)的动态路径生成策略计算开销高且路径评估准确性不足。解决方案的关键在于提出一种基于蒙特卡洛树搜索(Monte Carlo Tree Search, MCTS)的动态自适应推理框架(Dynamically Adaptive MCTS-based Reasoning, DAMR),其创新点包括:1)利用LLM引导的MCTS实现高效、可扩展的符号化搜索,并通过top-k关系选择显著压缩搜索空间;2)设计轻量级Transformer评分器,通过交叉注意力机制联合编码问题与关系序列,实现细粒度语义变化捕捉以提升路径合理性评估精度;3)引入动态伪路径精修机制,在搜索过程中自动构建训练信号,缓解高质量监督数据稀缺问题,使评分器能持续适应推理轨迹分布的变化。
链接: https://arxiv.org/abs/2508.00719
作者: Yingxu Wang,Shiqi Fan,Mengzhu Wang,Siwei Liu
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:
Abstract:Knowledge Graph Question Answering (KGQA) aims to interpret natural language queries and perform structured reasoning over knowledge graphs by leveraging their relational and semantic structures to retrieve accurate answers. Recent KGQA methods primarily follow either retrieve-then-reason paradigm, relying on GNNs or heuristic rules for static paths extraction, or dynamic path generation strategies that use large language models (LLMs) with prompting to jointly perform retrieval and reasoning. However, the former suffers from limited adaptability due to static path extraction and lack of contextual refinement, while the latter incurs high computational costs and struggles with accurate path evaluation due to reliance on fixed scoring functions and extensive LLM calls. To address these issues, this paper proposes Dynamically Adaptive MCTS-based Reasoning (DAMR), a novel framework that integrates symbolic search with adaptive path evaluation for efficient and context-aware KGQA. DAMR employs a Monte Carlo Tree Search (MCTS) backbone guided by an LLM-based planner, which selects top- k relevant relations at each step to reduce search space. To improve path evaluation accuracy, we introduce a lightweight Transformer-based scorer that performs context-aware plausibility estimation by jointly encoding the question and relation sequence through cross-attention, enabling the model to capture fine-grained semantic shifts during multi-hop reasoning. Furthermore, to alleviate the scarcity of high-quality supervision, DAMR incorporates a dynamic pseudo-path refinement mechanism that periodically generates training signals from partial paths explored during search, allowing the scorer to continuously adapt to the evolving distribution of reasoning trajectories. Extensive experiments on multiple KGQA benchmarks show that DAMR significantly outperforms state-of-the-art methods.
zh
[NLP-9] NyayaRAG : Realistic Legal Judgment Prediction with RAG under the Indian Common Law System
【速读】: 该论文旨在解决印度法律场景下司法判决预测(Legal Judgment Prediction, LJP)中存在的关键问题:现有方法主要依赖案件事实、争议焦点和推理内容等内部文本信息,却忽视了普通法体系中至关重要的两个要素——成文法规范(statutory provisions)和判例法 precedent。为应对这一局限,作者提出NyayaRAG框架,其核心创新在于引入检索增强生成(Retrieval-Augmented Generation, RAG)机制,通过整合真实庭审场景所需的三类结构化输入——案件事实描述、相关法律条文以及语义检索到的先例案例——构建了一个面向印度法律体系的专用处理流程。实验表明,这种融合结构化法律知识的策略显著提升了判决预测准确率与法律解释质量。
链接: https://arxiv.org/abs/2508.00709
作者: Shubham Kumar Nigam,Balaramamahanthi Deepak Patnaik,Shivam Mishra,Ajay Varghese Thomas,Noel Shallum,Kripabandhu Ghosh,Arnab Bhattacharya
机构: IIT Kanpur (印度理工学院坎普尔分校); SRM Institute of Science and Technology (SRM科学技术研究所); IISER Kolkata (印度科学教育与研究学院加尔各答分校); Symbiosis Law School Pune (辛布亚法律学院浦那分校)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Information Retrieval (cs.IR); Machine Learning (cs.LG)
备注:
Abstract:Legal Judgment Prediction (LJP) has emerged as a key area in AI for law, aiming to automate judicial outcome forecasting and enhance interpretability in legal reasoning. While previous approaches in the Indian context have relied on internal case content such as facts, issues, and reasoning, they often overlook a core element of common law systems, which is reliance on statutory provisions and judicial precedents. In this work, we propose NyayaRAG, a Retrieval-Augmented Generation (RAG) framework that simulates realistic courtroom scenarios by providing models with factual case descriptions, relevant legal statutes, and semantically retrieved prior cases. NyayaRAG evaluates the effectiveness of these combined inputs in predicting court decisions and generating legal explanations using a domain-specific pipeline tailored to the Indian legal system. We assess performance across various input configurations using both standard lexical and semantic metrics as well as LLM-based evaluators such as G-Eval. Our results show that augmenting factual inputs with structured legal knowledge significantly improves both predictive accuracy and explanation quality.
zh
[NLP-10] Classification of Psychiatry Clinical Notes by Diagnosis: A Deep Learning and Machine Learning Approach
【速读】: 该论文旨在解决临床笔记分类中针对焦虑症(Anxiety)和适应障碍(Adjustment Disorder)两类精神健康诊断的自动化识别问题,以提升辅助诊断效率与准确性。其解决方案的关键在于系统比较传统机器学习模型(如随机森林、支持向量机、K近邻、决策树和梯度提升树)与深度学习模型(DistilBERT 和 SciBERT)的性能,并评估三种数据过采样策略(无过采样、随机过采样、SMOTE)对模型效果的影响,同时引入超参数调优以优化模型表现。研究发现,超参数调优显著提升了各类模型的准确率,而SMOTE仅在基于BERT的模型中显示出积极效应,表明合理调整模型参数是实现高性能分类的核心因素。
链接: https://arxiv.org/abs/2508.00695
作者: Sergio Rubio-Martín,María Teresa García-Ordás,Antonio Serrano-García,Clara Margarita Franch-Pato,Arturo Crespo-Álvaro,José Alberto Benítez-Andrades
机构: 未知
类目: Machine Learning (cs.LG); Computation and Language (cs.CL)
备注:
Abstract:The classification of clinical notes into specific diagnostic categories is critical in healthcare, especially for mental health conditions like Anxiety and Adjustment Disorder. In this study, we compare the performance of various Artificial Intelligence models, including both traditional Machine Learning approaches (Random Forest, Support Vector Machine, K-nearest neighbors, Decision Tree, and eXtreme Gradient Boost) and Deep Learning models (DistilBERT and SciBERT), to classify clinical notes into these two diagnoses. Additionally, we implemented three oversampling strategies: No Oversampling, Random Oversampling, and Synthetic Minority Oversampling Technique (SMOTE), to assess their impact on model performance. Hyperparameter tuning was also applied to optimize model accuracy. Our results indicate that oversampling techniques had minimal impact on model performance overall. The only exception was SMOTE, which showed a positive effect specifically with BERT-based models. However, hyperparameter optimization significantly improved accuracy across the models, enhancing their ability to generalize and perform on the dataset. The Decision Tree and eXtreme Gradient Boost models achieved the highest accuracy among machine learning approaches, both reaching 96%, while the DistilBERT and SciBERT models also attained 96% accuracy in the deep learning category. These findings underscore the importance of hyperparameter tuning in maximizing model performance. This study contributes to the ongoing research on AI-assisted diagnostic tools in mental health by providing insights into the efficacy of different model architectures and data balancing methods.
zh
[NLP-11] Better Call Claude: Can LLM s Detect Changes of Writing Style?
【速读】: 该论文旨在解决句子级别写作风格变化检测(sentence-level style change detection)这一作者分析领域中极具挑战性的问题。其核心解决方案在于利用最新的大语言模型(Large Language Models, LLMs)进行零样本(zero-shot)性能评估,通过在PAN 2024和2025官方“多作者写作风格分析”数据集上的基准测试,揭示LLMs对细粒度写作风格差异的敏感性及其在该任务中的卓越表现。关键发现表明,当前最先进的生成式AI模型不仅能够识别个体句子层面的风格变化,且其准确率显著超越以往竞赛推荐基线,同时提示这些模型可能更依赖内容无关的纯风格信号而非语义信息,从而为后续研究提供了新的基准与方向。
链接: https://arxiv.org/abs/2508.00680
作者: Johannes Römisch,Svetlana Gorovaia,Mariia Halchynska,Gleb Schmidt,Ivan P. Yamshchikov
机构: 未知
类目: Computation and Language (cs.CL)
备注:
Abstract:This article explores the zero-shot performance of state-of-the-art large language models (LLMs) on one of the most challenging tasks in authorship analysis: sentence-level style change detection. Benchmarking four LLMs on the official PAN~2024 and 2025 “Multi-Author Writing Style Analysis” datasets, we present several observations. First, state-of-the-art generative models are sensitive to variations in writing style - even at the granular level of individual sentences. Second, their accuracy establishes a challenging baseline for the task, outperforming suggested baselines of the PAN competition. Finally, we explore the influence of semantics on model predictions and present evidence suggesting that the latest generation of LLMs may be more sensitive to content-independent and purely stylistic signals than previously reported.
zh
[NLP-12] Segment First Retrieve Better: Realistic Legal Search via Rhetorical Role-Based Queries
【速读】: 该论文旨在解决普通法体系下判例检索(legal precedent retrieval)因法律文书数量激增与复杂性提升而对传统检索方法带来的挑战,特别是在仅能获取部分案件信息时如何实现高效、准确的判例匹配。其解决方案的关键在于提出一种名为TraceRetriever的检索框架,该框架模拟真实法律搜索场景,仅依赖 rhetorically significant segments(具有修辞重要性的段落)而非完整文档进行处理;通过整合BM25、向量数据库(Vector Database)和交叉编码器(Cross-Encoder)模型,并采用倒数排名融合(Reciprocal Rank Fusion, RRF)策略融合初始结果,在最终阶段进行重排序,从而在有限输入条件下实现高精度判例检索,显著提升了法律研究在实际应用中的可靠性与可扩展性。
链接: https://arxiv.org/abs/2508.00679
作者: Shubham Kumar Nigam,Tanmay Dubey,Noel Shallum,Arnab Bhattacharya
机构: IIT Kanpur (印度理工学院坎普尔分校); IISER Kolkata (印度科学教育研究所加尔各答分校); Symbiosis Law School Pune (辛布亚法律学院浦那分校)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Information Retrieval (cs.IR); Machine Learning (cs.LG)
备注:
Abstract:Legal precedent retrieval is a cornerstone of the common law system, governed by the principle of stare decisis, which demands consistency in judicial decisions. However, the growing complexity and volume of legal documents challenge traditional retrieval methods. TraceRetriever mirrors real-world legal search by operating with limited case information, extracting only rhetorically significant segments instead of requiring complete documents. Our pipeline integrates BM25, Vector Database, and Cross-Encoder models, combining initial results through Reciprocal Rank Fusion before final re-ranking. Rhetorical annotations are generated using a Hierarchical BiLSTM CRF classifier trained on Indian judgments. Evaluated on IL-PCR and COLIEE 2025 datasets, TraceRetriever addresses growing document volume challenges while aligning with practical search constraints, reliable and scalable foundation for precedent retrieval enhancing legal research when only partial case knowledge is available.
zh
[NLP-13] am “better_call_claude”: Style Change Detection using a Sequential Sentence Pair Classifier
【速读】: 该论文旨在解决文档中写作风格变化点的检测问题(style change detection),特别是在细粒度级别上识别句子级别的风格切换,这是计算作者分析领域中的一个关键且具有挑战性的问题。解决方案的核心在于提出一种序列句子对分类器(Sequential Sentence Pair Classifier, SSPC),该模型将一段文本视为整体进行建模:首先利用预训练语言模型(PLM)提取单句表示,再通过双向长短期记忆网络(BiLSTM)捕捉句子在文档上下文中的语义关联,随后对相邻句子向量进行拼接并输入多层感知机进行逐对预测。该方法虽相对保守和轻量,但有效利用了上下文信息,在应对基准数据集中普遍存在“风格浅层”短句这一难题上表现突出,最终在PAN-2025官方测试集上取得了显著的宏F1分数(EASY/MEDIUM/HARD分别为0.923/0.828/0.724)。
链接: https://arxiv.org/abs/2508.00675
作者: Gleb Schmidt,Johannes Römisch,Mariia Halchynska,Svetlana Gorovaia,Ivan P. Yamshchikov
机构: Radboud University (拉德布德大学); Technical University of Applied Sciences Würzburg-Schweinfurt (维尔茨堡-施韦因富特应用技术大学); HSE University (高等经济大学)
类目: Computation and Language (cs.CL)
备注:
Abstract:Style change detection - identifying the points in a document where writing style shifts - remains one of the most important and challenging problems in computational authorship analysis. At PAN 2025, the shared task challenges participants to detect style switches at the most fine-grained level: individual sentences. The task spans three datasets, each designed with controlled and increasing thematic variety within documents. We propose to address this problem by modeling the content of each problem instance - that is, a series of sentences - as a whole, using a Sequential Sentence Pair Classifier (SSPC). The architecture leverages a pre-trained language model (PLM) to obtain representations of individual sentences, which are then fed into a bidirectional LSTM (BiLSTM) to contextualize them within the document. The BiLSTM-produced vectors of adjacent sentences are concatenated and passed to a multi-layer perceptron for prediction per adjacency. Building on the work of previous PAN participants classical text segmentation, the approach is relatively conservative and lightweight. Nevertheless, it proves effective in leveraging contextual information and addressing what is arguably the most challenging aspect of this year’s shared task: the notorious problem of “stylistically shallow”, short sentences that are prevalent in the proposed benchmark data. Evaluated on the official PAN-2025 test datasets, the model achieves strong macro-F1 scores of 0.923, 0.828, and 0.724 on the EASY, MEDIUM, and HARD data, respectively, outperforming not only the official random baselines but also a much more challenging one: claude-3.7-sonnet’s zero-shot performance.
zh
[NLP-14] MELAC: Massive Evaluation of Large Language Models with Alignment of Culture in Persian Language
【速读】: 该论文试图解决当前大型语言模型(Large Language Models, LLMs)在非西方文化语境下评估资源匮乏的问题,特别是针对波斯语(Persian)和伊朗文化缺乏系统性评测数据的现状。解决方案的关键在于构建19个全新的评估数据集,覆盖伊朗法律、波斯语语法、习语表达及大学入学考试等本土化主题,并基于这些数据集对41个主流LLM进行基准测试,从而填补语言与文化多样性评估的空白。
链接: https://arxiv.org/abs/2508.00673
作者: Farhan Farsi,Farnaz Aghababaloo,Shahriar Shariati Motlagh,Parsa Ghofrani,MohammadAli SadraeiJavaheri,Shayan Bali,Amirhossein Shabani,Farbod Bijary,Ghazal Zamaninejad,AmirMohammad Salehoof,Saeedeh Momtazi
机构: Amirkabir University of Technology(阿米尔卡比尔理工大学); Part AI Research Center(部分AI研究中心); University of Mazandaran(马赞德兰大学); King’s College London(伦敦国王学院)
类目: Computation and Language (cs.CL)
备注: Preprint. Under review
Abstract:As large language models (LLMs) become increasingly embedded in our daily lives, evaluating their quality and reliability across diverse contexts has become essential. While comprehensive benchmarks exist for assessing LLM performance in English, there remains a significant gap in evaluation resources for other languages. Moreover, because most LLMs are trained primarily on data rooted in European and American cultures, they often lack familiarity with non-Western cultural contexts. To address this limitation, our study focuses on the Persian language and Iranian culture. We introduce 19 new evaluation datasets specifically designed to assess LLMs on topics such as Iranian law, Persian grammar, Persian idioms, and university entrance exams. Using these datasets, we benchmarked 41 prominent LLMs, aiming to bridge the existing cultural and linguistic evaluation gap in the field.
zh
[NLP-15] Medical Reasoning in the Era of LLM s: A Systematic Review of Enhancement Techniques and Applications
【速读】: 该论文旨在解决大型语言模型(Large Language Models, LLMs)在医学领域应用中缺乏系统性、透明性和可验证推理能力的问题,这一缺陷限制了其在临床实践中的可靠性和可信度。解决方案的关键在于推动从单步答案生成向专为医学推理设计的LLMs转变,并提出了一种系统的分类框架,将增强推理的技术分为训练时策略(如监督微调、强化学习)和测试时机制(如提示工程、多智能体系统),同时强调跨模态数据(文本、图像、代码)与核心临床应用场景(诊断、教育、治疗规划)的适配性,以及评估体系从单一准确率向推理质量与可视化可解释性演进,从而为构建高效、鲁棒且符合社会技术伦理的医疗AI提供理论基础与发展方向。
链接: https://arxiv.org/abs/2508.00669
作者: Wenxuan Wang,Zizhan Ma,Meidan Ding,Shiyi Zheng,Shengyuan Liu,Jie Liu,Jiaming Ji,Wenting Chen,Xiang Li,Linlin Shen,Yixuan Yuan
机构: Renmin University of China (中国人民大学); The Chinese University of Hong Kong (香港中文大学); Shenzhen University (深圳大学); City University of Hong Kong (香港城市大学); Peking University (北京大学); Massachusetts General Hospital and Harvard Medical School (马萨诸塞州总医院和哈佛医学院)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注:
Abstract:The proliferation of Large Language Models (LLMs) in medicine has enabled impressive capabilities, yet a critical gap remains in their ability to perform systematic, transparent, and verifiable reasoning, a cornerstone of clinical practice. This has catalyzed a shift from single-step answer generation to the development of LLMs explicitly designed for medical reasoning. This paper provides the first systematic review of this emerging field. We propose a taxonomy of reasoning enhancement techniques, categorized into training-time strategies (e.g., supervised fine-tuning, reinforcement learning) and test-time mechanisms (e.g., prompt engineering, multi-agent systems). We analyze how these techniques are applied across different data modalities (text, image, code) and in key clinical applications such as diagnosis, education, and treatment planning. Furthermore, we survey the evolution of evaluation benchmarks from simple accuracy metrics to sophisticated assessments of reasoning quality and visual interpretability. Based on an analysis of 60 seminal studies from 2022-2025, we conclude by identifying critical challenges, including the faithfulness-plausibility gap and the need for native multimodal reasoning, and outlining future directions toward building efficient, robust, and sociotechnically responsible medical AI.
zh
[NLP-16] Demo: TOSense – What Did You Just Agree to?
【速读】: 该论文旨在解决在线服务中用户需同意冗长且晦涩的《服务条款》(Terms of Service, ToS)所引发的信息不对称和法律风险问题。其解决方案的关键在于提出TOSense——一个基于Chrome扩展的实时问答系统,通过两个核心组件实现:一是自动爬取ToS内容的“tos-crawl”爬虫;二是轻量级大语言模型流水线,包括使用MiniLM进行语义检索和BART-encoder进行答案相关性验证。此外,为避免昂贵的人工标注,作者设计了新颖的问答评估流水线(Question Answering Evaluation Pipeline, QEP),利用聚类主题匹配生成合成问题并验证答案正确性,从而在苹果、谷歌、X(原Twitter)、微软和奈飞等五大平台验证了系统的有效性(最高准确率达44.5%)。
链接: https://arxiv.org/abs/2508.00659
作者: Xinzhang Chen,Hassan Ali,Arash Shaghaghi,Salil S. Kanhere,Sanjay Jha
机构: The University of New South Wales (新南威尔士大学)
类目: Cryptography and Security (cs.CR); Computation and Language (cs.CL)
备注: Accepted as a demonstration paper at IEEE LCN 2025
Abstract:Online services often require users to agree to lengthy and obscure Terms of Service (ToS), leading to information asymmetry and legal risks. This paper proposes TOSense-a Chrome extension that allows users to ask questions about ToS in natural language and get concise answers in real time. The system combines (i) a crawler “tos-crawl” that automatically extracts ToS content, and (ii) a lightweight large language model pipeline: MiniLM for semantic retrieval and BART-encoder for answer relevance verification. To avoid expensive manual annotation, we present a novel Question Answering Evaluation Pipeline (QEP) that generates synthetic questions and verifies the correctness of answers using clustered topic matching. Experiments on five major platforms, Apple, Google, X (formerly Twitter), Microsoft, and Netflix, show the effectiveness of TOSense (with up to 44.5% accuracy) across varying number of topic clusters. During the demonstration, we will showcase TOSense in action. Attendees will be able to experience seamless extraction, interactive question answering, and instant indexing of new sites.
zh
[NLP-17] DACTYL: Diverse Adversarial Corpus of Texts Yielded from Large Language Models
【速读】: 该论文旨在解决现有生成式 AI (Generative AI) 文本检测器在真实场景中性能下降的问题,其核心挑战在于当前检测方法主要基于零样本(zero-shot)生成文本进行训练和评估,而对少样本(few-shot)或单样本(one-shot)生成文本以及领域特定持续预训练(Continued Pre-trained, CPT)语言模型生成的文本缺乏有效检测能力。解决方案的关键在于构建了一个新的挑战性数据集 DACTYL(Diverse Adversarial Corpus of Texts Yielded from Language models),专门针对 one-shot/few-shot 生成文本及 CPT 模型输出进行设计,并采用两种训练策略对比:标准二元交叉熵(BCE)优化与更先进的深度 X-风险优化(DXO)。实验表明,尽管 BCE 方法在 DACTYL 测试集上略优,但 DXO 分类器在分布外(out-of-distribution, OOD)文本上表现显著更佳,在模拟学生作文检测任务中,其宏 F1 分数比最佳 BCE 分类器高出 50.56 点且误报率更低,证明 DXO 具有更好的泛化能力且不易过拟合测试集,从而揭示了提升 AIG 文本检测鲁棒性的新方向。
链接: https://arxiv.org/abs/2508.00619
作者: Shantanu Thorat,Andrew Caines
机构: University of Cambridge (剑桥大学)
类目: Computation and Language (cs.CL); Machine Learning (cs.LG)
备注: MPhil in Advanced Computer Science thesis for University of Cambridge
Abstract:Existing AIG (AI-generated) text detectors struggle in real-world settings despite succeeding in internal testing, suggesting that they may not be robust enough. We rigorously examine the machine-learning procedure to build these detectors to address this. Most current AIG text detection datasets focus on zero-shot generations, but little work has been done on few-shot or one-shot generations, where LLMs are given human texts as an example. In response, we introduce the Diverse Adversarial Corpus of Texts Yielded from Language models (DACTYL), a challenging AIG text detection dataset focusing on one-shot/few-shot generations. We also include texts from domain-specific continued-pre-trained (CPT) language models, where we fully train all parameters using a memory-efficient optimization approach. Many existing AIG text detectors struggle significantly on our dataset, indicating a potential vulnerability to one-shot/few-shot and CPT-generated texts. We also train our own classifiers using two approaches: standard binary cross-entropy (BCE) optimization and a more recent approach, deep X-risk optimization (DXO). While BCE-trained classifiers marginally outperform DXO classifiers on the DACTYL test set, the latter excels on out-of-distribution (OOD) texts. In our mock deployment scenario in student essay detection with an OOD student essay dataset, the best DXO classifier outscored the best BCE-trained classifier by 50.56 macro-F1 score points at the lowest false positive rates for both. Our results indicate that DXO classifiers generalize better without overfitting to the test set. Our experiments highlight several areas of improvement for AIG text detectors.
zh
[NLP-18] Prompting Science Report 3: Ill pay you or Ill kill you – but will you care?
【速读】: 该论文旨在解决关于大语言模型(Large Language Models, LLMs)提示工程(prompting)中两个广泛流传的信念——即“给予奖励”(如承诺给小费)和“施加威胁”是否能提升模型性能的问题。研究通过在GPQA(General Purpose Question Answering)和MMLU-Pro(Multi-task Multi-domain Language Understanding Probing)基准上进行实证测试,发现:对模型进行奖励或威胁通常不会显著影响其整体性能;然而,提示方式的微小变化可能在单个问题层面产生显著差异,但这种影响具有高度不确定性,难以提前判断某一提示策略对特定问题是有利还是有害。因此,论文的关键结论是:简单提示技巧的有效性被高估,尤其是在处理复杂任务时,提示设计仍需针对具体问题进行精细化调整。
链接: https://arxiv.org/abs/2508.00614
作者: Lennart Meincke,Ethan Mollick,Lilach Mollick,Dan Shapiro
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:
Abstract:This is the third in a series of short reports that seek to help business, education, and policy leaders understand the technical details of working with AI through rigorous testing. In this report, we investigate two commonly held prompting beliefs: a) offering to tip the AI model and b) threatening the AI model. Tipping was a commonly shared tactic for improving AI performance and threats have been endorsed by Google Founder Sergey Brin (All-In, May 2025, 8:20) who observed that ‘models tend to do better if you threaten them,’ a claim we subject to empirical testing here. We evaluate model performance on GPQA (Rein et al. 2024) and MMLU-Pro (Wang et al. 2024). We demonstrate two things: - Threatening or tipping a model generally has no significant effect on benchmark performance. - Prompt variations can significantly affect performance on a per-question level. However, it is hard to know in advance whether a particular prompting approach will help or harm the LLM’s ability to answer any particular question. Taken together, this suggests that simple prompting variations might not be as effective as previously assumed, especially for difficult problems. However, as reported previously (Meincke et al. 2025a), prompting approaches can yield significantly different results for individual questions. Subjects: Computation and Language (cs.CL); Artificial Intelligence (cs.AI) Cite as: arXiv:2508.00614 [cs.CL] (or arXiv:2508.00614v1 [cs.CL] for this version) https://doi.org/10.48550/arXiv.2508.00614 Focus to learn more arXiv-issued DOI via DataCite (pending registration) Submission history From: Lennart Meincke [view email] [v1] Fri, 1 Aug 2025 13:23:21 UTC (1,346 KB) Full-text links: Access Paper: View a PDF of the paper titled Prompting Science Report 3: I’ll pay you or I’ll kill you – but will you care?, by Lennart Meincke and 3 other authorsView PDFOther Formats view license Current browse context: cs.CL prev | next new | recent | 2025-08 Change to browse by: cs cs.AI References Citations NASA ADSGoogle Scholar Semantic Scholar a export BibTeX citation Loading… BibTeX formatted citation loading… Data provided by: Bookmark checked=“checked”> Bibliographic Tools Bibliographic and Citation Tools Bibliographic Explorer Toggle Bibliographic Explorer (What is the Explorer?) Connected Papers Toggle Connected Papers (What is Connected Papers?) Litmaps Toggle Litmaps (What is Litmaps?) scite.ai Toggle scite Smart Citations (What are Smart Citations?) Code, Data, Media Code, Data and Media Associated with this Article alphaXiv Toggle alphaXiv (What is alphaXiv?) Links to Code Toggle CatalyzeX Code Finder for Papers (What is CatalyzeX?) DagsHub Toggle DagsHub (What is DagsHub?) GotitPub Toggle Gotit.pub (What is GotitPub?) Huggingface Toggle Hugging Face (What is Huggingface?) Links to Code Toggle Papers with Code (What is Papers with Code?) ScienceCast Toggle ScienceCast (What is ScienceCast?) Demos Demos Replicate Toggle Replicate (What is Replicate?) Spaces Toggle Hugging Face Spaces (What is Spaces?) Spaces Toggle TXYZ.AI (What is TXYZ.AI?) Related Papers Recommenders and Search Tools Link to Influence Flower Influence Flower (What are Influence Flowers?) Core recommender toggle CORE Recommender (What is CORE?) Author Venue Institution Topic About arXivLabs arXivLabs: experimental projects with community collaborators arXivLabs is a framework that allows collaborators to develop and share new arXiv features directly on our website. Both individuals and organizations that work with arXivLabs have embraced and accepted our values of openness, community, excellence, and user data privacy. arXiv is committed to these values and only works with partners that adhere to them. Have an idea for a project that will add value for arXiv’s community? Learn more about arXivLabs. Which authors of this paper are endorsers? | Disable MathJax (What is MathJax?) mathjaxToggle(); About Help contact arXivClick here to contact arXiv Contact subscribe to arXiv mailingsClick here to subscribe Subscribe Copyright Privacy Policy Web Accessibility Assistance arXiv Operational Status Get status notifications via email or slack
zh
[NLP-19] GHTM: A Graph based Hybrid Topic Modeling Approach in Low-Resource Bengali Language
【速读】: 该论文旨在解决 Bengali 语言中文本主题建模(Topic Modeling)研究不足的问题,主要挑战包括其形态复杂性、资源匮乏以及相关研究较少。解决方案的关键在于提出一种基于图卷积网络(Graph Convolutional Network, GCN)的混合主题模型——GHTM(Graph-Based Hybrid Topic Model),该模型将文档向量表示为图中的节点,利用 GCN 学习语义丰富的嵌入表示,并通过非负矩阵分解(Non-negative Matrix Factorization, NMF)提取主题特征。实验表明,该方法在主题一致性与多样性上优于传统方法(如 LDA、LSA、NMF)及现代框架(如 BERTopic 和 Top2Vec),并构建了首个源自孟加拉语教科书材料的新型语料库 NCTBText,有效弥补了现有数据集以新闻文本为主的局限性。
链接: https://arxiv.org/abs/2508.00605
作者: Farhana Haque,Md. Abdur Rahman,Sumon Ahmed
机构: IIT, University of Dhaka (印度理工学院,达卡大学); CARS, University of Dhaka (计算与机器人科学研究中心,达卡大学)
类目: Computation and Language (cs.CL)
备注:
Abstract:Topic modeling is a Natural Language Processing (NLP) technique that is used to identify latent themes and extract topics from text corpora by grouping similar documents based on their most significant keywords. Although widely researched in English, topic modeling remains understudied in Bengali due to its morphological complexity, lack of adequate resources and initiatives. In this contribution, a novel Graph Convolutional Network (GCN) based model called GHTM (Graph-Based Hybrid Topic Model) is proposed. This model represents input vectors of documents as nodes in the graph, which GCN uses to produce semantically rich embeddings. The embeddings are then decomposed using Non-negative Matrix Factorization (NMF) to get the topical representations of the underlying themes of the text corpus. This study compares the proposed model against a wide range of Bengali topic modeling techniques, from traditional methods such as LDA, LSA, and NMF to contemporary frameworks such as BERTopic and Top2Vec on three Bengali datasets. The experimental results demonstrate the effectiveness of the proposed model by outperforming other models in topic coherence and diversity. In addition, we introduce a novel Bengali dataset called “NCTBText” sourced from Bengali textbook materials to enrich and diversify the predominantly newspaper-centric Bengali corpora.
zh
[NLP-20] A Context-Aware Dual-Metric Framework for Confidence Estimation in Large Language Models
【速读】: 该论文旨在解决大语言模型(Large Language Models, LLMs)在输出置信度估计中忽视响应与上下文相关性的关键问题,这一缺陷限制了模型在需要高可信度的安全敏感场景中的可靠部署。解决方案的核心在于提出CRUX框架,其创新性地引入两个新指标:一是上下文感知的熵减(Context-aware entropy reduction),通过对比有无上下文条件下的采样信息增益来量化数据不确定性;二是统一一致性检验(Unified consistency examination),通过分析生成答案在有无上下文时的全局一致性来捕捉模型不确定性。该方法首次将上下文忠实性和一致性联合用于置信度估计,在多个基准和领域特定数据集上显著优于现有基线方法。
链接: https://arxiv.org/abs/2508.00600
作者: Mingruo Yuan,Shuyi Zhang,Ben Kao
机构: 未知
类目: Computation and Language (cs.CL); Machine Learning (cs.LG)
备注:
Abstract:Accurate confidence estimation is essential for trustworthy large language models (LLMs) systems, as it empowers the user to determine when to trust outputs and enables reliable deployment in safety-critical applications. Current confidence estimation methods for LLMs neglect the relevance between responses and contextual information, a crucial factor in output quality evaluation, particularly in scenarios where background knowledge is provided. To bridge this gap, we propose CRUX (Context-aware entropy Reduction and Unified consistency eXamination), the first framework that integrates context faithfulness and consistency for confidence estimation via two novel metrics. First, contextual entropy reduction represents data uncertainty with the information gain through contrastive sampling with and without context. Second, unified consistency examination captures potential model uncertainty through the global consistency of the generated answers with and without context. Experiments across three benchmark datasets (CoQA, SQuAD, QuAC) and two domain-specific datasets (BioASQ, EduQG) demonstrate CRUX’s effectiveness, achieving the highest AUROC than existing baselines.
zh
[NLP-21] Context-based Motion Retrieval using Open Vocabulary Methods for Autonomous Driving
【速读】: 该论文旨在解决自动驾驶系统在面对弱势道路使用者(Vulnerable Road Users, VRUs)异常或复杂行为等边缘场景时的可靠性评估难题,尤其是如何从大规模驾驶数据集中高效检索稀有的人类行为场景。解决方案的关键在于提出一种上下文感知的运动检索框架,通过将基于Skinned Multi-Person Linear (SMPL)模型生成的运动序列与对应视频帧联合编码至共享多模态嵌入空间,并使其与自然语言对齐,从而实现基于文本查询的可扩展人类行为及其情境的精准检索。该方法显著提升了运动-场景关联检索的准确性,在自建数据集WayMoCo上相较现有最优模型最高提升达27.5%。
链接: https://arxiv.org/abs/2508.00589
作者: Stefan Englmeier(1),Max A. Büttner(1),Katharina Winter(1),Fabian B. Flohr(1) ((1) Munich University of Applied Sciences, Intelligent Vehicles Lab (IVL), Munich, Germany)
机构: Munich University of Applied Sciences (慕尼黑应用技术大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Computation and Language (cs.CL); Information Retrieval (cs.IR); Robotics (cs.RO)
备注: 9 pages, 10 figure, project page this https URL , submitted to IEEE Transactions on Intelligent Vehicles (T-IV), This work has been submitted to the IEEE for possible publication
Abstract:Autonomous driving systems must operate reliably in safety-critical scenarios, particularly those involving unusual or complex behavior by Vulnerable Road Users (VRUs). Identifying these edge cases in driving datasets is essential for robust evaluation and generalization, but retrieving such rare human behavior scenarios within the long tail of large-scale datasets is challenging. To support targeted evaluation of autonomous driving systems in diverse, human-centered scenarios, we propose a novel context-aware motion retrieval framework. Our method combines Skinned Multi-Person Linear (SMPL)-based motion sequences and corresponding video frames before encoding them into a shared multimodal embedding space aligned with natural language. Our approach enables the scalable retrieval of human behavior and their context through text queries. This work also introduces our dataset WayMoCo, an extension of the Waymo Open Dataset. It contains automatically labeled motion and scene context descriptions derived from generated pseudo-ground-truth SMPL sequences and corresponding image data. Our approach outperforms state-of-the-art models by up to 27.5% accuracy in motion-context retrieval, when evaluated on the WayMoCo dataset.
zh
[NLP-22] SynAdapt: Learning Adaptive Reasoning in Large Language Models via Synthetic Continuous Chain-of-Thought
【速读】: 该论文旨在解决链式思维(Chain-of-Thought, CoT)推理中因生成离散CoT token(Discrete CoT, DCoT)而导致的时间开销过高问题,以及现有连续CoT(Continuous CoT, CCoT)方法在微调间接性、对齐能力有限和目标不一致方面的局限性。其解决方案的关键在于提出一种名为SynAdapt的高效推理框架:通过生成合成连续CoT(synthetic CCoT)作为精确且有效的对齐目标,直接引导大语言模型(LLM)学习连续推理并输出准确答案;同时引入一个难度分类器,结合问题上下文与CCoT特征识别难题,并自适应地触发LLM对难例进行再思考,从而在多个基准测试中实现最优的准确性-效率权衡。
链接: https://arxiv.org/abs/2508.00574
作者: Jianwei Wang,Ziming Wu,Fuming Lai,Shaobing Lian,Ziqian Zeng
机构: Tencent Inc.(腾讯公司)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:
Abstract:While Chain-of-Thought (CoT) reasoning improves model performance, it incurs significant time costs due to the generation of discrete CoT tokens (DCoT). Continuous CoT (CCoT) offers a more efficient alternative, but existing CCoT methods are hampered by indirect fine-tuning, limited alignment, or inconsistent targets. To overcome these limitations, we propose \textitSynAdapt, an innovative efficient reasoning framework. Specifically, \textitSynAdapt generates the synthetic CCoT to serve as a precise and effective alignment target for LLMs. This synthetic CCoT explicitly guides the LLM to learn CCoT and derive accurate answers directly. Furthermore, relying solely on CCoT is insufficient for solving hard questions. To address this, \textitSynAdapt integrates a difficulty classifier that leverages both question context and CCoT to identify hard questions. CCoT can effectively help identify hard questions after some brief reasoning. We then adaptively prompt the LLM to re-think these hard questions for improved performance. Extensive experimental results across various benchmarks from different difficulty levels strongly demonstrate the effectiveness of our method, achieving the best accuracy-efficiency trade-off.
zh
[NLP-23] Activation-Guided Local Editing for Jailbreaking Attacks
【速读】: 该论文旨在解决当前生成式 AI (Generative AI) 模型中 jailbreak 攻击方法存在的两大核心问题:一是基于 token 级别的攻击易导致输入不连贯且迁移能力差,二是基于 prompt 级别的攻击难以规模化并高度依赖人工设计。解决方案的关键在于提出一种两阶段框架 AGILE,第一阶段通过场景化生成与重述来隐藏恶意意图,第二阶段利用模型隐层状态信息引导细粒度编辑,从而有效改变模型对输入的内部表征,使其从有害转向无害。该方法在攻击成功率上达到当前最优水平(较最强基线提升最高达 37.74%),并展现出优异的黑盒迁移能力及对主流防御机制的绕过效果。
链接: https://arxiv.org/abs/2508.00555
作者: Jiecong Wang,Haoran Li,Hao Peng,Ziqian Zeng,Zihao Wang,Haohua Du,Zhengtao Yu
机构: Beihang University (北京航空航天大学); The Hong Kong University of Science and Technology (香港科技大学); South China University of Technology (华南理工大学); Nanyang Technological University (南洋理工大学); Kunming University of Science and Technology (昆明理工大学)
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注:
Abstract:Jailbreaking is an essential adversarial technique for red-teaming these models to uncover and patch security flaws. However, existing jailbreak methods face significant drawbacks. Token-level jailbreak attacks often produce incoherent or unreadable inputs and exhibit poor transferability, while prompt-level attacks lack scalability and rely heavily on manual effort and human ingenuity. We propose a concise and effective two-stage framework that combines the advantages of these approaches. The first stage performs a scenario-based generation of context and rephrases the original malicious query to obscure its harmful intent. The second stage then utilizes information from the model’s hidden states to guide fine-grained edits, effectively steering the model’s internal representation of the input from a malicious toward a benign one. Extensive experiments demonstrate that this method achieves state-of-the-art Attack Success Rate, with gains of up to 37.74% over the strongest baseline, and exhibits excellent transferability to black-box models. Our analysis further demonstrates that AGILE maintains substantial effectiveness against prominent defense mechanisms, highlighting the limitations of current safeguards and providing valuable insights for future defense development. Our code is available at this https URL.
zh
[NLP-24] PaPaformer: Language Model from Pre-trained Paraller Paths
【速读】: 该论文旨在解决现代大语言模型(Large-Language Models, LLMs)训练所需计算资源和时间成本过高的问题,尤其针对小型语言模型(Small-Language Models, SLMs)仍需数天甚至数周训练周期的挑战。其解决方案的关键在于提出一种名为PaPaformer的解码器-only Transformer架构变体,该结构通过低维并行路径(parallel paths)实现模型参数的高效训练与组合:这些路径可独立使用不同类型的训练数据进行训练,随后融合为一个更大规模的模型。此方法不仅显著缩短训练时间(从数天/周降至数小时),还能在减少总参数量的同时提升性能,并为特定任务定制化路径提供了灵活性。
链接: https://arxiv.org/abs/2508.00544
作者: Joonas Tapaninaho,Mourad Oussala
机构: University of Oulu (奥卢大学)
类目: Computation and Language (cs.CL); Machine Learning (cs.LG)
备注:
Abstract:The training of modern large-language models requires an increasingly amount of computation power and time. Even smaller variants, such as small-language models (SLMs), take several days to train in the best-case scenarios, often requiring multiple GPUs. This paper explores methods to train and evaluate decoder-only transformer-based language models in hours instead of days/weeks. We introduces \textitPaPaformer, a decoder-only transformer architecture variant, whose lower-dimensional parallel paths are combined into larger model. The paper shows that these lower-dimensional paths can be trained individually with different types of training data and then combined into one larger model. This method gives the option to reduce the total number of model parameters and the training time with increasing performance. Moreover, the use of parallel path structure opens interesting possibilities to customize paths to accommodate specific task requirements.
zh
[NLP-25] he Prosody of Emojis
【速读】: 该论文旨在解决emoji在数字交际中如何影响语音韵律(prosody)的产生与感知问题,即emoji作为文本语境下缺失的韵律线索(如语调、节奏和语调)的视觉替代品,是否能塑造说话者的语音表现并被听者准确解读。解决方案的关键在于通过结构化但开放式的生产与感知任务收集真实人类语音数据,直接关联emoji语义与语音韵律特征,从而提供实证证据:说话者会根据emoji调整其韵律表达,听者也能仅凭韵律差异识别意图emoji,且emoji语义差异越大,韵律分化越显著。这一发现揭示了emoji在数字化交际中作为韵律意图载体的深层功能。
链接: https://arxiv.org/abs/2508.00537
作者: Giulio Zhou,Tsz Kin Lam,Alexandra Birch,Barry Haddow
机构: University of Edinburgh (爱丁堡大学)
类目: Computation and Language (cs.CL)
备注:
Abstract:Prosodic features such as pitch, timing, and intonation are central to spoken communication, conveying emotion, intent, and discourse structure. In text-based settings, where these cues are absent, emojis act as visual surrogates that add affective and pragmatic nuance. This study examines how emojis influence prosodic realisation in speech and how listeners interpret prosodic cues to recover emoji meanings. Unlike previous work, we directly link prosody and emoji by analysing actual human speech data, collected through structured but open-ended production and perception tasks. This provides empirical evidence of how emoji semantics shape spoken delivery and perception. Results show that speakers adapt their prosody based on emoji cues, listeners can often identify the intended emoji from prosodic variation alone, and greater semantic differences between emojis correspond to increased prosodic divergence. These findings suggest that emojis can act as meaningful carriers of prosodic intent, offering insight into their communicative role in digitally mediated contexts.
zh
[NLP-26] owards a unified framework for programming paradigms: A systematic review of classification formalisms and methodological foundations
【速读】: 该论文旨在解决多范式编程语言兴起背景下传统编程范式分类方法失效的问题,尤其是由此引发的互操作性缺陷等实际软件工程挑战。其解决方案的关键在于从静态分类转向形式化的重构方法:通过识别一组正交的、原子化的概念原语,并借助类型论(Type theory)、范畴论(Category theory)及统一编程理论(Unifying Theories of Programming, UTP)等数学框架,对这些原语进行组合性建模,从而在理论上保障混合语言中范式的可组合性质。这一转变标志着研究重心从描述性分类向具有形式保证的重建性框架演进。
链接: https://arxiv.org/abs/2508.00534
作者: Mikel Vandeloise
机构: University of Namur (namur大学)
类目: Programming Languages (cs.PL); Computation and Language (cs.CL)
备注: Preprint submitted to the Journal of Object Technology on July 29, 2025. Data available upon request until peer-review is completed
Abstract:The rise of multi-paradigm languages challenges traditional classification methods, leading to practical software engineering issues like interoperability defects. This systematic literature review (SLR) maps the formal foundations of programming paradigms. Our objective is twofold: (1) to assess the state of the art of classification formalisms and their limitations, and (2) to identify the conceptual primitives and mathematical frameworks for a more powerful, reconstructive approach. Based on a synthesis of 74 primary studies, we find that existing taxonomies lack conceptual granularity, a unified formal basis, and struggle with hybrid languages. In response, our analysis reveals a strong convergence toward a compositional reconstruction of paradigms. This approach identifies a minimal set of orthogonal, atomic primitives and leverages mathematical frameworks, predominantly Type theory, Category theory and Unifying Theories of Programming (UTP), to formally guarantee their compositional properties. We conclude that the literature reflects a significant intellectual shift away from classification towards these promising formal, reconstructive frameworks. This review provides a map of this evolution and proposes a research agenda for their unification. Comments: Preprint submitted to the Journal of Object Technology on July 29, 2025. Data available upon request until peer-review is completed Subjects: Programming Languages (cs.PL); Computation and Language (cs.CL) ACMclasses: D.3.2; F.3.2; D.3.1 Cite as: arXiv:2508.00534 [cs.PL] (or arXiv:2508.00534v1 [cs.PL] for this version) https://doi.org/10.48550/arXiv.2508.00534 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
zh
[NLP-27] EFlat-LoRA: Efficiently Seeking Flat Minima for Better Generalization in Fine-Tuning Large Language Models and Beyond
【速读】: 该论文旨在解决低秩适配(Low-Rank Adaptation, LoRA)中表达能力与泛化能力之间的关联性不明确的问题,尤其是缺乏有效手段在LoRA框架下寻找局部平坦的极小值(locally flat minima),从而提升模型泛化性能。其解决方案的关键在于提出Flat-LoRA及其高效版本EFlat-LoRA,通过理论证明全参数空间中的扰动可映射至低秩子空间,从而避免多矩阵扰动带来的干扰;该方法在保持LoRA计算效率的同时显著提升了模型在大语言模型和视觉语言模型上的泛化性能,验证了LoRA的泛化能力与其优化路径的尖锐度(sharpness)密切相关。
链接: https://arxiv.org/abs/2508.00522
作者: Jiaxin Deng,Qingcheng Zhu,Junbiao Pang,Linlin Yang,Zhongqian Fu,Baochang Zhang
机构: 未知
类目: Computation and Language (cs.CL)
备注:
Abstract:Little research explores the correlation between the expressive ability and generalization ability of the low-rank adaptation (LoRA). Sharpness-Aware Minimization (SAM) improves model generalization for both Convolutional Neural Networks (CNNs) and Transformers by encouraging convergence to locally flat minima. However, the connection between sharpness and generalization has not been fully explored for LoRA due to the lack of tools to either empirically seek flat minima or develop theoretical methods. In this work, we propose Flat-LoRA and its efficient version i.e., EFlat-LoRA, to seek flat minima for LoRA. Concretely, we theoretically demonstrate that perturbations in the full parameter space can be transferred to the low-rank subspace. This approach eliminates the potential interference introduced by perturbations across multiple matrices in the low-rank subspace. Our extensive experiments on large language models and vision-language models demonstrate that EFlat-LoRA achieves optimize efficiency comparable to that of LoRA while simultaneously attaining comparable or even better performance. For example, on the GLUE dataset with RoBERTa-large, EFlat-LoRA outperforms LoRA and full fine-tuning by 1.0% and 0.5% on average, respectively. On vision-language models e.g., Qwen-VL-Chat shows performance improvements of 1.5% and 1.0% on SQA and VizWiz datasets, respectively. These empirical results also verify that the generalization of LoRA is closely related to sharpness, which is omitted by previous methods.
zh
[NLP-28] Fine-grained Spatiotemporal Grounding on Egocentric Videos ICCV2025
【速读】: 该论文旨在解决自指视角视频(egocentric video)中的时空定位问题,即在第一人称视角视频中根据文本查询精确定位目标实体的时间和空间位置。与已有研究主要集中在第三人称视角(exocentric)视频不同,自指视角视频因物体持续时间更短、轨迹稀疏、目标尺寸更小及位置偏移更大等特性,导致现有模型性能显著下降。其解决方案的关键在于提出了首个像素级细粒度时空定位基准数据集EgoMask,通过自动标注流程生成涵盖短、中、长时段视频的引用表达和对象掩码,并构建了大规模训练数据集EgoMask-Train。实验证明,基于该数据集微调后的模型在EgoMask上表现大幅提升,同时保持在第三人称数据集上的性能,为自指视角视频理解提供了关键资源与方法论支持。
链接: https://arxiv.org/abs/2508.00518
作者: Shuo Liang,Yiwu Zhong,Zi-Yuan Hu,Yeyao Tao,Liwei Wang
机构: The Chinese University of Hong Kong (香港中文大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Computation and Language (cs.CL)
备注: Accepted by ICCV 2025
Abstract:Spatiotemporal video grounding aims to localize target entities in videos based on textual queries. While existing research has made significant progress in exocentric videos, the egocentric setting remains relatively underexplored, despite its growing importance in applications such as augmented reality and robotics. In this work, we conduct a systematic analysis of the discrepancies between egocentric and exocentric videos, revealing key challenges such as shorter object durations, sparser trajectories, smaller object sizes, and larger positional shifts. To address these challenges, we introduce EgoMask, the first pixel-level benchmark for fine-grained spatiotemporal grounding in egocentric videos. It is constructed by our proposed automatic annotation pipeline, which annotates referring expressions and object masks across short-, medium-, and long-term videos. Additionally, we create EgoMask-Train, a large-scale training dataset to facilitate model development. Experiments demonstrate that the state-of-the-art spatiotemporal grounding models perform poorly on our benchmark EgoMask, but fine-tuning on EgoMask-Train yields significant improvements, while preserving performance on exocentric datasets. Our work thus provides essential resources and insights for advancing egocentric video understanding. Our code is available at this https URL .
zh
[NLP-29] he Missing Parts: Augmenting Fact Verification with Half-Truth Detection
【速读】: 该论文旨在解决现有事实核查系统在处理“半真相”(half-truths)时的局限性问题,即这些系统通常仅评估陈述是否与检索到的证据一致,而忽略了关键信息缺失导致的误导性。针对这一挑战,作者提出了一项新的任务——半真相检测,并构建了PolitiFact-Hidden基准数据集,包含15k条政治声明及其句子级证据对齐和推断出的声明意图标注。解决方案的关键在于提出TRACER框架,这是一个模块化的重新评估机制,通过证据对齐、隐含意图推断和隐藏内容因果影响估计三个步骤识别基于信息遗漏的虚假信息。TRACER可无缝集成至现有事实核查流程中,并显著提升多个基线模型的性能,尤其在半真相分类F1指标上最高提升16点,验证了建模信息缺失对于可信事实核查的重要性。
链接: https://arxiv.org/abs/2508.00489
作者: Yixuan Tang,Jincheng Wang,Anthony K.H. Tung
机构: National University of Singapore (新加坡国立大学)
类目: Computation and Language (cs.CL)
备注:
Abstract:Fact verification systems typically assess whether a claim is supported by retrieved evidence, assuming that truthfulness depends solely on what is stated. However, many real-world claims are half-truths, factually correct yet misleading due to the omission of critical context. Existing models struggle with such cases, as they are not designed to reason about what is left unsaid. We introduce the task of half-truth detection, and propose PolitiFact-Hidden, a new benchmark with 15k political claims annotated with sentence-level evidence alignment and inferred claim intent. To address this challenge, we present TRACER, a modular re-assessment framework that identifies omission-based misinformation by aligning evidence, inferring implied intent, and estimating the causal impact of hidden content. TRACER can be integrated into existing fact-checking pipelines and consistently improves performance across multiple strong baselines. Notably, it boosts Half-True classification F1 by up to 16 points, highlighting the importance of modeling omissions for trustworthy fact verification.
zh
[NLP-30] GETALP@AutoMin 2025: Leverag ing RAG to Answer Questions based on Meeting Transcripts
【速读】: 该论文旨在解决会议转录文本中的问答任务(question-answering based on meeting transcripts)问题,其核心挑战在于从非结构化对话内容中准确提取语义信息并生成高质量回答。解决方案的关键在于融合检索增强生成(Retrieval Augmented Generation, RAG)与抽象意义表示(Abstract Meaning Representation, AMR)技术,提出三种结合这两种方法的系统架构。实验表明,引入AMR可使约35%的问题获得高质量回答,并显著提升涉及区分不同参与者(如“谁”类问题)的问答准确性。
链接: https://arxiv.org/abs/2508.00476
作者: Jeongwoo Kang,Markarit Vartampetian,Felix Herron,Yongxin Zhou,Diandra Fabre,Gabriela Gonzalez-Saez
机构: Univ. Grenoble Alpes (格勒诺布尔阿尔卑斯大学); CNRS (法国国家科学研究中心); Grenoble INP (格勒诺布尔综合理工学院); LIG (信息与图形实验室)
类目: Computation and Language (cs.CL)
备注:
Abstract:This paper documents GETALP’s submission to the Third Run of the Automatic Minuting Shared Task at SIGDial 2025. We participated in Task B: question-answering based on meeting transcripts. Our method is based on a retrieval augmented generation (RAG) system and Abstract Meaning Representations (AMR). We propose three systems combining these two approaches. Our results show that incorporating AMR leads to high-quality responses for approximately 35% of the questions and provides notable improvements in answering questions that involve distinguishing between different participants (e.g., who questions).
zh
[NLP-31] Learning an Efficient Multi-Turn Dialogue Evaluator from Multiple Judges AAAI2026
【速读】: 该论文旨在解决当前基于大语言模型(Large Language Models, LLMs)的对话评估方法中存在的偏差问题,尤其是“LLM-as-a-judge”范式下单一模型评估易受主观性影响、结果不可靠的问题。现有多判官(multi-judge)方法虽能通过聚合多个LLM的判断提升评估质量,但计算开销巨大。其解决方案的关键在于:将多个LLM判官的偏好知识(preference knowledge)进行聚合,构建一个单一模型来模拟集体智慧,从而在保持多判官反馈多样性优势的同时,显著降低推理阶段的计算成本,实现高效且灵活的对话质量评估。
链接: https://arxiv.org/abs/2508.00454
作者: Yuqi Tang,Kehua Feng,Yunfeng Wang,Zhiwen Chen,Chengfei Lv,Gang Yu,Qiang Zhang,Keyan Ding
机构: 未知
类目: Computation and Language (cs.CL)
备注: 15 pages, 2 pages, under review at AAAI 2026
Abstract:Evaluating the conversational abilities of large language models (LLMs) remains a challenging task. Current mainstream approaches primarily rely on the ``LLM-as-a-judge" paradigm, where an LLM is prompted to serve as an evaluator to assess dialogue quality. However, such methods often suffer from various biases, which undermine the reliability and consistency of the evaluation results. To mitigate these biases, recent methods employ multiple LLMs as judges and aggregate their judgments to select the optimal assessment. Although effective, this multi-judge approach incurs significant computational overhead during inference. In this paper, we propose an efficient multi-turn dialogue evaluator that captures the collective wisdom of multiple LLM judges by aggregating their preference knowledge into a single model. Our approach preserves the advantages of diverse multi-judge feedback while drastically reducing the evaluation cost, enabling fast and flexible dialogue quality assessment. Extensive experiments on seven single rating and pairwise comparison dialogue evaluation benchmarks demonstrate that our method outperforms existing baselines across diverse scenarios, showcasing its efficiency and robustness.
zh
[NLP-32] ReaGAN: Node-as-Agent -Reasoning Agent -Reasoning Graph Agentic Network
【速读】: 该论文旨在解决图神经网络(Graph Neural Networks, GNNs)在信息传播过程中存在的两个核心问题:一是节点间信息丰富度不均衡,即部分节点包含大量有用信息而另一些节点则信息稀疏;二是传统预定义的消息传递机制主要依赖局部结构相似性,忽视了图中远距离但语义相关的全局关系,从而限制了模型对复杂关联的捕捉能力。解决方案的关键在于提出一种基于代理(agent-based)的框架——检索增强型图智能体网络(Retrieval-augmented Graph Agentic Network, ReaGAN),其核心创新包括:每个节点作为独立代理,基于内部记忆自主规划下一步行动,实现节点级决策与自适应消息传播;同时引入检索增强生成(Retrieval-Augmented Generation, RAG)机制,使节点能够访问语义相关的内容并构建全局关系。该方法在无需微调冻结的大语言模型(frozen LLM)基础上,在少样本上下文设置下实现了具有竞争力的性能,验证了代理式规划与局部-全局检索协同机制在图学习中的有效性。
链接: https://arxiv.org/abs/2508.00429
作者: Minghao Guo,Xi Zhu,Jingyuan Huang,Kai Mei,Yongfeng Zhang
机构: Rutgers University (罗格斯大学)
类目: Computation and Language (cs.CL); Machine Learning (cs.LG); Multiagent Systems (cs.MA)
备注: 17 pages, work in progress
Abstract:Graph Neural Networks (GNNs) have achieved remarkable success in graph-based learning by propagating information among neighbor nodes via predefined aggregation mechanisms. However, such fixed schemes often suffer from two key limitations. First, they cannot handle the imbalance in node informativeness – some nodes are rich in information, while others remain sparse. Second, predefined message passing primarily leverages local structural similarity while ignoring global semantic relationships across the graph, limiting the model’s ability to capture distant but relevant information. We propose Retrieval-augmented Graph Agentic Network (ReaGAN), an agent-based framework that empowers each node with autonomous, node-level decision-making. Each node acts as an agent that independently plans its next action based on its internal memory, enabling node-level planning and adaptive message propagation. Additionally, retrieval-augmented generation (RAG) allows nodes to access semantically relevant content and build global relationships in the graph. ReaGAN achieves competitive performance under few-shot in-context settings using a frozen LLM backbone without fine-tuning, showcasing the potential of agentic planning and local-global retrieval in graph learning.
zh
[NLP-33] Combining Discrete Wavelet and Cosine Transforms for Efficient Sentence Embedding
【速读】: 该论文旨在解决自然语言处理(Natural Language Processing, NLP)中词向量和句向量维度高、信息冗余的问题,同时探索如何有效保留语义特征并实现高效压缩。其解决方案的关键在于引入离散小波变换(Discrete Wavelet Transform, DWT)对词嵌入和句嵌入进行降维与信息整合,并进一步结合离散余弦变换(Discrete Cosine Transform, DCT)构建一种无需参数的压缩模型,通过局部变化的词特征将富含信息的句子映射为固定长度的向量表示,从而在下游任务中实现与原始嵌入相当甚至更优的性能表现。
链接: https://arxiv.org/abs/2508.00420
作者: Rana Salama,Abdou Youssef,Mona Diab
机构: 未知
类目: Computation and Language (cs.CL)
备注:
Abstract:Wavelets have emerged as a cutting edge technology in a number of fields. Concrete results of their application in Image and Signal processing suggest that wavelets can be effectively applied to Natural Language Processing (NLP) tasks that capture a variety of linguistic properties. In this paper, we leverage the power of applying Discrete Wavelet Transforms (DWT) to word and sentence embeddings. We first evaluate, intrinsically and extrinsically, how wavelets can effectively be used to consolidate important information in a word vector while reducing its dimensionality. We further combine DWT with Discrete Cosine Transform (DCT) to propose a non-parameterized model that compresses a sentence with a dense amount of information in a fixed size vector based on locally varying word features. We show the efficacy of the proposed paradigm on downstream applications models yielding comparable and even superior (in some tasks) results to original embeddings.
zh
[NLP-34] Cognitive Kernel-Pro: A Framework for Deep Research Agents and Agent Foundation Models Training
【速读】: 该论文旨在解决当前AI代理系统普遍存在的可访问性与可复现性问题,即多数现有代理框架依赖闭源代码或付费API及专有工具,限制了研究社区对先进AI代理的开发与评估。其解决方案的关键在于提出一个完全开源且尽可能免费的多模块代理框架——Cognitive Kernel-Pro,该框架通过系统化构建高质量训练数据(涵盖网络、文件、代码和通用推理四大领域),并引入测试时反思与投票机制以提升代理的鲁棒性和性能,从而在GAIA基准上实现了开源及免费代理中的最先进水平,尤其8B参数模型超越了WebDancer和WebSailor等先前领先系统,确立了高能力、易获取AI代理的新标准。
链接: https://arxiv.org/abs/2508.00414
作者: Tianqing Fang,Zhisong Zhang,Xiaoyang Wang,Rui Wang,Can Qin,Yuxuan Wan,Jun-Yu Ma,Ce Zhang,Jiaqi Chen,Xiyun Li,Hongming Zhang,Haitao Mi,Dong Yu
机构: 未知
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注: 16 pages
Abstract:General AI Agents are increasingly recognized as foundational frameworks for the next generation of artificial intelligence, enabling complex reasoning, web interaction, coding, and autonomous research capabilities. However, current agent systems are either closed-source or heavily reliant on a variety of paid APIs and proprietary tools, limiting accessibility and reproducibility for the research community. In this work, we present \textbfCognitive Kernel-Pro, a fully open-source and (to the maximum extent) free multi-module agent framework designed to democratize the development and evaluation of advanced AI agents. Within Cognitive Kernel-Pro, we systematically investigate the curation of high-quality training data for Agent Foundation Models, focusing on the construction of queries, trajectories, and verifiable answers across four key domains: web, file, code, and general reasoning. Furthermore, we explore novel strategies for agent test-time reflection and voting to enhance agent robustness and performance. We evaluate Cognitive Kernel-Pro on GAIA, achieving state-of-the-art results among open-source and free agents. Notably, our 8B-parameter open-source model surpasses previous leading systems such as WebDancer and WebSailor, establishing a new performance standard for accessible, high-capability AI agents. Code is available at this https URL
zh
[NLP-35] Benchmarking LLM s for Unit Test Generation from Real-World Functions
【速读】: 该论文旨在解决当前大语言模型(Large Language Models, LLMs)在函数级单元测试生成任务中评估基准存在的两个核心问题:数据污染(data contamination)和函数代码结构过于简单(structurally simple function code)。这些问题导致现有基准无法真实反映LLMs在实际软件工程场景中的表现,且实验结果可能因记忆效应或对简单程序的过拟合而缺乏泛化能力。解决方案的关键在于提出一个名为ULT(UnLeakedTestbench)的新基准,其通过多阶段筛选流程确保函数具有高圈复杂度(high cyclomatic complexity)并有效避免测试用例泄露,从而构建出更贴近真实世界Python函数的3,909个测试生成任务。此外,作者还设计了配对基准PLT(PreLeakedTestbench),用于控制性地分析LLMs在测试生成中对记忆与推理的依赖程度,从而提升评估的科学性和可解释性。
链接: https://arxiv.org/abs/2508.00408
作者: Dong Huang,Jie M. Zhang,Mark Harman,Qianru Zhang,Mingzhe Du,See-Kiong Ng
机构: National University of Singapore(新加坡国立大学); King’s College London(伦敦国王学院); University College London(伦敦大学学院); The University of Cambridge(剑桥大学)
类目: oftware Engineering (cs.SE); Computation and Language (cs.CL)
备注: Under Review
Abstract:Recently, large language models (LLMs) have shown great promise in automating unit test generation, significantly reducing the manual effort required by developers. To effectively evaluate the capabilities of LLMs in this domain, it is crucial to have a well-designed benchmark that accurately reflects real-world scenarios and mitigates common pitfalls. Existing LLM test generation benchmarks are limited by two critical drawbacks: data contamination and structurally simple function code. As a result, we often cannot rely on the validity of scientific conclusions drawn from empirical studies using these limited benchmarks. The empirical evidence presented may be biased due to contamination and may fail to generalize beyond toy programs due to structural simplicity. To address these problems, we introduce ULT (UnLeakedTestbench), a new benchmark specifically designed for function-level unit test generation from real-world Python functions. ULT is constructed through a multi-stage curation process that ensures high cyclomatic complexity and mitigates test case contamination. With 3,909 carefully selected function-level tasks, ULT provides a more realistic and challenging evaluation of LLMs’ test generation capabilities. We also provide PLT (PreLeakedTestbench), a pair benchmark of ULT with leaked tests designed to enable a controlled analysis of memorization versus reasoning in test generation. Our evaluation results demonstrate that ULT is significantly more challenging. For example, test cases generated by LLMs only achieve 41.32%, 45.10%, 30.22%, and 40.21% for accuracy, statement coverage, branch coverage, and mutation score on average for all LLMs, respectively. These results are substantially lower than the corresponding metrics on TestEval (91.79%, 92.18%, 82.04%, and 49.69%) and PLT (47.07%, 55.13%, 40.07%, and 50.80%). Comments: Under Review Subjects: Software Engineering (cs.SE); Computation and Language (cs.CL) Cite as: arXiv:2508.00408 [cs.SE] (or arXiv:2508.00408v1 [cs.SE] for this version) https://doi.org/10.48550/arXiv.2508.00408 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
zh
[NLP-36] SA-GCS: Semantic-Aware Gaussian Curriculum Scheduling for UAV Vision-Language Navigation
【速读】: 该论文旨在解决无人机视觉语言导航(Vision-Language Navigation, VLN)任务中强化学习(Reinforcement Learning, RL)训练效率低、收敛慢以及对训练样本难度差异考虑不足的问题。解决方案的关键在于提出一种名为语义感知高斯课程调度(Semantic-Aware Gaussian Curriculum Scheduling, SA-GCS)的新颖训练框架,其核心创新包括:1)设计语义感知难度评估器(Semantic-Aware Difficulty Estimator, SA-DE),用于量化训练样本的复杂度;2)引入高斯课程调度器(Gaussian Curriculum Scheduler, GCS),动态调整采样分布,实现从简单到复杂的渐进式训练策略。这一机制显著提升了训练效率和模型性能,并在CityNav基准上验证了其优越性与可扩展性。
链接: https://arxiv.org/abs/2508.00390
作者: Hengxing Cai,Jinhan Dong,Yijie Rao,Jingcheng Deng,Jingjun Tan,Qien Chen,Haidong Wang,Zhen Wang,Shiyu Huang,Agachai Sumalee,Renxin Zhong
机构: 未知
类目: Computation and Language (cs.CL)
备注:
Abstract:Unmanned Aerial Vehicle (UAV) Vision-Language Navigation (VLN) aims to enable agents to accurately localize targets and plan flight paths in complex environments based on natural language instructions, with broad applications in intelligent inspection, disaster rescue, and urban monitoring. Recent progress in Vision-Language Models (VLMs) has provided strong semantic understanding for this task, while reinforcement learning (RL) has emerged as a promising post-training strategy to further improve generalization. However, existing RL methods often suffer from inefficient use of training data, slow convergence, and insufficient consideration of the difficulty variation among training samples, which limits further performance improvement. To address these challenges, we propose \textbfSemantic-Aware Gaussian Curriculum Scheduling (SA-GCS), a novel training framework that systematically integrates Curriculum Learning (CL) into RL. SA-GCS employs a Semantic-Aware Difficulty Estimator (SA-DE) to quantify the complexity of training samples and a Gaussian Curriculum Scheduler (GCS) to dynamically adjust the sampling distribution, enabling a smooth progression from easy to challenging tasks. This design significantly improves training efficiency, accelerates convergence, and enhances overall model performance. Extensive experiments on the CityNav benchmark demonstrate that SA-GCS consistently outperforms strong baselines across all metrics, achieves faster and more stable convergence, and generalizes well across models of different scales, highlighting its robustness and scalability. The implementation of our approach is publicly available.
zh
[NLP-37] Multi-Layer Attention is the Amplifier of Demonstration Effectiveness
【速读】: 该论文旨在解决上下文学习(In-Context Learning, ICL)中演示样本(demonstration)无效性的问题,即并非所有提供的演示都能提升模型性能,而现有方法多依赖于演示与用户查询的相关性,忽视了模型已吸收的信息。解决方案的关键在于引入梯度流(gradient flow)机制,提出一种名为GradS的新方法:通过衡量每个演示相对于给定查询的梯度流大小来选择最具信息增量的演示,从而确保所选演示既非冗余(已被模型学习),也非无关(对当前任务无贡献)。实验验证了随着模型层数增加,演示有效性差异被放大,且GradS在多个主流大语言模型和数据集上相较最强基线平均提升6.8%,证明其有效性。
链接: https://arxiv.org/abs/2508.00385
作者: Dingzirui Wang,Xuangliang Zhang,Keyan Xu,Qingfu Zhu,Wanxiang Che,Yang Deng
机构: 未知
类目: Computation and Language (cs.CL); Machine Learning (cs.LG)
备注:
Abstract:Numerous studies have investigated the underlying mechanisms of in-context learning (ICL) effectiveness to inspire the design of related methods. However, existing work predominantly assumes the effectiveness of the demonstrations provided within ICL, while many research indicates that not all demonstrations are effective, failing to yielding any performance improvement during ICL. Therefore, in this paper, we investigate the reasons behind demonstration ineffectiveness. Our analysis is based on gradient flow and linear self-attention models. By setting the gradient flow to zero, we deduce that a demonstration becomes ineffective if its information has either been learned by the model or is irrelevant to the user query. Furthermore, we demonstrate that in multi-layer models, the disparity in effectiveness among demonstrations is amplified with layer increasing, causing the model to focus more on effective ones. Considering that current demonstration selection methods primarily focus on the relevance to the user query while overlooking the information that the model has already assimilated, we propose a novel method called GradS, which leverages gradient flow for demonstration selection. We use the magnitude of the gradient flow of the demonstration with respect to a given user query as the criterion, thereby ensuring the effectiveness of the chosen ones. We validate our derivation and GradS on four prominent LLMs across five mainstream datasets. The experimental results confirm that the disparity in effectiveness among demonstrations is magnified as the model layer increases, substantiating our derivations. Moreover, GradS achieves a relative improvement of 6.8% on average over the strongest baselines, demonstrating its effectiveness.
zh
[NLP-38] EdgeInfinite-Instruct: Bridging SFT-Based Optimization and NPU-Level Efficiency for Edge Devices
【速读】: 该论文旨在解决在资源受限的边缘设备上部署基于Transformer的大语言模型(Large Language Models, LLMs)时,因自注意力机制的二次时间复杂度和不断增长的键值(Key-Value, KV)缓存需求而导致的长序列任务效率低下问题。现有KV缓存优化方法虽提升内存效率,但难以降低首个标记生成时间(Time to First Token, TTFT),且可能因令牌剪枝损害性能;而替代性序列建模架构通常需全量重训练且缺乏基础设施支持。论文提出的EdgeInfinite-Instruct方案关键在于:采用针对长序列任务设计的分段监督微调(Segmented Supervised Fine-Tuning, S-SFT)策略以增强指令遵循能力,并通过细粒度后训练量化(Post-Training Quantization, PTQ)与固定形状计算图优化,在保持准确率的同时显著降低计算开销并适配边缘神经网络处理单元(NPU)的硬件特性,从而实现高效、低延迟的移动端部署。
链接: https://arxiv.org/abs/2508.00370
作者: Jiyu Chen,Poh Seng Lim,Shuang Peng,Daxiong Luo,JungHau Foo,Yap Deep,Timothy Lee Jun Jie,Kelvin Teh Kae Wen,Fan Yang,Danyu Feng,Hao-Yun Chen,Peng-Wen Chen,Fangyuan Li,Xiaoxin Chen,Wong Wai Mun
机构: 1. National University of Singapore (新加坡国立大学); 2. NUS Graduate School for Integrative Sciences and Engineering (新加坡国立大学整合科学与工程研究生院); 3. Institute of High Performance Computing (高性能计算研究所)
类目: Computation and Language (cs.CL); Machine Learning (cs.LG)
备注: 9 pages
Abstract:Deploying Transformer-based large language models (LLMs) on resource-constrained edge devices for long-sequence tasks remains challenging due to the quadratic time complexity of self-attention and growing Key-Value (KV) cache demands. While existing KV cache optimizations improve memory efficiency, they often fail to reduce time to first token (TTFT) and may degrade performance through token pruning. Alternative sequence modeling architectures address some of these limitations, but typically require full retraining and lack infrastructure support. EdgeInfinite offers an efficient solution by fine-tuning only a small subset of parameters, maintaining quality while reducing both computational and memory costs, including improved TTFT. However, its instruction-following ability is limited, and it lacks mobile-specific optimizations. To address these issues, we propose EdgeInfinite-Instruct, which introduces a Segmented Supervised Fine-Tuning (S-SFT) strategy tailored to long-sequence tasks such as summarization and question answering. We further optimized EdgeInfinite-Instruct for efficient deployment on edge NPUs by employing fine-grained post-training quantization (PTQ) to reduce computational demands while maintaining accuracy, and by implementing a fixed-shape computation graph that balances memory usage and on-device efficiency through scenario-specific customization of input token and cache sizes. Experiments on long-context benchmarks and real-world mobile tasks show that our approach improves domain-specific performance while maintaining efficiency on NPU-accelerated edge devices.
zh
[NLP-39] Lucy: edgerunning agent ic web search on mobile with machine generated task vectors
【速读】: 该论文旨在解决小语言模型(Small Language Models, SLMs)在知识密集型任务中因参数容量受限而导致性能不足的问题。传统方法通常将推理视为固定或启发式过程,难以有效提升模型表现。其解决方案的关键在于提出一种新范式:将模型内部由 think 和 /think 标签限定的推理过程视为一个动态任务向量机(dynamic task vector machine),并利用强化学习与价值函数优化(RLVR)对这一机制进行训练,使模型能够在线构建和迭代优化自身的任务向量。通过该机制与多条件规划(MCP)集成,作者开发出 Lucy 模型(1.7B 参数),在 SimpleQA 基准上达到 78.3% 的准确率,与远大于它的 DeepSeek-V3 等大模型相当,证明了结构化、自构造推理能力可显著增强小模型的性能。
链接: https://arxiv.org/abs/2508.00360
作者: Alan Dao(Gia Tuan Dao),Dinh Bach Vu,Alex Nguyen,Norapat Buppodom
机构: Menlo Research (Menlo 研究所)
类目: Computation and Language (cs.CL)
备注:
Abstract:Small language models (SLMs) are inherently limited in knowledge-intensive tasks due to their constrained capacity. While test-time computation offers a path to enhanced performance, most approaches treat reasoning as a fixed or heuristic process. In this work, we propose a new paradigm: viewing the model’s internal reasoning, delimited by think and /think tags, as a dynamic task vector machine. Rather than treating the content inside these tags as a mere trace of thought, we interpret the generation process itself as a mechanism through which the model \textbfconstructs and refines its own task vectors on the fly. We developed a method to optimize this dynamic task vector machine through RLVR and successfully trained an agentic web-search model. We present Lucy, a 1.7B-parameter SLM that leverages this dynamic reasoning mechanism with MCP integration to achieve 78.3% accuracy on the SimpleQA benchmark, performing on par with much larger models such as DeepSeek-V3. This demonstrates that small models can rival large ones when equipped with structured, self-constructed task reasoning.
zh
[NLP-40] PilotRL: Training Language Model Agents via Global Planning -Guided Progressive Reinforcement Learning
【速读】: 该论文旨在解决当前基于大语言模型(Large Language Models, LLMs)的智能体在复杂任务中面临的三大挑战:一是现有ReAct范式依赖单步推理与即时执行,难以支持长期战略规划;二是规划模块与执行模块之间的协同机制不足,影响整体决策效率;三是主流方法依赖监督微调,导致模型过度记忆固定任务路径,泛化能力受限。其解决方案的关键在于提出AdaPlan自适应全局规划驱动的智能体范式,并进一步设计PilotRL训练框架,通过渐进式强化学习分阶段优化模型对全局计划的遵循能力、计划质量以及规划与执行的协同优化,从而实现高效长程决策与更强的环境适应性。
链接: https://arxiv.org/abs/2508.00344
作者: Keer Lu,Chong Chen,Bin Cui,Huang Leng,Wentao Zhang
机构: 未知
类目: Computation and Language (cs.CL)
备注:
Abstract:Large Language Models (LLMs) have shown remarkable advancements in tackling agent-oriented tasks. Despite their potential, existing work faces challenges when deploying LLMs in agent-based environments. The widely adopted agent paradigm ReAct centers on integrating single-step reasoning with immediate action execution, which limits its effectiveness in complex tasks requiring long-term strategic planning. Furthermore, the coordination between the planner and executor during problem-solving is also a critical factor to consider in agent design. Additionally, current approaches predominantly rely on supervised fine-tuning, which often leads models to memorize established task completion trajectories, thereby restricting their generalization ability when confronted with novel problem contexts. To address these challenges, we introduce an adaptive global plan-based agent paradigm AdaPlan, aiming to synergize high-level explicit guidance with execution to support effective long-horizon decision-making. Based on the proposed paradigm, we further put forward PilotRL, a global planning-guided training framework for LLM agents driven by progressive reinforcement learning. We first develop the model’s ability to follow explicit guidance from global plans when addressing agent tasks. Subsequently, based on this foundation, we focus on optimizing the quality of generated plans. Finally, we conduct joint optimization of the model’s planning and execution coordination. Experiments indicate that PilotRL could achieve state-of-the-art performances, with LLaMA3.1-8B-Instruct + PilotRL surpassing closed-sourced GPT-4o by 3.60%, while showing a more substantial gain of 55.78% comparing to GPT-4o-mini at a comparable parameter scale.
zh
[NLP-41] Improving Multimodal Contrastive Learning of Sentence Embeddings with Object-Phrase Alignment
【速读】: 该论文旨在解决多模态句子嵌入模型在训练过程中因图像-文本对(image-caption pairs)中存在噪声(如冗余或无关信息)而导致性能下降的问题。解决方案的关键在于引入细粒度的物体短语对齐(object-phrase alignment),通过结合现有的分割和目标检测模型提取精确的物体-短语配对,并基于此优化一个针对物体-短语对应关系设计的对比学习目标,从而提升多模态表示学习的准确性与鲁棒性。
链接: https://arxiv.org/abs/2508.00332
作者: Kaiyan Zhao,Zhongtao Miao,Yoshimasa Tsuruoka
机构: The University of Tokyo (东京大学)
类目: Computation and Language (cs.CL)
备注: Work in progress
Abstract:Multimodal sentence embedding models typically leverage image-caption pairs in addition to textual data during training. However, such pairs often contain noise, including redundant or irrelevant information on either the image or caption side. To mitigate this issue, we propose MCSEO, a method that enhances multimodal sentence embeddings by incorporating fine-grained object-phrase alignment alongside traditional image-caption alignment. Specifically, MCSEO utilizes existing segmentation and object detection models to extract accurate object-phrase pairs, which are then used to optimize a contrastive learning objective tailored to object-phrase correspondence. Experimental results on semantic textual similarity (STS) tasks across different backbone models demonstrate that MCSEO consistently outperforms strong baselines, highlighting the significance of precise object-phrase alignment in multimodal representation learning.
zh
[NLP-42] R1-ACT: Efficient Reasoning Model Safety Alignment by Activating Safety Knowledge
【速读】: 该论文旨在解决大推理模型(Large Reasoning Models, LRM)在面对有害用户指令时容易产生不当响应的安全风险问题。研究表明,LRM本身已具备足够的安全知识,但缺乏在推理过程中有效激活这些知识的能力。解决方案的关键在于提出一种名为R1-Act的后训练方法,通过结构化的推理过程显式触发模型中的安全知识,从而在不损害原有推理能力的前提下显著提升安全性。该方法仅需1,000个训练样本和单张RTX A6000 GPU上90分钟的训练时间,展现出良好的鲁棒性、可扩展性和实际效率。
链接: https://arxiv.org/abs/2508.00324
作者: Yeonjun In,Wonjoong Kim,Sangwu Park,Chanyoung Park
机构: KAIST
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注: under review
Abstract:Although large reasoning models (LRMs) have demonstrated impressive capabilities on complex tasks, recent studies reveal that these models frequently fulfill harmful user instructions, raising significant safety concerns. In this paper, we investigate the underlying cause of LRM safety risks and find that models already possess sufficient safety knowledge but fail to activate it during reasoning. Based on this insight, we propose R1-Act, a simple and efficient post-training method that explicitly triggers safety knowledge through a structured reasoning process. R1-Act achieves strong safety improvements while preserving reasoning performance, outperforming prior alignment methods. Notably, it requires only 1,000 training examples and 90 minutes of training on a single RTX A6000 GPU. Extensive experiments across multiple LRM backbones and sizes demonstrate the robustness, scalability, and practical efficiency of our approach.
zh
[NLP-43] Systematic Evaluation of Optimization Techniques for Long-Context Language Models
【速读】: 该论文旨在解决大语言模型(Large Language Models, LLMs)在长文本处理场景中面临的资源消耗高、上下文窗口有限的问题,以及现有优化技术(如剪枝、量化和令牌丢弃)在长上下文任务中的有效性与系统性能影响尚不明确的挑战。其解决方案的关键在于系统性地基准测试多种优化方法及其组合,在内存占用、延迟和吞吐量等系统层面指标与文本生成质量之间建立关联,并揭示了简单叠加优化策略可能因累积近似误差对大规模模型产生负面影响,强调需结合系统级剖析与任务特定评估来实现效率、准确性和可扩展性的平衡。
链接: https://arxiv.org/abs/2508.00305
作者: Ammar Ahmed,Sheng Di,Franck Cappello,Zirui Liu,Jingoo Han,Ali Anwar
机构: University of Minnesota, Twin Cities, USA (明尼苏达大学双城分校); Argonne National Labratory, Lemont, USA (阿贡国家实验室); Samsung Semiconductor Inc. (三星半导体公司)
类目: Computation and Language (cs.CL); Machine Learning (cs.LG); Performance (cs.PF)
备注:
Abstract:Large language models (LLMs) excel across diverse natural language processing tasks but face resource demands and limited context windows. Although techniques like pruning, quantization, and token dropping can mitigate these issues, their efficacy in long-context scenarios and system evaluation remains underexplored. This paper systematically benchmarks these optimizations, characterizing memory usage, latency, and throughput, and studies how these methods impact the quality of text generation. We first analyze individual optimization methods for two LLM architectures supporting long context and then systematically evaluate combinations of these techniques to assess how this deeper analysis impacts performance metrics. We subsequently study the scalability of individual optimization methods on a larger variant with 70 billion-parameter model. Our novel insights reveal that naive combination inference optimization algorithms can adversely affect larger models due to compounded approximation errors, as compared to their smaller counterparts. Experiments show that relying solely on F1 obscures these effects by hiding precision-recall trade-offs in question answering tasks. By integrating system-level profiling with task-specific insights, this study helps LLM practitioners and researchers explore and balance efficiency, accuracy, and scalability across tasks and hardware configurations.
zh
[NLP-44] Integrating clinical reasoning into large language model-based diagnosis through etiology-aware attention steering
【速读】: 该论文旨在解决大语言模型(Large Language Models, LLMs)在复杂临床场景中诊断可靠性不足的问题,尤其是其在病因推理(etiology reasoning)方面的局限性。解决方案的关键在于提出一种病因感知注意力引导框架(Etiology-Aware Attention Steering Framework),通过构建基于权威临床指南的结构化临床推理支架(Clinical Reasoning Scaffolding, CRS),识别对病因推理至关重要的注意力头(Etiology-Aware Head Identification),并引入基于推理引导的参数高效微调方法(Reasoning-Guided Parameter-Efficient Fine-tuning),将病因线索嵌入输入表示,并通过推理引导损失函数引导关键注意力头聚焦于核心临床信息,从而显著提升诊断准确性和临床推理可解释性。
链接: https://arxiv.org/abs/2508.00285
作者: Peixian Li,Yu Tian,Ruiqi Tu,Chengkai Wu,Jingjing Ren,Jingsong Li
机构: 未知
类目: Computation and Language (cs.CL)
备注: 23 pages, 8 figures
Abstract:Objective: Large Language Models (LLMs) demonstrate significant capabilities in medical text understanding and generation. However, their diagnostic reliability in complex clinical scenarios remains limited. This study aims to enhance LLMs’ diagnostic accuracy and clinical reasoning ability. Method: We propose an Etiology-Aware Attention Steering Framework to integrate structured clinical reasoning into LLM-based diagnosis. Specifically, we first construct Clinical Reasoning Scaffolding (CRS) based on authoritative clinical guidelines for three representative acute abdominal emergencies: acute appendicitis, acute pancreatitis, and acute cholecystitis. Next, we develop the Etiology-Aware Head Identification algorithm to pinpoint attention heads crucial for the model’s etiology reasoning. To ensure reliable clinical reasoning alignment, we introduce the Reasoning-Guided Parameter-Efficient Fine-tuning that embeds etiological reasoning cues into input representations and steers the selected Etiology-Aware Heads toward critical information through a Reasoning-Guided Loss function. Result: On the Consistent Diagnosis Cohort, our framework improves average diagnostic accuracy by 15.65% and boosts the average Reasoning Focus Score by 31.6% over baselines. External validation on the Discrepant Diagnosis Cohort further confirms its effectiveness in enhancing diagnostic accuracy. Further assessments via Reasoning Attention Frequency indicate that our models exhibit enhanced reliability when faced with real-world complex scenarios. Conclusion: This study presents a practical and effective approach to enhance clinical reasoning in LLM-based diagnosis. By aligning model attention with structured CRS, the proposed framework offers a promising paradigm for building more interpretable and reliable AI diagnostic systems in complex clinical settings.
zh
[NLP-45] Mind the Gap: The Divergence Between Human and LLM -Generated Tasks
【速读】: 该论文旨在解决生成式 AI(Generative AI)是否能模拟人类基于内在动机和具身认知(embodied cognition)的任务生成行为这一核心问题。研究表明,人类任务生成受个人价值观(如对变化的开放性)和认知风格等心理驱动因素显著影响,而即使将这些心理因素显式提供给大语言模型(LLM),其生成的任务仍表现出社会性弱、物理参与度低且主题偏向抽象化的特点,说明当前LLM主要依赖统计模式而非价值驱动和具身机制。解决方案的关键在于:未来智能代理的设计必须引入内在动机(intrinsic motivation)与物理世界交互的接地机制(physical grounding),以弥合LLM与人类认知之间的本质差距。
链接: https://arxiv.org/abs/2508.00282
作者: Yi-Long Lu,Jiajun Song,Chunhui Zhang,Wei Wang
机构: 未知
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注:
Abstract:Humans constantly generate a diverse range of tasks guided by internal motivations. While generative agents powered by large language models (LLMs) aim to simulate this complex behavior, it remains uncertain whether they operate on similar cognitive principles. To address this, we conducted a task-generation experiment comparing human responses with those of an LLM agent (GPT-4o). We find that human task generation is consistently influenced by psychological drivers, including personal values (e.g., Openness to Change) and cognitive style. Even when these psychological drivers are explicitly provided to the LLM, it fails to reflect the corresponding behavioral patterns. They produce tasks that are markedly less social, less physical, and thematically biased toward abstraction. Interestingly, while the LLM’s tasks were perceived as more fun and novel, this highlights a disconnect between its linguistic proficiency and its capacity to generate human-like, embodied this http URL conclude that there is a core gap between the value-driven, embodied nature of human cognition and the statistical patterns of LLMs, highlighting the necessity of incorporating intrinsic motivation and physical grounding into the design of more human-aligned agents.
zh
[NLP-46] MetaAgent : Toward Self-Evolving Agent via Tool Meta-Learning
【速读】: 该论文旨在解决当前智能体(Agent)在复杂知识发现任务中缺乏持续自我进化能力的问题,即如何让代理在不进行参数更新或额外后训练的情况下,通过实践不断优化其推理与工具使用策略。解决方案的关键在于提出MetaAgent框架,其核心机制包括:1)基于“做中学”原则构建最小化工作流,具备基础推理和自适应求助能力;2)当遇到知识缺口时,生成自然语言求助请求并由专用工具路由器调用外部工具;3)通过持续的自我反思与答案验证,将经验提炼为可动态注入任务上下文的结构化文本;4)自主构建内部工具集与持久化知识库,实现数据驱动的“工具学习”(meta tool learning),从而在不改变模型参数的前提下,逐步提升任务执行性能。
链接: https://arxiv.org/abs/2508.00271
作者: Hongjin Qian,Zheng Liu
机构: BAAI
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Information Retrieval (cs.IR)
备注: Technical Report, 14 pages
Abstract:In this work, we propose MetaAgent, an agentic paradigm inspired by the principle of learning-by-doing, where expertise is developed through hands-on practice and continual self-improvement. MetaAgent starts with a minimal workflow, equipped only with basic reasoning and adaptive help-seeking abilities. When a knowledge gap is encountered, MetaAgent generates natural language help requests, which are routed to the most suitable external tool by a dedicated tool router. As MetaAgent solves tasks, it continually conducts self-reflection and answer verification, distilling actionable experience into concise texts that are dynamically incorporated into future task contexts. Besides, MetaAgent autonomously builds in-house tools and a persistent knowledge base by organizing its tool-use history, further enhancing its ability to retrieve and integrate relevant information We term this continual, data-driven process as \textitmeta tool learning, through which MetaAgent incrementally refines its reasoning and tool-use strategies, without changing model parameters or requiring further post-training. Evaluated on challenging knowledge discovery benchmarks, including GAIA, WebWalkerQA, and BrowseCamp, MetaAgent consistently outperforms workflow-based baselines and matches or exceeds end-to-end trained agents, demonstrating the promise of self-evolving agentic systems for robust, general-purpose knowledge discovery. We provide our source codes in this https URL.
zh
[NLP-47] Model Misalignment and Language Change: Traces of AI-Associated Language in Unscripted Spoken English
【速读】: 该论文试图解决的问题是:大型语言模型(Large Language Models, LLMs)的广泛应用是否导致了人类语言系统本身的演变,而不仅仅是文本生成工具带来的表面性词汇偏好变化。其解决方案的关键在于构建了一个包含2210万词的非脚本化口语语料库(来自科技类播客),并对比分析了ChatGPT发布前后(2022年)与LLM相关词汇的使用趋势。研究发现,这些特定词汇在2022年后显著增加,且与基线同义词无明显变化,表明人类语言选择正逐步趋近于LLM的词汇模式,提示可能存在由AI暴露引发的语言系统层面的结构性转变。
链接: https://arxiv.org/abs/2508.00238
作者: Bryce Anderson,Riley Galpin,Tom S. Juzek
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: Accepted at AIES 2025. To appear in the AIES Proceedings. 14 pages, 2 figures, 2 tables. Licensed under CC BY-SA 4.0
Abstract:In recent years, written language, particularly in science and education, has undergone remarkable shifts in word usage. These changes are widely attributed to the growing influence of Large Language Models (LLMs), which frequently rely on a distinct lexical style. Divergences between model output and target audience norms can be viewed as a form of misalignment. While these shifts are often linked to using Artificial Intelligence (AI) directly as a tool to generate text, it remains unclear whether the changes reflect broader changes in the human language system itself. To explore this question, we constructed a dataset of 22.1 million words from unscripted spoken language drawn from conversational science and technology podcasts. We analyzed lexical trends before and after ChatGPT’s release in 2022, focusing on commonly LLM-associated words. Our results show a moderate yet significant increase in the usage of these words post-2022, suggesting a convergence between human word choices and LLM-associated patterns. In contrast, baseline synonym words exhibit no significant directional shift. Given the short time frame and the number of words affected, this may indicate the onset of a remarkable shift in language use. Whether this represents natural language change or a novel shift driven by AI exposure remains an open question. Similarly, although the shifts may stem from broader adoption patterns, it may also be that upstream training misalignments ultimately contribute to changes in human language use. These findings parallel ethical concerns that misaligned models may shape social and moral beliefs.
zh
[NLP-48] owards Higher Effective Rank in Parameter-efficient Fine-tuning using Khatri–Rao Product ICCV2025
【速读】: 该论文旨在解决低秩适应(Low-rank Adaptation, LoRA)在应用于多模态模型和大语言模型时存在的局限性,特别是其在逼近具有平坦谱特性或高频成分的矩阵(即高有效秩矩阵)时表现不佳的问题。解决方案的关键在于提出一种名为KRAdapter的新颖参数高效微调(Parameter-efficient Fine-tuning, PEFT)算法,该算法利用Khatri-Rao积构造权重更新,从理论上倾向于生成具有高有效秩的矩阵乘积,从而在保持LoRA内存与计算效率的同时,显著提升模型在视觉-语言模型(最大1B参数)和大语言模型(最大8B参数)上的性能,尤其在未见过的常识推理任务中表现出更强的泛化能力。
链接: https://arxiv.org/abs/2508.00230
作者: Paul Albert,Frederic Z. Zhang,Hemanth Saratchandran,Anton van den Hengel,Ehsan Abbasnejad
机构: 未知
类目: Machine Learning (cs.LG); Computation and Language (cs.CL); Computer Vision and Pattern Recognition (cs.CV)
备注: To appear in ICCV 2025
Abstract:Parameter-efficient fine-tuning (PEFT) has become a standard approach for adapting large pre-trained models. Amongst PEFT methods, low-rank adaptation (LoRA) has achieved notable success. However, recent studies have highlighted its limitations compared against full-rank alternatives, particularly when applied to multimodal and large language models. In this work, we present a quantitative comparison amongst full-rank and low-rank PEFT methods using a synthetic matrix approximation benchmark with controlled spectral properties. Our results confirm that LoRA struggles to approximate matrices with relatively flat spectrums or high frequency components – signs of high effective ranks. To this end, we introduce KRAdapter, a novel PEFT algorithm that leverages the Khatri-Rao product to produce weight updates, which, by construction, tends to produce matrix product with a high effective rank. We demonstrate performance gains with KRAdapter on vision-language models up to 1B parameters and on large language models up to 8B parameters, particularly on unseen common-sense reasoning tasks. In addition, KRAdapter maintains the memory and compute efficiency of LoRA, making it a practical and robust alternative to fine-tune billion-scale parameter models.
zh
[NLP-49] RL-PLUS: Countering Capability Boundary Collapse of LLM s in Reinforcement Learning with Hybrid-policy Optimization
【速读】: 该论文旨在解决强化学习与可验证奖励(Reinforcement Learning with Verifiable Reward, RLVR)在提升大语言模型(Large Language Models, LLMs)复杂推理能力时所面临的两个核心问题:一是由于RLVR固有的on-policy策略、LLM庞大的动作空间以及稀疏奖励,导致其难以突破基础模型的内在能力边界;二是RLVR可能引发能力边界坍塌(capability boundary collapse),从而限制模型的问题求解范围。解决方案的关键在于提出RL-PLUS方法,该方法通过内生探索(即“思考”,Thinking)与外源数据(即“学习”,Learning)的协同机制实现更强的推理能力并超越基础模型的能力边界。其核心创新包括:利用多重重要性采样(Multiple Importance Sampling)缓解外部数据带来的分布偏移问题,以及设计基于探索的优势函数(Exploration-Based Advantage Function)引导模型走向高价值且未被探索的推理路径,从而有效扩展模型的推理能力边界并提升泛化性能。
链接: https://arxiv.org/abs/2508.00222
作者: Yihong Dong,Xue Jiang,Yongding Tao,Huanyu Liu,Kechi Zhang,Lili Mou,Rongyu Cao,Yingwei Ma,Jue Chen,Binhua Li,Zhi Jin,Fei Huang,Yongbin Li,Ge Li
机构: Peking University (北京大学); Tongyi Lab, Alibaba Group (阿里巴巴集团通义实验室); University of Alberta (阿尔伯塔大学); Canada CIFAR AI Chair (加拿大CIFAR人工智能主席)
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Machine Learning (cs.LG)
备注:
Abstract:Reinforcement Learning with Verifiable Reward (RLVR) has significantly advanced the complex reasoning abilities of Large Language Models (LLMs). However, it struggles to break through the inherent capability boundaries of the base LLM, due to its inherently on-policy strategy with LLM’s immense action space and sparse reward. Further, RLVR can lead to the capability boundary collapse, narrowing the LLM’s problem-solving scope. To address this problem, we propose RL-PLUS, a novel approach that synergizes internal exploitation (i.e., Thinking) with external data (i.e., Learning) to achieve stronger reasoning capabilities and surpass the boundaries of base models. RL-PLUS integrates two core components: Multiple Importance Sampling to address for distributional mismatch from external data, and an Exploration-Based Advantage Function to guide the model towards high-value, unexplored reasoning paths. We provide both theoretical analysis and extensive experiments to demonstrate the superiority and generalizability of our approach. The results show that RL-PLUS achieves state-of-the-art performance compared with existing RLVR methods on six math reasoning benchmarks and exhibits superior performance on six out-of-distribution reasoning tasks. It also achieves consistent and significant gains across diverse model families, with average relative improvements ranging from 21.1% to 69.2%. Moreover, Pass@k curves across multiple benchmarks indicate that RL-PLUS effectively resolves the capability boundary collapse problem.
zh
[NLP-50] Semantic Compression for Word and Sentence Embeddings using Discrete Wavelet Transform
【速读】: 该论文旨在解决自然语言处理(Natural Language Processing, NLP)中嵌入表示(embedding representation)维度高、计算成本大且冗余信息较多的问题。其解决方案的关键在于引入离散小波变换(Discrete Wavelet Transform, DWT),通过在不同分辨率层级上分析和压缩词向量与句向量嵌入,实现高维嵌入的降维同时保留关键语义信息。实验表明,DWT可在保持语义相似性任务性能几乎不变的前提下,将嵌入维度降低50%-93%,并在多数下游任务中提升准确率,从而为NLP应用提供一种高效且有效的嵌入压缩与优化方法。
链接: https://arxiv.org/abs/2508.00220
作者: Rana Aref Salama,Abdou Youssef,Mona Diab
机构: George Washington University (乔治华盛顿大学); Cairo University (开罗大学); Carnegie Mellon University (卡内基梅隆大学)
类目: Computation and Language (cs.CL)
备注:
Abstract:Wavelet transforms, a powerful mathematical tool, have been widely used in different domains, including Signal and Image processing, to unravel intricate patterns, enhance data representation, and extract meaningful features from data. Tangible results from their application suggest that Wavelet transforms can be applied to NLP capturing a variety of linguistic and semantic properties. In this paper, we empirically leverage the application of Discrete Wavelet Transforms (DWT) to word and sentence embeddings. We aim to showcase the capabilities of DWT in analyzing embedding representations at different levels of resolution and compressing them while maintaining their overall quality. We assess the effectiveness of DWT embeddings on semantic similarity tasks to show how DWT can be used to consolidate important semantic information in an embedding vector. We show the efficacy of the proposed paradigm using different embedding models, including large language models, on downstream tasks. Our results show that DWT can reduce the dimensionality of embeddings by 50-93% with almost no change in performance for semantic similarity tasks, while achieving superior accuracy in most downstream tasks. Our findings pave the way for applying DWT to improve NLP applications.
zh
[NLP-51] abular Data Understanding with LLM s: A Survey of Recent Advances and Challenges
【速读】: 该论文旨在解决当前大语言模型(Large Language Models, LLMs)和多模态大语言模型(Multimodal Large Language Models, MLLMs)在表格理解任务中面临的挑战,包括任务多样性导致的缺乏统一方法、复杂表格结构处理困难以及模型在不同表格表示形式间泛化能力弱等问题。其解决方案的关键在于构建一个系统的表格输入表示分类法(taxonomy of tabular input representations),并明确表征各类表格理解任务的边界与特性,从而为后续研究提供清晰的框架指引,并揭示当前领域存在的三个关键研究空白:以检索为主的任务主导、对复杂表格结构与长上下文处理能力不足,以及跨格式泛化性能有限。
链接: https://arxiv.org/abs/2508.00217
作者: Xiaofeng Wu,Alan Ritter,Wei Xu
机构: Georgia Institute of Technology (佐治亚理工学院)
类目: Computation and Language (cs.CL); Databases (cs.DB); Machine Learning (cs.LG)
备注:
Abstract:Tables have gained significant attention in large language models (LLMs) and multimodal large language models (MLLMs) due to their complex and flexible structure. Unlike linear text inputs, tables are two-dimensional, encompassing formats that range from well-structured database tables to complex, multi-layered spreadsheets, each with different purposes. This diversity in format and purpose has led to the development of specialized methods and tasks, instead of universal approaches, making navigation of table understanding tasks challenging. To address these challenges, this paper introduces key concepts through a taxonomy of tabular input representations and an introduction of table understanding tasks. We highlight several critical gaps in the field that indicate the need for further research: (1) the predominance of retrieval-focused tasks that require minimal reasoning beyond mathematical and logical operations; (2) significant challenges faced by models when processing complex table structures, large-scale tables, length context, or multi-table scenarios; and (3) the limited generalization of models across different tabular representations and formats.
zh
[NLP-52] Comparison of Large Language Models for Deployment Requirements
【速读】: 该论文试图解决的问题是:在大型语言模型(Large Language Models, LLMs)快速发展的背景下,研究人员和企业难以根据许可协议和硬件需求选择最适合的开源基础模型或领域特定模型。解决方案的关键在于构建一个持续更新的对比列表,系统性地整理各类基础模型与领域特定模型的核心特征,包括发布年份、许可证类型和硬件要求,从而帮助用户高效筛选和部署合适的LLM。
链接: https://arxiv.org/abs/2508.00185
作者: Alper Yaman,Jannik Schwab,Christof Nitsche,Abhirup Sinha,Marco Huber
机构: Fraunhofer Institute for Manufacturing Engineering and Automation IPA (弗劳恩霍夫制造工程与自动化研究所 IPA)
类目: Computation and Language (cs.CL)
备注:
Abstract:Large Language Models (LLMs), such as Generative Pre-trained Transformers (GPTs) are revolutionizing the generation of human-like text, producing contextually relevant and syntactically correct content. Despite challenges like biases and hallucinations, these Artificial Intelligence (AI) models excel in tasks, such as content creation, translation, and code generation. Fine-tuning and novel architectures, such as Mixture of Experts (MoE), address these issues. Over the past two years, numerous open-source foundational and fine-tuned models have been introduced, complicating the selection of the optimal LLM for researchers and companies regarding licensing and hardware requirements. To navigate the rapidly evolving LLM landscape and facilitate LLM selection, we present a comparative list of foundational and domain-specific models, focusing on features, such as release year, licensing, and hardware requirements. This list is published on GitLab and will be continuously updated.
zh
[NLP-53] On the Risk of Misleading Reports: Diagnosing Textual Biases in Multimodal Clinical AI MICCAI2025
【速读】: 该论文旨在解决多模态医学模型在临床决策中对文本信息过度依赖的问题,这种偏倚可能导致模型忽视关键的视觉线索(visual cues),从而影响诊断准确性。解决方案的关键在于提出一种基于扰动的量化方法——选择性模态切换(Selective Modality Shifting, SMS),通过系统性地交换不同标签样本间的图像或文本,暴露模型在二分类任务中对各模态的依赖程度。实验表明,即使存在互补的视觉信息,多个开源视觉语言模型(VLMs)仍表现出显著的文本依赖性,且注意力分析进一步证实图像内容常被文本细节掩盖。该方法为评估和改进多模态医学AI模型的真正融合能力提供了可操作的工具。
链接: https://arxiv.org/abs/2508.00171
作者: David Restrepo,Ira Ktena,Maria Vakalopoulou,Stergios Christodoulidis,Enzo Ferrante
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Computation and Language (cs.CL)
备注: Accepted to MICCAI 2025 1st Workshop on Multimodal Large Language Models (MLLMs) in Clinical Practice
Abstract:Clinical decision-making relies on the integrated analysis of medical images and the associated clinical reports. While Vision-Language Models (VLMs) can offer a unified framework for such tasks, they can exhibit strong biases toward one modality, frequently overlooking critical visual cues in favor of textual information. In this work, we introduce Selective Modality Shifting (SMS), a perturbation-based approach to quantify a model’s reliance on each modality in binary classification tasks. By systematically swapping images or text between samples with opposing labels, we expose modality-specific biases. We assess six open-source VLMs-four generalist models and two fine-tuned for medical data-on two medical imaging datasets with distinct modalities: MIMIC-CXR (chest X-ray) and FairVLMed (scanning laser ophthalmoscopy). By assessing model performance and the calibration of every model in both unperturbed and perturbed settings, we reveal a marked dependency on text input, which persists despite the presence of complementary visual information. We also perform a qualitative attention-based analysis which further confirms that image content is often overshadowed by text details. Our findings highlight the importance of designing and evaluating multimodal medical models that genuinely integrate visual and textual cues, rather than relying on single-modality signals.
zh
[NLP-54] Watch the Weights: Unsupervised monitoring and control of fine-tuned LLM s
【速读】: 该论文旨在解决当前大型语言模型(Large Language Models, LLMs)在缺乏完整训练数据情况下,难以有效检测和防御新型潜在威胁(如后门攻击或遗忘机制)的问题。传统可解释性方法依赖激活值分析,通常假设测试数据与训练数据分布相似,这在面对分布外(out-of-distribution)的恶意行为时失效。解决方案的关键在于提出一种基于权重差异的分析方法:通过计算微调前后模型权重的差值矩阵的前几个奇异向量(singular vectors),识别出模型新增的行为模式;进而利用这些方向上的激活余弦相似度变化来高精度监测异常行为——实验表明该方法可在不依赖分布一致数据的前提下,实现对后门攻击的100%阻断(误报率<1.2%),并能准确检测未学习内容(准确率达95.42%),甚至具备引导模型恢复被“遗忘”信息的能力。
链接: https://arxiv.org/abs/2508.00161
作者: Ziqian Zhong,Aditi Raghunathan
机构: Carnegie Mellon University (卡内基梅隆大学)
类目: Machine Learning (cs.LG); Computation and Language (cs.CL)
备注:
Abstract:The releases of powerful open-weight large language models (LLMs) are often not accompanied by access to their full training data. Existing interpretability methods, particularly those based on activations, often require or assume distributionally similar data. This is a significant limitation when detecting and defending against novel potential threats like backdoors, which are by definition out-of-distribution. In this work, we introduce a new method for understanding, monitoring and controlling fine-tuned LLMs that interprets weights, rather than activations, thereby side stepping the need for data that is distributionally similar to the unknown training data. We demonstrate that the top singular vectors of the weight difference between a fine-tuned model and its base model correspond to newly acquired behaviors. By monitoring the cosine similarity of activations along these directions, we can detect salient behaviors introduced during fine-tuning with high precision. For backdoored models that bypasses safety mechanisms when a secret trigger is present, our method stops up to 100% of attacks with a false positive rate below 1.2%. For models that have undergone unlearning, we detect inference on erased topics with accuracy up to 95.42% and can even steer the model to recover “unlearned” information. Besides monitoring, our method also shows potential for pre-deployment model auditing: by analyzing commercial instruction-tuned models (OLMo, Llama, Qwen), we are able to uncover model-specific fine-tuning focus including marketing strategies and Midjourney prompt generation. Our implementation can be found at this https URL. Subjects: Machine Learning (cs.LG); Computation and Language (cs.CL) Cite as: arXiv:2508.00161 [cs.LG] (or arXiv:2508.00161v1 [cs.LG] for this version) https://doi.org/10.48550/arXiv.2508.00161 Focus to learn more arXiv-issued DOI via DataCite (pending registration) Submission history From: Ziqian Zhong [view email] [v1] Thu, 31 Jul 2025 21:04:12 UTC (2,655 KB)
zh
[NLP-55] Is neural semantic parsing good at ellipsis resolution or isnt it?
【速读】: 该论文旨在解决神经语义解析器在处理强上下文敏感现象(如英语动词短语省略)时性能下降的问题。这类现象要求模型能够复制并整合大量语义信息以形成完整的语义表示,而现有解析器在标准测试集上表现优异(语义匹配得分超过90%),却在涉及省略的实例中表现失败。解决方案的关键在于构建一个包含120个省略案例及其完整语义表示的挑战数据集,并以此作为基准评估多种神经语义解析器的能力,从而揭示其在复杂语义重构任务中的局限性。
链接: https://arxiv.org/abs/2508.00121
作者: Xiao Zhang,Johan bos
机构: University of Groningen (格罗宁根大学)
类目: Computation and Language (cs.CL)
备注: Accepted by 16th IWCS
Abstract:Neural semantic parsers have shown good overall performance for a variety of linguistic phenomena, reaching semantic matching scores of more than 90%. But how do such parsers perform on strongly context-sensitive phenomena, where large pieces of semantic information need to be duplicated to form a meaningful semantic representation? A case in point is English verb phrase ellipsis, a construct where entire verb phrases can be abbreviated by a single auxiliary verb. Are the otherwise known as powerful semantic parsers able to deal with ellipsis or aren’t they? We constructed a corpus of 120 cases of ellipsis with their fully resolved meaning representation and used this as a challenge set for a large battery of neural semantic parsers. Although these parsers performed very well on the standard test set, they failed in the instances with ellipsis. Data augmentation
zh
[NLP-56] FACTORY: A Challenging Human-Verified Prompt Set for Long-Form Factuality
【速读】: 该论文旨在解决当前长文本事实性评估(long-form factuality evaluation)基准测试中缺乏人工验证导致的质量不可靠问题。现有基准常依赖自动指标或未经过严格人工审核的样本,可能无法真实反映模型生成内容的事实准确性。其解决方案的关键在于构建FACTORY——一个大规模、经人工验证的提示集合,采用“模型在环”(model-in-the-loop)方法生成初始提示,并通过人工精细化筛选和优化,确保提示具备事实查询性(fact-seeking)、可回答性和明确性(unambiguous)。实验表明,FACTORY显著提升了评估难度:顶级语言模型在该基准上的非事实性陈述占比达40%,远高于其他数据集的10%,凸显其对模型跨长尾事实推理能力的高要求与可靠性。
链接: https://arxiv.org/abs/2508.00109
作者: Mingda Chen,Yang Li,Xilun Chen,Adina Williams,Gargi Ghosh,Scott Yih
机构: Meta(元)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:
Abstract:Long-form factuality evaluation assesses the ability of models to generate accurate, comprehensive responses to short prompts. Existing benchmarks often lack human verification, leading to potential quality issues. To address this limitation, we introduce FACTORY, a large-scale, human-verified prompt set. Developed using a model-in-the-loop approach and refined by humans, FACTORY includes challenging prompts that are fact-seeking, answerable, and unambiguous. We conduct human evaluations on 6 state-of-the-art language models using FACTORY and existing datasets. Our results show that FACTORY is a challenging benchmark: approximately 40% of the claims made in the responses of SOTA models are not factual, compared to only 10% for other datasets. Our analysis identifies the strengths of FACTORY over prior benchmarks, emphasizing its reliability and the necessity for models to reason across long-tailed facts.
zh
[NLP-57] Semiotic Complexity and Its Epistemological Implications for Modeling Culture
【速读】: 该论文试图解决计算人文学科中建模方法缺乏理论化所引发的解释不清和认知偏差问题,核心在于提升模型构建过程中的翻译一致性与可解释性。其解决方案的关键在于提出“符号复杂性(semiotic complexity)”这一概念,用以衡量文本在不同阐释视角下意义的可变程度,并指出当前主流建模实践(尤其是评估环节)常将具有高符号复杂性的数据误判为符号简单性数据,从而导致隐蔽但关键的翻译错误,削弱了研究结果的可信度。作者进一步建议研究人员应系统性地识别并处理符号复杂性,以增强模型的本体论严谨性和解释透明度。
链接: https://arxiv.org/abs/2508.00095
作者: Zachary K. Stine,James E. Deitrick
机构: University of Central Arkansas (中央阿肯色大学)
类目: Computation and Language (cs.CL); Computers and Society (cs.CY)
备注: Preprint. Manuscript currently under review
Abstract:Greater theorizing of methods in the computational humanities is needed for epistemological and interpretive clarity, and therefore the maturation of the field. In this paper, we frame such modeling work as engaging in translation work from a cultural, linguistic domain into a computational, mathematical domain, and back again. Translators benefit from articulating the theory of their translation process, and so do computational humanists in their work – to ensure internal consistency, avoid subtle yet consequential translation errors, and facilitate interpretive transparency. Our contribution in this paper is to lay out a particularly consequential dimension of the lack of theorizing and the sorts of translation errors that emerge in our modeling practices as a result. Along these lines we introduce the idea of semiotic complexity as the degree to which the meaning of some text may vary across interpretive lenses, and make the case that dominant modeling practices – especially around evaluation – commit a translation error by treating semiotically complex data as semiotically simple when it seems epistemologically convenient by conferring superficial clarity. We then lay out several recommendations for researchers to better account for these epistemological issues in their own work.
zh
[NLP-58] Do LLM s produce texts with “human-like” lexical diversity?
【速读】: 该论文试图解决的问题是:大型语言模型(Large Language Models, LLMs)生成的文本在多大程度上真正具有人类写作的特征,尤其是从词汇多样性(lexical diversity)这一维度进行考察。其解决方案的关键在于系统性地测量并比较四款ChatGPT模型(-3.5、-4、-o4 mini 和 -4.5)生成文本与不同教育水平的母语(L1)和非母语(L2)英语写作者(n=240)文本在六个词汇多样性指标(包括词汇量、丰度、种类-重复率、均匀度、差异性和分散度)上的差异。通过多元方差分析(MANOVA)、单因素方差分析(ANOVA)和支持向量机(SVM)分类方法,研究发现LLM生成文本在所有指标上均显著区别于人类文本,且较新版本(如-4.5)比旧版本更缺乏人类特征,从而揭示出当前LLM在词汇多样性方面尚未达到人类写作水平。
链接: https://arxiv.org/abs/2508.00086
作者: Kelly Kendro,Jeffrey Maloney,Scott Jarvis
机构: 未知
类目: Computation and Language (cs.CL)
备注: 35 pages; includes abstract
Abstract:The degree to which LLMs produce writing that is truly human-like remains unclear despite the extensive empirical attention that this question has received. The present study addresses this question from the perspective of lexical diversity. Specifically, the study investigates patterns of lexical diversity in LLM-generated texts from four ChatGPT models (-3.5, -4, -o4 mini, and -4.5) in comparison with texts written by L1 and L2 English participants (n = 240) across four education levels. Six dimensions of lexical diversity were measured in each text: volume, abundance, variety-repetition, evenness, disparity, and dispersion. Results from one-way MANOVAs, one-way ANOVAS, and Support Vector Machines revealed that the LLM-generated texts differed significantly from human-written texts for each variable, with ChatGPT-o4 mini and -4.5 differing the most. Within these two groups, ChatGPT-4.5 demonstrated higher levels of lexical diversity despite producing fewer tokens. The human writers’ lexical diversity did not differ across subgroups (i.e., education, language status). Altogether, the results indicate that LLMs do not produce human-like texts in relation to lexical diversity, and the newer LLMs produce less human-like texts than older models. We discuss the implications of these results for language pedagogy and related applications.
zh
[NLP-59] A Survey on Code Generation with LLM -based Agents
【速读】: 该论文旨在系统梳理基于大语言模型(Large Language Models, LLMs)的代码生成智能体(code generation agents)的研究进展,解决当前该领域缺乏全面综述、技术分类不清晰、应用范围分散及评估标准不统一的问题。其解决方案的关键在于:首先,从技术演进角度厘清代码生成智能体的发展脉络;其次,构建涵盖单智能体与多智能体架构的核心技术分类体系;再次,覆盖软件开发全生命周期(Software Development Life Cycle, SDLC)的应用场景,并归纳主流评估基准与代表性工具;最后,识别当前面临的主要挑战并提出具有基础性和长期性的研究方向,以推动该领域向工程实用化和系统化发展。
链接: https://arxiv.org/abs/2508.00083
作者: Yihong Dong,Xue Jiang,Jiaru Qian,Tian Wang,Kechi Zhang,Zhi Jin,Ge Li
机构: Peking University (北京大学)
类目: oftware Engineering (cs.SE); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Machine Learning (cs.LG)
备注: Work in progress
Abstract:Code generation agents powered by large language models (LLMs) are revolutionizing the software development paradigm. Distinct from previous code generation techniques, code generation agents are characterized by three core features. 1) Autonomy: the ability to independently manage the entire workflow, from task decomposition to coding and debugging. 2) Expanded task scope: capabilities that extend beyond generating code snippets to encompass the full software development lifecycle (SDLC). 3) Enhancement of engineering practicality: a shift in research emphasis from algorithmic innovation toward practical engineering challenges, such as system reliability, process management, and tool integration. This domain has recently witnessed rapid development and an explosion in research, demonstrating significant application potential. This paper presents a systematic survey of the field of LLM-based code generation agents. We trace the technology’s developmental trajectory from its inception and systematically categorize its core techniques, including both single-agent and multi-agent architectures. Furthermore, this survey details the applications of LLM-based agents across the full SDLC, summarizes mainstream evaluation benchmarks and metrics, and catalogs representative tools. Finally, by analyzing the primary challenges, we identify and propose several foundational, long-term research directions for the future work of the field.
zh
[NLP-60] PhysicsEval: Inference-Time Techniques to Improve the Reasoning Proficiency of Large Language Models on Physics Problems
【速读】: 该论文旨在解决大语言模型(Large Language Models, LLMs)在物理问题求解任务中的性能瓶颈,尤其是针对数学计算和描述性推理两类问题的准确率不足问题。其关键解决方案在于引入多智能体(multi-agent)框架,通过多个小型LLM代理对初步生成的解进行累积式验证与修正,从而显著提升模型在原始表现较差的问题上的准确性。此外,作者还构建了一个新的评估基准P\small HYSICSE\small VAL,包含19,609道来自教材和在线教育平台的物理题目及其正确答案,为该领域提供了一个高质量、多样化的测试集。
链接: https://arxiv.org/abs/2508.00079
作者: Oshayer Siddique,J. M Areeb Uzair Alam,Md Jobayer Rahman Rafy,Syed Rifat Raiyan,Hasan Mahmud,Md Kamrul Hasan
机构: Islamic University of Technology (伊斯兰科技大学)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: Under review, 18 pages, 4 figures, 7 tables
Abstract:The discipline of physics stands as a cornerstone of human intellect, driving the evolution of technology and deepening our understanding of the fundamental principles of the cosmos. Contemporary literature includes some works centered on the task of solving physics problems - a crucial domain of natural language reasoning. In this paper, we evaluate the performance of frontier LLMs in solving physics problems, both mathematical and descriptive. We also employ a plethora of inference-time techniques and agentic frameworks to improve the performance of the models. This includes the verification of proposed solutions in a cumulative fashion by other, smaller LLM agents, and we perform a comparative analysis of the performance that the techniques entail. There are significant improvements when the multi-agent framework is applied to problems that the models initially perform poorly on. Furthermore, we introduce a new evaluation benchmark for physics problems, \rm P\small HYSICSE\small VAL , consisting of 19,609 problems sourced from various physics textbooks and their corresponding correct solutions scraped from physics forums and educational websites. Our code and data are publicly available at this https URL.
zh
[NLP-61] GPT -4.1 Sets the Standard in Automated Experiment Design Using Novel Python Libraries
【速读】: 该论文旨在解决大语言模型(Large Language Models, LLMs)在科学计算场景中对不熟悉Python API的理解与调用能力不足的问题,尤其关注其在零样本(zero-shot)条件下生成可执行代码的可靠性。解决方案的关键在于构建两个结构化、无示例提示(structured, zero-shot prompts)的基准测试任务:一是使用ParShift库进行对话式数据分析,二是利用pyclugen和scikit-learn完成合成数据生成与聚类;通过定量评估代码功能正确性与提示合规性,并结合定性错误分析,系统评估主流LLMs的表现,从而揭示当前模型在端到端科学自动化中的局限性,并指出改进方向——包括优化提示设计、完善第三方库文档以及提升模型本身的能力。
链接: https://arxiv.org/abs/2508.00033
作者: Nuno Fachada,Daniel Fernandes,Carlos M. Fernandes,Bruno D. Ferreira-Saraiva,João P. Matos-Carvalho
机构: 未知
类目: oftware Engineering (cs.SE); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注:
Abstract:Large Language Models (LLMs) have advanced rapidly as tools for automating code generation in scientific research, yet their ability to interpret and use unfamiliar Python APIs for complex computational experiments remains poorly characterized. This study systematically benchmarks a selection of state-of-the-art LLMs in generating functional Python code for two increasingly challenging scenarios: conversational data analysis with the \textitParShift library, and synthetic data generation and clustering using \textitpyclugen and \textitscikit-learn. Both experiments use structured, zero-shot prompts specifying detailed requirements but omitting in-context examples. Model outputs are evaluated quantitatively for functional correctness and prompt compliance over multiple runs, and qualitatively by analyzing the errors produced when code execution fails. Results show that only a small subset of models consistently generate correct, executable code, with GPT-4.1 standing out as the only model to always succeed in both tasks. In addition to benchmarking LLM performance, this approach helps identify shortcomings in third-party libraries, such as unclear documentation or obscure implementation bugs. Overall, these findings highlight current limitations of LLMs for end-to-end scientific automation and emphasize the need for careful prompt design, comprehensive library documentation, and continued advances in language model capabilities.
zh
[NLP-62] Scalable Spectrum Availability Prediction using a Markov Chain Framework and ITU-R Propagation Models
【速读】: 该论文旨在解决认知无线电网络中频谱资源利用率低的问题,核心挑战在于如何准确预测主用户(Primary User, PU)在时间和空间上的空闲状态,从而实现次级用户(Secondary User, SU)的主动且无干扰的频谱接入。解决方案的关键在于提出了一种可扩展的频谱可用性预测框架,其核心是将两状态马尔可夫链(Two-State Markov Chain)与ITU-R高保真传播模型(P.528和P.2108推荐标准)相结合:马尔可夫链用于建模主用户活动的时间依赖性,传播模型则通过路径损耗和杂波效应计算次级用户位置处的干扰水平,以判断频谱是否可用。该方法实现了时空维度上的精准预测,并具备良好的计算效率和适应性,适用于多种频段和场景,为实时频谱管理提供了可行方案。
链接: https://arxiv.org/abs/2508.00028
作者: Abir Ray
机构: Cornell University (康奈尔大学)
类目: Networking and Internet Architecture (cs.NI); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Numerical Analysis (math.NA)
备注: 12 pages
Abstract:Spectrum resources are often underutilized across time and space, motivating dynamic spectrum access strategies that allow secondary users to exploit unused frequencies. A key challenge is predicting when and where spectrum will be available (i.e., unused by primary licensed users) in order to enable proactive and interference-free access. This paper proposes a scalable framework for spectrum availability prediction that combines a two-state Markov chain model of primary user activity with high-fidelity propagation models from the ITU-R (specifically Recommendations P.528 and P.2108). The Markov chain captures temporal occupancy patterns, while the propagation models incorporate path loss and clutter effects to determine if primary signals exceed interference thresholds at secondary user locations. By integrating these components, the proposed method can predict spectrum opportunities both in time and space with improved accuracy. We develop the system model and algorithm for the approach, analyze its scalability and computational efficiency, and discuss assumptions, limitations, and potential applications. The framework is flexible and can be adapted to various frequency bands and scenarios. The results and analysis show that the proposed approach can effectively identify available spectrum with low computational cost, making it suitable for real-time spectrum management in cognitive radio networks and other dynamic spectrum sharing systems.
zh
[NLP-63] NusaAksara: A Multimodal and Multilingual Benchmark for Preserving Indonesian Indigenous Scripts
【速读】: 该论文旨在解决印度尼西亚多语言、多书写系统在自然语言处理(Natural Language Processing, NLP)领域长期被忽视的问题,尤其是现有NLP技术普遍基于拉丁化文本,难以处理本土原生书写系统。其解决方案的关键在于构建了一个名为NusaAksara的公开基准数据集,涵盖8种书写系统和7种印尼语方言(包括低资源语言),并整合了文本与图像模态的多样化任务,如图像分割、光学字符识别(OCR)、音译、翻译和语言识别。该数据集由专家严格标注,且包含未被Unicode支持的兰普ung书写系统,从而填补了当前NLP研究对非拉丁文字处理能力的空白。通过在多种模型(包括大语言模型LLM、视觉语言模型VLM及专用系统)上进行评测,论文揭示了主流NLP技术对本地书写系统的严重适应性不足,突显了建立专门针对多语种原生书写系统的基准的重要性。
链接: https://arxiv.org/abs/2502.18148
作者: Muhammad Farid Adilazuarda,Musa Izzanardi Wijanarko,Lucky Susanto,Khumaisa Nur’aini,Derry Wijaya,Alham Fikri Aji
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:
Abstract:Indonesia is rich in languages and scripts. However, most NLP progress has been made using romanized text. In this paper, we present NusaAksara, a novel public benchmark for Indonesian languages that includes their original scripts. Our benchmark covers both text and image modalities and encompasses diverse tasks such as image segmentation, OCR, transliteration, translation, and language identification. Our data is constructed by human experts through rigorous steps. NusaAksara covers 8 scripts across 7 languages, including low-resource languages not commonly seen in NLP benchmarks. Although unsupported by Unicode, the Lampung script is included in this dataset. We benchmark our data across several models, from LLMs and VLMs such as GPT-4o, Llama 3.2, and Aya 23 to task-specific systems such as PP-OCR and LangID, and show that most NLP technologies cannot handle Indonesia’s local scripts, with many achieving near-zero performance.
zh
[NLP-64] ContestTrade: A Multi-Agent Trading System Based on Internal Contest Mechanism
【速读】: 该论文旨在解决大语言模型(Large Language Model, LLM)在金融交易中因对市场噪声高度敏感而导致性能下降的问题。其解决方案的关键在于提出一种具有内部竞争机制的多智能体系统,该系统由数据团队(Data Team)和研究团队(Research Team)组成:数据团队负责将海量市场数据压缩为多样化的文本因子以适配LLM的上下文限制,研究团队则基于深度研究方法并行生成多种交易决策;核心创新在于每个团队内嵌实时评估与排名机制,通过真实市场反馈对各智能体持续评分与排序,仅采纳表现最优智能体的输出,从而提升系统对动态环境的适应能力、增强抗噪鲁棒性,并显著优于现有多智能体系统及传统量化投资方法。
链接: https://arxiv.org/abs/2508.00554
作者: Li Zhao,Rui Sun,Zuoyou Jiang,Bo Yang,Yuxiao Bai,Mengting Chen,Xinyang Wang,Jing Li,Zuo Bai
机构: 未知
类目: Trading and Market Microstructure (q-fin.TR); Computation and Language (cs.CL); Computational Finance (q-fin.CP)
备注:
Abstract:In financial trading, large language model (LLM)-based agents demonstrate significant potential. However, the high sensitivity to market noise undermines the performance of LLM-based trading systems. To address this limitation, we propose a novel multi-agent system featuring an internal competitive mechanism inspired by modern corporate management structures. The system consists of two specialized teams: (1) Data Team - responsible for processing and condensing massive market data into diversified text factors, ensuring they fit the model’s constrained context. (2) Research Team - tasked with making parallelized multipath trading decisions based on deep research methods. The core innovation lies in implementing a real-time evaluation and ranking mechanism within each team, driven by authentic market feedback. Each agent’s performance undergoes continuous scoring and ranking, with only outputs from top-performing agents being adopted. The design enables the system to adaptively adjust to dynamic environment, enhances robustness against market noise and ultimately delivers superior trading performance. Experimental results demonstrate that our proposed system significantly outperforms prevailing multiagent systems and traditional quantitative investment methods across diverse evaluation metrics.
zh
计算机视觉
[CV-0] IGL-Nav: Incremental 3D Gaussian Localization for Image-goal Navigation ICCV2025
【速读】:该论文旨在解决图像目标导航(image-goal navigation)中的关键挑战,即如何在探索过程中高效且准确地将目标图像定位到三维空间中。传统方法依赖端到端强化学习或基于拓扑图/鸟瞰图(BEV)的地图作为记忆,难以充分建模已探索环境与目标图像之间的几何关系。其解决方案的核心是提出一种增量式3D高斯表示(Incremental 3D Gaussian Localization, IGL-Nav)框架:首先通过前馈单目预测逐步更新场景的可渲染3D高斯(3DGS)表示;然后利用几何信息进行离散空间匹配以粗略定位目标,等效于高效的3D卷积操作;当代理接近目标时,再通过可微渲染优化实现精细姿态估计。此方法显著提升了导航精度与效率,并支持自由视角设置及真实机器人部署。
链接: https://arxiv.org/abs/2508.00823
作者: Wenxuan Guo,Xiuwei Xu,Hang Yin,Ziwei Wang,Jianjiang Feng,Jie Zhou,Jiwen Lu
机构: Tsinghua University (清华大学); Nanyang Technological University (南洋理工大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Robotics (cs.RO)
备注: Accepted to ICCV 2025. Project page: this https URL
Abstract:Visual navigation with an image as goal is a fundamental and challenging problem. Conventional methods either rely on end-to-end RL learning or modular-based policy with topological graph or BEV map as memory, which cannot fully model the geometric relationship between the explored 3D environment and the goal image. In order to efficiently and accurately localize the goal image in 3D space, we build our navigation system upon the renderable 3D gaussian (3DGS) representation. However, due to the computational intensity of 3DGS optimization and the large search space of 6-DoF camera pose, directly leveraging 3DGS for image localization during agent exploration process is prohibitively inefficient. To this end, we propose IGL-Nav, an Incremental 3D Gaussian Localization framework for efficient and 3D-aware image-goal navigation. Specifically, we incrementally update the scene representation as new images arrive with feed-forward monocular prediction. Then we coarsely localize the goal by leveraging the geometric information for discrete space matching, which can be equivalent to efficient 3D convolution. When the agent is close to the goal, we finally solve the fine target pose with optimization via differentiable rendering. The proposed IGL-Nav outperforms existing state-of-the-art methods by a large margin across diverse experimental configurations. It can also handle the more challenging free-view image-goal setting and be deployed on real-world robotic platform using a cellphone to capture goal image at arbitrary pose. Project page: this https URL.
zh
[CV-1] Cross-Dataset Semantic Segmentation Performance Analysis: Unifying NIST Point Cloud City Datasets for 3D Deep Learning
【速读】:该论文旨在解决异构标注点云数据在公共安全应用(如灾前规划系统)中进行语义分割时面临的性能不稳定问题,核心挑战在于不同来源数据的标签不一致、类别不平衡以及小尺度安全关键特征识别率低。解决方案的关键在于采用基于KPConv架构的分级标注方案,并通过IoU指标评估安全相关特征的分割性能,从而揭示几何尺寸较大的对象(如楼梯、窗户)分割效果较好,而小尺寸安全要素因激光雷达扫描的几何区分度不足导致识别困难;研究进一步指出,标准化标注协议和改进标注技术是提升点云语义分割可靠性的必要前提,同时建议引入自动化标注与多数据集学习策略以应对数据异质性问题。
链接: https://arxiv.org/abs/2508.00822
作者: Alexander Nikitas Dimopoulos,Joseph Grasso
机构: National Institute of Standards and Technology (美国国家标准与技术研究院)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:This study analyzes semantic segmentation performance across heterogeneously labeled point-cloud datasets relevant to public safety applications, including pre-incident planning systems derived from lidar scans. Using NIST’s Point Cloud City dataset (Enfield and Memphis collections), we investigate challenges in unifying differently labeled 3D data. Our methodology employs a graded schema with the KPConv architecture, evaluating performance through IoU metrics on safety-relevant features. Results indicate performance variability: geometrically large objects (e.g. stairs, windows) achieve higher segmentation performance, suggesting potential for navigational context, while smaller safety-critical features exhibit lower recognition rates. Performance is impacted by class imbalance and the limited geometric distinction of smaller objects in typical lidar scans, indicating limitations in detecting certain safety-relevant features using current point-cloud methods. Key identified challenges include insufficient labeled data, difficulties in unifying class labels across datasets, and the need for standardization. Potential directions include automated labeling and multi-dataset learning strategies. We conclude that reliable point-cloud semantic segmentation for public safety necessitates standardized annotation protocols and improved labeling techniques to address data heterogeneity and the detection of small, safety-critical elements.
zh
[CV-2] SpA2V: Harnessing Spatial Auditory Cues for Audio-driven Spatially-aware Video Generation
【速读】:该论文旨在解决当前音频驱动视频生成(Audio-driven Video Generation)方法在内容准确性和空间一致性上的不足,尤其是现有模型主要依赖音频的语义信息(如声源类别),而忽略了声音所蕴含的空间属性(如声源位置和运动方向)。其解决方案的关键在于提出SpA2V框架,首次显式利用音频中的空间听觉线索(spatial auditory cues),通过两个阶段实现高质量视频生成:首先,借助先进的多模态大语言模型(Multimodal Large Language Model, MLLM)从音频中提取空间与语义线索,构建中间表示——视频场景布局(Video Scene Layout, VSL);其次,将VSL作为条件引导无缝注入预训练扩散模型(diffusion models),实现无需微调的、基于布局约束的视频生成。这一设计显著提升了生成视频在语义和空间维度上对输入音频的匹配度。
链接: https://arxiv.org/abs/2508.00782
作者: Kien T. Pham,Yingqing He,Yazhou Xing,Qifeng Chen,Long Chen
机构: Hong Kong University of Science and Technology (香港科技大学)
类目: Graphics (cs.GR); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV); Multimedia (cs.MM); Sound (cs.SD); Audio and Speech Processing (eess.AS)
备注: The 33rd ACM Multimedia Conference (MM '25)
Abstract:Audio-driven video generation aims to synthesize realistic videos that align with input audio recordings, akin to the human ability to visualize scenes from auditory input. However, existing approaches predominantly focus on exploring semantic information, such as the classes of sounding sources present in the audio, limiting their ability to generate videos with accurate content and spatial composition. In contrast, we humans can not only naturally identify the semantic categories of sounding sources but also determine their deeply encoded spatial attributes, including locations and movement directions. This useful information can be elucidated by considering specific spatial indicators derived from the inherent physical properties of sound, such as loudness or frequency. As prior methods largely ignore this factor, we present SpA2V, the first framework explicitly exploits these spatial auditory cues from audios to generate videos with high semantic and spatial correspondence. SpA2V decomposes the generation process into two stages: 1) Audio-guided Video Planning: We meticulously adapt a state-of-the-art MLLM for a novel task of harnessing spatial and semantic cues from input audio to construct Video Scene Layouts (VSLs). This serves as an intermediate representation to bridge the gap between the audio and video modalities. 2) Layout-grounded Video Generation: We develop an efficient and effective approach to seamlessly integrate VSLs as conditional guidance into pre-trained diffusion models, enabling VSL-grounded video generation in a training-free manner. Extensive experiments demonstrate that SpA2V excels in generating realistic videos with semantic and spatial alignment to the input audios.
zh
[CV-3] Zero-Shot Anomaly Detection with Dual-Branch Prompt Learning BMVC2025
【速读】:该论文旨在解决零样本异常检测(Zero-shot anomaly detection, ZSAD)在面对领域偏移(domain shift)时性能下降的问题,即现有方法因训练数据来自有限领域而难以泛化到新的分布。其解决方案的关键在于提出PILOT框架,包含两个核心创新:(1) 一种新颖的双分支提示学习机制(dual-branch prompt learning mechanism),动态融合可学习提示池与结构化语义属性,使模型能自适应地加权每张输入图像中最相关的异常线索;(2) 一种无标签测试时自适应策略(label-free test-time adaptation strategy),利用未标注测试数据中高置信度伪标签更新可学习提示参数,从而增强模型对新领域的适应能力。
链接: https://arxiv.org/abs/2508.00777
作者: Zihan Wang,Samira Ebrahimi Kahou,Narges Armanfard
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted at BMVC 2025
Abstract:Zero-shot anomaly detection (ZSAD) enables identifying and localizing defects in unseen categories by relying solely on generalizable features rather than requiring any labeled examples of anomalies. However, existing ZSAD methods, whether using fixed or learned prompts, struggle under domain shifts because their training data are derived from limited training domains and fail to generalize to new distributions. In this paper, we introduce PILOT, a framework designed to overcome these challenges through two key innovations: (1) a novel dual-branch prompt learning mechanism that dynamically integrates a pool of learnable prompts with structured semantic attributes, enabling the model to adaptively weight the most relevant anomaly cues for each input image; and (2) a label-free test-time adaptation strategy that updates the learnable prompt parameters using high-confidence pseudo-labels from unlabeled test data. Extensive experiments on 13 industrial and medical benchmarks demonstrate that PILOT achieves state-of-the-art performance in both anomaly detection and localization under domain shift.
zh
[CV-4] Sample-Aware Test-Time Adaptation for Medical Image-to-Image Translation
【速读】:该论文旨在解决图像到图像翻译(image-to-image translation)在医学影像应用中对分布外样本(out-of-distribution samples)适应性不足的问题,即传统方法在处理此类样本时易导致性能下降,且缺乏针对不同测试样本的差异化调整能力。解决方案的关键在于提出一种新型的测试时自适应(Test-Time Adaptation, TTA)框架,其核心包括两个组成部分:一是重建模块(Reconstruction Module),用于量化每个测试样本的域偏移程度;二是动态自适应块(Dynamic Adaptation Block),可根据样本特性选择性地调整预训练模型内部特征,从而在不损害分布内样本性能的前提下有效缓解域偏移问题。该方法实现了样本级的动态适应,显著优于均匀应用于所有样本的传统TTA策略。
链接: https://arxiv.org/abs/2508.00766
作者: Irene Iele,Francesco Di Feola,Valerio Guarrasi,Paolo Soda
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:
Abstract:Image-to-image translation has emerged as a powerful technique in medical imaging, enabling tasks such as image denoising and cross-modality conversion. However, it suffers from limitations in handling out-of-distribution samples without causing performance degradation. To address this limitation, we propose a novel Test-Time Adaptation (TTA) framework that dynamically adjusts the translation process based on the characteristics of each test sample. Our method introduces a Reconstruction Module to quantify the domain shift and a Dynamic Adaptation Block that selectively modifies the internal features of a pretrained translation model to mitigate the shift without compromising the performance on in-distribution samples that do not require adaptation. We evaluate our approach on two medical image-to-image translation tasks: low-dose CT denoising and T1 to T2 MRI translation, showing consistent improvements over both the baseline translation model without TTA and prior TTA methods. Our analysis highlights the limitations of the state-of-the-art that uniformly apply the adaptation to both out-of-distribution and in-distribution samples, demonstrating that dynamic, sample-specific adjustment offers a promising path to improve model resilience in real-world scenarios. The code is available at: this https URL.
zh
[CV-5] SU-ESRGAN: Semantic and Uncertainty-Aware ESRGAN for Super-Resolution of Satellite and Drone Imagery with Fine-Tuning for Cross Domain Evaluation
【速读】:该论文旨在解决生成式对抗网络(Generative Adversarial Networks, GANs)在遥感图像超分辨率(Super-Resolution, SR)应用中缺乏语义一致性与像素级置信度的问题,从而限制其在灾害响应、城市规划和农业等关键场景中的可信度。解决方案的关键在于提出Semantic and Uncertainty-Aware ESRGAN (SU-ESRGAN),该框架首次将ESRGAN与DeepLabv3的分割损失结合以保留类别细节,并引入蒙特卡洛Dropout机制生成像素级不确定性图,从而在保持图像质量(PSNR、SSIM、LPIPS指标相当)的同时增强模型的可解释性与鲁棒性,尤其适用于无人机或卫星系统中因广角镜头导致的空间分辨率下降问题。
链接: https://arxiv.org/abs/2508.00750
作者: Prerana Ramkumar
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG); Image and Video Processing (eess.IV)
备注:
Abstract:Generative Adversarial Networks (GANs) have achieved realistic super-resolution (SR) of images however, they lack semantic consistency and per-pixel confidence, limiting their credibility in critical remote sensing applications such as disaster response, urban planning and agriculture. This paper introduces Semantic and Uncertainty-Aware ESRGAN (SU-ESRGAN), the first SR framework designed for satellite imagery to integrate the ESRGAN, segmentation loss via DeepLabv3 for class detail preservation and Monte Carlo dropout to produce pixel-wise uncertainty maps. The SU-ESRGAN produces results (PSNR, SSIM, LPIPS) comparable to the Baseline ESRGAN on aerial imagery. This novel model is valuable in satellite systems or UAVs that use wide field-of-view (FoV) cameras, trading off spatial resolution for coverage. The modular design allows integration in UAV data pipelines for on-board or post-processing SR to enhance imagery resulting due to motion blur, compression and sensor limitations. Further, the model is fine-tuned to evaluate its performance on cross domain applications. The tests are conducted on two drone based datasets which differ in altitude and imaging perspective. Performance evaluation of the fine-tuned models show a stronger adaptation to the Aerial Maritime Drone Dataset, whose imaging characteristics align with the training data, highlighting the importance of domain-aware training in SR-applications.
zh
[CV-6] Is It Really You? Exploring Biometric Verification Scenarios in Photorealistic Talking-Head Avatar Videos
【速读】:该论文旨在解决在基于虚拟头像(talking-head avatar)的通信场景中,当攻击者窃取用户外观与声音后,如何通过行为生物特征实现身份验证的问题。其核心挑战在于:传统视觉或听觉识别手段失效时,能否利用个体独特的面部动态运动模式作为可靠的行为生物特征进行身份确认。解决方案的关键在于提出一种轻量级、可解释的时空图卷积网络架构,结合时间注意力池化机制,仅依赖面部关键点来建模动态面部动作,并在自建的真实感头像数据集上验证了该方法的有效性——实验结果显示,仅凭面部运动线索即可实现接近80% AUC值的身份验证性能,为Avatar媒介下的安全防护提供了新的技术路径。
链接: https://arxiv.org/abs/2508.00748
作者: Laura Pedrouzo-Rodriguez,Pedro Delgado-DeRobles,Luis F. Gomez,Ruben Tolosana,Ruben Vera-Rodriguez,Aythami Morales,Julian Fierrez
机构: Universidad Autonoma de Madrid (马德里自治大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Cryptography and Security (cs.CR); Multimedia (cs.MM)
备注: Accepted at the IEEE International Joint Conference on Biometrics (IJCB 2025)
Abstract:Photorealistic talking-head avatars are becoming increasingly common in virtual meetings, gaming, and social platforms. These avatars allow for more immersive communication, but they also introduce serious security risks. One emerging threat is impersonation: an attacker can steal a user’s avatar-preserving their appearance and voice-making it nearly impossible to detect its fraudulent usage by sight or sound alone. In this paper, we explore the challenge of biometric verification in such avatar-mediated scenarios. Our main question is whether an individual’s facial motion patterns can serve as reliable behavioral biometrics to verify their identity when the avatar’s visual appearance is a facsimile of its owner. To answer this question, we introduce a new dataset of realistic avatar videos created using a state-of-the-art one-shot avatar generation model, GAGAvatar, with genuine and impostor avatar videos. We also propose a lightweight, explainable spatio-temporal Graph Convolutional Network architecture with temporal attention pooling, that uses only facial landmarks to model dynamic facial gestures. Experimental results demonstrate that facial motion cues enable meaningful identity verification with AUC values approaching 80%. The proposed benchmark and biometric system are available for the research community in order to bring attention to the urgent need for more advanced behavioral biometric defenses in avatar-based communication systems.
zh
[CV-7] GECO: Geometrically Consistent Embedding with Lightspeed Inference
【速读】:该论文旨在解决自监督视觉基础模型在特征学习中语义对应能力较强但缺乏对底层三维几何结构感知的问题。其核心解决方案是提出GECO(Geometrically Coherent Features)框架,通过基于最优传输(optimal transport)的训练机制,在无需关键点标注的情况下实现几何一致性特征的学习,即使在遮挡和非遮挡场景下也能保持鲁棒性。该方法在轻量级架构下实现了30 fps的推理速度,较之前方法快98.2%,并在PFPascal、APK和CUB数据集上分别提升了6.0%、6.2%和4.1%的PCK指标,同时指出PCK指标不足以衡量几何质量,引入了新的评估指标以推动更几何感知的特征学习。
链接: https://arxiv.org/abs/2508.00746
作者: Regine Hartwig,Dominik Muhle,Riccardo Marin,Daniel Cremers
机构: TU Munich (慕尼黑工业大学); Munich Center for Machine Learning (慕尼黑机器学习中心)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Recent advances in feature learning have shown that self-supervised vision foundation models can capture semantic correspondences but often lack awareness of underlying 3D geometry. GECO addresses this gap by producing geometrically coherent features that semantically distinguish parts based on geometry (e.g., left/right eyes, front/back legs). We propose a training framework based on optimal transport, enabling supervision beyond keypoints, even under occlusions and disocclusions. With a lightweight architecture, GECO runs at 30 fps, 98.2% faster than prior methods, while achieving state-of-the-art performance on PFPascal, APK, and CUB, improving PCK by 6.0%, 6.2%, and 4.1%, respectively. Finally, we show that PCK alone is insufficient to capture geometric quality and introduce new metrics and insights for more geometry-aware feature learning. Link to project page: this https URL
zh
[CV-8] Rethinking Backbone Design for Lightweight 3D Object Detection in LiDAR ICCV2025
【速读】:该论文旨在解决当前基于点云数据的3D目标检测方法中普遍依赖复杂骨干网络(如VGG或ResNet)导致模型计算成本高、效率低的问题。现有轻量化设计在2D目标检测中已较为成熟,但在3D场景下仍缺乏系统研究。其解决方案的关键在于提出一种名为Dense Backbone的轻量级骨干结构,该结构通过密集连接(dense connection)机制实现高效特征提取,在保持高检测精度的同时显著降低参数量和推理延迟。实验表明,将该骨干替换至PillarNet等先进3D检测器后,可在nuScenes测试集上实现29%参数减少与28%延迟降低,仅损失2%检测精度,且具备即插即用特性,无需修改原有网络结构。
链接: https://arxiv.org/abs/2508.00744
作者: Adwait Chandorkar,Hasan Tercan,Tobias Meisen
机构: Institute for TMDT, University of Wuppertal, Germany (伍珀塔尔大学 TMDT 研究所)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: accepted at the Embedded Vision Workshop ICCV 2025
Abstract:Recent advancements in LiDAR-based 3D object detection have significantly accelerated progress toward the realization of fully autonomous driving in real-world environments. Despite achieving high detection performance, most of the approaches still rely on a VGG-based or ResNet-based backbone for feature exploration, which increases the model complexity. Lightweight backbone design is well-explored for 2D object detection, but research on 3D object detection still remains limited. In this work, we introduce Dense Backbone, a lightweight backbone that combines the benefits of high processing speed, lightweight architecture, and robust detection accuracy. We adapt multiple SoTA 3d object detectors, such as PillarNet, with our backbone and show that with our backbone, these models retain most of their detection capability at a significantly reduced computational cost. To our knowledge, this is the first dense-layer-based backbone tailored specifically for 3D object detection from point cloud data. DensePillarNet, our adaptation of PillarNet, achieves a 29% reduction in model parameters and a 28% reduction in latency with just a 2% drop in detection accuracy on the nuScenes test set. Furthermore, Dense Backbone’s plug-and-play design allows straightforward integration into existing architectures, requiring no modifications to other network components.
zh
[CV-9] AudioGen-Omni: A Unified Multimodal Diffusion Transformer for Video-Synchronized Audio Speech and Song Generation
【速读】:该论文旨在解决多模态音频生成任务中语义一致性、声学多样性与跨模态同步精度不足的问题,特别是针对视频驱动的高质量语音和歌曲生成场景。其核心解决方案是提出AudioGen-Omni,一种基于多模态扩散Transformer(MMDit)的统一架构,通过引入联合训练范式整合大规模视频-文本-音频语料库,并采用统一的歌词-转录编码器将发音单元(graphemes and phonemes)映射为密集帧级表征;进一步利用AdaLN增强的联合注意力机制结合相位对齐各向异性位置注入(PAAPI),实现对时序结构模态的RoPE选择性应用,从而提升跨模态对齐精度;同时通过解冻所有模态并掩码缺失输入,突破传统文本冻结范式的语义限制,显著改善音频质量、语义对齐度及唇音同步准确性,最终在Text-to-Audio/Speech/Song任务上达到SOTA性能,且推理效率达1.91秒/8秒音频。
链接: https://arxiv.org/abs/2508.00733
作者: Le Wang,Jun Wang,Feng Deng,Chen Zhang,Kun Gai,Di Zhang
机构: China University of Mining and Technology (中国矿业大学); Kuaishou Technology (快手科技)
类目: ound (cs.SD); Computer Vision and Pattern Recognition (cs.CV); Multimedia (cs.MM); Audio and Speech Processing (eess.AS)
备注: 12 pages, 2 figures
Abstract:We present AudioGen-Omni - a unified approach based on multimodal diffusion transformers (MMDit), capable of generating high-fidelity audio, speech, and songs coherently synchronized with the input video. AudioGen-Omni introduces a novel joint training paradigm that seamlessly integrates large-scale video-text-audio corpora, enabling a model capable of generating semantically rich, acoustically diverse audio conditioned on multimodal inputs and adaptable to a wide range of audio generation tasks. AudioGen-Omni employs a unified lyrics-transcription encoder that encodes graphemes and phonemes from both sung and spoken inputs into dense frame-level representations. Dense frame-level representations are fused using an AdaLN-based joint attention mechanism enhanced with phase-aligned anisotropic positional infusion (PAAPI), wherein RoPE is selectively applied to temporally structured modalities to ensure precise and robust cross-modal alignment. By unfreezing all modalities and masking missing inputs, AudioGen-Omni mitigates the semantic constraints of text-frozen paradigms, enabling effective cross-modal conditioning. This joint training approach enhances audio quality, semantic alignment, and lip-sync accuracy, while also achieving state-of-the-art results on Text-to-Audio/Speech/Song tasks. With an inference time of 1.91 seconds for 8 seconds of audio, it offers substantial improvements in both efficiency and generality.
zh
[CV-10] YOLO-Count: Differentiable Object Counting for Text-to-Image Generation ICCV2025
【速读】:该论文旨在解决通用目标计数(object counting)中的挑战,并实现对文本到图像(text-to-image, T2I)生成过程中物体数量的精确控制。其关键解决方案是提出了一种可微分的开放词汇计数模型 YOLO-Count,核心创新在于引入了“基数图”(cardinality map)这一新型回归目标,该目标能够有效建模对象尺寸和空间分布的变化;同时结合表示对齐与混合强弱监督策略,使模型在保持高计数精度的同时,可为生成模型提供细粒度的数量引导,从而实现对 T2I 系统中物体数量的有效控制。
链接: https://arxiv.org/abs/2508.00728
作者: Guanning Zeng,Xiang Zhang,Zirui Wang,Haiyang Xu,Zeyuan Chen,Bingnan Li,Zhuowen Tu
机构: Tsinghua University (清华大学); UC San Diego (加州大学圣地亚哥分校); UC Berkeley (加州大学伯克利分校)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: ICCV 2025
Abstract:We propose YOLO-Count, a differentiable open-vocabulary object counting model that tackles both general counting challenges and enables precise quantity control for text-to-image (T2I) generation. A core contribution is the ‘cardinality’ map, a novel regression target that accounts for variations in object size and spatial distribution. Leveraging representation alignment and a hybrid strong-weak supervision scheme, YOLO-Count bridges the gap between open-vocabulary counting and T2I generation control. Its fully differentiable architecture facilitates gradient-based optimization, enabling accurate object count estimation and fine-grained guidance for generative models. Extensive experiments demonstrate that YOLO-Count achieves state-of-the-art counting accuracy while providing robust and effective quantity control for T2I systems.
zh
[CV-11] MIHBench: Benchmarking and Mitigating Multi-Image Hallucinations in Multimodal Large Language Models ACM-MM25
【速读】:该论文旨在解决多图像场景下多模态大语言模型(Multimodal Large Language Models, MLLMs)中对象相关幻觉(object-related hallucination)问题,这是当前研究尚未充分探索的领域。现有工作主要集中在单图像设置,而本文首次系统性地分析了多图像输入下的幻觉现象,并提出了MIHBench基准,用于评估对象存在性、数量计数和跨视图身份一致性三类幻觉任务。解决方案的关键在于提出动态注意力平衡机制(Dynamic Attention Balancing),该机制在保持整体视觉注意力比例的前提下,调整图像间的注意力分布,从而有效降低多图像场景中的幻觉发生率,并提升语义整合与推理稳定性。
链接: https://arxiv.org/abs/2508.00726
作者: Jiale Li,Mingrui Wu,Zixiang Jin,Hao Chen,Jiayi Ji,Xiaoshuai Sun,Liujuan Cao,Rongrong Ji
机构: Xiamen University (厦门大学); Zhongguancun Academy (中关村学院)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: ACM MM25 has accepted this paper
Abstract:Despite growing interest in hallucination in Multimodal Large Language Models, existing studies primarily focus on single-image settings, leaving hallucination in multi-image scenarios largely unexplored. To address this gap, we conduct the first systematic study of hallucinations in multi-image MLLMs and propose MIHBench, a benchmark specifically tailored for evaluating object-related hallucinations across multiple images. MIHBench comprises three core tasks: Multi-Image Object Existence Hallucination, Multi-Image Object Count Hallucination, and Object Identity Consistency Hallucination, targeting semantic understanding across object existence, quantity reasoning, and cross-view identity consistency. Through extensive evaluation, we identify key factors associated with the occurrence of multi-image hallucinations, including: a progressive relationship between the number of image inputs and the likelihood of hallucination occurrences; a strong correlation between single-image hallucination tendencies and those observed in multi-image contexts; and the influence of same-object image ratios and the positional placement of negative samples within image sequences on the occurrence of object identity consistency hallucination. To address these challenges, we propose a Dynamic Attention Balancing mechanism that adjusts inter-image attention distributions while preserving the overall visual attention proportion. Experiments across multiple state-of-the-art MLLMs demonstrate that our method effectively reduces hallucination occurrences and enhances semantic integration and reasoning stability in multi-image scenarios.
zh
[CV-12] D3: Training-Free AI-Generated Video Detection Using Second-Order Features
【速读】:该论文旨在解决当前AI生成视频检测方法在识别合成视频时存在的局限性,特别是对时间维度上人工痕迹(temporal artifacts)挖掘不足的问题。其解决方案的关键在于构建基于牛顿力学的二阶动力学分析理论框架,并据此提出适用于时间特征检测的二阶中心差分(Second-order Central Difference)特征;进一步设计出无需训练的检测方法D3(Detection by Difference of Differences),该方法通过捕捉真实视频与AI生成视频在二阶特征分布上的本质差异实现高效准确的鉴别。实验表明,D3在多个公开数据集上显著优于现有方法,且具备优异的计算效率和鲁棒性。
链接: https://arxiv.org/abs/2508.00701
作者: Chende Zheng,Ruiqi suo,Chenhao Lin,Zhengyu Zhao,Le Yang,Shuai Liu,Minghui Yang,Cong Wang,Chao Shen
机构: Xi’an Jiaotong University (西安交通大学); Guangdong OPPO Mobile Communications Co., Ltd. (广东OPPO移动通信有限公司); City University of Hong Kong (香港城市大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: 8 pages, 4 figures
Abstract:The evolution of video generation techniques, such as Sora, has made it increasingly easy to produce high-fidelity AI-generated videos, raising public concern over the dissemination of synthetic content. However, existing detection methodologies remain limited by their insufficient exploration of temporal artifacts in synthetic videos. To bridge this gap, we establish a theoretical framework through second-order dynamical analysis under Newtonian mechanics, subsequently extending the Second-order Central Difference features tailored for temporal artifact detection. Building on this theoretical foundation, we reveal a fundamental divergence in second-order feature distributions between real and AI-generated videos. Concretely, we propose Detection by Difference of Differences (D3), a novel training-free detection method that leverages the above second-order temporal discrepancies. We validate the superiority of our D3 on 4 open-source datasets (Gen-Video, VideoPhy, EvalCrafter, VidProM), 40 subsets in total. For example, on GenVideo, D3 outperforms the previous best method by 10.39% (absolute) mean Average Precision. Additional experiments on time cost and post-processing operations demonstrate D3’s exceptional computational efficiency and strong robust performance. Our code is available at this https URL.
zh
[CV-13] Can Large Pretrained Depth Estimation Models Help With Image Dehazing? AAAI2026
【速读】:该论文旨在解决真实场景中图像去雾(image dehazing)问题,尤其是雾霾在空间上分布不均所带来的挑战。现有方法虽借助大规模预训练模型取得一定进展,但其架构特定的设计限制了在不同精度与效率需求场景下的适应性。解决方案的关键在于系统性地探究预训练深度表征(pretrained depth representations)在去雾任务中的泛化能力,发现其深度特征在不同雾霾强度下具有高度一致性;基于此洞察,作者提出一个即插即用的RGB-D融合模块,可无缝集成到多种去雾架构中,从而提升模型的通用性和有效性。
链接: https://arxiv.org/abs/2508.00698
作者: Hongfei Zhang,Kun Zhou,Ruizheng Wu,Jiangbo Lu
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Submitted to AAAI2026
Abstract:Image dehazing remains a challenging problem due to the spatially varying nature of haze in real-world scenes. While existing methods have demonstrated the promise of large-scale pretrained models for image dehazing, their architecture-specific designs hinder adaptability across diverse scenarios with different accuracy and efficiency requirements. In this work, we systematically investigate the generalization capability of pretrained depth representations-learned from millions of diverse images-for image dehazing. Our empirical analysis reveals that the learned deep depth features maintain remarkable consistency across varying haze levels. Building on this insight, we propose a plug-and-play RGB-D fusion module that seamlessly integrates with diverse dehazing architectures. Extensive experiments across multiple benchmarks validate both the effectiveness and broad applicability of our approach.
zh
[CV-14] On-Device Diffusion Transformer Policy for Efficient Robot Manipulation ICCV2025
【速读】:该论文旨在解决扩散策略(Diffusion Policies)在资源受限的移动平台上的实时部署难题,其核心挑战在于计算效率低下和内存占用过大。解决方案的关键在于提出LightDP框架,通过两个核心策略实现加速:一是对去噪模块进行网络压缩,采用统一的剪枝与再训练流程以提升模型在剪枝后的恢复能力,避免传统剪枝方法导致的性能下降;二是结合一致性蒸馏技术减少采样步数,从而在保持动作预测精度的同时显著降低推理延迟。实验表明,LightDP可在移动设备上实现实时动作预测,并达到与当前最优扩散策略相当的性能水平。
链接: https://arxiv.org/abs/2508.00697
作者: Yiming Wu,Huan Wang,Zhenghao Chen,Jianxin Pang,Dong Xu
机构: The University of Hong Kong (香港大学); Westlake University (西湖大学); University of Newcastle (纽卡斯尔大学); UBTech Robotics Corp. (优必选科技)
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
备注: ICCV 2025
Abstract:Diffusion Policies have significantly advanced robotic manipulation tasks via imitation learning, but their application on resource-constrained mobile platforms remains challenging due to computational inefficiency and extensive memory footprint. In this paper, we propose LightDP, a novel framework specifically designed to accelerate Diffusion Policies for real-time deployment on mobile devices. LightDP addresses the computational bottleneck through two core strategies: network compression of the denoising modules and reduction of the required sampling steps. We first conduct an extensive computational analysis on existing Diffusion Policy architectures, identifying the denoising network as the primary contributor to latency. To overcome performance degradation typically associated with conventional pruning methods, we introduce a unified pruning and retraining pipeline, optimizing the model’s post-pruning recoverability explicitly. Furthermore, we combine pruning techniques with consistency distillation to effectively reduce sampling steps while maintaining action prediction accuracy. Experimental evaluations on the standard datasets, \ie, PushT, Robomimic, CALVIN, and LIBERO, demonstrate that LightDP achieves real-time action prediction on mobile devices with competitive performance, marking an important step toward practical deployment of diffusion-based policies in resource-limited environments. Extensive real-world experiments also show the proposed LightDP can achieve performance comparable to state-of-the-art Diffusion Policies.
zh
[CV-15] Revisiting Adversarial Patch Defenses on Object Detectors: Unified Evaluation Large-Scale Dataset and New Insights
【速读】:该论文旨在解决当前针对目标检测器的对抗性补丁攻击(adversarial patch attacks)防御方法评估缺乏统一、全面框架的问题,导致现有防御策略的性能评估不一致且不完整。其解决方案的关键在于构建首个针对补丁防御的基准测试体系(patch defense benchmark),涵盖2种攻击目标、13种补丁攻击方式、11种目标检测器及4类多样化的评价指标,并在此基础上创建了包含94类补丁和94,000张图像的大规模对抗补丁数据集。通过系统性分析,研究揭示了防御难点主要源于数据分布差异而非高频特性,提出以被攻击目标的平均精度(AP)作为更可靠的防御性能指标,并发现具有复杂/随机模型结构或通用补丁属性的防御方法对自适应攻击更具鲁棒性,从而为补丁攻击与防御的科学评估与设计提供了新方向。
链接: https://arxiv.org/abs/2508.00649
作者: Junhao Zheng,Jiahao Sun,Chenhao Lin,Zhengyu Zhao,Chen Ma,Chong Zhang,Cong Wang,Qian Wang,Chao Shen
机构: Xi’an Jiaotong University (西安交通大学); City University of Hong Kong (香港城市大学); Wuhan University (武汉大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Developing reliable defenses against patch attacks on object detectors has attracted increasing interest. However, we identify that existing defense evaluations lack a unified and comprehensive framework, resulting in inconsistent and incomplete assessments of current methods. To address this issue, we revisit 11 representative defenses and present the first patch defense benchmark, involving 2 attack goals, 13 patch attacks, 11 object detectors, and 4 diverse metrics. This leads to the large-scale adversarial patch dataset with 94 types of patches and 94,000 images. Our comprehensive analyses reveal new insights: (1) The difficulty in defending against naturalistic patches lies in the data distribution, rather than the commonly believed high frequencies. Our new dataset with diverse patch distributions can be used to improve existing defenses by 15.09% AP@0.5. (2) The average precision of the attacked object, rather than the commonly pursued patch detection accuracy, shows high consistency with defense performance. (3) Adaptive attacks can substantially bypass existing defenses, and defenses with complex/stochastic models or universal patch properties are relatively robust. We hope that our analyses will serve as guidance on properly evaluating patch attacks/defenses and advancing their design. Code and dataset are available at this https URL, where we will keep integrating new attacks/defenses.
zh
[CV-16] Minimum Data Maximum Impact: 20 annotated samples for explainable lung nodule classification MICCAI2025
【速读】:该论文旨在解决医疗图像诊断中可解释模型因缺乏大规模标注有病理相关视觉属性(pathology-related visual attributes)的数据集而难以广泛应用的问题。其解决方案的关键在于利用生成式AI(Generative AI)中的扩散模型(Diffusion Model)进行属性条件化增强,并仅使用LIDC-IDRI数据集中20个标注有属性的肺结节样本进行训练,从而合成高质量的属性标注数据;将这些合成数据融入可解释模型的训练后,显著提升了属性预测准确率(+13.4%)和目标诊断准确率(+1.8%),验证了合成数据在克服真实数据稀缺性方面的有效性。
链接: https://arxiv.org/abs/2508.00639
作者: Luisa Gallée,Catharina Silvia Lisson,Christoph Gerhard Lisson,Daniela Drees,Felix Weig,Daniel Vogele,Meinrad Beer,Michael Götz
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted at iMIMIC - Interpretability of Machine Intelligence in Medical Image Computing workshop MICCAI 2025 Medical Image Computing and Computer Assisted Intervention
Abstract:Classification models that provide human-interpretable explanations enhance clinicians’ trust and usability in medical image diagnosis. One research focus is the integration and prediction of pathology-related visual attributes used by radiologists alongside the diagnosis, aligning AI decision-making with clinical reasoning. Radiologists use attributes like shape and texture as established diagnostic criteria and mirroring these in AI decision-making both enhances transparency and enables explicit validation of model outputs. However, the adoption of such models is limited by the scarcity of large-scale medical image datasets annotated with these attributes. To address this challenge, we propose synthesizing attribute-annotated data using a generative model. We enhance the Diffusion Model with attribute conditioning and train it using only 20 attribute-labeled lung nodule samples from the LIDC-IDRI dataset. Incorporating its generated images into the training of an explainable model boosts performance, increasing attribute prediction accuracy by 13.4% and target prediction accuracy by 1.8% compared to training with only the small real attribute-annotated dataset. This work highlights the potential of synthetic data to overcome dataset limitations, enhancing the applicability of explainable models in medical image analysis.
zh
[CV-17] Backdoor Attacks on Deep Learning Face Detection
【速读】:该论文旨在解决在非受限环境下运行的面部识别系统(Face Recognition Systems)所面临的挑战,特别是由于光照不一致或人脸姿态多样导致的性能下降问题。为应对这些挑战,系统通常依赖于面部检测模块(Face Detection Module)来回归边界框和关键点坐标以实现面部对齐(Face Alignment)。然而,本文首次揭示了针对面部检测模块的生成式攻击(Object Generation Attacks,称为Face Generation Attacks)及其衍生的“关键点偏移攻击”(Landmark Shift Attack),该攻击通过后门机制干扰关键点坐标回归任务,从而破坏面部对齐的准确性。解决方案的关键在于提出有效的防御机制,以缓解此类基于生成模型的对抗性攻击对关键点回归任务的影响。
链接: https://arxiv.org/abs/2508.00620
作者: Quentin Le Roux,Yannick Teglia,Teddy Furon,Philippe Loubet-Moundi
机构: Thales Cyber & Digital(泰勒斯网络与数字); Inria/CNRS/IRISA/Univ. de Rennes(法国国家信息与自动化研究院/法国国家科学研究中心/信息与随机系统研究所/雷恩大学); Thales Cyber & Digital(泰勒斯网络与数字); Inria/CNRS/IRISA/Univ. de Rennes(法国国家信息与自动化研究院/法国国家科学研究中心/信息与随机系统研究所/雷恩大学); Thales Cyber & Digital(泰勒斯网络与数字)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Cryptography and Security (cs.CR); Machine Learning (cs.LG)
备注:
Abstract:Face Recognition Systems that operate in unconstrained environments capture images under varying conditions,such as inconsistent lighting, or diverse face poses. These challenges require including a Face Detection module that regresses bounding boxes and landmark coordinates for proper Face Alignment. This paper shows the effectiveness of Object Generation Attacks on Face Detection, dubbed Face Generation Attacks, and demonstrates for the first time a Landmark Shift Attack that backdoors the coordinate regression task performed by face detectors. We then offer mitigations against these vulnerabilities.
zh
[CV-18] DPoser-X: Diffusion Model as Robust 3D Whole-body Human Pose Prior ICCV2025
【速读】:该论文旨在解决3D全身人体姿态建模中因关节动作复杂性和高质量全身体态数据稀缺而导致的先验模型构建难题(pose prior modeling)。其核心解决方案是提出基于扩散模型的DPoser-X框架,通过将多种以姿态为中心的任务统一为逆问题,并借助变分扩散采样进行求解;关键创新在于设计了针对姿态数据特性的截断时间步调度策略(truncated timestep scheduling)以及掩码训练机制(masked training mechanism),从而有效融合全身与局部肢体数据,在捕捉身体各部位间依赖关系的同时避免特定动作的过拟合,显著提升了模型在人体、手部、面部及全身姿态建模多个基准上的鲁棒性与泛化能力。
链接: https://arxiv.org/abs/2508.00599
作者: Junzhe Lu,Jing Lin,Hongkun Dou,Ailing Zeng,Yue Deng,Xian Liu,Zhongang Cai,Lei Yang,Yulun Zhang,Haoqian Wang,Ziwei Liu
机构: Tsinghua University (清华大学); Nanyang Technological University (南洋理工大学); Beihang University (北京航空航天大学); Anuttacon; NVIDIA Research (英伟达研究); SenseTime Research (商汤科技研究院); Shanghai Jiao Tong University (上海交通大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: ICCV 2025 (oral); Code released: this https URL
Abstract:We present DPoser-X, a diffusion-based prior model for 3D whole-body human poses. Building a versatile and robust full-body human pose prior remains challenging due to the inherent complexity of articulated human poses and the scarcity of high-quality whole-body pose datasets. To address these limitations, we introduce a Diffusion model as body Pose prior (DPoser) and extend it to DPoser-X for expressive whole-body human pose modeling. Our approach unifies various pose-centric tasks as inverse problems, solving them through variational diffusion sampling. To enhance performance on downstream applications, we introduce a novel truncated timestep scheduling method specifically designed for pose data characteristics. We also propose a masked training mechanism that effectively combines whole-body and part-specific datasets, enabling our model to capture interdependencies between body parts while avoiding overfitting to specific actions. Extensive experiments demonstrate DPoser-X’s robustness and versatility across multiple benchmarks for body, hand, face, and full-body pose modeling. Our model consistently outperforms state-of-the-art alternatives, establishing a new benchmark for whole-body human pose prior modeling.
zh
[CV-19] GeoMoE: Divide-and-Conquer Motion Field Modeling with Mixture-of-Experts for Two-View Geometry
【速读】:该论文旨在解决复杂真实场景中两视图几何运动场估计的难题,这类场景常伴有极端视角和尺度变化以及显著的深度不连续性,导致运动场呈现多样且异质的运动模式。现有方法缺乏针对性建模策略,无法显式处理这种变异性,从而使得估计的运动场偏离其真实的结构与分布。解决方案的关键在于提出GeoMoE框架,通过引入概率先验引导的分解策略(Probabilistic Prior-Guided Decomposition),利用内点概率信号实现结构感知的运动场子域划分,有效抑制异常值引起的偏差;并设计MoE增强的双路径整流机制(MoE-Enhanced Bi-Path Rectifier),在空间上下文和通道语义路径上分别增强各子域特征,并将其路由至定制专家进行针对性建模,从而解耦异质运动模式、抑制跨子域干扰与表征纠缠,实现细粒度的运动场校正。
链接: https://arxiv.org/abs/2508.00592
作者: Jiajun Le,Jiayi Ma
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Recent progress in two-view geometry increasingly emphasizes enforcing smoothness and global consistency priors when estimating motion fields between pairs of images. However, in complex real-world scenes, characterized by extreme viewpoint and scale changes as well as pronounced depth discontinuities, the motion field often exhibits diverse and heterogeneous motion patterns. Most existing methods lack targeted modeling strategies and fail to explicitly account for this variability, resulting in estimated motion fields that diverge from their true underlying structure and distribution. We observe that Mixture-of-Experts (MoE) can assign dedicated experts to motion sub-fields, enabling a divide-and-conquer strategy for heterogeneous motion patterns. Building on this insight, we re-architect motion field modeling in two-view geometry with GeoMoE, a streamlined framework. Specifically, we first devise a Probabilistic Prior-Guided Decomposition strategy that exploits inlier probability signals to perform a structure-aware decomposition of the motion field into heterogeneous sub-fields, sharply curbing outlier-induced bias. Next, we introduce an MoE-Enhanced Bi-Path Rectifier that enhances each sub-field along spatial-context and channel-semantic paths and routes it to a customized expert for targeted modeling, thereby decoupling heterogeneous motion regimes, suppressing cross-sub-field interference and representational entanglement, and yielding fine-grained motion-field rectification. With this minimalist design, GeoMoE outperforms prior state-of-the-art methods in relative pose and homography estimation and shows strong generalization. The source code and pre-trained models are available at this https URL.
zh
[CV-20] Wukong Framework for Not Safe For Work Detection in Text-to-Image systems
【速读】:该论文旨在解决生成式 AI (Generative AI) 中文本到图像(Text-to-Image, T2I)生成模型可能输出不安全内容(Not Safe For Work, NSFW)的问题,即外部防护(external safeguarding)的效率与准确性难题。现有方案中,文本过滤器易受对抗攻击且忽略模型特异性,而图像过滤器虽准确但计算开销大、延迟高。解决方案的关键在于提出 Wukong 框架,其利用扩散模型(Diffusion Model)早期去噪步骤中的中间特征,并复用 U-Net 架构中预训练的交叉注意力(cross-attention)参数,在图像生成过程中实现早期检测,从而在保持与图像过滤器相当的准确率的同时显著提升效率。
链接: https://arxiv.org/abs/2508.00591
作者: Mingrui Liu,Sixiao Zhang,Cheng Long
机构: Nanyang Technological University (南洋理工大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Cryptography and Security (cs.CR)
备注: Under review
Abstract:Text-to-Image (T2I) generation is a popular AI-generated content (AIGC) technology enabling diverse and creative image synthesis. However, some outputs may contain Not Safe For Work (NSFW) content (e.g., violence), violating community guidelines. Detecting NSFW content efficiently and accurately, known as external safeguarding, is essential. Existing external safeguards fall into two types: text filters, which analyze user prompts but overlook T2I model-specific variations and are prone to adversarial attacks; and image filters, which analyze final generated images but are computationally costly and introduce latency. Diffusion models, the foundation of modern T2I systems like Stable Diffusion, generate images through iterative denoising using a U-Net architecture with ResNet and Transformer blocks. We observe that: (1) early denoising steps define the semantic layout of the image, and (2) cross-attention layers in U-Net are crucial for aligning text and image regions. Based on these insights, we propose Wukong, a transformer-based NSFW detection framework that leverages intermediate outputs from early denoising steps and reuses U-Net’s pre-trained cross-attention parameters. Wukong operates within the diffusion process, enabling early detection without waiting for full image generation. We also introduce a new dataset containing prompts, seeds, and image-specific NSFW labels, and evaluate Wukong on this and two public benchmarks. Results show that Wukong significantly outperforms text-based safeguards and achieves comparable accuracy of image filters, while offering much greater efficiency.
zh
[CV-21] A Novel Modeling Framework and Data Product for Extended VIIRS-like Artificial Nighttime Light Image Reconstruction (1986-2024)
【速读】:该论文旨在解决现有基于NPP-VIIRS传感器的夜间人工光(Artificial Night-Time Light, NTL)遥感数据时间覆盖较短(仅自2012年起)的问题,以及当前扩展VIIRS类NTL时间序列方法中存在的两个关键缺陷:光强度低估和结构信息遗漏。解决方案的核心在于提出一种两阶段重建框架——“构建-精修”流程:第一阶段采用分层融合解码器(Hierarchical Fusion Decoder, HFD)提升初始重建结果的保真度;第二阶段引入双特征精修模块(Dual Feature Refiner, DFR),利用高分辨率不透水面掩膜引导并增强细粒度空间结构细节。该方法成功生成了从1986年开始、长达26年的中国区域Extended VIIRS-like Artificial Nighttime Light (EVAL)产品,在精度和时序一致性上显著优于现有方法,为长期人类活动研究提供了可靠的数据支持。
链接: https://arxiv.org/abs/2508.00590
作者: Yihe Tian,Kwan Man Cheng,Zhengbo Zhang,Tao Zhang,Suju Li,Dongmei Yan,Bing Xu
机构: Tsinghua University (清华大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Image and Video Processing (eess.IV)
备注:
Abstract:Artificial Night-Time Light (NTL) remote sensing is a vital proxy for quantifying the intensity and spatial distribution of human activities. Although the NPP-VIIRS sensor provides high-quality NTL observations, its temporal coverage, which begins in 2012, restricts long-term time-series studies that extend to earlier periods. Despite the progress in extending VIIRS-like NTL time-series, current methods still suffer from two significant shortcomings: the underestimation of light intensity and the structural omission. To overcome these limitations, we propose a novel reconstruction framework consisting of a two-stage process: construction and refinement. The construction stage features a Hierarchical Fusion Decoder (HFD) designed to enhance the fidelity of the initial reconstruction. The refinement stage employs a Dual Feature Refiner (DFR), which leverages high-resolution impervious surface masks to guide and enhance fine-grained structural details. Based on this framework, we developed the Extended VIIRS-like Artificial Nighttime Light (EVAL) product for China, extending the standard data record backwards by 26 years to begin in 1986. Quantitative evaluation shows that EVAL significantly outperforms existing state-of-the-art products, boosting the \textR^2 from 0.68 to 0.80 while lowering the RMSE from 1.27 to 0.99. Furthermore, EVAL exhibits excellent temporal consistency and maintains a high correlation with socioeconomic parameters, confirming its reliability for long-term analysis. The resulting EVAL dataset provides a valuable new resource for the research community and is publicly available at this https URL.
zh
[CV-22] Uncertainty-Aware Likelihood Ratio Estimation for Pixel-Wise Out-of-Distribution Detection ICCV
【速读】:该论文旨在解决语义分割模型在真实自动驾驶场景中对未知物体误分类的问题,尤其是在复杂场景下,罕见类别常被误判为真正未知对象。其核心解决方案是提出一种不确定性感知的似然比估计方法(uncertainty-aware likelihood ratio estimation),通过在似然比检验中引入证据分类器(evidential classifier),显式建模像素特征来自已知或未知类别的不确定性;该方法不依赖点估计,而是输出概率分布以同时捕捉稀有训练样本和不完美合成异常值带来的不确定性,从而更有效地利用异常暴露(outlier exposure)策略,在保持高平均精度(90.91%)的同时实现最低平均假阳性率(2.5%),且计算开销可忽略。
链接: https://arxiv.org/abs/2508.00587
作者: Marc Hölle,Walter Kellermann,Vasileios Belagiannis
机构: Friedrich-Alexander-Universität Erlangen-Nürnberg (弗里德里希-亚历山大埃尔朗根-纽伦堡大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted at ICCVW 2025, 11 pages, 4 figures
Abstract:Semantic segmentation models trained on known object classes often fail in real-world autonomous driving scenarios by confidently misclassifying unknown objects. While pixel-wise out-of-distribution detection can identify unknown objects, existing methods struggle in complex scenes where rare object classes are often confused with truly unknown objects. We introduce an uncertainty-aware likelihood ratio estimation method that addresses these limitations. Our approach uses an evidential classifier within a likelihood ratio test to distinguish between known and unknown pixel features from a semantic segmentation model, while explicitly accounting for uncertainty. Instead of producing point estimates, our method outputs probability distributions that capture uncertainty from both rare training examples and imperfect synthetic outliers. We show that by incorporating uncertainty in this way, outlier exposure can be leveraged more effectively. Evaluated on five standard benchmark datasets, our method achieves the lowest average false positive rate (2.5%) among state-of-the-art while maintaining high average precision (90.91%) and incurring only negligible computational overhead. Code is available at this https URL.
zh
[CV-23] CoProU-VO: Combining Projected Uncertainty for End-to-End Unsupervised Monocular Visual Odometry
【速读】:该论文旨在解决无监督视觉里程计(Visual Odometry, VO)在动态场景中因违反静态场景假设而导致位姿估计错误的问题。传统方法通常依赖单帧信息进行不确定性建模,忽略了连续帧之间的时序不确定性,从而难以有效识别动态物体和遮挡区域。解决方案的关键在于提出一种跨帧不确定性传播机制,通过联合建模目标帧与投影参考帧的不确定性,并基于概率公式进行融合,构建出更鲁棒的掩码以过滤动态区域。该方法被命名为Combined Projected Uncertainty VO (CoProU-VO),其核心创新在于将不确定性从时间维度上进行传播和整合,从而提升在复杂动态场景中的位姿估计精度。
链接: https://arxiv.org/abs/2508.00568
作者: Jingchao Xie,Oussema Dhaouadi,Weirong Chen,Johannes Meier,Jacques Kaiser,Daniel Cremers
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted for GCPR 2025. Project page: this https URL
Abstract:Visual Odometry (VO) is fundamental to autonomous navigation, robotics, and augmented reality, with unsupervised approaches eliminating the need for expensive ground-truth labels. However, these methods struggle when dynamic objects violate the static scene assumption, leading to erroneous pose estimations. We tackle this problem by uncertainty modeling, which is a commonly used technique that creates robust masks to filter out dynamic objects and occlusions without requiring explicit motion segmentation. Traditional uncertainty modeling considers only single-frame information, overlooking the uncertainties across consecutive frames. Our key insight is that uncertainty must be propagated and combined across temporal frames to effectively identify unreliable regions, particularly in dynamic scenes. To address this challenge, we introduce Combined Projected Uncertainty VO (CoProU-VO), a novel end-to-end approach that combines target frame uncertainty with projected reference frame uncertainty using a principled probabilistic formulation. Built upon vision transformer backbones, our model simultaneously learns depth, uncertainty estimation, and camera poses. Consequently, experiments on the KITTI and nuScenes datasets demonstrate significant improvements over previous unsupervised monocular end-to-end two-frame-based methods and exhibit strong performance in challenging highway scenes where other approaches often fail. Additionally, comprehensive ablation studies validate the effectiveness of cross-frame uncertainty propagation.
zh
[CV-24] Weakly Supervised Virus Capsid Detection with Image-Level Annotations in Electron Microscopy Images
【速读】:该论文旨在解决当前目标检测方法依赖大量人工标注边界框(bounding boxes)所带来的高成本问题,尤其是在专业领域中,这类标注需专家参与,耗时且难以获取。解决方案的关键在于提出一种领域特定的弱监督目标检测算法,仅使用图像级标签(image-level annotations)即可训练高性能检测模型。其核心创新是通过知识蒸馏(knowledge distillation)技术,利用预训练模型对图像中病毒存在与否的预测能力生成伪标签(pseudo-labels),并采用具有收缩感受野(shrinking receptive field)的优化策略直接提取病毒颗粒,无需特定网络结构。实验表明,该方法生成的伪标签在标注时间受限时优于现有弱标签方法甚至真实标注(ground truth),显著提升了标注效率与检测性能。
链接: https://arxiv.org/abs/2508.00563
作者: Hannah Kniesel,Leon Sick,Tristan Payer,Tim Bergner,Kavitha Shaga Devan,Clarissa Read,Paul Walther,Timo Ropinski
机构: Ulm University (乌尔姆大学); TU Vienna (维也纳技术大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Current state-of-the-art methods for object detection rely on annotated bounding boxes of large data sets for training. However, obtaining such annotations is expensive and can require up to hundreds of hours of manual labor. This poses a challenge, especially since such annotations can only be provided by experts, as they require knowledge about the scientific domain. To tackle this challenge, we propose a domain-specific weakly supervised object detection algorithm that only relies on image-level annotations, which are significantly easier to acquire. Our method distills the knowledge of a pre-trained model, on the task of predicting the presence or absence of a virus in an image, to obtain a set of pseudo-labels that can be used to later train a state-of-the-art object detection model. To do so, we use an optimization approach with a shrinking receptive field to extract virus particles directly without specific network architectures. Through a set of extensive studies, we show how the proposed pseudo-labels are easier to obtain, and, more importantly, are able to outperform other existing weak labeling methods, and even ground truth labels, in cases where the time to obtain the annotation is limited.
zh
[CV-25] Guiding Diffusion-Based Articulated Object Generation by Partial Point Cloud Alignment and Physical Plausibility Constraints ICCV
【速读】:该论文旨在解决生成式AI(Generative AI)在生成可交互物体(articulated objects)时,如何提升其与部分点云(partial point clouds)的对齐精度以及物理合理性的问题。解决方案的关键在于提出PhysNAP模型,该模型基于扩散模型(diffusion model),利用有符号距离函数(Signed Distance Function, SDF)表示部件形状,并通过预测SDF计算点云对齐损失引导反向扩散过程;同时引入非穿透约束(non-penetration constraints)和运动学可行性约束(mobility constraints)以增强生成对象的物理合理性,且支持类别感知(category-aware)机制以进一步优化对齐效果。
链接: https://arxiv.org/abs/2508.00558
作者: Jens U. Kreber,Joerg Stueckler
机构: University of Augsburg (奥格斯堡大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注: Accepted for publication at the IEEE/CVF International Conference on Computer Vision (ICCV), 2025
Abstract:Articulated objects are an important type of interactable objects in everyday environments. In this paper, we propose PhysNAP, a novel diffusion model-based approach for generating articulated objects that aligns them with partial point clouds and improves their physical plausibility. The model represents part shapes by signed distance functions (SDFs). We guide the reverse diffusion process using a point cloud alignment loss computed using the predicted SDFs. Additionally, we impose non-penetration and mobility constraints based on the part SDFs for guiding the model to generate more physically plausible objects. We also make our diffusion approach category-aware to further improve point cloud alignment if category information is available. We evaluate the generative ability and constraint consistency of samples generated with PhysNAP using the PartNet-Mobility dataset. We also compare it with an unguided baseline diffusion model and demonstrate that PhysNAP can improve constraint consistency and provides a tradeoff with generative ability.
zh
[CV-26] raining-Free Class Purification for Open-Vocabulary Semantic Segmentation ICCV2025
【速读】:该论文旨在解决训练-free开放词汇语义分割(Open-Vocabulary Semantic Segmentation, OVSS)中因类别冗余(class redundancy)和视觉-语言歧义(visual-language ambiguity)导致的分类激活图(class activation map)与亲和性优化激活图(affinity-refined activation map)质量下降的问题。其解决方案的关键在于提出一种名为FreeCP的训练-free类别净化框架,通过净化语义类别表示来纠正由冗余和歧义引起的错误,进而提升最终分割预测的准确性。该方法可作为即插即用模块显著增强现有OVSS方法的性能。
链接: https://arxiv.org/abs/2508.00557
作者: Qi Chen,Lingxiao Yang,Yun Chen,Nailong Zhao,Jianhuang Lai,Jie Shao,Xiaohua Xie
机构: Sun Yat-sen University (中山大学); ByteDance Intelligent Creation (字节跳动智能创作); University of Surrey (萨里大学); Alibaba Cloud Computing (阿里云计算)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted to ICCV 2025
Abstract:Fine-tuning pre-trained vision-language models has emerged as a powerful approach for enhancing open-vocabulary semantic segmentation (OVSS). However, the substantial computational and resource demands associated with training on large datasets have prompted interest in training-free methods for OVSS. Existing training-free approaches primarily focus on modifying model architectures and generating prototypes to improve segmentation performance. However, they often neglect the challenges posed by class redundancy, where multiple categories are not present in the current test image, and visual-language ambiguity, where semantic similarities among categories create confusion in class activation. These issues can lead to suboptimal class activation maps and affinity-refined activation maps. Motivated by these observations, we propose FreeCP, a novel training-free class purification framework designed to address these challenges. FreeCP focuses on purifying semantic categories and rectifying errors caused by redundancy and ambiguity. The purified class representations are then leveraged to produce final segmentation predictions. We conduct extensive experiments across eight benchmarks to validate FreeCP’s effectiveness. Results demonstrate that FreeCP, as a plug-and-play module, significantly boosts segmentation performance when combined with other OVSS methods.
zh
[CV-27] HiPrune: Training-Free Visual Token Pruning via Hierarchical Attention in Vision-Language Models
【速读】:该论文旨在解决视觉语言模型(Vision-Language Models, VLMs)中因图像被编码为长序列视觉token而导致的计算开销过大和推理效率低下的问题。现有方法通常依赖于特殊token(如CLS)或需任务特定训练,难以在不同架构间泛化。其解决方案的关键在于提出一种无需训练且与模型无关的token剪枝框架HiPrune,该框架利用视觉编码器中的分层注意力结构:识别中间层关注对象中心区域、深层捕捉全局上下文特征;据此选取三类信息丰富token——锚点token(高注意力对象中心层)、缓冲token(邻接锚点以保持空间连续性)及注册token(深层强注意力用于全局摘要),从而实现高效压缩。实验表明,HiPrune在不重新训练的情况下显著降低token数量(最低至11.1%)并保持高达99.5%的任务准确率,同时将推理FLOPs和延迟减少最多9倍,展现出优异的跨模型与跨任务泛化能力。
链接: https://arxiv.org/abs/2508.00553
作者: Jizhihui Liu,Feiyi Du,Guangdao Zhu,Niu Lian,Jun Li,Bin Chen
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Vision-Language Models (VLMs) encode images into lengthy sequences of visual tokens, leading to excessive computational overhead and limited inference efficiency. While prior efforts prune or merge tokens to address this issue, they often rely on special tokens (e.g., CLS) or require task-specific training, hindering scalability across architectures. In this paper, we propose HiPrune, a training-free and model-agnostic token Pruning framework that exploits the Hierarchical attention structure within vision encoders. We identify that middle layers attend to object-centric regions, while deep layers capture global contextual features. Based on this observation, HiPrune selects three types of informative tokens: (1) Anchor tokens with high attention in object-centric layers, (2) Buffer tokens adjacent to anchors for spatial continuity, and (3) Register tokens with strong attention in deep layers for global summarization. Our method requires no retraining and integrates seamlessly with any ViT-based VLM. Extensive experiments on LLaVA-1.5, LLaVA-NeXT, and Qwen2.5-VL demonstrate that HiPrune achieves state-of-the-art pruning performance, preserving up to 99.3% task accuracy with only 33.3% tokens, and maintaining 99.5% accuracy with just 11.1% tokens. Meanwhile, it reduces inference FLOPs and latency by up to 9 \times , showcasing strong generalization across models and tasks. Code is available at this https URL.
zh
[CV-28] DBLP: Noise Bridge Consistency Distillation For Efficient And Reliable Adversarial Purification
【速读】:该论文旨在解决深度神经网络(Deep Neural Networks, DNNs)在面对对抗性扰动时的脆弱性问题,尤其是现有基于扩散模型的对抗净化方法因依赖密集迭代去噪而难以实际部署的瓶颈。其解决方案的关键在于提出了一种名为“扩散桥蒸馏净化”(Diffusion Bridge Distillation for Purification, DBLP)的新框架,核心创新是引入“噪声桥蒸馏”(noise bridge distillation)目标函数,在潜在一致性模型(Latent Consistency Model, LCM)中建立对抗噪声分布与干净数据分布之间的原理性对齐;同时通过自适应语义增强机制,融合多尺度金字塔边缘图作为条件输入,以提升净化过程中图像的语义保真度,从而在保持高鲁棒准确率和优异图像质量的同时,实现约0.2秒的推理时间,显著推动了实时对抗净化技术的发展。
链接: https://arxiv.org/abs/2508.00552
作者: Chihan Huang,Belal Alsinglawi,Islam Al-qudah
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Recent advances in deep neural networks (DNNs) have led to remarkable success across a wide range of tasks. However, their susceptibility to adversarial perturbations remains a critical vulnerability. Existing diffusion-based adversarial purification methods often require intensive iterative denoising, severely limiting their practical deployment. In this paper, we propose Diffusion Bridge Distillation for Purification (DBLP), a novel and efficient diffusion-based framework for adversarial purification. Central to our approach is a new objective, noise bridge distillation, which constructs a principled alignment between the adversarial noise distribution and the clean data distribution within a latent consistency model (LCM). To further enhance semantic fidelity, we introduce adaptive semantic enhancement, which fuses multi-scale pyramid edge maps as conditioning input to guide the purification process. Extensive experiments across multiple datasets demonstrate that DBLP achieves state-of-the-art (SOTA) robust accuracy, superior image quality, and around 0.2s inference time, marking a significant step toward real-time adversarial purification.
zh
[CV-29] Your other Left! Vision-Language Models Fail to Identify Relative Positions in Medical Images MICCAI
【速读】:该论文旨在解决视觉语言模型(Vision-Language Models, VLMs)在医学影像中准确判断解剖结构相对位置的能力不足问题,这是其临床应用的关键瓶颈。解决方案的关键在于引入一种新的基准数据集——MIRP(Medical Imaging Relative Positioning),用于系统评估VLMs在医学图像中定位相对关系的能力,并通过可视化提示(如字母数字或彩色标记)探索提升性能的方法,发现尽管提示能带来一定改善,但模型仍严重依赖先验解剖知识而非图像内容,导致推理错误频发。
链接: https://arxiv.org/abs/2508.00549
作者: Daniel Wolf,Heiko Hillenhagen,Billurvan Taskin,Alex Bäuerle,Meinrad Beer,Michael Götz,Timo Ropinski
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted at the International Conference on Medical Image Computing and Computer Assisted Intervention (MICCAI) 2025
Abstract:Clinical decision-making relies heavily on understanding relative positions of anatomical structures and anomalies. Therefore, for Vision-Language Models (VLMs) to be applicable in clinical practice, the ability to accurately determine relative positions on medical images is a fundamental prerequisite. Despite its importance, this capability remains highly underexplored. To address this gap, we evaluate the ability of state-of-the-art VLMs, GPT-4o, Llama3.2, Pixtral, and JanusPro, and find that all models fail at this fundamental task. Inspired by successful approaches in computer vision, we investigate whether visual prompts, such as alphanumeric or colored markers placed on anatomical structures, can enhance performance. While these markers provide moderate improvements, results remain significantly lower on medical images compared to observations made on natural images. Our evaluations suggest that, in medical imaging, VLMs rely more on prior anatomical knowledge than on actual image content for answering relative position questions, often leading to incorrect conclusions. To facilitate further research in this area, we introduce the MIRP , Medical Imaging Relative Positioning, benchmark dataset, designed to systematically evaluate the capability to identify relative positions in medical images.
zh
[CV-30] Video Color Grading via Look-Up Table Generation ICCV2025
【速读】:该论文旨在解决视频色彩分级(color grading)过程中依赖专业技能、流程复杂且难以实现个性化表达的问题。传统方法需由专业调色师手动调整以营造特定视觉风格或情绪氛围,而本文提出了一种基于参考图像的视频色彩分级框架,其核心创新在于利用扩散模型(diffusion model)显式生成用于颜色属性对齐的查找表(look-up table, LUT),从而在保持视频结构细节完整性的前提下实现高效推理。关键在于通过高阶特征一致性约束(如风格、情绪等)来指导LUT生成,并进一步引入文本提示机制以支持用户偏好驱动的低级特征增强(如对比度、亮度等),显著提升了自动化与可控性。
链接: https://arxiv.org/abs/2508.00548
作者: Seunghyun Shin,Dongmin Shin,Jisu Shin,Hae-Gon Jeon,Joon-Young Lee
机构: GIST; Yonsei University; Adobe Research
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: ICCV2025
Abstract:Different from color correction and transfer, color grading involves adjusting colors for artistic or storytelling purposes in a video, which is used to establish a specific look or mood. However, due to the complexity of the process and the need for specialized editing skills, video color grading remains primarily the domain of professional colorists. In this paper, we present a reference-based video color grading framework. Our key idea is explicitly generating a look-up table (LUT) for color attribute alignment between reference scenes and input video via a diffusion model. As a training objective, we enforce that high-level features of the reference scenes like look, mood, and emotion should be similar to that of the input video. Our LUT-based approach allows for color grading without any loss of structural details in the whole video frames as well as achieving fast inference. We further build a pipeline to incorporate a user-preference via text prompts for low-level feature enhancement such as contrast and brightness, etc. Experimental results, including extensive user studies, demonstrate the effectiveness of our approach for video color grading. Codes are publicly available at this https URL.
zh
[CV-31] EPANet: Efficient Path Aggregation Network for Underwater Fish Detection
【速读】:该论文旨在解决水下鱼类检测(Underwater Fish Detection, UFD)中因目标分辨率低、背景干扰大以及目标与环境视觉相似性高而导致的检测精度不足问题。现有方法多依赖局部特征增强或复杂的注意力机制,常以模型复杂度上升和效率下降为代价。解决方案的关键在于提出一种高效路径聚合网络(Efficient Path Aggregation Network, EPANet),其核心创新包括:一是高效路径聚合特征金字塔网络(EPA-FPN),通过跨尺度长距离跳跃连接提升语义-空间互补性,并采用跨层融合路径优化特征整合效率;二是多尺度多样分段短路瓶颈模块(MS-DDSP bottleneck),在传统瓶颈结构基础上引入细粒度特征划分与多样化卷积操作,增强局部特征多样性与表征能力。实验表明,EPANet在检测精度和推理速度上均优于当前最优方法,同时保持更低或相当的参数量。
链接: https://arxiv.org/abs/2508.00528
作者: Jinsong Yang,Zeyuan Hu,Yichen Li
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Underwater fish detection (UFD) remains a challenging task in computer vision due to low object resolution, significant background interference, and high visual similarity between targets and surroundings. Existing approaches primarily focus on local feature enhancement or incorporate complex attention mechanisms to highlight small objects, often at the cost of increased model complexity and reduced efficiency. To address these limitations, we propose an efficient path aggregation network (EPANet), which leverages complementary feature integration to achieve accurate and lightweight UFD. EPANet consists of two key components: an efficient path aggregation feature pyramid network (EPA-FPN) and a multi-scale diverse-division short path bottleneck (MS-DDSP bottleneck). The EPA-FPN introduces long-range skip connections across disparate scales to improve semantic-spatial complementarity, while cross-layer fusion paths are adopted to enhance feature integration efficiency. The MS-DDSP bottleneck extends the conventional bottleneck structure by introducing finer-grained feature division and diverse convolutional operations, thereby increasing local feature diversity and representation capacity. Extensive experiments on benchmark UFD datasets demonstrate that EPANet outperforms state-of-the-art methods in terms of detection accuracy and inference speed, while maintaining comparable or even lower parameter complexity.
zh
[CV-32] Leverag ing Convolutional and Graph Networks for an Unsupervised Remote Sensing Labelling Tool
【速读】:该论文旨在解决遥感影像标注成本高、依赖专家且效率低的问题,尤其是针对无监督环境下如何自动识别并标注具有相似语义内容的地理区域。其解决方案的关键在于构建一个无需预标注数据的无监督流水线,利用卷积神经网络(Convolutional Neural Networks, CNNs)与图神经网络(Graph Neural Networks, GNNs)相结合的分割方法,将Sentinel-2影像划分为基于颜色和空间相似性的同质像素区域,并通过GNN聚合邻域信息以增强局部特征表示,从而在编码空间中形成旋转不变的语义关系,提升标注精度与粒度,同时减少异常值干扰。
链接: https://arxiv.org/abs/2508.00506
作者: Tulsi Patel,Mark W. Jones,Thomas Redfern
机构: Swansea University (斯旺西大学); UK Hydrographic Office (英国海道测量局)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Video supplement demonstrating feature-space exploration and interactive labelling is available at: this https URL and is archived at this https URL
Abstract:Machine learning for remote sensing imaging relies on up-to-date and accurate labels for model training and testing. Labelling remote sensing imagery is time and cost intensive, requiring expert analysis. Previous labelling tools rely on pre-labelled data for training in order to label new unseen data. In this work, we define an unsupervised pipeline for finding and labelling geographical areas of similar context and content within Sentinel-2 satellite imagery. Our approach removes limitations of previous methods by utilising segmentation with convolutional and graph neural networks to encode a more robust feature space for image comparison. Unlike previous approaches we segment the image into homogeneous regions of pixels that are grouped based on colour and spatial similarity. Graph neural networks are used to aggregate information about the surrounding segments enabling the feature representation to encode the local neighbourhood whilst preserving its own local information. This reduces outliers in the labelling tool, allows users to label at a granular level, and allows a rotationally invariant semantic relationship at the image level to be formed within the encoding space.
zh
[CV-33] LesiOnTime – Joint Temporal and Clinical Modeling for Small Breast Lesion Segmentation in Longitudinal DCE-MRI
【速读】:该论文旨在解决乳腺动态对比增强磁共振成像(DCE-MRI)中微小病灶分割不准确的问题,尤其是在高风险人群的早期癌症筛查场景下,现有深度学习方法多聚焦于大病灶,忽视了放射科医生在临床实践中常用的纵向影像信息和BI-RADS评分等临床背景知识。解决方案的关键在于提出LesiOnTime框架,其核心创新包括:(1) 时间先验注意力(Temporal Prior Attention, TPA)模块,用于动态融合前后时间点扫描的信息;(2) BI-RADS一致性正则化(BI-RADS Consistency Regularization, BCR)损失函数,通过强制具有相似BI-RADS评估结果的扫描在潜在空间中对齐,将临床知识嵌入训练过程。实验表明,该方法在Dice分数上较单时间点与纵向基线模型提升5%,验证了引入时序与临床上下文对于提高真实世界乳腺癌早期病灶分割可靠性的重要性。
链接: https://arxiv.org/abs/2508.00496
作者: Mohammed Kamran,Maria Bernathova,Raoul Varga,Christian Singer,Zsuzsanna Bago-Horvath,Thomas Helbich,Georg Langs,Philipp Seeböck
机构: Medical University of Vienna (维也纳医科大学); Austrian Academy of Sciences (奥地利科学院)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:
Abstract:Accurate segmentation of small lesions in Breast Dynamic Contrast-Enhanced MRI (DCE-MRI) is critical for early cancer detection, especially in high-risk patients. While recent deep learning methods have advanced lesion segmentation, they primarily target large lesions and neglect valuable longitudinal and clinical information routinely used by radiologists. In real-world screening, detecting subtle or emerging lesions requires radiologists to compare across timepoints and consider previous radiology assessments, such as the BI-RADS score. We propose LesiOnTime, a novel 3D segmentation approach that mimics clinical diagnostic workflows by jointly leveraging longitudinal imaging and BIRADS scores. The key components are: (1) a Temporal Prior Attention (TPA) block that dynamically integrates information from previous and current scans; and (2) a BI-RADS Consistency Regularization (BCR) loss that enforces latent space alignment for scans with similar radiological assessments, thus embedding domain knowledge into the training process. Evaluated on a curated in-house longitudinal dataset of high-risk patients with DCE-MRI, our approach outperforms state-of-the-art single-timepoint and longitudinal baselines by 5% in terms of Dice. Ablation studies demonstrate that both TPA and BCR contribute complementary performance gains. These results highlight the importance of incorporating temporal and clinical context for reliable early lesion segmentation in real-world breast cancer screening. Our code is publicly available at this https URL
zh
[CV-34] SAMSA 2.0: Prompting Segment Anything with Spectral Angles for Hyperspectral Interactive Medical Image Segmentation
【速读】:该论文旨在解决高光谱医学图像中分割精度不足的问题,特别是在数据稀缺和噪声干扰等临床场景下,传统基于RGB图像的分割模型性能受限。解决方案的关键在于提出SAMSA 2.0框架,通过引入光谱角提示(spectral angle prompting)实现光谱信息与空间线索的早期融合,从而在不重新训练的前提下显著提升Segment Anything Model (SAM) 的分割准确性和鲁棒性,尤其在少样本和零样本条件下表现出更强的泛化能力。
链接: https://arxiv.org/abs/2508.00493
作者: Alfie Roddan,Tobias Czempiel,Chi Xu,Daniel S. Elson,Stamatia Giannarou
机构: The Hamlyn Centre for Robotic Surgery, Department of Surgery and Cancer, Imperial College London (帝国理工学院外科与癌症系机器人手术哈姆林中心)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:We present SAMSA 2.0, an interactive segmentation framework for hyperspectral medical imaging that introduces spectral angle prompting to guide the Segment Anything Model (SAM) using spectral similarity alongside spatial cues. This early fusion of spectral information enables more accurate and robust segmentation across diverse spectral datasets. Without retraining, SAMSA 2.0 achieves up to +3.8% higher Dice scores compared to RGB-only models and up to +3.1% over prior spectral fusion methods. Our approach enhances few-shot and zero-shot performance, demonstrating strong generalization in challenging low-data and noisy scenarios common in clinical imaging.
zh
[CV-35] LAMIC: Layout-Aware Multi-Image Composition via Scalability of Multimodal Diffusion Transformer
【速读】:该论文旨在解决多参考图像合成中如何在不依赖训练的情况下实现空间布局感知的连贯且一致图像生成的问题。其核心挑战在于从多个参考图像中提取并保持实体身份(identity)、背景一致性(background consistency)以及精确的空间布局控制(layout control)。解决方案的关键在于提出LAMIC框架,该框架基于MMDiT模型,引入两种即插即用的注意力机制:Group Isolation Attention(GIA)用于增强实体解耦能力,Region-Modulated Attention(RMA)则实现布局感知生成;同时通过定义Inclusion Ratio(IN-R)、Fill Ratio(FI-R)和Background Similarity(BG-S)三个评估指标全面衡量布局控制与一致性性能。实验表明,LAMIC在无需任何训练或微调的前提下,在多项关键指标上均达到当前最优水平,展现出强大的零样本泛化能力。
链接: https://arxiv.org/abs/2508.00477
作者: Yuzhuo Chen,Zehua Ma,Jianhua Wang,Kai Kang,Shunyu Yao,Weiming Zhang
机构: University of Science and Technology of China (中国科学技术大学); Onestory Team; East China Normal University (华东师范大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 8 pages, 5 figures, 3 tables
Abstract:In controllable image synthesis, generating coherent and consistent images from multiple references with spatial layout awareness remains an open challenge. We present LAMIC, a Layout-Aware Multi-Image Composition framework that, for the first time, extends single-reference diffusion models to multi-reference scenarios in a training-free manner. Built upon the MMDiT model, LAMIC introduces two plug-and-play attention mechanisms: 1) Group Isolation Attention (GIA) to enhance entity disentanglement; and 2) Region-Modulated Attention (RMA) to enable layout-aware generation. To comprehensively evaluate model capabilities, we further introduce three metrics: 1) Inclusion Ratio (IN-R) and Fill Ratio (FI-R) for assessing layout control; and 2) Background Similarity (BG-S) for measuring background consistency. Extensive experiments show that LAMIC achieves state-of-the-art performance across most major metrics: it consistently outperforms existing multi-reference baselines in ID-S, BG-S, IN-R and AVG scores across all settings, and achieves the best DPG in complex composition tasks. These results demonstrate LAMIC’s superior abilities in identity keeping, background preservation, layout control, and prompt-following, all achieved without any training or fine-tuning, showcasing strong zero-shot generalization ability. By inheriting the strengths of advanced single-reference models and enabling seamless extension to multi-image scenarios, LAMIC establishes a new training-free paradigm for controllable multi-image composition. As foundation models continue to evolve, LAMIC’s performance is expected to scale accordingly. Our implementation is available at: this https URL.
zh
[CV-36] HyPCV-Former: Hyperbolic Spatio-Temporal Transformer for 3D Point Cloud Video Anomaly Detection
【速读】:该论文旨在解决视频异常检测中传统方法依赖欧几里得空间(Euclidean space)表示所导致的局限性,即难以有效捕捉事件的层次结构和时空连续性。其解决方案的关键在于提出一种基于双曲空间(hyperbolic space)的时空Transformer架构——HyPCV-Former,该方法通过将点云序列的帧级特征嵌入到Lorentzian双曲空间中,更好地建模事件的潜在层次结构;同时设计了超球面多头自注意力机制(hyperbolic multi-head self-attention, HMHA),利用Lorentzian内积和曲率感知Softmax,在非欧几里得几何下学习时间依赖关系,并在完整的Lorentzian空间中直接完成特征变换与异常评分,避免了切空间近似的误差。
链接: https://arxiv.org/abs/2508.00473
作者: Jiaping Cao,Kangkang Zhou,Juan Du
机构: Hong Kong University of Science and Technology (Guangzhou)(香港科技大学(广州)); HKUST-GZ(香港科技大学(广州))
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Video anomaly detection is a fundamental task in video surveillance, with broad applications in public safety and intelligent monitoring systems. Although previous methods leverage Euclidean representations in RGB or depth domains, such embeddings are inherently limited in capturing hierarchical event structures and spatio-temporal continuity. To address these limitations, we propose HyPCV-Former, a novel hyperbolic spatio-temporal transformer for anomaly detection in 3D point cloud videos. Our approach first extracts per-frame spatial features from point cloud sequences via point cloud extractor, and then embeds them into Lorentzian hyperbolic space, which better captures the latent hierarchical structure of events. To model temporal dynamics, we introduce a hyperbolic multi-head self-attention (HMHA) mechanism that leverages Lorentzian inner products and curvature-aware softmax to learn temporal dependencies under non-Euclidean geometry. Our method performs all feature transformations and anomaly scoring directly within full Lorentzian space rather than via tangent space approximation. Extensive experiments demonstrate that HyPCV-Former achieves state-of-the-art performance across multiple anomaly categories, with a 7% improvement on the TIMo dataset and a 5.6% gain on the DAD dataset compared to benchmarks. The code will be released upon paper acceptance.
zh
[CV-37] Semantic and Temporal Integration in Latent Diffusion Space for High-Fidelity Video Super-Resolution
【速读】:该论文旨在解决视频超分辨率(Video Super-Resolution, VSR)模型在生成过程中难以同时实现高保真度对齐低分辨率输入与保持帧间时序一致性的问题。解决方案的关键在于提出一种基于语义与时间引导的视频超分辨率方法(Semantic and Temporal Guided Video Super-Resolution, SeTe-VSR),该方法在潜在扩散空间(latent diffusion space)中引入高层语义信息以及时空联合引导机制,从而在恢复精细细节的同时保障时序连贯性,显著提升重建视频的真实感和感知质量。
链接: https://arxiv.org/abs/2508.00471
作者: Yiwen Wang,Xinning Chai,Yuhong Zhang,Zhengxue Cheng,Jun Zhao,Rong Xie,Li Song
机构: Shanghai Jiao Tong University (上海交通大学); Tencent (腾讯)
类目: Computer Vision and Pattern Recognition (cs.CV); Image and Video Processing (eess.IV)
备注:
Abstract:Recent advancements in video super-resolution (VSR) models have demonstrated impressive results in enhancing low-resolution videos. However, due to limitations in adequately controlling the generation process, achieving high fidelity alignment with the low-resolution input while maintaining temporal consistency across frames remains a significant challenge. In this work, we propose Semantic and Temporal Guided Video Super-Resolution (SeTe-VSR), a novel approach that incorporates both semantic and temporal-spatio guidance in the latent diffusion space to address these challenges. By incorporating high-level semantic information and integrating spatial and temporal information, our approach achieves a seamless balance between recovering intricate details and ensuring temporal coherence. Our method not only preserves high-reality visual content but also significantly enhances fidelity. Extensive experiments demonstrate that SeTe-VSR outperforms existing methods in terms of detail recovery and perceptual quality, highlighting its effectiveness for complex video super-resolution tasks.
zh
[CV-38] PIF-Net: Ill-Posed Prior Guided Multispectral and Hyperspectral Image Fusion via Invertible Mamba and Fusion-Aware LoRA
【速读】:该论文旨在解决多光谱与高光谱图像融合(Multispectral and Hyperspectral Image Fusion, MHIF)中因光谱与空间信息固有权衡及观测数据有限导致的病态问题(ill-posed problem),尤其针对数据错位引发的融合性能下降难题。其解决方案的关键在于提出一种名为PIF-Net的融合框架,通过显式引入病态先验(ill-posed priors)来增强融合能力;同时设计基于可逆Mamba架构的方法以平衡全局光谱建模与计算效率,确保特征变换过程中的信息一致性、稳定梯度流和可逆性;此外,创新性地引入Fusion-Aware Low-Rank Adaptation模块,在保持模型轻量化的同时动态校准光谱与空间特征,从而实现高质量图像重建。
链接: https://arxiv.org/abs/2508.00453
作者: Baisong Li,Xingwang Wang,Haixiao Xu
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:The goal of multispectral and hyperspectral image fusion (MHIF) is to generate high-quality images that simultaneously possess rich spectral information and fine spatial details. However, due to the inherent trade-off between spectral and spatial information and the limited availability of observations, this task is fundamentally ill-posed. Previous studies have not effectively addressed the ill-posed nature caused by data misalignment. To tackle this challenge, we propose a fusion framework named PIF-Net, which explicitly incorporates ill-posed priors to effectively fuse multispectral images and hyperspectral images. To balance global spectral modeling with computational efficiency, we design a method based on an invertible Mamba architecture that maintains information consistency during feature transformation and fusion, ensuring stable gradient flow and process reversibility. Furthermore, we introduce a novel fusion module called the Fusion-Aware Low-Rank Adaptation module, which dynamically calibrates spectral and spatial features while keeping the model lightweight. Extensive experiments on multiple benchmark datasets demonstrate that PIF-Net achieves significantly better image restoration performance than current state-of-the-art methods while maintaining model efficiency.
zh
[CV-39] CLIPTime: Time-Aware Multimodal Representation Learning from Images and Text
【速读】:该论文旨在解决现有视觉语言模型(如CLIP)在捕捉生物生长过程时间动态性方面的局限性,特别是在微生物、农业和生物降解等领域的应用中,难以准确预测生物体发育阶段及其对应时间戳的问题。解决方案的关键在于提出CLIPTime框架,该框架基于CLIP架构构建多模态、多任务学习模型,通过联合学习视觉与文本嵌入实现无显式时间输入条件下的时序感知推理;其核心创新包括:1)构建一个带对齐时间戳和阶段标签的合成真菌生长数据集用于训练与评估;2)设计联合分类与回归机制,同时预测离散的生长阶段和连续的时间戳;3)引入时序精度(temporal accuracy)和回归误差等定制化指标以量化时间感知预测的准确性。实验表明,CLIPTime能够有效建模生物进展并输出可解释的时序结果,展示了视觉语言模型在真实世界生物监测中的潜力。
链接: https://arxiv.org/abs/2508.00447
作者: Anju Rani,Daniel Ortiz-Arroyo,Petar Durdevic
机构: Aalborg University (奥尔堡大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注: 11 pages, 8 figures
Abstract:Understanding the temporal dynamics of biological growth is critical across diverse fields such as microbiology, agriculture, and biodegradation research. Although vision-language models like Contrastive Language Image Pretraining (CLIP) have shown strong capabilities in joint visual-textual reasoning, their effectiveness in capturing temporal progression remains limited. To address this, we propose CLIPTime, a multimodal, multitask framework designed to predict both the developmental stage and the corresponding timestamp of fungal growth from image and text inputs. Built upon the CLIP architecture, our model learns joint visual-textual embeddings and enables time-aware inference without requiring explicit temporal input during testing. To facilitate training and evaluation, we introduce a synthetic fungal growth dataset annotated with aligned timestamps and categorical stage labels. CLIPTime jointly performs classification and regression, predicting discrete growth stages alongside continuous timestamps. We also propose custom evaluation metrics, including temporal accuracy and regression error, to assess the precision of time-aware predictions. Experimental results demonstrate that CLIPTime effectively models biological progression and produces interpretable, temporally grounded outputs, highlighting the potential of vision-language models in real-world biological monitoring applications.
zh
[CV-40] AutoDebias: Automated Framework for Debiasing Text-to-Image Models
【速读】:该论文旨在解决文本到图像(Text-to-Image, T2I)生成模型中存在的无意社会偏见问题,尤其是那些未在提示中明确提及但隐含在生成结果中的性别或种族刻板印象,尤其针对现有去偏方法难以处理的细微或多重交互偏见。解决方案的关键在于提出AutoDebias框架,该框架利用视觉语言模型自动识别有害的视觉模式,并通过生成体现平衡表征的包容性替代提示构建公平引导(fairness guides),进而驱动基于CLIP的训练过程,在保持原始模型图像质量和多样性的同时有效缓解偏见。
链接: https://arxiv.org/abs/2508.00445
作者: Hongyi Cai,Mohammad Mahdinur Rahman,Mingkang Dong,Jie Li,Muxin Pu,Zhili Fang,Yinan Peng,Hanjun Luo,Yang Liu
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Text-to-Image (T2I) models generate high-quality images from text prompts but often exhibit unintended social biases, such as gender or racial stereotypes, even when these attributes are not mentioned. Existing debiasing methods work well for simple or well-known cases but struggle with subtle or overlapping biases. We propose AutoDebias, a framework that automatically identifies and mitigates harmful biases in T2I models without prior knowledge of specific bias types. Specifically, AutoDebias leverages vision-language models to detect biased visual patterns and constructs fairness guides by generating inclusive alternative prompts that reflect balanced representations. These guides drive a CLIP-guided training process that promotes fairer outputs while preserving the original model’s image quality and diversity. Unlike existing methods, AutoDebias effectively addresses both subtle stereotypes and multiple interacting biases. We evaluate the framework on a benchmark covering over 25 bias scenarios, including challenging cases where multiple biases occur simultaneously. AutoDebias detects harmful patterns with 91.6% accuracy and reduces biased outputs from 90% to negligible levels, while preserving the visual fidelity of the original model.
zh
[CV-41] SDMatte: Grafting Diffusion Models for Interactive Matting ICCV2025
【速读】:该论文旨在解决当前交互式抠图(interactive matting)方法在提取边缘区域精细细节方面的不足,尤其是在处理复杂纹理和边界模糊场景时性能受限的问题。其核心解决方案是提出SDMatte——一种基于扩散模型(diffusion model)的交互式抠图框架,关键创新在于:首先,将扩散模型强大的先验知识与文本驱动交互能力转化为视觉提示驱动交互能力;其次,通过引入视觉提示的坐标嵌入(coordinate embeddings)和目标物体的不透明度嵌入(opacity embeddings)到U-Net结构中,增强模型对空间位置和不透明度信息的敏感性;第三,设计了一种掩码自注意力机制(masked self-attention mechanism),使模型能够聚焦于由视觉提示指定的区域,从而显著提升抠图精度与细节保留能力。
链接: https://arxiv.org/abs/2508.00443
作者: Longfei Huang,Yu Liang,Hao Zhang,Jinwei Chen,Wei Dong,Lunde Chen,Wanyu Liu,Bo Li,Pengtao Jiang
机构: Shanghai University (上海大学); vivo Mobile Communication Co., Ltd. (维沃移动通信有限公司)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted at ICCV 2025, 11 pages, 4 figures
Abstract:Recent interactive matting methods have shown satisfactory performance in capturing the primary regions of objects, but they fall short in extracting fine-grained details in edge regions. Diffusion models trained on billions of image-text pairs, demonstrate exceptional capability in modeling highly complex data distributions and synthesizing realistic texture details, while exhibiting robust text-driven interaction capabilities, making them an attractive solution for interactive matting. To this end, we propose SDMatte, a diffusion-driven interactive matting model, with three key contributions. First, we exploit the powerful priors of diffusion models and transform the text-driven interaction capability into visual prompt-driven interaction capability to enable interactive matting. Second, we integrate coordinate embeddings of visual prompts and opacity embeddings of target objects into U-Net, enhancing SDMatte’s sensitivity to spatial position information and opacity information. Third, we propose a masked self-attention mechanism that enables the model to focus on areas specified by visual prompts, leading to better performance. Extensive experiments on multiple datasets demonstrate the superior performance of our method, validating its effectiveness in interactive matting. Our code and model are available at this https URL.
zh
[CV-42] opoTTA: Topology-Enhanced Test-Time Adaptation for Tubular Structure Segmentation
【速读】:该论文针对管状结构分割(Tubular Structure Segmentation, TSS)在跨域场景下因领域偏移(domain shift)导致性能下降的问题提出了解决方案。TSS对拓扑结构变化尤为敏感,局部特征(如纹理和对比度)的差异可能破坏分割的拓扑连续性,从而影响整体精度。为应对这一挑战,作者提出首个专为TSS设计的测试时自适应(Test-Time Adaptation, TTA)框架TopoTTA,其核心在于两个阶段:第一阶段通过提出的拓扑元差异卷积(Topological Meta Difference Convolutions, TopoMDCs)在不修改预训练参数的前提下增强模型对跨域拓扑差异的表征能力;第二阶段引入拓扑困难样本生成策略(Topology Hard sample Generation, TopoHG),结合伪标签对生成的伪断裂区域进行预测对齐,从而提升拓扑连续性。实验表明,TopoTTA在多个数据集上平均提升了31.81%的clDice指标,且可作为即插即用的TTA模块适配CNN-based TSS模型。
链接: https://arxiv.org/abs/2508.00442
作者: Jiale Zhou,Wenhan Wang,Shikun Li,Xiaolei Qu,Xin Guo,Yizhong Liu,Wenzhong Tang,Xun Lin,Yefeng Zheng
机构: Zhejiang University (浙江大学); Westlake University (西湖大学); Beihang University (北京航空航天大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:
Abstract:Tubular structure segmentation (TSS) is important for various applications, such as hemodynamic analysis and route navigation. Despite significant progress in TSS, domain shifts remain a major challenge, leading to performance degradation in unseen target domains. Unlike other segmentation tasks, TSS is more sensitive to domain shifts, as changes in topological structures can compromise segmentation integrity, and variations in local features distinguishing foreground from background (e.g., texture and contrast) may further disrupt topological continuity. To address these challenges, we propose Topology-enhanced Test-Time Adaptation (TopoTTA), the first test-time adaptation framework designed specifically for TSS. TopoTTA consists of two stages: Stage 1 adapts models to cross-domain topological discrepancies using the proposed Topological Meta Difference Convolutions (TopoMDCs), which enhance topological representation without altering pre-trained parameters; Stage 2 improves topological continuity by a novel Topology Hard sample Generation (TopoHG) strategy and prediction alignment on hard samples with pseudo-labels in the generated pseudo-break regions. Extensive experiments across four scenarios and ten datasets demonstrate TopoTTA’s effectiveness in handling topological distribution shifts, achieving an average improvement of 31.81% in clDice. TopoTTA also serves as a plug-and-play TTA solution for CNN-based TSS models.
zh
[CV-43] Reducing the gap between general purpose data and aerial images in concentrated solar power plants
【速读】:该论文旨在解决集中式太阳能发电(Concentrated Solar Power, CSP)厂中无人机航拍图像在计算机视觉任务中的泛化难题。由于CSP厂区具有高反射表面和领域特异性元素,现有通用数据集训练的模型难以直接适用,而真实标注数据的采集与标注成本高昂,限制了工业场景下的快速部署。解决方案的关键在于提出AerialCSP——一个高质量的合成航空影像数据集,通过精确模拟真实环境条件生成带标注的图像,用于模型预训练。该方法显著降低了对人工标注数据的依赖,并在实际故障检测任务中表现出色,尤其提升了对罕见和微小缺陷的识别能力。
链接: https://arxiv.org/abs/2508.00440
作者: M.A. Pérez-Cutiño,J. Valverde,J. Capitán,J.M. Díaz-Báñez
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Robotics (cs.RO)
备注:
Abstract:In the context of Concentrated Solar Power (CSP) plants, aerial images captured by drones present a unique set of challenges. Unlike urban or natural landscapes commonly found in existing datasets, solar fields contain highly reflective surfaces, and domain-specific elements that are uncommon in traditional computer vision benchmarks. As a result, machine learning models trained on generic datasets struggle to generalize to this setting without extensive retraining and large volumes of annotated data. However, collecting and labeling such data is costly and time-consuming, making it impractical for rapid deployment in industrial applications. To address this issue, we propose a novel approach: the creation of AerialCSP, a virtual dataset that simulates aerial imagery of CSP plants. By generating synthetic data that closely mimic real-world conditions, our objective is to facilitate pretraining of models before deployment, significantly reducing the need for extensive manual labeling. Our main contributions are threefold: (1) we introduce AerialCSP, a high-quality synthetic dataset for aerial inspection of CSP plants, providing annotated data for object detection and image segmentation; (2) we benchmark multiple models on AerialCSP, establishing a baseline for CSP-related vision tasks; and (3) we demonstrate that pretraining on AerialCSP significantly improves real-world fault detection, particularly for rare and small defects, reducing the need for extensive manual labeling. AerialCSP is made publicly available at this https URL. Subjects: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Robotics (cs.RO) Cite as: arXiv:2508.00440 [cs.CV] (or arXiv:2508.00440v1 [cs.CV] for this version) https://doi.org/10.48550/arXiv.2508.00440 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
zh
[CV-44] Contact-Aware Amodal Completion for Human-Object Interaction via Multi-Regional Inpainting ICCV2025
【速读】:该论文旨在解决在动态场景中,现有基于预训练扩散模型的图像补全方法在处理人-物体交互(Human-Object Interaction, HOI)时难以生成合理补全结果的问题,其核心挑战在于模型对HOI理解不足。解决方案的关键在于引入物理先验知识(如人体拓扑结构和接触信息),并设计了一种针对HOI特性的多区域修复(multi-regional inpainting)策略:将图像划分为主要区域(primary region,遮挡物最可能存在的区域)和次要区域(secondary region,遮挡概率较低的区域),并在扩散模型中为不同区域定制去噪机制,从而显著提升补全结果在形状准确性和视觉细节上的真实性与合理性。
链接: https://arxiv.org/abs/2508.00427
作者: Seunggeun Chi,Enna Sachdeva,Pin-Hao Huang,Kwonjoon Lee
机构: Purdue University (普渡大学); Honda Research Institute USA (本田研究美国公司)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: ICCV 2025 (Highlight)
Abstract:Amodal completion, which is the process of inferring the full appearance of objects despite partial occlusions, is crucial for understanding complex human-object interactions (HOI) in computer vision and robotics. Existing methods, such as those that use pre-trained diffusion models, often struggle to generate plausible completions in dynamic scenarios because they have a limited understanding of HOI. To solve this problem, we’ve developed a new approach that uses physical prior knowledge along with a specialized multi-regional inpainting technique designed for HOI. By incorporating physical constraints from human topology and contact information, we define two distinct regions: the primary region, where occluded object parts are most likely to be, and the secondary region, where occlusions are less probable. Our multi-regional inpainting method uses customized denoising strategies across these regions within a diffusion model. This improves the accuracy and realism of the generated completions in both their shape and visual detail. Our experimental results show that our approach significantly outperforms existing methods in HOI scenarios, moving machine perception closer to a more human-like understanding of dynamic environments. We also show that our pipeline is robust even without ground-truth contact annotations, which broadens its applicability to tasks like 3D reconstruction and novel view/pose synthesis.
zh
[CV-45] UIS-Mamba: Exploring Mamba for Underwater Instance Segmentation via Dynamic Tree Scan and Hidden State Weaken ACM-MM2025
【速读】:该论文旨在解决水下实例分割(Underwater Instance Segmentation, UIS)任务中因水下场景特有的颜色失真和边界模糊导致的实例对象内部特征连续性破坏,以及复杂背景隐藏状态对实例理解的干扰问题。解决方案的关键在于提出首个基于Mamba架构的UIS模型UIS-Mamba,并设计两个创新模块:动态树扫描(Dynamic Tree Scan, DTS)通过允许图像块动态偏移与缩放,引导最小生成树构建,从而提供动态局部感受野以维持实例内部特征连续性;隐藏状态削弱(Hidden State Weaken, HSW)模块则基于Ncut机制削弱复杂背景的隐藏状态影响,有效聚焦状态传播的信息流至实例本身,提升分割精度。
链接: https://arxiv.org/abs/2508.00421
作者: Runmin Cong,Zongji Yu,Hao Fang,Haoyan Sun,Sam Kwong
机构: Shandong University (山东大学); Lingnan University (岭南大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: ACM MM 2025
Abstract:Underwater Instance Segmentation (UIS) tasks are crucial for underwater complex scene detection. Mamba, as an emerging state space model with inherently linear complexity and global receptive fields, is highly suitable for processing image segmentation tasks with long sequence features. However, due to the particularity of underwater scenes, there are many challenges in applying Mamba to UIS. The existing fixed-patch scanning mechanism cannot maintain the internal continuity of scanned instances in the presence of severely underwater color distortion and blurred instance boundaries, and the hidden state of the complex underwater background can also inhibit the understanding of instance objects. In this work, we propose the first Mamba-based underwater instance segmentation model UIS-Mamba, and design two innovative modules, Dynamic Tree Scan (DTS) and Hidden State Weaken (HSW), to migrate Mamba to the underwater task. DTS module maintains the continuity of the internal features of the instance objects by allowing the patches to dynamically offset and scale, thereby guiding the minimum spanning tree and providing dynamic local receptive fields. HSW module suppresses the interference of complex backgrounds and effectively focuses the information flow of state propagation to the instances themselves through the Ncut-based hidden state weakening mechanism. Experimental results show that UIS-Mamba achieves state-of-the-art performance on both UIIS and USIS10K datasets, while maintaining a low number of parameters and computational complexity. Code is available at this https URL.
zh
[CV-46] IN2OUT: Fine-Tuning Video Inpainting Model for Video Outpainting Using Hierarchical Discriminator ICIP2025
【速读】:该论文旨在解决视频外画(video outpainting)中扩展画面边界时保持内容一致性的问题,即如何在不破坏原有场景语义和视觉连贯性的前提下,生成高质量的新增区域。现有方法通常仅关注背景生成,忽视了对象运动流(object flow)的学习与重建,导致结果模糊且缺乏全局一致性。解决方案的关键在于引入一个分层判别器(hierarchical discriminator),将对抗训练的目标区分为全局与局部两个层次:全局目标确保扩展区域的整体结构合理性,局部目标提升细节质量;同时设计了一种专门针对外画任务的损失函数,利用判别器的局部与全局特征来指导生成器优化,从而显著提升生成结果的视觉吸引力与整体一致性。
链接: https://arxiv.org/abs/2508.00418
作者: Sangwoo Youn,Minji Lee,Nokap Tony Park,Yeonggyoo Jeon,Taeyoung Na
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Image and Video Processing (eess.IV)
备注: ICIP 2025. Code: this https URL
Abstract:Video outpainting presents a unique challenge of extending the borders while maintaining consistency with the given content. In this paper, we suggest the use of video inpainting models that excel in object flow learning and reconstruction in outpainting rather than solely generating the background as in existing methods. However, directly applying or fine-tuning inpainting models to outpainting has shown to be ineffective, often leading to blurry results. Our extensive experiments on discriminator designs reveal that a critical component missing in the outpainting fine-tuning process is a discriminator capable of effectively assessing the perceptual quality of the extended areas. To tackle this limitation, we differentiate the objectives of adversarial training into global and local goals and introduce a hierarchical discriminator that meets both objectives. Additionally, we develop a specialized outpainting loss function that leverages both local and global features of the discriminator. Fine-tuning on this adversarial loss function enhances the generator’s ability to produce both visually appealing and globally coherent outpainted scenes. Our proposed method outperforms state-of-the-art methods both quantitatively and qualitatively. Supplementary materials including the demo video and the code are available in SigPort.
zh
[CV-47] DC-AE 1.5: Accelerating Diffusion Model Convergence with Structured Latent Space ICCV2025
【速读】:该论文旨在解决高分辨率扩散模型中自编码器(Autoencoder)潜空间通道数增加导致的训练收敛速度变慢问题,这虽然提升了重建质量,但反而降低了生成质量,限制了潜在扩散模型的质量上限,并阻碍了高空间压缩比自编码器的应用。解决方案的关键在于两项创新:一是提出结构化潜空间(Structured Latent Space),通过训练策略在潜空间中构建通道级结构,使前序通道捕捉物体结构、后序通道保留图像细节;二是引入增强扩散训练(Augmented Diffusion Training),在对象相关潜通道上增加额外扩散训练目标,以加速模型收敛。这两项技术共同实现了更快的收敛速度和更优的扩散缩放性能。
链接: https://arxiv.org/abs/2508.00413
作者: Junyu Chen,Dongyun Zou,Wenkun He,Junsong Chen,Enze Xie,Song Han,Han Cai
机构: NVIDIA(英伟达)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: ICCV 2025
Abstract:We present DC-AE 1.5, a new family of deep compression autoencoders for high-resolution diffusion models. Increasing the autoencoder’s latent channel number is a highly effective approach for improving its reconstruction quality. However, it results in slow convergence for diffusion models, leading to poorer generation quality despite better reconstruction quality. This issue limits the quality upper bound of latent diffusion models and hinders the employment of autoencoders with higher spatial compression ratios. We introduce two key innovations to address this challenge: i) Structured Latent Space, a training-based approach to impose a desired channel-wise structure on the latent space with front latent channels capturing object structures and latter latent channels capturing image details; ii) Augmented Diffusion Training, an augmented diffusion training strategy with additional diffusion training objectives on object latent channels to accelerate convergence. With these techniques, DC-AE 1.5 delivers faster convergence and better diffusion scaling results than DC-AE. On ImageNet 512x512, DC-AE-1.5-f64c128 delivers better image generation quality than DC-AE-f32c32 while being 4x faster. Code: this https URL.
zh
[CV-48] Sortblock: Similarity-Aware Feature Reuse for Diffusion Model
【速读】:该论文旨在解决扩散模型(Diffusion Models)中基于Transformer架构的生成式AI(Generative AI)在推理阶段因固有的顺序去噪过程导致的高延迟问题,从而限制了其在实时场景中的部署。解决方案的关键在于提出一种无需训练的加速框架Sortblock,该框架通过动态缓存块级特征(block-wise features),依据相邻时间步之间特征相似性进行排序,并基于残差演化程度自适应地确定重计算比例,从而有选择性地跳过冗余计算,同时保持生成质量。此外,引入轻量级线性预测机制以减少跳过计算带来的累积误差,实验表明该方法在多种任务和DiT架构下可实现超过2倍的推理速度提升且输出质量损失最小。
链接: https://arxiv.org/abs/2508.00412
作者: Hanqi Chen,Xu Zhang,Xiaoliu Guan,Lielin Jiang,Guanzhong Wang,Zeyu Chen,Yi Liu
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Diffusion Transformers (DiTs) have demonstrated remarkable generative capabilities, particularly benefiting from Transformer architectures that enhance visual and artistic fidelity. However, their inherently sequential denoising process results in high inference latency, limiting their deployment in real-time scenarios. Existing training-free acceleration approaches typically reuse intermediate features at fixed timesteps or layers, overlooking the evolving semantic focus across denoising stages and Transformer this http URL address this, we propose Sortblock, a training-free inference acceleration framework that dynamically caches block-wise features based on their similarity across adjacent timesteps. By ranking the evolution of residuals, Sortblock adaptively determines a recomputation ratio, selectively skipping redundant computations while preserving generation quality. Furthermore, we incorporate a lightweight linear prediction mechanism to reduce accumulated errors in skipped this http URL experiments across various tasks and DiT architectures demonstrate that Sortblock achieves over 2 \times inference speedup with minimal degradation in output quality, offering an effective and generalizable solution for accelerating diffusion-based generative models.
zh
[CV-49] PMR: Physical Model-Driven Multi-Stage Restoration of Turbulent Dynamic Videos
【速读】:该论文旨在解决大气湍流引起的几何失真和模糊问题,这些问题会显著降低远距离动态场景视频的质量,尤其在强湍流和复杂动态条件下,现有方法难以恢复边缘细节并消除混合失真。解决方案的关键在于提出一个动态效率指数(Dynamic Efficiency Index, DEI),用于量化不同湍流条件下的视频动态强度,并构建高动态湍流训练数据集;同时设计了一种物理模型驱动的多阶段视频恢复框架(Physical Model-Driven Multi-Stage Video Restoration, PMR),包含去倾斜(de-tilting)、运动分割增强(motion segmentation enhancement)和去模糊(de-blurring)三个阶段,通过轻量级骨干网络与分阶段联合训练,在保证高效性的同时实现高质量恢复效果,尤其在真实高湍流、复杂动态场景中表现出强泛化能力。
链接: https://arxiv.org/abs/2508.00406
作者: Tao Wu,Jingyuan Ye,Ying Fu
机构: Chengdu University of Information Technology (成都信息工程大学); National Innovation Center For UHD Video Technology (国家超高清视频创新中心); School of Automation and Intelligent Sensing & Institute of Image Processing and Pattern Recognition & Institute of Medical Robotics, Shanghai Jiao Tong University (上海交通大学自动化与智能传感学院 & 图像处理与模式识别研究所 & 医学机器人研究所)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Geometric distortions and blurring caused by atmospheric turbulence degrade the quality of long-range dynamic scene videos. Existing methods struggle with restoring edge details and eliminating mixed distortions, especially under conditions of strong turbulence and complex dynamics. To address these challenges, we introduce a Dynamic Efficiency Index ( DEI ), which combines turbulence intensity, optical flow, and proportions of dynamic regions to accurately quantify video dynamic intensity under varying turbulence conditions and provide a high-dynamic turbulence training dataset. Additionally, we propose a Physical Model-Driven Multi-Stage Video Restoration ( PMR ) framework that consists of three stages: \textbfde-tilting for geometric stabilization, \textbfmotion segmentation enhancement for dynamic region refinement, and \textbfde-blurring for quality restoration. PMR employs lightweight backbones and stage-wise joint training to ensure both efficiency and high restoration quality. Experimental results demonstrate that the proposed method effectively suppresses motion trailing artifacts, restores edge details and exhibits strong generalization capability, especially in real-world scenarios characterized by high-turbulence and complex dynamics. We will make the code and datasets openly available.
zh
[CV-50] Sari Sandbox: A Virtual Retail Store Environment for Embodied AI Agents ICCV2025
【速读】:该论文旨在解决当前零售场景中缺乏高保真、可交互的3D仿真环境,以用于评估具身智能体(embodied agents)在购物任务中的表现与人类水平的差距问题。解决方案的关键在于构建Sari Sandbox——一个支持超过250个可交互食品商品、涵盖三种商店布局的高保真、照片级真实感3D零售模拟环境,并通过API实现对环境的控制;同时配套引入SariBench数据集,包含不同难度下的人类操作标注示范,从而为具身智能体提供可比基准,支持其在虚拟现实(VR)和视觉语言模型(VLM)驱动的代理两种模式下的训练与评估。
链接: https://arxiv.org/abs/2508.00400
作者: Janika Deborah Gajo,Gerarld Paul Merales,Jerome Escarcha,Brenden Ashley Molina,Gian Nartea,Emmanuel G. Maminta,Juan Carlos Roldan,Rowel O. Atienza
机构: University of the Philippines, Diliman (菲律宾大学迪里曼分校); University of the Philippines, Diliman (菲律宾大学迪里曼分校)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 14 pages, accepted in ICCV 2025 Workshop on RetailVision
Abstract:We present Sari Sandbox, a high-fidelity, photorealistic 3D retail store simulation for benchmarking embodied agents against human performance in shopping tasks. Addressing a gap in retail-specific sim environments for embodied agent training, Sari Sandbox features over 250 interactive grocery items across three store configurations, controlled via an API. It supports both virtual reality (VR) for human interaction and a vision language model (VLM)-powered embodied agent. We also introduce SariBench, a dataset of annotated human demonstrations across varied task difficulties. Our sandbox enables embodied agents to navigate, inspect, and manipulate retail items, providing baselines against human performance. We conclude with benchmarks, performance analysis, and recommendations for enhancing realism and scalability. The source code can be accessed via this https URL.
zh
[CV-51] SafetyBench: A video-language benchmark for safety in industrial environment ICCV2025
【速读】:该论文旨在解决当前视觉-语言模型(Vision-Language Models, VLMs)在高风险工业场景中对常规操作与安全关键异常行为识别能力不足的问题。现有VLMs虽在零样本视频理解任务上表现优异,但在工业安全领域仍缺乏系统评估基准和针对性优化。其解决方案的关键在于构建iSafetyBench——首个面向工业环境的视频-语言基准数据集,包含1,100个真实工业场景视频片段,涵盖98类常规动作和67类危险动作的开放词汇多标签标注,并配套单标签与多标签选择题形式的评测任务,从而实现对VLMs在标准及安全关键情境下的细粒度性能评估。实验表明,当前主流VLMs在该基准上表现显著落后,凸显了开发更鲁棒、具备安全意识的多模态模型的必要性。
链接: https://arxiv.org/abs/2508.00399
作者: Raiyaan Abdullah,Yogesh Singh Rawat,Shruti Vyas
机构: University of Central Florida (中佛罗里达大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted to VISION’25 - ICCV 2025 workshop
Abstract:Recent advances in vision-language models (VLMs) have enabled impressive generalization across diverse video understanding tasks under zero-shot settings. However, their capabilities in high-stakes industrial domains-where recognizing both routine operations and safety-critical anomalies is essential-remain largely underexplored. To address this gap, we introduce iSafetyBench, a new video-language benchmark specifically designed to evaluate model performance in industrial environments across both normal and hazardous scenarios. iSafetyBench comprises 1,100 video clips sourced from real-world industrial settings, annotated with open-vocabulary, multi-label action tags spanning 98 routine and 67 hazardous action categories. Each clip is paired with multiple-choice questions for both single-label and multi-label evaluation, enabling fine-grained assessment of VLMs in both standard and safety-critical contexts. We evaluate eight state-of-the-art video-language models under zero-shot conditions. Despite their strong performance on existing video benchmarks, these models struggle with iSafetyBench-particularly in recognizing hazardous activities and in multi-label scenarios. Our results reveal significant performance gaps, underscoring the need for more robust, safety-aware multimodal models for industrial applications. iSafetyBench provides a first-of-its-kind testbed to drive progress in this direction. The dataset is available at: this https URL.
zh
[CV-52] Occlusion-robust Stylization for Drawing-based 3D Animation ICCV2025
【速读】:该论文旨在解决绘制式3D动画(drawing-based 3D animation)中因遮挡导致的风格属性退化问题,特别是当目标姿态在推理阶段出现动态运动带来的遮挡时,现有方法会出现轮廓闪烁和笔触模糊等现象。其核心问题在于“风格化姿态间隙”(stylization pose gap)——训练时使用的姿态均为无遮挡状态,而推理时的姿态常包含复杂遮挡,导致风格保留能力下降。解决方案的关键是提出一种抗遮挡风格化框架(Occlusion-robust Stylization Framework, OSF),通过引入光流(optical flow)生成更鲁棒的边缘引导信号,替代传统依赖不准确边缘输入的方式,从而在遮挡条件下仍能保持一致的风格化效果;同时,OSF采用单阶段流程取代先前两阶段方法,在保证质量的同时实现2.4倍加速与2.1倍内存降低。
链接: https://arxiv.org/abs/2508.00398
作者: Sunjae Yoon,Gwanhyeong Koo,Younghwan Lee,Ji Woo Hong,Chang D. Yoo
机构: KAIST(韩国科学技术院)
类目: Graphics (cs.GR); Computer Vision and Pattern Recognition (cs.CV)
备注: 11 pages, 13 figures, ICCV 2025
Abstract:3D animation aims to generate a 3D animated video from an input image and a target 3D motion sequence. Recent advances in image-to-3D models enable the creation of animations directly from user-hand drawings. Distinguished from conventional 3D animation, drawing-based 3D animation is crucial to preserve artist’s unique style properties, such as rough contours and distinct stroke patterns. However, recent methods still exhibit quality deterioration in style properties, especially under occlusions caused by overlapping body parts, leading to contour flickering and stroke blurring. This occurs due to a `stylization pose gap’ between training and inference in stylization networks designed to preserve drawing styles in drawing-based 3D animation systems. The stylization pose gap denotes that input target poses used to train the stylization network are always in occlusion-free poses, while target poses encountered in an inference include diverse occlusions under dynamic motions. To this end, we propose Occlusion-robust Stylization Framework (OSF) for drawing-based 3D animation. We found that while employing object’s edge can be effective input prior for guiding stylization, it becomes notably inaccurate when occlusions occur at inference. Thus, our proposed OSF provides occlusion-robust edge guidance for stylization network using optical flow, ensuring a consistent stylization even under occlusions. Furthermore, OSF operates in a single run instead of the previous two-stage method, achieving 2.4x faster inference and 2.1x less memory.
zh
[CV-53] Video Forgery Detection with Optical Flow Residuals and Spatial-Temporal Consistency
【速读】:该论文旨在解决扩散模型生成的视频内容日益逼真所带来的视频伪造检测难题,尤其是现有方法难以捕捉高视觉保真度和连贯运动下细微的时间不一致性问题。解决方案的关键在于提出一种基于时空一致性的检测框架,通过融合RGB外观特征与光流残差(optical flow residuals)来增强对伪造痕迹的感知能力:其中双分支结构分别处理RGB帧以识别外观层面的伪影,以及处理光流残差以揭示由不完善时序合成引起的微小运动异常,从而实现对多种生成模型所产视频的高效、鲁棒检测。
链接: https://arxiv.org/abs/2508.00397
作者: Xi Xue,Kunio Suzuki,Nabarun Goswami,Takuya Shintate
机构: NABLAS Inc.(NABLAS公司)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:The rapid advancement of diffusion-based video generation models has led to increasingly realistic synthetic content, presenting new challenges for video forgery detection. Existing methods often struggle to capture fine-grained temporal inconsistencies, particularly in AI-generated videos with high visual fidelity and coherent motion. In this work, we propose a detection framework that leverages spatial-temporal consistency by combining RGB appearance features with optical flow residuals. The model adopts a dual-branch architecture, where one branch analyzes RGB frames to detect appearance-level artifacts, while the other processes flow residuals to reveal subtle motion anomalies caused by imperfect temporal synthesis. By integrating these complementary features, the proposed method effectively detects a wide range of forged videos. Extensive experiments on text-to-video and image-to-video tasks across ten diverse generative models demonstrate the robustness and strong generalization ability of the proposed approach.
zh
[CV-54] Decouple before Align: Visual Disentanglement Enhances Prompt Tuning
【速读】:该论文旨在解决提示调优(Prompt Tuning, PT)中因视觉与文本模态间信息不对称导致的注意力偏置问题,即视觉模态通常包含比以目标为中心的文本模态更丰富的上下文信息,粗略对齐会诱导模型仅关注背景区域而非感兴趣的目标对象。解决方案的关键在于提出一种无需架构改动的DAPT框架,其核心思想是“先解耦、后对齐”(decouple-before-align):首先利用粗细粒度的视觉分割线索将视觉特征显式解耦为前景和背景表示,再分别与原始前景文本及人工设计的背景类别对齐,从而实现对称增强的跨模态对齐;进一步地,引入面向前景-背景模式的视觉拉推正则化(visual pull-push regularization),引导原始视觉表示聚焦于目标区域,避免注意力偏移。
链接: https://arxiv.org/abs/2508.00395
作者: Fei Zhang,Tianfei Zhou,Jiangchao Yao,Ya Zhang,Ivor W. Tsang,Yanfeng Wang
机构: Shanghai Jiao Tong University (上海交通大学); Shanghai Innovation Institute; Shanghai Artificial Intelligence Laboratory (上海人工智能实验室); Beijing Institute of Technology (北京理工大学); A*STAR Centre for Frontier AI Research (新加坡科技研究局前沿人工智能研究中心)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: 16 pages, Accepted at IEEE Transactions on Pattern Analysis and Machine Intelligence (TPAMI)
Abstract:Prompt tuning (PT), as an emerging resource-efficient fine-tuning paradigm, has showcased remarkable effectiveness in improving the task-specific transferability of vision-language models. This paper delves into a previously overlooked information asymmetry issue in PT, where the visual modality mostly conveys more context than the object-oriented textual modality. Correspondingly, coarsely aligning these two modalities could result in the biased attention, driving the model to merely focus on the context area. To address this, we propose DAPT, an effective PT framework based on an intuitive decouple-before-align concept. First, we propose to explicitly decouple the visual modality into the foreground and background representation via exploiting coarse-and-fine visual segmenting cues, and then both of these decoupled patterns are aligned with the original foreground texts and the hand-crafted background classes, thereby symmetrically strengthening the modal alignment. To further enhance the visual concentration, we propose a visual pull-push regularization tailored for the foreground-background patterns, directing the original visual representation towards unbiased attention on the region-of-interest object. We demonstrate the power of architecture-free DAPT through few-shot learning, base-to-novel generalization, and data-efficient learning, all of which yield superior performance across prevailing benchmarks. Our code will be released at this https URL.
zh
[CV-55] Cued-Agent : A Collaborative Multi-Agent System for Automatic Cued Speech Recognition
【速读】:该论文旨在解决自动Cued Speech识别(Automatic Cued Speech Recognition, ACSR)中因手部与唇部动作存在时间异步性而导致的多模态融合困难问题,以及受限于数据稀缺导致现有方法难以有效训练融合机制、性能不佳的问题。解决方案的关键在于提出首个用于ACSR的协作式多智能体系统——Cued-Agent,其通过四个专业化子智能体协同工作:基于多模态大语言模型的手部识别智能体采用关键帧筛选和CS专家提示策略解码手部手势;预训练Transformer架构的唇部识别智能体提取视频中的唇部特征;手部提示解码智能体在推理阶段以无训练方式动态融合手部提示与唇部特征;自校正音素到词智能体首次实现通过语义精炼从音素序列端到端生成自然语言句子。该方案显著提升了ACSR在正常及听力障碍场景下的识别性能。
链接: https://arxiv.org/abs/2508.00391
作者: Guanjie Huang,Danny H.K. Tsang,Shan Yang,Guangzhi Lei,Li Liu
机构: The Hong Kong University of Science and Technology (Guangzhou)(香港科技大学(广州) ); Tencent AI Lab(腾讯AI实验室)
类目: Computer Vision and Pattern Recognition (cs.CV); Audio and Speech Processing (eess.AS)
备注: 9 pages
Abstract:Cued Speech (CS) is a visual communication system that combines lip-reading with hand coding to facilitate communication for individuals with hearing impairments. Automatic CS Recognition (ACSR) aims to convert CS hand gestures and lip movements into text via AI-driven methods. Traditionally, the temporal asynchrony between hand and lip movements requires the design of complex modules to facilitate effective multimodal fusion. However, constrained by limited data availability, current methods demonstrate insufficient capacity for adequately training these fusion mechanisms, resulting in suboptimal performance. Recently, multi-agent systems have shown promising capabilities in handling complex tasks with limited data availability. To this end, we propose the first collaborative multi-agent system for ACSR, named Cued-Agent. It integrates four specialized sub-agents: a Multimodal Large Language Model-based Hand Recognition agent that employs keyframe screening and CS expert prompt strategies to decode hand movements, a pretrained Transformer-based Lip Recognition agent that extracts lip features from the input video, a Hand Prompt Decoding agent that dynamically integrates hand prompts with lip features during inference in a training-free manner, and a Self-Correction Phoneme-to-Word agent that enables post-process and end-to-end conversion from phoneme sequences to natural language sentences for the first time through semantic refinement. To support this study, we expand the existing Mandarin CS dataset by collecting data from eight hearing-impaired cuers, establishing a mixed dataset of fourteen subjects. Extensive experiments demonstrate that our Cued-Agent performs superbly in both normal and hearing-impaired scenarios compared with state-of-the-art methods. The implementation is available at this https URL.
zh
[CV-56] STF: Shallow-Level Temporal Feedback to Enhance Spiking Transformers
【速读】:该论文旨在解决基于Transformer的脉冲神经网络(Spiking Neural Networks, SNNs)与浮点数人工神经网络(Artificial Neural Networks, ANNs)之间存在的显著性能差距问题,这一差距主要源于脉冲序列的二值特性。现有方法通过引入深层反馈机制传递高层语义信息以缩小差距,但此类设计通常涉及多个深度层,导致特征变换复杂、参数开销高、能耗增加及推理延迟延长。论文提出了一种轻量级、即插即用的浅层时序反馈模块(Shallow-level Temporal Feedback, STF),其核心由时序-空间位置嵌入(Temporal-Spatial Position Embedding, TSPE)和时序反馈(Temporal Feedback, TF)组成,专注于编码层的优化。实验表明,STF在多种Transformer架构下均能稳定提升静态数据集(如CIFAR-10、CIFAR-100和ImageNet-1K)上的性能,并通过增强脉冲模式多样性实现性能增益,同时在对抗鲁棒性和时间敏感性测试中优于直接编码及其变体,验证了其作为静态场景下新型脉冲编码方案的有效性。
链接: https://arxiv.org/abs/2508.00387
作者: Zeqi Zheng,Zizheng Zhu,Yingchao Yu,Yanchen Huang,Changze Lv,Junfeng Tang,Zhaofei Yu,Yaochu Jin
机构: Zhejiang University (浙江大学); Westlake University (西湖大学); Nanjing University (南京大学); Donghua University (东华大学); Fudan University (复旦大学); University of Electronic Science and Technology of China (电子科技大学); Peking University (北京大学)
类目: Neural and Evolutionary Computing (cs.NE); Computer Vision and Pattern Recognition (cs.CV)
备注: 32 pages, 4 figures
Abstract:Transformer-based Spiking Neural Networks (SNNs) suffer from a great performance gap compared to floating-point Artificial Neural Networks (ANNs) due to the binary nature of spike trains. Recent efforts have introduced deep-level feedback loops to transmit high-level semantic information to narrow this gap. However, these designs often span multiple deep layers, resulting in costly feature transformations, higher parameter overhead, increased energy consumption, and longer inference latency. To address this issue, we propose Shallow-level Temporal Feedback (STF), a lightweight plug-and-play module for the encoding layer, which consists of Temporal-Spatial Position Embedding (TSPE) and Temporal Feedback (TF).Extensive experiments show that STF consistently improves performance across various Transformer-based SNN backbones on static datasets, including CIFAR-10, CIFAR-100, and ImageNet-1K, under different spike timestep settings. Further analysis reveals that STF enhances the diversity of the spike patterns, which is key to performance gain. Moreover, evaluations on adversarial robustness and temporal sensitivity confirm that STF outperforms direct coding and its variants, highlighting its potential as a new spike encoding scheme for static scenarios. Our code will be released upon acceptance.
zh
[CV-57] MV_Hybrid: Improving Spatial Transcriptomics Prediction with Hybrid State Space-Vision Transformer Backbone in Pathology Vision Foundation Models MICCAI2025
【速读】:该论文旨在解决空间转录组学(spatial transcriptomics)在临床转化中因成本高和操作复杂而受限的问题,提出通过常规病理图像预测空间基因表达(生物标志物)作为替代方案。当前基于视觉Transformer(Vision Transformer, ViT)的病理视觉基础模型(vision foundation models, VFMs)性能尚未达到临床标准。其关键解决方案是引入一种混合骨干架构 MV_Hybrid,该架构融合了状态空间模型(state space models, SSMs)与ViT,利用SSMs对低频、细微形态模式的强偏置特性,更好地捕捉与分子表型相关联的组织结构特征。实验表明,在留一研究外推(leave-one-study-out, LOSO)评估中,MV_Hybrid 相比最优ViT模型基因表达预测相关性提升57%,且性能下降幅度比随机划分小43%,展现出显著更高的准确性和鲁棒性。
链接: https://arxiv.org/abs/2508.00383
作者: Won June Cho,Hongjun Yoon,Daeky Jeong,Hyeongyeol Lim,Yosep Chong
机构: Deepnoid (深诺德); The Catholic University of Korea (韩国天主教大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Computational Engineering, Finance, and Science (cs.CE); Machine Learning (cs.LG)
备注: Accepted (Oral) in MICCAI 2025 COMPAYL Workshop
Abstract:Spatial transcriptomics reveals gene expression patterns within tissue context, enabling precision oncology applications such as treatment response prediction, but its high cost and technical complexity limit clinical adoption. Predicting spatial gene expression (biomarkers) from routine histopathology images offers a practical alternative, yet current vision foundation models (VFMs) in pathology based on Vision Transformer (ViT) backbones perform below clinical standards. Given that VFMs are already trained on millions of diverse whole slide images, we hypothesize that architectural innovations beyond ViTs may better capture the low-frequency, subtle morphological patterns correlating with molecular phenotypes. By demonstrating that state space models initialized with negative real eigenvalues exhibit strong low-frequency bias, we introduce MV_Hybrid , a hybrid backbone architecture combining state space models (SSMs) with ViT. We compare five other different backbone architectures for pathology VFMs, all pretrained on identical colorectal cancer datasets using the DINOv2 self-supervised learning method. We evaluate all pretrained models using both random split and leave-one-study-out (LOSO) settings of the same biomarker dataset. In LOSO evaluation, MV_Hybrid achieves 57% higher correlation than the best-performing ViT and shows 43% smaller performance degradation compared to random split in gene expression prediction, demonstrating superior performance and robustness, respectively. Furthermore, MV_Hybrid shows equal or better downstream performance in classification, patch retrieval, and survival prediction tasks compared to that of ViT, showing its promise as a next-generation pathology VFM backbone. Our code is publicly available at: this https URL.
zh
[CV-58] Advancing Welding Defect Detection in Maritime Operations via Adapt-WeldNet and Defect Detection Interpretability Analysis
【速读】:该论文旨在解决油气行业管道系统焊接缺陷检测中存在的两大核心问题:一是传统无损检测(NDT)方法难以识别微小或内部缺陷,易导致潜在失效和高额停机成本;二是现有基于神经网络的缺陷分类方法依赖随意选择的预训练架构且缺乏可解释性,影响其在高安全要求场景下的部署可信度。解决方案的关键在于提出一个名为“Adapt-WeldNet”的自适应框架,通过系统评估多种预训练模型、迁移学习策略及自适应优化器,以确定最优模型与超参数组合,从而提升检测性能;同时引入缺陷检测可解释性分析(DDIA)框架,融合Grad-CAM和LIME等可解释人工智能(XAI)技术,并结合ASNT NDE Level II专业人员的领域验证,采用人机协同(HITL)机制,确保系统的可靠性、公平性和问责性,实现性能与透明度的双重增强,进而提升自动化决策的信任度与安全性。
链接: https://arxiv.org/abs/2508.00381
作者: Kamal Basha S,Athira Nambiar
机构: SRM Institute of Science and Technology (SRM 科学与技术学院)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Computational Engineering, Finance, and Science (cs.CE); Machine Learning (cs.LG)
备注:
Abstract:Weld defect detection is crucial for ensuring the safety and reliability of piping systems in the oil and gas industry, especially in challenging marine and offshore environments. Traditional non-destructive testing (NDT) methods often fail to detect subtle or internal defects, leading to potential failures and costly downtime. Furthermore, existing neural network-based approaches for defect classification frequently rely on arbitrarily selected pretrained architectures and lack interpretability, raising safety concerns for deployment. To address these challenges, this paper introduces ``Adapt-WeldNet", an adaptive framework for welding defect detection that systematically evaluates various pre-trained architectures, transfer learning strategies, and adaptive optimizers to identify the best-performing model and hyperparameters, optimizing defect detection and providing actionable insights. Additionally, a novel Defect Detection Interpretability Analysis (DDIA) framework is proposed to enhance system transparency. DDIA employs Explainable AI (XAI) techniques, such as Grad-CAM and LIME, alongside domain-specific evaluations validated by certified ASNT NDE Level II professionals. Incorporating a Human-in-the-Loop (HITL) approach and aligning with the principles of Trustworthy AI, DDIA ensures the reliability, fairness, and accountability of the defect detection system, fostering confidence in automated decisions through expert validation. By improving both performance and interpretability, this work enhances trust, safety, and reliability in welding defect detection systems, supporting critical operations in offshore and marine environments.
zh
[CV-59] CoRGI: Verified Chain-of-Thought Reasoning with Visual Grounding AAAI2026
【速读】:该论文旨在解决视觉语言模型(Vision-Language Models, VLMs)在使用链式思维(Chain-of-Thought, CoT)提示时产生的“幻觉”问题,即生成的推理过程虽然语言流畅,但缺乏对视觉内容的有效 grounding。其解决方案的关键在于提出一个模块化框架 CoRGI(Chain of Reasoning with Grounded Insights),通过引入显式的视觉验证机制,在多步推理过程中对每一步的结论进行视觉证据支持,从而提升推理结果的准确性与可信度。具体而言,CoRGI 包含三个阶段:生成文本推理链、利用视觉证据提取模块(Visual Evidence Verification Module, VEVM)提取每步推理的视觉依据、最后融合文本与视觉证据生成可验证的答案。该方法无需对现有 VLM 进行端到端再训练即可集成,实验证明其能显著提升两个主流开源 VLM(Qwen-2.5VL 和 LLaVA-1.6)在 VCR 基准上的推理性能,并获得人类评估的更高事实性与实用性。
链接: https://arxiv.org/abs/2508.00378
作者: Shixin Yi,Lin Shang
机构: 未知
类目: Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
备注: Preparing for AAAI 2026, Multimodal Reasoning
Abstract:Chain-of-Thought (CoT) prompting has shown promise in improving reasoning in vision-language models (VLMs), but it often produces explanations that are linguistically fluent yet lack grounding in visual content. We observe that such hallucinations arise in part from the absence of an explicit verification mechanism during multi-step reasoning. To address this, we propose \textbfCoRGI(\textbfChain \textbfof \textbfReasoning with \textbfGrounded \textbfInsights), a modular framework that introduces visual verification into the reasoning process. CoRGI follows a three-stage pipeline: it first generates a textual reasoning chain, then extracts supporting visual evidence for each reasoning step via a dedicated module (VEVM), and finally synthesizes the textual rationale with visual evidence to generate a grounded, verified answer. The framework can be integrated with existing VLMs without end-to-end retraining. We evaluate CoRGI on the VCR benchmark and find that it improves reasoning performance on two representative open-source VLM backbones, Qwen-2.5VL and LLaVA-1.6. Ablation studies confirm the contribution of each step in the verification module, and human evaluations suggest that CoRGI leads to more factual and helpful explanations. We also examine alternative designs for the visual verification step and discuss potential limitations of post-hoc verification frameworks. These findings highlight the importance of grounding intermediate reasoning steps in visual evidence to enhance the robustness of multimodal reasoning.
zh
[CV-60] Bidirectional Action Sequence Learning for Long-term Action Anticipation with Large Language Models
【速读】:该论文旨在解决视频驱动的长期动作预测(long-term action anticipation)中因传统方法单向特征提取导致的性能瓶颈问题,特别是难以捕捉场景内语义上 distinct 的子动作(sub-actions)。其解决方案的关键在于提出 BiAnt 框架,通过结合前向预测与后向预测,并引入大语言模型(large language model)实现双向信息融合,从而提升对复杂动作序列的理解与预测准确性。实验结果表明,BiAnt 在 Ego4D 数据集上相较基线方法在编辑距离(edit distance)指标上有显著改善。
链接: https://arxiv.org/abs/2508.00374
作者: Yuji Sato,Yasunori Ishii,Takayoshi Yamashita
机构: Panasonic Connect CoLtd.(松下连接有限公司); Panasonic Holdings Corporation (松下控股公司); Chubu University (中部大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted to MVA2025 (Best Poster Award)
Abstract:Video-based long-term action anticipation is crucial for early risk detection in areas such as automated driving and robotics. Conventional approaches extract features from past actions using encoders and predict future events with decoders, which limits performance due to their unidirectional nature. These methods struggle to capture semantically distinct sub-actions within a scene. The proposed method, BiAnt, addresses this limitation by combining forward prediction with backward prediction using a large language model. Experimental results on Ego4D demonstrate that BiAnt improves performance in terms of edit distance compared to baseline methods.
zh
[CV-61] Representation Shift: Unifying Token Compression with FlashAttention ICCV
【速读】:该论文旨在解决Transformer模型中自注意力机制(self-attention)因输入序列长度增加而导致的计算复杂度呈二次增长及GPU内存访问开销过高的问题。现有方法如token压缩技术虽可减少冗余信息,但多依赖注意力图(attention map)来评估token重要性,与高效内存友好的融合注意力核(fused attention kernel,如FlashAttention)不兼容;而FlashAttention通过避免显式构建注意力图来降低内存I/O,却无法直接配合训练-free的token压缩方法。解决方案的关键在于提出一种无需训练、模型无关的度量指标——Representation Shift,该指标量化每个token表示的变化程度,从而在不依赖注意力图的前提下实现有效的token压缩,并无缝集成至FlashAttention框架中,显著提升视频文本检索和视频问答任务的推理效率,加速比最高达5.5%和4.4%。
链接: https://arxiv.org/abs/2508.00367
作者: Joonmyung Choi,Sanghyeok Lee,Byungoh Ko,Eunseo Kim,Jihyung Kil,Hyunwoo J. Kim
机构: Korea University (韩国大学); Adobe Research (Adobe 研究院); KAIST (韩国科学技术院)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: International Conference on Computer Vision (ICCV), 2025
Abstract:Transformers have demonstrated remarkable success across vision, language, and video. Yet, increasing task complexity has led to larger models and more tokens, raising the quadratic cost of self-attention and the overhead of GPU memory access. To reduce the computation cost of self-attention, prior work has proposed token compression techniques that drop redundant or less informative tokens. Meanwhile, fused attention kernels such as FlashAttention have been developed to alleviate memory overhead by avoiding attention map construction and its associated I/O to HBM. This, however, makes it incompatible with most training-free token compression methods, which rely on attention maps to determine token importance. Here, we propose Representation Shift, a training-free, model-agnostic metric that measures the degree of change in each token’s representation. This seamlessly integrates token compression with FlashAttention, without attention maps or retraining. Our method further generalizes beyond Transformers to CNNs and state space models. Extensive experiments show that Representation Shift enables effective token compression compatible with FlashAttention, yielding significant speedups of up to 5.5% and 4.4% in video-text retrieval and video QA, respectively. Code is available at this https URL.
zh
[CV-62] SparseRecon: Neural Implicit Surface Reconstruction from Sparse Views with Feature and Depth Consistencies ICCV2025
【速读】:该论文旨在解决从稀疏视角(sparse views)进行三维表面重建时存在的两个核心问题:一是基于泛化的方法在训练中未见过的视图上表现不佳,二是基于过拟合的方法受限于有限的几何线索导致重建质量不高。解决方案的关键在于提出一种新的神经隐式重建方法 SparseRecon,其核心创新包括两点:首先引入基于体渲染(volume rendering)的跨视角特征一致性损失(feature consistency loss),以约束神经隐式场,从而缓解因视角间一致性信息不足带来的歧义,提升重建结果的完整性和平滑性;其次设计不确定性引导的深度约束(uncertainty-guided depth constraint),在遮挡区域和特征不显著区域补充几何信息,有效恢复细节并提高重建质量。实验表明,该方法在稀疏视图输入下,尤其在重叠视图较少的场景中,优于现有最先进方法。
链接: https://arxiv.org/abs/2508.00366
作者: Liang Han,Xu Zhang,Haichuan Song,Kanle Shi,Yu-Shen Liu,Zhizhong Han
机构: Tsinghua University (清华大学); East China Normal University (华东师范大学); China Telecom (中国电信); Kuaishou Technology (快手科技); Wayne State University (韦恩州立大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted by ICCV 2025
Abstract:Surface reconstruction from sparse views aims to reconstruct a 3D shape or scene from few RGB images. The latest methods are either generalization-based or overfitting-based. However, the generalization-based methods do not generalize well on views that were unseen during training, while the reconstruction quality of overfitting-based methods is still limited by the limited geometry clues. To address this issue, we propose SparseRecon, a novel neural implicit reconstruction method for sparse views with volume rendering-based feature consistency and uncertainty-guided depth constraint. Firstly, we introduce a feature consistency loss across views to constrain the neural implicit field. This design alleviates the ambiguity caused by insufficient consistency information of views and ensures completeness and smoothness in the reconstruction results. Secondly, we employ an uncertainty-guided depth constraint to back up the feature consistency loss in areas with occlusion and insignificant features, which recovers geometry details for better reconstruction quality. Experimental results demonstrate that our method outperforms the state-of-the-art methods, which can produce high-quality geometry with sparse-view input, especially in the scenarios with small overlapping views. Project page: this https URL.
zh
[CV-63] Honey Classification using Hyperspectral Imaging and Machine Learning
【速读】:该论文旨在解决蜂蜜植物来源(botanical origin)自动分类的问题,即如何利用机器学习方法从蜂蜜样本中准确识别其植物来源。解决方案的关键在于三个核心步骤:首先通过类别变换(class transformation)优化数据集准备阶段以增强类间可分性;其次采用线性判别分析(Linear Discriminant Analysis, LDA)进行特征提取与降维,保留最具判别性的特征;最后使用支持向量机(Support Vector Machines, SVM)和K近邻(K-Nearest Neighbors, KNN)模型对提取特征进行分类。实验表明,该方法在标准蜂蜜高光谱成像(hyperspectral imaging, HSI)数据集上实现了95.13%的图像级分类准确率和92.80%的实例级分类准确率,达到当前最优水平。
链接: https://arxiv.org/abs/2508.00361
作者: Mokhtar A. Al-Awadhi,Ratnadeep R. Deshmukh
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:In this paper, we propose a machine learning-based method for automatically classifying honey botanical origins. Dataset preparation, feature extraction, and classification are the three main steps of the proposed method. We use a class transformation method in the dataset preparation phase to maximize the separability across classes. The feature extraction phase employs the Linear Discriminant Analysis (LDA) technique for extracting relevant features and reducing the number of dimensions. In the classification phase, we use Support Vector Machines (SVM) and K-Nearest Neighbors (KNN) models to classify the extracted features of honey samples into their botanical origins. We evaluate our system using a standard honey hyperspectral imaging (HSI) dataset. Experimental findings demonstrate that the proposed system produces state-of-the-art results on this dataset, achieving the highest classification accuracy of 95.13% for hyperspectral image-based classification and 92.80% for hyperspectral instance-based classification.
zh
[CV-64] CoST: Efficient Collaborative Perception From Unified Spatiotemporal Perspective ICCV25
【速读】:该论文旨在解决多智能体协同感知中因个体感知受限(如遮挡和传感范围小)导致的性能瓶颈问题,同时克服现有方法在时空信息融合上效率低下的缺陷。其解决方案的关键在于提出一种统一的时空聚合机制——协同感知时空变换器(CoST),将来自不同智能体的空间信息与不同时刻的时间信息同步整合到一个统一的spatio-temporal空间中,从而实现高效的特征传输(每个静态目标仅需传输一次)和更优的特征融合(通过联合处理多智能体与多时间信息获得更全面的感知视角),显著提升感知精度并降低通信带宽需求。
链接: https://arxiv.org/abs/2508.00359
作者: Zongheng Tang,Yi Liu,Yifan Sun,Yulu Gao,Jinyu Chen,Runsheng Xu,Si Liu
机构: Hangzhou International Innovation Institute, Beihang University (北京航空航天大学杭州国际创新研究院); School of Artificial Intelligence, Beihang University (北京航空航天大学人工智能学院); University of California, Los Angeles (加州大学洛杉矶分校)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: ICCV25 (Highlight)
Abstract:Collaborative perception shares information among different agents and helps solving problems that individual agents may face, e.g., occlusions and small sensing range. Prior methods usually separate the multi-agent fusion and multi-time fusion into two consecutive steps. In contrast, this paper proposes an efficient collaborative perception that aggregates the observations from different agents (space) and different times into a unified spatio-temporal space simultanesouly. The unified spatio-temporal space brings two benefits, i.e., efficient feature transmission and superior feature fusion. 1) Efficient feature transmission: each static object yields a single observation in the spatial temporal space, and thus only requires transmission only once (whereas prior methods re-transmit all the object features multiple times). 2) superior feature fusion: merging the multi-agent and multi-time fusion into a unified spatial-temporal aggregation enables a more holistic perspective, thereby enhancing perception performance in challenging scenarios. Consequently, our Collaborative perception with Spatio-temporal Transformer (CoST) gains improvement in both efficiency and accuracy. Notably, CoST is not tied to any specific method and is compatible with a majority of previous methods, enhancing their accuracy while reducing the transmission bandwidth.
zh
[CV-65] Stable at Any Speed: Speed-Driven Multi-Object Tracking with Learnable Kalman Filtering
【速读】:该论文旨在解决多目标跟踪(Multi-object Tracking, MOT)在高速动态场景下因忽略自车速度对观测噪声和参考坐标系影响而导致的稳定性与准确性下降问题。传统基于静态坐标变换的跟踪方法未能动态调整不确定性建模,从而限制了其在高动态环境中的性能表现。解决方案的关键在于提出一种速度引导的可学习卡尔曼滤波器(Speed-Guided Learnable Kalman Filter, SG-LKF),该方法通过一个解耦的 token-mixing 与 channel-mixing 多层感知机(MotionScaleNet, MSNet)自适应预测滤波器核心参数,实现根据自车速度动态调整不确定性建模;同时引入自监督轨迹一致性损失,联合语义与位置约束优化帧间关联与轨迹连续性,显著提升了复杂动态场景下的跟踪性能。
链接: https://arxiv.org/abs/2508.00358
作者: Yan Gong,Mengjun Chen,Hao Liu,Gao Yongsheng,Lei Yang,Naibang Wang,Ziying Song,Haoqun Ma
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 9 pages, 7 figures, 5 tables
Abstract:Multi-object tracking (MOT) enables autonomous vehicles to continuously perceive dynamic objects, supplying essential temporal cues for prediction, behavior understanding, and safe planning. However, conventional tracking-by-detection methods typically rely on static coordinate transformations based on ego-vehicle poses, disregarding ego-vehicle speed-induced variations in observation noise and reference frame changes, which degrades tracking stability and accuracy in dynamic, high-speed scenarios. In this paper, we investigate the critical role of ego-vehicle speed in MOT and propose a Speed-Guided Learnable Kalman Filter (SG-LKF) that dynamically adapts uncertainty modeling to ego-vehicle speed, significantly improving stability and accuracy in highly dynamic scenarios. Central to SG-LKF is MotionScaleNet (MSNet), a decoupled token-mixing and channel-mixing MLP that adaptively predicts key parameters of SG-LKF. To enhance inter-frame association and trajectory continuity, we introduce a self-supervised trajectory consistency loss jointly optimized with semantic and positional constraints. Extensive experiments show that SG-LKF ranks first among all vision-based methods on KITTI 2D MOT with 79.59% HOTA, delivers strong results on KITTI 3D MOT with 82.03% HOTA, and outperforms SimpleTrack by 2.2% AMOTA on nuScenes 3D MOT.
zh
[CV-66] Analyze-Prompt-Reason : A Collaborative Agent -Based Framework for Multi-Image Vision-Language Reasoning
【速读】:该论文旨在解决多图像推理中跨不同数据集和任务格式的交错式多模态推理问题。其核心挑战在于如何有效利用大视觉语言模型(Large Vision-Language Model, LVLM)在处理单图或跨图任务时保持高精度与泛化能力。解决方案的关键在于提出一种基于协作代理(Collaborative Agent-Based)的框架,采用双代理机制:语言驱动的PromptEngineer生成上下文感知且任务特定的提示,而VisionReasoner则作为LVLM执行最终推理。该框架无需训练、模块化且全自动,能够适应分类、问答及自由文本生成等多样化任务,并在18个来自2025 MIRAGE Challenge(Track A)的数据集上验证了其有效性,显著提升了LVLM在多图像场景下的推理性能。
链接: https://arxiv.org/abs/2508.00356
作者: Angelos Vlachos,Giorgos Filandrianos,Maria Lymperaiou,Nikolaos Spanos,Ilias Mitsouras,Vasileios Karampinis,Athanasios Voulodimos
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Multiagent Systems (cs.MA)
备注:
Abstract:We present a Collaborative Agent-Based Framework for Multi-Image Reasoning. Our approach tackles the challenge of interleaved multimodal reasoning across diverse datasets and task formats by employing a dual-agent system: a language-based PromptEngineer, which generates context-aware, task-specific prompts, and a VisionReasoner, a large vision-language model (LVLM) responsible for final inference. The framework is fully automated, modular, and training-free, enabling generalization across classification, question answering, and free-form generation tasks involving one or multiple input images. We evaluate our method on 18 diverse datasets from the 2025 MIRAGE Challenge (Track A), covering a broad spectrum of visual reasoning tasks including document QA, visual comparison, dialogue-based understanding, and scene-level inference. Our results demonstrate that LVLMs can effectively reason over multiple images when guided by informative prompts. Notably, Claude 3.7 achieves near-ceiling performance on challenging tasks such as TQA (99.13% accuracy), DocVQA (96.87%), and MMCoQA (75.28 ROUGE-L). We also explore how design choices-such as model selection, shot count, and input length-influence the reasoning performance of different LVLMs.
zh
[CV-67] Omni-Scan: Creating Visually-Accurate Digital Twin Object Models Using a Bimanual Robot with Handover and Gaussian Splat Merging
【速读】:该论文旨在解决传统3D物体扫描方法依赖多相机阵列、精密激光扫描仪或机器人腕部摄像头导致工作空间受限的问题,从而难以高效获取完整视角的高质量3D对象模型。其核心解决方案是提出Omni-Scan机器人扫描流水线,利用双臂机器人抓取并旋转物体,在固定相机视角下通过两次抓取(分别由两个夹爪完成)实现对物体全向(360度)表面的覆盖,克服了单次抓取造成的遮挡问题;同时结合DepthAnything、Segment Anything和RAFT光流模型精准分割机器人夹爪与背景,再改进3D Gaussian Splats (3DGS) 训练流程以支持包含夹爪遮挡的拼接数据集,最终生成具有高保真度的全向3DGS数字孪生模型,已成功应用于工业及家用部件缺陷检测,平均准确率达83%。
链接: https://arxiv.org/abs/2508.00354
作者: Tianshuang Qiu,Zehan Ma,Karim El-Refai,Hiya Shah,Chung Min Kim,Justin Kerr,Ken Goldberg
机构: University of California, Berkeley (加州大学伯克利分校)
类目: Robotics (cs.RO); Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:3D Gaussian Splats (3DGSs) are 3D object models derived from multi-view images. Such “digital twins” are useful for simulations, virtual reality, marketing, robot policy fine-tuning, and part inspection. 3D object scanning usually requires multi-camera arrays, precise laser scanners, or robot wrist-mounted cameras, which have restricted workspaces. We propose Omni-Scan, a pipeline for producing high-quality 3D Gaussian Splat models using a bi-manual robot that grasps an object with one gripper and rotates the object with respect to a stationary camera. The object is then re-grasped by a second gripper to expose surfaces that were occluded by the first gripper. We present the Omni-Scan robot pipeline using DepthAny-thing, Segment Anything, as well as RAFT optical flow models to identify and isolate objects held by a robot gripper while removing the gripper and the background. We then modify the 3DGS training pipeline to support concatenated datasets with gripper occlusion, producing an omni-directional (360 degree view) model of the object. We apply Omni-Scan to part defect inspection, finding that it can identify visual or geometric defects in 12 different industrial and household objects with an average accuracy of 83%. Interactive videos of Omni-Scan 3DGS models can be found at this https URL
zh
[CV-68] Spectral Sensitivity Estimation with an Uncalibrated Diffraction Grating
【速读】:该论文旨在解决相机光谱敏感度(camera spectral sensitivity)精确标定的问题,这是实现颜色校正、光照估计和材料分析等计算机视觉任务的关键前提。传统方法依赖于专用窄带滤光片或已知光谱反射率的参考目标,限制了实用性与普适性。本文提出一种仅需一张未校准的衍射光栅(diffraction grating)即可完成标定的新方法,其核心创新在于通过拍摄直射光与衍射光图案的图像,在闭式解(closed-form solution)中同时估计相机光谱响应和光栅参数,显著提升了标定的准确性与操作便捷性。
链接: https://arxiv.org/abs/2508.00330
作者: Lilika Makabe,Hiroaki Santo,Fumio Okura,Michael S. Brown,Yasuyuki Matsushita
机构: The University of Osaka (大阪大学); York University (约克大学); Microsoft Research Asia – Tokyo (微软亚洲研究院 – 东京)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:This paper introduces a practical and accurate calibration method for camera spectral sensitivity using a diffraction grating. Accurate calibration of camera spectral sensitivity is crucial for various computer vision tasks, including color correction, illumination estimation, and material analysis. Unlike existing approaches that require specialized narrow-band filters or reference targets with known spectral reflectances, our method only requires an uncalibrated diffraction grating sheet, readily available off-the-shelf. By capturing images of the direct illumination and its diffracted pattern through the grating sheet, our method estimates both the camera spectral sensitivity and the diffraction grating parameters in a closed-form manner. Experiments on synthetic and real-world data demonstrate that our method outperforms conventional reference target-based methods, underscoring its effectiveness and practicality.
zh
[CV-69] Steering Guidance for Personalized Text-to-Image Diffusion Models ICCV2025
【速读】:该论文旨在解决文本到图像扩散模型个性化过程中存在的核心矛盾:在少量样本微调时,如何平衡目标分布的对齐性(如主体保真度)与原始模型的广泛知识保留(如文本编辑能力)。现有采样引导方法如无分类器引导(Classifier-Free Guidance, CFG)和自动引导(Autoguidance, AG)均存在局限性——CFG过度偏向目标分布导致泛化能力下降,而AG则牺牲了文本对齐性。其解决方案的关键在于提出一种新的“个性化引导”(Personalization Guidance)机制,该机制利用一个未学习的弱模型(以空文本提示为条件),并通过推理阶段预训练模型与微调后模型之间的权重插值动态控制弱模型的遗忘程度,从而显式地将生成结果导向一个兼顾文本对齐与目标分布 fidelity 的平衡潜在空间,且无需额外计算开销。
链接: https://arxiv.org/abs/2508.00319
作者: Sunghyun Park,Seokeon Choi,Hyoungwoo Park,Sungrack Yun
机构: Qualcomm AI Research(高通人工智能研究中心)
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注: ICCV 2025
Abstract:Personalizing text-to-image diffusion models is crucial for adapting the pre-trained models to specific target concepts, enabling diverse image generation. However, fine-tuning with few images introduces an inherent trade-off between aligning with the target distribution (e.g., subject fidelity) and preserving the broad knowledge of the original model (e.g., text editability). Existing sampling guidance methods, such as classifier-free guidance (CFG) and autoguidance (AG), fail to effectively guide the output toward well-balanced space: CFG restricts the adaptation to the target distribution, while AG compromises text alignment. To address these limitations, we propose personalization guidance, a simple yet effective method leveraging an unlearned weak model conditioned on a null text prompt. Moreover, our method dynamically controls the extent of unlearning in a weak model through weight interpolation between pre-trained and fine-tuned models during inference. Unlike existing guidance methods, which depend solely on guidance scales, our method explicitly steers the outputs toward a balanced latent space without additional computational overhead. Experimental results demonstrate that our proposed guidance can improve text alignment and target distribution fidelity, integrating seamlessly with various fine-tuning strategies.
zh
[CV-70] GV-VAD : Exploring Video Generation for Weakly-Supervised Video Anomaly Detection
【速读】:该论文旨在解决视频异常检测(Video Anomaly Detection, VAD)中因真实异常样本稀少、不可预测且标注成本高而导致的数据集难以扩展的问题,从而限制了模型性能与泛化能力。其解决方案的关键在于提出了一种生成式视频增强的弱监督视频异常检测框架(Generative Video-enhanced Weakly-supervised VAD, GV-VAD),该框架利用文本条件控制的视频生成模型合成语义可控且物理合理的虚拟视频,以低成本扩充训练数据;同时引入合成样本损失缩放策略,有效调控生成样本对训练过程的影响,提升模型训练效率与稳定性。
链接: https://arxiv.org/abs/2508.00312
作者: Suhang Cai,Xiaohao Peng,Chong Wang,Xiaojie Cai,Jiangbo Qian
机构: Ningbo University (宁波大学); China Telecom Corporation Limited Wenzhou Branch (中国电信股份有限公司温州分公司)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:
Abstract:Video anomaly detection (VAD) plays a critical role in public safety applications such as intelligent surveillance. However, the rarity, unpredictability, and high annotation cost of real-world anomalies make it difficult to scale VAD datasets, which limits the performance and generalization ability of existing models. To address this challenge, we propose a generative video-enhanced weakly-supervised video anomaly detection (GV-VAD) framework that leverages text-conditioned video generation models to produce semantically controllable and physically plausible synthetic videos. These virtual videos are used to augment training data at low cost. In addition, a synthetic sample loss scaling strategy is utilized to control the influence of generated synthetic samples for efficient training. The experiments show that the proposed framework outperforms state-of-the-art methods on UCF-Crime datasets. The code is available at this https URL.
zh
[CV-71] DocTron-Formula: Generalized Formula Recognition in Complex and Structured Scenarios
【速读】:该论文旨在解决数学公式光学字符识别(Optical Character Recognition, OCR)在科学文献智能分析中的难题,特别是针对数学内容结构多样性、复杂性及现实场景中变异性导致的识别性能瓶颈问题。其解决方案的关键在于构建一个基于通用视觉-语言模型的统一框架DocTron-Formula,摒弃了以往依赖专用架构的设计思路,并结合大规模、多学科、多层次结构复杂的CSFormula数据集进行监督微调,从而在多种风格、科学领域和复杂排版下实现最先进的识别性能,显著优于专门设计的模型,在准确性和鲁棒性上均取得突破。
链接: https://arxiv.org/abs/2508.00311
作者: Yufeng Zhong,Zhixiong Zeng,Lei Chen,Longrong Yang,Liming Zheng,Jing Huang,Siqi Yang,Lin Ma
机构: Meituan(美团)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Optical Character Recognition (OCR) for mathematical formula is essential for the intelligent analysis of scientific literature. However, both task-specific and general vision-language models often struggle to handle the structural diversity, complexity, and real-world variability inherent in mathematical content. In this work, we present DocTron-Formula, a unified framework built upon general vision-language models, thereby eliminating the need for specialized architectures. Furthermore, we introduce CSFormula, a large-scale and challenging dataset that encompasses multidisciplinary and structurally complex formulas at the line, paragraph, and page levels. Through straightforward supervised fine-tuning, our approach achieves state-of-the-art performance across a variety of styles, scientific domains, and complex layouts. Experimental results demonstrate that our method not only surpasses specialized models in terms of accuracy and robustness, but also establishes a new paradigm for the automated understanding of complex scientific documents.
zh
[CV-72] Exploring Fourier Prior and Event Collaboration for Low-Light Image Enhancement ACM-MM2025
【速读】:该论文旨在解决低光照条件下图像增强中多模态信息利用不充分的问题,特别是现有基于事件相机(event camera)的方法未能充分利用帧图像与事件数据各自的特性,导致性能受限。其关键解决方案在于将增强流程解耦为两个阶段:第一阶段设计了一个基于幅值-相位纠缠结构的可见性恢复网络(visibility restoration network),重新定义傅里叶域中幅值与相位组件的关系以提升图像基础可见性;第二阶段提出一种具有动态对齐机制的融合策略,缓解因两模态时间分辨率差异引起的空域错位问题,从而精细化重构图像结构信息。此外,通过空间频域插值模拟多样退化条件,构建对比损失函数以增强模型判别能力,最终实现优于当前最优方法的低光图像增强效果。
链接: https://arxiv.org/abs/2508.00308
作者: Chunyan She,Fujun Han,Chengyu Fang,Shukai Duan,Lidan Wang
机构: College of Artificial Intelligence, Southwest University (西南大学人工智能学院); Shenzhen International Graduate School, Tsinghua University (清华大学深圳国际研究生院)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted by ACM MM 2025
Abstract:The event camera, benefiting from its high dynamic range and low latency, provides performance gain for low-light image enhancement. Unlike frame-based cameras, it records intensity changes with extremely high temporal resolution, capturing sufficient structure information. Currently, existing event-based methods feed a frame and events directly into a single model without fully exploiting modality-specific advantages, which limits their performance. Therefore, by analyzing the role of each sensing modality, the enhancement pipeline is decoupled into two stages: visibility restoration and structure refinement. In the first stage, we design a visibility restoration network with amplitude-phase entanglement by rethinking the relationship between amplitude and phase components in Fourier space. In the second stage, a fusion strategy with dynamic alignment is proposed to mitigate the spatial mismatch caused by the temporal resolution discrepancy between two sensing modalities, aiming to refine the structure information of the image enhanced by the visibility restoration network. In addition, we utilize spatial-frequency interpolation to simulate negative samples with diverse illumination, noise and artifact degradations, thereby developing a contrastive loss that encourages the model to learn discriminative representations. Experiments demonstrate that the proposed method outperforms state-of-the-art models.
zh
[CV-73] Controllable Pedestrian Video Editing for Multi-View Driving Scenarios via Motion Sequence ICCV2025
【速读】:该论文旨在解决自动驾驶系统中行人检测模型因训练数据集对危险行人场景表征不足而导致的鲁棒性差的问题。其解决方案的关键在于提出一种可控的多视角行车场景下行人视频编辑框架,通过融合视频修复(video inpainting)与人体运动控制技术实现精准编辑:首先在多视角图像中定位行人感兴趣区域,扩展检测框并统一尺度拼接以保持跨视角空间关系;随后利用二值掩膜划定可编辑区域,并基于姿态序列控制条件引导编辑过程,从而实现行人插入、替换与移除等灵活功能。该方法在视觉真实感、时空一致性及跨视角一致性方面表现优异,为自动驾驶的数据增强和场景仿真提供了高效且可靠的解决方案。
链接: https://arxiv.org/abs/2508.00299
作者: Danzhen Fu,Jiagao Hu,Daiguo Zhou,Fei Wang,Zepeng Wang,Wenhua Liao
机构: Xiaomi Inc. (小米公司)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Robotics (cs.RO)
备注: ICCV 2025 Workshop (HiGen)
Abstract:Pedestrian detection models in autonomous driving systems often lack robustness due to insufficient representation of dangerous pedestrian scenarios in training datasets. To address this limitation, we present a novel framework for controllable pedestrian video editing in multi-view driving scenarios by integrating video inpainting and human motion control techniques. Our approach begins by identifying pedestrian regions of interest across multiple camera views, expanding detection bounding boxes with a fixed ratio, and resizing and stitching these regions into a unified canvas while preserving cross-view spatial relationships. A binary mask is then applied to designate the editable area, within which pedestrian editing is guided by pose sequence control conditions. This enables flexible editing functionalities, including pedestrian insertion, replacement, and removal. Extensive experiments demonstrate that our framework achieves high-quality pedestrian editing with strong visual realism, spatiotemporal coherence, and cross-view consistency. These results establish the proposed method as a robust and versatile solution for multi-view pedestrian video generation, with broad potential for applications in data augmentation and scenario simulation in autonomous driving.
zh
[CV-74] AniMer: Unified Pose and Shape Estimation Across Mammalia and Aves via Family-Aware Transformer
【速读】:该论文旨在解决跨物种动物姿态与形状重建的难题,尤其是在缺乏足够高质量3D标注数据的情况下,如何实现对哺乳动物(mammalia)和鸟类(aves)的统一建模。其关键解决方案在于提出AniMer+框架,采用一种高容量、家族感知的Vision Transformer(ViT)架构,结合Mixture-of-Experts(MoE)设计,将网络层划分为物种特异性组件(哺乳类和鸟类)与共享组件,从而在单一模型中高效学习共性和差异性解剖特征;同时,为缓解鸟类3D数据稀缺问题,引入基于扩散模型的条件图像生成管道,构建了首个大规模、带3D标注的鸟类数据集CtrlAVES3D,显著提升了模型在真实场景下的泛化能力与性能。
链接: https://arxiv.org/abs/2508.00298
作者: Jin Lyu,Liang An,Li Lin,Pujin Cheng,Yebin Liu,Xiaoying Tang
机构: Southern University of Science and Technology (南方科技大学); Tsinghua University (清华大学); University of Hong Kong (香港大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: arXiv admin note: substantial text overlap with arXiv:2412.00837
Abstract:In the era of foundation models, achieving a unified understanding of different dynamic objects through a single network has the potential to empower stronger spatial intelligence. Moreover, accurate estimation of animal pose and shape across diverse species is essential for quantitative analysis in biological research. However, this topic remains underexplored due to the limited network capacity of previous methods and the scarcity of comprehensive multi-species datasets. To address these limitations, we introduce AniMer+, an extended version of our scalable AniMer framework. In this paper, we focus on a unified approach for reconstructing mammals (mammalia) and birds (aves). A key innovation of AniMer+ is its high-capacity, family-aware Vision Transformer (ViT) incorporating a Mixture-of-Experts (MoE) design. Its architecture partitions network layers into taxa-specific components (for mammalia and aves) and taxa-shared components, enabling efficient learning of both distinct and common anatomical features within a single model. To overcome the critical shortage of 3D training data, especially for birds, we introduce a diffusion-based conditional image generation pipeline. This pipeline produces two large-scale synthetic datasets: CtrlAni3D for quadrupeds and CtrlAVES3D for birds. To note, CtrlAVES3D is the first large-scale, 3D-annotated dataset for birds, which is crucial for resolving single-view depth ambiguities. Trained on an aggregated collection of 41.3k mammalian and 12.4k avian images (combining real and synthetic data), our method demonstrates superior performance over existing approaches across a wide range of benchmarks, including the challenging out-of-domain Animal Kingdom dataset. Ablation studies confirm the effectiveness of both our novel network architecture and the generated synthetic datasets in enhancing real-world application performance.
zh
[CV-75] ITAN-Guide: Taming Inference-Time AligNment for Guided Text-to-Video Diffusion Models ICCV2025
【速读】:该论文旨在解决条件扩散模型在执行控制任务时依赖大量监督微调的问题,尤其是在文本到视频(Text-to-Video, T2V)扩散模型中,现有无需训练的引导框架存在内存占用高或控制精度不足的局限。其解决方案的关键在于提出一种名为TITAN-Guide的方法,通过在推理阶段优化扩散潜在空间(diffusion latents)而无需反向传播,结合前向梯度下降策略与多种方向性指令选项,在不增加额外训练成本的前提下显著提升控制精度并降低内存消耗,从而实现高效且高质量的T2V生成控制。
链接: https://arxiv.org/abs/2508.00289
作者: Christian Simon,Masato Ishii,Akio Hayakawa,Zhi Zhong,Shusuke Takahashi,Takashi Shibuya,Yuki Mitsufuji
机构: Sony Group Corporation (索尼集团); Sony AI (索尼人工智能)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted to ICCV 2025
Abstract:In the recent development of conditional diffusion models still require heavy supervised fine-tuning for performing control on a category of tasks. Training-free conditioning via guidance with off-the-shelf models is a favorable alternative to avoid further fine-tuning on the base model. However, the existing training-free guidance frameworks either have heavy memory requirements or offer sub-optimal control due to rough estimation. These shortcomings limit the applicability to control diffusion models that require intense computation, such as Text-to-Video (T2V) diffusion models. In this work, we propose Taming Inference Time Alignment for Guided Text-to-Video Diffusion Model, so-called TITAN-Guide, which overcomes memory space issues, and provides more optimal control in the guidance process compared to the counterparts. In particular, we develop an efficient method for optimizing diffusion latents without backpropagation from a discriminative guiding model. In particular, we study forward gradient descents for guided diffusion tasks with various options on directional directives. In our experiments, we demonstrate the effectiveness of our approach in efficiently managing memory during latent optimization, while previous methods fall short. Our proposed approach not only minimizes memory requirements but also significantly enhances T2V performance across a range of diffusion guidance benchmarks. Code, models, and demo are available at this https URL.
zh
[CV-76] UAV-ON: A Benchmark for Open-World Object Goal Navigation with Aerial Agents ACM-MM
【速读】:该论文旨在解决当前具身智能中空中导航(aerial navigation)能力研究不足的问题,尤其是在大规模、非结构化环境中,传统基于语言指令的视觉-语言导航(Vision-and-Language Navigation, VLN)范式因依赖详细序列化语言指导而难以实现可扩展性和自主性。其解决方案的关键在于提出UAV-ON基准,这是一个面向开放世界环境的大规模物体目标导航(Object Goal Navigation, ObjectNav)平台,允许无人机代理基于高层语义目标进行导航,而非依赖细粒度语言指令。该基准包含14个高保真Unreal Engine环境、1270个标注目标物体及其实例级语义描述(包括类别、物理轮廓和视觉特征),从而引入真实世界的模糊性和复杂推理挑战,推动空中代理在复杂现实场景中实现语义目标驱动的自主导航研究。
链接: https://arxiv.org/abs/2508.00288
作者: Jianqiang Xiao,Yuexuan Sun,Yixin Shao,Boxi Gan,Rongqiang Liu,Yanjing Wu,Weili Gua,Xiang Deng
机构: Harbin Institute of Technology, Shenzhen (哈尔滨工业大学深圳校区)
类目: Robotics (cs.RO); Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted to ACM MM Dataset Track 2025
Abstract:Aerial navigation is a fundamental yet underexplored capability in embodied intelligence, enabling agents to operate in large-scale, unstructured environments where traditional navigation paradigms fall short. However, most existing research follows the Vision-and-Language Navigation (VLN) paradigm, which heavily depends on sequential linguistic instructions, limiting its scalability and autonomy. To address this gap, we introduce UAV-ON, a benchmark for large-scale Object Goal Navigation (ObjectNav) by aerial agents in open-world environments, where agents operate based on high-level semantic goals without relying on detailed instructional guidance as in VLN. UAV-ON comprises 14 high-fidelity Unreal Engine environments with diverse semantic regions and complex spatial layouts, covering urban, natural, and mixed-use settings. It defines 1270 annotated target objects, each characterized by an instance-level instruction that encodes category, physical footprint, and visual descriptors, allowing grounded reasoning. These instructions serve as semantic goals, introducing realistic ambiguity and complex reasoning challenges for aerial agents. To evaluate the benchmark, we implement several baseline methods, including Aerial ObjectNav Agent (AOA), a modular policy that integrates instruction semantics with egocentric observations for long-horizon, goal-directed exploration. Empirical results show that all baselines struggle in this setting, highlighting the compounded challenges of aerial navigation and semantic goal grounding. UAV-ON aims to advance research on scalable UAV autonomy driven by semantic goal descriptions in complex real-world environments.
zh
[CV-77] Privacy-Preserving Driver Drowsiness Detection with Spatial Self-Attention and Federated Learning
【速读】:该论文旨在解决现实场景中驾驶员疲劳检测(driver drowsiness detection)的准确性难题,特别是在面部数据分散且高度异构的联邦学习(federated learning)环境下。其关键解决方案在于提出一种融合空间自注意力机制(Spatial Self-Attention, SSA)与长短期记忆网络(LSTM)的新型框架,以更有效地提取关键面部特征;同时引入梯度相似性比较(Gradient Similarity Comparison, GSC)策略,在模型聚合前筛选最具相关性的本地模型,从而提升全局模型的准确性和鲁棒性,同时保障用户隐私。
链接: https://arxiv.org/abs/2508.00287
作者: Tran Viet Khoa,Do Hai Son,Mohammad Abu Alsheikh,Yibeltal F Alem,Dinh Thai Hoang
机构: University of Canberra (澳大利亚首都大学); Curtin University (柯廷大学); VNU Information Technology Institute (越南国家大学信息技术研究所); University of Technology Sydney (悉尼科技大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Driver drowsiness is one of the main causes of road accidents and is recognized as a leading contributor to traffic-related fatalities. However, detecting drowsiness accurately remains a challenging task, especially in real-world settings where facial data from different individuals is decentralized and highly diverse. In this paper, we propose a novel framework for drowsiness detection that is designed to work effectively with heterogeneous and decentralized data. Our approach develops a new Spatial Self-Attention (SSA) mechanism integrated with a Long Short-Term Memory (LSTM) network to better extract key facial features and improve detection performance. To support federated learning, we employ a Gradient Similarity Comparison (GSC) that selects the most relevant trained models from different operators before aggregation. This improves the accuracy and robustness of the global model while preserving user privacy. We also develop a customized tool that automatically processes video data by extracting frames, detecting and cropping faces, and applying data augmentation techniques such as rotation, flipping, brightness adjustment, and zooming. Experimental results show that our framework achieves a detection accuracy of 89.9% in the federated learning settings, outperforming existing methods under various deployment scenarios. The results demonstrate the effectiveness of our approach in handling real-world data variability and highlight its potential for deployment in intelligent transportation systems to enhance road safety through early and reliable drowsiness detection.
zh
[CV-78] owards Robust Semantic Correspondence: A Benchmark and Insights
【速读】:该论文旨在解决语义对应(semantic correspondence)在复杂和挑战性场景下的鲁棒性不足问题,即现有方法在面对几何失真、图像模糊、数字伪影及环境遮挡等不利条件时性能显著下降的问题。其解决方案的关键在于构建了一个包含14种典型挑战场景的新基准数据集,并通过系统评估揭示了当前主流方法的脆弱性以及不同模型架构(如DINO与Stable Diffusion)在鲁棒性上的差异;进一步发现,尽管大规模视觉模型可提升整体鲁棒性,但微调反而削弱其相对鲁棒性,且通用数据增强策略无效,强调需设计任务特定的鲁棒性增强机制。
链接: https://arxiv.org/abs/2508.00272
作者: Wenyue Chong
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Semantic correspondence aims to identify semantically meaningful relationships between different images and is a fundamental challenge in computer vision. It forms the foundation for numerous tasks such as 3D reconstruction, object tracking, and image editing. With the progress of large-scale vision models, semantic correspondence has achieved remarkable performance in controlled and high-quality conditions. However, the robustness of semantic correspondence in challenging scenarios is much less investigated. In this work, we establish a novel benchmark for evaluating semantic correspondence in adverse conditions. The benchmark dataset comprises 14 distinct challenging scenarios that reflect commonly encountered imaging issues, including geometric distortion, image blurring, digital artifacts, and environmental occlusion. Through extensive evaluations, we provide several key insights into the robustness of semantic correspondence approaches: (1) All existing methods suffer from noticeable performance drops under adverse conditions; (2) Using large-scale vision models can enhance overall robustness, but fine-tuning on these models leads to a decline in relative robustness; (3) The DINO model outperforms the Stable Diffusion in relative robustness, and their fusion achieves better absolute robustness; Moreover, We evaluate common robustness enhancement strategies for semantic correspondence and find that general data augmentations are ineffective, highlighting the need for task-specific designs. These results are consistent across both our dataset and real-world benchmarks.
zh
[CV-79] Multimodal Referring Segmentation: A Survey
【速读】:该论文旨在解决多模态指代表达分割(multimodal referring segmentation)问题,即根据文本或音频形式的指代表达,在图像、视频和3D场景等视觉场景中准确分割目标物体。这一任务在依赖用户指令进行精确目标感知的实际应用中至关重要。解决方案的关键在于构建一个统一的元架构(meta architecture),系统性地整合卷积神经网络(CNN)、Transformer 和大语言模型(LLM)等先进技术,从而提升跨模态对齐与理解能力,并针对真实世界复杂性提出广义指代表达(Generalized Referring Expression, GREx)方法,增强模型在多样化场景下的泛化性能。
链接: https://arxiv.org/abs/2508.00265
作者: Henghui Ding,Song Tang,Shuting He,Chang Liu,Zuxuan Wu,Yu-Gang Jiang
机构: Fudan University (复旦大学); Shanghai University of Finance and Economics (上海财经大学); ByteDance Inc. (字节跳动)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Project Page: this https URL
Abstract:Multimodal referring segmentation aims to segment target objects in visual scenes, such as images, videos, and 3D scenes, based on referring expressions in text or audio format. This task plays a crucial role in practical applications requiring accurate object perception based on user instructions. Over the past decade, it has gained significant attention in the multimodal community, driven by advances in convolutional neural networks, transformers, and large language models, all of which have substantially improved multimodal perception capabilities. This paper provides a comprehensive survey of multimodal referring segmentation. We begin by introducing this field’s background, including problem definitions and commonly used datasets. Next, we summarize a unified meta architecture for referring segmentation and review representative methods across three primary visual scenes, including images, videos, and 3D scenes. We further discuss Generalized Referring Expression (GREx) methods to address the challenges of real-world complexity, along with related tasks and practical applications. Extensive performance comparisons on standard benchmarks are also provided. We continually track related works at this https URL.
zh
[CV-80] Instruction-Grounded Visual Projectors for Continual Learning of Generative Vision-Language Models ICCV2025
【速读】:该论文旨在解决持续学习(Continual Learning)场景下,预训练生成式视觉语言模型(Generative Vision-Language Models, VLMs)在引入新任务时对语言指令关注不足的问题。现有方法通过更新视觉投影器(Visual Projector)来适配新任务,但容易导致模型过度依赖视觉输入而忽略文本指令,尤其在重复性文本指令的任务中表现明显。其解决方案的关键在于提出一种基于指令上下文的多专家视觉投影框架:每个视觉投影器作为特定指令情境下的视觉到语言翻译专家,实现更精准的语义对齐;同时引入专家推荐策略以复用相似任务的历史专家,并结合专家剪枝机制减少因累积激活带来的干扰,从而提升模型对语言指令的遵循能力与任务适应性。
链接: https://arxiv.org/abs/2508.00260
作者: Hyundong Jin,Hyung Jin Chang,Eunwoo Kim
机构: Chung-Ang University (中央大学); University of Birmingham (伯明翰大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Multimedia (cs.MM)
备注: Accepted to ICCV 2025
Abstract:Continual learning enables pre-trained generative vision-language models (VLMs) to incorporate knowledge from new tasks without retraining data from previous ones. Recent methods update a visual projector to translate visual information for new tasks, connecting pre-trained vision encoders with large language models. However, such adjustments may cause the models to prioritize visual inputs over language instructions, particularly learning tasks with repetitive types of textual instructions. To address the neglect of language instructions, we propose a novel framework that grounds the translation of visual information on instructions for language models. We introduce a mixture of visual projectors, each serving as a specialized visual-to-language translation expert based on the given instruction context to adapt to new tasks. To avoid using experts for irrelevant instruction contexts, we propose an expert recommendation strategy that reuses experts for tasks similar to those previously learned. Additionally, we introduce expert pruning to alleviate interference from the use of experts that cumulatively activated in previous tasks. Extensive experiments on diverse vision-language tasks demonstrate that our method outperforms existing continual learning approaches by generating instruction-following responses.
zh
[CV-81] PointGauss: Point Cloud-Guided Multi-Object Segmentation for Gaussian Splatting
【速读】:该论文旨在解决现有基于高斯溅射(Gaussian Splatting)表示的多对象分割方法中存在的初始化时间长、多视角一致性差等问题。其解决方案的关键在于提出PointGauss框架,通过点云引导的高斯原始解码器(point cloud-based Gaussian primitive decoder)在1分钟内生成3D实例掩码,并结合GPU加速的2D掩码渲染系统确保多视角一致性,从而实现高效且准确的实时多对象分割。
链接: https://arxiv.org/abs/2508.00259
作者: Wentao Sun,Hanqing Xu,Quanyun Wu,Dedong Zhang,Yiping Chen,Lingfei Ma,John S. Zelek,Jonathan Li
机构: University of Waterloo (滑铁卢大学); East China Normal University (华东师范大学); Sun Yat-Sen University (中山大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 22 pages, 9 figures
Abstract:We introduce PointGauss, a novel point cloud-guided framework for real-time multi-object segmentation in Gaussian Splatting representations. Unlike existing methods that suffer from prolonged initialization and limited multi-view consistency, our approach achieves efficient 3D segmentation by directly parsing Gaussian primitives through a point cloud segmentation-driven pipeline. The key innovation lies in two aspects: (1) a point cloud-based Gaussian primitive decoder that generates 3D instance masks within 1 minute, and (2) a GPU-accelerated 2D mask rendering system that ensures multi-view consistency. Extensive experiments demonstrate significant improvements over previous state-of-the-art methods, achieving performance gains of 1.89 to 31.78% in multi-view mIoU, while maintaining superior computational efficiency. To address the limitations of current benchmarks (single-object focus, inconsistent 3D evaluation, small scale, and partial coverage), we present DesktopObjects-360, a novel comprehensive dataset for 3D segmentation in radiance fields, featuring: (1) complex multi-object scenes, (2) globally consistent 2D annotations, (3) large-scale training data (over 27 thousand 2D masks), (4) full 360° coverage, and (5) 3D evaluation masks.
zh
[CV-82] Guided Depth Map Super-Resolution via Multi-Scale Fusion U-shaped Mamba Network
【速读】:该论文旨在解决低分辨率深度图(depth map)超分辨率重建中难以有效恢复高频细节信息的问题,尤其针对传统卷积神经网络(Convolutional Neural Network, CNN)在建模长距离依赖关系上的局限性,以及Transformer因计算复杂度和内存消耗呈二次增长而难以处理高分辨率深度图的挑战。解决方案的关键在于提出一种多尺度融合U型Mamba(Multi-scale Fusion U-shaped Mamba, MSF-UM)模型,其核心创新是将Mamba的状态空间建模能力高效集成到由彩色图像引导的多尺度U型结构中,通过残差密集通道注意力模块与Mamba状态空间模块的协同设计,实现局部特征提取与长距离上下文建模的有机结合;同时引入多尺度跨模态融合策略,充分利用彩色图像中的高频纹理信息指导深度图超分辨率过程,从而在显著减少模型参数量的同时提升重建精度,并展现出优异的大规模深度图超分辨率泛化性能。
链接: https://arxiv.org/abs/2508.00248
作者: Chenggang Guo,Hao Xu,XianMing Wan
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Depth map super-resolution technology aims to improve the spatial resolution of low-resolution depth maps and effectively restore high-frequency detail information. Traditional convolutional neural network has limitations in dealing with long-range dependencies and are unable to fully model the global contextual information in depth maps. Although transformer can model global dependencies, its computational complexity and memory consumption are quadratic, which significantly limits its ability to process high-resolution depth maps. In this paper, we propose a multi-scale fusion U-shaped Mamba (MSF-UM) model, a novel guided depth map super-resolution framework. The core innovation of this model is to integrate Mamba’s efficient state-space modeling capabilities into a multi-scale U-shaped fusion structure guided by a color image. The structure combining the residual dense channel attention block and the Mamba state space module is designed, which combines the local feature extraction capability of the convolutional layer with the modeling advantage of the state space model for long-distance dependencies. At the same time, the model adopts a multi-scale cross-modal fusion strategy to make full use of the high-frequency texture information from the color image to guide the super-resolution process of the depth map. Compared with existing mainstream methods, the proposed MSF-UM significantly reduces the number of model parameters while achieving better reconstruction accuracy. Extensive experiments on multiple publicly available datasets validate the effectiveness of the model, especially showing excellent generalization ability in the task of large-scale depth map super-resolution.
zh
[CV-83] Object-Centric Cropping for Visual Few-Shot Classification
【速读】:该论文旨在解决少样本图像分类(Few-Shot Image Classification)中因图像模糊性(如多物体或复杂背景)导致性能下降的问题。其解决方案的关键在于引入关于目标物体在图像中局部位置的附加信息,显著提升了模型在主流基准上的分类准确率;更关键的是,这种改进可通过仅需用户提供一个像素点标注即可利用Segment Anything Model(SAM)获得,或通过完全无监督的前景对象提取方法实现,从而降低了对人工标注的依赖并增强了方法的实用性。
链接: https://arxiv.org/abs/2508.00218
作者: Aymane Abdali,Bartosz Boguslawski,Lucas Drumetz,Vincent Gripon
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注:
Abstract:In the domain of Few-Shot Image Classification, operating with as little as one example per class, the presence of image ambiguities stemming from multiple objects or complex backgrounds can significantly deteriorate performance. Our research demonstrates that incorporating additional information about the local positioning of an object within its image markedly enhances classification across established benchmarks. More importantly, we show that a significant fraction of the improvement can be achieved through the use of the Segment Anything Model, requiring only a pixel of the object of interest to be pointed out, or by employing fully unsupervised foreground object extraction methods.
zh
[CV-84] SAM-PTx: Text-Guided Fine-Tuning of SAM with Parameter-Efficient Parallel-Text Adapters
【速读】:该论文旨在解决生成式 AI(Generative AI)在图像分割任务中对语义信息利用不足的问题,特别是如何有效利用固定文本嵌入作为语义提示来增强Segment Anything Model (SAM) 的分割性能。其解决方案的关键在于提出一种轻量级适配器结构——Parallel-Text,该结构将冻结的CLIP-derived文本嵌入注入到SAM的图像编码器中,仅修改每个Transformer块中的MLP-parallel分支,从而在不破坏原有注意力路径的前提下实现语义引导的分割,同时保持模型大部分参数冻结,显著降低计算复杂度并提升适应效率。
链接: https://arxiv.org/abs/2508.00213
作者: Shayan Jalilian,Abdul Bais
机构: University of Regina (里贾纳大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注:
Abstract:The Segment Anything Model (SAM) has demonstrated impressive generalization in prompt-based segmentation. Yet, the potential of semantic text prompts remains underexplored compared to traditional spatial prompts like points and boxes. This paper introduces SAM-PTx, a parameter-efficient approach for adapting SAM using frozen CLIP-derived text embeddings as class-level semantic guidance. Specifically, we propose a lightweight adapter design called Parallel-Text that injects text embeddings into SAM’s image encoder, enabling semantics-guided segmentation while keeping most of the original architecture frozen. Our adapter modifies only the MLP-parallel branch of each transformer block, preserving the attention pathway for spatial reasoning. Through supervised experiments and ablations on the COD10K dataset as well as low-data subsets of COCO and ADE20K, we show that incorporating fixed text embeddings as input improves segmentation performance over purely spatial prompt baselines. To our knowledge, this is the first work to use text prompts for segmentation on the COD10K dataset. These results suggest that integrating semantic conditioning into SAM’s architecture offers a practical and scalable path for efficient adaptation with minimal computational complexity.
zh
[CV-85] Learning Personalised Human Internal Cognition from External Expressive Behaviours for Real Personality Recognition
【速读】:该论文旨在解决自动真实人格识别(Real Personality Recognition, RPR)中因传统方法仅作为外部观察者基于目标个体的表达行为推断人格印象,导致识别结果与真实人格偏差较大、性能不佳的问题。其解决方案的关键在于:通过模拟个体在生成表达行为时所依赖的个性化内部认知机制,利用易获取的短时音视频行为数据高效重构出反映个体特异性的认知表征;该认知表征被编码为包含二维节点和边特征矩阵的新型图结构,并引入一种二维图神经网络(2D Graph Neural Network, 2D-GNN)从该图中推断真实人格特质;整个框架采用端到端训练策略联合优化认知模拟、图构建与人格识别模块,从而实现更贴近真实人格的识别效果。
链接: https://arxiv.org/abs/2508.00205
作者: Xiangyu Kong,Hengde Zhu,Haoqin Sun,Zhihao Guo,Jiayan Gu,Xinyi Ni,Wei Zhang,Shizhe Liu,Siyang Song
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 10 pages, 4 figures
Abstract:Automatic real personality recognition (RPR) aims to evaluate human real personality traits from their expressive behaviours. However, most existing solutions generally act as external observers to infer observers’ personality impressions based on target individuals’ expressive behaviours, which significantly deviate from their real personalities and consistently lead to inferior recognition performance. Inspired by the association between real personality and human internal cognition underlying the generation of expressive behaviours, we propose a novel RPR approach that efficiently simulates personalised internal cognition from easy-accessible external short audio-visual behaviours expressed by the target individual. The simulated personalised cognition, represented as a set of network weights that enforce the personalised network to reproduce the individual-specific facial reactions, is further encoded as a novel graph containing two-dimensional node and edge feature matrices, with a novel 2D Graph Neural Network (2D-GNN) proposed for inferring real personality traits from it. To simulate real personality-related cognition, an end-to-end strategy is designed to jointly train our cognition simulation, 2D graph construction, and personality recognition modules.
zh
[CV-86] Graph Lineages and Skeletal Graph Products
【速读】:该论文旨在解决如何在机器学习和计算科学中高效构建与操作具有层次结构的图模型架构的问题,特别是针对多尺度建模与数值方法中的复杂性管理。其核心挑战在于设计一种既能表达层级增长特性(如顶点和边数随层级指数增长)、又能支持低开销代数运算的图结构形式。解决方案的关键在于提出“分级图”(graded graphs)及其对应的“骨骼化”(skeletal)二元运算符(如交叉积、盒积、不相交和及函数类型),这些运算符继承了标准图运算的范畴论性质但具有更低的时间和空间复杂度;同时引入延拓映射(prolongation maps)以定义相邻层级间的过程距离,并通过厚化(thickening)与升格(escalation)等一元操作生成多尺度图谱系与搜索前沿,从而实现对连续极限对象的逼近。此框架为构建“分层架构”(hierarchitectures)提供了形式化的类型理论基础,适用于深度神经网络与多重网格数值方法等场景。
链接: https://arxiv.org/abs/2508.00197
作者: Eric Mjolsness,Cory B. Scott
机构: University of California Irvine (加州大学欧文分校); Colorado College (科罗拉多学院)
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG); Category Theory (math.CT); Numerical Analysis (math.NA)
备注: 42 pages. 33 Figures. Under review
Abstract:Graphs, and sequences of growing graphs, can be used to specify the architecture of mathematical models in many fields including machine learning and computational science. Here we define structured graph “lineages” (ordered by level number) that grow in a hierarchical fashion, so that: (1) the number of graph vertices and edges increases exponentially in level number; (2) bipartite graphs connect successive levels within a graph lineage and, as in multigrid methods, can constrain matrices relating successive levels; (3) using prolongation maps within a graph lineage, process-derived distance measures between graphs at successive levels can be defined; (4) a category of “graded graphs” can be defined, and using it low-cost “skeletal” variants of standard algebraic graph operations and type constructors (cross product, box product, disjoint sum, and function types) can be derived for graded graphs and hence hierarchical graph lineages; (5) these skeletal binary operators have similar but not identical algebraic and category-theoretic properties to their standard counterparts; (6) graph lineages and their skeletal product constructors can approach continuum limit objects. Additional space-efficient unary operators on graded graphs are also derived: thickening, which creates a graph lineage of multiscale graphs, and escalation to a graph lineage of search frontiers (useful as a generalization of adaptive grids and in defining “skeletal” functions). The result is an algebraic type theory for graded graphs and (hierarchical) graph lineages. The approach is expected to be well suited to defining hierarchical model architectures - “hierarchitectures” - and local sampling, search, or optimization algorithms on them. We demonstrate such application to deep neural networks (including visual and feature scale spaces) and to multigrid numerical methods.
zh
[CV-87] Robust 3D Object Detection using Probabilistic Point Clouds from Single-Photon LiDARs ICCV2025
【速读】:该论文旨在解决LiDAR点云在真实场景中因远距离、低反照率物体或强环境光等因素导致的稀疏或错误点云问题,这些问题源于原始测量噪声,并会传播至下游感知模型,造成显著精度损失。传统3D处理流程在构建点云时未保留原始测量中的不确定性信息,从而限制了鲁棒性。解决方案的关键在于提出概率点云(Probabilistic Point Clouds, PPC),即为每个点附加一个概率属性,以编码原始数据中的测量不确定性(或置信度),从而实现对不确定性的显式建模与利用。进一步地,作者设计了基于PPC的推理方法,可在不改变现有3D推理流水线的前提下作为轻量级模块嵌入,显著提升复杂场景下(如小目标、远距离、低反照率对象)的3D目标检测性能。
链接: https://arxiv.org/abs/2508.00169
作者: Bhavya Goyal,Felipe Gutierrez-Barragan,Wei Lin,Andreas Velten,Yin Li,Mohit Gupta
机构: University of Wisconsin-Madison (威斯康星大学麦迪逊分校); Ubicept
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: ICCV 2025
Abstract:LiDAR-based 3D sensors provide point clouds, a canonical 3D representation used in various scene understanding tasks. Modern LiDARs face key challenges in several real-world scenarios, such as long-distance or low-albedo objects, producing sparse or erroneous point clouds. These errors, which are rooted in the noisy raw LiDAR measurements, get propagated to downstream perception models, resulting in potentially severe loss of accuracy. This is because conventional 3D processing pipelines do not retain any uncertainty information from the raw measurements when constructing point clouds. We propose Probabilistic Point Clouds (PPC), a novel 3D scene representation where each point is augmented with a probability attribute that encapsulates the measurement uncertainty (or confidence) in the raw data. We further introduce inference approaches that leverage PPC for robust 3D object detection; these methods are versatile and can be used as computationally lightweight drop-in modules in 3D inference pipelines. We demonstrate, via both simulations and real captures, that PPC-based 3D inference methods outperform several baselines using LiDAR as well as camera-LiDAR fusion models, across challenging indoor and outdoor scenarios involving small, distant, and low-albedo objects, as well as strong ambient light. Our project webpage is at this https URL . Comments: ICCV 2025 Subjects: Computer Vision and Pattern Recognition (cs.CV) Cite as: arXiv:2508.00169 [cs.CV] (or arXiv:2508.00169v1 [cs.CV] for this version) https://doi.org/10.48550/arXiv.2508.00169 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
zh
[CV-88] GeoExplorer: Active Geo-localization with Curiosity-Driven Exploration ICCV2025
【速读】:该论文旨在解决主动地理定位(Active Geo-localization, AGL)任务中因依赖距离奖励导致的鲁棒性不足与泛化能力差的问题,尤其在目标或环境未见时表现不佳。其解决方案的关键在于引入基于好奇心驱动的内在奖励机制(curiosity-driven exploration),该机制不依赖于具体目标,而是通过有效建模环境来实现鲁棒、多样且情境相关的探索策略,从而提升AGL代理在复杂和未知场景下的定位性能。
链接: https://arxiv.org/abs/2508.00152
作者: Li Mi,Manon Bechaz,Zeming Chen,Antoine Bosselut,Devis Tuia
机构: EPFL(瑞士联邦理工学院)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: ICCV 2025. Project page at this https URL
Abstract:Active Geo-localization (AGL) is the task of localizing a goal, represented in various modalities (e.g., aerial images, ground-level images, or text), within a predefined search area. Current methods approach AGL as a goal-reaching reinforcement learning (RL) problem with a distance-based reward. They localize the goal by implicitly learning to minimize the relative distance from it. However, when distance estimation becomes challenging or when encountering unseen targets and environments, the agent exhibits reduced robustness and generalization ability due to the less reliable exploration strategy learned during training. In this paper, we propose GeoExplorer, an AGL agent that incorporates curiosity-driven exploration through intrinsic rewards. Unlike distance-based rewards, our curiosity-driven reward is goal-agnostic, enabling robust, diverse, and contextually relevant exploration based on effective environment modeling. These capabilities have been proven through extensive experiments across four AGL benchmarks, demonstrating the effectiveness and generalization ability of GeoExplorer in diverse settings, particularly in localizing unfamiliar targets and environments.
zh
[CV-89] World Consistency Score: A Unified Metric for Video Generation Quality
【速读】:该论文旨在解决当前生成式视频模型评估中缺乏对视频内部世界一致性(world consistency)的全面衡量问题,现有指标如FVD、CLIPScore等主要关注视觉保真度或提示对齐度,而忽视了物体恒常性、关系稳定性、因果合理性等时间与物理层面的连贯性。解决方案的关键在于提出一种统一且可解释的评估指标——世界一致性分数(World Consistency Score, WCS),其核心是将四个子指标(物体恒常性、关系稳定性、因果合规性、闪烁惩罚)通过学习得到的加权组合公式整合为单一得分,这些子指标分别利用开源工具(如目标追踪器、动作识别器、CLIP嵌入和光流)进行量化计算,并基于人类偏好数据训练权重,从而实现与人类判断高度一致的评估体系。
链接: https://arxiv.org/abs/2508.00144
作者: Akshat Rakheja,Aarsh Ashdhir,Aryan Bhattacharjee,Vanshika Sharma
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 27 pages, 1 figure
Abstract:We introduce World Consistency Score (WCS), a novel unified evaluation metric for generative video models that emphasizes internal world consistency of the generated videos. WCS integrates four interpretable sub-components - object permanence, relation stability, causal compliance, and flicker penalty - each measuring a distinct aspect of temporal and physical coherence in a video. These submetrics are combined via a learned weighted formula to produce a single consistency score that aligns with human judgments. We detail the motivation for WCS in the context of existing video evaluation metrics, formalize each submetric and how it is computed with open-source tools (trackers, action recognizers, CLIP embeddings, optical flow), and describe how the weights of the WCS combination are trained using human preference data. We also outline an experimental validation blueprint: using benchmarks like VBench-2.0, EvalCrafter, and LOVE to test WCS’s correlation with human evaluations, performing sensitivity analyses, and comparing WCS against established metrics (FVD, CLIPScore, VBench, FVMD). The proposed WCS offers a comprehensive and interpretable framework for evaluating video generation models on their ability to maintain a coherent “world” over time, addressing gaps left by prior metrics focused only on visual fidelity or prompt alignment.
zh
[CV-90] Exploring the Feasibility of Deep Learning Techniques for Accurate Gender Classification from Eye Images
【速读】:该论文旨在解决性别分类在实际应用中因化妆品和伪装等因素导致准确率下降的问题,提出了一种基于眼周区域(periocular region)的高精度性别识别方法。解决方案的关键在于构建一个先进的卷积神经网络(Convolutional Neural Network, CNN)模型,利用彩色图像数据库对眼周区域进行特征提取与分类,该区域包含丰富的视觉线索,可有效提升性别识别鲁棒性。实验结果表明,该模型在未使用过的CVBL数据集上达到99%的准确率,在Female and Male数据集上以较少参数(7,235,089)实现96%的准确率,显著优于现有方法,具备在安防、监控等场景中部署的潜力。
链接: https://arxiv.org/abs/2508.00135
作者: Basna Mohammed Salih Hasan,Ramadhan J. Mstafa
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: 12 pages, 18 figures, 5 tables
Abstract:Gender classification has emerged as a crucial aspect in various fields, including security, human-machine interaction, surveillance, and advertising. Nonetheless, the accuracy of this classification can be influenced by factors such as cosmetics and disguise. Consequently, our study is dedicated to addressing this concern by concentrating on gender classification using color images of the periocular region. The periocular region refers to the area surrounding the eye, including the eyelids, eyebrows, and the region between them. It contains valuable visual cues that can be used to extract key features for gender classification. This paper introduces a sophisticated Convolutional Neural Network (CNN) model that utilizes color image databases to evaluate the effectiveness of the periocular region for gender classification. To validate the model’s performance, we conducted tests on two eye datasets, namely CVBL and (Female and Male). The recommended architecture achieved an outstanding accuracy of 99% on the previously unused CVBL dataset while attaining a commendable accuracy of 96% with a small number of learnable parameters (7,235,089) on the (Female and Male) dataset. To ascertain the effectiveness of our proposed model for gender classification using the periocular region, we evaluated its performance through an extensive range of metrics and compared it with other state-of-the-art approaches. The results unequivocally demonstrate the efficacy of our model, thereby suggesting its potential for practical application in domains such as security and surveillance.
zh
[CV-91] Stress-Aware Resilient Neural Training
【速读】:该论文旨在解决深度神经网络在训练过程中因优化困境(如陷入尖锐极小值)而导致泛化能力不足的问题。其核心解决方案是提出一种名为“塑性变形优化器”(Plastic Deformation Optimizer)的应力感知机制,该机制通过引入自适应噪声来模拟材料科学中的弹性与塑性变形概念:当内部应力信号(反映训练损失和精度停滞)表明持续优化困难时,模型会主动注入噪声以跳出局部最优,从而收敛到更平坦、更具泛化能力的损失景观区域。此方法在六种架构、四种优化器及七个视觉基准上验证了其在保持极低计算开销下的鲁棒性和泛化性能提升。
链接: https://arxiv.org/abs/2508.00098
作者: Ashkan Shakarami,Yousef Yeganeh,Azade Farshad,Lorenzo Nicole,Stefano Ghidoni,Nassir Navab
机构: University of Padova (帕多瓦大学); Technical University of Munich (慕尼黑工业大学)
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
备注: 16 pages, 11 figures
Abstract:This paper introduces Stress-Aware Learning, a resilient neural training paradigm in which deep neural networks dynamically adjust their optimization behavior - whether under stable training regimes or in settings with uncertain dynamics - based on the concept of Temporary (Elastic) and Permanent (Plastic) Deformation, inspired by structural fatigue in materials science. To instantiate this concept, we propose Plastic Deformation Optimizer, a stress-aware mechanism that injects adaptive noise into model parameters whenever an internal stress signal - reflecting stagnation in training loss and accuracy - indicates persistent optimization difficulty. This enables the model to escape sharp minima and converge toward flatter, more generalizable regions of the loss landscape. Experiments across six architectures, four optimizers, and seven vision benchmarks demonstrate improved robustness and generalization with minimal computational overhead. The code and 3D visuals will be available on GitHub: this https URL.
zh
[CV-92] he Monado SLAM Dataset for Egocentric Visual-Inertial Tracking IROS2025
【速读】:该论文旨在解决当前视觉惯性里程计(Visual-Inertial Odometry, VIO)与同步定位与建图(Simultaneous Localization and Mapping, SLAM)系统在头戴式设备应用场景中表现不佳的问题,尤其是在高强运动、动态遮挡、长时间追踪、低纹理区域、不良光照条件及传感器饱和等挑战性场景下,现有文献数据集未能充分覆盖这些真实世界复杂情况,导致算法研发可能忽视关键性能瓶颈。解决方案的关键在于构建并发布Monado SLAM数据集——一套来自多种虚拟现实头显的真实序列数据,并以宽松的CC BY 4.0许可证开放共享,从而推动VIO/SLAM技术在更具代表性的实际场景中的研究与进步。
链接: https://arxiv.org/abs/2508.00088
作者: Mateo de Mayo,Daniel Cremers,Taihú Pire
机构: Technical University of Munich (慕尼黑工业大学); Munich Center for Machine Learning (慕尼黑机器学习中心); Collabora Ltd. (Collabora有限公司); CIFASIS, CONICET-UNR (CIFASIS, CONICET-UNR)
类目: Computer Vision and Pattern Recognition (cs.CV); Robotics (cs.RO)
备注: Accepted to IROS 2025
Abstract:Humanoid robots and mixed reality headsets benefit from the use of head-mounted sensors for tracking. While advancements in visual-inertial odometry (VIO) and simultaneous localization and mapping (SLAM) have produced new and high-quality state-of-the-art tracking systems, we show that these are still unable to gracefully handle many of the challenging settings presented in the head-mounted use cases. Common scenarios like high-intensity motions, dynamic occlusions, long tracking sessions, low-textured areas, adverse lighting conditions, saturation of sensors, to name a few, continue to be covered poorly by existing datasets in the literature. In this way, systems may inadvertently overlook these essential real-world issues. To address this, we present the Monado SLAM dataset, a set of real sequences taken from multiple virtual reality headsets. We release the dataset under a permissive CC BY 4.0 license, to drive advancements in VIO/SLAM research and development.
zh
[CV-93] Punching Bag vs. Punching Person: Motion Transferability in Videos ICCV2025
【速读】:该论文旨在解决动作识别模型在跨不同场景下迁移高阶运动概念(high-level motion concepts)的能力问题,尤其是当动作出现在未见过的上下文(如“punching person”)时,模型是否仍能有效识别。其核心挑战在于评估模型在保持泛化能力的同时,能否从已知动作中抽象出可迁移的语义运动特征。解决方案的关键在于构建一个系统的运动可迁移性评估框架,包含三个数据集:Syn-TA(合成3D物体运动)、Kinetics400-TA 和 Something-Something-v2-TA(均源自真实视频),并在此基础上对13种前沿动作识别模型进行系统评测,揭示了模型在细粒度动作识别中的脆弱性、对空间与时间线索依赖性的差异,以及通过解耦粗粒度与细粒度运动可提升在时间复杂场景下的识别性能。
链接: https://arxiv.org/abs/2508.00085
作者: Raiyaan Abdullah,Jared Claypoole,Michael Cogswell,Ajay Divakaran,Yogesh Rawat
机构: Center for Research in Computer Vision, University of Central Florida (计算机视觉研究中心,中佛罗里达大学); Center for Vision Technology, SRI International (视觉技术中心,SRI 国际)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: Accepted to ICCV 2025 main conference
Abstract:Action recognition models demonstrate strong generalization, but can they effectively transfer high-level motion concepts across diverse contexts, even within similar distributions? For example, can a model recognize the broad action “punching” when presented with an unseen variation such as “punching person”? To explore this, we introduce a motion transferability framework with three datasets: (1) Syn-TA, a synthetic dataset with 3D object motions; (2) Kinetics400-TA; and (3) Something-Something-v2-TA, both adapted from natural video datasets. We evaluate 13 state-of-the-art models on these benchmarks and observe a significant drop in performance when recognizing high-level actions in novel contexts. Our analysis reveals: 1) Multimodal models struggle more with fine-grained unknown actions than with coarse ones; 2) The bias-free Syn-TA proves as challenging as real-world datasets, with models showing greater performance drops in controlled settings; 3) Larger models improve transferability when spatial cues dominate but struggle with intensive temporal reasoning, while reliance on object and background cues hinders generalization. We further explore how disentangling coarse and fine motions can improve recognition in temporally challenging datasets. We believe this study establishes a crucial benchmark for assessing motion transferability in action recognition. Datasets and relevant code: this https URL.
zh
[CV-94] A Quality-Guided Mixture of Score-Fusion Experts Framework for Human Recognition ICCV2025
【速读】:该论文旨在解决全身体征识别(whole-body biometric recognition)中因多模态数据(如人脸、步态和体型)的评分分布差异及质量不一致性导致的性能瓶颈问题。传统方法依赖于固定权重的分数融合策略,难以适应不同模态在相似度评分上的分布变化,从而限制了整体识别精度。其解决方案的关键在于提出一种可学习的分数融合框架——质量引导的专家混合模型(Quality-guided Mixture of score-fusion Experts, QME),通过引入模态特异性的质量估计器(Modality-specific Quality Estimator, QE)与伪质量损失函数进行评分质量评估,并结合分数三元组损失(score triplet loss)优化度量空间对齐,实现动态自适应的多模态融合,显著提升了跨数据集的识别性能。
链接: https://arxiv.org/abs/2508.00053
作者: Jie Zhu,Yiyang Su,Minchul Kim,Anil Jain,Xiaoming Liu
机构: Michigan State University (密歇根州立大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted to ICCV 2025. 11 pages, 5 figures
Abstract:Whole-body biometric recognition is a challenging multimodal task that integrates various biometric modalities, including face, gait, and body. This integration is essential for overcoming the limitations of unimodal systems. Traditionally, whole-body recognition involves deploying different models to process multiple modalities, achieving the final outcome by score-fusion (e.g., weighted averaging of similarity matrices from each model). However, these conventional methods may overlook the variations in score distributions of individual modalities, making it challenging to improve final performance. In this work, we present \textbfQuality-guided \textbfMixture of score-fusion \textbfExperts (QME), a novel framework designed for improving whole-body biometric recognition performance through a learnable score-fusion strategy using a Mixture of Experts (MoE). We introduce a novel pseudo-quality loss for quality estimation with a modality-specific Quality Estimator (QE), and a score triplet loss to improve the metric performance. Extensive experiments on multiple whole-body biometric datasets demonstrate the effectiveness of our proposed approach, achieving state-of-the-art results across various metrics compared to baseline methods. Our method is effective for multimodal and multi-model, addressing key challenges such as model misalignment in the similarity score domain and variability in data quality.
zh
[CV-95] AI-Driven Collaborative Satellite Object Detection for Space Sustainability
【速读】:该论文旨在解决低地球轨道(Low-Earth Orbit, LEO)卫星密度日益增长所带来的空间可持续性挑战,特别是由于在轨碰撞风险上升所引发的空间态势感知(Space Situational Awareness, SSA)能力不足问题。传统地面跟踪系统因延迟和覆盖范围限制难以满足实时监测需求,因此论文提出一种基于卫星集群的分布式视觉检测框架,通过多颗卫星协同执行基于深度学习(Deep Learning, DL)的空间目标检测(Space Object Detection, SOD)任务来提升在轨自主感知能力。其解决方案的关键在于:构建高保真度仿真数据集以支持集群卫星成像场景建模,并引入一种距离感知的视角选择策略优化检测性能,同时采用轻量级DL模型实现低尺寸、重量与功耗(Size, Weight, and Power, SWaP)约束下的高效推理,从而在保障系统资源效率的同时显著提升空间目标检测精度与鲁棒性。
链接: https://arxiv.org/abs/2508.00755
作者: Peng Hu,Wenxuan Zhang
机构: University of Waterloo (滑铁卢大学); University of Manitoba (曼尼托巴大学)
类目: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV)
备注: Submitted to the 13th Annual IEEE International Conference on Wireless for Space and Extreme Environments (WiSEE 2025)
Abstract:The growing density of satellites in low-Earth orbit (LEO) presents serious challenges to space sustainability, primarily due to the increased risk of in-orbit collisions. Traditional ground-based tracking systems are constrained by latency and coverage limitations, underscoring the need for onboard, vision-based space object detection (SOD) capabilities. In this paper, we propose a novel satellite clustering framework that enables the collaborative execution of deep learning (DL)-based SOD tasks across multiple satellites. To support this approach, we construct a high-fidelity dataset simulating imaging scenarios for clustered satellite formations. A distance-aware viewpoint selection strategy is introduced to optimize detection performance, and recent DL models are used for evaluation. Experimental results show that the clustering-based method achieves competitive detection accuracy compared to single-satellite and existing approaches, while maintaining a low size, weight, and power (SWaP) footprint. These findings underscore the potential of distributed, AI-enabled in-orbit systems to enhance space situational awareness and contribute to long-term space sustainability.
zh
[CV-96] FMPlug: Plug-In Foundation Flow-Matching Priors for Inverse Problems
【速读】:该论文旨在解决 ill-posed inverse problems(病态逆问题)的求解难题,这类问题在图像恢复任务中尤为突出,如图像超分辨率和高斯模糊去卷积。传统方法通常依赖于领域特定或未训练的先验知识,难以有效建模复杂数据分布。本文提出的 FMPlug 框架通过两个关键创新突破这一局限:一是利用观测数据与目标对象之间的相似性(similarity between observed and desired objects),二是基于生成流(generative flows)的高斯特性(Gaussianity of generative flows)设计正则化策略;具体而言,引入时间自适应预热策略(time-adaptive warm-up strategy)和尖锐高斯正则化(sharp Gaussianity regularization),从而充分释放无领域依赖的基础模型(domain-agnostic foundation models)的潜力,显著优于现有基于基础流匹配(foundation flow-matching, FM)先验的方法。
链接: https://arxiv.org/abs/2508.00721
作者: Yuxiang Wan,Ryan Devera,Wenjie Zhang,Ju Sun
机构: 未知
类目: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG); Signal Processing (eess.SP)
备注:
Abstract:We present FMPlug, a novel plug-in framework that enhances foundation flow-matching (FM) priors for solving ill-posed inverse problems. Unlike traditional approaches that rely on domain-specific or untrained priors, FMPlug smartly leverages two simple but powerful insights: the similarity between observed and desired objects and the Gaussianity of generative flows. By introducing a time-adaptive warm-up strategy and sharp Gaussianity regularization, FMPlug unlocks the true potential of domain-agnostic foundation models. Our method beats state-of-the-art methods that use foundation FM priors by significant margins, on image super-resolution and Gaussian deblurring.
zh
[CV-97] he Repeated-Stimulus Confound in Electroencephalography
【速读】:该论文旨在解决神经解码研究中因重复呈现相同刺激而导致的“重复刺激混淆”(repeated-stimulus confound)问题,即当解码模型在训练和评估阶段使用同一刺激的多次响应时,刺激身份会成为性能评估的混杂变量,从而高估模型的实际解码准确性。解决方案的关键在于识别出受此混淆影响的数据集和相关文献,并通过实验验证该混淆对模型性能估计的系统性偏差:研究发现,受影响模型的解码准确率被高估了4.46%–7.42%,且每增加1%的混淆下准确率,高估幅度上升0.26%。这一分析揭示了该混淆不仅导致性能估计偏乐观,还可能误导科学结论,甚至可被滥用以支持伪科学主张,如超感官知觉的存在。
链接: https://arxiv.org/abs/2508.00531
作者: Jack A. Kilgallen,Barak A. Pearlmutter,Jeffrey Mark Siskind
机构: Hamilton Institute, Maynooth University, Maynooth, Ireland; Department of Computer Science, Maynooth University, Maynooth, Ireland; Elmore Family School of Electrical and Computer Engineering, Purdue University, West Lafayette, IN 47907-2035, USA
类目: Neurons and Cognition (q-bio.NC); Computer Vision and Pattern Recognition (cs.CV)
备注: 15 pages, 6 figures, 8 tables, in submission to IEEE
Abstract:In neural-decoding studies, recordings of participants’ responses to stimuli are used to train models. In recent years, there has been an explosion of publications detailing applications of innovations from deep-learning research to neural-decoding studies. The data-hungry models used in these experiments have resulted in a demand for increasingly large datasets. Consequently, in some studies, the same stimuli are presented multiple times to each participant to increase the number of trials available for use in model training. However, when a decoding model is trained and subsequently evaluated on responses to the same stimuli, stimulus identity becomes a confounder for accuracy. We term this the repeated-stimulus confound. We identify a susceptible dataset, and 16 publications which report model performance based on evaluation procedures affected by the confound. We conducted experiments using models from the affected studies to investigate the likely extent to which results in the literature have been misreported. Our findings suggest that the decoding accuracies of these models were overestimated by between 4.46-7.42%. Our analysis also indicates that per 1% increase in accuracy under the confound, the magnitude of the overestimation increases by 0.26%. The confound not only results in optimistic estimates of decoding performance, but undermines the validity of several claims made within the affected publications. We conducted further experiments to investigate the implications of the confound in alternative contexts. We found that the same methodology used within the affected studies could also be used to justify an array of pseudoscientific claims, such as the existence of extrasensory perception.
zh
[CV-98] Diffusion-Based User-Guided Data Augmentation for Coronary Stenosis Detection MICCAI2025
【速读】:该论文旨在解决冠状动脉狭窄(coronary stenosis)的自动化定位与严重程度评估中因标注数据有限和类别不平衡导致的深度学习模型性能受限问题。其关键解决方案是提出了一种基于扩散模型(diffusion model)的图像修复(inpainting)数据增强方法,能够生成逼真的病变图像,并支持用户对病变严重程度的可控调节,从而在小样本条件下显著提升病变检测与严重程度分类的准确性,增强临床决策支持系统的可靠性。
链接: https://arxiv.org/abs/2508.00438
作者: Sumin Seo,In Kyu Lee,Hyun-Woo Kim,Jaesik Min,Chung-Hwan Jung
机构: Medipixel(医脉像素)
类目: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted at MICCAI 2025. Dataset available at this https URL
Abstract:Coronary stenosis is a major risk factor for ischemic heart events leading to increased mortality, and medical treatments for this condition require meticulous, labor-intensive analysis. Coronary angiography provides critical visual cues for assessing stenosis, supporting clinicians in making informed decisions for diagnosis and treatment. Recent advances in deep learning have shown great potential for automated localization and severity measurement of stenosis. In real-world scenarios, however, the success of these competent approaches is often hindered by challenges such as limited labeled data and class imbalance. In this study, we propose a novel data augmentation approach that uses an inpainting method based on a diffusion model to generate realistic lesions, allowing user-guided control of severity. Extensive evaluation on lesion detection and severity classification across various synthetic dataset sizes shows superior performance of our method on both a large-scale in-house dataset and a public coronary angiography dataset. Furthermore, our approach maintains high detection and classification performance even when trained with limited data, highlighting its clinical importance in improving the assessment of severity of stenosis and optimizing data utilization for more reliable decision support.
zh
[CV-99] Jet Image Generation in High Energy Physics Using Diffusion Models
【速读】:该论文旨在解决高能物理(High Energy Physics, HEP)领域中对质子-质子碰撞事件中喷注(jet)图像生成的准确性与效率问题。传统方法常依赖于潜在空间分布建模,而本文提出了一种直接在图像空间中训练扩散模型(diffusion models)的新方案,通过将JetNet仿真数据集中夸克、胶子、W玻色子、Z玻色子及顶夸克喷注的运动学变量映射为二维图像表示,并利用得分匹配(score-based)和一致性模型(consistency models)进行训练。其关键创新在于:1)首次将扩散模型应用于LHC喷注图像生成;2)不依赖潜在空间假设,直接在图像空间建模喷注粒子的空间分布;3)实验证明一致性模型在生成保真度(FID指标)和稳定性上优于得分匹配模型,显著提升了计算效率与生成质量,为HEP研究提供了更高效、精准的合成数据生成工具。
链接: https://arxiv.org/abs/2508.00250
作者: Victor D. Martinez,Vidya Manian,Sudhir Malik
机构: University of Puerto Rico, Mayaguez (波多黎各大学马亚圭兹分校)
类目: High Energy Physics - Phenomenology (hep-ph); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注: The paper is under review at IEEE Transactions in Nuclear Science
Abstract:This article presents, for the first time, the application of diffusion models for generating jet images corresponding to proton-proton collision events at the Large Hadron Collider (LHC). The kinematic variables of quark, gluon, W-boson, Z-boson, and top quark jets from the JetNet simulation dataset are mapped to two-dimensional image representations. Diffusion models are trained on these images to learn the spatial distribution of jet constituents. We compare the performance of score-based diffusion models and consistency models in accurately generating class-conditional jet images. Unlike approaches based on latent distributions, our method operates directly in image space. The fidelity of the generated images is evaluated using several metrics, including the Fréchet Inception Distance (FID), which demonstrates that consistency models achieve higher fidelity and generation stability compared to score-based diffusion models. These advancements offer significant improvements in computational efficiency and generation accuracy, providing valuable tools for High Energy Physics (HEP) research.
zh
[CV-100] Weakly Supervised Intracranial Aneurysm Detection and Segmentation in MR angiography via Multi-task UNet with Vesselness Prior ICCV2025
【速读】:该论文旨在解决颅内动脉瘤(intracranial aneurysms, IAs)在时间飞越磁共振血管成像(time-of-flight MR angiography, TOF-MRA)中因体积小、对比度弱而导致的检测与形态学分析精度低的问题,同时应对缺乏大规模带体素级专家标注数据集对深度学习算法开发的限制。其解决方案的关键在于提出一种新型的弱监督三维多任务UNet架构,该架构融合了血管增强先验(vesselness priors),通过Frangi滤波器提取软性脑血管先验信息,并将其作为网络输入和注意力模块嵌入到解码器中,从而实现对动脉瘤的联合检测与分割;此外,模型在Lausanne数据集上使用粗略标签训练,并在精细标注测试集及外部ADAM数据集上验证,证明了其在分割(Dice = 0.614,95%HD = 1.38 mm)和检测(假阳性率 = 1.47,灵敏度 = 92.9%)方面均优于现有最优方法(SOTA)。
链接: https://arxiv.org/abs/2508.00235
作者: Erin Rainville,Amirhossein Rasoulian,Hassan Rivaz,Yiming Xiao
机构: Concordia University (康考迪亚大学); NeuroRx Research (神经Rx研究); Concordia University (康考迪亚大学)
类目: Image and Video Processing (eess.IV); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted to ICCV 2025 Workshop CVAMD
Abstract:Intracranial aneurysms (IAs) are abnormal dilations of cerebral blood vessels that, if ruptured, can lead to life-threatening consequences. However, their small size and soft contrast in radiological scans often make it difficult to perform accurate and efficient detection and morphological analyses, which are critical in the clinical care of the disorder. Furthermore, the lack of large public datasets with voxel-wise expert annotations pose challenges for developing deep learning algorithms to address the issues. Therefore, we proposed a novel weakly supervised 3D multi-task UNet that integrates vesselness priors to jointly perform aneurysm detection and segmentation in time-of-flight MR angiography (TOF-MRA). Specifically, to robustly guide IA detection and segmentation, we employ the popular Frangi’s vesselness filter to derive soft cerebrovascular priors for both network input and an attention block to conduct segmentation from the decoder and detection from an auxiliary branch. We train our model on the Lausanne dataset with coarse ground truth segmentation, and evaluate it on the test set with refined labels from the same database. To further assess our model’s generalizability, we also validate it externally on the ADAM dataset. Our results demonstrate the superior performance of the proposed technique over the SOTA techniques for aneurysm segmentation (Dice = 0.614, 95%HD =1.38mm) and detection (false positive rate = 1.47, sensitivity = 92.9%).
zh
[CV-101] GEPAR3D: Geometry Prior-Assisted Learning for 3D Tooth Segmentation MICCAI
【速读】:该论文旨在解决锥形束计算机断层扫描(Cone-Beam Computed Tomography, CBCT)中牙齿分割的难题,特别是根尖等精细结构的分割问题,这对正畸治疗中牙根吸收的评估至关重要。其解决方案的关键在于提出GEPAR3D方法,该方法将实例检测与多类别分割统一为一步流程,并引入牙齿解剖形状的统计模型(Statistical Shape Model, SSM)作为几何先验,以捕捉解剖上下文和形态一致性,同时不强制邻接约束;此外,通过深度分水岭方法建模每个牙齿为连续的3D能量盆地,编码体素到边界距离,从而实现对狭窄且复杂的根尖区域的精确分割。
链接: https://arxiv.org/abs/2508.00155
作者: Tomasz Szczepański,Szymon Płotka,Michal K. Grzeszczyk,Arleta Adamowicz,Piotr Fudalej,Przemysław Korzeniowski,Tomasz Trzciński,Arkadiusz Sitek
机构: 11; 22; 33; 4455; 66
类目: Image and Video Processing (eess.IV); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注: Accepted for the 28th International Conference on Medical Image Computing and Computer Assisted Intervention (MICCAI) 2025
Abstract:Tooth segmentation in Cone-Beam Computed Tomography (CBCT) remains challenging, especially for fine structures like root apices, which is critical for assessing root resorption in orthodontics. We introduce GEPAR3D, a novel approach that unifies instance detection and multi-class segmentation into a single step tailored to improve root segmentation. Our method integrates a Statistical Shape Model of dentition as a geometric prior, capturing anatomical context and morphological consistency without enforcing restrictive adjacency constraints. We leverage a deep watershed method, modeling each tooth as a continuous 3D energy basin encoding voxel distances to boundaries. This instance-aware representation ensures accurate segmentation of narrow, complex root apices. Trained on publicly available CBCT scans from a single center, our method is evaluated on external test sets from two in-house and two public medical centers. GEPAR3D achieves the highest overall segmentation performance, averaging a Dice Similarity Coefficient (DSC) of 95.0% (+2.8% over the second-best method) and increasing recall to 95.2% (+9.5%) across all test sets. Qualitative analyses demonstrated substantial improvements in root segmentation quality, indicating significant potential for more accurate root resorption assessment and enhanced clinical decision-making in orthodontics. We provide the implementation and dataset at this https URL.
zh
[CV-102] CADS: A Comprehensive Anatomical Dataset and Segmentation for Whole-Body Anatomy in Computed Tomography
【速读】:该论文旨在解决当前医学影像中全身体积CT图像分割存在的模型碎片化与数据覆盖不足的问题。现有AI方法多聚焦于单一解剖结构的分割,导致模型间不兼容、性能差异大且评估标准不统一;同时,缺乏足够规模和多样性的训练数据限制了整体性分割模型的临床部署。其解决方案的关键在于构建一个大规模、标准化、全覆盖的开源数据集(CADS),包含22,022例CT扫描及167个解剖结构的完整标注,相较现有数据集在样本数量上提升18倍、解剖目标增加60%。在此基础上开发出基于成熟架构的CADS-model,实现了端到端的自动化全身体积CT分割,并通过跨18个公共数据集和独立医院队列的全面验证,证明其在放射治疗等临床场景中的有效性,从而推动放射科AI解决方案的鲁棒性和可及性。
链接: https://arxiv.org/abs/2507.22953
作者: Murong Xu,Tamaz Amiranashvili,Fernando Navarro,Maksym Fritsak,Ibrahim Ethem Hamamci,Suprosanna Shit,Bastian Wittmann,Sezgin Er,Sebastian M. Christ,Ezequiel de la Rosa,Julian Deseoe,Robert Graf,Hendrik Möller,Anjany Sekuboyina,Jan C. Peeken,Sven Becker,Giulia Baldini,Johannes Haubold,Felix Nensa,René Hosch,Nikhil Mirajkar,Saad Khalid,Stefan Zachow,Marc-André Weber,Georg Langs,Jakob Wasserthal,Mehmet Kemal Ozdemir,Andrey Fedorov,Ron Kikinis,Stephanie Tanadini-Lang,Jan S. Kirschke,Stephanie E. Combs,Bjoern Menze
机构: 未知
类目: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Accurate delineation of anatomical structures in volumetric CT scans is crucial for diagnosis and treatment planning. While AI has advanced automated segmentation, current approaches typically target individual structures, creating a fragmented landscape of incompatible models with varying performance and disparate evaluation protocols. Foundational segmentation models address these limitations by providing a holistic anatomical view through a single model. Yet, robust clinical deployment demands comprehensive training data, which is lacking in existing whole-body approaches, both in terms of data heterogeneity and, more importantly, anatomical coverage. In this work, rather than pursuing incremental optimizations in model architecture, we present CADS, an open-source framework that prioritizes the systematic integration, standardization, and labeling of heterogeneous data sources for whole-body CT segmentation. At its core is a large-scale dataset of 22,022 CT volumes with complete annotations for 167 anatomical structures, representing a significant advancement in both scale and coverage, with 18 times more scans than existing collections and 60% more distinct anatomical targets. Building on this diverse dataset, we develop the CADS-model using established architectures for accessible and automated full-body CT segmentation. Through comprehensive evaluation across 18 public datasets and an independent real-world hospital cohort, we demonstrate advantages over SoTA approaches. Notably, thorough testing of the model’s performance in segmentation tasks from radiation oncology validates its direct utility for clinical interventions. By making our large-scale dataset, our segmentation models, and our clinical software tool publicly available, we aim to advance robust AI solutions in radiology and make comprehensive anatomical analysis accessible to clinicians and researchers alike.
zh
人工智能
[AI-0] Unraveling Hidden Representations: A Multi-Modal Layer Analysis for Better Synthetic Content Forensics
【速读】:该论文旨在解决生成式AI(Generative AI)伪造内容(如深度伪造图像和音频)日益泛滥所引发的虚假信息传播问题,尤其针对现有检测方法在跨生成模型家族和多模态数据域上泛化能力差的局限性。解决方案的关键在于利用大规模预训练多模态模型的潜在表示(latent code),发现其天然蕴含区分真实与伪造内容的信息,并在此基础上训练线性分类器,从而实现跨模态、少样本场景下的高性能、高效率且稳定的伪造检测效果。
链接: https://arxiv.org/abs/2508.00784
作者: Tom Or,Omri Azencot(Ben Gurion University of the Negev)
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:
Abstract:Generative models achieve remarkable results in multiple data domains, including images and texts, among other examples. Unfortunately, malicious users exploit synthetic media for spreading misinformation and disseminating deepfakes. Consequently, the need for robust and stable fake detectors is pressing, especially when new generative models appear everyday. While the majority of existing work train classifiers that discriminate between real and fake information, such tools typically generalize only within the same family of generators and data modalities, yielding poor results on other generative classes and data domains. Towards a universal classifier, we propose the use of large pre-trained multi-modal models for the detection of generative content. Effectively, we show that the latent code of these models naturally captures information discriminating real from fake. Building on this observation, we demonstrate that linear classifiers trained on these features can achieve state-of-the-art results across various modalities, while remaining computationally efficient, fast to train, and effective even in few-shot settings. Our work primarily focuses on fake detection in audio and images, achieving performance that surpasses or matches that of strong baseline methods.
zh
[AI-1] A Simple and Effective Method for Uncertainty Quantification and OOD Detection
【速读】:该论文旨在解决现有不确定性量化方法(如贝叶斯神经网络和深度集成)计算复杂度高、存储需求大的问题。其解决方案的关键在于利用单一确定性模型,通过特征空间密度估计来实现对分布偏移(distributional shifts)和分布外(out-of-distribution, OOD)样本的有效检测:具体而言,该方法基于核密度估计(kernel density estimation)构建信息势场(information potential field),以近似训练数据在特征空间中的密度分布,并将测试样本的特征表示与该密度进行比较,从而判断是否存在分布偏移或是否为分布外样本。实验表明,该方法在合成数据集(Two Moons 和 Three Spirals)及 CIFAR-10 与 SVHN 的 OOD 检测任务中均优于基线模型。
链接: https://arxiv.org/abs/2508.00754
作者: Yaxin Ma,Benjamin Colburn,Jose C. Principe
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:
Abstract:Bayesian neural networks and deep ensemble methods have been proposed for uncertainty quantification; however, they are computationally intensive and require large storage. By utilizing a single deterministic model, we can solve the above issue. We propose an effective method based on feature space density to quantify uncertainty for distributional shifts and out-of-distribution (OOD) detection. Specifically, we leverage the information potential field derived from kernel density estimation to approximate the feature space density of the training set. By comparing this density with the feature space representation of test samples, we can effectively determine whether a distributional shift has occurred. Experiments were conducted on a 2D synthetic dataset (Two Moons and Three Spirals) as well as an OOD detection task (CIFAR-10 vs. SVHN). The results demonstrate that our method outperforms baseline models.
zh
[AI-2] Harnessing the Power of Interleaving and Counterfactual Evaluation for Airbnb Search Ranking
【速读】:该论文旨在解决在线推荐系统中排序算法评估的效率与准确性问题,尤其是在A/B测试(A/B test)面临统计功效不足、实验周期长(如高价值订单转化指标)以及离线评估方法精度有限、难以有效筛选候选方案的挑战。其解决方案的关键在于引入交错评估(interleaving)和反事实评估(counterfactual evaluation)方法,通过提升实验敏感度(最高达传统A/B测试的100倍)并加速在线评估流程,从而更高效地识别出值得进行A/B测试的最优候选策略。
链接: https://arxiv.org/abs/2508.00751
作者: Qing Zhang,Alex Deng,Michelle Du,Huiji Gao,Liwei He,Sanjeev Katariya
机构: 未知
类目: Information Retrieval (cs.IR); Artificial Intelligence (cs.AI)
备注: 10 pages
Abstract:Evaluation plays a crucial role in the development of ranking algorithms on search and recommender systems. It enables online platforms to create user-friendly features that drive commercial success in a steady and effective manner. The online environment is particularly conducive to applying causal inference techniques, such as randomized controlled experiments (known as A/B test), which are often more challenging to implement in fields like medicine and public policy. However, businesses face unique challenges when it comes to effective A/B test. Specifically, achieving sufficient statistical power for conversion-based metrics can be time-consuming, especially for significant purchases like booking accommodations. While offline evaluations are quicker and more cost-effective, they often lack accuracy and are inadequate for selecting candidates for A/B test. To address these challenges, we developed interleaving and counterfactual evaluation methods to facilitate rapid online assessments for identifying the most promising candidates for A/B tests. Our approach not only increased the sensitivity of experiments by a factor of up to 100 (depending on the approach and metrics) compared to traditional A/B testing but also streamlined the experimental process. The practical insights gained from usage in production can also benefit organizations with similar interests.
zh
[AI-3] How LLM s are Shaping the Future of Virtual Reality
【速读】:该论文旨在解决如何将大型语言模型(Large Language Models, LLMs)有效集成到虚拟现实(Virtual Reality, VR)游戏中,以提升沉浸感、自适应性和智能性的问题。其解决方案的关键在于构建融合多模态交互、混合人工智能架构与伦理保障机制的系统设计策略,从而在保证实时性能和可扩展性的前提下,实现更真实、更具创造力且用户参与度更高的VR体验。
链接: https://arxiv.org/abs/2508.00737
作者: Süeda Özkaya,Santiago Berrezueta-Guzman,Stefan Wagner
机构: 未知
类目: Human-Computer Interaction (cs.HC); Artificial Intelligence (cs.AI)
备注: Pre-print
Abstract:The integration of Large Language Models (LLMs) into Virtual Reality (VR) games marks a paradigm shift in the design of immersive, adaptive, and intelligent digital experiences. This paper presents a comprehensive review of recent research at the intersection of LLMs and VR, examining how these models are transforming narrative generation, non-player character (NPC) interactions, accessibility, personalization, and game mastering. Drawing from an analysis of 62 peer reviewed studies published between 2018 and 2025, we identify key application domains ranging from emotionally intelligent NPCs and procedurally generated storytelling to AI-driven adaptive systems and inclusive gameplay interfaces. We also address the major challenges facing this convergence, including real-time performance constraints, memory limitations, ethical risks, and scalability barriers. Our findings highlight that while LLMs significantly enhance realism, creativity, and user engagement in VR environments, their effective deployment requires robust design strategies that integrate multimodal interaction, hybrid AI architectures, and ethical safeguards. The paper concludes by outlining future research directions in multimodal AI, affective computing, reinforcement learning, and open-source development, aiming to guide the responsible advancement of intelligent and inclusive VR systems.
zh
[AI-4] Adaptive Machine Learning-Driven Multi-Fidelity Stratified Sampling for Failure Analysis of Nonlinear Stochastic Systems
【速读】:该论文旨在解决在复杂非线性有限元建模环境中,基于随机激励的罕见事件分析中,传统方差缩减技术因需大量模型评估而计算成本高昂的问题。其解决方案的关键在于提出一种结合自适应机器学习代理模型的多保真度分层抽样方法:首先通过分层抽样生成高保真度数据集训练深度学习代理模型,该模型作为低成本且高度相关的低保真度模型;随后采用自适应训练策略平衡代理模型近似精度与计算开销;最终利用多保真度蒙特卡洛框架将低保真度输出与额外高保真度结果融合,获得各层内失效概率的无偏估计,并通过全概率定理计算整体失效概率。此方法显著降低了计算成本,同时保持了对非线性响应超出概率曲线的准确估计。
链接: https://arxiv.org/abs/2508.00734
作者: Liuyun Xu,Seymour M.J. Spence
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:
Abstract:Existing variance reduction techniques used in stochastic simulations for rare event analysis still require a substantial number of model evaluations to estimate small failure probabilities. In the context of complex, nonlinear finite element modeling environments, this can become computationally challenging-particularly for systems subjected to stochastic excitation. To address this challenge, a multi-fidelity stratified sampling scheme with adaptive machine learning metamodels is introduced for efficiently propagating uncertainties and estimating small failure probabilities. In this approach, a high-fidelity dataset generated through stratified sampling is used to train a deep learning-based metamodel, which then serves as a cost-effective and highly correlated low-fidelity model. An adaptive training scheme is proposed to balance the trade-off between approximation quality and computational demand associated with the development of the low-fidelity model. By integrating the low-fidelity outputs with additional high-fidelity results, an unbiased estimate of the strata-wise failure probabilities is obtained using a multi-fidelity Monte Carlo framework. The overall probability of failure is then computed using the total probability theorem. Application to a full-scale high-rise steel building subjected to stochastic wind excitation demonstrates that the proposed scheme can accurately estimate exceedance probability curves for nonlinear responses of interest, while achieving significant computational savings compared to single-fidelity variance reduction approaches.
zh
[AI-5] Nested Graph Pseudo-Label Refinement for Noisy Label Domain Adaptation Learning
【速读】:该论文旨在解决图域适应(Graph Domain Adaptation, GDA)中因源域标签噪声导致的性能下降问题。现有方法通常假设源标签为清洁标签,但在实际场景中标签噪声普遍存在,严重影响特征对齐与跨域迁移效果。解决方案的关键在于提出一种名为Nested Graph Pseudo-Label Refinement (NeGPR) 的新框架:首先通过双分支预训练(语义分支与拓扑分支)强制特征空间中的邻域一致性以削弱噪声监督的影响;其次引入嵌套伪标签精炼机制,使一个分支基于高置信度目标样本指导另一分支进行渐进式跨域学习;最后结合噪声感知正则化策略,理论上证明可缓解伪标签噪声的负面影响,即使在源域过拟合的情况下仍能提升鲁棒性。
链接: https://arxiv.org/abs/2508.00716
作者: Yingxu Wang,Mengzhu Wang,Zhichao Huang,Suyu Liu
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:
Abstract:Graph Domain Adaptation (GDA) facilitates knowledge transfer from labeled source graphs to unlabeled target graphs by learning domain-invariant representations, which is essential in applications such as molecular property prediction and social network analysis. However, most existing GDA methods rely on the assumption of clean source labels, which rarely holds in real-world scenarios where annotation noise is pervasive. This label noise severely impairs feature alignment and degrades adaptation performance under domain shifts. To address this challenge, we propose Nested Graph Pseudo-Label Refinement (NeGPR), a novel framework tailored for graph-level domain adaptation with noisy labels. NeGPR first pretrains dual branches, i.e., semantic and topology branches, by enforcing neighborhood consistency in the feature space, thereby reducing the influence of noisy supervision. To bridge domain gaps, NeGPR employs a nested refinement mechanism in which one branch selects high-confidence target samples to guide the adaptation of the other, enabling progressive cross-domain learning. Furthermore, since pseudo-labels may still contain noise and the pre-trained branches are already overfitted to the noisy labels in the source domain, NeGPR incorporates a noise-aware regularization strategy. This regularization is theoretically proven to mitigate the adverse effects of pseudo-label noise, even under the presence of source overfitting, thus enhancing the robustness of the adaptation process. Extensive experiments on benchmark datasets demonstrate that NeGPR consistently outperforms state-of-the-art methods under severe label noise, achieving gains of up to 12.7% in accuracy.
zh
[AI-6] JSON-Bag: A generic game trajectory representation
【速读】:该论文旨在解决游戏轨迹(game trajectories)的通用表示与分类问题,即如何有效建模和比较不同玩家、参数或种子生成的游戏过程。其核心挑战在于传统方法依赖人工设计特征(hand-crafted features),难以泛化且效率较低。解决方案的关键是提出JSON Bag-of-Tokens模型(JSON-Bag),将游戏轨迹的JSON描述进行分词处理,形成token集合,并利用Jensen-Shannon距离(JSD)作为度量标准,结合原型最近邻搜索(P-NNS)实现高效分类。该方法在六种桌游上的实验表明,其性能优于手工特征基线,且具备样本高效性;同时通过将tokens作为随机森林(Random Forest)的输入,实现了自动特征提取,显著提升了低表现任务的准确性。此外,研究进一步验证了JSD距离与玩家策略间距离的高度相关性,证明了该表示的有效性和可解释性。
链接: https://arxiv.org/abs/2508.00712
作者: Dien Nguyen,Diego Perez-Liebana,Simon Lucas
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注: 8 pages, 3 figures, 6 tables, to be published in IEEE Conference on Games 2025
Abstract:We introduce JSON Bag-of-Tokens model (JSON-Bag) as a method to generically represent game trajectories by tokenizing their JSON descriptions and apply Jensen-Shannon distance (JSD) as distance metric for them. Using a prototype-based nearest-neighbor search (P-NNS), we evaluate the validity of JSON-Bag with JSD on six tabletop games – \textit7 Wonders, \textitDominion, \textitSea Salt and Paper, \textitCan’t Stop, \textitConnect4, \textitDots and boxes – each over three game trajectory classification tasks: classifying the playing agents, game parameters, or game seeds that were used to generate the trajectories. Our approach outperforms a baseline using hand-crafted features in the majority of tasks. Evaluating on N-shot classification suggests using JSON-Bag prototype to represent game trajectory classes is also sample efficient. Additionally, we demonstrate JSON-Bag ability for automatic feature extraction by treating tokens as individual features to be used in Random Forest to solve the tasks above, which significantly improves accuracy on underperforming tasks. Finally, we show that, across all six games, the JSD between JSON-Bag prototypes of agent classes highly correlates with the distances between agents’ policies. Comments: 8 pages, 3 figures, 6 tables, to be published in IEEE Conference on Games 2025 Subjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI) Cite as: arXiv:2508.00712 [cs.LG] (or arXiv:2508.00712v1 [cs.LG] for this version) https://doi.org/10.48550/arXiv.2508.00712 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
zh
[AI-7] Efficient Solution and Learning of Robust Factored MDPs
【速读】:该论文旨在解决传统鲁棒马尔可夫决策过程(Robust Markov Decision Processes, r-MDPs)在学习过程中样本效率低的问题,尤其是在高维状态空间中,由于缺乏对环境动态不确定性建模的结构化方法,导致策略合成计算复杂且难以获得紧致的性能保证。其解决方案的关键在于引入因子化状态空间表示(factored state-space representations),利用系统各组件间模型不确定性的独立性,将原本非凸难解的优化问题重构为可 tractable(可处理的)线性规划问题,并在此基础上直接学习因子化模型表示,从而显著提升样本效率,生成具有更严格性能保证的鲁棒策略。
链接: https://arxiv.org/abs/2508.00707
作者: Yannik Schnitzer,Alessandro Abate,David Parker
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:
Abstract:Robust Markov decision processes (r-MDPs) extend MDPs by explicitly modelling epistemic uncertainty about transition dynamics. Learning r-MDPs from interactions with an unknown environment enables the synthesis of robust policies with provable (PAC) guarantees on performance, but this can require a large number of sample interactions. We propose novel methods for solving and learning r-MDPs based on factored state-space representations that leverage the independence between model uncertainty across system components. Although policy synthesis for factored r-MDPs leads to hard, non-convex optimisation problems, we show how to reformulate these into tractable linear programs. Building on these, we also propose methods to learn factored model representations directly. Our experimental results show that exploiting factored structure can yield dimensional gains in sample efficiency, producing more effective robust policies with tighter performance guarantees than state-of-the-art methods.
zh
[AI-8] Context-Aware Visualization for Explainable AI Recommendations in Social Media: A Vision for User-Aligned Explanations
【速读】:该论文试图解决社交平台中AI推荐系统缺乏可解释性的问题,尤其是当前解释方法普遍化、未针对用户个体需求和使用情境进行定制,导致用户难以理解推荐逻辑,从而削弱了推荐价值。解决方案的关键在于提出一个用户分段且上下文感知的解释层,通过一个集成多种解释方法的可视化系统实现:该系统根据用户类型(如AI专家或普通用户)和使用场景动态调整解释的风格(可视化 vs. 数值)与粒度(专家级 vs. 简化版),首次在单一流程中协同适配解释的呈现形式与受众特征,从而提升用户对推荐决策的理解力与信任度。
链接: https://arxiv.org/abs/2508.00674
作者: Banan Alkhateeb,Ellis Solaiman
机构: 未知
类目: Artificial Intelligence (cs.AI); Human-Computer Interaction (cs.HC); Machine Learning (cs.LG)
备注:
Abstract:Social media platforms today strive to improve user experience through AI recommendations, yet the value of such recommendations vanishes as users do not understand the reasons behind them. This issue arises because explainability in social media is general and lacks alignment with user-specific needs. In this vision paper, we outline a user-segmented and context-aware explanation layer by proposing a visual explanation system with diverse explanation methods. The proposed system is framed by the variety of user needs and contexts, showing explanations in different visualized forms, including a technically detailed version for AI experts and a simplified one for lay users. Our framework is the first to jointly adapt explanation style (visual vs. numeric) and granularity (expert vs. lay) inside a single pipeline. A public pilot with 30 X users will validate its impact on decision-making and trust.
zh
[AI-9] ransparent Adaptive Learning via Data-Centric Multimodal Explainable AI
【速读】:该论文试图解决当前基于人工智能的自适应学习系统在教育场景中缺乏透明性的问题,即这些系统难以向用户(如教师、学生)清晰解释其决策逻辑,而现有可解释人工智能(Explainable AI, XAI)方法多聚焦于技术输出层面,忽视了用户角色和理解能力的差异。解决方案的关键在于提出一个混合框架,将传统XAI技术与生成式AI(Generative AI)模型及用户个性化机制相结合,生成多模态、个性化的解释内容,从而将解释性重新定义为一种根据用户角色和学习目标动态调整的沟通过程,以提升系统的透明度并支持以用户为中心的学习体验。
链接: https://arxiv.org/abs/2508.00665
作者: Maryam Mosleh,Marie Devlin,Ellis Solaiman
机构: 未知
类目: Artificial Intelligence (cs.AI); Human-Computer Interaction (cs.HC); Machine Learning (cs.LG)
备注:
Abstract:Artificial intelligence-driven adaptive learning systems are reshaping education through data-driven adaptation of learning experiences. Yet many of these systems lack transparency, offering limited insight into how decisions are made. Most explainable AI (XAI) techniques focus on technical outputs but neglect user roles and comprehension. This paper proposes a hybrid framework that integrates traditional XAI techniques with generative AI models and user personalisation to generate multimodal, personalised explanations tailored to user needs. We redefine explainability as a dynamic communication process tailored to user roles and learning goals. We outline the framework’s design, key XAI limitations in education, and research directions on accuracy, fairness, and personalisation. Our aim is to move towards explainable AI that enhances transparency while supporting user-centred experiences.
zh
[AI-10] Multi-Band Variable-Lag Granger Causality: A Unified Framework for Causal Time Series Inference across Frequencies
【速读】:该论文旨在解决传统Granger因果分析中固定时滞假设在复杂系统中不现实的问题,以及现有可变时滞Granger因果(Variable-Lag Granger Causality, VLGC)方法未能考虑因果关系在不同频段上存在差异性延迟的局限。其解决方案的关键在于提出多频带可变时滞Granger因果(Multi-Band Variable-Lag Granger Causality, MB-VLGC)框架,通过显式建模频率依赖的因果时滞,将VLGC扩展至能够捕捉不同频段(如脑电信号中的α波与δ波)中因果影响的时间延迟差异,从而更准确地刻画真实世界时间序列数据中的复杂因果结构。
链接: https://arxiv.org/abs/2508.00658
作者: Chakattrai Sookkongwaree,Tattep Lakmuang,Chainarong Amornbunchornvej
机构: 未知
类目: Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Econometrics (econ.EM); Methodology (stat.ME)
备注: First draft
Abstract:Understanding causal relationships in time series is fundamental to many domains, including neuroscience, economics, and behavioral science. Granger causality is one of the well-known techniques for inferring causality in time series. Typically, Granger causality frameworks have a strong fix-lag assumption between cause and effect, which is often unrealistic in complex systems. While recent work on variable-lag Granger causality (VLGC) addresses this limitation by allowing a cause to influence an effect with different time lags at each time point, it fails to account for the fact that causal interactions may vary not only in time delay but also across frequency bands. For example, in brain signals, alpha-band activity may influence another region with a shorter delay than slower delta-band oscillations. In this work, we formalize Multi-Band Variable-Lag Granger Causality (MB-VLGC) and propose a novel framework that generalizes traditional VLGC by explicitly modeling frequency-dependent causal delays. We provide a formal definition of MB-VLGC, demonstrate its theoretical soundness, and propose an efficient inference pipeline. Extensive experiments across multiple domains demonstrate that our framework significantly outperforms existing methods on both synthetic and real-world datasets, confirming its broad applicability to any type of time series data. Code and datasets are publicly available.
zh
[AI-11] Multi-Agent Game Generation and Evaluation via Audio-Visual Recordings
【速读】:该论文旨在解决当前大语言模型(Large Language Models, LLMs)在生成交互式音视频内容(如视频游戏)时面临的两大挑战:一是缺乏自动化评估指标以衡量多媒体内容质量,二是难以处理复杂创作任务——这类任务通常需要多轮协作、多人团队和专业美术资源(multi-shot, multi-agents)。为应对这些问题,作者提出两个核心解决方案:其一为 AVR-Eval,一种基于音视频记录(Audio-Visual Recordings, AVRs)的相对评估指标,通过跨模态模型(omni-modal model)对比两段内容的音视频表现,并由文本模型判断优劣,从而有效识别高质量与存在缺陷或错位的内容;其二为 AVR-Agent,一个基于多智能体系统(multi-agent system)的生成框架,能够从多媒体资产库中选择相关资源(音频、图像、3D模型),生成初始JavaScript代码并利用 AVR-Eval 进行迭代优化。实验表明,AVR-Agent 生成的内容相较于单次生成方法具有显著更高的胜率(win rate),但模型在使用定制资产和接收 AVR 反馈方面仍存在明显不足,揭示了当前生成式AI与人类创作者在利用高质量资源和反馈机制上的根本差异。
链接: https://arxiv.org/abs/2508.00632
作者: Alexia Jolicoeur-Martineau
机构: 未知
类目: Artificial Intelligence (cs.AI); Multiagent Systems (cs.MA); Multimedia (cs.MM)
备注:
Abstract:While AI excels at generating text, audio, images, and videos, creating interactive audio-visual content such as video games remains challenging. Current LLMs can generate JavaScript games and animations, but lack automated evaluation metrics and struggle with complex content that normally requires teams of humans working for many months (multi-shot, multi-agents) using assets made by artists. To tackle these issues, we built a new metric and a multi-agent system. We propose AVR-Eval, a relative metric for multimedia content quality using Audio-Visual Recordings (AVRs). An omni-modal model (processing text, video, and audio) compares the AVRs of two contents, with a text model reviewing evaluations to determine superiority. We show that AVR-Eval properly identifies good from broken or mismatched content. We built AVR-Agent, a multi-agent system generating JavaScript code from a bank of multimedia assets (audio, images, 3D models). The coding agent selects relevant assets, generates multiple initial codes, uses AVR-Eval to identify the best version, and iteratively improves it through omni-modal agent feedback from the AVR. We run experiments on games and animations with AVR-Eval (win rate of content A against B). We find that content generated by AVR-Agent has a significantly higher win rate against content made through one-shot generation. However, models struggle to leverage custom assets and AVR feedback effectively, showing no higher win rate. This reveals a critical gap: while humans benefit from high-quality assets and audio-visual feedback, current coding models do not seem to utilize these resources as effectively, highlighting fundamental differences between human and machine content creation approaches. Subjects: Artificial Intelligence (cs.AI); Multiagent Systems (cs.MA); Multimedia (cs.MM) Cite as: arXiv:2508.00632 [cs.AI] (or arXiv:2508.00632v1 [cs.AI] for this version) https://doi.org/10.48550/arXiv.2508.00632 Focus to learn more arXiv-issued DOI via DataCite (pending registration) Submission history From: Alexia Jolicoeur-Martineau [view email] [v1] Fri, 1 Aug 2025 13:45:13 UTC (5,306 KB)
zh
[AI-12] Similarity-Based Self-Construct Graph Model for Predicting Patient Criticalness Using Graph Neural Networks and EHR Data
【速读】:该论文旨在解决重症监护病房(ICU)患者风险预测中传统模型因孤立处理个体患者而难以利用电子健康记录(EHR)中潜在关系结构的问题。其核心解决方案是提出一种基于相似性的自构建图模型(Similarity-Based Self-Construct Graph Model, SBSCGM),该模型动态从多模态EHR数据中构建患者相似性图,并结合混合图神经网络架构(HybridGraphMedGNN)进行预测。关键创新在于:1)采用融合特征相似性和结构相似性的混合度量方法实时连接临床表型相近的患者;2)在图神经网络中集成图卷积网络(GCN)、GraphSAGE和图注意力网络(GAT)层,从而同时捕捉局部与全局图模式,提升对患者死亡风险和连续危重程度评分的预测性能(AUC-ROC达0.94),并提供可解释的注意力机制洞察。
链接: https://arxiv.org/abs/2508.00615
作者: Mukesh Kumar Sahu,Pinki Roy
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:
Abstract:Accurately predicting the criticalness of ICU patients (such as in-ICU mortality risk) is vital for early intervention in critical care. However, conventional models often treat each patient in isolation and struggle to exploit the relational structure in Electronic Health Records (EHR). We propose a Similarity-Based Self-Construct Graph Model (SBSCGM) that dynamically builds a patient similarity graph from multi-modal EHR data, and a HybridGraphMedGNN architecture that operates on this graph to predict patient mortality and a continuous criticalness score. SBSCGM uses a hybrid similarity measure (combining feature-based and structural similarities) to connect patients with analogous clinical profiles in real-time. The HybridGraphMedGNN integrates Graph Convolutional Network (GCN), GraphSAGE, and Graph Attention Network (GAT) layers to learn robust patient representations, leveraging both local and global graph patterns. In experiments on 6,000 ICU stays from the MIMIC-III dataset, our model achieves state-of-the-art performance (AUC-ROC 0.94 ) outperforming baseline classifiers and single-type GNN models. We also demonstrate improved precision/recall and show that the attention mechanism provides interpretable insights into model predictions. Our framework offers a scalable and interpretable solution for critical care risk prediction, with potential to support clinicians in real-world ICU deployment.
zh
[AI-13] Composable OS Kernel Architectures for Autonomous Intelligence
【速读】:该论文旨在解决传统操作系统内核在应对智能系统(如边缘计算、嵌入式实时环境中的自主应用)时,因静态资源管理机制难以满足动态、复杂且高实时性的AI任务需求而产生的性能瓶颈问题。其解决方案的关键在于:首先将可加载内核模块(Loadable Kernel Modules, LKMs)重构为面向AI的计算单元,实现内核空间中快速的感知与认知处理;其次,扩展Linux内核以支持原生深度学习推理、浮点加速及实时自适应调度,构建高效运行机器学习(ML)工作负载的AI-native环境;最后,引入基于范畴论(Category Theory)和同伦类型论(Homotopy Type Theory)的神经符号内核设计,统一符号推理与可微逻辑,使操作系统能够主动预判并适配智能应用的认知需求。
链接: https://arxiv.org/abs/2508.00604
作者: Rajpreet Singh,Vidhi Kothari
机构: 未知
类目: Operating Systems (cs.OS); Artificial Intelligence (cs.AI)
备注: 8 pages
Abstract:As intelligent systems permeate edge devices, cloud infrastructure, and embedded real-time environments, this research proposes a new OS kernel architecture for intelligent systems, transforming kernels from static resource managers to adaptive, AI-integrated platforms. Key contributions include: (1) treating Loadable Kernel Modules (LKMs) as AI-oriented computation units for fast sensory and cognitive processing in kernel space; (2) expanding the Linux kernel into an AI-native environment with built-in deep learning inference, floating-point acceleration, and real-time adaptive scheduling for efficient ML workloads; and (3) introducing a Neurosymbolic kernel design leveraging Category Theory and Homotopy Type Theory to unify symbolic reasoning and differentiable logic within OS internals. Together, these approaches enable operating systems to proactively anticipate and adapt to the cognitive needs of autonomous intelligent applications.
zh
[AI-14] LeakSealer: A Semisupervised Defense for LLM s Against Prompt Injection and Leakage Attacks
【速读】:该论文旨在解决大型语言模型(Large Language Models, LLMs)在实际部署中面临的两大安全威胁:越狱攻击(jailbreaking)和敏感信息泄露(data leakage),尤其是检索增强生成(Retrieval Augmented Generation, RAG)架构引入的潜在漏洞。其解决方案的关键在于提出一个名为 LeakSealer 的模型无关框架,该框架结合静态分析与动态防御机制,在人类在环(Human-In-The-Loop, HITL)管道中实现对异常模式的识别与响应。具体而言,LeakSealer 通过分析历史交互数据生成按主题分类的使用图谱(usage maps),从而提供溯源能力以追踪越狱攻击演化路径,并能实时检测个人身份信息(PII)泄露风险,实证表明其在静态场景下对提示注入攻击具有最高精度与召回率,在动态场景下 PII 泄露检测 AUPRC 达到 0.97,显著优于现有基线如 Llama Guard。
链接: https://arxiv.org/abs/2508.00602
作者: Francesco Panebianco,Stefano Bonfanti,Francesco Trovò,Michele Carminati
机构: 未知
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: 22 pages, preprint
Abstract:The generalization capabilities of Large Language Models (LLMs) have led to their widespread deployment across various applications. However, this increased adoption has introduced several security threats, notably in the forms of jailbreaking and data leakage attacks. Additionally, Retrieval Augmented Generation (RAG), while enhancing context-awareness in LLM responses, has inadvertently introduced vulnerabilities that can result in the leakage of sensitive information. Our contributions are twofold. First, we introduce a methodology to analyze historical interaction data from an LLM system, enabling the generation of usage maps categorized by topics (including adversarial interactions). This approach further provides forensic insights for tracking the evolution of jailbreaking attack patterns. Second, we propose LeakSealer, a model-agnostic framework that combines static analysis for forensic insights with dynamic defenses in a Human-In-The-Loop (HITL) pipeline. This technique identifies topic groups and detects anomalous patterns, allowing for proactive defense mechanisms. We empirically evaluate LeakSealer under two scenarios: (1) jailbreak attempts, employing a public benchmark dataset, and (2) PII leakage, supported by a curated dataset of labeled LLM interactions. In the static setting, LeakSealer achieves the highest precision and recall on the ToxicChat dataset when identifying prompt injection. In the dynamic setting, PII leakage detection achieves an AUPRC of 0.97 , significantly outperforming baselines such as Llama Guard.
zh
[AI-15] From EMR Data to Clinical Insight: An LLM -Driven Framework for Automated Pre-Consultation Questionnaire Generation
【速读】:该论文旨在解决从复杂且庞大的电子病历(Electronic Medical Records, EMRs)中生成全面、逻辑清晰且具有疾病特异性的预问诊问卷的难题。现有直接使用大语言模型(Large Language Model, LLM)的方法在信息完整性、逻辑顺序和疾病层面的知识整合方面存在显著局限。其解决方案的关键在于提出一种多阶段LLM驱动框架:第一阶段提取EMR中的原子断言(atomic assertions,即带有时间信息的关键事实);第二阶段基于聚类构建个体因果网络并合成疾病知识;第三阶段据此生成个性化与标准化相结合的问诊问卷。该方法通过显式构建临床知识结构,有效提升了信息覆盖率、诊断相关性、可理解性和生成效率。
链接: https://arxiv.org/abs/2508.00581
作者: Ruiqing Ding,Qianfang Sun,Yongkang Leng,Hui Yin,Xiaojian Li
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注: 16 pages, 10 figures
Abstract:Pre-consultation is a critical component of effective healthcare delivery. However, generating comprehensive pre-consultation questionnaires from complex, voluminous Electronic Medical Records (EMRs) is a challenging task. Direct Large Language Model (LLM) approaches face difficulties in this task, particularly regarding information completeness, logical order, and disease-level synthesis. To address this issue, we propose a novel multi-stage LLM-driven framework: Stage 1 extracts atomic assertions (key facts with timing) from EMRs; Stage 2 constructs personal causal networks and synthesizes disease knowledge by clustering representative networks from an EMR corpus; Stage 3 generates tailored personal and standardized disease-specific questionnaires based on these structured representations. This framework overcomes limitations of direct methods by building explicit clinical knowledge. Evaluated on a real-world EMR dataset and validated by clinical experts, our method demonstrates superior performance in information coverage, diagnostic relevance, understandability, and generation time, highlighting its practical potential to enhance patient information collection.
zh
[AI-16] OmniUnet: A Multimodal Network for Unstructured Terrain Segmentation on Planetary Rovers Using RGB Depth and Thermal Imagery
【速读】:该论文旨在解决在非结构化环境中机器人导航时,如何利用多模态感知系统(如RGB、深度和热成像)实现安全、准确的语义分割问题。其关键解决方案是提出了一种基于Transformer架构的OmniUnet神经网络模型,能够有效融合RGB-D-T多模态数据以提升复杂地形的分割精度;同时开发了定制化的多模态传感器外壳并采集了真实场景下的标注数据集,验证了模型在资源受限平台上的实时推理能力(平均预测时间为673 ms),从而为行星机器人提供可靠的多模态地形感知能力。
链接: https://arxiv.org/abs/2508.00580
作者: Raul Castilla-Arquillo,Carlos Perez-del-Pulgar,Levin Gerdes,Alfonso Garcia-Cerezo,Miguel A. Olivares-Mendez
机构: 未知
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI)
备注:
Abstract:Robot navigation in unstructured environments requires multimodal perception systems that can support safe navigation. Multimodality enables the integration of complementary information collected by different sensors. However, this information must be processed by machine learning algorithms specifically designed to leverage heterogeneous data. Furthermore, it is necessary to identify which sensor modalities are most informative for navigation in the target environment. In Martian exploration, thermal imagery has proven valuable for assessing terrain safety due to differences in thermal behaviour between soil types. This work presents OmniUnet, a transformer-based neural network architecture for semantic segmentation using RGB, depth, and thermal (RGB-D-T) imagery. A custom multimodal sensor housing was developed using 3D printing and mounted on the Martian Rover Testbed for Autonomy (MaRTA) to collect a multimodal dataset in the Bardenas semi-desert in northern Spain. This location serves as a representative environment of the Martian surface, featuring terrain types such as sand, bedrock, and compact soil. A subset of this dataset was manually labeled to support supervised training of the network. The model was evaluated both quantitatively and qualitatively, achieving a pixel accuracy of 80.37% and demonstrating strong performance in segmenting complex unstructured terrain. Inference tests yielded an average prediction time of 673 ms on a resource-constrained computer (Jetson Orin Nano), confirming its suitability for on-robot deployment. The software implementation of the network and the labeled dataset have been made publicly available to support future research in multimodal terrain perception for planetary robotics.
zh
[AI-17] MultiSHAP: A Shapley-Based Framework for Explaining Cross-Modal Interactions in Multimodal AI Models
【速读】:该论文旨在解决多模态人工智能(Multimodal AI)模型在高风险应用场景中因缺乏可解释性而导致的信任难题,特别是如何精准量化不同模态(如视觉与语言)之间细粒度元素(如图像块和文本词元)的协同或抑制效应。其解决方案的关键在于提出一种模型无关的解释框架 MultiSHAP,该框架基于 Shapley Interaction Index,能够对单个样本提供实例级解释以揭示交叉模态作用机制,并对整个数据集提供层级解释以识别通用交互模式,从而实现对多模态预测的精确归因,且适用于开源与闭源模型。
链接: https://arxiv.org/abs/2508.00576
作者: Zhanliang Wang,Kai Wang
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:
Abstract:Multimodal AI models have achieved impressive performance in tasks that require integrating information from multiple modalities, such as vision and language. However, their “black-box” nature poses a major barrier to deployment in high-stakes applications where interpretability and trustworthiness are essential. How to explain cross-modal interactions in multimodal AI models remains a major challenge. While existing model explanation methods, such as attention map and Grad-CAM, offer coarse insights into cross-modal relationships, they cannot precisely quantify the synergistic effects between modalities, and are limited to open-source models with accessible internal weights. Here we introduce MultiSHAP, a model-agnostic interpretability framework that leverages the Shapley Interaction Index to attribute multimodal predictions to pairwise interactions between fine-grained visual and textual elements (such as image patches and text tokens), while being applicable to both open- and closed-source models. Our approach provides: (1) instance-level explanations that reveal synergistic and suppressive cross-modal effects for individual samples - “why the model makes a specific prediction on this input”, and (2) dataset-level explanation that uncovers generalizable interaction patterns across samples - “how the model integrates information across modalities”. Experiments on public multimodal benchmarks confirm that MultiSHAP faithfully captures cross-modal reasoning mechanisms, while real-world case studies demonstrate its practical utility. Our framework is extensible beyond two modalities, offering a general solution for interpreting complex multimodal AI models.
zh
[AI-18] Analysing Temporal Reasoning in Description Logics Using Formal Grammars ECAI2025
【速读】:该论文旨在解决生成式 AI (Generative AI) 中时间逻辑描述逻辑 TEL◯ 的模型最终周期性(ultimate periodicity)性质缺失问题,以及由此引发的查询回答(query answering)的不可判定性问题,该问题自 TEL◯ 提出以来一直未被解决。其解决方案的关键在于建立 TEL◯(或其片段)与特定形式文法——尤其是交集文法(conjunctive grammars,即带有交运算的上下文无关文法)之间的对应关系,从而利用交集文法已有的理论成果来证明某些新片段的查询回答可判定性,并复用现有针对交集文法的工具和算法实现高效推理。
链接: https://arxiv.org/abs/2508.00575
作者: Camille Bourgaux,Anton Gnatenko,Michaël Thomazo
机构: 未知
类目: Logic in Computer Science (cs.LO); Artificial Intelligence (cs.AI)
备注: This is an extended version of a paper appearing at the 28th European Conference on Artificial Intelligence (ECAI 2025). 20 pages
Abstract:We establish a correspondence between (fragments of) \mathcalTEL^\bigcirc , a temporal extension of the \mathcalEL description logic with the LTL operator \bigcirc^k , and some specific kinds of formal grammars, in particular, conjunctive grammars (context-free grammars equipped with the operation of intersection). This connection implies that \mathcalTEL^\bigcirc does not possess the property of ultimate periodicity of models, and further leads to undecidability of query answering in \mathcalTEL^\bigcirc , closing a question left open since the introduction of \mathcalTEL^\bigcirc . Moreover, it also allows to establish decidability of query answering for some new interesting fragments of \mathcalTEL^\bigcirc , and to reuse for this purpose existing tools and algorithms for conjunctive grammars.
zh
[AI-19] SPENCER: Self-Adaptive Model Distillation for Efficient Code Retrieval
【速读】:该论文旨在解决代码检索(Code Retrieval)任务中模型效率与准确率之间的权衡问题。现有基于双编码器(Dual-Encoder)的方法虽然推理效率高,但因缺乏代码片段与自然语言描述在底层的交互,限制了性能上限。为此,作者提出SPENCER框架,其核心创新在于:首先使用轻量级双编码器快速缩小搜索空间,再通过交叉编码器(Cross-Encoder)提升精度;同时引入自适应模型蒸馏(Self-AdaPtive Model Distillation)技术,在保持整体性能超过98%的前提下,将双编码器的推理时间降低70%,并设计教学助手选择策略以适配不同预训练模型,从而实现高效且精准的代码检索。
链接: https://arxiv.org/abs/2508.00546
作者: Wenchao Gu,Zongyi Lyu,Yanlin Wang,Hongyu Zhang,Cuiyun Gao,Michael R. Lyu
机构: 未知
类目: oftware Engineering (cs.SE); Artificial Intelligence (cs.AI)
备注:
Abstract:Code retrieval aims to provide users with desired code snippets based on users’ natural language queries. With the development of deep learning technologies, adopting pre-trained models for this task has become mainstream. Considering the retrieval efficiency, most of the previous approaches adopt a dual-encoder for this task, which encodes the description and code snippet into representation vectors, respectively. However, the model structure of the dual-encoder tends to limit the model’s performance, since it lacks the interaction between the code snippet and description at the bottom layer of the model during training. To improve the model’s effectiveness while preserving its efficiency, we propose a framework, which adopts Self-AdaPtive Model Distillation for Efficient CodE Retrieval, named SPENCER. SPENCER first adopts the dual-encoder to narrow the search space and then adopts the cross-encoder to improve accuracy. To improve the efficiency of SPENCER, we propose a novel model distillation technique, which can greatly reduce the inference time of the dual-encoder while maintaining the overall performance. We also propose a teaching assistant selection strategy for our model distillation, which can adaptively select the suitable teaching assistant models for different pre-trained models during the model distillation to ensure the model performance. Extensive experiments demonstrate that the combination of dual-encoder and cross-encoder improves overall performance compared to solely dual-encoder-based models for code retrieval. Besides, our model distillation technique retains over 98% of the overall performance while reducing the inference time of the dual-encoder by 70%.
zh
[AI-20] Foundations of Interpretable Models
【速读】:该论文试图解决当前可解释人工智能(Interpretable AI)研究中定义不明确、缺乏可操作性的问题,即现有对“可解释性”的定义未能为用户提供关于通用、稳健且可靠的可解释模型设计的指导,导致相关研究本质上存在根本性缺陷。其解决方案的关键在于提出一个通用、简洁且能涵盖现有非正式概念的新定义,该定义具备可操作性,能够直接揭示设计可解释模型所需的基础属性、假设、原则、数据结构和架构特征,并在此基础上构建了一个通用的设计蓝图及首个原生支持可解释数据结构与流程的开源库。
链接: https://arxiv.org/abs/2508.00545
作者: Pietro Barbiero,Mateo Espinosa Zarlenga,Alberto Termine,Mateja Jamnik,Giuseppe Marra
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Neural and Evolutionary Computing (cs.NE); Machine Learning (stat.ML)
备注:
Abstract:We argue that existing definitions of interpretability are not actionable in that they fail to inform users about general, sound, and robust interpretable model design. This makes current interpretability research fundamentally ill-posed. To address this issue, we propose a definition of interpretability that is general, simple, and subsumes existing informal notions within the interpretable AI community. We show that our definition is actionable, as it directly reveals the foundational properties, underlying assumptions, principles, data structures, and architectural features necessary for designing interpretable models. Building on this, we propose a general blueprint for designing interpretable models and introduce the first open-sourced library with native support for interpretable data structures and processes.
zh
[AI-21] owards a Measure Theory of Semantic Information
【速读】:该论文旨在解决Bar-Hillel-Carnap悖论问题,即经典语义信息量化理论中将矛盾命题赋予最大信息量的不合理现象。针对此问题,作者首先批判性分析了Floridi提出的基于距离度量和抛物线关系的强语义信息理论,指出其未能彻底消除该悖论;随后提出一种新的解决方案,其关键在于引入单位圆(unit circle)这一数学结构,该结构在基础三角学至量子理论中均有广泛应用。通过类比冯·诺依曼(von Neumann)的量子概率框架,构建了一个满足Floridi所有要求的信息度量空间,从而不仅消除了悖论,还得出矛盾命题间具有相同信息量的新结论,并以实例说明其应用价值。
链接: https://arxiv.org/abs/2508.00525
作者: George M. Coghill
机构: 未知
类目: Information Theory (cs.IT); Artificial Intelligence (cs.AI)
备注: 17 pages,3 figures
Abstract:A classic account of the quantification of semantic information is that of Bar-Hiller and Carnap. Their account proposes an inverse relation between the informativeness of a statement and its probability. However, their approach assigns the maximum informativeness to a contradiction: which Floridi refers to as the Bar-Hillel-Carnap paradox. He developed a novel theory founded on a distance metric and parabolic relation, designed to remove this paradox. Unfortunately is approach does not succeed in that aim. In this paper I critique Floridi’s theory of strongly semantic information on its own terms and show where it succeeds and fails. I then present a new approach based on the unit circle (a relation that has been the basis of theories from basic trigonometry to quantum theory). This is used, by analogy with von Neumann’s quantum probability to construct a measure space for informativeness that meets all the requirements stipulated by Floridi and removes the paradox. In addition, while contradictions and tautologies have zero informativeness, it is found that messages which are contradictory to each other are equally informative. The utility of this is explained by means of an example. Comments: 17 pages,3 figures Subjects: Information Theory (cs.IT); Artificial Intelligence (cs.AI) Cite as: arXiv:2508.00525 [cs.IT] (or arXiv:2508.00525v1 [cs.IT] for this version) https://doi.org/10.48550/arXiv.2508.00525 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
zh
[AI-22] Pro2Guard: Proactive Runtime Enforcement of LLM Agent Safety via Probabilistic Model Checking
【速读】:该论文旨在解决大型语言模型(Large Language Model, LLM)代理在机器人、虚拟助手和网络自动化等场景中因随机行为带来的难以预测的安全风险问题。现有基于规则的防护系统(如AgentSpec)多为反应式机制,仅能在不安全行为即将发生或已发生时响应,缺乏前瞻性,难以应对长程依赖关系和分布偏移。其解决方案的关键在于提出Pro2Guard——一个基于概率可达性分析的主动运行时强制框架:通过将代理行为抽象为符号状态并从执行轨迹中学习离散时间马尔可夫链(Discrete-Time Markov Chain, DTMC),在运行时估算到达不安全状态的概率,当预测风险超过用户设定阈值时提前触发干预措施,从而实现事前风险防控;同时结合语义有效性检查与PAC边界保证统计可靠性,有效逼近真实模型。
链接: https://arxiv.org/abs/2508.00500
作者: Haoyu Wang,Chris M. Poskitt,Jun Sun,Jiali Wei
机构: 未知
类目: Artificial Intelligence (cs.AI); Software Engineering (cs.SE)
备注:
Abstract:Large Language Model (LLM) agents exhibit powerful autonomous capabilities across domains such as robotics, virtual assistants, and web automation. However, their stochastic behavior introduces significant safety risks that are difficult to anticipate. Existing rule-based enforcement systems, such as AgentSpec, focus on developing reactive safety rules, which typically respond only when unsafe behavior is imminent or has already occurred. These systems lack foresight and struggle with long-horizon dependencies and distribution shifts. To address these limitations, we propose Pro2Guard, a proactive runtime enforcement framework grounded in probabilistic reachability analysis. Pro2Guard abstracts agent behaviors into symbolic states and learns a Discrete-Time Markov Chain (DTMC) from execution traces. At runtime, it anticipates future risks by estimating the probability of reaching unsafe states, triggering interventions before violations occur when the predicted risk exceeds a user-defined threshold. By incorporating semantic validity checks and leveraging PAC bounds, Pro2Guard ensures statistical reliability while approximating the underlying ground-truth model. We evaluate Pro2Guard extensively across two safety-critical domains: embodied household agents and autonomous vehicles. In embodied agent tasks, Pro2Guard enforces safety early on up to 93.6% of unsafe tasks using low thresholds, while configurable modes (e.g., reflect) allow balancing safety with task success, maintaining up to 80.4% task completion. In autonomous driving scenarios, Pro2Guard achieves 100% prediction of traffic law violations and collisions, anticipating risks up to 38.66 seconds ahead.
zh
[AI-23] HannesImitation: Grasping with the Hannes Prosthetic Hand via Imitation Learning IROS
【速读】:该论文旨在解决当前假肢手在非结构化环境中自主抓取能力不足的问题,尤其是如何减少用户认知负荷并提升操作灵活性。传统方法依赖于手动标注的控制序列或基于分割的视觉伺服控制,在复杂场景中泛化能力有限。解决方案的关键在于引入基于扩散模型(diffusion policy)的模仿学习(imitation learning)框架,通过收集包含桌面、货架及人机交接等多样化场景的抓握演示数据(HannesImitationDataset),训练一个统一策略模型,直接预测手腕姿态和手部闭合动作以实现鲁棒抓取。实验表明,该方法在未见过的对象和条件下均能成功执行抓取任务,且优于现有基于分割的视觉伺服控制器。
链接: https://arxiv.org/abs/2508.00491
作者: Carlo Alessi,Federico Vasile,Federico Ceola,Giulia Pasquale,Nicolò Boccardo,Lorenzo Natale
机构: 未知
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI)
备注: Paper accepted at IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS)
Abstract:Recent advancements in control of prosthetic hands have focused on increasing autonomy through the use of cameras and other sensory inputs. These systems aim to reduce the cognitive load on the user by automatically controlling certain degrees of freedom. In robotics, imitation learning has emerged as a promising approach for learning grasping and complex manipulation tasks while simplifying data collection. Its application to the control of prosthetic hands remains, however, largely unexplored. Bridging this gap could enhance dexterity restoration and enable prosthetic devices to operate in more unconstrained scenarios, where tasks are learned from demonstrations rather than relying on manually annotated sequences. To this end, we present HannesImitationPolicy, an imitation learning-based method to control the Hannes prosthetic hand, enabling object grasping in unstructured environments. Moreover, we introduce the HannesImitationDataset comprising grasping demonstrations in table, shelf, and human-to-prosthesis handover scenarios. We leverage such data to train a single diffusion policy and deploy it on the prosthetic hand to predict the wrist orientation and hand closure for grasping. Experimental evaluation demonstrates successful grasps across diverse objects and conditions. Finally, we show that the policy outperforms a segmentation-based visual servo controller in unstructured scenarios. Additional material is provided on our project page: this https URL
zh
[AI-24] CyGATE: Game-Theoretic Cyber Attack-Defense Engine for Patch Strategy Optimization
【速读】:该论文旨在解决现代网络攻击多阶段、动态性强且防御方在不确定性下需实时调整缓解策略的问题,传统博弈论模型因静态假设和缺乏与实时威胁情报的集成而适应性不足。解决方案的关键在于提出CyGATE框架,其核心是将攻防互动建模为部分可观测随机博弈(POSG),并引入大语言模型(LLM)结合检索增强生成(RAG)技术,使攻击者能根据环境变化动态调整战术、防御者基于观测到的对手行为和风险演化重新优先排序补丁,从而实现对高风险漏洞的有效识别与资源优化配置,在动态场景中提升防御的适应性、战略前瞻性和效率。
链接: https://arxiv.org/abs/2508.00478
作者: Yuning Jiang,Nay Oo,Qiaoran Meng,Lu Lin,Dusit Niyato,Zehui Xiong,Hoon Wei Lim,Biplab Sikdar
机构: 未知
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI)
备注:
Abstract:Modern cyber attacks unfold through multiple stages, requiring defenders to dynamically prioritize mitigations under uncertainty. While game-theoretic models capture attacker-defender interactions, existing approaches often rely on static assumptions and lack integration with real-time threat intelligence, limiting their adaptability. This paper presents CyGATE, a game-theoretic framework modeling attacker-defender interactions, using large language models (LLMs) with retrieval-augmented generation (RAG) to enhance tactic selection and patch prioritization. Applied to a two-agent scenario, CyGATE frames cyber conflicts as a partially observable stochastic game (POSG) across Cyber Kill Chain stages. Both agents use belief states to navigate uncertainty, with the attacker adapting tactics and the defender re-prioritizing patches based on evolving risks and observed adversary behavior. The framework’s flexible architecture enables extension to multi-agent scenarios involving coordinated attackers, collaborative defenders, or complex enterprise environments with multiple stakeholders. Evaluated in a dynamic patch scheduling scenario, CyGATE effectively prioritizes high-risk vulnerabilities, enhancing adaptability through dynamic threat integration, strategic foresight by anticipating attacker moves under uncertainty, and efficiency by optimizing resource use.
zh
[AI-25] hinking Machines: Mathematical Reasoning in the Age of LLM s
【速读】:该论文旨在解决大语言模型(Large Language Models, LLMs)在形式数学推理(formal mathematics)领域进展缓慢的问题,特别是对比其在编程任务中的优异表现,探究其背后的根本原因。论文指出,尽管代码生成与定理证明在表面上具有相似性,但前者更稳定、可扩展,而后者仍面临显著的脆弱性和局限性。解决方案的关键在于深入分析三个核心问题:一是形式数学与非形式数学作为训练数据域之间的权衡;二是为何证明生成比代码合成更加不稳定;三是LLMs是否真正内化了逻辑状态演化的表征,还是仅停留在表面模仿。通过系统梳理当前模型与基准测试的进展,论文试图厘清现有技术瓶颈的本质,并为未来突破提供方向。
链接: https://arxiv.org/abs/2508.00459
作者: Andrea Asperti,Alberto Naibo,Claudio Sacerdoti Coen
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:
Abstract:Large Language Models (LLMs) have shown remarkable abilities in structured reasoning and symbolic tasks, with coding emerging as a particular area of strength. This success has sparked growing interest in applying LLMs to mathematics, both in informal problem-solving and formal theorem proving. However, progress in formal mathematics has proven to be significantly more difficult, despite surface-level similarities between programming and proof construction. This discrepancy raises important questions about how LLMs ``reason’', how they are supervised, and whether they internally track a notion of computational or deductive state. In this article, we address the state-of-the-art of the discipline, focusing on recent models and benchmarks, and explore three central issues at the intersection of machine learning and mathematical cognition: (i) the trade-offs between formal and informal mathematics as training domains; (ii) the deeper reasons why proof generation remains more brittle than code synthesis; (iii) and the question of whether LLMs represent, or merely mimic, a notion of evolving logical state. Our goal is not to draw hard boundaries, but to identify where the current limits lie, and how they might be extended.
zh
[AI-26] M2VAE: Multi-Modal Multi-View Variational Autoencoder for Cold-start Item Recommendation
【速读】:该论文旨在解决推荐系统中的冷启动物品推荐问题,即在新物品缺乏历史交互数据的情况下如何有效进行推荐。现有方法虽利用多模态内容缓解此问题,但常忽略模态间的多视角结构及其共享与特定特征的区分。其解决方案的关键在于提出多模态多视角变分自编码器(Multi-Modal Multi-View Variational AutoEncoder, M²VAE),通过为物品ID、类别属性和图像特征分别生成类型特异性潜在变量,并采用专家产品(Product-of-Experts, PoE)构建共通表示;同时引入解耦对比损失(disentangled contrastive loss)分离共通视图与独特视图,保留特征信息量;此外,借助偏好引导的专家混合(preference-guided Mixture-of-Experts, MoE)动态融合表示,并通过对比学习整合共现信号,从而无需预训练即可提升推荐性能。
链接: https://arxiv.org/abs/2508.00452
作者: Chuan He,Yongchao Liu,Qiang Li,Wenliang Zhong,Chuntao Hong,Xinwei Yao
机构: 未知
类目: Information Retrieval (cs.IR); Artificial Intelligence (cs.AI)
备注:
Abstract:Cold-start item recommendation is a significant challenge in recommendation systems, particularly when new items are introduced without any historical interaction data. While existing methods leverage multi-modal content to alleviate the cold-start issue, they often neglect the inherent multi-view structure of modalities, the distinction between shared and modality-specific features. In this paper, we propose Multi-Modal Multi-View Variational AutoEncoder (M^2VAE), a generative model that addresses the challenges of modeling common and unique views in attribute and multi-modal features, as well as user preferences over single-typed item features. Specifically, we generate type-specific latent variables for item IDs, categorical attributes, and image features, and use Product-of-Experts (PoE) to derive a common representation. A disentangled contrastive loss decouples the common view from unique views while preserving feature informativeness. To model user inclinations, we employ a preference-guided Mixture-of-Experts (MoE) to adaptively fuse representations. We further incorporate co-occurrence signals via contrastive learning, eliminating the need for pretraining. Extensive experiments on real-world datasets validate the effectiveness of our approach.
zh
[AI-27] When Relevance Meets Novelty: Dual-Stable Periodic Optimization for Exploratory Recommendation
【速读】:该论文旨在解决传统推荐系统中存在的两个核心问题:一是由于过度依赖用户历史偏好而形成的强反馈循环,限制了用户的探索机会并导致内容疲劳;二是现有大语言模型(Large Language Models, LLMs)增强的双模型框架在兴趣建模上存在偏差,且优化过程静态,无法利用增量数据进行闭环迭代优化。解决方案的关键在于提出协同进化对齐(Co-Evolutionary Alignment, CoEA)方法:首先通过双稳定兴趣探索(Dual-Stable Interest Exploration, DSIE)模块,联合建模由群体身份驱动的长期偏好与个体短期兴趣,缓解兴趣建模偏差;其次设计周期性协同优化(Periodic Collaborative Optimization, PCO)机制,通过Relevance LLM与Novelty LLM之间的周期性验证-微调-反馈闭环,实现基于增量数据的动态优化,从而提升推荐系统的探索能力与适应性。
链接: https://arxiv.org/abs/2508.00450
作者: Hongxiang Lin,Hao Guo,Zeshun Li,Erpeng Xue,Yongqian He,Xiangyu Hou,Zhaoyu Hu,Lei Wang,Sheng Chen
机构: 未知
类目: Information Retrieval (cs.IR); Artificial Intelligence (cs.AI)
备注:
Abstract:Traditional recommendation systems tend to trap users in strong feedback loops by excessively pushing content aligned with their historical preferences, thereby limiting exploration opportunities and causing content fatigue. Although large language models (LLMs) demonstrate potential with their diverse content generation capabilities, existing LLM-enhanced dual-model frameworks face two major limitations: first, they overlook long-term preferences driven by group identity, leading to biased interest modeling; second, they suffer from static optimization flaws, as a one-time alignment process fails to leverage incremental user data for closed-loop optimization. To address these challenges, we propose the Co-Evolutionary Alignment (CoEA) method. For interest modeling bias, we introduce Dual-Stable Interest Exploration (DSIE) module, jointly modeling long-term group identity and short-term individual interests through parallel processing of behavioral sequences. For static optimization limitations, we design a Periodic Collaborative Optimization (PCO) mechanism. This mechanism regularly conducts preference verification on incremental data using the Relevance LLM, then guides the Novelty LLM to perform fine-tuning based on the verification results, and subsequently feeds back the output of the incrementally fine-tuned Novelty LLM to the Relevance LLM for re-evaluation, thereby achieving a dynamic closed-loop optimization. Extensive online and offline experiments verify the effectiveness of the CoEA model in exploratory recommendation.
zh
[AI-28] heory of Mind Using Active Inference: A Framework for Multi-Agent Cooperation
【速读】:该论文旨在解决多智能体协作中缺乏通用性与无需显式通信的问题,即如何在不依赖任务特定共享生成模型或明确通信机制的前提下,实现高效且灵活的多智能体合作。解决方案的关键在于将心智理论(Theory of Mind, ToM)引入主动推理(Active Inference)框架,使智能体能够通过观测他人行为推断其信念和目标,并据此调整自身策略。具体而言,ToM-equipped智能体维护对自我与他人信念及目标的独立表征,并利用基于推理树的规划算法递归探索联合策略空间,从而在避障和觅食等任务中显著优于无ToM能力的对照组,展现出更优的合作性能与冗余动作减少效果。
链接: https://arxiv.org/abs/2508.00401
作者: Riddhi J. Pitliya,Ozan Catal,Toon Van de Maele,Corrado Pezzato,Tim Verbelen
机构: 未知
类目: Artificial Intelligence (cs.AI); Multiagent Systems (cs.MA)
备注:
Abstract:We present a novel approach to multi-agent cooperation by implementing theory of mind (ToM) within active inference. ToM - the ability to understand that others can have differing knowledge and goals - enables agents to reason about others’ beliefs while planning their own actions. Unlike previous active inference approaches to multi-agent cooperation, our method neither relies on task-specific shared generative models nor requires explicit communication, while being generalisable. In our framework, the ToM-equipped agent maintains distinct representations of its own and others’ beliefs and goals. We extend the sophisticated inference tree-based planning algorithm to systematically explore joint policy spaces through recursive reasoning. Our approach is evaluated through collision avoidance and foraging task simulations. Results demonstrate that ToM-equipped agents cooperate better compared to non-ToM counterparts by being able to avoid collisions and reduce redundant efforts. Crucially, ToM agents accomplish this by inferring others’ beliefs solely from observable behaviour. This work advances practical applications in artificial intelligence while providing computational insights into ToM.
zh
[AI-29] ExeKGLib: A Platform for Machine Learning Analytics based on Knowledge Graphs
【速读】:该论文旨在解决领域专家(如科学与工程领域的研究人员)在缺乏机器学习(Machine Learning, ML)专业知识的情况下,难以高效构建高质量ML流水线的问题。解决方案的关键在于提出ExeKGLib,这是一个基于知识图谱(Knowledge Graph, KG)的Python库,通过图形化界面降低使用门槛,并利用编码于知识图谱中的ML知识以自然语言形式呈现,使非ML专家能够直观地设计和执行ML流程,同时提升流程的透明性、可复用性和可执行性。
链接: https://arxiv.org/abs/2508.00394
作者: Antonis Klironomos,Baifan Zhou,Zhipeng Tan,Zhuoxun Zheng,Mohamed H. Gad-Elrab,Heiko Paulheim,Evgeny Kharlamov
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:
Abstract:Nowadays machine learning (ML) practitioners have access to numerous ML libraries available online. Such libraries can be used to create ML pipelines that consist of a series of steps where each step may invoke up to several ML libraries that are used for various data-driven analytical tasks. Development of high-quality ML pipelines is non-trivial; it requires training, ML expertise, and careful development of each step. At the same time, domain experts in science and engineering may not possess such ML expertise and training while they are in pressing need of ML-based analytics. In this paper, we present our ExeKGLib, a Python library enhanced with a graphical interface layer that allows users with minimal ML knowledge to build ML pipelines. This is achieved by relying on knowledge graphs that encode ML knowledge in simple terms accessible to non-ML experts. ExeKGLib also allows improving the transparency and reusability of the built ML workflows and ensures that they are executable. We show the usability and usefulness of ExeKGLib by presenting real use cases.
zh
[AI-30] Oedipus and the Sphinx: Benchmarking and Improving Visual Language Models for Complex Graphic Reasoning
【速读】:该论文旨在解决当前视觉语言模型(Visual Language Models, VLMs)在复杂图形推理任务中表现不足的问题,尤其是其在空间关系、抽象推理和多元素交互等维度上的能力有限,而现有研究主要集中在简单图形场景,缺乏系统性评估框架。为应对这一挑战,作者提出首个专注于结构化图形推理任务的评估基准ReasonBench,涵盖来自真实智力测试的1,613道题目,覆盖位置、属性、数量及多元素推理等维度,从而全面衡量VLMs在空间、关系与抽象推理方面的能力。解决方案的关键在于引入双优化策略:Diagrammatic Reasoning Chain(DiaCoT)通过分层分解提升推理过程的可解释性,ReasonTune则通过微调增强模型对特定推理任务的适应性,二者协同使VLM性能提升达33.5%。
链接: https://arxiv.org/abs/2508.00323
作者: Jianyi Zhang,Xu Ji,Ziyin Zhou,Yuchen Zhou,Shubo Shi,Haoyu Wu,Zhen Li,Shizhao Liu
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:
Abstract:Evaluating the performance of visual language models (VLMs) in graphic reasoning tasks has become an important research topic. However, VLMs still show obvious deficiencies in simulating human-level graphic reasoning capabilities, especially in complex graphic reasoning and abstract problem solving, which are less studied and existing studies only focus on simple graphics. To evaluate the performance of VLMs in complex graphic reasoning, we propose ReasonBench, the first evaluation benchmark focused on structured graphic reasoning tasks, which includes 1,613 questions from real-world intelligence tests. ReasonBench covers reasoning dimensions related to location, attribute, quantity, and multi-element tasks, providing a comprehensive evaluation of the performance of VLMs in spatial, relational, and abstract reasoning capabilities. We benchmark 11 mainstream VLMs (including closed-source and open-source models) and reveal significant limitations of current models. Based on these findings, we propose a dual optimization strategy: Diagrammatic Reasoning Chain (DiaCoT) enhances the interpretability of reasoning by decomposing layers, and ReasonTune enhances the task adaptability of model reasoning through training, all of which improves VLM performance by 33.5%. All experimental data and code are in the repository: this https URL.
zh
[AI-31] MetaExplainer: A Framework to Generate Multi-Type User-Centered Explanations for AI Systems
【速读】:该论文旨在解决当前AI系统解释能力与用户实际需求之间存在的鸿沟问题,即模型生成的解释往往难以满足用户的个性化理解需求。其解决方案的关键在于提出了一种名为MetaExplainer的神经符号框架,该框架通过三阶段流程实现以用户为中心的解释生成:首先利用大语言模型(Large Language Models, LLM)将用户问题转化为机器可读格式;其次由模型解释方法生成系统推荐结果;最后合成自然语言解释以总结解释输出。整个过程借助Explanation Ontology(解释本体)引导LLM和解释方法,确保生成解释的准确性、上下文相关性和可追溯性,从而提升AI系统的可解释性和可信度。
链接: https://arxiv.org/abs/2508.00300
作者: Shruthi Chari,Oshani Seneviratne,Prithwish Chakraborty,Pablo Meyer,Deborah L. McGuinness
机构: 未知
类目: Human-Computer Interaction (cs.HC); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:
Abstract:Explanations are crucial for building trustworthy AI systems, but a gap often exists between the explanations provided by models and those needed by users. To address this gap, we introduce MetaExplainer, a neuro-symbolic framework designed to generate user-centered explanations. Our approach employs a three-stage process: first, we decompose user questions into machine-readable formats using state-of-the-art large language models (LLM); second, we delegate the task of generating system recommendations to model explainer methods; and finally, we synthesize natural language explanations that summarize the explainer outputs. Throughout this process, we utilize an Explanation Ontology to guide the language models and explainer methods. By leveraging LLMs and a structured approach to explanation generation, MetaExplainer aims to enhance the interpretability and trustworthiness of AI systems across various applications, providing users with tailored, question-driven explanations that better meet their needs. Comprehensive evaluations of MetaExplainer demonstrate a step towards evaluating and utilizing current state-of-the-art explanation frameworks. Our results show high performance across all stages, with a 59.06% F1-score in question reframing, 70% faithfulness in model explanations, and 67% context-utilization in natural language synthesis. User studies corroborate these findings, highlighting the creativity and comprehensiveness of generated explanations. Tested on the Diabetes (PIMA Indian) tabular dataset, MetaExplainer supports diverse explanation types, including Contrastive, Counterfactual, Rationale, Case-Based, and Data explanations. The framework’s versatility and traceability from using ontology to guide LLMs suggest broad applicability beyond the tested scenarios, positioning MetaExplainer as a promising tool for enhancing AI explainability across various domains.
zh
[AI-32] Calibrated Language Models and How to Find Them with Label Smoothing ICML
【速读】:该论文旨在解决指令微调(instruction tuning)导致大语言模型(LLM)在输出时出现置信度校准(confidence calibration)退化的问题,即模型在经过微调后变得过度自信,从而影响其可靠性。解决方案的关键在于引入标签平滑(label smoothing),通过正则化手段缓解模型对预测结果的过度自信倾向;研究进一步指出,在词汇量较大的大语言模型(LV-LLMs)中,标签平滑的效果会显著减弱,原因与隐藏层维度和词汇大小直接相关,并通过理论分析与实验验证了这一现象;最后,为解决标签平滑计算中交叉熵损失带来的内存开销问题,作者设计了一种定制化的核函数(customized kernel),在不牺牲速度和性能的前提下大幅降低内存占用。
链接: https://arxiv.org/abs/2508.00264
作者: Jerry Huang,Peng Lu,Qiuhao Zeng
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Machine Learning (stat.ML)
备注: Accepted to the Forty-second International Conference on Machine Learning (ICML) 2025. First two authors contributed equally
Abstract:Recent advances in natural language processing (NLP) have opened up greater opportunities to enable fine-tuned large language models (LLMs) to behave as more powerful interactive agents through improved instruction-following ability. However, understanding how this impacts confidence calibration for reliable model output has not been researched in full. In this work, we examine various open-sourced LLMs, identifying significant calibration degradation after instruction tuning in each. Seeking a practical solution, we look towards label smoothing, which has been shown as an effective method to regularize for overconfident predictions but has yet to be widely adopted in the supervised fine-tuning (SFT) of LLMs. We first provide insight as to why label smoothing is sufficient to maintain calibration throughout the SFT process. However, settings remain where the effectiveness of smoothing is severely diminished, in particular the case of large vocabulary LLMs (LV-LLMs). We posit the cause to stem from the ability to become over-confident, which has a direct relationship with the hidden size and vocabulary size, and justify this theoretically and experimentally. Finally, we address an outstanding issue regarding the memory footprint of the cross-entropy loss computation in the label smoothed loss setting, designing a customized kernel to dramatically reduce memory consumption without sacrificing speed or performance in comparison to existing solutions for non-smoothed losses.
zh
[AI-33] Large AI Model-Enabled Secure Communications in Low-Altitude Wireless Networks: Concepts Perspectives and Case Study
【速读】:该论文旨在解决低空无线网络(Low-altitude wireless networks, LAWNs)在安全通信方面面临的独特挑战,包括因低空作业、频繁移动性和对免许可频谱的依赖而导致的易受恶意攻击问题。传统人工智能方法在LAWN场景中存在安全风险放大和性能局限性。解决方案的关键在于引入大语言模型(Large Language Models, LLMs)作为核心组件,构建一种基于大模型(Large Artificial Intelligence Models, LAMs)的优化框架:通过LLMs对人工设计的状态特征进行增强,并据此设计内在奖励机制,从而显著提升强化学习在安全通信任务中的性能表现。
链接: https://arxiv.org/abs/2508.00256
作者: Chuang Zhang,Geng Sun,Jiacheng Wang,Yijing Lin,Weijie Yuan,Sinem Coleri,Dusit Niyato,Tony Q. S. Quek
机构: 未知
类目: Networking and Internet Architecture (cs.NI); Artificial Intelligence (cs.AI)
备注: This paper has been submitted to IEEE Communications Magazine for consideration
Abstract:Low-altitude wireless networks (LAWNs) have the potential to revolutionize communications by supporting a range of applications, including urban parcel delivery, aerial inspections and air taxis. However, compared with traditional wireless networks, LAWNs face unique security challenges due to low-altitude operations, frequent mobility and reliance on unlicensed spectrum, making it more vulnerable to some malicious attacks. In this paper, we investigate some large artificial intelligence model (LAM)-enabled solutions for secure communications in LAWNs. Specifically, we first explore the amplified security risks and important limitations of traditional AI methods in LAWNs. Then, we introduce the basic concepts of LAMs and delve into the role of LAMs in addressing these challenges. To demonstrate the practical benefits of LAMs for secure communications in LAWNs, we propose a novel LAM-based optimization framework that leverages large language models (LLMs) to generate enhanced state features on top of handcrafted representations, and to design intrinsic rewards accordingly, thereby improving reinforcement learning performance for secure communication tasks. Through a typical case study, simulation results validate the effectiveness of the proposed framework. Finally, we outline future directions for integrating LAMs into secure LAWN applications.
zh
[AI-34] Accurate and Consistent Graph Model Generation from Text with Large Language Models
【速读】:该论文旨在解决基于大语言模型(Large Language Models, LLMs)生成图模型时存在的三个核心问题:语法违规(syntax violations)、约束不一致(constraint inconsistencies)以及不准确性(inaccuracy),其中后两者在现有方法中尚未得到有效处理。解决方案的关键在于提出一种新颖的抽象-具体化(abstraction-concretization)框架,该框架通过聚合LLM产生的多个候选输出构建一个概率化的部分模型(probabilistic partial model),并在此基础上迭代精炼为满足所有元模型语法和领域约束的最适具体模型(concrete model),从而显著提升生成图模型的一致性和质量。
链接: https://arxiv.org/abs/2508.00255
作者: Boqi Chen,Ou Wei,Bingzhou Zheng,Gunter Mussbacher
机构: 未知
类目: oftware Engineering (cs.SE); Artificial Intelligence (cs.AI)
备注: Accepted at ACM / IEEE 28th International Conference on Model Driven Engineering Languages and Systems (MODELS 2025)
Abstract:Graph model generation from natural language description is an important task with many applications in software engineering. With the rise of large language models (LLMs), there is a growing interest in using LLMs for graph model generation. Nevertheless, LLM-based graph model generation typically produces partially correct models that suffer from three main issues: (1) syntax violations: the generated model may not adhere to the syntax defined by its metamodel, (2) constraint inconsistencies: the structure of the model might not conform to some domain-specific constraints, and (3) inaccuracy: due to the inherent uncertainty in LLMs, the models can include inaccurate, hallucinated elements. While the first issue is often addressed through techniques such as constraint decoding or filtering, the latter two remain largely unaddressed. Motivated by recent self-consistency approaches in LLMs, we propose a novel abstraction-concretization framework that enhances the consistency and quality of generated graph models by considering multiple outputs from an LLM. Our approach first constructs a probabilistic partial model that aggregates all candidate outputs and then refines this partial model into the most appropriate concrete model that satisfies all constraints. We evaluate our framework on several popular open-source and closed-source LLMs using diverse datasets for model generation tasks. The results demonstrate that our approach significantly improves both the consistency and quality of the generated graph models.
zh
[AI-35] Whats Behind the Magic? Audiences Seek Artistic Value in Generative AIs Contributions to a Live Dance Performance
【速读】:该论文试图解决的问题是:在生成式 AI (Generative AI) 用于艺术创作的背景下,不同利益相关者对其作品价值存在分歧,且公众对 AI 创作艺术的接受度受认知情境影响。解决方案的关键在于通过实验设计控制信息呈现时机——即在观众观看舞蹈表演后或前告知其是否使用了 GenAI,从而揭示社会语境和用户解释框架如何调节对 AI 艺术作品的审美评价。研究发现,当观众在不知情的情况下接触 GenAI 辅助的艺术作品时,更倾向于赋予其更高的艺术价值,这表明技术解释需结合社会认知语境,以促进对生成式 AI 艺术价值的理解与共识。
链接: https://arxiv.org/abs/2508.00239
作者: Jacqueline Elise Bruen,Myounghoon Jeon
机构: 未知
类目: Human-Computer Interaction (cs.HC); Artificial Intelligence (cs.AI)
备注: In Proceedings of Explainable AI for the Arts Workshop 2025 (XAIxArts 2025) arXiv:2406.14485
Abstract:With the development of generative artificial intelligence (GenAI) tools to create art, stakeholders cannot come to an agreement on the value of these works. In this study we uncovered the mixed opinions surrounding art made by AI. We developed two versions of a dance performance augmented by technology either with or without GenAI. For each version we informed audiences of the performance’s development either before or after a survey on their perceptions of the performance. There were thirty-nine participants (13 males, 26 female) divided between the four performances. Results demonstrated that individuals were more inclined to attribute artistic merit to works made by GenAI when they were unaware of its use. We present this case study as a call to address the importance of utilizing the social context and the users’ interpretations of GenAI in shaping a technical explanation, leading to a greater discussion that can bridge gaps in understanding.
zh
[AI-36] Quality-of-Service Aware LLM Routing for Edge Computing with Multiple Experts
【速读】:该论文旨在解决边缘计算环境中多大语言模型(Large Language Models, LLMs)服务的动态路由问题,以确保用户请求能够被高效、低延迟地分配至合适的边缘LLM专家服务,从而维持高质量的服务体验(Quality-of-Service, QoS)。现有路由算法难以同时应对LLM服务的异构性、请求间的干扰以及动态负载变化带来的长期QoS稳定性挑战。解决方案的关键在于提出一种基于深度强化学习(Deep Reinforcement Learning, DRL)的QoS感知路由框架:首先设计了一种动态状态抽象技术,利用异构图注意力网络(Heterogeneous Graph Attention Network, HAN)紧凑表示全局状态特征;其次引入动作影响估计器和定制化的奖励函数,引导DRL智能体在最大化QoS的同时避免延迟违规,从而实现资源利用效率与服务质量的协同优化。
链接: https://arxiv.org/abs/2508.00234
作者: Jin Yang,Qiong Wu,Zhiying Feng,Zhi Zhou,Deke Guo,Xu Chen
机构: 未知
类目: Networking and Internet Architecture (cs.NI); Artificial Intelligence (cs.AI); Distributed, Parallel, and Cluster Computing (cs.DC); Multiagent Systems (cs.MA)
备注: Accepted by IEEE Transactions on Mobile Computing
Abstract:Large Language Models (LLMs) have demonstrated remarkable capabilities, leading to a significant increase in user demand for LLM services. However, cloud-based LLM services often suffer from high latency, unstable responsiveness, and privacy concerns. Therefore, multiple LLMs are usually deployed at the network edge to boost real-time responsiveness and protect data privacy, particularly for many emerging smart mobile and IoT applications. Given the varying response quality and latency of LLM services, a critical issue is how to route user requests from mobile and IoT devices to an appropriate LLM service (i.e., edge LLM expert) to ensure acceptable quality-of-service (QoS). Existing routing algorithms fail to simultaneously address the heterogeneity of LLM services, the interference among requests, and the dynamic workloads necessary for maintaining long-term stable QoS. To meet these challenges, in this paper we propose a novel deep reinforcement learning (DRL)-based QoS-aware LLM routing framework for sustained high-quality LLM services. Due to the dynamic nature of the global state, we propose a dynamic state abstraction technique to compactly represent global state features with a heterogeneous graph attention network (HAN). Additionally, we introduce an action impact estimator and a tailored reward function to guide the DRL agent in maximizing QoS and preventing latency violations. Extensive experiments on both Poisson and real-world workloads demonstrate that our proposed algorithm significantly improves average QoS and computing resource efficiency compared to existing baselines.
zh
[AI-37] Reinitializing weights vs units for maintaining plasticity in neural networks
【速读】:该论文旨在解决神经网络在持续学习(continual learning)过程中因长期训练于非平稳数据而导致的塑性丧失问题(loss of plasticity),即模型逐渐失去新任务的学习能力。其核心解决方案是通过选择性地重初始化网络中的权重(selective weight reinitialization),而非传统方法中重初始化神经元单元(reinitializing units)。该算法识别并重置网络中“最无用”的权重,从而更有效地维持模型的适应性和可塑性,尤其在小规模网络或包含层归一化(layer normalization)结构时表现优于单元重初始化策略。
链接: https://arxiv.org/abs/2508.00212
作者: J. Fernando Hernandez-Garcia,Shibhansh Dohare,Jun Luo,Rich S. Sutton
机构: 未知
类目: Neural and Evolutionary Computing (cs.NE); Artificial Intelligence (cs.AI)
备注:
Abstract:Loss of plasticity is a phenomenon in which a neural network loses its ability to learn when trained for an extended time on non-stationary data. It is a crucial problem to overcome when designing systems that learn continually. An effective technique for preventing loss of plasticity is reinitializing parts of the network. In this paper, we compare two different reinitialization schemes: reinitializing units vs reinitializing weights. We propose a new algorithm, which we name \textitselective weight reinitialization, for reinitializing the least useful weights in a network. We compare our algorithm to continual backpropagation and ReDo, two previously proposed algorithms that reinitialize units in the network. Through our experiments in continual supervised learning problems, we identify two settings when reinitializing weights is more effective at maintaining plasticity than reinitializing units: (1) when the network has a small number of units and (2) when the network includes layer normalization. Conversely, reinitializing weights and units are equally effective at maintaining plasticity when the network is of sufficient size and does not include layer normalization. We found that reinitializing weights maintains plasticity in a wider variety of settings than reinitializing units.
zh
[AI-38] Robust Classification under Noisy Labels: A Geometry-Aware Reliability Framework for Foundation Models
【速读】:该论文旨在解决在标签噪声环境下,基于预训练基础模型(Foundation Models, FMs)进行微调时的分类鲁棒性问题,尤其在无法获得完美标注数据的场景中。其解决方案的关键在于提出一种两阶段框架:首先进行可靠性估计(reliability estimation),随后采用基于可靠性的加权推理(reliability-weighted inference)。该方法通过引入非负核(Non-negative Kernel, NNK)构建的局部邻域结构来增强几何信息利用,从而减少对距离和局部邻域的依赖,提升在高噪声条件下的分类性能。实验表明,该方法在CIFAR-10和DermaMNIST数据集上优于标准k近邻(kNN)方法及近期自适应邻域基线。
链接: https://arxiv.org/abs/2508.00202
作者: Ecem Bozkurt,Antonio Ortega
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Signal Processing (eess.SP)
备注: 5 pages, 2 figures, under review at CAMSAP 2025
Abstract:Foundation models (FMs) pretrained on large datasets have become fundamental for various downstream machine learning tasks, in particular in scenarios where obtaining perfectly labeled data is prohibitively expensive. In this paper, we assume an FM has to be fine-tuned with noisy data and present a two-stage framework to ensure robust classification in the presence of label noise without model retraining. Recent work has shown that simple k-nearest neighbor (kNN) approaches using an embedding derived from an FM can achieve good performance even in the presence of severe label noise. Our work is motivated by the fact that these methods make use of local geometry. In this paper, following a similar two-stage procedure, reliability estimation followed by reliability-weighted inference, we show that improved performance can be achieved by introducing geometry information. For a given instance, our proposed inference uses a local neighborhood of training data, obtained using the non-negative kernel (NNK) neighborhood construction. We propose several methods for reliability estimation that can rely less on distance and local neighborhood as the label noise increases. Our evaluation on CIFAR-10 and DermaMNIST shows that our methods improve robustness across various noise conditions, surpassing standard K-NN approaches and recent adaptive-neighborhood baselines.
zh
[AI-39] EMA Without the Lag: Bias-Corrected Iterate Averag ing Schemes
【速读】:该论文旨在解决语言模型(Language Model, LM)微调过程中因小批量训练导致的随机性问题,这种随机性会引发生成质量的剧烈波动,从而 destabilize 训练过程。现有常用方法如指数移动平均(Exponential Moving Average, EMA)虽能有效降低方差、平滑训练,但会引入来自历史迭代权重的偏差,导致优化滞后于标准训练。解决方案的关键在于提出 Bias-Corrected Exponential Moving Average (BEMA),通过一个简单的理论模型证明其在收敛速度上优于标准 EMA 和 vanilla 训练,并在多个标准 LM 基准测试中验证了 BEMA 在提升最终性能和加速收敛方面的显著优势,实现了在保持方差抑制的同时消除偏差的改进。
链接: https://arxiv.org/abs/2508.00180
作者: Adam Block,Cyril Zhang
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Machine Learning (stat.ML)
备注:
Abstract:Stochasticity in language model fine-tuning, often caused by the small batch sizes typically used in this regime, can destabilize training by introducing large oscillations in generation quality. A popular approach to mitigating this instability is to take an Exponential moving average (EMA) of weights throughout training. While EMA reduces stochasticity, thereby smoothing training, the introduction of bias from old iterates often creates a lag in optimization relative to vanilla training. In this work, we propose the Bias-Corrected Exponential Moving Average (BEMA), a simple and practical augmentation of EMA that retains variance-reduction benefits while eliminating bias. BEMA is motivated by a simple theoretical model wherein we demonstrate provable acceleration of BEMA over both a standard EMA and vanilla training. Through an extensive suite of experiments on Language Models, we show that BEMA leads to significantly improved convergence rates and final performance over both EMA and vanilla training in a variety of standard LM benchmarks, making BEMA a practical and theoretically motivated intervention for more stable and efficient fine-tuning.
zh
[AI-40] he SPACE of AI: Real-World Lessons on AIs Impact on Developers
【速读】:该论文试图解决的问题是:随着人工智能(AI)工具日益嵌入软件开发工作流,其对开发者生产力和体验的真实影响尚不明确。研究通过混合方法学(问卷调查、访谈与观察研究)围绕SPACE框架(满意度、绩效、活动、协作与效率)系统评估开发者对AI影响的感知,发现AI在提升常规任务效率方面具有显著价值,但效果因任务复杂度、个体使用模式及团队采纳程度而异;关键解决方案在于强调组织支持与同伴学习机制在最大化AI价值中的核心作用,表明AI更倾向于增强而非替代开发者,其有效整合依赖于团队文化与支持体系,而不仅是技术工具本身。
链接: https://arxiv.org/abs/2508.00178
作者: Brian Houck,Travis Lowdermilk,Cody Beyer,Steven Clarke,Ben Hanrahan
机构: 未知
类目: Human-Computer Interaction (cs.HC); Artificial Intelligence (cs.AI); Software Engineering (cs.SE)
备注:
Abstract:As artificial intelligence (AI) tools become increasingly embedded in software development workflows, questions persist about their true impact on developer productivity and experience. This paper presents findings from a mixed-methods study examining how developers perceive AI’s influence across the dimensions of the SPACE framework: Satisfaction, Performance, Activity, Collaboration and Efficiency. Drawing on survey responses from over 500 developers and qualitative insights from interviews and observational studies, we find that AI is broadly adopted and widely seen as enhancing productivity, particularly for routine tasks. However, the benefits vary, depending on task complexity, individual usage patterns, and team-level adoption. Developers report increased efficiency and satisfaction, with less evidence of impact on collaboration. Organizational support and peer learning play key roles in maximizing AI’s value. These findings suggest that AI is augmenting developers rather than replacing them, and that effective integration depends as much on team culture and support structures as on the tools themselves. We conclude with practical recommendations for teams, organizations and researchers seeking to harness AI’s potential in software engineering.
zh
[AI-41] DeformTune: A Deformable XAI Music Prototype for Non-Musicians
【速读】:该论文旨在解决当前生成式 AI (Generative AI) 音乐创作工具对非音乐专业人士不友好这一问题,这些问题主要体现在依赖文本提示、复杂界面或乐器类控制,导致用户难以理解和操作。解决方案的关键在于提出 DeformTune 系统,该系统结合了触觉可变形交互界面与 MeasureVAE 模型,通过具身化(embodied)的物理输入方式增强用户对 AI 控制逻辑的理解与操控感,从而提升系统的直观性、可解释性和易用性,尤其面向无正式音乐训练背景的用户群体。
链接: https://arxiv.org/abs/2508.00160
作者: Ziqing Xu,Nick Bryan-Kinns
机构: 未知
类目: Human-Computer Interaction (cs.HC); Artificial Intelligence (cs.AI); Sound (cs.SD); Audio and Speech Processing (eess.AS)
备注: In Proceedings of Explainable AI for the Arts Workshop 2025 (XAIxArts 2025) arXiv:2406.14485
Abstract:Many existing AI music generation tools rely on text prompts, complex interfaces, or instrument-like controls, which may require musical or technical knowledge that non-musicians do not possess. This paper introduces DeformTune, a prototype system that combines a tactile deformable interface with the MeasureVAE model to explore more intuitive, embodied, and explainable AI interaction. We conducted a preliminary study with 11 adult participants without formal musical training to investigate their experience with AI-assisted music creation. Thematic analysis of their feedback revealed recurring challenge–including unclear control mappings, limited expressive range, and the need for guidance throughout use. We discuss several design opportunities for enhancing explainability of AI, including multimodal feedback and progressive interaction support. These findings contribute early insights toward making AI music systems more explainable and empowering for novice users.
zh
[AI-42] Model-Based Soft Maximization of Suitable Metrics of Long-Term Human Power
【速读】:该论文旨在解决人工智能安全与人类福祉之间的平衡问题,核心挑战在于如何设计AI系统的目标函数,使其既能防止AI对人类权力的侵蚀(如工具性地追求控制权或导致人类逐渐失能),又能促进人类长期福祉——即在人机交互中维持合理的权力分配,并尊重人类目标多样性、有限理性及社会规范。解决方案的关键在于提出一个可参数化且可分解的目标函数,该函数以“人类权力”(human power)的不平等和风险规避型长期聚合度量为核心,明确要求AI代理主动赋能人类并管理人机间权力关系。通过基于世界模型的逆向归纳或某种多智能体强化学习方法计算该指标,论文进一步论证了软最大化此指标相较于传统效用最大化目标更具安全性,且可能诱导出有益的工具性子目标(如合作、透明性和限制自身能力扩张)。
链接: https://arxiv.org/abs/2508.00159
作者: Jobst Heitzig,Ram Potham
机构: 未知
类目: Artificial Intelligence (cs.AI); Computers and Society (cs.CY); Machine Learning (cs.LG); Theoretical Economics (econ.TH); Optimization and Control (math.OC)
备注:
Abstract:Power is a key concept in AI safety: power-seeking as an instrumental goal, sudden or gradual disempowerment of humans, power balance in human-AI interaction and international AI governance. At the same time, power as the ability to pursue diverse goals is essential for wellbeing. This paper explores the idea of promoting both safety and wellbeing by forcing AI agents explicitly to empower humans and to manage the power balance between humans and AI agents in a desirable way. Using a principled, partially axiomatic approach, we design a parametrizable and decomposable objective function that represents an inequality- and risk-averse long-term aggregate of human power. It takes into account humans’ bounded rationality and social norms, and, crucially, considers a wide variety of possible human goals. We derive algorithms for computing that metric by backward induction or approximating it via a form of multi-agent reinforcement learning from a given world model. We exemplify the consequences of (softly) maximizing this metric in a variety of paradigmatic situations and describe what instrumental sub-goals it will likely imply. Our cautious assessment is that softly maximizing suitable aggregate metrics of human power might constitute a beneficial objective for agentic AI systems that is safer than direct utility-based objectives. Subjects: Artificial Intelligence (cs.AI); Computers and Society (cs.CY); Machine Learning (cs.LG); Theoretical Economics (econ.TH); Optimization and Control (math.OC) MSC classes: 68Txx ACMclasses: I.2 Cite as: arXiv:2508.00159 [cs.AI] (or arXiv:2508.00159v1 [cs.AI] for this version) https://doi.org/10.48550/arXiv.2508.00159 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
zh
[AI-43] Beyond Agreement: Rethinking Ground Truth in Educational AI Annotation
【速读】:该论文试图解决的问题是:在教育领域中,过度依赖人类一致性评分(inter-rater reliability, IRR)作为标注质量的唯一标准,导致训练数据和模型无法有效提升学生学习效果。传统IRR指标如Cohen’s kappa虽被广泛用于验证标注数据,但其仅关注标注者间的一致性,忽略了标注内容是否具有教育有效性(validity)和对学生学习的实际影响。论文指出,这种对“共识”的过度追求限制了生成更具预测性和教学价值的数据分类方法的发展。解决方案的关键在于引入多种互补的评估方法,包括多标签标注方案、专家驱动的方法以及闭环有效性验证(close-the-loop validity),并强调外部有效性——即确保标注标准在不同教学行为类别(如提供提示、反馈等)中均具可推广性。通过将评价焦点从“人类一致性”转向“教育有效性与实际影响”,论文呼吁重构标注质量的定义,以推动更高质量、更可行动的AI教育应用发展。
链接: https://arxiv.org/abs/2508.00143
作者: Danielle R. Thomas,Conrad Borchers,Kenneth R. Koedinger
机构: 未知
类目: Artificial Intelligence (cs.AI); Computers and Society (cs.CY)
备注: Accepted for presentation at NCME AIME-Con 2025
Abstract:Humans can be notoriously imperfect evaluators. They are often biased, unreliable, and unfit to define “ground truth.” Yet, given the surging need to produce large amounts of training data in educational applications using AI, traditional inter-rater reliability (IRR) metrics like Cohen’s kappa remain central to validating labeled data. IRR remains a cornerstone of many machine learning pipelines for educational data. Take, for example, the classification of tutors’ moves in dialogues or labeling open responses in machine-graded assessments. This position paper argues that overreliance on human IRR as a gatekeeper for annotation quality hampers progress in classifying data in ways that are valid and predictive in relation to improving learning. To address this issue, we highlight five examples of complementary evaluation methods, such as multi-label annotation schemes, expert-based approaches, and close-the-loop validity. We argue that these approaches are in a better position to produce training data and subsequent models that produce improved student learning and more actionable insights than IRR approaches alone. We also emphasize the importance of external validity, for example, by establishing a procedure of validating tutor moves and demonstrating that it works across many categories of tutor actions (e.g., providing hints). We call on the field to rethink annotation quality and ground truth–prioritizing validity and educational impact over consensus alone.
zh
[AI-44] INSPIRE-GNN: Intelligent Sensor Placement to Improve Sparse Bicycling Network Prediction via Reinforcement Learning Boosted Graph Neural Networks
【速读】:该论文旨在解决城市交通规划中因自行车流量检测传感器覆盖不足导致的数据稀疏问题(data sparsity),从而影响链路级自行车流量估计的准确性。解决方案的关键在于提出一种基于强化学习(Reinforcement Learning, RL)增强的混合图神经网络框架——INSPIRE-GNN,该框架融合了图卷积网络(Graph Convolutional Networks, GCN)与图注意力网络(Graph Attention Networks, GAT),并引入深度Q网络(Deep Q-Network, DQN)作为RL代理,实现数据驱动的传感器部署策略优化,在极低传感器覆盖率(如99%数据缺失)环境下显著提升自行车流量估计精度。
链接: https://arxiv.org/abs/2508.00141
作者: Mohit Gupta,Debjit Bhowmick,Rhys Newbury,Meead Saberi,Shirui Pan,Ben Beck
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:
Abstract:Accurate link-level bicycling volume estimation is essential for sustainable urban transportation planning. However, many cities face significant challenges of high data sparsity due to limited bicycling count sensor coverage. To address this issue, we propose INSPIRE-GNN, a novel Reinforcement Learning (RL)-boosted hybrid Graph Neural Network (GNN) framework designed to optimize sensor placement and improve link-level bicycling volume estimation in data-sparse environments. INSPIRE-GNN integrates Graph Convolutional Networks (GCN) and Graph Attention Networks (GAT) with a Deep Q-Network (DQN)-based RL agent, enabling a data-driven strategic selection of sensor locations to maximize estimation performance. Applied to Melbourne’s bicycling network, comprising 15,933 road segments with sensor coverage on only 141 road segments (99% sparsity) - INSPIRE-GNN demonstrates significant improvements in volume estimation by strategically selecting additional sensor locations in deployments of 50, 100, 200 and 500 sensors. Our framework outperforms traditional heuristic methods for sensor placement such as betweenness centrality, closeness centrality, observed bicycling activity and random placement, across key metrics such as Mean Squared Error (MSE), Root Mean Squared Error (RMSE) and Mean Absolute Error (MAE). Furthermore, our experiments benchmark INSPIRE-GNN against standard machine learning and deep learning models in the bicycle volume estimation performance, underscoring its effectiveness. Our proposed framework provides transport planners actionable insights to effectively expand sensor networks, optimize sensor placement and maximize volume estimation accuracy and reliability of bicycling data for informed transportation planning decisions.
zh
[AI-45] Your Model Is Unfair Are You Even Aware? Inverse Relationship Between Comprehension and Trust in Explainability Visualizations of Biased ML Models
【速读】:该论文旨在解决机器学习(Machine Learning, ML)模型在实际应用中因偏见行为导致用户信任度下降的问题,尤其关注非专家用户对模型解释可视化工具的理解、偏见感知与信任之间的关系。其关键解决方案在于通过系统性地构建解释可视化的设计特征分类体系,并结合用户研究验证不同可视化工具(如LIME、SHAP、CP、Anchors和ELI5)的影响机制,发现了一个出人意料的逆向关系:更高的理解程度反而降低信任,这一现象主要由偏见感知增强所中介;进一步实验表明,通过调整可视化设计或提升模型公平性可有效缓解偏见感知,从而显著提升用户信任,即使理解程度保持较高水平亦然。该研究为负责任的机器学习应用提供了可操作的可视化设计指导。
链接: https://arxiv.org/abs/2508.00140
作者: Zhanna Kaufman,Madeline Endres,Cindy Xiong Bearfield,Yuriy Brun
机构: 未知
类目: Human-Computer Interaction (cs.HC); Artificial Intelligence (cs.AI)
备注:
Abstract:Systems relying on ML have become ubiquitous, but so has biased behavior within them. Research shows that bias significantly affects stakeholders’ trust in systems and how they use them. Further, stakeholders of different backgrounds view and trust the same systems differently. Thus, how ML models’ behavior is explained plays a key role in comprehension and trust. We survey explainability visualizations, creating a taxonomy of design characteristics. We conduct user studies to evaluate five state-of-the-art visualization tools (LIME, SHAP, CP, Anchors, and ELI5) for model explainability, measuring how taxonomy characteristics affect comprehension, bias perception, and trust for non-expert ML users. Surprisingly, we find an inverse relationship between comprehension and trust: the better users understand the models, the less they trust them. We investigate the cause and find that this relationship is strongly mediated by bias perception: more comprehensible visualizations increase people’s perception of bias, and increased bias perception reduces trust. We confirm this relationship is causal: Manipulating explainability visualizations to control comprehension, bias perception, and trust, we show that visualization design can significantly (p 0.001) increase comprehension, increase perceived bias, and reduce trust. Conversely, reducing perceived model bias, either by improving model fairness or by adjusting visualization design, significantly increases trust even when comprehension remains high. Our work advances understanding of how comprehension affects trust and systematically investigates visualization’s role in facilitating responsible ML applications.
zh
[AI-46] Co-Producing AI: Toward an Augmented Participatory Lifecycle AAAI
【速读】:该论文旨在解决人工智能(Artificial Intelligence, AI)算法在实践中可能对文化边缘化群体造成不成比例负面影响的问题,尽管已有伦理指南和技术手段试图缓解AI的偏见与风险,但现有方法仍不足以从根本上消除这些系统性不平等。解决方案的关键在于对AI生产流程进行根本性重构,强调以协同共创(co-production)、多样性、公平性、包容性(Diversity, Equity, Inclusion, DEI)和跨学科协作为核心原则,提出一个包含五个相互关联阶段的增强型AI生命周期模型:共构问题(co-framing)、共设计(co-design)、共实施(co-implementation)、共部署(co-deployment)和共维护(co-maintenance),该模型通过分布式权威和迭代知识交换机制实现治理结构的参与式转型,并与主流伦理框架相衔接,为未来规模化参与式治理提供了关键研究方向。
链接: https://arxiv.org/abs/2508.00138
作者: Rashid Mushkani,Hugo Berard,Toumadher Ammar,Cassandre Chatonnier,Shin Koseki
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注: Eighth AAAI/ACM Conference on AI, Ethics, and Society 2025
Abstract:Despite efforts to mitigate the inherent risks and biases of artificial intelligence (AI) algorithms, these algorithms can disproportionately impact culturally marginalized groups. A range of approaches has been proposed to address or reduce these risks, including the development of ethical guidelines and principles for responsible AI, as well as technical solutions that promote algorithmic fairness. Drawing on design justice, expansive learning theory, and recent empirical work on participatory AI, we argue that mitigating these harms requires a fundamental re-architecture of the AI production pipeline. This re-design should center co-production, diversity, equity, inclusion (DEI), and multidisciplinary collaboration. We introduce an augmented AI lifecycle consisting of five interconnected phases: co-framing, co-design, co-implementation, co-deployment, and co-maintenance. The lifecycle is informed by four multidisciplinary workshops and grounded in themes of distributed authority and iterative knowledge exchange. Finally, we relate the proposed lifecycle to several leading ethical frameworks and outline key research questions that remain for scaling participatory governance.
zh
[AI-47] SHACL Validation under Graph Updates (Extended Paper) ISWC2025
【速读】:该论文旨在解决 RDF 图在更新后仍保持 SHACL(SHApe Constraint Language)约束验证的问题,即静态验证问题(static validation under updates),其核心是判断任意满足初始 SHACL 规范的图在应用给定更新序列后是否依然满足这些约束。解决方案的关键在于引入一种基于回归的技术,将更新操作嵌入到 SHACL 约束中,从而将静态验证问题转化为 SHACL 约束的(不可)满足性判定问题(在 SHACL 的一个轻微扩展形式下)。这一方法为推理演化中的 RDF 图提供了理论基础,并通过原型实现验证了其可行性与有效性。
链接: https://arxiv.org/abs/2508.00137
作者: Shqiponja Ahmetaj,George Konstantinidis,Magdalena Ortiz,Paolo Pareti,Mantas Simkus
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注: Accepted at the International Semantic Web Conference (ISWC 2025)
Abstract:SHACL (SHApe Constraint Language) is a W3C standardized constraint language for RDF graphs. In this paper, we study SHACL validation in RDF graphs under updates. We present a SHACL-based update language that can capture intuitive and realistic modifications on RDF graphs and study the problem of static validation under such updates. This problem asks to verify whether every graph that validates a SHACL specification will still do so after applying a given update sequence. More importantly, it provides a basis for further services for reasoning about evolving RDF graphs. Using a regression technique that embeds the update actions into SHACL constraints, we show that static validation under updates can be reduced to (un)satisfiability of constraints in (a minor extension of) SHACL. We analyze the computational complexity of the static validation problem for SHACL and some key fragments. Finally, we present a prototype implementation that performs static validation and other static analysis tasks on SHACL constraints and demonstrate its behavior through preliminary experiments.
zh
[AI-48] Algorithmic Detection of Rank Reversals Transitivity Violations and Decomposition Inconsistencies in Multi-Criteria Decision Analysis
【速读】:该论文旨在解决多准则决策分析(Multi-Criteria Decision Analysis, MCDA)中排名反转(Rank Reversal)问题,即在不同备选方案集合下,同一决策方法可能产生不一致甚至矛盾的排序结果,从而严重影响决策有效性。解决方案的关键在于提出三种能够检测排名反转现象的测试方法,并将其集成到Scikit-Criteria库中,同时针对一般场景下的实现复杂性进行了设计优化,以支持对不同MCDA方法进行系统性评估与全局效能排序。
链接: https://arxiv.org/abs/2508.00129
作者: Agustín Borda,Juan Bautista Cabral,Gonzalo Giarda,Diego Nicolás Gimenez Irusta,Paula Pacheco,Alvaro Roy Schachner
机构: 未知
类目: Artificial Intelligence (cs.AI); Optimization and Control (math.OC)
备注:
Abstract:In Multi-Criteria Decision Analysis, Rank Reversals are a serious problem that can greatly affect the results of a Multi-Criteria Decision Method against a particular set of alternatives. It is therefore useful to have a mechanism that allows one to measure the performance of a method on a set of alternatives. This idea could be taken further to build a global ranking of the effectiveness of different methods to solve a problem. In this paper, we present three tests that detect the presence of Rank Reversals, along with their implementation in the Scikit-Criteria library. We also address the complications that arise when implementing these tests for general scenarios and the design considerations we made to handle them. We close with a discussion about how these additions could play a major role in the judgment of multi-criteria decision methods for problem solving.
zh
[AI-49] StackLiverNet: A Novel Stacked Ensemble Model for Accurate and Interpretable Liver Disease Detection
【速读】:该论文旨在解决当前用于肝病分类的机器学习与深度学习模型普遍存在误分类率高、可解释性差、计算成本高昂以及缺乏有效预处理策略等问题。其解决方案的关键在于提出了一种名为StackLiverNet的可解释堆叠集成模型,该模型通过先进的数据预处理和特征选择技术提升鲁棒性和预测能力;采用随机欠采样缓解类别不平衡问题,使训练过程更加均衡;并通过LightGBM元模型融合多个超参数优化后的基分类器,发挥其互补优势。实验结果显示,该方法在测试集上达到99.89%的准确率、0.9974的Cohen Kappa系数和0.9993的AUC值,且仅发生5次误分类,同时具备高效的训练(4.2783秒)与推理速度(0.1106秒),适用于临床实践。此外,结合LIME、SHAP和Morris方法实现了对个体预测和全局特征重要性的透明解释,进一步增强了模型的可信度与实用性。
链接: https://arxiv.org/abs/2508.00117
作者: Md. Ehsanul Haque,S. M. Jahidul Islam,Shakil Mia,Rumana Sharmin,Ashikuzzaman,Md Samir Morshed,Md. Tahmidul Huque
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注: Accepted and presented paper of THE 16th INTERNATIONAL IEEE CONFERENCE ON COMPUTING, COMMUNICATION AND NETWORKING TECHNOLOGIES (ICCCNT) INDIA
Abstract:Liver diseases are a serious health concern in the world, which requires precise and timely diagnosis to enhance the survival chances of patients. The current literature implemented numerous machine learning and deep learning models to classify liver diseases, but most of them had some issues like high misclassification error, poor interpretability, prohibitive computational expense, and lack of good preprocessing strategies. In order to address these drawbacks, we introduced StackLiverNet in this study; an interpretable stacked ensemble model tailored to the liver disease detection task. The framework uses advanced data preprocessing and feature selection technique to increase model robustness and predictive ability. Random undersampling is performed to deal with class imbalance and make the training balanced. StackLiverNet is an ensemble of several hyperparameter-optimized base classifiers, whose complementary advantages are used through a LightGBM meta-model. The provided model demonstrates excellent performance, with the testing accuracy of 99.89%, Cohen Kappa of 0.9974, and AUC of 0.9993, having only 5 misclassifications, and efficient training and inference speeds that are amenable to clinical practice (training time 4.2783 seconds, inference time 0.1106 seconds). Besides, Local Interpretable Model-Agnostic Explanations (LIME) are applied to generate transparent explanations of individual predictions, revealing high concentrations of Alkaline Phosphatase and moderate SGOT as important observations of liver disease. Also, SHAP was used to rank features by their global contribution to predictions, while the Morris method confirmed the most influential features through sensitivity analysis.
zh
[AI-50] No AI Without PI! Object-Centric Process Mining as the Enabler for Generative Predictive and Prescriptive Artificial Intelligence
【速读】:该论文旨在解决工业场景中人工智能(Artificial Intelligence, AI)应用难以成功落地的问题,尤其是在端到端运营流程优化方面。其核心挑战在于现有AI方法缺乏对过程数据的深度理解与结构化处理能力,导致无法有效诊断和改进复杂、动态的业务流程。解决方案的关键在于引入面向对象的过程挖掘(Object-Centric Process Mining, OCPM),它作为连接数据与流程的“缺失环节”,能够将组织特有的过程相关数据进行结构化建模,并支持生成式AI(Generative AI)、预测性AI(Predictive AI)和规范性AI(Prescriptive AI)的协同应用。文中提出“流程智能”(Process Intelligence, PI)概念,即以流程为中心的数据驱动技术集合,使AI能够在组织语境下精准赋能运营流程改进。
链接: https://arxiv.org/abs/2508.00116
作者: Wil M.P. van der Aalst
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注: 10 pages, 4 figures, preprint keynote paper of the seventh International Conference on Intelligent and Fuzzy Systems (INFUS 2025)
Abstract:The uptake of Artificial Intelligence (AI) impacts the way we work, interact, do business, and conduct research. However, organizations struggle to apply AI successfully in industrial settings where the focus is on end-to-end operational processes. Here, we consider generative, predictive, and prescriptive AI and elaborate on the challenges of diagnosing and improving such processes. We show that AI needs to be grounded using Object-Centric Process Mining (OCPM). Process-related data are structured and organization-specific and, unlike text, processes are often highly dynamic. OCPM is the missing link connecting data and processes and enables different forms of AI. We use the term Process Intelligence (PI) to refer to the amalgamation of process-centric data-driven techniques able to deal with a variety of object and event types, enabling AI in an organizational context. This paper explains why AI requires PI to improve operational processes and highlights opportunities for successfully combining OCPM and generative, predictive, and prescriptive AI.
zh
[AI-51] Hyperproperty-Constrained Secure Reinforcement Learning
【速读】:该论文旨在解决安全感知强化学习(Security-aware Reinforcement Learning, SecRL)中缺乏基于超性质(hyperproperties)建模与约束的问题,特别是在机器人任务中如何确保系统满足安全性、隐私性(opacity)和并发性等复杂属性。其核心解决方案是将超时窗时序逻辑(HyperTWTL)形式化的安全约束嵌入到马尔可夫决策过程(MDP)框架中,并采用动态Boltzmann softmax强化学习算法来学习满足这些HyperTWTL约束的最优策略,从而在保证安全性的同时实现高效策略优化。
链接: https://arxiv.org/abs/2508.00106
作者: Ernest Bonnah,Luan Viet Nguyen,Khaza Anuarul Hoque
机构: 未知
类目: Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Logic in Computer Science (cs.LO); Systems and Control (eess.SY)
备注: Accepted in IEEE/ACM MEMOCODE 2025
Abstract:Hyperproperties for Time Window Temporal Logic (HyperTWTL) is a domain-specific formal specification language known for its effectiveness in compactly representing security, opacity, and concurrency properties for robotics applications. This paper focuses on HyperTWTL-constrained secure reinforcement learning (SecRL). Although temporal logic-constrained safe reinforcement learning (SRL) is an evolving research problem with several existing literature, there is a significant research gap in exploring security-aware reinforcement learning (RL) using hyperproperties. Given the dynamics of an agent as a Markov Decision Process (MDP) and opacity/security constraints formalized as HyperTWTL, we propose an approach for learning security-aware optimal policies using dynamic Boltzmann softmax RL while satisfying the HyperTWTL constraints. The effectiveness and scalability of our proposed approach are demonstrated using a pick-up and delivery robotic mission case study. We also compare our results with two other baseline RL algorithms, showing that our proposed method outperforms them.
zh
[AI-52] A Mixed User-Centered Approach to Enable Augmented Intelligence in Intelligent Tutoring Systems: The Case of MathAIde app
【速读】:该论文旨在解决人工智能教育(AIED)在实际应用中面临的三大挑战:教师在系统设计中的关键作用未被充分重视、AI工具的局限性与可靠性不足,以及技术资源获取的不平等。为应对这些问题,研究提出以增强智能(Augmented Intelligence, AuI)为核心的设计理念,强调通过人机协同提升教学效果,即由AI提供辅助建议,人类教师进行最终判断并反馈,从而实现系统的持续优化。解决方案的关键在于构建一个以教师为中心的、全流程参与的设计方法,开发出MathAIde这一基于计算机视觉和AI的智能辅导系统,能够自动批改数学作业并提供预设的补救方案,同时通过高保真原型、A/B测试及真实课堂案例验证其有效性。实证结果表明,该方法显著提升了系统的实用性与可采纳性,尤其适用于资源有限的教学环境。
链接: https://arxiv.org/abs/2508.00103
作者: Guilherme Guerino,Luiz Rodrigues,Luana Bianchiniand Mariana Alves,Marcelo Marinho,Thomaz Veloso,Valmir Macario,Diego Dermeval,Thales Vieira,Ig Bittencourt,Seiji Isotani
机构: 未知
类目: Human-Computer Interaction (cs.HC); Artificial Intelligence (cs.AI)
备注: Article accepted in the International Journal of Human-Computer Interaction
Abstract:Integrating Artificial Intelligence in Education (AIED) aims to enhance learning experiences through technologies like Intelligent Tutoring Systems (ITS), offering personalized learning, increased engagement, and improved retention rates. However, AIED faces three main challenges: the critical role of teachers in the design process, the limitations and reliability of AI tools, and the accessibility of technological resources. Augmented Intelligence (AuI) addresses these challenges by enhancing human capabilities rather than replacing them, allowing systems to suggest solutions. In contrast, humans provide final assessments, thus improving AI over time. In this sense, this study focuses on designing, developing, and evaluating MathAIde, an ITS that corrects mathematics exercises using computer vision and AI and provides feedback based on photos of student work. The methodology included brainstorming sessions with potential users, high-fidelity prototyping, A/B testing, and a case study involving real-world classroom environments for teachers and students. Our research identified several design possibilities for implementing AuI in ITSs, emphasizing a balance between user needs and technological feasibility. Prioritization and validation through prototyping and testing highlighted the importance of efficiency metrics, ultimately leading to a solution that offers pre-defined remediation alternatives for teachers. Real-world deployment demonstrated the usefulness of the proposed solution. Our research contributes to the literature by providing a usable, teacher-centered design approach that involves teachers in all design phases. As a practical implication, we highlight that the user-centered design approach increases the usefulness and adoption potential of AIED systems, especially in resource-limited environments.
zh
[AI-53] XRoboToolkit: A Cross-Platform Framework for Robot Teleoperation
【速读】:该论文旨在解决当前机器人示范数据集收集过程中存在的可扩展性差、部署复杂和数据质量不佳的问题,这些问题限制了视觉-语言-动作(Vision-Language-Action, VLA)模型的训练效果。其解决方案的关键在于提出XRoboToolkit,一个基于OpenXR标准的跨平台扩展现实(Extended Reality, XR)机器人遥操作框架,具备低延迟立体视觉反馈、基于优化的逆运动学求解以及对头部、控制器、手部和辅助运动追踪设备等多种跟踪模态的支持,同时采用模块化架构实现与多种机器人平台及仿真环境的无缝集成,从而显著提升数据采集效率与质量。
链接: https://arxiv.org/abs/2508.00097
作者: Zhigen Zhao,Liuchuan Yu,Ke Jing,Ning Yang
机构: 未知
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI)
备注: 6 pages, 6 figures, project link: this https URL
Abstract:The rapid advancement of Vision-Language-Action models has created an urgent need for large-scale, high-quality robot demonstration datasets. Although teleoperation is the predominant method for data collection, current approaches suffer from limited scalability, complex setup procedures, and suboptimal data quality. This paper presents XRoboToolkit, a cross-platform framework for extended reality based robot teleoperation built on the OpenXR standard. The system features low-latency stereoscopic visual feedback, optimization-based inverse kinematics, and support for diverse tracking modalities including head, controller, hand, and auxiliary motion trackers. XRoboToolkit’s modular architecture enables seamless integration across robotic platforms and simulation environments, spanning precision manipulators, mobile robots, and dexterous hands. We demonstrate the framework’s effectiveness through precision manipulation tasks and validate data quality by training VLA models that exhibit robust autonomous performance.
zh
[AI-54] Rethinking Evidence Hierarchies in Medical Language Benchmarks: A Critical Evaluation of HealthBench
【速读】:该论文旨在解决当前医疗语言模型评估基准HealthBench存在的局限性,即其依赖专家意见而非高质量临床证据,可能导致区域偏倚和个体医师主观性被固化,并在低收入和中等收入国家因罕见病覆盖不足、指南不匹配等问题进一步放大。为应对这一挑战,论文提出将奖励函数锚定于版本控制的临床实践指南(Clinical Practice Guidelines, CPGs),并整合系统评价与GRADE证据评级体系。解决方案的关键在于通过“证据稳健”的强化学习框架,实现从评分规则到指南的映射、基于证据权重的评分机制以及情境化优先级调整逻辑,同时引入伦理考量和延迟结果反馈,从而提升模型在临床可信度、伦理合规性和全球适用性方面的表现。
链接: https://arxiv.org/abs/2508.00081
作者: Fred Mutisya(1,2),Shikoh Gitau(1),Nasubo Ongoma(1),Keith Mbae(1),Elizabeth Wamicha(1)
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:
Abstract:HealthBench, a benchmark designed to measure the capabilities of AI systems for health better (Arora et al., 2025), has advanced medical language model evaluation through physician-crafted dialogues and transparent rubrics. However, its reliance on expert opinion, rather than high-tier clinical evidence, risks codifying regional biases and individual clinician idiosyncrasies, further compounded by potential biases in automated grading systems. These limitations are particularly magnified in low- and middle-income settings, where issues like sparse neglected tropical disease coverage and region-specific guideline mismatches are prevalent. The unique challenges of the African context, including data scarcity, inadequate infrastructure, and nascent regulatory frameworks, underscore the urgent need for more globally relevant and equitable benchmarks. To address these shortcomings, we propose anchoring reward functions in version-controlled Clinical Practice Guidelines (CPGs) that incorporate systematic reviews and GRADE evidence ratings. Our roadmap outlines “evidence-robust” reinforcement learning via rubric-to-guideline linkage, evidence-weighted scoring, and contextual override logic, complemented by a focus on ethical considerations and the integration of delayed outcome feedback. By re-grounding rewards in rigorously vetted CPGs, while preserving HealthBench’s transparency and physician engagement, we aim to foster medical language models that are not only linguistically polished but also clinically trustworthy, ethically sound, and globally relevant. Subjects: Artificial Intelligence (cs.AI) Cite as: arXiv:2508.00081 [cs.AI] (or arXiv:2508.00081v1 [cs.AI] for this version) https://doi.org/10.48550/arXiv.2508.00081 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
zh
[AI-55] Evaluating COVID 19 Feature Contributions to Bitcoin Return Forecasting: Methodology Based on LightGBM and Genetic Optimization
【速读】:该论文试图解决的问题是:在比特币(Bitcoin)收益预测中,是否纳入与新冠疫情相关的健康指标能够显著提升模型的预测准确性。解决方案的关键在于提出了一种融合LightGBM回归模型与遗传算法(Genetic Algorithm, GA)优化的方法框架,通过31次独立运行的GA优化训练对比包含和不包含疫情指标的模型,并借助R²、RMSE、MAE等性能指标进行统计显著性检验,同时利用排列特征重要性(Permutation Feature Importance, PFI)量化各特征贡献度。结果表明,引入新冠相关指标显著提升了模型对极端市场波动的捕捉能力(R²提升40%,RMSE下降2%),其中疫苗接种率(尤其是完全接种人群的75百分位数)成为主导预测因子,从而为投资者和政策制定者提供基于公共卫生信号的精细化市场风险预警工具。
链接: https://arxiv.org/abs/2508.00078
作者: Imen Mahmoud,Andrei Velichko
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); General Economics (econ.GN)
备注: 22 pages, 5 figures
Abstract:This study proposes a novel methodological framework integrating a LightGBM regression model and genetic algorithm (GA) optimization to systematically evaluate the contribution of COVID-19-related indicators to Bitcoin return prediction. The primary objective was not merely to forecast Bitcoin returns but rather to determine whether including pandemic-related health data significantly enhances prediction accuracy. A comprehensive dataset comprising daily Bitcoin returns and COVID-19 metrics (vaccination rates, hospitalizations, testing statistics) was constructed. Predictive models, trained with and without COVID-19 features, were optimized using GA over 31 independent runs, allowing robust statistical assessment. Performance metrics (R2, RMSE, MAE) were statistically compared through distribution overlaps and Mann-Whitney U tests. Permutation Feature Importance (PFI) analysis quantified individual feature contributions. Results indicate that COVID-19 indicators significantly improved model performance, particularly in capturing extreme market fluctuations (R2 increased by 40%, RMSE decreased by 2%, both highly significant statistically). Among COVID-19 features, vaccination metrics, especially the 75th percentile of fully vaccinated individuals, emerged as dominant predictors. The proposed methodology extends existing financial analytics tools by incorporating public health signals, providing investors and policymakers with refined indicators to navigate market uncertainty during systemic crises.
zh
[AI-56] riP-LLM : A Tri-Branch Patch-wise Large Language Model Framework for Time-Series Anomaly Detection
【速读】:该论文旨在解决高维度、复杂性日益增加的时间序列数据中异常检测的挑战,尤其是在物联网(IoT)和智能制造场景下,传统统计方法难以有效应对数据的异质性和复杂性。其解决方案的关键在于提出一种三分支的补丁级大语言模型框架(Tri-Branch Patch-wise Large Language Model Framework, TriP-LLM),通过“补丁化(Patching)、选择(Selection)与全局编码(Global)”三阶段设计,将输入时间序列转化为补丁级标记(patch-wise tokens),并利用预训练冻结的大语言模型(LLM)提取局部与全局时序特征;随后通过轻量级补丁级解码器重构输入信号以生成异常评分。该架构在无需阈值设定的情况下显著提升了检测性能,并在内存效率上优于基于通道独立(Channel Independence, CI)处理的现有LLM方法,适用于GPU显存受限环境。
链接: https://arxiv.org/abs/2508.00047
作者: Yuan-Cheng Yu,Yen-Chieh Ouyang,Chun-An Lin
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注: 11 pages, 2 figures
Abstract:Time-series anomaly detection plays a central role across a wide range of application domains. With the increasing proliferation of the Internet of Things (IoT) and smart manufacturing, time-series data has dramatically increased in both scale and dimensionality. This growth has exposed the limitations of traditional statistical methods in handling the high heterogeneity and complexity of such data. Inspired by the recent success of large language models (LLMs) in multimodal tasks across language and vision domains, we propose a novel unsupervised anomaly detection framework: A Tri-Branch Patch-wise Large Language Model Framework for Time-Series Anomaly Detection (TriP-LLM). TriP-LLM integrates local and global temporal features through a tri-branch design-Patching, Selection, and Global-to encode the input time series into patch-wise tokens, which are then processed by a frozen, pretrained LLM. A lightweight patch-wise decoder reconstructs the input, from which anomaly scores are derived. We evaluate TriP-LLM on several public benchmark datasets using PATE, a recently proposed threshold-free evaluation metric, and conduct all comparisons within a unified open-source framework to ensure fairness. Experimental results show that TriP-LLM consistently outperforms recent state-of-the-art methods across all datasets, demonstrating strong detection capabilities. Furthermore, through extensive ablation studies, we verify the substantial contribution of the LLM to the overall architecture. Compared to LLM-based approaches using Channel Independence (CI) patch processing, TriP-LLM achieves significantly lower memory consumption, making it more suitable for GPU memory-constrained environments. All code and model checkpoints are publicly available on this https URL
zh
[AI-57] Benchmarking Partial Observability in Reinforcement Learning with a Suite of Memory-Improvable Domains
【速读】:该论文旨在解决强化学习算法在部分可观测环境(partially observable environments)中难以有效应对状态信息缺失的问题,尤其是在当前基准测试仅涵盖简单状态混淆形式(如特征掩码或高斯噪声)的情况下,无法充分评估算法对真实世界复杂部分可观测性的建模能力。解决方案的关键在于提出一套最佳实践指南和开源工具库POBAX(Partially Observable Benchmarks in JAX),其核心特征是:第一,构建具有多样化部分可观测性形式(如视觉遮挡、未知对手意图)的基准环境,以提升算法的泛化能力;第二,确保环境中存在显著的性能差距(即“记忆可改进性”,memory improvable),使得性能提升明确来源于算法处理部分可观测性的能力而非其他因素。该框架通过精选代表性任务(包括定位与地图构建、视觉控制、游戏等)并提供高性能JAX实现,为研究者提供了可快速部署、GPU可扩展的实验平台,从而推动部分可观测强化学习的系统性评估与发展。
链接: https://arxiv.org/abs/2508.00046
作者: Ruo Yu Tao,Kaicheng Guo,Cameron Allen,George Konidaris
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注: To appear at RLC 2025. 1 cover page, 10 pages, 3 reference pages + 13 pages for supplementary material
Abstract:Mitigating partial observability is a necessary but challenging task for general reinforcement learning algorithms. To improve an algorithm’s ability to mitigate partial observability, researchers need comprehensive benchmarks to gauge progress. Most algorithms tackling partial observability are only evaluated on benchmarks with simple forms of state aliasing, such as feature masking and Gaussian noise. Such benchmarks do not represent the many forms of partial observability seen in real domains, like visual occlusion or unknown opponent intent. We argue that a partially observable benchmark should have two key properties. The first is coverage in its forms of partial observability, to ensure an algorithm’s generalizability. The second is a large gap between the performance of a agents with more or less state information, all other factors roughly equal. This gap implies that an environment is memory improvable: where performance gains in a domain are from an algorithm’s ability to cope with partial observability as opposed to other factors. We introduce best-practice guidelines for empirically benchmarking reinforcement learning under partial observability, as well as the open-source library POBAX: Partially Observable Benchmarks in JAX. We characterize the types of partial observability present in various environments and select representative environments for our benchmark. These environments include localization and mapping, visual control, games, and more. Additionally, we show that these tasks are all memory improvable and require hard-to-learn memory functions, providing a concrete signal for partial observability research. This framework includes recommended hyperparameters as well as algorithm implementations for fast, out-of-the-box evaluation, as well as highly performant environments implemented in JAX for GPU-scalable experimentation.
zh
[AI-58] Learning Like Humans: Resource-Efficient Federated Fine-Tuning through Cognitive Developmental Stages
【速读】:该论文旨在解决联邦微调(Federated Fine-tuning)在边缘设备上部署受限的问题,即其资源密集特性难以满足边缘计算环境的算力与通信约束。解决方案的关键在于提出一种受认知发展启发的渐进式联邦微调方法——Developmental Federated Tuning (DevFT),该方法将微调过程分解为多个发育阶段,每个阶段优化参数容量逐步增加的子模型,并通过早期阶段的知识迁移提供优化的初始化参数,从而避免陷入局部最优并加速训练收敛;同时引入去冲突引导的层分组(deconfliction-guided layer grouping)和基于差分的层融合(differential-based layer fusion)技术,高效提取关键信息并构建代表性层,显著降低通信开销并提升性能。
链接: https://arxiv.org/abs/2508.00041
作者: Yebo Wu,Jingguang Li,Zhijiang Guo,Li Li
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Distributed, Parallel, and Cluster Computing (cs.DC)
备注:
Abstract:Federated fine-tuning enables Large Language Models (LLMs) to adapt to downstream tasks while preserving data privacy, but its resource-intensive nature limits deployment on edge devices. In this paper, we introduce Developmental Federated Tuning (DevFT), a resource-efficient approach inspired by cognitive development that progressively builds a powerful LLM from a compact foundation. DevFT decomposes the fine-tuning process into developmental stages, each optimizing submodels with increasing parameter capacity. Knowledge from earlier stages transfers to subsequent submodels, providing optimized initialization parameters that prevent convergence to local minima and accelerate training. This paradigm mirrors human learning, gradually constructing comprehensive knowledge structure while refining existing skills. To efficiently build stage-specific submodels, DevFT introduces deconfliction-guided layer grouping and differential-based layer fusion to distill essential information and construct representative layers. Evaluations across multiple benchmarks demonstrate that DevFT significantly outperforms state-of-the-art methods, achieving up to 4.59 \times faster convergence, 10.67 \times reduction in communication overhead, and 9.07% average performance improvement, while maintaining compatibility with existing approaches.
zh
[AI-59] Hybrid LSTM-Transformer Models for Profiling Highway-Railway Grade Crossings
【速读】:该论文旨在解决高轮廓公路铁路平交道口(High-Profile Highway Railway Grade Crossings, HRGCs)因车辆挂滞(hang-up)引发的安全隐患问题,此类隐患通常由铁路轨道竣工后维护不当或未遵守垂直线形设计规范所致。传统测量方法存在成本高、耗时长、交通干扰大及安全隐患等缺陷。研究的关键在于提出一种基于改进深度学习架构的高效、精准HRGC剖面重建方案:利用搭载惯性测量单元(IMU)和全球定位系统(GPS)传感器的公路测试车采集动态数据,并结合工业标准步行式剖面仪获取地面真实数据,构建了三种混合深度学习模型(Transformer-LSTM顺序型、LSTM-Transformer顺序型与并行型),其中LSTM-Transformer顺序型与并行型模型表现最优,成功实现了2D/3D HRGC剖面的快速准确生成,显著提升了对挂滞风险的评估能力,为提升公铁交叉口安全性提供了技术支撑。
链接: https://arxiv.org/abs/2508.00039
作者: Kaustav Chatterjee,Joshua Q. Li,Fatemeh Ansari,Masud Rana Munna,Kundan Parajulee,Jared Schwennesen
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:
Abstract:Hump crossings, or high-profile Highway Railway Grade Crossings (HRGCs), pose safety risks to highway vehicles due to potential hang-ups. These crossings typically result from post-construction railway track maintenance activities or non-compliance with design guidelines for HRGC vertical alignments. Conventional methods for measuring HRGC profiles are costly, time-consuming, traffic-disruptive, and present safety challenges. To address these issues, this research employed advanced, cost-effective techniques and innovative modeling approaches for HRGC profile measurement. A novel hybrid deep learning framework combining Long Short-Term Memory (LSTM) and Transformer architectures was developed by utilizing instrumentation and ground truth data. Instrumentation data were gathered using a highway testing vehicle equipped with Inertial Measurement Unit (IMU) and Global Positioning System (GPS) sensors, while ground truth data were obtained via an industrial-standard walking profiler. Field data was collected at the Red Rock Railroad Corridor in Oklahoma. Three advanced deep learning models Transformer-LSTM sequential (model 1), LSTM-Transformer sequential (model 2), and LSTM-Transformer parallel (model 3) were evaluated to identify the most efficient architecture. Models 2 and 3 outperformed the others and were deployed to generate 2D/3D HRGC profiles. The deep learning models demonstrated significant potential to enhance highway and railroad safety by enabling rapid and accurate assessment of HRGC hang-up susceptibility.
zh
[AI-60] Predicting Large-scale Urban Network Dynamics with Energy-informed Graph Neural Diffusion
【速读】:该论文旨在解决当前用于预测城市系统(如交通流、太阳能发电和智能电表数据)时空动态的图神经网络模型在大规模网络中面临的有效性与效率之间的权衡问题。现有方法因计算复杂度高而难以扩展,且缺乏对物理规律的显式建模。解决方案的关键在于借鉴物理定律,设计出符合基本原理并避免结构冗余的可解释神经扩散机制;具体而言,提出了一种基于Transformer架构的可扩展时空Transformer(ScaleSTF),其注意力层由低维嵌入诱导,从而实现线性时间复杂度,在保持高精度的同时显著提升可扩展性。
链接: https://arxiv.org/abs/2508.00037
作者: Tong Nie,Jian Sun,Wei Ma
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注: Accepted at IEEE Transactions on Industrial Informatics
Abstract:Networked urban systems facilitate the flow of people, resources, and services, and are essential for economic and social interactions. These systems often involve complex processes with unknown governing rules, observed by sensor-based time series. To aid decision-making in industrial and engineering contexts, data-driven predictive models are used to forecast spatiotemporal dynamics of urban systems. Current models such as graph neural networks have shown promise but face a trade-off between efficacy and efficiency due to computational demands. Hence, their applications in large-scale networks still require further efforts. This paper addresses this trade-off challenge by drawing inspiration from physical laws to inform essential model designs that align with fundamental principles and avoid architectural redundancy. By understanding both micro- and macro-processes, we present a principled interpretable neural diffusion scheme based on Transformer-like structures whose attention layers are induced by low-dimensional embeddings. The proposed scalable spatiotemporal Transformer (ScaleSTF), with linear complexity, is validated on large-scale urban systems including traffic flow, solar power, and smart meters, showing state-of-the-art performance and remarkable scalability. Our results constitute a fresh perspective on the dynamics prediction in large-scale urban networks.
zh
[AI-61] Generative Logic: A New Computer Architecture for Deterministic Reasoning and Knowledge Generation
【速读】:该论文旨在解决形式化推理中自动化证明生成与可追溯性不足的问题,尤其是在基础数学理论(如皮亚诺算术)中从公理出发系统推导出可验证定理的挑战。其解决方案的核心是提出一种称为生成式逻辑(Generative Logic, GL)的确定性架构:它将用户提供的公理定义(以最小化的数学编程语言MPL书写)编译为分布式逻辑块(Logic Blocks, LBs)网格,通过消息传递机制探索公理的演绎邻域;当多个表达式在推理规则下统一时,自动生成带有完整溯源信息的新事实,并构建可回放、可审计的证明图谱。该方法实现了从皮亚诺公理出发自动重构加法和乘法的结合律、交换律及分配律等基础算术定律的机器可检查证明,并支持以交互式HTML形式呈现每一步推理过程,为后续软硬件协同设计实现大规模并行推理及与概率模型(如大型语言模型)集成用于自动形式化和猜想生成奠定了基础。
链接: https://arxiv.org/abs/2508.00017
作者: Nikolai Sergeev
机构: 未知
类目: Logic in Computer Science (cs.LO); Artificial Intelligence (cs.AI); Hardware Architecture (cs.AR)
备注: 19 pages, 5 figures. Code and interactive HTML proof graphs permanently archived on Zenodo (DOI: https://doi.org/10.5281/zenodo.16408441 )
Abstract:We present Generative Logic (GL), a deterministic architecture that begins from user-supplied axiomatic definitions – written in a minimalist Mathematical Programming Language (MPL) – and systematically explores their deductive neighborhood. Definitions are compiled into a distributed grid of simple Logic Blocks (LBs) that exchange messages; any time several expressions unify under an inference rule, a new fact is emitted with full provenance to its sources, yielding replayable, auditable proof graphs. A prototype software implementation instantiates the workflow on first-order Peano arithmetic. Starting only from the Peano axioms, GL enumerates candidate implications, applies normalization and type filters, and automatically reconstructs machine-checkable proofs of foundational arithmetic laws including associativity and commutativity of addition, associativity and commutativity of multiplication, and distributivity. Generated proofs export to navigable HTML so that every inference step can be inspected independently. We outline a hardware-software co-design path toward massively parallel realizations and describe prospective integration with probabilistic models (e.g., Large Language Models (LLMs)) for autoformalization and conjecture seeding. The Python and MPL code to reproduce the Peano experiments, along with the full HTML proof graphs, are available in the project’s GitHub repository at this https URL and are permanently archived at this https URL. We invite community feedback and collaboration. Comments: 19 pages, 5 figures. Code and interactive HTML proof graphs permanently archived on Zenodo (DOI: https://doi.org/10.5281/zenodo.16408441) Subjects: Logic in Computer Science (cs.LO); Artificial Intelligence (cs.AI); Hardware Architecture (cs.AR) Cite as: arXiv:2508.00017 [cs.LO] (or arXiv:2508.00017v1 [cs.LO] for this version) https://doi.org/10.48550/arXiv.2508.00017 Focus to learn more arXiv-issued DOI via DataCite
zh
[AI-62] AoI-Aware Resource Allocation with Deep Reinforcement Learning for HAPS-V2X Networks
【速读】:该论文旨在解决第六代移动通信(6G)网络中面向自动驾驶等安全关键应用的高可靠低时延通信(HRLLC)需求下,信息新鲜度(Age-of-Information, AoI)难以保障的问题。特别是在基础设施受限区域,传统地面通信存在覆盖不足与可靠性差的挑战。解决方案的关键在于利用高空平台站(High-Altitude Platform Stations, HAPS)提供的广域覆盖和低延迟优势,并结合深度确定性策略梯度(Deep Deterministic Policy Gradient, DDPG)强化学习算法,实现对车联万物(Vehicle-to-Everything, V2X)网络中AoI的动态优化。该方法无需集中式协调即可实现独立学习,显著提升了信息新鲜度与整体网络可靠性,为编队式自动驾驶系统中的资源分配提供了高效、去中心化的AoI感知机制。
链接: https://arxiv.org/abs/2508.00011
作者: Ahmet Melih Ince,Ayse Elif Canbilen,Halim Yanikomeroglu
机构: 未知
类目: Networking and Internet Architecture (cs.NI); Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Multiagent Systems (cs.MA); Systems and Control (eess.SY)
备注: 6 pages, 3 figures, to appear in IEEE conference proceedings
Abstract:Sixth-generation (6G) networks are designed to meet the hyper-reliable and low-latency communication (HRLLC) requirements of safety-critical applications such as autonomous driving. Integrating non-terrestrial networks (NTN) into the 6G infrastructure brings redundancy to the network, ensuring continuity of communications even under extreme conditions. In particular, high-altitude platform stations (HAPS) stand out for their wide coverage and low latency advantages, supporting communication reliability and enhancing information freshness, especially in rural areas and regions with infrastructure constraints. In this paper, we present reinforcement learning-based approaches using deep deterministic policy gradient (DDPG) to dynamically optimize the age-of-information (AoI) in HAPS-enabled vehicle-to-everything (V2X) networks. The proposed method improves information freshness and overall network reliability by enabling independent learning without centralized coordination. The findings reveal the potential of HAPS-supported solutions, combined with DDPG-based learning, for efficient AoI-aware resource allocation in platoon-based autonomous vehicle systems.
zh
[AI-63] Enabling Immersive XR Collaborations over FTTR Networks (Invited)
【速读】:该论文旨在解决室内扩展现实(Extended Reality, XR)协作中因带宽分配不均和网络切换中断导致的沉浸式体验质量下降问题。其解决方案的关键在于提出基于预测的带宽分配机制与无缝切换(Seamless Handover)策略,通过在光纤到房间(Fiber-To-The-Room, FTTR)架构下实现动态资源优化与低延迟切换,从而保障高保真XR应用的稳定性和连续性。
链接: https://arxiv.org/abs/2508.00009
作者: Sourav Mondal,Elaine Wong
机构: 未知
类目: Networking and Internet Architecture (cs.NI); Artificial Intelligence (cs.AI)
备注: This invited paper was presented in Optica Advanced Photonic Congress 2025
Abstract:Fiber-To-The-Room is a potential solution to achieve in-premise extended reality collaborations. This paper explores predictive bandwidth allocation and seamless handover schemes over FTTR, showing high-quality immersive experience for in-premise collaborations can be achieved. \copyright 2025 The Author(s).
zh
[AI-64] Agent Network Protocol Technical White Paper
【速读】:该论文旨在解决当前互联网基础设施难以支持大规模智能体(Agent)之间高效互连与协作的问题,主要挑战包括数据孤岛、接口不友好以及协作成本高昂。其解决方案的关键在于提出新一代通信协议——Agent Network Protocol (ANP),该协议以AI原生设计为核心,兼容现有互联网协议,并采用模块化可组合架构,通过三层协议体系(身份与加密通信层、元协议协商层、应用协议层)系统性地实现智能体的身份认证、动态协商及能力发现与互操作性,从而推动Agentic Web的发展。
链接: https://arxiv.org/abs/2508.00007
作者: Gaowei Chang,Eidan Lin,Chengxuan Yuan,Rizhao Cai,Binbin Chen,Xuan Xie,Yin Zhang
机构: 未知
类目: Networking and Internet Architecture (cs.NI); Artificial Intelligence (cs.AI)
备注: This white paper is a reformatted version of the open-source community edition previously released by the ANP Open Source Technology Community( this https URL )
Abstract:With the development of large models and autonomous decision-making AI, agents are rapidly becoming the new entities of the internet, following mobile apps. However, existing internet infrastructure is primarily designed for human interaction, creating data silos, unfriendly interfaces, and high collaboration costs among agents, making it difficult to support the needs for large-scale agent interconnection and collaboration. The internet is undergoing a profound transformation, showing four core trends: agents replacing traditional software, universal agent interconnection, native protocol-based connections, and autonomous agent organization and collaboration. To align with these trends, Agent Network Protocol (ANP) proposes a new generation of communication protocols for the Agentic Web. ANP adheres to AI-native design, maintains compatibility with existing internet protocols, adopts a modular composable architecture, follows minimalist yet extensible principles, and enables rapid deployment based on existing infrastructure. Through a three-layer protocol system–identity and encrypted communication layer, meta-protocol negotiation layer, and application protocol layer–ANP. systematically solves the problems of agent identity authentication, dynamic negotiation, and capability discovery interoperability.
zh
[AI-65] Modelling Program Spaces in Program Synthesis with Constraints
【速读】:该论文旨在解决程序合成(program synthesis)中因可能程序空间庞大而导致的组合爆炸问题。传统方法依赖约束求解器来表达程序语义,但未充分利用约束作为剔除无效程序的强大工具。本文的关键解决方案是引入语法约束(syntactic constraints),通过在程序语法层面施加限制,实现无需执行即可进行约束传播与检查,从而高效缩小搜索空间。该方法不仅识别可行解,还能筛选出更可能有用的程序,显著提升合成效率——实验表明,该策略可消除高达99%的程序空间,并大幅减少枚举时间。
链接: https://arxiv.org/abs/2508.00005
作者: Tilman Hinnerichs,Bart Swinkels,Jaap de Jong,Reuben Gardos Reid,Tudor Magirescu,Neil Yorke-Smith,Sebastijan Dumancic
机构: 未知
类目: Programming Languages (cs.PL); Artificial Intelligence (cs.AI)
备注:
Abstract:A core challenge in program synthesis is taming the large space of possible programs. Since program synthesis is essentially a combinatorial search, the community has sought to leverage powerful combinatorial constraint solvers. Here, constraints are used to express the program semantics, but not as a potentially potent tool to remove unwanted programs. Recent inductive logic programming approaches introduce constraints on the program’s syntax to be synthesized. These syntactic constraints allow for checking and propagating a constraint without executing the program, and thus for arbitrary operators. In this work, we leverage syntactic constraints to model program spaces, defining not just solutions that are feasible, but also ones that are likely useful. To demonstrate this idea, we introduce BART, a solver that efficiently propagates and solves these constraints. We evaluate BART on program space enumeration tasks, finding that the constraints eliminate up to 99 percent of the program space, and that modeling program spaces significantly reduces enumeration time.
zh
[AI-66] Agency Among Agents : Designing with Hypertextual Friction in the Algorithmic Web
【速读】:该论文试图解决算法驱动界面(如推荐 feeds 和生成式 AI 工具)在追求用户参与度和效率的同时,削弱了用户的自主性(user agency)问题,即用户对内容可见性及意义建构过程的控制权被系统决策所侵蚀。解决方案的关键在于提出“超文本摩擦”(Hypertextual Friction)这一概念性设计立场,将经典超文本原则中的摩擦(friction)、可追溯性(traceability)和结构(structure)转化为可操作的设计价值,以在算法中介环境中重新赋予用户对信息探索、导航和创作过程的主动权。通过对比分析 Wikipedia 与 Instagram Explore、传统网页与生成式 AI 图像工具等界面,论文揭示了超文本系统如何强化来源溯源、关联性思维与用户主导的意义构建,而算法系统则倾向于隐藏机制并扁平化用户参与,从而为设计更具用户赋权能力的数字交互提供理论框架与实践路径。
链接: https://arxiv.org/abs/2507.23585
作者: Sophia Liu,Shm Garanganao Almeda
机构: 未知
类目: Human-Computer Interaction (cs.HC); Artificial Intelligence (cs.AI); Multimedia (cs.MM); Social and Information Networks (cs.SI)
备注: To appear in: Adjunct Proceedings of the 36th ACM Conference on Hypertext and Social Media, Chicago, IL, USA, September 15-18, 2025
Abstract:Today’s algorithm-driven interfaces, from recommendation feeds to GenAI tools, often prioritize engagement and efficiency at the expense of user agency. As systems take on more decision-making, users have less control over what they see and how meaning or relationships between content are constructed. This paper introduces “Hypertextual Friction,” a conceptual design stance that repositions classical hypertext principles–friction, traceability, and structure–as actionable values for reclaiming agency in algorithmically mediated environments. Through a comparative analysis of real-world interfaces–Wikipedia vs. Instagram Explore, and this http URL vs. GenAI image tools–we examine how different systems structure user experience, navigation, and authorship. We show that hypertext systems emphasize provenance, associative thinking, and user-driven meaning-making, while algorithmic systems tend to obscure process and flatten participation. We contribute: (1) a comparative analysis of how interface structures shape agency in user-driven versus agent-driven systems, and (2) a conceptual stance that offers hypertextual values as design commitments for reclaiming agency in an increasingly algorithmic web.
zh
[AI-67] Advancing Quantum Information Science Pre-College Education: The Case for Learning Sciences Collaboration
【速读】:该论文试图解决的问题是如何让年轻学习者有效参与量子信息科学(Quantum Information Science, QIS)这一与传统学习经验迥异的领域。解决方案的关键在于推动量子信息科学与学习科学(Learning Sciences, LS)之间的双向协作:一方面,利用学习科学中的设计型研究(Design-Based Research)方法来开发、优化并推广有效的QIS学习体验;另一方面,通过重构知识表征框架,引导学习者在认知层面重新理解、参与和实践QIS,从而实现深层次的学习 engagement。这种跨学科合作不仅有助于提升QIS教育效果,也有助于深化对复杂领域教学机制的理解。
链接: https://arxiv.org/abs/2508.00668
作者: Raquel Coelho,Roy Pea,Christian Schunn,Jinglei Cheng,Junyu Liu
机构: 未知
类目: Physics Education (physics.ed-ph); Artificial Intelligence (cs.AI); Computers and Society (cs.CY); Quantum Physics (quant-ph)
备注: 12 pages, 2 figures
Abstract:As quantum information science advances and the need for pre-college engagement grows, a critical question remains: How can young learners be prepared to participate in a field so radically different from what they have encountered before? This paper argues that meeting this challenge will require strong interdisciplinary collaboration with the Learning Sciences (LS), a field dedicated to understanding how people learn and designing theory-guided environments to support learning. Drawing on lessons from previous STEM education efforts, we discuss two key contributions of the learning sciences to quantum information science (QIS) education. The first is design-based research, the signature methodology of learning sciences, which can inform the development, refinement, and scaling of effective QIS learning experiences. The second is a framework for reshaping how learners reason about, learn and participate in QIS practices through shifts in knowledge representations that provide new forms of engagement and associated learning. We call for a two-way partnership between quantum information science and the learning sciences, one that not only supports learning in quantum concepts and practices but also improves our understanding of how to teach and support learning in highly complex domains. We also consider potential questions involved in bridging these disciplinary communities and argue that the theoretical and practical benefits justify the effort.
zh
[AI-68] Beamformed 360° Sound Maps: U-Net-Driven Acoustic Source Segmentation and Localization
【速读】:该论文旨在解决传统声源定位(Sound Source Localization, SSL)方法在处理360°空间音频时精度不足、依赖特定麦克风阵列配置以及难以实现密集空间分布声源建模的问题。其关键解决方案是将声源定位任务重新定义为球面语义分割(spherical semantic segmentation)问题,利用改进的U-Net模型对延迟求和(Delay-and-Sum, DAS)波束成形后的音频能量图进行区域分割,从而识别出活跃声源的空间分布区域;通过Tversky损失函数缓解类别不平衡问题,并基于激活区域计算质心以获得鲁棒的方向估计(Direction-of-Arrival, DoA),该方法不依赖具体麦克风布局,具备良好的泛化能力与阵列无关性。
链接: https://arxiv.org/abs/2508.00307
作者: Belman Jahir Rodriguez,Sergio F. Chevtchenko,Marcelo Herrera Martinez,Yeshwant Bethy,Saeed Afshar
机构: 未知
类目: Audio and Speech Processing (eess.AS); Artificial Intelligence (cs.AI); Sound (cs.SD); Signal Processing (eess.SP)
备注:
Abstract:We introduce a U-net model for 360° acoustic source localization formulated as a spherical semantic segmentation task. Rather than regressing discrete direction-of-arrival (DoA) angles, our model segments beamformed audio maps (azimuth and elevation) into regions of active sound presence. Using delay-and-sum (DAS) beamforming on a custom 24-microphone array, we generate signals aligned with drone GPS telemetry to create binary supervision masks. A modified U-Net, trained on frequency-domain representations of these maps, learns to identify spatially distributed source regions while addressing class imbalance via the Tversky loss. Because the network operates on beamformed energy maps, the approach is inherently array-independent and can adapt to different microphone configurations without retraining from scratch. The segmentation outputs are post-processed by computing centroids over activated regions, enabling robust DoA estimates. Our dataset includes real-world open-field recordings of a DJI Air 3 drone, synchronized with 360° video and flight logs across multiple dates and locations. Experimental results show that U-net generalizes across environments, providing improved angular precision, offering a new paradigm for dense spatial audio understanding beyond traditional Sound Source Localization (SSL).
zh
[AI-69] Formal Power Series Representations in Probability and Expected Utility Theory
【速读】:该论文旨在解决传统偏好理论中对偏好系统完整性与一致性的严格限制问题,例如传递性(transitivity)、阿基米德性质(Archimedean property)、有界性(boundedness)及连续性等假设。其解决方案的关键在于提出一种更一般的相干偏好理论(coherent preference theory),该理论仅要求偏好系统满足一个类似于德·菲内蒂(de Finetti)在概率基础中提出的相干性条件,即可扩展为完整的偏好系统。这一理论不依赖于前述传统假设,并进一步证明:任何满足此相干性的完整偏好系统均可通过有序域扩展的效用函数进行表示,从而同时推广了霍尔德定理(Hölder’s Theorem)并强化了哈恩嵌入定理(Hahn’s Embedding Theorem)。
链接: https://arxiv.org/abs/2508.00294
作者: Arthur Paul Pedersen,Samuel Allen Alexander
机构: 未知
类目: Probability (math.PR); Artificial Intelligence (cs.AI); Theoretical Economics (econ.TH); Logic (math.LO); Statistics Theory (math.ST)
备注:
Abstract:We advance a general theory of coherent preference that surrenders restrictions embodied in orthodox doctrine. This theory enjoys the property that any preference system admits extension to a complete system of preferences, provided it satisfies a certain coherence requirement analogous to the one de Finetti advanced for his foundations of probability. Unlike de Finetti’s theory, the one we set forth requires neither transitivity nor Archimedeanness nor boundedness nor continuity of preference. This theory also enjoys the property that any complete preference system meeting the standard of coherence can be represented by utility in an ordered field extension of the reals. Representability by utility is a corollary of this paper’s central result, which at once extends Hölder’s Theorem and strengthens Hahn’s Embedding Theorem.
zh
[AI-70] Embedding-Aware Quantum-Classical SVMs for Scalable Quantum Machine Learning
【速读】:该论文旨在解决量子支持向量机(Quantum Support Vector Machine, QSVM)在实际应用中面临的可扩展性问题,即高维量子态和硬件限制导致的性能瓶颈。其解决方案的关键在于提出了一种嵌入感知的量子-经典混合流水线,核心创新是结合类平衡k-means蒸馏与预训练视觉Transformer(Vision Transformer, ViT)的特征嵌入。研究发现,ViT嵌入能够独特地激发量子优势,在Fashion-MNIST和MNIST数据集上分别实现高达8.02%和4.42%的准确率提升,而卷积神经网络(CNN)特征则导致性能下降;这一结果首次通过16量子比特张量网络模拟(cuTensorNet)提供了系统性证据,表明量子核优势高度依赖于嵌入选择,并揭示了Transformer注意力机制与量子特征空间之间存在根本性的协同效应。
链接: https://arxiv.org/abs/2508.00024
作者: Sebastián Andrés Cajas Ordóñez,Luis Fernando Torres Torres,Mario Bifulco,Carlos Andrés Durán,Cristian Bosch,Ricardo Simón Carbajo
机构: 未知
类目: Quantum Physics (quant-ph); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:
Abstract:Quantum Support Vector Machines face scalability challenges due to high-dimensional quantum states and hardware limitations. We propose an embedding-aware quantum-classical pipeline combining class-balanced k-means distillation with pretrained Vision Transformer embeddings. Our key finding: ViT embeddings uniquely enable quantum advantage, achieving up to 8.02% accuracy improvements over classical SVMs on Fashion-MNIST and 4.42% on MNIST, while CNN features show performance degradation. Using 16-qubit tensor network simulation via cuTensorNet, we provide the first systematic evidence that quantum kernel advantage depends critically on embedding choice, revealing fundamental synergy between transformer attention and quantum feature spaces. This provides a practical pathway for scalable quantum machine learning that leverages modern neural architectures.
zh
机器学习
[LG-0] Adacc: Adaptive Compression and Activation Checkpointing for LLM Memory Management
链接: https://arxiv.org/abs/2508.00806
作者: Ping Chen,Zhuohong Deng,Ping Li,Shuibing He,Hongzi Zhu,Yi Zheng,Zhefeng Wang,Baoxing Huai,Minyi Guo
类目: Machine Learning (cs.LG); Distributed, Parallel, and Cluster Computing (cs.DC)
*备注: 8 pages
Abstract:Training large language models often employs recomputation to alleviate memory pressure, which can introduce up to 30% overhead in real-world scenarios. In this paper, we propose Adacc, a novel memory management framework that combines adaptive compression and activation checkpointing to reduce the GPU memory footprint. It comprises three modules: (1) We design layer-specific compression algorithms that account for outliers in LLM tensors, instead of directly quantizing floats from FP16 to INT4, to ensure model accuracy. (2) We propose an optimal scheduling policy that employs MILP to determine the best memory optimization for each tensor. (3) To accommodate changes in training tensors, we introduce an adaptive policy evolution mechanism that adjusts the policy during training to enhance throughput. Experimental results show that Adacc can accelerate the LLM training by 1.01x to 1.37x compared to state-of-the-art frameworks, while maintaining comparable model accuracy to the Baseline.
[LG-1] Online Fine-Tuning of Carbon Emission Predictions using Real-Time Recurrent Learning for State Space Models
链接: https://arxiv.org/abs/2508.00804
作者: Julian Lemmel,Manuel Kranzl,Adam Lamine,Philipp Neubauer,Radu Grosu,Sophie Neubauer
类目: Computational Engineering, Finance, and Science (cs.CE); Machine Learning (cs.LG); Systems and Control (eess.SY)
*备注: 6 pages
Abstract:This paper introduces a new approach for fine-tuning the predictions of structured state space models (SSMs) at inference time using real-time recurrent learning. While SSMs are known for their efficiency and long-range modeling capabilities, they are typically trained offline and remain static during deployment. Our method enables online adaptation by continuously updating model parameters in response to incoming data. We evaluate our approach for linear-recurrent-unit SSMs using a small carbon emission dataset collected from embedded automotive hardware. Experimental results show that our method consistently reduces prediction error online during inference, demonstrating its potential for dynamic, resource-constrained environments.
[LG-2] Explainable AI and Machine Learning for Exam-based Student Evaluation: Causal and Predictive Analysis of Socio-academic and Economic Factors
链接: https://arxiv.org/abs/2508.00785
作者: Bushra Akter,Md Biplob Hosen,Sabbir Ahmed,Mehrin Anannya,Md. Farhad Hossain
类目: Machine Learning (cs.LG)
*备注:
Abstract:Academic performance depends on a multivariable nexus of socio-academic and financial factors. This study investigates these influences to develop effective strategies for optimizing students’ CGPA. To achieve this, we reviewed various literature to identify key influencing factors and constructed an initial hypothetical causal graph based on the findings. Additionally, an online survey was conducted, where 1,050 students participated, providing comprehensive data for analysis. Rigorous data preprocessing techniques, including cleaning and visualization, ensured data quality before analysis. Causal analysis validated the relationships among variables, offering deeper insights into their direct and indirect effects on CGPA. Regression models were implemented for CGPA prediction, while classification models categorized students based on performance levels. Ridge Regression demonstrated strong predictive accuracy, achieving a Mean Absolute Error of 0.12 and a Mean Squared Error of 0.023. Random Forest outperformed in classification, attaining an F1-score near perfection and an accuracy of 98.68%. Explainable AI techniques such as SHAP, LIME, and Interpret enhanced model interpretability, highlighting critical factors such as study hours, scholarships, parental education, and prior academic performance. The study culminated in the development of a web-based application that provides students with personalized insights, allowing them to predict academic performance, identify areas for improvement, and make informed decisions to enhance their outcomes.
[LG-3] Learning to optimize with guarantees: a complete characterization of linearly convergent algorithms
链接: https://arxiv.org/abs/2508.00775
作者: Andrea Martin,Ian R. Manchester,Luca Furieri
类目: ystems and Control (eess.SY); Machine Learning (cs.LG); Optimization and Control (math.OC)
*备注:
Abstract:In high-stakes engineering applications, optimization algorithms must come with provable worst-case guarantees over a mathematically defined class of problems. Designing for the worst case, however, inevitably sacrifices performance on the specific problem instances that often occur in practice. We address the problem of augmenting a given linearly convergent algorithm to improve its average-case performance on a restricted set of target problems - for example, tailoring an off-the-shelf solver for model predictive control (MPC) for an application to a specific dynamical system - while preserving its worst-case guarantees across the entire problem class. Toward this goal, we characterize the class of algorithms that achieve linear convergence for classes of nonsmooth composite optimization problems. In particular, starting from a baseline linearly convergent algorithm, we derive all - and only - the modifications to its update rule that maintain its convergence properties. Our results apply to augmenting legacy algorithms such as gradient descent for nonconvex, gradient-dominated functions; Nesterov’s accelerated method for strongly convex functions; and projected methods for optimization over polyhedral feasibility sets. We showcase effectiveness of the approach on solving optimization problems with tight iteration budgets in application to ill-conditioned systems of linear equations and MPC for linear systems.
[LG-4] Evaluating Angle and Amplitude Encoding Strategies for Variational Quantum Machine Learning: their impact on models accuracy
链接: https://arxiv.org/abs/2508.00768
作者: Antonio Tudisco,Andrea Marchesin,Maurizio Zamboni,Mariagrazia Graziano,Giovanna Turvani
类目: Machine Learning (cs.LG)
*备注:
Abstract:Recent advancements in Quantum Computing and Machine Learning have increased attention to Quantum Machine Learning (QML), which aims to develop machine learning models by exploiting the quantum computing paradigm. One of the widely used models in this area is the Variational Quantum Circuit (VQC), a hybrid model where the quantum circuit handles data inference while classical optimization adjusts the parameters of the circuit. The quantum circuit consists of an encoding layer, which loads data into the circuit, and a template circuit, known as the ansatz, responsible for processing the data. This work involves performing an analysis by considering both Amplitude- and Angle-encoding models, and examining how the type of rotational gate applied affects the classification performance of the model. This comparison is carried out by training the different models on two datasets, Wine and Diabetes, and evaluating their performance. The study demonstrates that, under identical model topologies, the difference in accuracy between the best and worst models ranges from 10% to 30%, with differences reaching up to 41%. Moreover, the results highlight how the choice of rotational gates used in encoding can significantly impact the model’s classification performance. The findings confirm that the embedding represents a hyperparameter for VQC models.
[LG-5] Diffusion-Scheduled Denoising Autoencoders for Anomaly Detection in Tabular Data
链接: https://arxiv.org/abs/2508.00758
作者: Timur Sattarov,Marco Schreyer,Damian Borth
类目: Machine Learning (cs.LG)
*备注: 22 pages, 16 figures, 7 tables, preprint version
Abstract:Anomaly detection in tabular data remains challenging due to complex feature interactions and the scarcity of anomalous examples. Denoising autoencoders rely on fixed-magnitude noise, limiting adaptability to diverse data distributions. Diffusion models introduce scheduled noise and iterative denoising, but lack explicit reconstruction mappings. We propose the Diffusion-Scheduled Denoising Autoencoder (DDAE), a framework that integrates diffusion-based noise scheduling and contrastive learning into the encoding process to improve anomaly detection. We evaluated DDAE on 57 datasets from ADBench. Our method outperforms in semi-supervised settings and achieves competitive results in unsupervised settings, improving PR-AUC by up to 65% (9%) and ROC-AUC by 16% (6%) over state-of-the-art autoencoder (diffusion) model baselines. We observed that higher noise levels benefit unsupervised training, while lower noise with linear scheduling is optimal in semi-supervised settings. These findings underscore the importance of principled noise strategies in tabular anomaly detection.
[LG-6] Democratizing Tabular Data Access with an Openunicodex2013Source Syntheticunicodex2013Data SDK
链接: https://arxiv.org/abs/2508.00718
作者: Ivona Krchova,Mariana Vargas Vieyra,Mario Scriminaci,Andrey Sidorenko
类目: Machine Learning (cs.LG)
*备注:
Abstract:Machine learning development critically depends on access to high-quality data. However, increasing restrictions due to privacy, proprietary interests, and ethical concerns have created significant barriers to data accessibility. Synthetic data offers a viable solution by enabling safe, broad data usage without compromising sensitive information. This paper presents the MOSTLY AI Synthetic Data Software Development Kit (SDK), an open-source toolkit designed specifically for synthesizing high-quality tabular data. The SDK integrates robust features such as differential privacy guarantees, fairness-aware data generation, and automated quality assurance into a flexible and accessible Python interface. Leveraging the TabularARGN autoregressive framework, the SDK supports diverse data types and complex multi-table and sequential datasets, delivering competitive performance with notable improvements in speed and usability. Currently deployed both as a cloud service and locally installable software, the SDK has seen rapid adoption, highlighting its practicality in addressing real-world data bottlenecks and promoting widespread data democratization.
[LG-7] Learning Network Dismantling without Handcrafted Inputs
链接: https://arxiv.org/abs/2508.00706
作者: Haozhe Tian,Pietro Ferraro,Robert Shorten,Mahdi Jalili,Homayoun Hamedmoghadam
类目: Machine Learning (cs.LG)
*备注:
Abstract:The application of message-passing Graph Neural Networks has been a breakthrough for important network science problems. However, the competitive performance often relies on using handcrafted structural features as inputs, which increases computational cost and introduces bias into the otherwise purely data-driven network representations. Here, we eliminate the need for handcrafted features by introducing an attention mechanism and utilizing message-iteration profiles, in addition to an effective algorithmic approach to generate a structurally diverse training set of small synthetic networks. Thereby, we build an expressive message-passing framework and use it to efficiently solve the NP-hard problem of Network Dismantling, virtually equivalent to vital node identification, with significant real-world applications. Trained solely on diversified synthetic networks, our proposed model – MIND: Message Iteration Network Dismantler – generalizes to large, unseen real networks with millions of nodes, outperforming state-of-the-art network dismantling methods. Increased efficiency and generalizability of the proposed model can be leveraged beyond dismantling in a range of complex network problems.
[LG-8] Wind Power Scenario Generation based on the Generalized Dynamic Factor Model and Generative Adversarial Network
链接: https://arxiv.org/abs/2508.00692
作者: Young-ho Cho,Hao Zhu,Duehee Lee,Ross Baldick
类目: Machine Learning (cs.LG); Systems and Control (eess.SY)
*备注:
Abstract:For conducting resource adequacy studies, we synthesize multiple long-term wind power scenarios of distributed wind farms simultaneously by using the spatio-temporal features: spatial and temporal correlation, waveforms, marginal and ramp rates distributions of waveform, power spectral densities, and statistical characteristics. Generating the spatial correlation in scenarios requires the design of common factors for neighboring wind farms and antithetical factors for distant wind farms. The generalized dynamic factor model (GDFM) can extract the common factors through cross spectral density analysis, but it cannot closely imitate waveforms. The GAN can synthesize plausible samples representing the temporal correlation by verifying samples through a fake sample discriminator. To combine the advantages of GDFM and GAN, we use the GAN to provide a filter that extracts dynamic factors with temporal information from the observation data, and we then apply this filter in the GDFM to represent both spatial and frequency correlations of plausible waveforms. Numerical tests on the combination of GDFM and GAN have demonstrated performance improvements over competing alternatives in synthesizing wind power scenarios from Australia, better realizing plausible statistical characteristics of actual wind power compared to alternatives such as the GDFM with a filter synthesized from distributions of actual dynamic filters and the GAN with direct synthesis without dynamic factors.
[LG-9] DP-DGAD: A Generalist Dynamic Graph Anomaly Detector with Dynamic Prototypes
链接: https://arxiv.org/abs/2508.00664
作者: Jialun Zheng,Jie Liu,Jiannong Cao,Xiao Wang,Hanchen Yang,Yankai Chen,Philip S. Yu
类目: Machine Learning (cs.LG)
*备注:
Abstract:Dynamic graph anomaly detection (DGAD) is essential for identifying anomalies in evolving graphs across domains such as finance, traffic, and social networks. Recently, generalist graph anomaly detection (GAD) models have shown promising results. They are pretrained on multiple source datasets and generalize across domains. While effective on static graphs, they struggle to capture evolving anomalies in dynamic graphs. Moreover, the continuous emergence of new domains and the lack of labeled data further challenge generalist DGAD. Effective cross-domain DGAD requires both domain-specific and domain-agnostic anomalous patterns. Importantly, these patterns evolve temporally within and across domains. Building on these insights, we propose a DGAD model with Dynamic Prototypes (DP) to capture evolving domain-specific and domain-agnostic patterns. Firstly, DP-DGAD extracts dynamic prototypes, i.e., evolving representations of normal and anomalous patterns, from temporal ego-graphs and stores them in a memory buffer. The buffer is selectively updated to retain general, domain-agnostic patterns while incorporating new domain-specific ones. Then, an anomaly scorer compares incoming data with dynamic prototypes to flag both general and domain-specific anomalies. Finally, DP-DGAD employs confidence-based pseudo-labeling for effective self-supervised adaptation in target domains. Extensive experiments demonstrate state-of-the-art performance across ten real-world datasets from different domains.
[LG-10] rajSurv: Learning Continuous Latent Trajectories from Electronic Health Records for Trustworthy Survival Prediction
链接: https://arxiv.org/abs/2508.00657
作者: Sihang Zeng,Lucas Jing Liu,Jun Wen,Meliha Yetisgen,Ruth Etzioni,Gang Luo
类目: Machine Learning (cs.LG)
*备注: Accepted by MLHC 2025
Abstract:Trustworthy survival prediction is essential for clinical decision making. Longitudinal electronic health records (EHRs) provide a uniquely powerful opportunity for the prediction. However, it is challenging to accurately model the continuous clinical progression of patients underlying the irregularly sampled clinical features and to transparently link the progression to survival outcomes. To address these challenges, we develop TrajSurv, a model that learns continuous latent trajectories from longitudinal EHR data for trustworthy survival prediction. TrajSurv employs a neural controlled differential equation (NCDE) to extract continuous-time latent states from the irregularly sampled data, forming continuous latent trajectories. To ensure the latent trajectories reflect the clinical progression, TrajSurv aligns the latent state space with patient state space through a time-aware contrastive learning approach. To transparently link clinical progression to the survival outcome, TrajSurv uses latent trajectories in a two-step divide-and-conquer interpretation process. First, it explains how the changes in clinical features translate into the latent trajectory’s evolution using a learned vector field. Second, it clusters these latent trajectories to identify key clinical progression patterns associated with different survival outcomes. Evaluations on two real-world medical datasets, MIMIC-III and eICU, show TrajSurv’s competitive accuracy and superior transparency over existing deep learning methods.
[LG-11] Light-Weight Diffusion Multiplier and Uncertainty Quantification for Fourier Neural Operators
链接: https://arxiv.org/abs/2508.00643
作者: Albert Matveev,Sanmitra Ghosh,Aamal Hussain,James-Michael Leahy,Michalis Michaelides
类目: Machine Learning (cs.LG)
*备注:
Abstract:Operator learning is a powerful paradigm for solving partial differential equations, with Fourier Neural Operators serving as a widely adopted foundation. However, FNOs face significant scalability challenges due to overparameterization and offer no native uncertainty quantification – a key requirement for reliable scientific and engineering applications. Instead, neural operators rely on post hoc UQ methods that ignore geometric inductive biases. In this work, we introduce DINOZAUR: a diffusion-based neural operator parametrization with uncertainty quantification. Inspired by the structure of the heat kernel, DINOZAUR replaces the dense tensor multiplier in FNOs with a dimensionality-independent diffusion multiplier that has a single learnable time parameter per channel, drastically reducing parameter count and memory footprint without compromising predictive performance. By defining priors over those time parameters, we cast DINOZAUR as a Bayesian neural operator to yield spatially correlated outputs and calibrated uncertainty estimates. Our method achieves competitive or superior performance across several PDE benchmarks while providing efficient uncertainty quantification.
[LG-12] Reinforcement Learning for Decision-Level Interception Prioritization in Drone Swarm Defense
链接: https://arxiv.org/abs/2508.00641
作者: Alessandro Palmas
类目: Machine Learning (cs.LG)
*备注: 11 pages, 10 figures
Abstract:The growing threat of low-cost kamikaze drone swarms poses a critical challenge to modern defense systems demanding rapid and strategic decision-making to prioritize interceptions across multiple effectors and high-value target zones. In this work, we present a case study demonstrating the practical advantages of reinforcement learning in addressing this challenge. We introduce a high-fidelity simulation environment that captures realistic operational constraints, within which a decision-level reinforcement learning agent learns to coordinate multiple effectors for optimal interception prioritization. Operating in a discrete action space, the agent selects which drone to engage per effector based on observed state features such as positions, classes, and effector status. We evaluate the learned policy against a handcrafted rule-based baseline across hundreds of simulated attack scenarios. The reinforcement learning based policy consistently achieves lower average damage and higher defensive efficiency in protecting critical zones. This case study highlights the potential of reinforcement learning as a strategic layer within defense architectures, enhancing resilience without displacing existing control systems. All code and simulation assets are publicly released for full reproducibility, and a video demonstration illustrates the policy’s qualitative behavior.
[LG-13] KFS: KAN based adaptive Frequency Selection learning architecture for long term time series forecasting
链接: https://arxiv.org/abs/2508.00635
作者: Changning Wu,Gao Wu,Rongyao Cai,Yong Liu,Kexin Zhang
类目: Machine Learning (cs.LG)
*备注:
Abstract:Multi-scale decomposition architectures have emerged as predominant methodologies in time series forecasting. However, real-world time series exhibit noise interference across different scales, while heterogeneous information distribution among frequency components at varying scales leads to suboptimal multi-scale representation. Inspired by Kolmogorov-Arnold Networks (KAN) and Parseval’s theorem, we propose a KAN based adaptive Frequency Selection learning architecture (KFS) to address these challenges. This framework tackles prediction challenges stemming from cross-scale noise interference and complex pattern modeling through its FreK module, which performs energy-distribution-based dominant frequency selection in the spectral domain. Simultaneously, KAN enables sophisticated pattern representation while timestamp embedding alignment synchronizes temporal representations across scales. The feature mixing module then fuses scale-specific patterns with aligned temporal features. Extensive experiments across multiple real-world time series datasets demonstrate that KT achieves state-of-the-art performance as a simple yet effective architecture.
[LG-14] Separated-Variable Spectral Neural Networks: A Physics-Informed Learning Approach for High-Frequency PDEs
链接: https://arxiv.org/abs/2508.00628
作者: Xiong Xiong,Zhuo Zhang,Rongchun Hu,Chen Gao,Zichen Deng
类目: Machine Learning (cs.LG)
*备注:
Abstract:Solving high-frequency oscillatory partial differential equations (PDEs) is a critical challenge in scientific computing, with applications in fluid mechanics, quantum mechanics, and electromagnetic wave propagation. Traditional physics-informed neural networks (PINNs) suffer from spectral bias, limiting their ability to capture high-frequency solution components. We introduce Separated-Variable Spectral Neural Networks (SV-SNN), a novel framework that addresses these limitations by integrating separation of variables with adaptive spectral methods. Our approach features three key innovations: (1) decomposition of multivariate functions into univariate function products, enabling independent spatial and temporal networks; (2) adaptive Fourier spectral features with learnable frequency parameters for high-frequency capture; and (3) theoretical framework based on singular value decomposition to quantify spectral bias. Comprehensive evaluation on benchmark problems including Heat equation, Helmholtz equation, Poisson equations and Navier-Stokes equations demonstrates that SV-SNN achieves 1-3 orders of magnitude improvement in accuracy while reducing parameter count by over 90% and training time by 60%. These results establish SV-SNN as an effective solution to the spectral bias problem in neural PDE solving. The implementation will be made publicly available upon acceptance at this https URL.
[LG-15] IAMAP: Unlocking Deep Learning in QGIS for non-coders and limited computing resources
链接: https://arxiv.org/abs/2508.00627
作者: Paul Tresson,Pierre Le Coz,Hadrien Tulet,Anthony Malkassian,Maxime Réjou Méchain
类目: Machine Learning (cs.LG)
*备注: 11 pages, 5 figures
Abstract:Remote sensing has entered a new era with the rapid development of artificial intelligence approaches. However, the implementation of deep learning has largely remained restricted to specialists and has been impractical because it often requires (i) large reference datasets for model training and validation; (ii) substantial computing resources; and (iii) strong coding skills. Here, we introduce IAMAP, a user-friendly QGIS plugin that addresses these three challenges in an easy yet flexible way. IAMAP builds on recent advancements in self-supervised learning strategies, which now provide robust feature extractors, often referred to as foundation models. These generalist models can often be reliably used in few-shot or zero-shot scenarios (i.e., with little to no fine-tuning). IAMAP’s interface allows users to streamline several key steps in remote sensing image analysis: (i) extracting image features using a wide range of deep learning architectures; (ii) reducing dimensionality with built-in algorithms; (iii) performing clustering on features or their reduced representations; (iv) generating feature similarity maps; and (v) calibrating and validating supervised machine learning models for prediction. By enabling non-AI specialists to leverage the high-quality features provided by recent deep learning approaches without requiring GPU capacity or extensive reference datasets, IAMAP contributes to the democratization of computationally efficient and energy-conscious deep learning methods.
[LG-16] Information-Theoretic Decentralized Secure Aggregation with Collusion Resilience
链接: https://arxiv.org/abs/2508.00596
作者: Xiang Zhang,Zhou Li,Shuangyang Li,Kai Wan,Derrick Wing Kwan Ng,Giuseppe Caire
类目: Information Theory (cs.IT); Cryptography and Security (cs.CR); Distributed, Parallel, and Cluster Computing (cs.DC); Machine Learning (cs.LG)
*备注: Submitted to IEEE for potential journal publication
Abstract:In decentralized federated learning (FL), multiple clients collaboratively learn a shared machine learning (ML) model by leveraging their privately held datasets distributed across the network, through interactive exchange of the intermediate model updates. To ensure data security, cryptographic techniques are commonly employed to protect model updates during aggregation. Despite growing interest in secure aggregation, existing works predominantly focus on protocol design and computational guarantees, with limited understanding of the fundamental information-theoretic limits of such systems. Moreover, optimal bounds on communication and key usage remain unknown in decentralized settings, where no central aggregator is available. Motivated by these gaps, we study the problem of decentralized secure aggregation (DSA) from an information-theoretic perspective. Specifically, we consider a network of K fully-connected users, each holding a private input – an abstraction of local training data – who aim to securely compute the sum of all inputs. The security constraint requires that no user learns anything beyond the input sum, even when colluding with up to T other users. We characterize the optimal rate region, which specifies the minimum achievable communication and secret key rates for DSA. In particular, we show that to securely compute one symbol of the desired input sum, each user must (i) transmit at least one symbol to others, (ii) hold at least one symbol of secret key, and (iii) all users must collectively hold no fewer than K - 1 independent key symbols. Our results establish the fundamental performance limits of DSA, providing insights for the design of provably secure and communication-efficient protocols in distributed learning systems.
[LG-17] he Role of Active Learning in Modern Machine Learning
链接: https://arxiv.org/abs/2508.00586
作者: Thorben Werner,Lars Schmidt-Thieme,Vijaya Krishna Yalavarthi
类目: Machine Learning (cs.LG)
*备注:
Abstract:Even though Active Learning (AL) is widely studied, it is rarely applied in contexts outside its own scientific literature. We posit that the reason for this is AL’s high computational cost coupled with the comparatively small lifts it is typically able to generate in scenarios with few labeled points. In this work we study the impact of different methods to combat this low data scenario, namely data augmentation (DA), semi-supervised learning (SSL) and AL. We find that AL is by far the least efficient method of solving the low data problem, generating a lift of only 1-4% over random sampling, while DA and SSL methods can generate up to 60% lift in combination with random sampling. However, when AL is combined with strong DA and SSL techniques, it surprisingly is still able to provide improvements. Based on these results, we frame AL not as a method to combat missing labels, but as the final building block to squeeze the last bits of performance out of data after appropriate DA and SSL methods as been applied.
[LG-18] Learning Potential Energy Surfaces of Hydrogen Atom Transfer Reactions in Peptides
链接: https://arxiv.org/abs/2508.00578
作者: Marlen Neubert,Patrick Reiser,Frauke Gräter,Pascal Friederich
类目: Machine Learning (cs.LG); Materials Science (cond-mat.mtrl-sci); Chemical Physics (physics.chem-ph); Computational Physics (physics.comp-ph); Biomolecules (q-bio.BM)
*备注: 19 pages, 12 figures, and 4 tables (references and SI included)
Abstract:Hydrogen atom transfer (HAT) reactions are essential in many biological processes, such as radical migration in damaged proteins, but their mechanistic pathways remain incompletely understood. Simulating HAT is challenging due to the need for quantum chemical accuracy at biologically relevant scales; thus, neither classical force fields nor DFT-based molecular dynamics are applicable. Machine-learned potentials offer an alternative, able to learn potential energy surfaces (PESs) with near-quantum accuracy. However, training these models to generalize across diverse HAT configurations, especially at radical positions in proteins, requires tailored data generation and careful model selection. Here, we systematically generate HAT configurations in peptides to build large datasets using semiempirical methods and DFT. We benchmark three graph neural network architectures (SchNet, Allegro, and MACE) on their ability to learn HAT PESs and indirectly predict reaction barriers from energy predictions. MACE consistently outperforms the others in energy, force, and barrier prediction, achieving a mean absolute error of 1.13 kcal/mol on out-of-distribution DFT barrier predictions. This accuracy enables integration of ML potentials into large-scale collagen simulations to compute reaction rates from predicted barriers, advancing mechanistic understanding of HAT and radical migration in peptides. We analyze scaling laws, model transferability, and cost-performance trade-offs, and outline strategies for improvement by combining ML potentials with transition state search algorithms and active learning. Our approach is generalizable to other biomolecular systems, enabling quantum-accurate simulations of chemical reactivity in complex environments.
[LG-19] Phase-Locked SNR Band Selection for Weak Mineral Signal Detection in Hyperspectral Imagery
链接: https://arxiv.org/abs/2508.00539
作者: Judy X Yang
类目: Machine Learning (cs.LG)
*备注: 8 pages, 6 figures
Abstract:Hyperspectral imaging offers detailed spectral information for mineral mapping; however, weak mineral signatures are often masked by noisy and redundant bands, limiting detection performance. To address this, we propose a two-stage integrated framework for enhanced mineral detection in the Cuprite mining district. In the first stage, we compute the signal-to-noise ratio (SNR) for each spectral band and apply a phase-locked thresholding technique to discard low-SNR bands, effectively removing redundancy and suppressing background noise. Savitzky-Golay filtering is then employed for spectral smoothing, serving a dual role first to stabilize trends during band selection, and second to preserve fine-grained spectral features during preprocessing. In the second stage, the refined HSI data is reintroduced into the model, where KMeans clustering is used to extract 12 endmember spectra (W1 custom), followed by non negative least squares (NNLS) for abundance unmixing. The resulting endmembers are quantitatively compared with laboratory spectra (W1 raw) using cosine similarity and RMSE metrics. Experimental results confirm that our proposed pipeline improves unmixing accuracy and enhances the detection of weak mineral zones. This two-pass strategy demonstrates a practical and reproducible solution for spectral dimensionality reduction and unmixing in geological HSI applications.
[LG-20] Online Nonsubmodular Optimization with Delayed Feedback in the Bandit Setting
链接: https://arxiv.org/abs/2508.00523
作者: Sifan Yang,Yuanyu Wan,Lijun Zhang
类目: Machine Learning (cs.LG)
*备注:
Abstract:We investigate the online nonsubmodular optimization with delayed feedback in the bandit setting, where the loss function is \alpha -weakly DR-submodular and \beta -weakly DR-supermodular. Previous work has established an (\alpha,\beta) -regret bound of \mathcalO(nd^1/3T^2/3) , where n is the dimensionality and d is the maximum delay. However, its regret bound relies on the maximum delay and is thus sensitive to irregular delays. Additionally, it couples the effects of delays and bandit feedback as its bound is the product of the delay term and the \mathcalO(nT^2/3) regret bound in the bandit setting without delayed feedback. In this paper, we develop two algorithms to address these limitations, respectively. Firstly, we propose a novel method, namely DBGD-NF, which employs the one-point gradient estimator and utilizes all the available estimated gradients in each round to update the decision. It achieves a better \mathcalO(n\bard^1/3T^2/3) regret bound, which is relevant to the average delay \bard = \frac1T\sum_t=1^T d_t\leq d . Secondly, we extend DBGD-NF by employing a blocking update mechanism to decouple the joint effect of the delays and bandit feedback, which enjoys an \mathcalO(n(T^2/3 + \sqrtdT)) regret bound. When d = \mathcalO(T^1/3) , our regret bound matches the \mathcalO(nT^2/3) bound in the bandit setting without delayed feedback. Compared to our first \mathcalO(n\bard^1/3T^2/3) bound, it is more advantageous when the maximum delay d = o(\bard^2/3T^1/3) . Finally, we conduct experiments on structured sparse learning to demonstrate the superiority of our methods.
[LG-21] xt-Attributed Graph Anomaly Detection via Multi-Scale Cross- and Uni-Modal Contrastive Learning ECAI2025
链接: https://arxiv.org/abs/2508.00513
作者: Yiming Xu,Xu Hua,Zhen Peng,Bin Shi,Jiarun Chen,Xingbo Fu,Song Wang,Bo Dong
类目: Machine Learning (cs.LG)
*备注: Accepted by ECAI 2025
Abstract:The widespread application of graph data in various high-risk scenarios has increased attention to graph anomaly detection (GAD). Faced with real-world graphs that often carry node descriptions in the form of raw text sequences, termed text-attributed graphs (TAGs), existing graph anomaly detection pipelines typically involve shallow embedding techniques to encode such textual information into features, and then rely on complex self-supervised tasks within the graph domain to detect anomalies. However, this text encoding process is separated from the anomaly detection training objective in the graph domain, making it difficult to ensure that the extracted textual features focus on GAD-relevant information, seriously constraining the detection capability. How to seamlessly integrate raw text and graph topology to unleash the vast potential of cross-modal data in TAGs for anomaly detection poses a challenging issue. This paper presents a novel end-to-end paradigm for text-attributed graph anomaly detection, named CMUCL. We simultaneously model data from both text and graph structures, and jointly train text and graph encoders by leveraging cross-modal and uni-modal multi-scale consistency to uncover potential anomaly-related information. Accordingly, we design an anomaly score estimator based on inconsistency mining to derive node-specific anomaly scores. Considering the lack of benchmark datasets tailored for anomaly detection on TAGs, we release 8 datasets to facilitate future research. Extensive evaluations show that CMUCL significantly advances in text-attributed graph anomaly detection, delivering an 11.13% increase in average accuracy (AP) over the suboptimal.
[LG-22] Court of LLM s: Evidence-Augmented Generation via Multi-LLM Collaboration for Text-Attributed Graph Anomaly Detection
链接: https://arxiv.org/abs/2508.00507
作者: Yiming Xu,Jiarun Chen,Zhen Peng,Zihan Chen,Qika Lin,Lan Ma,Bin Shi,Bo Dong
类目: Machine Learning (cs.LG)
*备注: Accepted by ACM Multimedia 2025 (MM '25)
Abstract:The natural combination of intricate topological structures and rich textual information in text-attributed graphs (TAGs) opens up a novel perspective for graph anomaly detection (GAD). However, existing GAD methods primarily focus on designing complex optimization objectives within the graph domain, overlooking the complementary value of the textual modality, whose features are often encoded by shallow embedding techniques, such as bag-of-words or skip-gram, so that semantic context related to anomalies may be missed. To unleash the enormous potential of textual modality, large language models (LLMs) have emerged as promising alternatives due to their strong semantic understanding and reasoning capabilities. Nevertheless, their application to TAG anomaly detection remains nascent, and they struggle to encode high-order structural information inherent in graphs due to input length constraints. For high-quality anomaly detection in TAGs, we propose CoLL, a novel framework that combines LLMs and graph neural networks (GNNs) to leverage their complementary strengths. CoLL employs multi-LLM collaboration for evidence-augmented generation to capture anomaly-relevant contexts while delivering human-readable rationales for detected anomalies. Moreover, CoLL integrates a GNN equipped with a gating mechanism to adaptively fuse textual features with evidence while preserving high-order topological information. Extensive experiments demonstrate the superiority of CoLL, achieving an average improvement of 13.37% in AP. This study opens a new avenue for incorporating LLMs in advancing GAD.
[LG-23] A Conditional GAN for Tabular Data Generation with Probabilistic Sampling of Latent Subspaces
链接: https://arxiv.org/abs/2508.00472
作者: Leonidas Akritidis,Panayiotis Bozanis
类目: Machine Learning (cs.LG)
*备注:
Abstract:The tabular form constitutes the standard way of representing data in relational database systems and spreadsheets. But, similarly to other forms, tabular data suffers from class imbalance, a problem that causes serious performance degradation in a wide variety of machine learning tasks. One of the most effective solutions dictates the usage of Generative Adversarial Networks (GANs) in order to synthesize artificial data instances for the under-represented classes. Despite their good performance, none of the proposed GAN models takes into account the vector subspaces of the input samples in the real data space, leading to data generation in arbitrary locations. Moreover, the class labels are treated in the same manner as the other categorical variables during training, so conditional sampling by class is rendered less effective. To overcome these problems, this study presents ctdGAN, a conditional GAN for alleviating class imbalance in tabular datasets. Initially, ctdGAN executes a space partitioning step to assign cluster labels to the input samples. Subsequently, it utilizes these labels to synthesize samples via a novel probabilistic sampling strategy and a new loss function that penalizes both cluster and class mis-predictions. In this way, ctdGAN is trained to generate samples in subspaces that resemble those of the original data distribution. We also introduce several other improvements, including a simple, yet effective cluster-wise scaling technique that captures multiple feature modes without affecting data dimensionality. The exhaustive evaluation of ctdGAN with 14 imbalanced datasets demonstrated its superiority in generating high fidelity samples and improving classification accuracy.
[LG-24] Automated Type Annotation in Python Using Large Language Models
链接: https://arxiv.org/abs/2508.00422
作者: Varun Bharti,Shashwat Jha,Dhruv Kumar,Pankaj Jalote
类目: Programming Languages (cs.PL); Machine Learning (cs.LG)
*备注: Under Review
Abstract:Type annotations in Python enhance maintainability and error detection. However, generating these annotations manually is error prone and requires extra effort. Traditional automation approaches like static analysis, machine learning, and deep learning struggle with limited type vocabularies, behavioral over approximation, and reliance on large labeled datasets. In this work, we explore the use of LLMs for generating type annotations in Python. We develop a generate check repair pipeline: the LLM proposes annotations guided by a Concrete Syntax Tree representation, a static type checker (Mypy) verifies them, and any errors are fed back for iterative refinement. We evaluate four LLM variants: GPT 4oMini, GPT 4.1mini (general-purpose), and O3Mini, O4Mini (reasoning optimized), on 6000 code snippets from the ManyTypes4Py benchmark. We first measure the proportion of code snippets annotated by LLMs for which MyPy reported no errors (i.e., consistent results): GPT 4oMini achieved consistency on 65.9% of cases (34.1% inconsistent), while GPT 4.1mini, O3Mini, and O4Mini each reached approximately 88.6% consistency (around 11.4% failures). To measure annotation quality, we then compute exact-match and base-type match accuracies over all 6000 snippets: GPT 4.1mini and O3Mini perform the best, achieving up to 70.5% exact match and 79.1% base type accuracy, requiring under one repair iteration on average. Our results demonstrate that general-purpose and reasoning optimized LLMs, without any task specific fine tuning or additional training can be effective in generating consistent type this http URL perform competitively with traditional deep learning techniques which require large labeled dataset for training. While our work focuses on Python, the pipeline can be extended to other optionally typed imperative languages like Ruby
[LG-25] Loop Invariant Generation: A Hybrid Framework of Reasoning optimised LLM s and SMT Solvers
链接: https://arxiv.org/abs/2508.00419
作者: Varun Bharti,Shashwat Jha,Dhruv Kumar,Pankaj Jalote
类目: Logic in Computer Science (cs.LO); Machine Learning (cs.LG); Programming Languages (cs.PL)
*备注: Under Review
Abstract:Loop invariants are essential for proving the correctness of programs with loops. Developing loop invariants is challenging, and fully automatic synthesis cannot be guaranteed for arbitrary programs. Some approaches have been proposed to synthesize loop invariants using symbolic techniques and more recently using neural approaches. These approaches are able to correctly synthesize loop invariants only for subsets of standard benchmarks. In this work, we investigate whether modern, reasoning-optimized large language models can do better. We integrate OpenAI’s O1, O1-mini, and O3-mini into a tightly coupled generate-and-check pipeline with the Z3 SMT solver, using solver counterexamples to iteratively guide invariant refinement. We use Code2Inv benchmark, which provides C programs along with their formal preconditions and postconditions. On this benchmark of 133 tasks, our framework achieves 100% coverage (133 out of 133), outperforming the previous best of 107 out of 133, while requiring only 1-2 model proposals per instance and 14-55 seconds of wall-clock time. These results demonstrate that LLMs possess latent logical reasoning capabilities which can help automate loop invariant synthesis. While our experiments target C-specific programs, this approach should be generalizable to other imperative languages.
[LG-26] ransforming Credit Risk Analysis: A Time-Series-Driven ResE-BiLSTM Framework for Post-Loan Default Detection
链接: https://arxiv.org/abs/2508.00415
作者: Yue Yang,Yuxiang Lin,Ying Zhang,Zihan Su,Chang Chuan Goh,Tangtangfang Fang,Anthony Graham Bellotti,Boon Giin Lee
类目: Machine Learning (cs.LG)
*备注:
Abstract:Prediction of post-loan default is an important task in credit risk management, and can be addressed by detection of financial anomalies using machine learning. This study introduces a ResE-BiLSTM model, using a sliding window technique, and is evaluated on 44 independent cohorts from the extensive Freddie Mac US mortgage dataset, to improve prediction performance. The ResE-BiLSTM is compared with five baseline models: Long Short-Term Memory (LSTM), BiLSTM, Gated Recurrent Units (GRU), Convolutional Neural Networks (CNN), and Recurrent Neural Networks (RNN), across multiple metrics, including Accuracy, Precision, Recall, F1, and AUC. An ablation study was conducted to evaluate the contribution of individual components in the ResE-BiLSTM architecture. Additionally, SHAP analysis was employed to interpret the underlying features the model relied upon for its predictions. Experimental results demonstrate that ResE-BiLSTM achieves superior predictive performance compared to baseline models, underscoring its practical value and applicability in real-world scenarios.
[LG-27] Co-Reward: Self-supervised Reinforcement Learning for Large Language Model Reasoning via Contrastive Agreement
链接: https://arxiv.org/abs/2508.00410
作者: Zizhuo Zhang,Jianing Zhu,Xinmu Ge,Zihua Zhao,Zhanke Zhou,Xuan Li,Xiao Feng,Jiangchao Yao,Bo Han
类目: Machine Learning (cs.LG)
*备注:
Abstract:Although reinforcement learning with verifiable rewards (RLVR) shows promise in improving the reasoning ability of large language models (LLMs), the scaling up dilemma remains due to the reliance on human annotated labels especially for complex tasks. Recent alternatives that explore various self-reward signals exhibit the eliciting potential of LLM reasoning, but suffer from the non-negligible collapse issue. Inspired by the success of self-supervised learning, we propose \textitCo-Reward, a novel RL framework that leverages contrastive agreement across semantically analogical questions as a reward basis. Specifically, we construct a similar question for each training sample (without labels) and synthesize their individual surrogate labels through a simple rollout voting, and then the reward is constructed by cross-referring the labels of each question pair to enforce the internal reasoning consistency across analogical inputs. Intuitively, such a self-supervised reward-shaping mechanism increases the difficulty of learning collapse into a trivial solution, and promotes stable reasoning elicitation and improvement through expanding the input sample variants. Empirically, Co-Reward achieves superior performance compared to other self-reward baselines on multiple reasoning benchmarks and LLM series, and reaches or even surpasses ground-truth (GT) labeled reward, with improvements of up to +6.8% on MATH500 over GT reward on Llama-3.2-3B-Instruct. Our code is publicly available at this https URL.
[LG-28] Dual Adaptivity: Universal Algorithms for Minimizing the Adaptive Regret of Convex Functions
链接: https://arxiv.org/abs/2508.00392
作者: Lijun Zhang,Wenhao Yang,Guanghui Wang,Wei Jiang,Zhi-Hua Zhou
类目: Machine Learning (cs.LG)
*备注: arXiv admin note: text overlap with arXiv:1906.10851
Abstract:To deal with changing environments, a new performance measure – adaptive regret, defined as the maximum static regret over any interval, was proposed in online learning. Under the setting of online convex optimization, several algorithms have been successfully developed to minimize the adaptive regret. However, existing algorithms lack universality in the sense that they can only handle one type of convex functions and need apriori knowledge of parameters, which hinders their application in real-world scenarios. To address this limitation, this paper investigates universal algorithms with dual adaptivity, which automatically adapt to the property of functions (convex, exponentially concave, or strongly convex), as well as the nature of environments (stationary or changing). Specifically, we propose a meta-expert framework for dual adaptive algorithms, where multiple experts are created dynamically and aggregated by a meta-algorithm. The meta-algorithm is required to yield a second-order bound, which can accommodate unknown function types. We further incorporate the technique of sleeping experts to capture the changing environments. For the construction of experts, we introduce two strategies (increasing the number of experts or enhancing the capabilities of experts) to achieve universality. Theoretical analysis shows that our algorithms are able to minimize the adaptive regret for multiple types of convex functions simultaneously, and also allow the type of functions to switch between rounds. Moreover, we extend our meta-expert framework to online composite optimization, and develop a universal algorithm for minimizing the adaptive regret of composite functions.
[LG-29] Preliminary Investigation into Uncertainty-Aware Attack Stage Classification ECAI2025
链接: https://arxiv.org/abs/2508.00368
作者: Alessandro Gaudenzi,Lorenzo Nodari,Lance Kaplan,Alessandra Russo,Murat Sensoy,Federico Cerutti
类目: Cryptography and Security (cs.CR); Machine Learning (cs.LG)
*备注: Proceedings for SPAIML2025 workshop, 26/10/2025 Bologna Italy, co-located with ECAI2025
Abstract:Advanced Persistent Threats (APTs) represent a significant challenge in cybersecurity due to their prolonged, multi-stage nature and the sophistication of their operators. Traditional detection systems typically focus on identifying malicious activity in binary terms (benign or malicious) without accounting for the progression of an attack. However, effective response strategies depend on accurate inference of the attack’s current stage, as countermeasures must be tailored to whether an adversary is in the early reconnaissance phase or actively conducting exploitation or exfiltration. This work addresses the problem of attack stage inference under uncertainty, with a focus on robustness to out-of-distribution (OOD) inputs. We propose a classification approach based on Evidential Deep Learning (EDL), which models predictive uncertainty by outputting parameters of a Dirichlet distribution over possible stages. This allows the system not only to predict the most likely stage of an attack but also to indicate when it is uncertain or the input lies outside the training distribution. Preliminary experiments in a simulated environment demonstrate that the proposed model can accurately infer the stage of an attack with calibrated confidence while effectively detecting OOD inputs, which may indicate changes in the attackers’ tactics. These results support the feasibility of deploying uncertainty-aware models for staged threat detection in dynamic and adversarial environments.
[LG-30] OID-PPO: Optimal Interior Design using Proximal Policy Optimization by Transforming Design Guidelines into Reward Functions
链接: https://arxiv.org/abs/2508.00364
作者: Chanyoung Yoon,Sangbong Yoo,Soobin Yim,Chansoo Kim,Yun Jang
类目: Machine Learning (cs.LG)
*备注:
Abstract:Designing residential interiors strongly impacts occupant satisfaction but remains challenging due to unstructured spatial layouts, high computational demands, and reliance on expert knowledge. Existing methods based on optimization or deep learning are either computationally expensive or constrained by data scarcity. Reinforcement learning (RL) approaches often limit furniture placement to discrete positions and fail to incorporate design principles adequately. We propose OID-PPO, a novel RL framework for Optimal Interior Design using Proximal Policy Optimization, which integrates expert-defined functional and visual guidelines into a structured reward function. OID-PPO utilizes a diagonal Gaussian policy for continuous and flexible furniture placement, effectively exploring latent environmental dynamics under partial observability. Experiments conducted across diverse room shapes and furniture configurations demonstrate that OID-PPO significantly outperforms state-of-the-art methods in terms of layout quality and computational efficiency. Ablation studies further demonstrate the impact of structured guideline integration and reveal the distinct contributions of individual design constraints.
[LG-31] Sheaf Graph Neural Networks via PAC-Bayes Spectral Optimization
链接: https://arxiv.org/abs/2508.00357
作者: Yoonhyuk Choi,Jiho Choi,Chong-Kwon Kim
类目: Machine Learning (cs.LG)
*备注:
Abstract:Over-smoothing in Graph Neural Networks (GNNs) causes collapse in distinct node features, particularly on heterophilic graphs where adjacent nodes often have dissimilar labels. Although sheaf neural networks partially mitigate this problem, they typically rely on static or heavily parameterized sheaf structures that hinder generalization and scalability. Existing sheaf-based models either predefine restriction maps or introduce excessive complexity, yet fail to provide rigorous stability guarantees. In this paper, we introduce a novel scheme called SGPC (Sheaf GNNs with PAC-Bayes Calibration), a unified architecture that combines cellular-sheaf message passing with several mechanisms, including optimal transport-based lifting, variance-reduced diffusion, and PAC-Bayes spectral regularization for robust semi-supervised node classification. We establish performance bounds theoretically and demonstrate that the resulting bound-aware objective can be achieved via end-to-end training in linear computational complexity. Experiments on nine homophilic and heterophilic benchmarks show that SGPC outperforms state-of-the-art spectral and sheaf-based GNNs while providing certified confidence intervals on unseen nodes.
[LG-32] BOOD: Boundary-based Out-Of-Distribution Data Generation ICML
链接: https://arxiv.org/abs/2508.00350
作者: Qilin Liao,Shuo Yang,Bo Zhao,Ping Luo,Hengshuang Zhao
类目: Machine Learning (cs.LG)
*备注: 14 pages, 8 figures, To be published in the Proceedings of the International Conference on Machine Learning (ICML) 2025
Abstract:Harnessing the power of diffusion models to synthesize auxiliary training data based on latent space features has proven effective in enhancing out-of-distribution (OOD) detection performance. However, extracting effective features outside the in-distribution (ID) boundary in latent space remains challenging due to the difficulty of identifying decision boundaries between classes. This paper proposes a novel framework called Boundary-based Out-Of-Distribution data generation (BOOD), which synthesizes high-quality OOD features and generates human-compatible outlier images using diffusion models. BOOD first learns a text-conditioned latent feature space from the ID dataset, selects ID features closest to the decision boundary, and perturbs them to cross the decision boundary to form OOD features. These synthetic OOD features are then decoded into images in pixel space by a diffusion model. Compared to previous works, BOOD provides a more training efficient strategy for synthesizing informative OOD features, facilitating clearer distinctions between ID and OOD data. Extensive experimental results on common benchmarks demonstrate that BOOD surpasses the state-of-the-art method significantly, achieving a 29.64% decrease in average FPR95 (40.31% vs. 10.67%) and a 7.27% improvement in average AUROC (90.15% vs. 97.42%) on the CIFAR-100 dataset.
[LG-33] Embryology of a Language Model
链接: https://arxiv.org/abs/2508.00331
作者: George Wang,Garrett Baker,Andrew Gordon,Daniel Murfet
类目: Machine Learning (cs.LG)
*备注:
Abstract:Understanding how language models develop their internal computational structure is a central problem in the science of deep learning. While susceptibilities, drawn from statistical physics, offer a promising analytical tool, their full potential for visualizing network organization remains untapped. In this work, we introduce an embryological approach, applying UMAP to the susceptibility matrix to visualize the model’s structural development over training. Our visualizations reveal the emergence of a clear body plan,'' charting the formation of known features like the induction circuit and discovering previously unknown structures, such as a
spacing fin’’ dedicated to counting space tokens. This work demonstrates that susceptibility analysis can move beyond validation to uncover novel mechanisms, providing a powerful, holistic lens for studying the developmental principles of complex neural networks.
[LG-34] PnP-DA: Towards Principled Plug-and-Play Integration of Variational Data Assimilation and Generative Models
链接: https://arxiv.org/abs/2508.00325
作者: Yongquan Qu,Matthieu Blanke,Sara Shamekh,Pierre Gentine
类目: Machine Learning (cs.LG); Computational Physics (physics.comp-ph)
*备注:
Abstract:Earth system modeling presents a fundamental challenge in scientific computing: capturing complex, multiscale nonlinear dynamics in computationally efficient models while minimizing forecast errors caused by necessary simplifications. Even the most powerful AI- or physics-based forecast system suffer from gradual error accumulation. Data assimilation (DA) aims to mitigate these errors by optimally blending (noisy) observations with prior model forecasts, but conventional variational methods often assume Gaussian error statistics that fail to capture the true, non-Gaussian behavior of chaotic dynamical systems. We propose PnP-DA, a Plug-and-Play algorithm that alternates (1) a lightweight, gradient-based analysis update (using a Mahalanobis-distance misfit on new observations) with (2) a single forward pass through a pretrained generative prior conditioned on the background forecast via a conditional Wasserstein coupling. This strategy relaxes restrictive statistical assumptions and leverages rich historical data without requiring an explicit regularization functional, and it also avoids the need to backpropagate gradients through the complex neural network that encodes the prior during assimilation cycles. Experiments on standard chaotic testbeds demonstrate that this strategy consistently reduces forecast errors across a range of observation sparsities and noise levels, outperforming classical variational methods.
[LG-35] Invariant Graph Transformer for Out-of-Distribution Generalization
链接: https://arxiv.org/abs/2508.00304
作者: Tianyin Liao,Ziwei Zhang,Yufei Sun,Chunyu Hu,Jianxin Li
类目: Machine Learning (cs.LG)
*备注:
Abstract:Graph Transformers (GTs) have demonstrated great effectiveness across various graph analytical tasks. However, the existing GTs focus on training and testing graph data originated from the same distribution, but fail to generalize under distribution shifts. Graph invariant learning, aiming to capture generalizable graph structural patterns with labels under distribution shifts, is potentially a promising solution, but how to design attention mechanisms and positional and structural encodings (PSEs) based on graph invariant learning principles remains challenging. To solve these challenges, we introduce Graph Out-Of-Distribution generalized Transformer (GOODFormer), aiming to learn generalized graph representations by capturing invariant relationships between predictive graph structures and labels through jointly optimizing three modules. Specifically, we first develop a GT-based entropy-guided invariant subgraph disentangler to separate invariant and variant subgraphs while preserving the sharpness of the attention function. Next, we design an evolving subgraph positional and structural encoder to effectively and efficiently capture the encoding information of dynamically changing subgraphs during training. Finally, we propose an invariant learning module utilizing subgraph node representations and encodings to derive generalizable graph representations that can to unseen graphs. We also provide theoretical justifications for our method. Extensive experiments on benchmark datasets demonstrate the superiority of our method over state-of-the-art baselines under distribution shifts.
[LG-36] oward using explainable data-driven surrogate models for treating performance-based seismic design as an inverse engineering problem
链接: https://arxiv.org/abs/2508.00286
作者: Mohsen Zaker Esteghamati
类目: Machine Learning (cs.LG); Applications (stat.AP); Machine Learning (stat.ML)
*备注:
Abstract:This study presents a methodology to treat performance-based seismic design as an inverse engineering problem, where design parameters are directly derived to achieve specific performance objectives. By implementing explainable machine learning models, this methodology directly maps design variables and performance metrics, tackling computational inefficiencies of performance-based design. The resultant machine learning model is integrated as an evaluation function into a genetic optimization algorithm to solve the inverse problem. The developed methodology is then applied to two different inventories of steel and concrete moment frames in Los Angeles and Charleston to obtain sectional properties of frame members that minimize expected annualized seismic loss in terms of repair costs. The results show high accuracy of the surrogate models (e.g., R2 90%) across a diverse set of building types, geometries, seismic design, and site hazard, where the optimization algorithm could identify the optimum values of members’ properties for a fixed set of geometric variables, consistent with engineering principles.
[LG-37] Learning to Optimize Feedback for One Million Students: Insights from Multi-Armed and Contextual Bandits in Large-Scale Online Tutoring
链接: https://arxiv.org/abs/2508.00270
作者: Robin Schmucker,Nimish Pachapurkar,Shanmuga Bala,Miral Shah,Tom Mitchell
类目: Machine Learning (cs.LG)
*备注:
Abstract:We present an online tutoring system that learns to provide effective feedback to students after they answer questions incorrectly. Using data from one million students, the system learns which assistance action (e.g., one of multiple hints) to provide for each question to optimize student learning. Employing the multi-armed bandit (MAB) framework and offline policy evaluation, we assess 43,000 assistance actions, and identify trade-offs between assistance policies optimized for different student outcomes (e.g., response correctness, session completion). We design an algorithm that for each question decides on a suitable policy training objective to enhance students’ immediate second attempt success and overall practice session performance. We evaluate the resulting MAB policies in 166,000 practice sessions, verifying significant improvements in student outcomes. While MAB policies optimize feedback for the overall student population, we further investigate whether contextual bandit (CB) policies can enhance outcomes by personalizing feedback based on individual student features (e.g., ability estimates, response times). Using causal inference, we examine (i) how effects of assistance actions vary across students and (ii) whether CB policies, which leverage such effect heterogeneity, outperform MAB policies. While our analysis reveals that some actions for some questions exhibit effect heterogeneity, effect sizes may often be too small for CB policies to provide significant improvements beyond what well-optimized MAB policies that deliver the same action to all students already achieve. We discuss insights gained from deploying data-driven systems at scale and implications for future refinements. Today, the teaching policies optimized by our system support thousands of students daily.
[LG-38] RecoMind: A Reinforcement Learning Framework for Optimizing In-Session User Satisfaction in Recommendation Systems
链接: https://arxiv.org/abs/2508.00201
作者: Mehdi Ben Ayed,Fei Feng,Jay Adams,Vishwakarma Singh,Kritarth Anand,Jiajing Xu
类目: Machine Learning (cs.LG)
*备注:
Abstract:Existing web-scale recommendation systems commonly use supervised learning methods that prioritize immediate user feedback. Although reinforcement learning (RL) offers a solution to optimize longer-term goals, such as in-session engagement, applying it at web scale is challenging due to the extremely large action space and engineering complexity. In this paper, we introduce RecoMind, a simulator-based RL framework designed for the effective optimization of session-based goals at web-scale. RecoMind leverages existing recommendation models to establish a simulation environment and to bootstrap the RL policy to optimize immediate user interactions from the outset. This method integrates well with existing industry pipelines, simplifying the training and deployment of RL policies. Additionally, RecoMind introduces a custom exploration strategy to efficiently explore web-scale action spaces with hundreds of millions of items. We evaluated RecoMind through extensive offline simulations and online A/B testing on a video streaming platform. Both methods showed that the RL policy trained using RecoMind significantly outperforms traditional supervised learning recommendation approaches in in-session user satisfaction. In online A/B tests, the RL policy increased videos watched for more than 10 seconds by 15.81% and improved session depth by 4.71% for sessions with at least 10 interactions. As a result, RecoMind presents a systematic and scalable approach for embedding RL into web-scale recommendation systems, showing great promise for optimizing session-based user satisfaction.
[LG-39] RL as Regressor: A Reinforcement Learning Approach for Function Approximation
链接: https://arxiv.org/abs/2508.00174
作者: Yongchao Huang
类目: Machine Learning (cs.LG)
*备注: 7 pages
Abstract:Standard regression techniques, while powerful, are often constrained by predefined, differentiable loss functions such as mean squared error. These functions may not fully capture the desired behavior of a system, especially when dealing with asymmetric costs or complex, non-differentiable objectives. In this paper, we explore an alternative paradigm: framing regression as a Reinforcement Learning (RL) problem. We demonstrate this by treating a model’s prediction as an action and defining a custom reward signal based on the prediction error, and we can leverage powerful RL algorithms to perform function approximation. Through a progressive case study of learning a noisy sine wave, we illustrate the development of an Actor-Critic agent, iteratively enhancing it with Prioritized Experience Replay, increased network capacity, and positional encoding to enable a capable RL agent for this regression task. Our results show that the RL framework not only successfully solves the regression problem but also offers enhanced flexibility in defining objectives and guiding the learning process.
[LG-40] DiSC-Med: Diffusion-based Semantic Communications for Robust Medical Image Transmission
链接: https://arxiv.org/abs/2508.00172
作者: Fupei Guo,Hao Zheng,Xiang Zhang,Li Chen,Yue Wang,Songyang Zhang
类目: Machine Learning (cs.LG); Image and Video Processing (eess.IV)
*备注: To appear in 2025 IEEE Global Communications Conference (Globecom)
Abstract:The rapid development of artificial intelligence has driven smart health with next-generation wireless communication technologies, stimulating exciting applications in remote diagnosis and intervention. To enable a timely and effective response for remote healthcare, efficient transmission of medical data through noisy channels with limited bandwidth emerges as a critical challenge. In this work, we propose a novel diffusion-based semantic communication framework, namely DiSC-Med, for the medical image transmission, where medical-enhanced compression and denoising blocks are developed for bandwidth efficiency and robustness, respectively. Unlike conventional pixel-wise communication framework, our proposed DiSC-Med is able to capture the key semantic information and achieve superior reconstruction performance with ultra-high bandwidth efficiency against noisy channels. Extensive experiments on real-world medical datasets validate the effectiveness of our framework, demonstrating its potential for robust and efficient telehealth applications.
[LG-41] Data-Driven Motion Planning for Uncertain Nonlinear Systems
链接: https://arxiv.org/abs/2508.00154
作者: Babak Esmaeili,Hamidreza Modares,Stefano Di Cairano
类目: ystems and Control (eess.SY); Machine Learning (cs.LG); Robotics (cs.RO); Optimization and Control (math.OC)
*备注:
Abstract:This paper proposes a data-driven motion-planning framework for nonlinear systems that constructs a sequence of overlapping invariant polytopes. Around each randomly sampled waypoint, the algorithm identifies a convex admissible region and solves data-driven linear-matrix-inequality problems to learn several ellipsoidal invariant sets together with their local state-feedback gains. The convex hull of these ellipsoids, still invariant under a piece-wise-affine controller obtained by interpolating the gains, is then approximated by a polytope. Safe transitions between nodes are ensured by verifying the intersection of consecutive convex-hull polytopes and introducing an intermediate node for a smooth transition. Control gains are interpolated in real time via simplex-based interpolation, keeping the state inside the invariant polytopes throughout the motion. Unlike traditional approaches that rely on system dynamics models, our method requires only data to compute safe regions and design state-feedback controllers. The approach is validated through simulations, demonstrating the effectiveness of the proposed method in achieving safe, dynamically feasible paths for complex nonlinear systems.
[LG-42] ECG Latent Feature Extraction with Autoencoders for Downstream Prediction Tasks
链接: https://arxiv.org/abs/2508.00131
作者: Christopher Harvey,Sumaiya Shomaji,Zijun Yao,Amit Noheria
类目: Machine Learning (cs.LG)
*备注: arXiv admin note: substantial text overlap with arXiv:2410.02937
Abstract:The electrocardiogram (ECG) is an inexpensive and widely available tool for cardiac assessment. Despite its standardized format and small file size, the high complexity and inter-individual variability of ECG signals (typically a 60,000-size vector with 12 leads at 500 Hz) make it challenging to use in deep learning models, especially when only small training datasets are available. This study addresses these challenges by exploring feature generation methods from representative beat ECGs, focusing on Principal Component Analysis (PCA) and Autoencoders to reduce data complexity. We introduce three novel Variational Autoencoder (VAE) variants-Stochastic Autoencoder (SAE), Annealed beta-VAE (A beta-VAE), and Cyclical beta VAE (C beta-VAE)-and compare their effectiveness in maintaining signal fidelity and enhancing downstream prediction tasks using a Light Gradient Boost Machine (LGBM). The A beta-VAE achieved superior signal reconstruction, reducing the mean absolute error (MAE) to 15.7+/-3.2 muV, which is at the level of signal noise. Moreover, the SAE encodings, when combined with traditional ECG summary features, improved the prediction of reduced Left Ventricular Ejection Fraction (LVEF), achieving an holdout test set area under the receiver operating characteristic curve (AUROC) of 0.901 with a LGBM classifier. This performance nearly matches the 0.909 AUROC of state-of-the-art CNN model but requires significantly less computational resources. Further, the ECG feature extraction-LGBM pipeline avoids overfitting and retains predictive performance when trained with less data. Our findings demonstrate that these VAE encodings are not only effective in simplifying ECG data but also provide a practical solution for applying deep learning in contexts with limited-scale labeled training data.
[LG-43] Structured Transformations for Stable and Interpretable Neural Computation
链接: https://arxiv.org/abs/2508.00127
作者: Saleh Nikooroo,Thomas Engel
类目: Machine Learning (cs.LG)
*备注:
Abstract:Despite their impressive performance, contemporary neural networks often lack structural safeguards that promote stable learning and interpretable behavior. In this work, we introduce a reformulation of layer-level transformations that departs from the standard unconstrained affine paradigm. Each transformation is decomposed into a structured linear operator and a residual corrective component, enabling more disciplined signal propagation and improved training dynamics. Our formulation encourages internal consistency and supports stable information flow across depth, while remaining fully compatible with standard learning objectives and backpropagation. Through a series of synthetic and real-world experiments, we demonstrate that models constructed with these structured transformations exhibit improved gradient conditioning, reduced sensitivity to perturbations, and layer-wise robustness. We further show that these benefits persist across architectural scales and training regimes. This study serves as a foundation for a more principled class of neural architectures that prioritize stability and transparency-offering new tools for reasoning about learning behavior without sacrificing expressive power.
[LG-44] Leverag ing Operator Learning to Accelerate Convergence of the Preconditioned Conjugate Gradient Method
链接: https://arxiv.org/abs/2508.00101
作者: Alena Kopaničáková,Youngkyu Lee,George Em Karniadakis
类目: Numerical Analysis (math.NA); Machine Learning (cs.LG); Optimization and Control (math.OC)
*备注: 31 pages
Abstract:We propose a new deflation strategy to accelerate the convergence of the preconditioned conjugate gradient(PCG) method for solving parametric large-scale linear systems of equations. Unlike traditional deflation techniques that rely on eigenvector approximations or recycled Krylov subspaces, we generate the deflation subspaces using operator learning, specifically the Deep Operator Network~(DeepONet). To this aim, we introduce two complementary approaches for assembling the deflation operators. The first approach approximates near-null space vectors of the discrete PDE operator using the basis functions learned by the DeepONet. The second approach directly leverages solutions predicted by the DeepONet. To further enhance convergence, we also propose several strategies for prescribing the sparsity pattern of the deflation operator. A comprehensive set of numerical experiments encompassing steady-state, time-dependent, scalar, and vector-valued problems posed on both structured and unstructured geometries is presented and demonstrates the effectiveness of the proposed DeepONet-based deflated PCG method, as well as its generalization across a wide range of model parameters and problem resolutions.
[LG-45] Improved Robustness and Functional Localization in Topographic CNNs Through Weight Similarity
链接: https://arxiv.org/abs/2508.00043
作者: Nhut Truong,Uri Hasson
类目: Machine Learning (cs.LG)
*备注:
Abstract:Topographic neural networks are computational models that can simulate the spatial and functional organization of the brain. Topographic constraints in neural networks can be implemented in multiple ways, with potentially different impacts on the representations learned by the network. The impact of such different implementations has not been systematically examined. To this end, here we compare topographic convolutional neural networks trained with two spatial constraints: Weight Similarity (WS), which pushes neighboring units to develop similar incoming weights, and Activation Similarity (AS), which enforces similarity in unit activations. We evaluate the resulting models on classification accuracy, robustness to weight perturbations and input degradation, and the spatial organization of learned representations. Compared to both AS and standard CNNs, WS provided three main advantages: i) improved robustness to noise, also showing higher accuracy under weight corruption; ii) greater input sensitivity, reflected in higher activation variance; and iii) stronger functional localization, with units showing similar activations positioned at closer distances. In addition, WS produced differences in orientation tuning, symmetry sensitivity, and eccentricity profiles of units, indicating an influence of this spatial constraint on the representational geometry of the network. Our findings suggest that during end-to-end training, WS constraints produce more robust representations than AS or non-topographic CNNs. These findings also suggest that weight-based spatial constraints can shape feature learning and functional organization in biophysical inspired models.
[LG-46] owards Reliable AI in 6G: Detecting Concept Drift in Wireless Network
链接: https://arxiv.org/abs/2508.00042
作者: Athanasios Tziouvaras,Carolina Fortuna,George Floros,Kostas Kolomvatsos,Panagiotis Sarigiannidis,Marko Grobelnik,Blaž Bertalanič
类目: Networking and Internet Architecture (cs.NI); Machine Learning (cs.LG)
*备注: 10 pages, 12 figures
Abstract:AI-native 6G networks promise unprecedented automation and performance by embedding machine-learning models throughout the radio access and core segments of the network. However, the non-stationary nature of wireless environments due to infrastructure changes, user mobility, and emerging traffic patterns, induces concept drifts that can quickly degrade these model accuracies. Existing methods in general are very domain specific, or struggle with certain type of concept drift. In this paper, we introduce two unsupervised, model-agnostic, batch concept drift detectors. Both methods compute an expected-utility score to decide when concept drift occurred and if model retraining is warranted, without requiring ground-truth labels after deployment. We validate our framework on two real-world wireless use cases in outdoor fingerprinting for localization and for link-anomaly detection, and demonstrate that both methods are outperforming classical detectors such as ADWIN, DDM, CUSUM by 20-40 percentage points. Additionally, they achieve an F1-score of 0.94 and 1.00 in correctly triggering retraining alarm, thus reducing the false alarm rate by up to 20 percentage points compared to the best classical detectors.
[LG-47] Regime-Aware Conditional Neural Processes with Multi-Criteria Decision Support for Operational Electricity Price Forecasting
链接: https://arxiv.org/abs/2508.00040
作者: Abhinav Das,Stephan Schlüter
类目: Machine Learning (cs.LG); Probability (math.PR); Applications (stat.AP); Machine Learning (stat.ML)
*备注:
Abstract:This work integrates Bayesian regime detection with conditional neural processes for 24-hour electricity price prediction in the German market. Our methodology integrates regime detection using a disentangled sticky hierarchical Dirichlet process hidden Markov model (DS-HDP-HMM) applied to daily electricity prices. Each identified regime is subsequently modeled by an independent conditional neural process (CNP), trained to learn localized mappings from input contexts to 24-dimensional hourly price trajectories, with final predictions computed as regime-weighted mixtures of these CNP outputs. We rigorously evaluate R-NP against deep neural networks (DNN) and Lasso estimated auto-regressive (LEAR) models by integrating their forecasts into diverse battery storage optimization frameworks, including price arbitrage, risk management, grid services, and cost minimization. This operational utility assessment revealed complex performance trade-offs: LEAR often yielded superior absolute profits or lower costs, while DNN showed exceptional optimality in specific cost-minimization contexts. Recognizing that raw prediction accuracy doesn’t always translate to optimal operational outcomes, we employed TOPSIS as a comprehensive multi-criteria evaluation layer. Our TOPSIS analysis identified LEAR as the top-ranked model for 2021, but crucially, our proposed R-NP model emerged as the most balanced and preferred solution for 2021, 2022 and 2023.
[LG-48] Efficient Solving of Large Single Input Superstate Decomposable Markovian Decision Process
链接: https://arxiv.org/abs/2508.00816
作者: Youssef Ait El Mahjoub,Jean-Michel Fourneau,Salma Alouah
类目: Optimization and Control (math.OC); Machine Learning (cs.LG); Performance (cs.PF)
*备注: Preprint article submitted to ValueTools2025
Abstract:Solving Markov Decision Processes (MDPs) remains a central challenge in sequential decision-making, especially when dealing with large state spaces and long-term optimization criteria. A key step in Bellman dynamic programming algorithms is the policy evaluation, which becomes computationally demanding in infinite-horizon settings such as average-reward or discounted-reward formulations. In the context of Markov chains, aggregation and disaggregation techniques have for a long time been used to reduce complexity by exploiting structural decompositions. In this work, we extend these principles to a structured class of MDPs. We define the Single-Input Superstate Decomposable Markov Decision Process (SISDMDP), which combines Chiu’s single-input decomposition with Robertazzi’s single-cycle recurrence property. When a policy induces this structure, the resulting transition graph can be decomposed into interacting components with centralized recurrence. We develop an exact and efficient policy evaluation method based on this structure. This yields a scalable solution applicable to both average and discounted reward MDPs.
[LG-49] Constructive Disintegration and Conditional Modes
链接: https://arxiv.org/abs/2508.00617
作者: Nathaël Da Costa,Marvin Pförtner,Jon Cockayne
类目: atistics Theory (math.ST); Machine Learning (cs.LG); Probability (math.PR); Machine Learning (stat.ML)
*备注:
Abstract:Conditioning, the central operation in Bayesian statistics, is formalised by the notion of disintegration of measures. However, due to the implicit nature of their definition, constructing disintegrations is often difficult. A folklore result in machine learning conflates the construction of a disintegration with the restriction of probability density functions onto the subset of events that are consistent with a given observation. We provide a comprehensive set of mathematical tools which can be used to construct disintegrations and apply these to find densities of disintegrations on differentiable manifolds. Using our results, we provide a disturbingly simple example in which the restricted density and the disintegration density drastically disagree. Motivated by applications in approximate Bayesian inference and Bayesian inverse problems, we further study the modes of disintegrations. We show that the recently introduced notion of a “conditional mode” does not coincide in general with the modes of the conditional measure obtained through disintegration, but rather the modes of the restricted measure. We also discuss the implications of the discrepancy between the two measures in practice, advocating for the utility of both approaches depending on the modelling context.
[LG-50] Neighbor-Sampling Based Momentum Stochastic Methods for Training Graph Neural Networks
链接: https://arxiv.org/abs/2508.00267
作者: Molly Noel,Gabriel Mancino-Ball,Yangyang Xu
类目: Optimization and Control (math.OC); Machine Learning (cs.LG)
*备注: 32 pages
Abstract:Graph convolutional networks (GCNs) are a powerful tool for graph representation learning. Due to the recursive neighborhood aggregations employed by GCNs, efficient training methods suffer from a lack of theoretical guarantees or are missing important practical elements from modern deep learning algorithms, such as adaptivity and momentum. In this paper, we present several neighbor-sampling (NS) based Adam-type stochastic methods for solving a nonconvex GCN training problem. We utilize the control variate technique proposed by [1] to reduce the stochastic error caused by neighbor sampling. Under standard assumptions for Adam-type methods, we show that our methods enjoy the optimal convergence rate. In addition, we conduct extensive numerical experiments on node classification tasks with several benchmark datasets. The results demonstrate superior performance of our methods over classic NS-based SGD that also uses the control-variate technique, especially for large-scale graph datasets. Our code is available at this https URL .
[LG-51] Sinusoidal Approximation Theorem for Kolmogorov-Arnold Networks
链接: https://arxiv.org/abs/2508.00247
作者: Sergei Gleyzer,Hanh Nguyen,Dinesh P. Ramakrishnan,Eric A. F. Reinhardt
类目: Machine Learning (stat.ML); Machine Learning (cs.LG); Numerical Analysis (math.NA)
*备注: 15 pages, 3 figures
Abstract:The Kolmogorov-Arnold representation theorem states that any continuous multivariable function can be exactly represented as a finite superposition of continuous single variable functions. Subsequent simplifications of this representation involve expressing these functions as parameterized sums of a smaller number of unique monotonic functions. These developments led to the proof of the universal approximation capabilities of multilayer perceptron networks with sigmoidal activations, forming the alternative theoretical direction of most modern neural networks. Kolmogorov-Arnold Networks (KANs) have been recently proposed as an alternative to multilayer perceptrons. KANs feature learnable nonlinear activations applied directly to input values, modeled as weighted sums of basis spline functions. This approach replaces the linear transformations and sigmoidal post-activations used in traditional perceptrons. Subsequent works have explored alternatives to spline-based activations. In this work, we propose a novel KAN variant by replacing both the inner and outer functions in the Kolmogorov-Arnold representation with weighted sinusoidal functions of learnable frequencies. Inspired by simplifications introduced by Lorentz and Sprecher, we fix the phases of the sinusoidal activations to linearly spaced constant values and provide a proof of its theoretical validity. We also conduct numerical experiments to evaluate its performance on a range of multivariable functions, comparing it with fixed-frequency Fourier transform methods and multilayer perceptrons (MLPs). We show that it outperforms the fixed-frequency Fourier transform and achieves comparable performance to MLPs. Comments: 15 pages, 3 figures Subjects: Machine Learning (stat.ML); Machine Learning (cs.LG); Numerical Analysis (math.NA) Cite as: arXiv:2508.00247 [stat.ML] (or arXiv:2508.00247v1 [stat.ML] for this version) https://doi.org/10.48550/arXiv.2508.00247 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
[LG-52] funOCLUST: Clustering Functional Data with Outliers
链接: https://arxiv.org/abs/2508.00110
作者: Katharine M. Clark,Paul D. McNicholas
类目: Machine Learning (stat.ML); Machine Learning (cs.LG)
*备注:
Abstract:Functional data present unique challenges for clustering due to their infinite-dimensional nature and potential sensitivity to outliers. An extension of the OCLUST algorithm to the functional setting is proposed to address these issues. The approach leverages the OCLUST framework, creating a robust method to cluster curves and trim outliers. The methodology is evaluated on both simulated and real-world functional datasets, demonstrating strong performance in clustering and outlier identification.
[LG-53] Riemannian Optimization for Distance Geometry: A Study of Convergence Robustness and Incoherence
链接: https://arxiv.org/abs/2508.00091
作者: Chandler Smith,HanQin Cai,Abiy Tasissa
类目: Optimization and Control (math.OC); Computational Geometry (cs.CG); Machine Learning (cs.LG)
*备注: 54 pages, 6 figures
Abstract:The problem of recovering a configuration of points from partial pairwise distances, referred to as the Euclidean Distance Geometry (EDG) problem, arises in a broad range of applications, including sensor network localization, molecular conformation, and manifold learning. In this paper, we propose a Riemannian optimization framework for solving the EDG problem by formulating it as a low-rank matrix completion task over the space of positive semi-definite Gram matrices. The available distance measurements are encoded as expansion coefficients in a non-orthogonal basis, and optimization over the Gram matrix implicitly enforces geometric consistency through the triangle inequality, a structure inherited from classical multidimensional scaling. Under a Bernoulli sampling model for observed distances, we prove that Riemannian gradient descent on the manifold of rank- r matrices locally converges linearly with high probability when the sampling probability satisfies p \geq \mathcalO(\nu^2 r^2 \log(n)/n) , where \nu is an EDG-specific incoherence parameter. Furthermore, we provide an initialization candidate using a one-step hard thresholding procedure that yields convergence, provided the sampling probability satisfies p \geq \mathcalO(\nu r^3/2 \log^3/4(n)/n^1/4) . A key technical contribution of this work is the analysis of a symmetric linear operator arising from a dual basis expansion in the non-orthogonal basis, which requires a novel application of the Hanson–Wright inequality to establish an optimal restricted isometry property in the presence of coupled terms. Empirical evaluations on synthetic data demonstrate that our algorithm achieves competitive performance relative to state-of-the-art methods. Moreover, we propose a novel notion of matrix incoherence tailored to the EDG setting and provide robustness guarantees for our method.
[LG-54] Dimension reduction with structure-aware quantum circuits for hybrid machine learning
链接: https://arxiv.org/abs/2508.00048
作者: Ammar Daskin
类目: Quantum Physics (quant-ph); Machine Learning (cs.LG)
*备注: Any comments are welcome! The simulation code is provided at this https URL
Abstract:Schmidt decomposition of a vector can be understood as writing the singular value decomposition (SVD) in vector form. A vector can be written as a linear combination of tensor product of two dimensional vectors by recursively applying Schmidt decompositions via SVD to all subsystems. Given a vector expressed as a linear combination of tensor products, using only the k principal terms yields a k -rank approximation of the vector. Therefore, writing a vector in this reduced form allows to retain most important parts of the vector while removing small noises from it, analogous to SVD-based denoising. In this paper, we show that quantum circuits designed based on a value k (determined from the tensor network decomposition of the mean vector of the training sample) can approximate the reduced-form representations of entire datasets. We then employ this circuit ansatz with a classical neural network head to construct a hybrid machine learning model. Since the output of the quantum circuit for an 2^n dimensional vector is an n dimensional probability vector, this provides an exponential compression of the input and potentially can reduce the number of learnable parameters for training large-scale models. We use datasets provided in the Python scikit-learn module for the experiments. The results confirm the quantum circuit is able to compress data successfully to provide effective k -rank approximations to the classical processing component. Comments: Any comments are welcome! The simulation code is provided at this https URL Subjects: Quantum Physics (quant-ph); Machine Learning (cs.LG) Cite as: arXiv:2508.00048 [quant-ph] (or arXiv:2508.00048v1 [quant-ph] for this version) https://doi.org/10.48550/arXiv.2508.00048 Focus to learn more arXiv-issued DOI via DataCite
[LG-55] Hybrid Quantum Classical Surrogate for Real Time Inverse Finite Element Modeling in Digital Twins
链接: https://arxiv.org/abs/2508.00029
作者: Azadeh Alavi,Sanduni Jayasinghe,Mojtaba Mahmoodian,Sam Mazaheri,John Thangarajah,Sujeeva Setunge
类目: Quantum Physics (quant-ph); Machine Learning (cs.LG)
*备注: Submitted to Scientific Report
Abstract:Large-scale civil structures, such as bridges, pipelines, and offshore platforms, are vital to modern infrastructure, where unexpected failures can cause significant economic and safety repercussions. Although finite element (FE) modeling is widely used for real-time structural health monitoring (SHM), its high computational cost and the complexity of inverse FE analysis, where low dimensional sensor data must map onto high-dimensional displacement or stress fields pose ongoing challenges. Here, we propose a hybrid quantum classical multilayer perceptron (QMLP) framework to tackle these issues and facilitate swift updates to digital twins across a range of structural applications. Our approach embeds sensor data using symmetric positive definite (SPD) matrices and polynomial features, yielding a representation well suited to quantum processing. A parameterized quantum circuit (PQC) transforms these features, and the resultant quantum outputs feed into a classical neural network for final inference. By fusing quantum capabilities with classical modeling, the QMLP handles large scale inverse FE mapping while preserving computational viability. Through extensive experiments on a bridge, we demonstrate that the QMLP achieves a mean squared error (MSE) of 0.0000000000316, outperforming purely classical baselines with a large margin. These findings confirm the potential of quantum-enhanced methods for real time SHM, establishing a pathway toward more efficient, scalable digital twins that can robustly monitor and diagnose structural integrity in near real time. Comments: Submitted to Scientific Report Subjects: Quantum Physics (quant-ph); Machine Learning (cs.LG) Cite as: arXiv:2508.00029 [quant-ph] (or arXiv:2508.00029v1 [quant-ph] for this version) https://doi.org/10.48550/arXiv.2508.00029 Focus to learn more arXiv-issued DOI via DataCite
[LG-56] Quantum Semi-Random Forests for Qubit-Efficient Recommender Systems
链接: https://arxiv.org/abs/2508.00027
作者: Azadeh Alavi,Fatemeh Kouchmeshki,Abdolrahman Alavi,Yongli Ren,Jiayang Niu
类目: Quantum Physics (quant-ph); Machine Learning (cs.LG)
*备注: Submitted to IEEE Quantum AI Conference (QAI 2025), awaiting peer review
Abstract:Modern recommenders describe each item with hundreds of sparse semantic tags, yet most quantum pipelines still map one qubit per tag, demanding well beyond one hundred qubits, far out of reach for current noisy-intermediate-scale quantum (NISQ) devices and prone to deep, error-amplifying circuits. We close this gap with a three-stage hybrid machine learning algorithm that compresses tag profiles, optimizes feature selection under a fixed qubit budget via QAOA, and scores recommendations with a Quantum semi-Random Forest (QsRF) built on just five qubits, while performing similarly to the state-of-the-art methods. Leveraging SVD sketching and k-means, we learn a 1000-atom dictionary ( 97 % variance), then solve a 2020 QUBO via depth-3 QAOA to select 5 atoms. A 100-tree QsRF trained on these codes matches full-feature baselines on ICM-150/500.
信息检索
[IR-0] Experimental Evaluation of Dynamic Topic Modeling Algorithms
链接: https://arxiv.org/abs/2508.00710
作者: Ngozichukwuka Onah,Nadine Steinmetz,Hani Al-Sayeh,Kai-Uwe Sattler
类目: Information Retrieval (cs.IR)
*备注:
Abstract:The amount of text generated daily on social media is gigantic and analyzing this text is useful for many purposes. To understand what lies beneath a huge amount of text, we need dependable and effective computing techniques from self-powered topic models. Nevertheless, there are currently relatively few thorough quantitative comparisons between these models. In this study, we compare these models and propose an assessment metric that documents how the topics change in time.
[IR-1] MMRAG -DocQA: A Multi-Modal Retrieval-Augmented Generation Method for Document Question-Answering with Hierarchical Index and Multi-Granularity Retrieval
链接: https://arxiv.org/abs/2508.00579
作者: Ziyu Gong,Yihua Huang,Chengcheng Mai
类目: Multimedia (cs.MM); Information Retrieval (cs.IR)
*备注:
Abstract:The multi-modal long-context document question-answering task aims to locate and integrate multi-modal evidences (such as texts, tables, charts, images, and layouts) distributed across multiple pages, for question understanding and answer generation. The existing methods can be categorized into Large Vision-Language Model (LVLM)-based and Retrieval-Augmented Generation (RAG)-based methods. However, the former were susceptible to hallucinations, while the latter struggled for inter-modal disconnection and cross-page fragmentation. To address these challenges, a novel multi-modal RAG model, named MMRAG-DocQA, was proposed, leveraging both textual and visual information across long-range pages to facilitate accurate question answering. A hierarchical indexing method with the integration of flattened in-page chunks and topological cross-page chunks was designed to jointly establish in-page multi-modal associations and long-distance cross-page dependencies. By means of joint similarity evaluation and large language model (LLM)-based re-ranking, a multi-granularity semantic retrieval method, including the page-level parent page retrieval and document-level summary retrieval, was proposed to foster multi-modal evidence connection and long-distance evidence integration and reasoning. Experimental results performed on public datasets, MMLongBench-Doc and LongDocURL, demonstrated the superiority of our MMRAG-DocQA method in understanding and answering modality-rich and multi-page documents.
[IR-2] Session-Based Recommendation with Validated and Enriched LLM Intents
链接: https://arxiv.org/abs/2508.00570
作者: Gyuseok Lee,Yaokun Liu,Yifan Liu,Susik Yoon,Dong Wang,SeongKu Kang
类目: Information Retrieval (cs.IR)
*备注:
Abstract:Session-based recommendation (SBR) aims to predict the next item for an anonymous user in a timely manner. However, SBR suffers from data sparsity due to the short and anonymous nature of sessions. Recently, an emerging line of work has explored inferring the underlying user intents of a session using large language models (LLMs), with the generated intents serving as auxiliary training signals to enhance SBR models. Despite its promise, this approach faces three key challenges: validating intent quality, incorporating session-level multi-intents, and complementing inevitable LLM failure cases. In this paper, we propose VELI4SBR, a two-stage framework that leverages Validated and Enriched LLM-generated Intents for SBR. In the first stage, we generate high-quality intents using a predict-and-correct loop that validates the informativeness of LLM-generated intents with a global intent pool to constrain the LLM’s output space and reduce hallucination. In the second stage, we enhance the SBR model using the generated intents through a lightweight multi-intent prediction and fusion mechanism. Furthermore, we introduce a training strategy that compensates for LLM failures by inferring intents from inter-session behavioral similarities. Extensive experiments show that VELI4SBR outperforms state-of-the-art baselines while improving explainability.
[IR-3] Audio Prototypical Network For Controllable Music Recommendation
链接: https://arxiv.org/abs/2508.00194
作者: Fırat Öncel,Emiliano Penaloza,Haolun Wu,Shubham Gupta,Mirco Ravanelli,Laurent Charlin,Cem Subakan
类目: Information Retrieval (cs.IR); Audio and Speech Processing (eess.AS)
*备注: Accepted to MLSP2025
Abstract:Traditional recommendation systems represent user preferences in dense representations obtained through black-box encoder models. While these models often provide strong recommendation performance, they lack interpretability for users, leaving users unable to understand or control the system’s modeling of their preferences. This limitation is especially challenging in music recommendation, where user preferences are highly personal and often evolve based on nuanced qualities like mood, genre, tempo, or instrumentation. In this paper, we propose an audio prototypical network for controllable music recommendation. This network expresses user preferences in terms of prototypes representative of semantically meaningful features pertaining to musical qualities. We show that the model obtains competitive recommendation performance compared to popular baseline models while also providing interpretable and controllable user profiles.
[IR-4] Melody-Lyrics Matching with Contrastive Alignment Loss
链接: https://arxiv.org/abs/2508.00123
作者: Changhong Wang,Michel Olvera,Gaël Richard
类目: Audio and Speech Processing (eess.AS); Information Retrieval (cs.IR)
*备注: 10 pages, 7 figures, 3 tables. This work has been submitted to the IEEE for possible publication
Abstract:The connection between music and lyrics is far beyond semantic bonds. Conceptual pairs in the two modalities such as rhythm and rhyme, note duration and syllabic stress, and structure correspondence, raise a compelling yet seldom-explored direction in the field of music information retrieval. In this paper, we present melody-lyrics matching (MLM), a new task which retrieves potential lyrics for a given symbolic melody from text sources. Rather than generating lyrics from scratch, MLM essentially exploits the relationships between melody and lyrics. We propose a self-supervised representation learning framework with contrastive alignment loss for melody and lyrics. This has the potential to leverage the abundance of existing songs with paired melody and lyrics. No alignment annotations are required. Additionally, we introduce sylphone, a novel representation for lyrics at syllable-level activated by phoneme identity and vowel stress. We demonstrate that our method can match melody with coherent and singable lyrics with empirical results and intuitive examples. We open source code and provide matching examples on the companion webpage: this https URL.