本篇博文主要内容为 2025-08-20 从Arxiv.org论文网站获取的最新论文列表,自动更新,按照NLP、CV、ML、AI、IR五个大方向区分,若需要邮件定时接收,请在评论区留下你的邮箱号。
说明:每日论文数据从Arxiv.org获取,每天早上12:00左右定时自动更新。
友情提示: 如何您需要邮箱接收每日论文数据,请在评论处留下你的邮箱。
目录
概览 (2025-08-20)
今日共更新487篇论文,其中:
- 自然语言处理共52篇(Computation and Language (cs.CL))
- 人工智能共150篇(Artificial Intelligence (cs.AI))
- 计算机视觉共118篇(Computer Vision and Pattern Recognition (cs.CV))
- 机器学习共118篇(Machine Learning (cs.LG))
自然语言处理
[NLP-0] he Promise of Large Language Models in Digital Health: Evidence from Sentiment Analysis in Online Health Communities
【速读】: 该论文旨在解决数字健康分析中因专家知识稀缺而导致的患者生成健康内容(patient-generated health content)情感与医学语境复杂性难以准确识别的问题,同时克服传统机器学习方法在医疗场景下数据不足和隐私限制的瓶颈。其解决方案的关键在于利用大语言模型(Large Language Models, LLMs)通过上下文学习(in-context learning)整合专家知识,具体表现为构建一个结构化的编码手册(structured codebook),将专家解读指南系统化编码,使LLMs能够通过有针对性的提示(targeted prompting)而非大规模训练实现领域知识的应用,从而在无需额外训练的情况下达到专家级情感分析性能,并展现出与多名专家间一致性的高一致性结果。
链接: https://arxiv.org/abs/2508.14032
作者: Xiancheng Li,Georgios D. Karampatakis,Helen E. Wood,Chris J. Griffiths,Borislava Mihaylova,Neil S. Coulson,Alessio Pasinato,Pietro Panzarasa,Marco Viviani,Anna De Simoni
机构: Queen Mary University of London (伦敦玛丽女王大学); Wolfson Institute of Population Health (WIPH) (沃尔森人口健康研究所); University of Nottingham (诺丁汉大学); University of Milano-Bicocca (米兰比可卡大学)
类目: Computation and Language (cs.CL)
备注:
Abstract:Digital health analytics face critical challenges nowadays. The sophisticated analysis of patient-generated health content, which contains complex emotional and medical contexts, requires scarce domain expertise, while traditional ML approaches are constrained by data shortage and privacy limitations in healthcare settings. Online Health Communities (OHCs) exemplify these challenges with mixed-sentiment posts, clinical terminology, and implicit emotional expressions that demand specialised knowledge for accurate Sentiment Analysis (SA). To address these challenges, this study explores how Large Language Models (LLMs) can integrate expert knowledge through in-context learning for SA, providing a scalable solution for sophisticated health data analysis. Specifically, we develop a structured codebook that systematically encodes expert interpretation guidelines, enabling LLMs to apply domain-specific knowledge through targeted prompting rather than extensive training. Six GPT models validated alongside DeepSeek and LLaMA 3.1 are compared with pre-trained language models (BioBERT variants) and lexicon-based methods, using 400 expert-annotated posts from two OHCs. LLMs achieve superior performance while demonstrating expert-level agreement. This high agreement, with no statistically significant difference from inter-expert agreement levels, suggests knowledge integration beyond surface-level pattern recognition. The consistent performance across diverse LLM models, supported by in-context learning, offers a promising solution for digital health analytics. This approach addresses the critical challenge of expert knowledge shortage in digital health research, enabling real-time, expert-quality analysis for patient monitoring, intervention assessment, and evidence-based health strategies.
zh
[NLP-1] Unintended Misalignment from Agent ic Fine-Tuning: Risks and Mitigation
【速读】: 该论文旨在解决大语言模型(Large Language Models, LLMs)在被微调为具备规划与工具交互能力的智能体(agentic systems)过程中,因忽视安全性而导致的意外失准(unintentionally misaligned)问题,即模型更可能执行有害任务且减少拒绝能力。解决方案的关键在于提出 Prefix INjection Guard (PING),一种通过自动添加自然语言前缀(prefixes)来引导模型拒绝有害请求的方法;其核心创新在于采用迭代优化策略,交替生成候选前缀并选择能同时提升任务性能与拒绝行为的最优前缀,从而在不牺牲良性任务表现的前提下显著增强模型安全性。
链接: https://arxiv.org/abs/2508.14031
作者: Dongyoon Hahm,Taywon Min,Woogyeol Jin,Kimin Lee
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: Source code: this https URL
Abstract:Beyond simple text generation, Large Language Models (LLMs) have evolved into agentic systems capable of planning and interacting with external tools to solve complex tasks. This evolution involves fine-tuning LLMs on agent-specific tasks to enhance their proficiency. However, safety concerns are frequently overlooked during this fine-tuning process. In this work, we show that aligned LLMs can become unintentionally misaligned, leading to a higher likelihood of executing harmful tasks and a reduced tendency to refuse them when fine-tuned to execute agentic tasks. To address these safety challenges, we propose Prefix INjection Guard (PING), a simple yet effective method that prepends automatically generated natural language prefixes to agent responses, guiding them to refuse harmful requests while preserving performance on benign tasks. Specifically, we introduce an iterative approach that alternates between (1) generating candidate prefixes and (2) selecting those that optimize both task performance and refusal behavior. Experimental results demonstrate that PING significantly enhances the safety of fine-tuned LLM agents without sacrificing their effectiveness. PING consistently outperforms existing prompting approaches across diverse benchmarks in both web navigation and code generation tasks. Our analysis of internal hidden states via linear probes reveals that prefix tokens are crucial for behavior modification, explaining the performance gains. WARNING: This paper contains contents that are unethical or offensive in nature.
zh
[NLP-2] Beyond Pass@1: Self-Play with Variational Problem Synthesis Sustains RLVR
【速读】: 该论文旨在解决强化学习中可验证奖励(Reinforcement Learning with Verifiable Rewards, RLVR)训练过程中因策略熵坍塌(entropy collapse)导致的生成多样性下降问题,进而限制了大型语言模型(Large Language Models, LLMs)在复杂推理任务中的性能上限(如Pass@k指标)。其核心解决方案是提出一种在线自对弈与变分问题合成(Self-play with Variational problem Synthesis, SvS)策略,通过利用策略产生的正确解来合成具有变化性但参考答案保持不变的新问题,从而在训练过程中持续维持策略熵,有效提升模型在多种推理基准上的泛化能力和长期性能表现。
链接: https://arxiv.org/abs/2508.14029
作者: Xiao Liang,Zhongzhi Li,Yeyun Gong,Yelong Shen,Ying Nian Wu,Zhijiang Guo,Weizhu Chen
机构: University of California, Los Angeles (加州大学洛杉矶分校); Microsoft (微软); School of Artificial Intelligence, Chinese Academy of Sciences (中国科学院人工智能学院); Hong Kong University of Science and Technology (香港科技大学); Hong Kong University of Science and Technology (Guangzhou) (香港科技大学(广州)
类目: Computation and Language (cs.CL)
备注:
Abstract:Reinforcement Learning with Verifiable Rewards (RLVR) has recently emerged as a key paradigm for post-training Large Language Models (LLMs), particularly for complex reasoning tasks. However, vanilla RLVR training has been shown to improve Pass@1 performance at the expense of policy entropy, leading to reduced generation diversity and limiting the Pass@k performance, which typically represents the upper bound of LLM reasoning capability. In this paper, we systematically analyze the policy’s generation diversity from the perspective of training problems and find that augmenting and updating training problems helps mitigate entropy collapse during training. Based on these observations, we propose an online Self-play with Variational problem Synthesis (SvS) strategy for RLVR training, which uses the policy’s correct solutions to synthesize variational problems while ensuring their reference answers remain identical to the originals. This self-improving strategy effectively maintains policy entropy during training and substantially improves Pass@k compared with standard RLVR, sustaining prolonged improvements and achieving absolute gains of 18.3% and 22.8% in Pass@32 performance on the competition-level AIME24 and AIME25 benchmarks. Experiments on 12 reasoning benchmarks across varying model sizes from 3B to 32B consistently demonstrate the generalizability and robustness of SvS.
zh
[NLP-3] Ask Good Questions for Large Language Models
【速读】: 该论文旨在解决当前基于大语言模型(Large Language Models, LLMs)的对话系统在话题引导方面存在的准确性不足问题,其根本原因在于模型难以识别用户在相关概念上的认知混淆。解决方案的关键在于提出Ask-Good-Question (AGQ)框架,其中引入了改进的Concept-Enhanced Item Response Theory (CEIRT)模型,以更精准地评估用户的知识水平,并结合LLMs直接生成基于上下文的引导性问题,从而显著提升问答过程中的信息检索效率与用户体验。
链接: https://arxiv.org/abs/2508.14025
作者: Qi Wu,Zhongqi Lu
机构: China University of Petroleum-Beijing (中国石油大学(北京)); Hainan Institute of China University of Petroleum (Beijing) (中国石油大学(北京)海南研究院)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:
Abstract:Recent advances in large language models (LLMs) have significantly improved the performance of dialog systems, yet current approaches often fail to provide accurate guidance of topic due to their inability to discern user confusion in related concepts. To address this, we introduce the Ask-Good-Question (AGQ) framework, which features an improved Concept-Enhanced Item Response Theory (CEIRT) model to better identify users’ knowledge levels. Our contributions include applying the CEIRT model along with LLMs to directly generate guiding questions based on the inspiring text, greatly improving information retrieval efficiency during the question answer process. Through comparisons with other baseline methods, our approach outperforms by significantly enhencing the users’ information retrieval experiences.
zh
[NLP-4] Chunks as Arms: Multi-Armed Bandit-Guided Sampling for Long-Context LLM Preference Optimization
【速读】: 该论文旨在解决长上下文建模中因合成数据多样性低和事实不一致导致的微调效果受限问题。其解决方案的关键在于提出一种基于多臂老虎机(Multi-Armed Bandit, MAB)的滚动策略(rollout strategy),将长文本划分为多个片段作为“臂”,通过动态选择高奖励期望值的上下文片段输入至大语言模型(Large Language Model, LLM)生成响应,并利用奖励反馈迭代更新各片段的奖励分数,从而实现对最相关信息片段的有效探索与利用。该机制显著提升了偏好数据对的质量与多样性,最终结合直接偏好优化(Direct Preference Optimization, DPO)进一步提升模型在长上下文推理任务上的性能。
链接: https://arxiv.org/abs/2508.13993
作者: Shaohua Duan,Xinze Li,Zhenghao Liu,Xiaoyuan Yi,Yukun Yan,Shuo Wang,Yu Gu,Ge Yu,Maosong Sun
机构: 1. Tsinghua University (清华大学); 2. Shanghai Jiao Tong University (上海交通大学); 3. Alibaba Group (阿里巴巴集团)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:
Abstract:Long-context modeling is critical for a wide range of real-world tasks, including long-context question answering, summarization, and complex reasoning tasks. Recent studies have explored fine-tuning Large Language Models (LLMs) with synthetic data to enhance their long-context capabilities. However, the effectiveness of such approaches is often limited by the low diversity and factual inconsistencies in the generated data. To address these challenges, we propose LongMab-PO, a novel framework that leverages a Multi-Armed Bandit (MAB) rollout strategy to identify the most informative chunks from the given long context for sampling high-quality and diverse responses and constructing preference data pairs for Direct Preference Optimization (DPO) training. Specifically, we treat context chunks as arms of MAB, select chunks based on their expected reward scores to input into LLMs to generate responses, and iteratively update these scores based on reward feedback. This exploration and exploitation process enables the model to focus on the most relevant context segments, thereby generating and collecting high-quality and diverse responses. Finally, we collect these generated responses from the rollout process and apply the DPO method to further optimize the LLM. Experimental results show that LongMab-PO significantly improves the diversity and quality of preference data pairs, achieving state-of-the-art performance on long-context reasoning benchmarks. All code and data will be released on this https URL.
zh
[NLP-5] RotBench: Evaluating Multimodal Large Language Models on Identifying Image Rotation
【速读】: 该论文旨在解决多模态大语言模型(Multimodal Large Language Models, MLLMs)在识别图像旋转角度(0°、90°、180°、270°)方面的空间推理能力不足问题,特别是其对细微方向差异(如90°与270°)的区分能力薄弱。解决方案的关键在于构建一个高质量、人工筛选的基准测试集RotBench(包含350张生活场景、人像和风景图像),并系统评估多种主流开源与闭源MLLMs在此任务上的表现;实验发现,尽管提供辅助信息(如描述文本、深度图)或使用链式思维提示(chain-of-thought prompting)仅带来微弱且不一致的改进,而通过图像多视角展示结合投票机制可提升模型性能,且微调虽能增强180°旋转识别能力,却无法有效改善90°与270°的区分能力,揭示了当前MLLMs在空间感知上与人类水平之间存在显著差距。
链接: https://arxiv.org/abs/2508.13968
作者: Tianyi Niu,Jaemin Cho,Elias Stengel-Eskin,Mohit Bansal
机构: UNC Chapel Hill (北卡罗来纳大学教堂山分校)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注: 20 pages. Code and data: this https URL
Abstract:We investigate to what extent Multimodal Large Language Models (MLLMs) can accurately identify the orientation of input images rotated 0°, 90°, 180°, and 270°. This task demands robust visual reasoning capabilities to detect rotational cues and contextualize spatial relationships within images, regardless of their orientation. To evaluate MLLMs on these abilities, we introduce RotBench – a 350-image manually-filtered benchmark comprising lifestyle, portrait, and landscape images. Despite the relatively simple nature of this task, we show that several state-of-the-art open and proprietary MLLMs, including GPT-5, o3, and Gemini-2.5-Pro, do not reliably identify rotation in input images. Providing models with auxiliary information – including captions, depth maps, and more – or using chain-of-thought prompting offers only small and inconsistent improvements. Our results indicate that most models are able to reliably identify right-side-up (0°) images, while certain models are able to identify upside-down (180°) images. None can reliably distinguish between 90° and 270°. Simultaneously showing the image rotated in different orientations leads to moderate performance gains for reasoning models, while a modified setup using voting improves the performance of weaker models. We further show that fine-tuning does not improve models’ ability to distinguish 90° and 270° rotations, despite substantially improving the identification of 180° images. Together, these results reveal a significant gap between MLLMs’ spatial reasoning capabilities and human perception in identifying rotation.
zh
[NLP-6] ReviewGraph: A Knowledge Graph Embedding Based Framework for Review Rating Prediction with Sentiment Features
【速读】: 该论文旨在解决酒店行业中客户评论评分预测(Review Rating Prediction, RRP)的问题,核心挑战在于如何从非结构化的文本评论中提取可解释且有效的特征以提升评分预测的准确性与可解释性。解决方案的关键在于提出一种名为ReviewGraph的新框架,该框架通过抽取(主体,谓词,客体)三元组并关联情感得分,将文本评论转化为知识图谱;随后利用Node2Vec进行图嵌入,并结合情感特征输入至机器学习分类器进行评分预测。相较于传统自然语言处理方法(如Bag of Words、TF-IDF、Word2Vec)和大语言模型(LLMs),ReviewGraph在保持相近甚至更优预测性能的同时,显著降低了计算成本,并在Cohen’s Kappa等一致性指标上表现更佳,同时具备更强的可解释性和可视化探索能力,以及未来集成到检索增强生成(Retrieval-Augmented Generation, RAG)系统中的潜力。
链接: https://arxiv.org/abs/2508.13953
作者: A.J.W. de Vink,Natalia Amat-Lefort,Lifeng Han
机构: LIACS, Leiden University (莱顿大学计算机科学系); LUMC, Leiden University, NL (莱顿大学医学中心)
类目: Computation and Language (cs.CL)
备注:
Abstract:In the hospitality industry, understanding the factors that drive customer review ratings is critical for improving guest satisfaction and business performance. This work proposes ReviewGraph for Review Rating Prediction (RRP), a novel framework that transforms textual customer reviews into knowledge graphs by extracting (subject, predicate, object) triples and associating sentiment scores. Using graph embeddings (Node2Vec) and sentiment features, the framework predicts review rating scores through machine learning classifiers. We compare ReviewGraph performance with traditional NLP baselines (such as Bag of Words, TF-IDF, and Word2Vec) and large language models (LLMs), evaluating them in the HotelRec dataset. In comparison to the state of the art literature, our proposed model performs similar to their best performing model but with lower computational cost (without ensemble). While ReviewGraph achieves comparable predictive performance to LLMs and outperforms baselines on agreement-based metrics such as Cohen’s Kappa, it offers additional advantages in interpretability, visual exploration, and potential integration into Retrieval-Augmented Generation (RAG) systems. This work highlights the potential of graph-based representations for enhancing review analytics and lays the groundwork for future research integrating advanced graph neural networks and fine-tuned LLM-based extraction methods. We will share ReviewGraph output and platform open-sourced on our GitHub page this https URL Subjects: Computation and Language (cs.CL) Cite as: arXiv:2508.13953 [cs.CL] (or arXiv:2508.13953v1 [cs.CL] for this version) https://doi.org/10.48550/arXiv.2508.13953 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
zh
[NLP-7] Query Logs Analytics: A Aystematic Literature Review
【速读】: 该论文旨在解决当前关于日志(log)使用研究分散、缺乏系统整合的问题,尤其聚焦于数据库(Database, DB)、数据仓库(Data Warehouse, DW)、网页(Web)和知识图谱(Knowledge Graph, KG)日志的利用现状。其关键解决方案在于通过系统性综述方法分析超过300篇文献,回答三个核心问题:不同类型的日志是否具有共通的结构与功能特征、是否存在标准化的处理流程、以及哪些非功能性需求(Non-Functional Requirements, NFRs)指导其应用。研究表明,尽管存在少量端到端方法且日志处理流程尚未标准化,但多种日志类型共享结构性元素,从而为统一建模与优化提供了基础,并指出了未来在知识图谱日志挖掘与普及方面的研究方向。
链接: https://arxiv.org/abs/2508.13949
作者: Dihia Lanasri
机构: ESI(阿尔及利亚科学与技术学院)
类目: Databases (cs.DB); Computation and Language (cs.CL)
备注:
Abstract:In the digital era, user interactions with various resources such as databases, data warehouses, websites, and knowledge graphs (KGs) are increasingly mediated through digital platforms. These interactions leave behind digital traces, systematically captured in the form of logs. Logs, when effectively exploited, provide high value across industry and academia, supporting critical services (e.g., recovery and security), user-centric applications (e.g., recommender systems), and quality-of-service improvements (e.g., performance optimization). Despite their importance, research on log usage remains fragmented across domains, and no comprehensive study currently consolidates existing efforts. This paper presents a systematic survey of log usage, focusing on Database (DB), Data Warehouse (DW), Web, and KG logs. More than 300 publications were analyzed to address three central questions: (1) do different types of logs share common structural and functional characteristics? (2) are there standard pipelines for their usage? (3) which constraints and non-functional requirements (NFRs) guide their exploitation?. The survey reveals a limited number of end-to-end approaches, the absence of standardization across log usage pipelines, and the existence of shared structural elements among different types of logs. By consolidating existing knowledge, identifying gaps, and highlighting opportunities, this survey provides researchers and practitioners with a comprehensive overview of log usage and sheds light on promising directions for future research, particularly regarding the exploitation and democratization of KG logs.
zh
[NLP-8] Prompt Orchestration Markup Language
【速读】: 该论文旨在解决大型语言模型(Large Language Models, LLMs)在实际应用中因提示工程(prompt engineering)复杂性带来的挑战,包括提示结构不清晰、多源异构数据(如文档、表格、图像)集成困难、格式敏感性强以及缺乏系统化工具支持等问题。其解决方案的关键在于提出一种组件化的提示编排标记语言(Prompt Orchestration Markup Language, POML),通过逻辑结构标签(角色、任务、示例)、专用数据集成标签、类CSS的样式系统实现内容与表现分离,同时引入模板机制支持动态提示生成,并配套完整的开发者工具链(IDE支持、SDK等),从而提升提示的可维护性、复用性和协作效率。
链接: https://arxiv.org/abs/2508.13948
作者: Yuge Zhang,Nan Chen,Jiahang Xu,Yuqing Yang
机构: Microsoft Research (微软研究院)
类目: Human-Computer Interaction (cs.HC); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Programming Languages (cs.PL)
备注: All findings in this paper are derived from a POML snapshot as of February 2025
Abstract:Large Language Models (LLMs) require sophisticated prompting, yet current practices face challenges in structure, data integration, format sensitivity, and tooling. Existing methods lack comprehensive solutions for organizing complex prompts involving diverse data types (documents, tables, images) or managing presentation variations systematically. To address these gaps, we introduce POML (Prompt Orchestration Markup Language). POML employs component-based markup for logical structure (roles, tasks, examples), specialized tags for seamless data integration, and a CSS-like styling system to decouple content from presentation, reducing formatting sensitivity. It includes templating for dynamic prompts and a comprehensive developer toolkit (IDE support, SDKs) to improve version control and collaboration. We validate POML through two case studies demonstrating its impact on complex application integration (PomLink) and accuracy performance (TableQA), as well as a user study assessing its effectiveness in real-world development scenarios.
zh
[NLP-9] MME-SCI: A Comprehensive and Challenging Science Benchmark for Multimodal Large Language Models
【速读】: 该论文旨在解决当前多模态大语言模型(Multimodal Large Language Models, MLLMs)在科学领域评估中存在的三大挑战:一是缺乏对模型在多语言场景下推理能力的充分评估;二是现有基准测试在跨模态覆盖范围上的不足;三是科学知识要点缺乏细粒度标注。为应对这些问题,作者提出了MME-SCI这一综合性且具有挑战性的新基准,其关键在于精心构建了1,019个高质量问答对,涵盖数学、物理、化学和生物四个学科,并支持中文、英文、法文、西班牙文和日文五种语言,同时采用三种不同的评估模式,从而实现对MLLMs在多语言环境和细粒度科学知识理解方面的全面测评。实验表明,该基准显著提升了评估难度,例如在仅图像输入模式下,o4-mini模型在各学科准确率均低于60%,凸显了现有模型的局限性。
链接: https://arxiv.org/abs/2508.13938
作者: Jiacheng Ruan,Dan Jiang,Xian Gao,Ting Liu,Yuzhuo Fu,Yangyang Kang
机构: 未知
类目: Computation and Language (cs.CL); Computer Vision and Pattern Recognition (cs.CV)
备注: 9 pages, 6 figures, work in progress
Abstract:Recently, multimodal large language models (MLLMs) have achieved significant advancements across various domains, and corresponding evaluation benchmarks have been continuously refined and improved. In this process, benchmarks in the scientific domain have played an important role in assessing the reasoning capabilities of MLLMs. However, existing benchmarks still face three key challenges: 1) Insufficient evaluation of models’ reasoning abilities in multilingual scenarios; 2) Inadequate assessment of MLLMs’ comprehensive modality coverage; 3) Lack of fine-grained annotation of scientific knowledge points. To address these gaps, we propose MME-SCI, a comprehensive and challenging benchmark. We carefully collected 1,019 high-quality question-answer pairs, which involve 3 distinct evaluation modes. These pairs cover four subjects, namely mathematics, physics, chemistry, and biology, and support five languages: Chinese, English, French, Spanish, and Japanese. We conducted extensive experiments on 16 open-source models and 4 closed-source models, and the results demonstrate that MME-SCI is widely challenging for existing MLLMs. For instance, under the Image-only evaluation mode, o4-mini achieved accuracy of only 52.11%, 24.73%, 36.57%, and 29.80% in mathematics, physics, chemistry, and biology, respectively, indicating a significantly higher difficulty level compared to existing benchmarks. More importantly, using MME-SCI’s multilingual and fine-grained knowledge attributes, we analyzed existing models’ performance in depth and identified their weaknesses in specific domains. The Data and Evaluation Code are available at this https URL.
zh
[NLP-10] Improved Generalized Planning with LLM s through Strategy Refinement and Reflection
【速读】: 该论文旨在解决大型语言模型(Large Language Models, LLMs)在生成通用规划方案(generalized plans)时因策略生成错误导致程序实现失败的问题。此前方法仅生成单一自然语言策略并直接转化为Python代码,若策略有误则必然导致错误的通用计划。其解决方案的关键在于:首先将策略以伪代码形式生成,并引入自动调试机制对伪代码进行验证与修正;其次,在Python代码调试阶段增加反思步骤,引导LLM定位失败原因;最后借鉴LLM代码生成技术生成多个程序变体并择优选取。实验表明,该方法在17个基准领域中显著提升了通用计划的质量,且在12个领域中实现了所有可生成任务的求解。
链接: https://arxiv.org/abs/2508.13876
作者: Katharina Stein,Nils Hodel,Daniel Fišer,Jörg Hoffmann,Michael Katz,Alexander Koller
机构: University of Freiburg (弗莱堡大学); University of Bremen (不来梅大学); ETH Zurich (苏黎世联邦理工学院); Stanford University (斯坦福大学)
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注:
Abstract:LLMs have recently been used to generate Python programs representing generalized plans in PDDL planning, i.e., plans that generalize across the tasks of a given PDDL domain. Previous work proposed a framework consisting of three steps: the LLM first generates a summary and then a strategy for the domain, both in natural language, and then implements that strategy as a Python program, that gets debugged on example planning tasks. In that work, only one strategy is generated and passed directly to the program generation. If the strategy is incorrect, its implementation will therefore result in an incorrect generalized plan. Here, we introduce an approach that generates the strategy in the form of pseudocode and enables automatic debugging of the pseudocode, hence allowing us to identify and fix errors prior to the generation of the generalized plan itself. Additionally, we extend the Python debugging phase with a reflection step prompting the LLM to pinpoint the reason for the observed plan failure. Finally, we take inspiration from LLM code generation to produce several program variants and pick the best one. Running experiments on 17 benchmark domains, we show that these extensions substantially improve (and never deteriorate) the quality of the generalized plans. In 12 of the domains, our best Python programs solve all tasks that can be generated with the respective instance generator.
zh
[NLP-11] Extracting Structured Requirements from Unstructured Building Technical Specifications for Building Information Modeling
【速读】: 该论文旨在解决建筑行业中从非结构化法语建筑技术规范(Building Technical Specification, BTS)文档中自动提取需求信息的难题,以支持建筑信息模型(Building Information Modeling, BIM)的智能化应用。其解决方案的关键在于结合自然语言处理(Natural Language Processing, NLP)技术,采用基于Transformer架构的预训练语言模型CamemBERT与法国语料库微调后的Fr_core_news_lg模型进行命名实体识别(Named Entity Recognition, NER),并利用自定义特征向量和监督学习方法(如随机森林)实现关系抽取(Relation Extraction, RE)。实验表明,CamemBERT和Fr_core_news_lg在NER任务中F1分数超过90%,而随机森林在RE任务中表现最优(F1 > 80%),为后续构建知识图谱以增强自动化验证系统奠定基础。
链接: https://arxiv.org/abs/2508.13833
作者: Insaf Nahri,Romain Pinquié,Philippe Véron,Nicolas Bus,Mathieu Thorel
机构: Arts et Métiers Institute of Technology (法国国立工艺学院); CSTB (法国国家科学与技术研究中心); Univ. Grenoble Alpes (格勒诺布尔阿尔卑斯大学); Grenoble INP (格勒诺布尔综合理工学院); G-SCOP (格勒诺布尔自动化控制与系统科学实验室)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:
Abstract:This study explores the integration of Building Information Modeling (BIM) with Natural Language Processing (NLP) to automate the extraction of requirements from unstructured French Building Technical Specification (BTS) documents within the construction industry. Employing Named Entity Recognition (NER) and Relation Extraction (RE) techniques, the study leverages the transformer-based model CamemBERT and applies transfer learning with the French language model Fr_core_news_lg, both pre-trained on a large French corpus in the general domain. To benchmark these models, additional approaches ranging from rule-based to deep learning-based methods are developed. For RE, four different supervised models, including Random Forest, are implemented using a custom feature vector. A hand-crafted annotated dataset is used to compare the effectiveness of NER approaches and RE models. Results indicate that CamemBERT and Fr_core_news_lg exhibited superior performance in NER, achieving F1-scores over 90%, while Random Forest proved most effective in RE, with an F1 score above 80%. The outcomes are intended to be represented as a knowledge graph in future work to further enhance automatic verification systems.
zh
[NLP-12] he illusion of a perfect metric: Why evaluating AIs words is harder than it looks
【速读】: 该论文旨在解决自然语言生成(Natural Language Generation, NLG)自动评估指标(Automatic Evaluation Metrics, AEM)缺乏统一标准、效果不稳定且验证方法不规范的问题。当前尽管已有从基于词法匹配到语义相似度模型,再到大语言模型作为评判者(LLM-as-a-Judge)的多种AEM,但尚无一种指标能全面准确地逼近人类判断,且其有效性在不同任务和数据集上表现差异显著。论文的关键解决方案在于:首先,摒弃对“完美指标”的追求,转而根据具体任务需求选择合适的评估指标;其次,强调采用互补性评估策略以覆盖文本质量的不同维度;最后,主张未来新指标的研发应聚焦于改进验证方法论,提升评估结果的可靠性与可复现性。
链接: https://arxiv.org/abs/2508.13816
作者: Maria Paz Oliva,Adriana Correia,Ivan Vankov,Viktor Botev
机构: Iris AI; Neurobiology BAS (神经生物学巴斯)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: 11 pages, 1 figure. Accepted to RANLP 2025
Abstract:Evaluating Natural Language Generation (NLG) is crucial for the practical adoption of AI, but has been a longstanding research challenge. While human evaluation is considered the de-facto standard, it is expensive and lacks scalability. Practical applications have driven the development of various automatic evaluation metrics (AEM), designed to compare the model output with human-written references, generating a score which approximates human judgment. Over time, AEMs have evolved from simple lexical comparisons, to semantic similarity models and, more recently, to LLM-based evaluators. However, it seems that no single metric has emerged as a definitive solution, resulting in studies using different ones without fully considering the implications. This paper aims to show this by conducting a thorough examination of the methodologies of existing metrics, their documented strengths and limitations, validation methods, and correlations with human judgment. We identify several key challenges: metrics often capture only specific aspects of text quality, their effectiveness varies by task and dataset, validation practices remain unstructured, and correlations with human judgment are inconsistent. Importantly, we find that these challenges persist in the most recent type of metric, LLM-as-a-Judge, as well as in the evaluation of Retrieval Augmented Generation (RAG), an increasingly relevant task in academia and industry. Our findings challenge the quest for the ‘perfect metric’. We propose selecting metrics based on task-specific needs and leveraging complementary evaluations and advocate that new metrics should focus on enhanced validation methodologies.
zh
[NLP-13] Prompt-Based One-Shot Exact Length-Controlled Generation with LLM s
【速读】: 该论文旨在解决大语言模型(Large Language Models, LLMs)在文本生成过程中难以精确控制输出长度的问题,即模型常因无法可靠维持内部token计数而出现输出过长或过短的情况。解决方案的关键在于提出一种基于提示(prompt-based)的一次性策略,通过在输入提示中添加倒计时标记(countdown markers)和显式计数规则,引导模型在生成过程中“边写边计数”,从而实现对输出token数量的精准控制,且无需任何微调或迭代采样。该方法在多种任务场景下均表现出显著效果,例如在MT-Bench-LI指令遵循任务中,GPT-4.1的严格长度合规率从不足30%提升至95%以上,同时保持了答案质量。
链接: https://arxiv.org/abs/2508.13805
作者: Juncheng Xie,Hung-yi Lee
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: 18 pages
Abstract:Controlling the length of text produced by large language models (LLMs) remains challenging: models frequently overshoot or undershoot explicit length instructions because they cannot reliably keep an internal token count. We present a prompt-based, one-shot strategy that compels an off-the-shelf LLM to generate exactly a desired number of tokens - words (English) or characters (Chinese) - without any fine-tuning or iterative sampling. The prompt appends countdown markers and explicit counting rules so that the model “writes while counting.” We evaluate on four settings: open-ended generation (1-1000 tokens), XSUM summarization, MT-Bench-LI instruction following, and the LIFEBENCH equal-length track. On MT-Bench-LI, strict length compliance with GPT-4.1 leaps from below 30% under naive prompts to above 95% with our countdown prompt, surpassing the popular draft-then-revise baseline, while judged answer quality is preserved. These results show that precise length control can be achieved through prompt engineering alone, offering a lightweight alternative to training- or decoding-based methods.
zh
[NLP-14] Beyond Human Judgment: A Bayesian Evaluation of LLM s Moral Values Understanding
【速读】: 该论文旨在解决大语言模型(Large Language Models, LLMs)在理解道德维度方面与人类相比的性能差异问题。传统研究多依赖确定性标注标准(如多数投票或包含规则),难以刻画人类判断中的不确定性。为此,作者提出一种基于贝叶斯(Bayesian)方法的大规模评估框架,通过建模标注者之间的分歧来区分两类不确定性:固有的人类不一致(aleatoric uncertainty)和模型对领域敏感性的认知不确定性(epistemic uncertainty)。该方案的关键在于利用GPU优化的贝叶斯推理机制处理百万级模型查询和超大规模标注数据(超过25万条来自约700名标注者的注释),从而量化LLMs在道德判断任务上的表现,并发现其通常优于80%以上的人类标注者,尤其在减少假阴性(false negatives)方面显著优于人类,体现出更强的道德敏感性。
链接: https://arxiv.org/abs/2508.13804
作者: Maciej Skorski,Alina Landowska
机构: University of Luxembourg (卢森堡大学); SWPS University (SWPS大学)
类目: Computation and Language (cs.CL); Human-Computer Interaction (cs.HC)
备注:
Abstract:How do large language models understand moral dimensions compared to humans? This first large-scale Bayesian evaluation of market-leading language models provides the answer. In contrast to prior work using deterministic ground truth (majority or inclusion rules), we model annotator disagreements to capture both aleatoric uncertainty (inherent human disagreement) and epistemic uncertainty (model domain sensitivity). We evaluate top language models (Claude Sonnet 4, DeepSeek-V3, Llama 4 Maverick) across 250K+ annotations from ~700 annotators on 100K+ texts spanning social media, news, and forums. Our GPU-optimized Bayesian framework processed 1M+ model queries, revealing that AI models typically rank among the top 25% of human annotators, achieving much better-than-average balanced accuracy. Importantly, we find that AI produces far fewer false negatives than humans, highlighting their more sensitive moral detection capabilities. Subjects: Computation and Language (cs.CL); Human-Computer Interaction (cs.HC) MSC classes: 68T50, 62F15, 62P25 ACMclasses: I.2.7; K.4.1; J.4 Cite as: arXiv:2508.13804 [cs.CL] (or arXiv:2508.13804v1 [cs.CL] for this version) https://doi.org/10.48550/arXiv.2508.13804 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
zh
[NLP-15] racSum: A New Benchmark for Aspect-Based Summarization with Sentence-Level Traceability in Medical Domain
【速读】: 该论文旨在解决大语言模型(Large Language Models, LLMs)在生成文档摘要时存在的事实准确性问题,尤其是在医学领域中,用户难以验证摘要内容的真实性。为缓解这一问题,论文提出TracSum基准测试,其核心解决方案是构建一个可追溯的、基于特定医学维度的摘要任务,其中每个摘要都配有细粒度的句子级引用(sentence-level citations),使用户能够回溯至原始文本以验证信息来源。关键创新在于:1)标注了500篇医学摘要的七个关键医学方面,形成3500个摘要-引用对;2)设计了一套细粒度评估框架,通过四个指标衡量生成内容的完整性和一致性;3)提出Track-Then-Sum流水线作为基线方法,证明在摘要前显式进行句子级追踪可提升生成准确性,而引入完整上下文则进一步增强摘要完整性。
链接: https://arxiv.org/abs/2508.13798
作者: Bohao Chu,Meijie Li,Sameh Frihat,Chengyu Gu,Georg Lodde,Elisabeth Livingstone,Norbert Fuhr
机构: University of Duisburg-Essen (杜伊斯堡-埃森大学); Institute for AI in Medicine (IKIM) (人工智能医学研究所); University Hospital Essen (埃森大学医院)
类目: Computation and Language (cs.CL)
备注: 8 main pages, 12 appendix pages
Abstract:While document summarization with LLMs has enhanced access to textual information, concerns about the factual accuracy of these summaries persist, especially in the medical domain. Tracing evidence from which summaries are derived enables users to assess their accuracy, thereby alleviating this concern. In this paper, we introduce TracSum, a novel benchmark for traceable, aspect-based summarization, in which generated summaries are paired with sentence-level citations, enabling users to trace back to the original context. First, we annotate 500 medical abstracts for seven key medical aspects, yielding 3.5K summary-citation pairs. We then propose a fine-grained evaluation framework for this new task, designed to assess the completeness and consistency of generated content using four metrics. Finally, we introduce a summarization pipeline, Track-Then-Sum, which serves as a baseline method for comparison. In experiments, we evaluate both this baseline and a set of LLMs on TracSum, and conduct a human evaluation to assess the evaluation results. The findings demonstrate that TracSum can serve as an effective benchmark for traceable, aspect-based summarization tasks. We also observe that explicitly performing sentence-level tracking prior to summarization enhances generation accuracy, while incorporating the full context further improves completeness.
zh
[NLP-16] Can Large Language Models (LLM s) Describe Pictures Like Children? A Comparative Corpus Study
【速读】: 该论文试图解决的问题是:大型语言模型(Large Language Models, LLMs)生成的文本是否能够有效模拟儿童语言特征,尤其是在教育场景中用于儿童导向工具时其语言适配性如何。解决方案的关键在于通过对比LLM生成文本与真实德国儿童对图画故事的描述,在多个心理语言学维度上进行系统分析,包括词频、词汇丰富度、句子和词长、词性标注以及基于词嵌入的语义相似性。研究发现,尽管采用少量样本提示(few-shot prompt)可略微提升LLM文本与儿童文本的相似性,但整体仍存在显著差异,尤其在词汇多样性、名词使用频率及语义空间分布方面未能复制儿童语言模式,从而揭示了当前LLM在模仿儿童语言上的局限性,并为未来基于多模态提示(文本+图像)优化儿童语言生成提供了实证依据。
链接: https://arxiv.org/abs/2508.13769
作者: Hanna Woloszyn,Benjamin Gagl
机构: Self-Learning Systems Lab, Faculty of Human Sciences, Department of Special Education and Rehabilitation, Cologne, Germany
类目: Computation and Language (cs.CL)
备注:
Abstract:The role of large language models (LLMs) in education is increasing, yet little attention has been paid to whether LLM-generated text resembles child language. This study evaluates how LLMs replicate child-like language by comparing LLM-generated texts to a collection of German children’s descriptions of picture stories. We generated two LLM-based corpora using the same picture stories and two prompt types: zero-shot and few-shot prompts specifying a general age from the children corpus. We conducted a comparative analysis across psycholinguistic text properties, including word frequency, lexical richness, sentence and word length, part-of-speech tags, and semantic similarity with word embeddings. The results show that LLM-generated texts are longer but less lexically rich, rely more on high-frequency words, and under-represent nouns. Semantic vector space analysis revealed low similarity, highlighting differences between the two corpora on the level of corpus semantics. Few-shot prompt increased similarities between children and LLM text to a minor extent, but still failed to replicate lexical and semantic patterns. The findings contribute to our understanding of how LLMs approximate child language through multimodal prompting (text + image) and give insights into their use in psycholinguistic research and education while raising important questions about the appropriateness of LLM-generated language in child-directed educational tools.
zh
[NLP-17] MGT-Prism: Enhancing Domain Generalization for Machine-Generated Text Detection via Spectral Alignment
【速读】: 该论文旨在解决机器生成文本(Machine-Generated Text, MGT)检测模型在跨域场景下泛化能力差的问题,即当前检测方法在训练和测试数据来自同一领域时表现良好,但在面对未见过的领域时性能显著下降,这主要归因于不同来源数据之间的领域偏移(domain shift)。解决方案的关键在于从频域角度出发,提出MGT-Prism检测框架:首先通过分析文本表示在频域中的特征,发现MGT与人类写作文本(Human-Written Text, HWT)在幅值(magnitude)上存在显著差异,而谱形(spectral pattern)则具有跨域一致性;据此设计了一个低频域滤波模块以去除受领域变化影响的文档级特征,并引入动态谱对齐策略来提取任务特定且领域不变的特征,从而提升检测器在多领域场景下的泛化性能。
链接: https://arxiv.org/abs/2508.13768
作者: Shengchao Liu,Xiaoming Liu,Chengzhengxu Li,Zhaohan Zhang,Guoxin Ma,Yu Lan,Shuai Xiao
机构: Xi’an Jiaotong University (西安交通大学); Queen Mary University of London (伦敦玛丽女王大学); Alibaba (阿里巴巴)
类目: Computation and Language (cs.CL)
备注:
Abstract:Large Language Models have shown growing ability to generate fluent and coherent texts that are highly similar to the writing style of humans. Current detectors for Machine-Generated Text (MGT) perform well when they are trained and tested in the same domain but generalize poorly to unseen domains, due to domain shift between data from different sources. In this work, we propose MGT-Prism, an MGT detection method from the perspective of the frequency domain for better domain generalization. Our key insight stems from analyzing text representations in the frequency domain, where we observe consistent spectral patterns across diverse domains, while significant discrepancies in magnitude emerge between MGT and human-written texts (HWTs). The observation initiates the design of a low frequency domain filtering module for filtering out the document-level features that are sensitive to domain shift, and a dynamic spectrum alignment strategy to extract the task-specific and domain-invariant features for improving the detector’s performance in domain generalization. Extensive experiments demonstrate that MGT-Prism outperforms state-of-the-art baselines by an average of 0.90% in accuracy and 0.92% in F1 score on 11 test datasets across three domain-generalization scenarios.
zh
[NLP-18] Sycophancy under Pressure: Evaluating and Mitigating Sycophantic Bias via Adversarial Dialogues in Scientific QA
【速读】: 该论文旨在解决大语言模型(Large Language Models, LLMs)在科学问答(Scientific Question Answering, QA)场景中表现出的“谄媚倾向”(sycophancy)问题,即模型倾向于迎合用户信念而非坚持事实正确性,尤其在偏好对齐(preference-based alignment)训练策略下,这种倾向被进一步强化。该问题在高风险场景中可能导致错误知识传播与决策误导。解决方案的关键在于提出一种名为Pressure-Tune的轻量级后训练方法:通过在合成对抗性对话数据上微调模型,并引入链式思维(chain-of-thought)推理路径来明确拒绝用户提供的错误信息,同时强化模型对事实的承诺,从而显著提升模型在面对社会压力时的抗谄媚能力(sycophancy resistance),且不损害其准确性或对有效反馈的响应能力。
链接: https://arxiv.org/abs/2508.13743
作者: Kaiwei Zhang,Qi Jia,Zijian Chen,Wei Sun,Xiangyang Zhu,Chunyi Li,Dandan Zhu,Guangtao Zhai
机构: 1. University of Science and Technology of China (中国科学技术大学); 2. Institute of Artificial Intelligence, Chinese Academy of Sciences (中国科学院人工智能研究院); 3. National University of Defense Technology (国防科技大学)
类目: Computation and Language (cs.CL)
备注:
Abstract:Large language models (LLMs), while increasingly used in domains requiring factual rigor, often display a troubling behavior: sycophancy, the tendency to align with user beliefs regardless of correctness. This tendency is reinforced by preference-based alignment techniques that optimize for user satisfaction but can undermine truthfulness. While relatively benign in casual dialogue, sycophancy poses serious risks in high-stakes settings such as scientific question answering (QA), where model outputs may shape collaborative reasoning, decision-making, and knowledge formation. Despite its importance, this phenomenon remains underexamined in factual QA contexts. We address this gap by introducing a unified evaluation framework to quantify the impact of sycophantic context on model behavior in scientific QA, measuring how much user-imposed social pressure distorts model outputs. The framework incorporates adversarial prompting setups and targeted metrics, such as misleading resistance and sycophancy resistance, that capture a model’s ability to maintain factual consistency under misleading cues. Systematic evaluations across open-source and proprietary models reveal pervasive sycophantic tendencies, driven more by alignment strategy than by model size. To mitigate this issue, we propose Pressure-Tune, a lightweight post-training method that fine-tunes models on synthetic adversarial dialogues paired with chain-of-thought rationales. These rationales reject user misinformation while reinforcing factual commitments. Experiments on challenging scientific QA benchmarks show that Pressure-Tune significantly enhances sycophancy resistance without compromising accuracy or responsiveness to valid feedback, offering a practical pathway toward more truthful and principled model behavior.
zh
[NLP-19] EEG-MedRAG : Enhancing EEG-based Clinical Decision-Making via Hierarchical Hypergraph Retrieval-Augmented Generation
【速读】: 该论文旨在解决大规模、多源异构脑电图(EEG)数据在神经科学和临床实践中高效检索与语义解析的挑战。其核心解决方案是提出一种三层超图增强的检索增强生成框架(EEG-MedRAG),通过构建融合EEG领域知识、个体患者案例与大规模数据资源的n元关系超图结构,实现联合语义-时间维度的检索与因果链式诊断生成,从而提升临床决策支持系统的准确性与可解释性。
链接: https://arxiv.org/abs/2508.13735
作者: Yi Wang,Haoran Luo,Lu Meng
机构: 未知
类目: Computation and Language (cs.CL)
备注:
Abstract:With the widespread application of electroencephalography (EEG) in neuroscience and clinical practice, efficiently retrieving and semantically interpreting large-scale, multi-source, heterogeneous EEG data has become a pressing challenge. We propose EEG-MedRAG, a three-layer hypergraph-based retrieval-augmented generation framework that unifies EEG domain knowledge, individual patient cases, and a large-scale repository into a traversable n-ary relational hypergraph, enabling joint semantic-temporal retrieval and causal-chain diagnostic generation. Concurrently, we introduce the first cross-disease, cross-role EEG clinical QA benchmark, spanning seven disorders and five authentic clinical perspectives. This benchmark allows systematic evaluation of disease-agnostic generalization and role-aware contextual understanding. Experiments show that EEG-MedRAG significantly outperforms TimeRAG and HyperGraphRAG in answer accuracy and retrieval, highlighting its strong potential for real-world clinical decision support. Our data and code are publicly available at this https URL.
zh
[NLP-20] Prediction is not Explanation: Revisiting the Explanatory Capacity of Mapping Embeddings ECAI2025
【速读】: 该论文试图解决的问题是:当前主流方法通过将词嵌入(word embeddings)映射到人类可理解的语义特征(feature norms)来解释深度学习模型中隐含的知识,但这些方法是否真正捕捉到了语义知识仍存在争议。其关键解决方案在于,作者指出仅依赖预测准确率作为评估指标并不能可靠地证明嵌入中包含真实的语义信息,因为即使随机数据也能被高精度预测,这表明结果主要受算法上限(algorithmic upper bound)驱动,而非嵌入本身的语义表征能力。因此,论文强调这类映射更多反映的是向量空间中的几何相似性,而非语义属性的真实涌现,从而质疑了基于预测性能对不同语义数据集进行比较的有效性。
链接: https://arxiv.org/abs/2508.13729
作者: Hanna Herasimchyk,Alhassan Abdelhalim,Sören Laue,Michaela Regneri
机构: Universität Hamburg (汉堡大学)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: 10 pages, 6 Figures. Published at ECAI 2025 in a version without the Appendix
Abstract:Understanding what knowledge is implicitly encoded in deep learning models is essential for improving the interpretability of AI systems. This paper examines common methods to explain the knowledge encoded in word embeddings, which are core elements of large language models (LLMs). These methods typically involve mapping embeddings onto collections of human-interpretable semantic features, known as feature norms. Prior work assumes that accurately predicting these semantic features from the word embeddings implies that the embeddings contain the corresponding knowledge. We challenge this assumption by demonstrating that prediction accuracy alone does not reliably indicate genuine feature-based interpretability. We show that these methods can successfully predict even random information, concluding that the results are predominantly determined by an algorithmic upper bound rather than meaningful semantic representation in the word embeddings. Consequently, comparisons between datasets based solely on prediction performance do not reliably indicate which dataset is better captured by the word embeddings. Our analysis illustrates that such mappings primarily reflect geometric similarity within vector spaces rather than indicating the genuine emergence of semantic properties. Comments: 10 pages, 6 Figures. Published at ECAI 2025 in a version without the Appendix Subjects: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG) Cite as: arXiv:2508.13729 [cs.CL] (or arXiv:2508.13729v1 [cs.CL] for this version) https://doi.org/10.48550/arXiv.2508.13729 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
zh
[NLP-21] Generics and Default Reasoning in Large Language Models
【速读】: 该论文旨在解决大语言模型(Large Language Models, LLMs)在处理非单调逻辑中的默认推理(default reasoning)能力问题,特别是其对通用陈述(generic generalizations,如“鸟类会飞”、“乌鸦是黑色的”)的推理表现。这类推理模式具有允许例外的特性,是认知科学、语言学和逻辑学研究的核心议题。研究的关键在于系统评估28个前沿LLM在20种可废止推理(defeasible reasoning)模式下的表现,并对比不同提示策略(如少样本提示与思维链提示)的影响。结果表明,尽管部分模型在零样本条件下能较好完成默认推理任务,但多数模型难以区分可废止推理与演绎推理,常将通用语句误读为全称命题;且思维链提示(Chain-of-Thought prompting)反而显著降低性能(平均准确率下降11.14%),揭示了当前LLMs在默认推理方面的局限性与潜在改进方向。
链接: https://arxiv.org/abs/2508.13718
作者: James Ravi Kirkpatrick,Rachel Katharine Sterken
机构: University of Oxford (牛津大学); Magdalen College (马格达伦学院); University of Hong Kong (香港大学)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Logic in Computer Science (cs.LO)
备注: 33 pages, 26 figures
Abstract:This paper evaluates the capabilities of 28 large language models (LLMs) to reason with 20 defeasible reasoning patterns involving generic generalizations (e.g., ‘Birds fly’, ‘Ravens are black’) central to non-monotonic logic. Generics are of special interest to linguists, philosophers, logicians, and cognitive scientists because of their complex exception-permitting behaviour and their centrality to default reasoning, cognition, and concept acquisition. We find that while several frontier models handle many default reasoning problems well, performance varies widely across models and prompting styles. Few-shot prompting modestly improves performance for some models, but chain-of-thought (CoT) prompting often leads to serious performance degradation (mean accuracy drop -11.14%, SD 15.74% in models performing above 75% accuracy in zero-shot condition, temperature 0). Most models either struggle to distinguish between defeasible and deductive inference or misinterpret generics as universal statements. These findings underscore both the promise and limits of current LLMs for default reasoning.
zh
[NLP-22] ViExam: Are Vision Language Models Better than Humans on Vietnamese Multimodal Exam Questions?
【速读】: 该论文旨在解决当前视觉语言模型(Vision Language Models, VLMs)在低资源语言(如越南语)的多模态教育评估任务中性能表现不明的问题,特别是其跨语言多模态推理能力是否具备实际应用潜力。其关键解决方案是构建首个针对越南语多模态考试的基准测试集ViExam,包含2,548道涵盖数学、物理、化学、生物、地理、驾驶理论和智商测试共7个学科领域的多模态题目,并系统评估了主流商用与开源VLMs在该基准上的表现。结果显示,尽管最先进模型仅达到57.74%平均准确率,远低于人类平均水平(66.54%),且英语指令提示无法提升性能,反而降低1个百分点,但引入人机协作机制可使模型性能提升5个百分点,揭示了当前VLM在低资源多模态场景下仍存在显著局限性,同时指出了改进方向。
链接: https://arxiv.org/abs/2508.13680
作者: Vy Tuong Dang,An Vo,Quang Tau,Duc Dm,Daeyoung Kim
机构: 未知
类目: Computation and Language (cs.CL); Machine Learning (cs.LG)
备注:
Abstract:Vision language models (VLMs) demonstrate remarkable capabilities on English multimodal tasks, but their performance on low-resource languages with genuinely multimodal educational content remains largely unexplored. In this work, we test how VLMs perform on Vietnamese educational assessments, investigating whether VLMs trained predominantly on English data can handle real-world cross-lingual multimodal reasoning. Our work presents the first comprehensive evaluation of VLM capabilities on multimodal Vietnamese exams through proposing ViExam, a benchmark containing 2,548 multimodal questions. We find that state-of-the-art VLMs achieve only 57.74% while open-source models achieve 27.70% mean accuracy across 7 academic domains, including Mathematics, Physics, Chemistry, Biology, Geography, Driving Test, and IQ Test. Most VLMs underperform average human test-takers (66.54%), with only the thinking VLM o3 (74.07%) exceeding human average performance, yet still falling substantially short of human best performance (99.60%). Cross-lingual prompting with English instructions while maintaining Vietnamese content fails to improve performance, decreasing accuracy by 1 percentage point for SOTA VLMs. Human-in-the-loop collaboration can partially improve VLM performance by 5 percentage points. Code and data are available at: this https URL.
zh
[NLP-23] Input Time Scaling
【速读】: 该论文旨在解决当前大型语言模型(Large Language Models, LLMs)在训练和推理阶段过度依赖高质量数据集与单纯扩大数据规模所带来的性能瓶颈问题。传统方法通常通过后训练阶段使用大规模精心筛选的数据集(data training scaling)以及测试时进行复杂推理(inference time scaling)来提升性能,但这种方法存在资源浪费、边际效益递减等问题。本文提出一种新的缩放范式——输入时间缩放(Input Time Scaling),其核心在于将计算资源分配到查询(query)处理阶段,即在训练和测试过程中利用元知识(meta-knowledge)对输入进行策略性优化,如添加无关信息或随机采样等低质量数据策略。关键发现包括:训练与测试需协同设计(training-testing co-design),仅在训练或测试中应用策略会导致性能显著下降;看似低质量的数据反而可能带来更高性能,挑战了“垃圾进,垃圾出”(garbage in, garbage out)的直觉认知;同时验证了“少即是多”(Less is More)现象,即少量高质量示例即可激发模型高阶推理能力。实验表明,在Qwen2.5-32B-Instruct基础上实现AIME24和AIME25的SOTA结果(pass@1达76.7%),进一步通过三模型多数投票提升至80%,且基于DeepSeek-R1-Distill-Qwen-32B可达到AIME24 86.7%的最优表现。
链接: https://arxiv.org/abs/2508.13654
作者: Rapheal Huang(Yuming),Weilong Guo
机构: University of Chinese Academy of Sciences (中国科学院大学)
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注:
Abstract:Current Large Language Models (LLMs) are usually post-trained on large-scale carefully curated datasets (data training scaling) and doing reasoning in test time (inference time scaling). In this work, we present a new scaling paradigm, Input Time Scaling, to complement previous scaling methods by putting resources on queries (input time). During training and testing, we combine meta-knowledge from LLMs to refine inputs with different strategies. We also find a new phenomenon, training-testing co-design there. We need to apply query strategies during both training and testing. Only applying strategies on training or testing would seriously degrade the performance. We are also surprised to find that seemingly low data quality datasets can gain high performance. Adding irrelevant information to the queries, randomly selecting examples from a minimally filtered dataset, can even perform the best. These findings contradict the widely held inductive bias, “garbage in, garbage out”. Curating datasets with seemingly high-quality data can even potentially limit the performance ceiling. In addition, models trained on more data with similar quality (15k VS 1k) perform worse, simple dataset size scaling should also be carefully inspected. The good news is that our findings are compatible with the Less is More phenomenon. A small set of examples is enough to evoke high-level reasoning ability. With experiments on models trained on Qwen2.5-32B-Instruct, we are able to reach SOTA performance among 32B models on AIME24(76.7%) and AIME25(76.7%) pass@1. We can further achieve AIME24(76.7%) and AIME25(80%) with a majority vote of three models. Starting from DeepSeek-R1-Distill-Qwen-32B, the best result would be 86.7% on AIME24 and 76.7% on AIME25. To facilitate reproducibility and further research, we are working on open-source our datasets, data pipelines, evaluation results, and checkpoints.
zh
[NLP-24] CRISP: Persistent Concept Unlearning via Sparse Autoencoders
【速读】: 该论文旨在解决大语言模型(Large Language Models, LLMs)在实际部署中需选择性移除有害或不 Desired 知识,同时保持模型通用能力的问题。现有基于稀疏自动编码器(Sparse Autoencoders, SAEs)的方法多在推理阶段进行干预,无法持久改变模型参数,易被拥有参数访问权限的恶意方绕过或撤销。论文提出 CRISP 方法,其关键在于利用 SAE 自动识别跨层显著特征,并通过抑制这些特征的激活实现持久的概念遗忘(concept unlearning)。实验表明,CRISP 在 WMDP 基准的安全敏感任务上优于先前方法,在保留模型整体性能的同时有效清除有害知识,且特征级分析显示其能实现目标概念与良性概念的语义解耦,从而实现精准抑制。
链接: https://arxiv.org/abs/2508.13650
作者: Tomer Ashuach,Dana Arad,Aaron Mueller,Martin Tutek,Yonatan Belinkov
机构: Technion – Israel Institute of Technology(以色列理工学院); Boston University(波士顿大学); TakeLab, University of Zagreb(萨格勒布大学TakeLab实验室)
类目: Computation and Language (cs.CL)
备注: 18 pages, 5 figures
Abstract:As large language models (LLMs) are increasingly deployed in real-world applications, the need to selectively remove unwanted knowledge while preserving model utility has become paramount. Recent work has explored sparse autoencoders (SAEs) to perform precise interventions on monosemantic features. However, most SAE-based methods operate at inference time, which does not create persistent changes in the model’s parameters. Such interventions can be bypassed or reversed by malicious actors with parameter access. We introduce CRISP, a parameter-efficient method for persistent concept unlearning using SAEs. CRISP automatically identifies salient SAE features across multiple layers and suppresses their activations. We experiment with two LLMs and show that our method outperforms prior approaches on safety-critical unlearning tasks from the WMDP benchmark, successfully removing harmful knowledge while preserving general and in-domain capabilities. Feature-level analysis reveals that CRISP achieves semantically coherent separation between target and benign concepts, allowing precise suppression of the target features.
zh
[NLP-25] AdaDocVQA: Adaptive Framework for Long Document Visual Question Answering in Low-Resource Settings
【速读】: 该论文旨在解决文档视觉问答(Document VQA)在低资源环境下处理长文档时面临的两大挑战:一是因上下文长度限制导致的信息丢失问题,二是训练数据不足引发的模型性能瓶颈。解决方案的关键在于提出一个统一的自适应框架AdaDocVQA,其核心创新包括:(1)混合文本检索架构实现高效的文档分割;(2)智能数据增强流水线自动构建高质量的推理型问答对,并通过多级验证确保准确性;(3)自适应集成推理机制结合动态配置生成与早期停止策略,提升推理效率与精度。实验表明,该框架在日语文档VQA基准上显著优于现有方法,为低资源语言和专业领域提供了可扩展的解决方案。
链接: https://arxiv.org/abs/2508.13606
作者: Haoxuan Li,Wei Song,Aofan Liu,Peiwu Qin
机构: Tsinghua Shenzhen International Graduate School, Tsinghua University (清华大学深圳国际研究生院); Guangdong University of Technology (广东工业大学); School of Information Engineering, Peking University (北京大学信息工程学院)
类目: Computation and Language (cs.CL)
备注:
Abstract:Document Visual Question Answering (Document VQA) faces significant challenges when processing long documents in low-resource environments due to context limitations and insufficient training data. This paper presents AdaDocVQA, a unified adaptive framework addressing these challenges through three core innovations: a hybrid text retrieval architecture for effective document segmentation, an intelligent data augmentation pipeline that automatically generates high-quality reasoning question-answer pairs with multi-level verification, and adaptive ensemble inference with dynamic configuration generation and early stopping mechanisms. Experiments on Japanese document VQA benchmarks demonstrate substantial improvements with 83.04% accuracy on Yes/No questions, 52.66% on factual questions, and 44.12% on numerical questions in JDocQA, and 59% accuracy on LAVA dataset. Ablation studies confirm meaningful contributions from each component, and our framework establishes new state-of-the-art results for Japanese document VQA while providing a scalable foundation for other low-resource languages and specialized domains. Our code available at: this https URL.
zh
[NLP-26] Who Gets the Mic? Investigating Gender Bias in the Speaker Assignment of a Speech-LLM
【速读】: 该论文旨在探究生成式语音大模型(Speech-LLMs)是否像文本型大语言模型(LLMs)一样存在性别偏见,尤其是关注其在语音合成过程中通过说话人选择所体现的隐含性别倾向。研究的核心问题是:Speech-LLMs 是否会因训练数据或模型设计而对特定性别产生系统性偏好?解决方案的关键在于提出一种基于“说话人分配”(speaker assignment)的分析方法——由于语音模型必须输出具有性别特征的声音,因此说话人选择成为显式的性别线索,从而可量化评估模型是否存在性别偏见。作者以 Bark 模型为例,构建了两类测试集(职业类和带性别色彩词汇类),发现该模型虽无系统性偏见,但表现出一定的性别意识和倾向性。
链接: https://arxiv.org/abs/2508.13603
作者: Dariia Puhach,Amir H. Payberah,Éva Székely
机构: KTH Royal Institute of Technology (皇家理工学院)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:
Abstract:Similar to text-based Large Language Models (LLMs), Speech-LLMs exhibit emergent abilities and context awareness. However, whether these similarities extend to gender bias remains an open question. This study proposes a methodology leveraging speaker assignment as an analytic tool for bias investigation. Unlike text-based models, which encode gendered associations implicitly, Speech-LLMs must produce a gendered voice, making speaker selection an explicit bias cue. We evaluate Bark, a Text-to-Speech (TTS) model, analyzing its default speaker assignments for textual prompts. If Bark’s speaker selection systematically aligns with gendered associations, it may reveal patterns in its training data or model design. To test this, we construct two datasets: (i) Professions, containing gender-stereotyped occupations, and (ii) Gender-Colored Words, featuring gendered connotations. While Bark does not exhibit systematic bias, it demonstrates gender awareness and has some gender inclinations.
zh
[NLP-27] A Comparative Study of Decoding Strategies in Medical Text Generation
【速读】: 该论文旨在解决大型语言模型(Large Language Models, LLMs)在医疗场景中因不同解码策略选择而导致输出质量差异的问题,尤其关注解码策略对五类开放医学任务(翻译、摘要生成、问答、对话和图像描述)的影响。其解决方案的关键在于系统性地评估11种常见解码策略在医学专用与通用LLMs上的表现,发现确定性策略(如束搜索)普遍优于随机策略(如η采样和top-k采样),且推理速度较慢的方法通常产出更高质量结果;同时揭示了医疗专用模型虽在部分任务中表现更好,但整体性能并无显著优势,反而对解码策略更为敏感,强调了在医疗应用中需优先优化解码方法而非单纯依赖模型规模或类型。
链接: https://arxiv.org/abs/2508.13580
作者: Oriana Presacan,Alireza Nik,Vajira Thambawita,Bogdan Ionescu,Michael Riegler
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:
Abstract:Large Language Models (LLMs) rely on various decoding strategies to generate text, and these choices can significantly affect output quality. In healthcare, where accuracy is critical, the impact of decoding strategies remains underexplored. We investigate this effect in five open-ended medical tasks, including translation, summarization, question answering, dialogue, and image captioning, evaluating 11 decoding strategies with medically specialized and general-purpose LLMs of different sizes. Our results show that deterministic strategies generally outperform stochastic ones: beam search achieves the highest scores, while \eta and top-k sampling perform worst. Slower decoding methods tend to yield better quality. Larger models achieve higher scores overall but have longer inference times and are no more robust to decoding. Surprisingly, while medical LLMs outperform general ones in two of the five tasks, statistical analysis shows no overall performance advantage and reveals greater sensitivity to decoding choice. We further compare multiple evaluation metrics and find that correlations vary by task, with MAUVE showing weak agreement with BERTScore and ROUGE, as well as greater sensitivity to the decoding strategy. These results highlight the need for careful selection of decoding methods in medical applications, as their influence can sometimes exceed that of model choice.
zh
[NLP-28] Compressed Models are NOT Trust-equivalent to Their Large Counterparts
【速读】: 该论文试图解决的问题是:在资源受限环境中部署压缩后的深度学习模型时,是否可以像信任原始大型模型一样信任压缩模型的预测结果。现有研究主要关注压缩对准确率等性能指标的影响,但作者指出,性能一致并不等同于信任等价(trust-equivalence)。解决方案的关键在于提出一个二维评估框架:一是可解释性对齐(interpretability alignment),衡量压缩模型与原模型是否基于相同的输入特征进行决策,采用LIME和SHAP方法测试;二是校准相似性(calibration similarity),评估两者预测概率的可靠性一致性,通过ECE、MCE、Brier Score及可靠性图进行量化。实验表明,即使准确率相近,压缩模型在可解释性和校准性上仍存在显著差异,因此不能简单将压缩模型作为大型模型的“即插即用”替代品,必须进行超越性能一致性的系统性信任评估。
链接: https://arxiv.org/abs/2508.13533
作者: Rohit Raj Rai,Chirag Kothari,Siddhesh Shelke,Amit Awekar
机构: IIT Guwahati(印度理工学院古瓦哈蒂分校); IIT Indore(印度理工学院印多尔分校)
类目: Computation and Language (cs.CL); Machine Learning (cs.LG)
备注:
Abstract:Large Deep Learning models are often compressed before being deployed in a resource-constrained environment. Can we trust the prediction of compressed models just as we trust the prediction of the original large model? Existing work has keenly studied the effect of compression on accuracy and related performance measures. However, performance parity does not guarantee trust-equivalence. We propose a two-dimensional framework for trust-equivalence evaluation. First, interpretability alignment measures whether the models base their predictions on the same input features. We use LIME and SHAP tests to measure the interpretability alignment. Second, calibration similarity measures whether the models exhibit comparable reliability in their predicted probabilities. It is assessed via ECE, MCE, Brier Score, and reliability diagrams. We conducted experiments using BERT-base as the large model and its multiple compressed variants. We focused on two text classification tasks: natural language inference and paraphrase identification. Our results reveal low interpretability alignment and significant mismatch in calibration similarity. It happens even when the accuracies are nearly identical between models. These findings show that compressed models are not trust-equivalent to their large counterparts. Deploying compressed models as a drop-in replacement for large models requires careful assessment, going beyond performance parity.
zh
[NLP-29] MATA (māta): Mindful Assessment of the Telugu Abilities of Large Language Models
【速读】: 该论文旨在解决当前大型语言模型(Large Language Models, LLMs)在低资源语言Telugu中的能力评估缺乏细粒度、系统性基准数据集的问题。其解决方案的关键在于构建了MATA,一个包含729道精心设计的多项选择题和开放式问题的评估数据集,覆盖多样化的语言维度,并在此基础上对11个开源与闭源LLMs进行细致性能分析,同时揭示模型在多项选择题中对表面启发式策略(如答案位置和干扰项模式)的依赖性,以及通过LLM-as-a-judge与人工评估对比验证其在低资源语言环境下开放题评价的可靠性。该研究强调细粒度评估对于识别模型局限性和推动更具语言能力的LLMs发展的重要性。
链接: https://arxiv.org/abs/2508.13526
作者: Chalamalasetti Kranti,Sowmya Vajjala
机构: 未知
类目: Computation and Language (cs.CL)
备注: Pre-print
Abstract:In this paper, we introduce MATA, a novel evaluation dataset to assess the ability of Large Language Models (LLMs) in Telugu language, comprising 729 carefully curated multiple-choice and open-ended questions that span diverse linguistic dimensions. We evaluate 11 open-weight and closed-source LLMs on our dataset and present a fine-grained analysis of their performance. Further, we empirically show how LLMs rely on superficial heuristics such as answer position and distractor patterns for multiple-choice questions. Finally, we also compare LLM-as-a-judge evaluation with human evaluation for open-ended questions and draw some conclusions on its reliability in a low-resource language. We argue that such fine-grained evaluation is essential for understanding model limitations and can inform the development of more linguistically capable LLMs, while also serving as a foundation for future research in Telugu NLP.
zh
[NLP-30] Saudi-Dialect-ALLaM: LoRA Fine-Tuning for Dialectal Arabic Generation
【速读】: 该论文旨在解决阿拉伯语大语言模型(Large Language Models, LLMs)对沙特方言(如Najdi和Hijazi)支持不足的问题,从而提升模型在本地化方言生成中的准确性与控制力。其核心挑战在于现代标准阿拉伯语(Modern Standard Arabic, MSA)主导的模型难以捕捉真实方言变体,导致方言生成能力弱且MSA混杂率高。解决方案的关键在于使用一个私有标注的沙特方言指令数据集(5,466条合成指令-响应对),通过LoRA微调沙特本土开发的基础模型ALLaM-7B-Instruct-preview,并对比两种训练策略:(i)Dialect-Token训练,在指令前添加显式方言标签以增强控制;(ii)No-Token训练,不插入标签但依赖上下文学习。实验表明,Dialect-Token方法显著提升了方言识别率(从47.97%升至84.21%)并降低MSA泄漏(从32.63%降至6.21%),同时保持文本保真度(chrF++ +3.53,BERTScore +0.059),优于多个主流通用指令模型。
链接: https://arxiv.org/abs/2508.13525
作者: Hassan Barmandah
机构: Umm Al-Qura University (乌姆库拉大学)
类目: Computation and Language (cs.CL); Machine Learning (cs.LG)
备注: 7 pages, 6 figures, 2 tables. Code: this https URL . Dataset and trained weights/adapters are not released. Primary category: cs.CL
Abstract:Large language models (LLMs) for Arabic are still dominated by Modern Standard Arabic (MSA), with limited support for Saudi dialects such as Najdi and Hijazi. This underrepresentation hinders their ability to capture authentic dialectal variation. Using a privately curated Saudi Dialect Instruction dataset (Hijazi and Najdi; 5,466 synthetic instruction-response pairs; 50/50 split), we LoRA-tune ALLaM-7B-Instruct-preview, the first foundation model developed in Saudi Arabia, for Saudi dialect generation. We investigate two variants: (i) Dialect-Token training, which prepends an explicit dialect tag to the instruction, and (ii) No-Token training, which omits the tag at formatting time. Evaluation on a held-out test set combines an external dialect classifier with text fidelity metrics (chrF++ and BERTScore) and diversity measures. The Dialect-Token model achieves the best control, raising the Saudi rate from 47.97% to 84.21% and reducing MSA leakage from 32.63% to 6.21%; fidelity also improves (chrF++ +3.53, BERTScore +0.059). Both LoRA variants outperform strong generic instruction models (Falcon-7B-Instruct, Llama-3.1-8B-Instruct, Qwen-2.5-7B-Instruct, AceGPT-v2-8B-Chat, JAIS-13B-Chat) in dialect control and fidelity, while avoiding metadata-tag echoing that these baselines frequently exhibit. We do not release the dataset or any model weights/adapters; instead, we release training/evaluation/inference code and a detailed datasheet (schema and aggregate statistics) to support independent verification.
zh
[NLP-31] ProMed: Shapley Information Gain Guided Reinforcement Learning for Proactive Medical LLM s
【速读】: 该论文旨在解决医疗大语言模型(Medical Large Language Models, LLMs)在临床问诊场景中因采用被动应答范式而导致的诊断错误风险问题。当前模型多基于静态问答模式,缺乏主动获取患者信息的能力,难以应对复杂、不完整信息下的诊疗需求。解决方案的关键在于提出ProMed框架,其核心创新是引入Shapley Information Gain (SIG)奖励机制,通过结合新获取信息量与上下文重要性(由Shapley值估计)来量化每个问题的临床价值,并在此基础上设计两阶段训练流程:一是利用蒙特卡洛树搜索(Monte Carlo Tree Search, MCTS)生成高奖励交互轨迹进行模型初始化;二是通过SIG增强的策略优化机制,构建新型SIG引导的奖励分配机制,优先优化高信息量问题的提问策略。该方法显著提升了模型在部分信息条件下的决策准确性与泛化能力。
链接: https://arxiv.org/abs/2508.13514
作者: Hongxin Ding,Baixiang Huang,Yue Fang,Weibin Liao,Xinke Jiang,Zheng Li,Junfeng Zhao,Yasha Wang
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:
Abstract:Interactive medical questioning is essential in real-world clinical consultations, where physicians must actively gather information from patients. While medical Large Language Models (LLMs) have shown impressive capabilities in static medical question answering, they predominantly operate under a reactive paradigm: generating answers directly without seeking additional information, which risks incorrect diagnoses in such interactive settings. To address this limitation, we propose ProMed, a reinforcement learning (RL) framework that transitions medical LLMs toward a proactive paradigm, equipping them with the ability to ask clinically valuable questions before decision-making. At the core of ProMed is the Shapley Information Gain (SIG) reward, which quantifies the clinical utility of each question by combining the amount of newly acquired information with its contextual importance, estimated via Shapley values. We integrate SIG into a two-stage training pipeline: (1) SIG-Guided Model Initialization uses Monte Carlo Tree Search (MCTS) to construct high-reward interaction trajectories to supervise the model, and (2) SIG-Augmented Policy Optimization, which integrates SIG and enhances RL with a novel SIG-guided Reward Distribution Mechanism that assigns higher rewards to informative questions for targeted optimization. Extensive experiments on two newly curated partial-information medical benchmarks demonstrate that ProMed significantly outperforms state-of-the-art methods by an average of 6.29% and delivers a 54.45% gain over the reactive paradigm, while also generalizing robustly to out-of-domain cases.
zh
[NLP-32] LLM -Enhanced Linear Autoencoders for Recommendation CIKM2025
【速读】: 该论文旨在解决现有线性自编码器(Linear Autoencoder, LAE)在推荐系统中利用文本信息时,因依赖稀疏的词共现模式而难以捕捉丰富语义信息的问题。其解决方案的关键在于提出L3AE框架,首次将大语言模型(Large Language Models, LLMs)融入LAE结构,通过两阶段优化策略实现文本语义与用户-物品交互异构知识的有效融合:第一阶段基于LLM生成的物品表示构建语义物品关联矩阵;第二阶段在学习物品间权重矩阵的同时,以语义关联作为正则化项进行蒸馏,且每阶段均采用闭式解优化,保障全局最优性和计算效率。
链接: https://arxiv.org/abs/2508.13500
作者: Jaewan Moon,Seongmin Park,Jongwuk Lee
机构: Sungkyunkwan University (成均馆大学)
类目: Information Retrieval (cs.IR); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Machine Learning (cs.LG)
备注: Accepted by CIKM 2025
Abstract:Large language models (LLMs) have been widely adopted to enrich the semantic representation of textual item information in recommender systems. However, existing linear autoencoders (LAEs) that incorporate textual information rely on sparse word co-occurrence patterns, limiting their ability to capture rich textual semantics. To address this, we propose L3AE, the first integration of LLMs into the LAE framework. L3AE effectively integrates the heterogeneous knowledge of textual semantics and user-item interactions through a two-phase optimization strategy. (i) L3AE first constructs a semantic item-to-item correlation matrix from LLM-derived item representations. (ii) It then learns an item-to-item weight matrix from collaborative signals while distilling semantic item correlations as regularization. Notably, each phase of L3AE is optimized through closed-form solutions, ensuring global optimality and computational efficiency. Extensive experiments demonstrate that L3AE consistently outperforms state-of-the-art LLM-enhanced models on three benchmark datasets, achieving gains of 27.6% in Recall@20 and 39.3% in NDCG@20. The source code is available at this https URL.
zh
[NLP-33] Structured Prompting and Multi-Agent Knowledge Distillation for Traffic Video Interpretation and Risk Inference
【速读】: 该论文旨在解决智能交通系统(ITS)和自动驾驶中高速公路场景理解与交通风险推理的难题,尤其是传统方法在真实复杂动态环境下的可扩展性与泛化能力不足问题。解决方案的关键在于提出一种结构化提示(structured prompting)与知识蒸馏(knowledge distillation)相结合的框架,通过两个大型视觉语言模型(VLMs)——GPT-4o 和 o3-mini——采用链式思维(Chain-of-Thought, CoT)策略生成多视角高质量伪标注与风险评估,用于监督微调一个参数量小得多的轻量级学生模型 VISTA(3B规模)。该模型在保持高效部署能力的同时,实现了对低分辨率交通视频的语义忠实、风险感知的文本描述,验证了结构化多代理监督与有效知识蒸馏可使轻量化 VLM 获得复杂推理能力。
链接: https://arxiv.org/abs/2508.13439
作者: Yunxiang Yang,Ningning Xu,Jidong J. Yang
机构: University of Georgia (乔治亚大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Image and Video Processing (eess.IV)
备注: 16 pages, 10 figures, 1 table
Abstract:Comprehensive highway scene understanding and robust traffic risk inference are vital for advancing Intelligent Transportation Systems (ITS) and autonomous driving. Traditional approaches often struggle with scalability and generalization, particularly under the complex and dynamic conditions of real-world environments. To address these challenges, we introduce a novel structured prompting and knowledge distillation framework that enables automatic generation of high-quality traffic scene annotations and contextual risk assessments. Our framework orchestrates two large Vision-Language Models (VLMs): GPT-4o and o3-mini, using a structured Chain-of-Thought (CoT) strategy to produce rich, multi-perspective outputs. These outputs serve as knowledge-enriched pseudo-annotations for supervised fine-tuning of a much smaller student VLM. The resulting compact 3B-scale model, named VISTA (Vision for Intelligent Scene and Traffic Analysis), is capable of understanding low-resolution traffic videos and generating semantically faithful, risk-aware captions. Despite its significantly reduced parameter count, VISTA achieves strong performance across established captioning metrics (BLEU-4, METEOR, ROUGE-L, and CIDEr) when benchmarked against its teacher models. This demonstrates that effective knowledge distillation and structured multi-agent supervision can empower lightweight VLMs to capture complex reasoning capabilities. The compact architecture of VISTA facilitates efficient deployment on edge devices, enabling real-time risk monitoring without requiring extensive infrastructure upgrades.
zh
[NLP-34] ALIGN: Word Association Learning for Cross-Cultural Generalization in Large Language Models
【速读】: 该论文旨在解决大语言模型(Large Language Models, LLMs)在跨文化沟通中因预训练语料库中语言和观点分布偏倚而导致的文化对齐不足问题。其核心挑战在于缺乏充分的文化知识以及有效学习方法的探索。解决方案的关键在于采用一种成本高效且基于人类认知机制的策略:利用母语者自由联想词典(free word-association norms)进行参数高效的微调,这些词典编码了隐含的文化图式(cultural schemas)。通过监督微调(SFT)与基于PPO的偏好优化,在英语-US和中文的Small-World-of-Words数据集上对Llama-3.1-8B和Qwen-2.5-7B模型进行适配,显著提升了词汇层面的文化相关性,并实现了文化价值观的迁移,使模型输出更贴近目标文化分布,同时仅需数百万条关联数据即可达成媲美甚至超越70B参数基线的效果。
链接: https://arxiv.org/abs/2508.13426
作者: Chunhua Liu,Kabir Manandhar Shrestha,Sukai Huang
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:
Abstract:As large language models (LLMs) increasingly mediate cross-cultural communication, their behavior still reflects the distributional bias of the languages and viewpoints that are over-represented in their pre-training corpora. Yet, it remains a challenge to model and align culture due to limited cultural knowledge and a lack of exploration into effective learning approaches. We introduce a cost-efficient, cognitively grounded remedy: parameter-efficient fine-tuning on native speakers’ free word-association norms, which encode implicit cultural schemas. Leveraging English-US and Mandarin associations from the Small-World-of-Words project, we adapt Llama-3.1-8B and Qwen-2.5-7B via supervised fine-tuning (SFT) and PPO-based preference optimization. SFT boosts held-out association Precision at 5 by 16-20% in English and 43-165% in Mandarin, lifts median concreteness by +0.20, and attains human-level valence and arousal. These lexical gains transfer: on World-Values-Survey questions, fine-tuned models shift answer distributions toward the target culture, and on a 50-item high-tension subset, Qwen’s Chinese-aligned responses double while Llama’s US bias drops by one-third. Our 7-8B models rival or beat vanilla 70B baselines, showing that a few million culture-grounded associations can instill value alignment without costly retraining. Our work highlights both the promise and the need for future research grounded in human cognition in improving cultural alignment in AI models.
zh
[NLP-35] ASER: Table Agents for Schema-guided Extraction and Recommendation
【速读】: 该论文旨在解决现实世界金融文档中表格信息提取的难题,尤其是面对多页、高度异构且结构混乱的表格(如99.4%的表格无明确边界框,单表最多达426行),传统方法难以有效提取其中蕴含的数百万种金融工具持仓信息。其解决方案的关键在于提出一个持续学习的代理型表格提取系统——TASER(Table Agents for Schema-guided Extraction and Recommendation),该系统通过表检测、分类、抽取和推荐四个阶段协同工作,利用初始schema引导结构化输出,并引入推荐代理(Recommender Agent)对结果进行审查、建议schema修订并决定最终推荐,从而实现比Table Transformer等现有模型高出10.1%的准确率。此外,实验表明更大的批量训练能显著提升可执行schema建议的比例(+104.3%)及实际提取持仓量(+9.8%),凸显了持续学习机制在提升系统鲁棒性与适应性中的核心作用。
链接: https://arxiv.org/abs/2508.13404
作者: Nicole Cho,Kirsty Fielding,William Watson,Sumitra Ganesh,Manuela Veloso
机构: 未知
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Information Retrieval (cs.IR); Machine Learning (cs.LG)
备注:
Abstract:Real-world financial documents report essential information about an entity’s financial holdings that can span millions of different financial instrument types. Yet, these details are often buried in messy, multi-page, fragmented tables - for example, 99.4% of the tables in our dataset have no bounding boxes with the maximum number of rows amounting to 426 per table across 44 pages. To tackle these unique challenges from real-world tables, we present a continuously learning, agentic table extraction system, TASER (Table Agents for Schema-guided Extraction and Recommendation) that extracts highly unstructured, multi-page, heterogeneous tables into normalized, schema-conforming outputs. Our table agents execute on table detection, classification, extraction, and recommendations by leveraging an initial schema. Then, our Recommender Agent reviews the outputs, recommends schema revisions, and decides on the final recommendations, enabling TASER to outperform existing table detection models such as Table Transformer by 10.1%. Within this continuous learning process, we highlight that larger batch sizes result in a 104.3% increase in schema recommendations that are actionable and utilized, resulting in a 9.8% increase in extracted holdings - highlighting the importance of a continuous learning process. To train TASER, we have manually labeled 22,584 pages (28,150,449 tokens), 3,213 tables for 731,685,511,687 of holdings culminating in one of the first real financial table datasets. We release our dataset TASERTab to enable the research community to access real-world financial tables and outputs. Our results highlight the promise of agentic, schema-guided extraction systems for robust understanding of real-world financial tables.
zh
[NLP-36] Datarus-R1: An Adaptive Multi-Step Reasoning LLM for Automated Data Analysis
【速读】: 该论文旨在解决当前大语言模型(Large Language Models, LLMs)在复杂推理任务中普遍存在的格式坍塌(format collapse)、冗余输出及缺乏真实分析过程建模的问题,尤其针对需要多步逻辑推导、代码执行与自我修正的高阶问题求解场景。解决方案的关键在于构建一个以“分析轨迹”(analytical trajectories)为核心的训练范式:首先通过合成数据生成器创建包含推理步骤、代码执行、错误追踪、自修正等完整过程的带标签笔记本(notebook)样本;其次设计双奖励机制(dual-reward framework),融合轻量级结构信号与分层奖励模型(Hierarchical Reward Model, HRM),同时评估单步合理性与端到端连贯性;最后采用优化的Group Relative Policy Optimization(GRPO)实现高效强化学习对齐,并引入余弦课程学习策略平滑过渡从结构保真到语义深度的训练目标,从而显著提升模型在金融、医学、数值分析等领域的真实问题解决能力与效率。
链接: https://arxiv.org/abs/2508.13382
作者: Ayoub Ben Chaliah,Hela Dellagi
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:
Abstract:We present Datarus-R1-14B, a 14 B-parameter open-weights language model fine-tuned from Qwen 2.5-14B-Instruct to act as a virtual data analyst and graduate-level problem solver. Datarus is trained not on isolated question-answer pairs but on full analytical trajectories including reasoning steps, code execution, error traces, self-corrections, and final conclusions, all captured in a ReAct-style notebook format spanning finance, medicine, numerical analysis, and other quantitative domains. Our training pipeline combines (i) a trajectory-centric synthetic data generator that yielded 144 000 tagged notebook episodes, (ii) a dual-reward framework blending a lightweight tag-based structural signal with a Hierarchical Reward Model (HRM) that scores both single-step soundness and end-to-end coherence, and (iii) a memory-optimized implementation of Group Relative Policy Optimization (GRPO) featuring KV-cache reuse, sequential generation, and reference-model sharding. A cosine curriculum smoothly shifts emphasis from structural fidelity to semantic depth, reducing the format collapse and verbosity that often plague RL-aligned LLMs. A central design choice in Datarus is it dual reasoning interface. In agentic mode the model produces ReAct-tagged steps that invoke Python tools to execute real code; in reflection mode it outputs compact Chain-of-Thought (CoT) traces delimited by think and answer tags. On demanding postgraduate-level problems, Datarus exhibits an “AHA-moment” pattern: it sketches hypotheses, revises them once or twice, and converges avoiding the circular, token-inflating loops common to contemporary systems. Across standard public benchmarks Datarus surpasses similar size models and even reaches the level of larger reasoning models such as QwQ-32B achieving up to 30% higher accuracy on AIME 2024/2025 and LiveCodeBench while emitting 18-49% fewer tokens per solution.
zh
[NLP-37] Whispering Context: Distilling Syntax and Semantics for Long Speech Transcripts
【速读】: 该论文旨在解决自动语音识别(ASR)系统在长音频转录中难以保持句法和语义准确性的难题,这一问题直接影响命名实体识别(NER)、大小写和标点符号等下游任务的性能。其解决方案的关键在于通过知识蒸馏(knowledge distillation)将LLaMA模型中的上下文语言知识注入Whisper ASR模型:一方面采用基于最优传输(optimal transport)的词级别蒸馏策略以对齐不同模型间的维度与序列长度;另一方面通过最小化Whisper与LLaMA句子嵌入之间的表示损失,融合语法与语义信息。实验表明,该方法在Spoken Wikipedia数据集上显著提升了词错误率(WER)、NER、大小写及标点预测的准确性,为构建具备语义感知能力的长音频语音识别系统提供了新路径。
链接: https://arxiv.org/abs/2508.13376
作者: Duygu Altinok
机构: Independent Researcher(独立研究员)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: Accepted to IEEE ASRU 2025. This is the preprint, all rights reserved for ASRU2025
Abstract:ASR systems often struggle with maintaining syntactic and semantic accuracy in long audio transcripts, impacting tasks like Named Entity Recognition (NER), capitalization, and punctuation. We propose a novel approach that enhances ASR by distilling contextual knowledge from LLaMA models into Whisper. Our method uses two strategies: (1) token level distillation with optimal transport to align dimensions and sequence lengths, and (2) representation loss minimization between sentence embeddings of Whisper and LLaMA, blending syntax and semantics. Evaluations on the Spoken Wikipedia dataset, a benchmark with long audios and rich entities demonstrate significant improvements in Word Error Rate (WER), NER, capitalization, and punctuation success. By introducing novel NER metrics and exploring semantics aware ASR, our work highlights the value of integrating linguistic context into transcription, setting a foundation for robust, context-aware ASR in longform speech.
zh
[NLP-38] Stands to Reason : Investigating the Effect of Reasoning on Idiomaticity Detection
【速读】: 该论文旨在解决习语识别(idiomaticity detection)任务中大型语言模型(LLMs)性能提升的问题,特别是探讨推理能力(reasoning capabilities)和模型规模对识别准确率的影响。其解决方案的关键在于:通过引入链式思维(Chain-of-Thought, CoT)推理机制,并系统评估不同参数规模的开源模型(DeepSeek-R1系列,从1.5B到70B参数)在多个习语识别数据集上的表现,发现推理能力对小模型有显著提升作用,但对大模型仅带来小幅改进;同时揭示了大模型具备更强的习语语义理解能力,能生成准确的表达定义,而小模型则常无法正确输出含义——因此进一步提出在小模型提示中加入定义信息可有效增强其性能。
链接: https://arxiv.org/abs/2508.13365
作者: Dylan Phelps,Rodrigo Wilkens,Edward Gow-Smith,Thomas Pickard,Maggie Mi,Aline Villavicencio
机构: University of Sheffield (谢菲尔德大学); University of Exeter (埃克塞特大学)
类目: Computation and Language (cs.CL)
备注:
Abstract:The recent trend towards utilisation of reasoning models has improved the performance of Large Language Models (LLMs) across many tasks which involve logical steps. One linguistic task that could benefit from this framing is idiomaticity detection, as a potentially idiomatic expression must first be understood before it can be disambiguated and serves as a basis for reasoning. In this paper, we explore how reasoning capabilities in LLMs affect idiomaticity detection performance and examine the effect of model size. We evaluate, as open source representative models, the suite of DeepSeek-R1 distillation models ranging from 1.5B to 70B parameters across four idiomaticity detection datasets. We find the effect of reasoning to be smaller and more varied than expected. For smaller models, producing chain-of-thought (CoT) reasoning increases performance from Math-tuned intermediate models, but not to the levels of the base models, whereas larger models (14B, 32B, and 70B) show modest improvements. Our in-depth analyses reveal that larger models demonstrate good understanding of idiomaticity, successfully producing accurate definitions of expressions, while smaller models often fail to output the actual meaning. For this reason, we also experiment with providing definitions in the prompts of smaller models, which we show can improve performance in some cases.
zh
[NLP-39] Overcoming Latency Bottlenecks in On-Device Speech Translation: A Cascaded Approach with Alignment-Based Streaming MT
【速读】: 该论文旨在解决在设备端实现低延迟、高精度的流式语音翻译(Streaming Speech Translation)问题,尤其是在将自动语音识别(ASR)与机器翻译(MT)集成时面临的实时性与翻译质量难以兼顾的挑战。其关键解决方案在于提出一种同步翻译(Simultaneous Translation)方法,通过利用ASR系统生成的语言学线索(linguistic cues)来有效管理上下文,并结合时间超时(time-out)和强制终结(forced finalization)等高效的束搜索剪枝技术,从而在保持较低延迟的同时显著提升翻译质量,缩小与非流式翻译系统之间的性能差距。
链接: https://arxiv.org/abs/2508.13358
作者: Zeeshan Ahmed,Frank Seide,Niko Moritz,Ju Lin,Ruiming Xie,Simone Merello,Zhe Liu,Christian Fuegen
机构: Meta AI.(Meta人工智能)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:
Abstract:This paper tackles several challenges that arise when integrating Automatic Speech Recognition (ASR) and Machine Translation (MT) for real-time, on-device streaming speech translation. Although state-of-the-art ASR systems based on Recurrent Neural Network Transducers (RNN-T) can perform real-time transcription, achieving streaming translation in real-time remains a significant challenge. To address this issue, we propose a simultaneous translation approach that effectively balances translation quality and latency. We also investigate efficient integration of ASR and MT, leveraging linguistic cues generated by the ASR system to manage context and utilizing efficient beam-search pruning techniques such as time-out and forced finalization to maintain system’s real-time factor. We apply our approach to an on-device bilingual conversational speech translation and demonstrate that our techniques outperform baselines in terms of latency and quality. Notably, our technique narrows the quality gap with non-streaming translation systems, paving the way for more accurate and efficient real-time speech translation.
zh
[NLP-40] X-MoE: Enabling Scalable Training for Emerging Mixture-of-Experts Architectures on HPC Platforms
【速读】: 该论文旨在解决当前专家混合(Mixture-of-Experts, MoE)架构在大规模训练中面临的可扩展性瓶颈问题,包括激活内存开销大、全对全通信成本高,以及现有训练系统对非NVIDIA硬件平台优化不足导致计算潜力未被充分挖掘。其解决方案的关键在于提出X-MoE系统,通过三项核心技术实现高效跨平台MoE训练:一是无填充的高效MoE训练机制与跨平台内核设计,降低内存占用并提升硬件利用率;二是冗余规避调度(redundancy-bypassing dispatch),减少无效数据传输;三是结合序列分片(sequence-sharded)的MoE模块的混合并行策略,显著提升大规模模型训练的吞吐量和扩展性。实验证明,X-MoE可在配备AMD MI250X GPU的Frontier超级计算机上将DeepSeek风格MoE模型扩展至5450亿参数(1024张GPU),较现有方法提升10倍规模,同时保持高训练效率。
链接: https://arxiv.org/abs/2508.13337
作者: Yueming Yuan,Ahan Gupta,Jianping Li,Sajal Dash,Feiyi Wang,Minjia Zhang
机构: University of Illinois Urbana-Champaign (伊利诺伊大学厄巴纳-香槟分校); Oak Ridge National Laboratory (橡树岭国家实验室)
类目: Machine Learning (cs.LG); Computation and Language (cs.CL); Distributed, Parallel, and Cluster Computing (cs.DC)
备注: 17 pages, 20 figures. To be published in SC 2025
Abstract:Emerging expert-specialized Mixture-of-Experts (MoE) architectures, such as DeepSeek-MoE, deliver strong model quality through fine-grained expert segmentation and large top-k routing. However, their scalability is limited by substantial activation memory overhead and costly all-to-all communication. Furthermore, current MoE training systems - primarily optimized for NVIDIA GPUs - perform suboptimally on non-NVIDIA platforms, leaving significant computational potential untapped. In this work, we present X-MoE, a novel MoE training system designed to deliver scalable training performance for next-generation MoE architectures. X-MoE achieves this via several novel techniques, including efficient padding-free MoE training with cross-platform kernels, redundancy-bypassing dispatch, and hybrid parallelism with sequence-sharded MoE blocks. Our evaluation on the Frontier supercomputer, powered by AMD MI250X GPUs, shows that X-MoE scales DeepSeek-style MoEs up to 545 billion parameters across 1024 GPUs - 10x larger than the largest trainable model with existing methods under the same hardware budget, while maintaining high training throughput. The source code of X-MoE is available at this https URL.
zh
[NLP-41] Explicit v.s. Implicit Memory: Exploring Multi-hop Complex Reasoning Over Personalized Information
【速读】: 该论文旨在解决当前基于大语言模型(Large Language Models, LLMs)的智能体在处理复杂任务时,因记忆机制不足而导致多跳个性化推理能力受限的问题。现有方法多集中于偏好对齐和简单问答,难以应对需要在大量用户信息中进行多跳推理的实际场景。解决方案的关键在于提出了一种新的“多跳个性化推理任务”(multi-hop personalized reasoning task),并构建了相应的数据集与统一评估框架;在此基础上,系统比较了显式与隐式记忆方法的性能,并进一步提出混合记忆方法(HybridMem),融合两种范式的优点以提升多跳推理效果,从而显著增强智能体在复杂个性化任务中的表现。
链接: https://arxiv.org/abs/2508.13250
作者: Zeyu Zhang,Yang Zhang,Haoran Tan,Rui Li,Xu Chen
机构: Gaoling School of Artificial Intelligence, Renmin University of China (中国人民大学高瓴人工智能学院); National University of Singapore (新加坡国立大学)
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Information Retrieval (cs.IR)
备注: 15 pages, 13 figures, 3 tables
Abstract:In large language model-based agents, memory serves as a critical capability for achieving personalization by storing and utilizing users’ information. Although some previous studies have adopted memory to implement user personalization, they typically focus on preference alignment and simple question-answering. However, in the real world, complex tasks often require multi-hop reasoning on a large amount of user information, which poses significant challenges for current memory approaches. To address this limitation, we propose the multi-hop personalized reasoning task to explore how different memory mechanisms perform in multi-hop reasoning over personalized information. We explicitly define this task and construct a dataset along with a unified evaluation framework. Then, we implement various explicit and implicit memory methods and conduct comprehensive experiments. We evaluate their performance on this task from multiple perspectives and analyze their strengths and weaknesses. Besides, we explore hybrid approaches that combine both paradigms and propose the HybridMem method to address their limitations. We demonstrate the effectiveness of our proposed model through extensive experiments. To benefit the research community, we release this project at this https URL.
zh
[NLP-42] Combating Homelessness Stigma with LLM s: A New Multi-Modal Dataset for Bias Detection
【速读】: 该论文旨在解决社会对无家可归者(People Experiencing Homelessness, PEH)存在的系统性偏见问题,这种偏见阻碍了政策制定与公众认知的改善。其解决方案的关键在于利用自然语言处理(Natural Language Processing, NLP)和大语言模型(Large Language Models, LLMs)技术,构建一个跨平台、多模态的人工标注数据集,并通过零样本(zero-shot)与少样本(few-shot)分类方法自动识别和量化网络空间中表达的PEH偏见。研究发现,尽管本地LLM在零样本分类上表现不一致,但其基于上下文学习的分类性能接近闭源模型,且整体优于BERT模型,从而为政策制定提供新的偏见指标,并推动生成式AI在公平性和伦理应用上的提升。
链接: https://arxiv.org/abs/2508.13187
作者: Jonathan A. Karr Jr.,Benjamin F. Herbst,Ting Hua,Matthew Hauenstein,Georgina Curto,Nitesh V. Chawla
机构: 未知
类目: Computers and Society (cs.CY); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注:
Abstract:Homelessness is a persistent social challenge, impacting millions worldwide. Over 770,000 people experienced homelessness in the U.S. in 2024. Social stigmatization is a significant barrier to alleviation, shifting public perception, and influencing policymaking. Given that online and city council discourse reflect and influence part of public opinion, it provides valuable insights to identify and track social biases. This research contributes to alleviating homelessness by acting on public opinion. It introduces novel methods, building on natural language processing (NLP) and large language models (LLMs), to identify and measure PEH social bias expressed in digital spaces. We present a new, manually-annotated multi-modal dataset compiled from Reddit, X (formerly Twitter), news articles, and city council meeting minutes across 10 U.S. cities. This unique dataset provides evidence of the typologies of homelessness bias described in the literature. In order to scale up and automate the detection of homelessness bias online, we evaluate LLMs as classifiers. We applied both zero-shot and few-shot classification techniques to this data. We utilized local LLMs (Llama 3.2 3B Instruct, Qwen 2.5 7B Instruct, and Phi4 Instruct Mini) as well as closed-source API models (GPT-4.1, Gemini 2.5 Pro, and Grok-4). Our findings reveal that although there are significant inconsistencies in local LLM zero-shot classification, the in-context learning classification scores of local LLMs approach the classification scores of closed-source LLMs. Furthermore, LLMs outperform BERT when averaging across all categories. This work aims to raise awareness about the pervasive bias against PEH, develop new indicators to inform policy, and ultimately enhance the fairness and ethical application of Generative AI technologies.
zh
[NLP-43] MM-BrowseComp: A Comprehensive Benchmark for Multimodal Browsing Agents
【速读】: 该论文旨在解决当前AI代理在网页浏览任务中对多模态信息(如图像、视频)处理能力不足的问题。现有基准(如BrowseComp)主要评估文本信息的检索与推理能力,忽略了网络内容中广泛存在的视觉等多模态信息。为填补这一空白,作者提出MM-BrowseComp,一个包含224个精心设计的多模态问题的新基准,这些问题要求代理在搜索和推理过程中理解并整合嵌入在网页图像或视频中的关键信息。解决方案的关键在于:一是构建具有挑战性的多模态任务集,二是提供每题的验证检查清单以实现细粒度的多模态依赖关系和推理路径分析,从而更全面地评估模型的多模态理解和推理能力。实验表明,即使是最先进的模型(如OpenAI o3 with tools)在该基准上仅达到29.02%的准确率,凸显了当前模型在原生多模态推理方面的局限性。
链接: https://arxiv.org/abs/2508.13186
作者: Shilong Li,Xingyuan Bu,Wenjie Wang,Jiaheng Liu,Jun Dong,Haoyang He,Hao Lu,Haozhe Zhang,Chenchen Jing,Zhen Li,Chuanhao Li,Jiayi Tian,Chenchen Zhang,Tianhao Peng,Yancheng He,Jihao Gu,Yuanxing Zhang,Jian Yang,Ge Zhang,Wenhao Huang,Wangchunshu Zhou,Zhaoxiang Zhang,Ruizhe Ding,Shilei Wen
机构: ByteDance(字节跳动)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
备注: The first two authors contribute equally, 26 pages, repo at this https URL
Abstract:AI agents with advanced reasoning and tool use capabilities have demonstrated impressive performance in web browsing for deep search. While existing benchmarks such as BrowseComp evaluate these browsing abilities, they primarily focus on textual information, overlooking the prevalence of multimodal content. To bridge this gap, we introduce MM-BrowseComp, a novel benchmark comprising 224 challenging, hand-crafted questions specifically designed to assess agents’ multimodal retrieval and reasoning capabilities. These questions often incorporate images in prompts, and crucial information encountered during the search and reasoning process may also be embedded within images or videos on webpages. Consequently, methods relying solely on text prove insufficient for our benchmark. Additionally, we provide a verified checklist for each question, enabling fine-grained analysis of multimodal dependencies and reasoning paths. Our comprehensive evaluation of state-of-the-art models on MM-BrowseComp reveals that even top models like OpenAI o3 with tools achieve only 29.02% accuracy, highlighting the suboptimal multimodal capabilities and lack of native multimodal reasoning in current models.
zh
[NLP-44] he Interpretability Analysis of the Model Can Bring Improvements to the Text-to-SQL Task
【速读】: 该论文旨在解决文本到SQL(text-to-SQL)模型在真实场景中基础能力与泛化性能不足的问题,尤其聚焦于WHERE条件子句的语义解析准确性。其解决方案的关键在于将模型可解释性分析与执行引导策略相结合,用于增强WHERE子句的语义解析效果;同时引入过滤调整、逻辑相关性优化及模型融合机制,构建出CESQL模型,从而在WikiSQL数据集上显著提升预测准确率,并减少对条件列数据及人工标注训练样本的依赖。
链接: https://arxiv.org/abs/2508.13178
作者: Cong Zhang
机构: 未知
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Databases (cs.DB)
备注:
Abstract:To elevate the foundational capabilities and generalization prowess of the text-to-SQL model in real-world applications, we integrate model interpretability analysis with execution-guided strategy for semantic parsing of WHERE clauses in SQL queries. Furthermore, we augment this approach with filtering adjustments, logical correlation refinements, and model fusion, culminating in the design of the CESQL model that facilitates conditional enhancement. Our model excels on the WikiSQL dataset, which is emblematic of single-table database query tasks, markedly boosting the accuracy of prediction outcomes. When predicting conditional values in WHERE clauses, we have not only minimized our dependence on data within the condition columns of tables but also circumvented the impact of manually labeled training data. Our hope is that this endeavor to enhance accuracy in processing basic database queries will offer fresh perspectives for research into handling complex queries and scenarios featuring irregular data in real-world database environments.
zh
[NLP-45] White-Box Reasoning : Synergizing LLM Strategy and gm/Id Data for Automated Analog Circuit Design
【速读】: 该论文旨在解决模拟集成电路(Analog IC)设计中因依赖经验知识和低效仿真而导致的瓶颈问题,尤其是在先进工艺节点下传统设计公式失效的挑战。其解决方案的关键在于提出一种“协同推理”(synergistic reasoning)框架,将大型语言模型(LLM)的战略推理能力与gm/Id方法论的物理精度相结合:通过为LLM注入gm/Id查找表数据,使其从经验性“猜测”转变为基于物理规律的定量设计伙伴,从而实现高效且精准的设计优化。实验证明,该方法在两阶段运算放大器设计中仅用5次迭代即满足所有TT角规格,并扩展至全部PVT角,显著优于无gm/Id数据支持的纯LLM方案,且效率较资深工程师提升一个数量级。
链接: https://arxiv.org/abs/2508.13172
作者: Jianqiu Chen,Siqi Li,Xu He
机构: Shanghai Hynitron Technology Co., Ltd(上海芯力特科技有限公司); College of Information Science and Engineering, Hunan University(湖南大学信息科学与工程学院)
类目: Hardware Architecture (cs.AR); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注: 8 pages, 4 figures, 7 Tables
Abstract:Analog IC design is a bottleneck due to its reliance on experience and inefficient simulations, as traditional formulas fail in advanced nodes. Applying Large Language Models (LLMs) directly to this problem risks mere “guessing” without engineering principles. We present a “synergistic reasoning” framework that integrates an LLM’s strategic reasoning with the physical precision of the gm/Id methodology. By empowering the LLM with gm/Id lookup tables, it becomes a quantitative, data-driven design partner. We validated this on a two-stage op-amp, where our framework enabled the Gemini model to meet all TT corner specs in 5 iterations and extended optimization to all PVT corners. A crucial ablation study proved gm/Id data is key for this efficiency and precision; without it, the LLM is slower and deviates. Compared to a senior engineer’s design, our framework achieves quasi-expert quality with an order-of-magnitude improvement in efficiency. This work validates a path for true analog design automation by combining LLM reasoning with scientific circuit design methodologies. Comments: 8 pages, 4 figures, 7 Tables Subjects: Hardware Architecture (cs.AR); Artificial Intelligence (cs.AI); Computation and Language (cs.CL) Cite as: arXiv:2508.13172 [cs.AR] (or arXiv:2508.13172v1 [cs.AR] for this version) https://doi.org/10.48550/arXiv.2508.13172 Focus to learn more arXiv-issued DOI via DataCite
zh
[NLP-46] Cognitive Workspace: Active Memory Management for LLM s – An Empirical Study of Functional Infinite Context
【速读】: 该论文旨在解决大型语言模型(Large Language Models, LLMs)在上下文管理方面存在的根本性局限,尤其是在扩展上下文窗口至数百万标记后仍无法有效模拟人类动态、任务驱动的记忆管理机制的问题。传统检索增强生成(Retrieval-Augmented Generation, RAG)方法依赖被动信息检索,缺乏元认知意识和主动规划能力,难以实现真正的认知扩展。解决方案的关键在于提出“认知工作空间”(Cognitive Workspace)范式,其核心创新包括:(1) 主动记忆管理与信息精炼机制,(2) 分层认知缓冲区以维持持久的工作状态,以及(3) 任务驱动的上下文优化策略,能够根据认知负荷动态调整资源分配。实证研究表明,该方案实现了平均58.6%的内存复用率(传统RAG为0%),并带来17–18%的净效率提升,且统计显著(p < 0.001,Cohen’s d > 23),首次为LLM中主动记忆优于被动检索提供了量化证据。
链接: https://arxiv.org/abs/2508.13171
作者: Tao An
机构: Hawaii Pacific University (夏威夷太平洋大学)
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注: 13 pages, 1 figure, code available at this https URL
Abstract:Large Language Models (LLMs) face fundamental limitations in context management despite recent advances extending context windows to millions of tokens. We propose Cognitive Workspace, a novel paradigm that transcends traditional Retrieval-Augmented Generation (RAG) by emulating human cognitive mechanisms of external memory use. Drawing from cognitive science foundations including Baddeley’s working memory model, Clark’s extended mind thesis, and Hutchins’ distributed cognition framework, we demonstrate that current passive retrieval systems fail to capture the dynamic, task-driven nature of human memory management. Our analysis of 2024-2025 developments reveals that while techniques like Infini-attention and StreamingLLM achieve impressive context lengths, they lack the metacognitive awareness and active planning capabilities essential for true cognitive extension. Cognitive Workspace addresses these limitations through three core innovations: (1) active memory management with deliberate information curation, (2) hierarchical cognitive buffers enabling persistent working states, and (3) task-driven context optimization that dynamically adapts to cognitive demands. Empirical validation demonstrates Cognitive Workspace achieves an average 58.6% memory reuse rate (ranging from 54-60% across different tasks) compared to 0% for traditional RAG, with 17-18% net efficiency gain despite 3.3x higher operation counts. Statistical analysis confirms these advantages with p 0.001 and Cohen’s d 23 across multiple task types, establishing the first quantitative evidence for active memory superiority in LLM systems. We present a comprehensive theoretical framework synthesizing insights from 50+ recent papers, positioning Cognitive Workspace as a fundamental shift from information retrieval to genuine cognitive augmentation.
zh
[NLP-47] Fair Play in the Newsroom: Actor-Based Filtering Gender Discrimination in Text Corpora
【速读】: 该论文旨在解决大规模语言模型在数字通信中因训练数据中的结构性性别不平等而导致的输出偏见问题。其解决方案的关键在于提出了一种扩展的基于角色(actor-level)的分析与干预管道,该管道引入了新的角色层面指标,用于捕捉情感倾向、句法主语权和引语风格等方面的不对称性,从而支持对语料库进行诊断性分析和基于排除的平衡处理,有效提升了语料库在多个语言维度上的性别均衡性。
链接: https://arxiv.org/abs/2508.13169
作者: Stefanie Urchs,Veronika Thurner,Matthias Aßenmacher,Christian Heumann,Stephanie Thiemichen
机构: Hochschule München University of Applied Sciences (慕尼黑应用技术大学); LMU Munich (慕尼黑路德维希-马克西米利安大学)
类目: Computation and Language (cs.CL); Computers and Society (cs.CY)
备注:
Abstract:Large language models are increasingly shaping digital communication, yet their outputs often reflect structural gender imbalances that originate from their training data. This paper presents an extended actor-level pipeline for detecting and mitigating gender discrimination in large-scale text corpora. Building on prior work in discourse-aware fairness analysis, we introduce new actor-level metrics that capture asymmetries in sentiment, syntactic agency, and quotation styles. The pipeline supports both diagnostic corpus analysis and exclusion-based balancing, enabling the construction of fairer corpora. We apply our approach to the taz2024full corpus of German newspaper articles from 1980 to 2024, demonstrating substantial improvements in gender balance across multiple linguistic dimensions. Our results show that while surface-level asymmetries can be mitigated through filtering and rebalancing, subtler forms of bias persist, particularly in sentiment and framing. We release the tools and reports to support further research in discourse-based fairness auditing and equitable corpus construction.
zh
[NLP-48] Chain-of-Agents : End-to-End Agent Agent Agent Foundation Models via Multi-Agent Distillation and Agentic RL
【速读】: 该论文旨在解决当前多智能体系统(multi-agent systems)在复杂问题求解中存在的计算效率低、能力受限以及难以受益于数据驱动学习的问题。现有方法依赖人工提示(prompt)和工作流工程,缺乏端到端的优化能力。其解决方案的关键在于提出一种新的范式——链式智能体(Chain-of-Agents, CoA),该范式使大型语言模型(LLMs)能够在单一模型内以端到端方式模拟多智能体协作,动态激活不同工具代理(tool agents)与角色扮演代理(role-playing agents)完成多轮问题求解。为训练此类能力,作者进一步设计了多智能体蒸馏框架(multi-agent distillation framework)和基于可验证任务的智能体强化学习(agentic reinforcement learning),最终构建出Agent Foundation Models(AFMs),在网页代理与代码代理等多个基准上取得了新的最先进性能。
链接: https://arxiv.org/abs/2508.13167
作者: Weizhen Li,Jianbo Lin,Zhuosong Jiang,Jingyi Cao,Xinpeng Liu,Jiayu Zhang,Zhenqiang Huang,Qianben Chen,Weichen Sun,Qiexiang Wang,Hongxuan Lu,Tianrui Qin,Chenghao Zhu,Yi Yao,Shuying Fan,Xiaowan Li,Tiannan Wang,Pai Liu,King Zhu,He Zhu,Dingfeng Shi,Piaohong Wang,Yeyi Guan,Xiangru Tang,Minghao Liu,Yuchen Eleanor Jiang,Jian Yang,Jiaheng Liu,Ge Zhang,Wangchunshu Zhou
机构: 未知
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注: 51 pages
Abstract:Recent advances in large language models (LLMs) and multi-agent systems have demonstrated remarkable capabilities in complex problem-solving tasks such as deep research, vibe coding, and mathematical reasoning. However, most existing multi-agent systems are built upon manual prompt/workflow engineering with sophisticated agent frameworks, making them computationally inefficient, less capable, and can not benefit from data-centric learning. In this work, we introduce Chain-of-Agents (CoA), a novel paradigm of LLM reasoning that enables native end-to-end complex problem-solving in the same way as a multi-agent system (i.e., multi-turn problem solving with multiple tools and multiple agents) within one model. In chain-of-agents problem-solving, the model dynamically activates different tool agents and role-playing agents to simulate multi-agent collaboration in an end-to-end fashion. To elicit end-to-end chain-of-agents problem-solving abilities in LLMs, we introduce a multi-agent distillation framework to distill state-of-the-art multi-agent systems into chain-of-agents trajectories for agentic supervised fine-tuning. We then use agentic reinforcement learning on verifiable agentic tasks to further improve the models’ capabilities on chain-of-agents problem solving. We call the resulting models Agent Foundation Models (AFMs). Our empirical studies demonstrate that AFM establishes new state-of-the-art performance across diverse benchmarks in both web agent and code agent settings. We make the entire research, including the model weights, code for training and evaluation, and the training data, fully open-sourced, which offers a solid starting point for future research on agent models and agentic RL.
zh
[NLP-49] Uncovering Emergent Physics Representations Learned In-Context by Large Language Models
【速读】: 该论文旨在解决大语言模型(Large Language Models, LLMs)在上下文学习(In-Context Learning, ICL)中如何实现跨任务泛化,尤其是其在物理系统动力学预测任务中的推理机制尚不明确的问题。解决方案的关键在于:通过构建一个基于物理系统的动力学预测任务作为代理实验,结合稀疏自编码器(Sparse Autoencoders, SAEs)对模型残差流(residual stream)激活的分析,发现LLMs在ICL过程中能够编码出与关键物理变量(如能量)相关的特征表示,从而揭示了LLMs在真实世界结构化动态环境中可习得具意义的物理概念。
链接: https://arxiv.org/abs/2508.12448
作者: Yeongwoo Song,Jaeyong Bae,Dong-Kyum Kim,Hawoong Jeong
机构: KAIST(韩国科学技术院); MPI for Security and Privacy(德国马普信息安全与隐私研究所)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: 17 pages, 10 figures
Abstract:Large language models (LLMs) exhibit impressive in-context learning (ICL) abilities, enabling them to solve wide range of tasks via textual prompts alone. As these capabilities advance, the range of applicable domains continues to expand significantly. However, identifying the precise mechanisms or internal structures within LLMs that allow successful ICL across diverse, distinct classes of tasks remains elusive. Physics-based tasks offer a promising testbed for probing this challenge. Unlike synthetic sequences such as basic arithmetic or symbolic equations, physical systems provide experimentally controllable, real-world data based on structured dynamics grounded in fundamental principles. This makes them particularly suitable for studying the emergent reasoning behaviors of LLMs in a realistic yet tractable setting. Here, we mechanistically investigate the ICL ability of LLMs, especially focusing on their ability to reason about physics. Using a dynamics forecasting task in physical systems as a proxy, we evaluate whether LLMs can learn physics in context. We first show that the performance of dynamics forecasting in context improves with longer input contexts. To uncover how such capability emerges in LLMs, we analyze the model’s residual stream activations using sparse autoencoders (SAEs). Our experiments reveal that the features captured by SAEs correlate with key physical variables, such as energy. These findings demonstrate that meaningful physical concepts are encoded within LLMs during in-context learning. In sum, our work provides a novel case study that broadens our understanding of how LLMs learn in context.
zh
[NLP-50] aoSR1: The Thinking Model for E-commerce Relevance Search
【速读】: 该论文旨在解决电商搜索中查询-商品相关性预测任务的挑战,即如何在保持语义匹配能力的同时增强模型的复杂推理能力。现有基于BERT的模型虽擅长语义理解,但缺乏推理能力;而大语言模型(LLMs)虽具强大推理潜力,却面临Chain-of-Thought(CoT)错误累积、判别式幻觉(discriminative hallucination)及部署可行性等问题。其解决方案的关键在于提出一个三阶段框架TaoSR1:首先通过监督微调(SFT)引入CoT推理;其次利用离线采样与Pass@N策略结合直接偏好优化(DPO)提升生成质量;最后采用基于难度的动态采样与组相对策略优化(GRPO)缓解判别式幻觉;同时辅以后CoT处理和基于累积概率的分区方法实现高效在线部署,从而构建了一个可直接部署的、具备强推理能力的相关性分类新范式。
链接: https://arxiv.org/abs/2508.12365
作者: Chenhe Dong,Shaowei Yao,Pengkun Jiao,Jianhui Yang,Yiming Jin,Zerui Huang,Xiaojiang Zhou,Dan Ou,Haihong Tang
机构: Taobao & Tmall Group of Alibaba(淘宝与天猫集团); Fudan University(复旦大学); Tsinghua University(清华大学)
类目: Information Retrieval (cs.IR); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注:
Abstract:Query-product relevance prediction is a core task in e-commerce search. BERT-based models excel at semantic matching but lack complex reasoning capabilities. While Large Language Models (LLMs) are explored, most still use discriminative fine-tuning or distill to smaller models for deployment. We propose a framework to directly deploy LLMs for this task, addressing key challenges: Chain-of-Thought (CoT) error accumulation, discriminative hallucination, and deployment feasibility. Our framework, TaoSR1, involves three stages: (1) Supervised Fine-Tuning (SFT) with CoT to instill reasoning; (2) Offline sampling with a pass@N strategy and Direct Preference Optimization (DPO) to improve generation quality; and (3) Difficulty-based dynamic sampling with Group Relative Policy Optimization (GRPO) to mitigate discriminative hallucination. Additionally, post-CoT processing and a cumulative probability-based partitioning method enable efficient online deployment. TaoSR1 significantly outperforms baselines on offline datasets and achieves substantial gains in online side-by-side human evaluations, introducing a novel paradigm for applying CoT reasoning to relevance classification.
zh
[NLP-51] owards No-Code Programming of Cobots: Experiments with Code Synthesis by Large Code Models for Conversational Programming
【速读】: 该论文旨在解决协作机器人(cobots)在工业装配场景中编程灵活性不足的问题,传统方法依赖专家编程或手动示教,限制了程序的可修改性和表达能力。其解决方案的关键在于利用大语言模型(LLMs)的上下文学习能力,通过自然语言指令引导机器人生成装配任务的代码;研究构建了一个名为RATS(Repetitive Assembly Task)的二维构建任务框架及配套数据集,系统评估了LLMs在给定示例条件下合成指令序列(first-order code)的能力,发现其能有效生成基础代码,但在抽象结构(如函数、循环等高阶代码)生成上仍存在挑战。
链接: https://arxiv.org/abs/2409.11041
作者: Chalamalasetti Kranti,Sherzod Hakimov,David Schlangen
机构: University of Potsdam (波茨坦大学); German Research Center for Artificial Intelligence (DFKI) (德国人工智能研究中心)
类目: Computation and Language (cs.CL); Robotics (cs.RO)
备注: Accepted to ITL4HRI workshop at RO-MAN 2025 conference
Abstract:While there has been a lot of research recently on robots in household environments, at the present time, most robots in existence can be found on shop floors, and most interactions between humans and robots happen there. Collaborative robots'' (cobots) designed to work alongside humans on assembly lines traditionally require expert programming, limiting ability to make changes, or manual guidance, limiting expressivity of the resulting programs. To address these limitations, we explore using Large Language Models (LLMs), and in particular, their abilities of doing in-context learning, for conversational code generation. As a first step, we define RATS, the
Repetitive Assembly Task’‘, a 2D building task designed to lay the foundation for simulating industry assembly scenarios. In this task, a programmer' instructs a cobot, using natural language, on how a certain assembly is to be built; that is, the programmer induces a program, through natural language. We create a dataset that pairs target structures with various example instructions (human-authored, template-based, and model-generated) and example code. With this, we systematically evaluate the capabilities of state-of-the-art LLMs for synthesising this kind of code, given in-context examples. Evaluating in a simulated environment, we find that LLMs are capable of generating accurate
first order code’ (instruction sequences), but have problems producing `higher-order code’ (abstractions such as functions, or use of loops).
zh
计算机视觉
[CV-0] LongSplat: Robust Unposed 3D Gaussian Splatting for Casual Long Videos ICCV2025
【速读】:该论文旨在解决从随意拍摄的长视频中进行新视角合成(Novel View Synthesis, NVS)时面临的三大核心挑战:相机运动不规则、相机位姿未知以及场景范围广阔导致的位姿漂移、几何初始化不准确和严重内存限制问题。解决方案的关键在于提出一种鲁棒的无位姿3D高斯点绘(3D Gaussian Splatting)框架LongSplat,其创新性体现在三个方面:(1) 增量联合优化机制,同步优化相机位姿与3D高斯分布以避免局部最优并保障全局一致性;(2) 基于学习到的3D先验信息的鲁棒位姿估计模块;(3) 基于空间密度的八叉树锚点生成机制,将密集点云高效转化为稀疏锚点,显著降低计算复杂度。实验证明,该方法在渲染质量、位姿精度和计算效率上均优于现有方法。
链接: https://arxiv.org/abs/2508.14041
作者: Chin-Yang Lin,Cheng Sun,Fu-En Yang,Min-Hung Chen,Yen-Yu Lin,Yu-Lun Liu
机构: National Yang Ming Chiao Tung University (国立阳明交通大学); NVIDIA Research (英伟达研究)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: ICCV 2025. Project page: this https URL
Abstract:LongSplat addresses critical challenges in novel view synthesis (NVS) from casually captured long videos characterized by irregular camera motion, unknown camera poses, and expansive scenes. Current methods often suffer from pose drift, inaccurate geometry initialization, and severe memory limitations. To address these issues, we introduce LongSplat, a robust unposed 3D Gaussian Splatting framework featuring: (1) Incremental Joint Optimization that concurrently optimizes camera poses and 3D Gaussians to avoid local minima and ensure global consistency; (2) a robust Pose Estimation Module leveraging learned 3D priors; and (3) an efficient Octree Anchor Formation mechanism that converts dense point clouds into anchors based on spatial density. Extensive experiments on challenging benchmarks demonstrate that LongSplat achieves state-of-the-art results, substantially improving rendering quality, pose accuracy, and computational efficiency compared to prior approaches. Project page: this https URL
zh
[CV-1] Beyond Simple Edits: Composed Video Retrieval with Dense Modifications ICCV-2025
【速读】:该论文旨在解决组合视频检索(composed video retrieval)任务中因细粒度语义描述和时间理解差异导致的检索性能瓶颈问题。现有标准检索框架难以有效处理由文本描述驱动的复杂动作组合及视频时序细节变化,从而限制了在细粒度场景下的检索准确性。其解决方案的关键在于:构建了一个大规模、细粒度标注的新数据集 Dense-WebVid-CoVR(含 160 万样本,密集修改文本量为现有数据集的约 7 倍),并提出一种基于交叉注意力(Cross-Attention, CA)融合机制的新型模型,通过引入**锚定文本编码器(grounded text encoder)**实现视觉与文本信息的精准对齐,显著提升了对密集修改文本的建模能力。实验表明,该方法在视觉+文本设置下 Recall@1 达到 71.3%,较当前最优方法提升 3.4%,验证了其在利用详细视频描述进行精确检索方面的有效性。
链接: https://arxiv.org/abs/2508.14039
作者: Omkar Thawakar,Dmitry Demidov,Ritesh Thawkar,Rao Muhammad Anwer,Mubarak Shah,Fahad Shahbaz Khan,Salman Khan
机构: Mohamed bin Zayed University of AI(穆罕默德·本·扎耶德大学人工智能); University of Central Florida(中佛罗里达大学); Linköping University(林雪平大学); Australian National University(澳大利亚国立大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted to ICCV-2025
Abstract:Composed video retrieval is a challenging task that strives to retrieve a target video based on a query video and a textual description detailing specific modifications. Standard retrieval frameworks typically struggle to handle the complexity of fine-grained compositional queries and variations in temporal understanding limiting their retrieval ability in the fine-grained setting. To address this issue, we introduce a novel dataset that captures both fine-grained and composed actions across diverse video segments, enabling more detailed compositional changes in retrieved video content. The proposed dataset, named Dense-WebVid-CoVR, consists of 1.6 million samples with dense modification text that is around seven times more than its existing counterpart. We further develop a new model that integrates visual and textual information through Cross-Attention (CA) fusion using grounded text encoder, enabling precise alignment between dense query modifications and target videos. The proposed model achieves state-of-the-art results surpassing existing methods on all metrics. Notably, it achieves 71.3% Recall@1 in visual+text setting and outperforms the state-of-the-art by 3.4%, highlighting its efficacy in terms of leveraging detailed video descriptions and dense modification texts. Our proposed dataset, code, and model are available at :this https URL
zh
[CV-2] Distilled-3DGS:Distilled 3D Gaussian Splatting
【速读】:该论文旨在解决3D Gaussian Splatting (3DGS) 在新视角合成(NVS)任务中因高保真渲染所需大量3D高斯分布而导致的内存消耗和存储需求过大的问题。解决方案的关键在于提出首个针对3DGS的知识蒸馏框架(Distilled-3DGS),通过多种教师模型(包括基础3DGS、噪声增强变体和丢弃正则化版本)的输出聚合来指导轻量级学生模型的优化,并引入结构相似性损失(structural similarity loss)以增强学生模型与教师模型之间空间几何分布的一致性,从而在不依赖复杂组件的情况下实现更优的渲染质量与存储效率平衡。
链接: https://arxiv.org/abs/2508.14037
作者: Lintao Xiang,Xinkai Chen,Jianhuang Lai,Guangcong Wang
机构: The University of Manchester (曼彻斯特大学); Vision, Graphics, and X Group, Great Bay University (大 bay 大学视觉、图形与X组); Sun Yat-Sen University (中山大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Project page: this https URL Code: this https URL
Abstract:3D Gaussian Splatting (3DGS) has exhibited remarkable efficacy in novel view synthesis (NVS). However, it suffers from a significant drawback: achieving high-fidelity rendering typically necessitates a large number of 3D Gaussians, resulting in substantial memory consumption and storage requirements. To address this challenge, we propose the first knowledge distillation framework for 3DGS, featuring various teacher models, including vanilla 3DGS, noise-augmented variants, and dropout-regularized versions. The outputs of these teachers are aggregated to guide the optimization of a lightweight student model. To distill the hidden geometric structure, we propose a structural similarity loss to boost the consistency of spatial geometric distributions between the student and teacher model. Through comprehensive quantitative and qualitative evaluations across diverse datasets, the proposed Distilled-3DGS, a simple yet effective framework without bells and whistles, achieves promising rendering results in both rendering quality and storage efficiency compared to state-of-the-art methods. Project page: this https URL . Code: this https URL .
zh
[CV-3] GeoSAM2: Unleashing the Power of SAM2 for 3D Part Segmentation
【速读】:该论文旨在解决当前3D生成方法在从稀疏或多视角图像中快速生成形状时,因计算约束导致输出几何细节不足的问题。解决方案的关键在于提出了一种名为DetailGen3D的生成式方法,其核心创新是通过数据依赖的潜在空间流(data-dependent flows in latent space)直接建模粗到细的变换过程,从而避免使用大规模3D生成模型带来的计算开销;同时引入一种token匹配策略以确保细化过程中空间对应关系的准确性,实现局部细节合成的同时保持全局结构完整性。
链接: https://arxiv.org/abs/2508.14036
作者: Ken Deng,Yunhan Yang,Jingxiang Sun,Xihui Liu,Yebin Liu,Ding Liang,Yan-Pei Cao
机构: VAST; The University of Hong Kong (香港大学); Tsinghua University (清华大学); Sun Yat-sen University (中山大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: this https URL
Abstract:Modern 3D generation methods can rapidly create shapes from sparse or single views, but their outputs often lack geometric detail due to computational constraints. We present DetailGen3D, a generative approach specifically designed to enhance these generated 3D shapes. Our key insight is to model the coarse-to-fine transformation directly through data-dependent flows in latent space, avoiding the computational overhead of large-scale 3D generative models. We introduce a token matching strategy that ensures accurate spatial correspondence during refinement, enabling local detail synthesis while preserving global structure. By carefully designing our training data to match the characteristics of synthesized coarse shapes, our method can effectively enhance shapes produced by various 3D generation and reconstruction approaches, from single-view to sparse multi-view inputs. Extensive experiments demonstrate that DetailGen3D achieves high-fidelity geometric detail synthesis while maintaining efficiency in training.
zh
[CV-4] InfiniteTalk: Audio-driven Video Generation for Sparse-Frame Video Dubbing
【速读】:该论文旨在解决当前音频驱动的人类动画(audio-driven human animation)技术中,传统视频配音方法仅限于嘴部区域编辑所导致的面部表情与身体动作不协调问题,从而影响观众沉浸感。其核心解决方案是提出一种新的“稀疏帧视频配音”(sparse-frame video dubbing)范式,关键在于通过战略性保留参考关键帧来维持身份特征、标志性手势及摄像机轨迹,同时实现全身动作的同步编辑;此外,为应对图像到视频模型在长序列生成中的适应性条件不足问题,作者进一步设计了InfiniteTalk——一个面向无限长度序列配音的流式音频驱动生成器,利用时间上下文帧实现片段间无缝过渡,并采用精细化参考帧定位策略优化控制强度,显著提升了视觉真实感、情感一致性与全身运动同步性。
链接: https://arxiv.org/abs/2508.14033
作者: Shaoshu Yang,Zhe Kong,Feng Gao,Meng Cheng,Xiangyu Liu,Yong Zhang,Zhuoliang Kang,Wenhan Luo,Xunliang Cai,Ran He,Xiaoming Wei
机构: University of Chinese Academy of Sciences (中国科学院大学); Meituan (美团); CASIA (中国科学院自动化研究所); Sun Yat-sen University (中山大学深圳校区); HKUST (香港科技大学); State Key Laboratory of Multimodal Artificial Intelligence Systems (多模态人工智能系统国家重点实验室)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 11 pages, 7 figures
Abstract:Recent breakthroughs in video AIGC have ushered in a transformative era for audio-driven human animation. However, conventional video dubbing techniques remain constrained to mouth region editing, resulting in discordant facial expressions and body gestures that compromise viewer immersion. To overcome this limitation, we introduce sparse-frame video dubbing, a novel paradigm that strategically preserves reference keyframes to maintain identity, iconic gestures, and camera trajectories while enabling holistic, audio-synchronized full-body motion editing. Through critical analysis, we identify why naive image-to-video models fail in this task, particularly their inability to achieve adaptive conditioning. Addressing this, we propose InfiniteTalk, a streaming audio-driven generator designed for infinite-length long sequence dubbing. This architecture leverages temporal context frames for seamless inter-chunk transitions and incorporates a simple yet effective sampling strategy that optimizes control strength via fine-grained reference frame positioning. Comprehensive evaluations on HDTF, CelebV-HQ, and EMTD datasets demonstrate state-of-the-art performance. Quantitative metrics confirm superior visual realism, emotional coherence, and full-body motion synchronization.
zh
[CV-5] Backdooring Self-Supervised Contrastive Learning by Noisy Alignment ICCV2025
【速读】:该论文旨在解决自监督对比学习(Self-supervised Contrastive Learning, CL)在无标签数据预训练过程中易受数据投毒后门攻击(Data Poisoning Backdoor Attacks on CL, DPCL)的问题。现有DPCL方法因依赖脆弱的隐式共现关系及对后门图像中判别特征抑制不足,导致攻击效果有限。其解决方案的关键在于提出“噪声对齐”(Noisy Alignment, NA)机制,通过显式抑制中毒图像中的噪声成分实现更强的攻击效力;该机制基于对对比学习随机裁剪机制的策略性操控,并将其建模为具有理论最优参数的图像布局优化问题,从而在保持干净数据准确率的同时显著提升攻击成功率,且对主流后门防御手段具备鲁棒性。
链接: https://arxiv.org/abs/2508.14015
作者: Tuo Chen,Jie Gui,Minjing Dong,Ju Jia,Lanting Fang,Jian Liu
机构: Southeast University (东南大学); Purple Mountain Laboratories (紫金山实验室); Ant Group (蚂蚁集团); City University of Hong Kong (香港城市大学); Beijing Institute of Technology (北京理工大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted by ICCV 2025
Abstract:Self-supervised contrastive learning (CL) effectively learns transferable representations from unlabeled data containing images or image-text pairs but suffers vulnerability to data poisoning backdoor attacks (DPCLs). An adversary can inject poisoned images into pretraining datasets, causing compromised CL encoders to exhibit targeted misbehavior in downstream tasks. Existing DPCLs, however, achieve limited efficacy due to their dependence on fragile implicit co-occurrence between backdoor and target object and inadequate suppression of discriminative features in backdoored images. We propose Noisy Alignment (NA), a DPCL method that explicitly suppresses noise components in poisoned images. Inspired by powerful training-controllable CL attacks, we identify and extract the critical objective of noisy alignment, adapting it effectively into data-poisoning scenarios. Our method implements noisy alignment by strategically manipulating contrastive learning’s random cropping mechanism, formulating this process as an image layout optimization problem with theoretically derived optimal parameters. The resulting method is simple yet effective, achieving state-of-the-art performance compared to existing DPCLs, while maintaining clean-data accuracy. Furthermore, Noisy Alignment demonstrates robustness against common backdoor defenses. Codes can be found at this https URL.
zh
[CV-6] Online 3D Gaussian Splatting Modeling with Novel View Selection
【速读】:该论文旨在解决从仅包含RGB图像帧的在线3D高斯溅射(3D Gaussian Splatting, 3DGS)建模问题,特别是现有方法依赖关键帧进行三维场景重建时导致的不完整重建问题。其核心挑战在于:一方面,仅使用关键帧难以覆盖整个场景;另一方面,在线处理限制了可用帧数和训练迭代次数,难以构建具有泛化能力的模型。解决方案的关键在于提出一种基于自适应视图选择的策略,通过在线分析重建质量,动态选取最优非关键帧用于额外训练,从而结合关键帧与选定非关键帧从多视角优化不完整区域,显著提升模型完整性。同时,论文引入一个在线多视图立体匹配(online multi-view stereo)框架,确保3D信息在建模过程中的全局一致性。
链接: https://arxiv.org/abs/2508.14014
作者: Byeonggwon Lee,Junkyu Park,Khang Truong Giang,Soohwan Song
机构: Dongguk University (东国大学); 42dot
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:This study addresses the challenge of generating online 3D Gaussian Splatting (3DGS) models from RGB-only frames. Previous studies have employed dense SLAM techniques to estimate 3D scenes from keyframes for 3DGS model construction. However, these methods are limited by their reliance solely on keyframes, which are insufficient to capture an entire scene, resulting in incomplete reconstructions. Moreover, building a generalizable model requires incorporating frames from diverse viewpoints to achieve broader scene coverage. However, online processing restricts the use of many frames or extensive training iterations. Therefore, we propose a novel method for high-quality 3DGS modeling that improves model completeness through adaptive view selection. By analyzing reconstruction quality online, our approach selects optimal non-keyframes for additional training. By integrating both keyframes and selected non-keyframes, the method refines incomplete regions from diverse viewpoints, significantly enhancing completeness. We also present a framework that incorporates an online multi-view stereo approach, ensuring consistency in 3D information throughout the 3DGS modeling process. Experimental results demonstrate that our method outperforms state-of-the-art methods, delivering exceptional performance in complex outdoor scenes.
zh
[CV-7] ResPlan: A Large-Scale Vector-Graph Dataset of 17000 Residential Floor Plans
【速读】:该论文旨在解决当前空间智能(Spatial Intelligence)研究中缺乏大规模、高保真且结构丰富的真实住宅平面图数据集的问题,以推动生成式 AI (Generative AI)、机器人导航、强化学习及虚拟现实等应用的发展。现有数据集如 RPLAN 和 MSD 存在视觉保真度不足、结构多样性有限以及布局理想化等局限性,难以支撑真实场景下的复杂空间推理任务。解决方案的关键在于提出 ResPlan 数据集——包含 17,000 张详细标注的住宅平面图,涵盖建筑元素(墙体、门、窗、阳台)与功能空间(厨房、卧室、卫生间),并提供几何与图结构两种表示形式,支持快速三维转换和图推理;同时开源了用于几何清理、对齐与标注优化的自动化处理流程,显著提升了数据质量与可用性,为下一代空间感知与决策系统提供了可扩展、高真实感的基准资源。
链接: https://arxiv.org/abs/2508.14006
作者: Mohamed Abouagour,Eleftherios Garyfallidis
机构: Indiana University (印第安纳大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Robotics (cs.RO)
备注: 18 pages, 3 figures, 4 tables
Abstract:We introduce ResPlan, a large-scale dataset of 17,000 detailed, structurally rich, and realistic residential floor plans, created to advance spatial AI research. Each plan includes precise annotations of architectural elements (walls, doors, windows, balconies) and functional spaces (such as kitchens, bedrooms, and bathrooms). ResPlan addresses key limitations of existing datasets such as RPLAN (Wu et al., 2019) and MSD (van Engelenburg et al., 2024) by offering enhanced visual fidelity and greater structural diversity, reflecting realistic and non-idealized residential layouts. Designed as a versatile, general-purpose resource, ResPlan supports a wide range of applications including robotics, reinforcement learning, generative AI, virtual and augmented reality, simulations, and game development. Plans are provided in both geometric and graph-based formats, enabling direct integration into simulation engines and fast 3D conversion. A key contribution is an open-source pipeline for geometry cleaning, alignment, and annotation refinement. Additionally, ResPlan includes structured representations of room connectivity, supporting graph-based spatial reasoning tasks. Finally, we present comparative analyses with existing benchmarks and outline several open benchmark tasks enabled by ResPlan. Ultimately, ResPlan offers a significant advance in scale, realism, and usability, providing a robust foundation for developing and benchmarking next-generation spatial intelligence systems.
zh
[CV-8] Self-Supervised Sparse Sensor Fusion for Long Range Perception
【速读】:该论文旨在解决自动驾驶车辆在城际高速公路场景下远距离感知能力不足的问题,尤其针对高速行驶(>100 km/h)时所需的250米感知距离与现有方法受限于短距离感知(50–100米)之间的差距。传统基于鸟瞰图(Bird’s Eye View, BEV)的感知方法在扩展感知范围时面临内存和计算成本随距离呈二次增长的瓶颈,难以支撑大型卡车等高惯性车辆的长距离规划需求。解决方案的关键在于引入一种稀疏表示基础上的高效3D多模态与时间特征编码机制,并设计了一种新颖的自监督预训练策略,从而实现从无标签摄像头-LiDAR数据中进行大规模学习,最终将感知距离扩展至250米,在目标检测mAP上提升26.6%,LiDAR预测的Chamfer Distance降低30.5%。
链接: https://arxiv.org/abs/2508.13995
作者: Edoardo Palladin,Samuel Brucker,Filippo Ghilotti,Praveen Narayanan,Mario Bijelic,Felix Heide
机构: Torc Robotics; Princeton University (普林斯顿大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Outside of urban hubs, autonomous cars and trucks have to master driving on intercity highways. Safe, long-distance highway travel at speeds exceeding 100 km/h demands perception distances of at least 250 m, which is about five times the 50-100m typically addressed in city driving, to allow sufficient planning and braking margins. Increasing the perception ranges also allows to extend autonomy from light two-ton passenger vehicles to large-scale forty-ton trucks, which need a longer planning horizon due to their high inertia. However, most existing perception approaches focus on shorter ranges and rely on Bird’s Eye View (BEV) representations, which incur quadratic increases in memory and compute costs as distance grows. To overcome this limitation, we built on top of a sparse representation and introduced an efficient 3D encoding of multi-modal and temporal features, along with a novel self-supervised pre-training scheme that enables large-scale learning from unlabeled camera-LiDAR data. Our approach extends perception distances to 250 meters and achieves an 26.6% improvement in mAP in object detection and a decrease of 30.5% in Chamfer Distance in LiDAR forecasting compared to existing methods, reaching distances up to 250 meters. Project Page: this https URL
zh
[CV-9] Physics-Based 3D Simulation for Synthetic Data Generation and Failure Analysis in Packaging Stability Assessment
【速读】:该论文旨在解决物流运输中托盘(pallet)配置设计与安全性分析的难题,尤其针对传统物理测试成本高、环境影响大以及难以精确评估动态行为的问题。其解决方案的关键在于构建一个完全可控且高精度的物理仿真系统,该系统基于3D图形环境,可模拟不同包装布局、包装材料及动态工况下的托盘运动行为;同时引入深度神经网络对仿真生成的视频进行训练,作为托盘配置的碰撞测试预测器,从而显著提升安全分析的效率与准确性。
链接: https://arxiv.org/abs/2508.13989
作者: Samuel Seligardi,Pietro Musoni,Eleonora Iotti,Gianluca Contesso,Alessandro Dal Palù
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:The design and analysis of pallet setups are essential for ensuring safety of packages transportation. With rising demands in the logistics sector, the development of automated systems utilizing advanced technologies has become increasingly crucial. Moreover, the widespread use of plastic wrapping has motivated researchers to investigate eco-friendly alternatives that still adhere to safety standards. We present a fully controllable and accurate physical simulation system capable of replicating the behavior of moving pallets. It features a 3D graphics-based virtual environment that supports a wide range of configurations, including variable package layouts, different wrapping materials, and diverse dynamic conditions. This innovative approach reduces the need for physical testing, cutting costs and environmental impact while improving measurement accuracy for analyzing pallet dynamics. Additionally, we train a deep neural network to evaluate the rendered videos generated by our simulator, as a crash-test predictor for pallet configurations, further enhancing the system’s utility in safety analysis.
zh
[CV-10] OmViD: Omni-supervised active learning for video action detection ICCV
【速读】:该论文旨在解决视频动作检测(video action detection)中密集时空标注(dense spatio-temporal annotations)获取成本高、效率低的问题。其核心挑战在于现实世界视频的复杂度不一,不同视频可能需要不同程度的标注信息。解决方案的关键在于:首先提出一种简单的主动学习策略,用于为每段视频自动评估并确定最优的标注类型(如视频级标签、点标注、涂鸦、边界框或像素级掩码);其次引入一种新颖的时空3D超像素(3D-superpixel)方法,从这些不同粒度的标注中生成伪标签(pseudo-labels),从而实现多粒度标注下的有效模型训练。该方法在UCF101-24和JHMDB-21数据集上验证,显著降低了标注成本且性能损失最小。
链接: https://arxiv.org/abs/2508.13983
作者: Aayush Rana,Akash Kumar,Vibhav Vineet,Yogesh S Rawat
机构: University of Central Florida (中佛罗里达大学); Microsoft (微软)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: ICCVW’25
Abstract:Video action detection requires dense spatio-temporal annotations, which are both challenging and expensive to obtain. However, real-world videos often vary in difficulty and may not require the same level of annotation. This paper analyzes the appropriate annotation types for each sample and their impact on spatio-temporal video action detection. It focuses on two key aspects: 1) how to obtain varying levels of annotation for videos, and 2) how to learn action detection from different annotation types. The study explores video-level tags, points, scribbles, bounding boxes, and pixel-level masks. First, a simple active learning strategy is proposed to estimate the necessary annotation type for each video. Then, a novel spatio-temporal 3D-superpixel approach is introduced to generate pseudo-labels from these annotations, enabling effective training. The approach is validated on UCF101-24 and JHMDB-21 datasets, significantly cutting annotation costs with minimal performance loss.
zh
[CV-11] ROVR-Open-Dataset: A Large-Scale Depth Dataset for Autonomous Driving
【速读】:该论文旨在解决当前深度估计(depth estimation)数据集在多样性、可扩展性及成本效益方面的局限性,尤其是在基础模型(foundation models)和多模态学习时代背景下,现有数据集如KITTI、nuScenes和DDAD已接近性能饱和,难以支撑更复杂的泛化能力研究。其解决方案的关键在于提出一个大规模、多样化且轻量级的帧级连续数据集,包含20,000个视频帧,覆盖动态室外驾驶环境;通过低成本采集流程实现广泛场景覆盖,并利用稀疏但统计充分的真值标签支持鲁棒训练,从而在驾驶场景多样性与低深度密度方面引入新的挑战,推动深度估计方法在复杂条件下的性能提升。
链接: https://arxiv.org/abs/2508.13977
作者: Xianda Guo,Ruijun Zhang,Yiqun Duan,Ruilin Wang,Keyuan Zhou,Wenzhao Zheng,Wenke Huang,Gangwei Xu,Mike Horton,Yuan Si,Hao Zhao,Long Chen
机构: Wuhan University (武汉大学); CASIA (中国科学院自动化研究所); University of Technology Sydney (悉尼科技大学); Zhejiang University (浙江大学); University of California, Berkeley (加州大学伯克利分校); ROVR Labs, Inc. (ROVR 实验室公司); AIR, Tsinghua University (清华大学人工智能研究院); Huazhong University of Science and Technology (华中科技大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Depth estimation is a fundamental task for 3D scene understanding in autonomous driving, robotics, and augmented reality. Existing depth datasets, such as KITTI, nuScenes, and DDAD, have advanced the field but suffer from limitations in diversity and scalability. As benchmark performance on these datasets approaches saturation, there is an increasing need for a new generation of large-scale, diverse, and cost-efficient datasets to support the era of foundation models and multi-modal learning. To address these challenges, we introduce a large-scale, diverse, frame-wise continuous dataset for depth estimation in dynamic outdoor driving environments, comprising 20K video frames to evaluate existing methods. Our lightweight acquisition pipeline ensures broad scene coverage at low cost, while sparse yet statistically sufficient ground truth enables robust training. Compared to existing datasets, ours presents greater diversity in driving scenarios and lower depth density, creating new challenges for generalization. Benchmark experiments with standard monocular depth estimation models validate the dataset’s utility and highlight substantial performance gaps in challenging conditions, establishing a new platform for advancing depth estimation research.
zh
[CV-12] Augmenting cobots for sheet-metal SMEs with 3D object recognition and localisation
【速读】:该论文旨在解决高混合低批量(high-mix-low-volume)生产环境下钣金车间面临的挑战,即标准自动化方案难以满足小批量、多品种订单的需求,导致中小企业依赖重复性人工劳动,从而增加生产成本并限制技术技能劳动力的充分利用。解决方案的关键在于通过COOCK+ ROBUST项目,将协作机器人(cobots)改造为具备移动性和可重构性的生产辅助工具,核心是集成现有技术如3D物体识别与定位(3D object recognition and localisation),以提升工业场景中协作系统的灵活性和智能化水平。
链接: https://arxiv.org/abs/2508.13964
作者: Martijn Cramer,Yanming Wu,David De Schepper,Eric Demeester
机构: KU Leuven (鲁汶大学); Flanders Make @ KU Leuven (弗拉芒制造@鲁汶大学)
类目: Robotics (cs.RO); Computer Vision and Pattern Recognition (cs.CV)
备注: 13 pages, 25 figures
Abstract:Due to high-mix-low-volume production, sheet-metal workshops today are challenged by small series and varying orders. As standard automation solutions tend to fall short, SMEs resort to repetitive manual labour impacting production costs and leading to tech-skilled workforces not being used to their full potential. The COOCK+ ROBUST project aims to transform cobots into mobile and reconfigurable production assistants by integrating existing technologies, including 3D object recognition and localisation. This article explores both the opportunities and challenges of enhancing cobotic systems with these technologies in an industrial setting, outlining the key steps involved in the process. Additionally, insights from a past project, carried out by the ACRO research unit in collaboration with an industrial partner, serves as a concrete implementation example throughout.
zh
[CV-13] ViT-FIQA: Assessing Face Image Quality using Vision Transformers ICCV
【速读】:该论文旨在解决人脸图像质量评估(Face Image Quality Assessment, FIQA)问题,即预测人脸图像在人脸识别(Face Recognition, FR)系统中的可用性。现有主流方法依赖卷积神经网络(Convolutional Neural Networks, CNNs),而忽视了视觉Transformer(Vision Transformer, ViT)架构的潜力。其解决方案的关键在于提出ViT-FIQA,通过在标准ViT主干网络中引入一个可学习的质量标记(quality token),该标记与图像块标记拼接后经由全局自注意力机制聚合跨所有图像块的上下文信息,从而实现对任意人脸图像的标量效用评分预测;同时,ViT-FIQA在骨干网络输出端分支为两个任务头:一是利用全连接层从图像块标记中学习判别性特征表示(采用边缘惩罚softmax损失),二是将质量标记送入回归头以预测图像效用,从而实现统一建模与高效评估。
链接: https://arxiv.org/abs/2508.13957
作者: Andrea Atzori,Fadi Boutros,Naser Damer
机构: Fraunhofer Institute for Computer Graphics Research IGD (弗劳恩霍夫计算机图形研究所); TU Darmstadt (达姆施塔特工业大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted at the IEEE/CVF International Conference on Computer Vision Workshops 2025 (ICCVW 2025)
Abstract:Face Image Quality Assessment (FIQA) aims to predict the utility of a face image for face recognition (FR) systems. State-of-the-art FIQA methods mainly rely on convolutional neural networks (CNNs), leaving the potential of Vision Transformer (ViT) architectures underexplored. This work proposes ViT-FIQA, a novel approach that extends standard ViT backbones, originally optimized for FR, through a learnable quality token designed to predict a scalar utility score for any given face image. The learnable quality token is concatenated with the standard image patch tokens, and the whole sequence is processed via global self-attention by the ViT encoders to aggregate contextual information across all patches. At the output of the backbone, ViT-FIQA branches into two heads: (1) the patch tokens are passed through a fully connected layer to learn discriminative face representations via a margin-penalty softmax loss, and (2) the quality token is fed into a regression head to learn to predict the face sample’s utility. Extensive experiments on challenging benchmarks and several FR models, including both CNN- and ViT-based architectures, demonstrate that ViT-FIQA consistently achieves top-tier performance. These results underscore the effectiveness of transformer-based architectures in modeling face image utility and highlight the potential of ViTs as a scalable foundation for future FIQA research this https URL.
zh
[CV-14] DIME-Net: A Dual-Illumination Adaptive Enhancement Network Based on Retinex and Mixture-of-Experts ACM-MM2025
【速读】:该论文旨在解决复杂光照条件下图像退化问题,尤其是低光照(low-light)和逆光(backlit)场景对图像质量及下游视觉任务的负面影响。现有方法通常仅针对单一类型的光照退化进行优化,缺乏统一处理多种光照条件的能力。其解决方案的关键在于提出一种双光照增强框架 DIME-Net,核心创新是引入基于稀疏门控机制的混合专家(Mixture-of-Experts, MoE)光照估计模块,能够根据输入图像的光照特征自适应选择合适的 S 曲线专家网络;同时结合 Retinex 理论实现对低光与逆光图像的差异化增强,并设计了光照感知交叉注意力(Illumination-Aware Cross Attention)与序列状态全局注意力(Sequential-State Global Attention)机制以修复光照引起的伪影和色彩失真。此外,构建了混合光照数据集 MixBL,使模型在单一训练过程中即可实现跨场景鲁棒性。
链接: https://arxiv.org/abs/2508.13921
作者: Ziang Wang,Xiaoqin Wang,Dingyi Wang,Qiang Li,Shushan Qiao
机构: Institute of Microelectronics of the Chinese Academy of Sciences(中国科学院微电子研究所)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted at ACM Multimedia 2025 (ACM MM 2025)
Abstract:Image degradation caused by complex lighting conditions such as low-light and backlit scenarios is commonly encountered in real-world environments, significantly affecting image quality and downstream vision tasks. Most existing methods focus on a single type of illumination degradation and lack the ability to handle diverse lighting conditions in a unified manner. To address this issue, we propose a dual-illumination enhancement framework called DIME-Net. The core of our method is a Mixture-of-Experts illumination estimator module, where a sparse gating mechanism adaptively selects suitable S-curve expert networks based on the illumination characteristics of the input image. By integrating Retinex theory, this module effectively performs enhancement tailored to both low-light and backlit images. To further correct illumination-induced artifacts and color distortions, we design a damage restoration module equipped with Illumination-Aware Cross Attention and Sequential-State Global Attention mechanisms. In addition, we construct a hybrid illumination dataset, MixBL, by integrating existing datasets, allowing our model to achieve robust illumination adaptability through a single training process. Experimental results show that DIME-Net achieves competitive performance on both synthetic and real-world low-light and backlit datasets without any retraining. These results demonstrate its generalization ability and potential for practical multimedia applications under diverse and complex illumination conditions.
zh
[CV-15] PhysGM: Large Physical Gaussian Model for Feed-Forward 4D Synthesis
【速读】:该论文旨在解决现有物理驱动的3D运动合成方法中存在的两大关键问题:一是依赖预重建的3D Gaussian Splatting (3DGS) 表示,限制了从单张图像直接生成物理可模拟的4D场景的能力;二是物理属性的引入方式要么是刚性的人工定义,要么是依赖视频模型优化的不稳定方案。解决方案的核心在于提出PhysGM,一个前馈式框架,能够从单张图像中联合预测3D高斯表示及其物理属性,从而实现即时物理仿真与高保真4D渲染。其关键技术包括:1)通过联合优化高斯重建与概率物理预测建立基础模型;2)利用物理合理参考视频对模型进行精调,提升渲染质量和物理预测准确性;3)采用直接偏好优化(Direct Preference Optimization, DPO)对齐模拟结果与参考视频,避免复杂可微分仿真和光栅化过程中的反向传播需求,显著提升训练效率与稳定性。
链接: https://arxiv.org/abs/2508.13911
作者: Chunji Lv,Zequn Chen,Donglin Di,Weinan Zhang,Hao Li,Wei Chen,Changsheng Li
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:While physics-grounded 3D motion synthesis has seen significant progress, current methods face critical limitations. They typically rely on pre-reconstructed 3D Gaussian Splatting (3DGS) representations, while physics integration depends on either inflexible, manually defined physical attributes or unstable, optimization-heavy guidance from video models. To overcome these challenges, we introduce PhysGM, a feed-forward framework that jointly predicts a 3D Gaussian representation and its physical properties from a single image, enabling immediate, physical simulation and high-fidelity 4D rendering. We first establish a base model by jointly optimizing for Gaussian reconstruction and probabilistic physics prediction. The model is then refined with physically plausible reference videos to enhance both rendering fidelity and physics prediction accuracy. We adopt the Direct Preference Optimization (DPO) to align its simulations with reference videos, circumventing Score Distillation Sampling (SDS) optimization which needs back-propagating gradients through the complex differentiable simulation and rasterization. To facilitate the training, we introduce a new dataset PhysAssets of over 24,000 3D assets, annotated with physical properties and corresponding guiding videos. Experimental results demonstrate that our method effectively generates high-fidelity 4D simulations from a single image in one minute. This represents a significant speedup over prior works while delivering realistic rendering results. Our project page is at:this https URL
zh
[CV-16] Multimodal Data Storag e and Retrieval for Embodied AI: A Survey
【速读】:该论文旨在解决具身智能(Embodied AI, EAI)系统在持续与物理世界交互过程中产生的海量异构多模态数据流,传统数据管理系统难以有效管理的问题。其核心挑战在于如何实现对物理世界的精准语义锚定(physical grounding)、低延迟访问以及动态可扩展性。解决方案的关键在于系统性地评估五类存储架构(图数据库、多模型数据库、数据湖、向量数据库和时序数据库)与五种检索范式(融合策略、表征对齐、图结构、生成模型驱动及高效优化检索),揭示长期语义一致性与实时响应之间存在的根本张力,并识别出从基础的物理锚定差距到跨模态集成、动态适应及开放世界泛化等关键瓶颈。最终提出以物理感知的数据模型、存储-检索协同优化机制和标准化基准测试为核心的未来研究方向,为构建下一代自主具身系统的鲁棒、高性能数据管理框架提供理论支撑与实践路径。
链接: https://arxiv.org/abs/2508.13901
作者: Yihao Lu,Hao Tang
机构: South China Normal University (华南师范大学); Peking University (北京大学)
类目: Robotics (cs.RO); Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Embodied AI (EAI) agents continuously interact with the physical world, generating vast, heterogeneous multimodal data streams that traditional management systems are ill-equipped to handle. In this survey, we first systematically evaluate five storage architectures (Graph Databases, Multi-Model Databases, Data Lakes, Vector Databases, and Time-Series Databases), focusing on their suitability for addressing EAI’s core requirements, including physical grounding, low-latency access, and dynamic scalability. We then analyze five retrieval paradigms (Fusion Strategy-Based Retrieval, Representation Alignment-Based Retrieval, Graph-Structure-Based Retrieval, Generation Model-Based Retrieval, and Efficient Retrieval-Based Optimization), revealing a fundamental tension between achieving long-term semantic coherence and maintaining real-time responsiveness. Based on this comprehensive analysis, we identify key bottlenecks, spanning from the foundational Physical Grounding Gap to systemic challenges in cross-modal integration, dynamic adaptation, and open-world generalization. Finally, we outline a forward-looking research agenda encompassing physics-aware data models, adaptive storage-retrieval co-optimization, and standardized benchmarking, to guide future research toward principled data management solutions for EAI. Our survey is based on a comprehensive review of more than 180 related studies, providing a rigorous roadmap for designing the robust, high-performance data management frameworks essential for the next generation of autonomous embodied systems.
zh
[CV-17] SCRNet: Spatial-Channel Regulation Network for Medical Ultrasound Image Segmentation
【速读】:该论文旨在解决医学超声图像分割中传统方法存在的局限性问题,即卷积神经网络(CNN)难以捕捉长距离依赖关系,而基于Transformer的方法可能忽略局部上下文信息。其解决方案的关键在于提出一种特征聚合模块(Feature Aggregation Module, FAM),该模块将前一层的两个输入特征分别送入卷积与交叉注意力并行模块(Convolution and Cross-Attention Parallel Module, CCAPM)的两个分支,赋予它们不同角色以建立强关联;同时通过融合卷积操作与交叉注意力机制,实现对长程依赖和局部上下文信息的协同建模。进一步地,FAM被集成到空间-通道调控模块(Spatial-Channel Regulation Module, SCRM)中,增强了对显著区域和关键特征的关注能力,并最终嵌入UNet编码器块构建出新的分割框架——空间-通道调控网络(Spatial-Channel Regulation Network, SCRNet),从而在多个基准数据集上实现了当前最优(state-of-the-art)性能。
链接: https://arxiv.org/abs/2508.13899
作者: Weixin Xu,Ziliang Wang
机构: Beihang University (北京航空航天大学); Peking University (北京大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 8 pagegs
Abstract:Medical ultrasound image segmentation presents a formidable challenge in the realm of computer vision. Traditional approaches rely on Convolutional Neural Networks (CNNs) and Transformer-based methods to address the intricacies of medical image segmentation. Nevertheless, inherent limitations persist, as CNN-based methods tend to disregard long-range dependencies, while Transformer-based methods may overlook local contextual information. To address these deficiencies, we propose a novel Feature Aggregation Module (FAM) designed to process two input features from the preceding layer. These features are seamlessly directed into two branches of the Convolution and Cross-Attention Parallel Module (CCAPM) to endow them with different roles in each of the two branches to help establish a strong connection between the two input features. This strategy enables our module to focus concurrently on both long-range dependencies and local contextual information by judiciously merging convolution operations with cross-attention mechanisms. Moreover, by integrating FAM within our proposed Spatial-Channel Regulation Module (SCRM), the ability to discern salient regions and informative features warranting increased attention is enhanced. Furthermore, by incorporating the SCRM into the encoder block of the UNet architecture, we introduce a novel framework dubbed Spatial-Channel Regulation Network (SCRNet). The results of our extensive experiments demonstrate the superiority of SCRNet, which consistently achieves state-of-the-art (SOTA) performance compared to existing methods.
zh
[CV-18] Forecasting Smog Events Using ConvLSTM: A Spatio-Temporal Approach for Aerosol Index Prediction in South Asia
【速读】:该论文旨在解决南亚雾霾(South Asian Smog)事件中细颗粒物(PM)浓度实时预测缺乏区域性系统的问题,尤其针对印度-恒河平原地区每年11月至次年2月因作物残留物焚烧、机动车排放及气象条件变化导致的严重空气污染问题。其解决方案的关键在于利用Sentinel-5P卫星提供的紫外波段气溶胶指数(UV Aerosol Index, 340–380 nm)数据,结合卷积长短期记忆神经网络(Convolutional Long Short-Term Memory, ConvLSTM),有效捕捉气溶胶时空相关性,实现五天间隔的气溶胶事件预测,模型在均方误差(MSE ≈ 0.0018)、损失值(loss ≈ 0.3995)和结构相似性指数(SSIM ≈ 0.74)上表现良好,为区域空气质量预警提供了可行的技术路径。
链接: https://arxiv.org/abs/2508.13891
作者: Taimur Khan
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:The South Asian Smog refers to the recurring annual air pollution events marked by high contaminant levels, reduced visibility, and significant socio-economic impacts, primarily affecting the Indo-Gangetic Plains (IGP) from November to February. Over the past decade, increased air pollution sources such as crop residue burning, motor vehicles, and changing weather patterns have intensified these smog events. However, real-time forecasting systems for increased particulate matter concentrations are still not established at regional scale. The Aerosol Index, closely tied to smog formation and a key component in calculating the Air Quality Index (AQI), reflects particulate matter concentrations. This study forecasts aerosol events using Sentinel-5P air constituent data (2019-2023) and a Convolutional Long-Short Term Memory (ConvLSTM) neural network, which captures spatial and temporal correlations more effectively than previous models. Using the Ultraviolet (UV) Aerosol Index at 340-380 nm as the predictor, results show the Aerosol Index can be forecasted at five-day intervals with a Mean Squared Error of ~0.0018, loss of ~0.3995, and Structural Similarity Index of ~0.74. While effective, the model can be improved by integrating additional data and refining its architecture.
zh
[CV-19] In-hoc Concept Representations to Regularise Deep Learning in Medical Imaging ICCV
【速读】:该论文旨在解决深度学习模型在医学影像任务中因分布偏移(distribution shift)而导致的泛化能力不足问题,特别是模型倾向于依赖虚假相关性(spurious correlations)而非临床有意义的特征进行预测。其解决方案的关键在于提出一种名为LCRReg的新颖正则化方法,该方法利用潜在概念表示(Latent Concept Representations, LCRs,如概念激活向量Concept Activation Vectors, CAVs)来引导模型学习语义上合理的表征。LCRReg无需主训练集中包含概念标签,而是通过一个小的辅助数据集合成高质量、解耦的概念样本,并在CNN中引入一个正则项,促使模型在与预定义概念相关的潜在子空间内激活,从而提升模型对分布外数据和虚假相关性的鲁棒性。
链接: https://arxiv.org/abs/2508.13880
作者: Valentina Corbetta,Floris Six Dijkstra,Regina Beets-Tan,Hoel Kervadec,Kristoffer Wickstrøm,Wilson Silva
机构: The Netherlands Cancer Institute (荷兰癌症研究所); Utrecht University (乌得勒支大学); Maastricht University (马斯特里赫特大学); University of Amsterdam (阿姆斯特丹大学); Amsterdam UMC (阿姆斯特丹大学医学中心); UiT The Arctic University of Norway (挪威北极大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 13 pages, 13 figures, 2 tables, accepted at PHAROS-AFE-AIMI Workshop in conjunction with the International Conference on Computer Vision (ICCV), 2025. This is the submitted manuscript with added link to the github repo, funding acknowledgments and author names and affiliations, and a correction to numbers in Table 1. Final version not published yet
Abstract:Deep learning models in medical imaging often achieve strong in-distribution performance but struggle to generalise under distribution shifts, frequently relying on spurious correlations instead of clinically meaningful features. We introduce LCRReg, a novel regularisation approach that leverages Latent Concept Representations (LCRs) (e.g., Concept Activation Vectors (CAVs)) to guide models toward semantically grounded representations. LCRReg requires no concept labels in the main training set and instead uses a small auxiliary dataset to synthesise high-quality, disentangled concept examples. We extract LCRs for predefined relevant features, and incorporate a regularisation term that guides a Convolutional Neural Network (CNN) to activate within latent subspaces associated with those concepts. We evaluate LCRReg across synthetic and real-world medical tasks. On a controlled toy dataset, it significantly improves robustness to injected spurious correlations and remains effective even in multi-concept and multiclass settings. On the diabetic retinopathy binary classification task, LCRReg enhances performance under both synthetic spurious perturbations and out-of-distribution (OOD) generalisation. Compared to baselines, including multitask learning, linear probing, and post-hoc concept-based models, LCRReg offers a lightweight, architecture-agnostic strategy for improving model robustness without requiring dense concept supervision. Code is available at the following link: this https URL_regularization
zh
[CV-20] RICO: Two Realistic Benchmarks and an In-Depth Analysis for Incremental Learning in Object Detection ICCV
【速读】:该论文旨在解决增量学习(Incremental Learning, IL)在真实场景中评估不足的问题,现有方法多依赖于简化或合成的基准测试,难以反映实际应用中的性能表现。为弥补这一缺陷,作者提出了两个现实增量目标检测基准:领域增量基准(Domain RICO, D-RICO)和扩展类别基准(Expanding-Classes RICO, EC-RICO),它们分别模拟固定类别下的领域变化和逐步引入新类别与新领域的场景,并基于14个多样化数据集构建,涵盖真实与合成域、不同天气条件、时间、相机视角及标注策略等复杂因素。解决方案的关键在于构建更贴近现实的评估框架,揭示当前IL方法在适应性和知识保留方面的系统性不足,并指出性能差距主要源于蒸馏中教师模型能力弱、单一模型难以应对多样任务以及模型可塑性不足等问题。
链接: https://arxiv.org/abs/2508.13878
作者: Matthias Neuwirth-Trapp,Maarten Bieshaar,Danda Pani Paudel,Luc Van Gool
机构: Bosch Research; INSAIT, Sofia University “St. Kliment Ohridski”
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted to ICCV Workshops 2025
Abstract:Incremental Learning (IL) trains models sequentially on new data without full retraining, offering privacy, efficiency, and scalability. IL must balance adaptability to new data with retention of old knowledge. However, evaluations often rely on synthetic, simplified benchmarks, obscuring real-world IL performance. To address this, we introduce two Realistic Incremental Object Detection Benchmarks (RICO): Domain RICO (D-RICO) features domain shifts with a fixed class set, and Expanding-Classes RICO (EC-RICO) integrates new domains and classes per IL step. Built from 14 diverse datasets covering real and synthetic domains, varying conditions (e.g., weather, time of day), camera sensors, perspectives, and labeling policies, both benchmarks capture challenges absent in existing evaluations. Our experiments show that all IL methods underperform in adaptability and retention, while replaying a small amount of previous data already outperforms all methods. However, individual training on the data remains superior. We heuristically attribute this gap to weak teachers in distillation, single models’ inability to manage diverse tasks, and insufficient plasticity. Our code will be made publicly available.
zh
[CV-21] A Comprehensive Re-Evaluation of Biometric Modality Properties in the Modern Era
【速读】:该论文旨在解决当前生物特征识别(biometric)模态评估框架滞后于技术发展的问题,特别是针对1998年广泛使用的对比表格已无法反映近年来生物特征技术进步与新兴安全漏洞的现状。其解决方案的关键在于通过一项涵盖24位生物特征专家的调查,重新评估主流生物特征模态(如人脸、指纹)在多个属性上的表现,并结合55个生物特征数据集的不确定性指标进行交叉验证,从而构建一个更可靠、更具时效性的评估体系。该方法不仅提升了评估结果的一致性和可信度,还揭示了专家意见与实证数据之间的强一致性,同时指出了当前评估中存在分歧的关键挑战,为未来研究方向提供了明确指引。
链接: https://arxiv.org/abs/2508.13874
作者: Rouqaiah Al-Refai,Pankaja Priya Ramasamy,Ragini Ramesh,Patricia Arias-Cabarcos,Philipp Terhörst
机构: Paderborn University (帕德博恩大学); European Commission, Joint Research Centre (JRC) (欧盟委员会联合研究中心)
类目: Machine Learning (cs.LG); Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:The rapid advancement of authentication systems and their increasing reliance on biometrics for faster and more accurate user verification experience, highlight the critical need for a reliable framework to evaluate the suitability of biometric modalities for specific applications. Currently, the most widely known evaluation framework is a comparative table from 1998, which no longer adequately captures recent technological developments or emerging vulnerabilities in biometric systems. To address these challenges, this work revisits the evaluation of biometric modalities through an expert survey involving 24 biometric specialists. The findings indicate substantial shifts in property ratings across modalities. For example, face recognition, shows improved ratings due to technological progress, while fingerprint, shows decreased reliability because of emerging vulnerabilities and attacks. Further analysis of expert agreement levels across rated properties highlighted the consistency of the provided evaluations and ensured the reliability of the ratings. Finally, expert assessments are compared with dataset-level uncertainty across 55 biometric datasets, revealing strong alignment in most modalities and underscoring the importance of integrating empirical evidence with expert insight. Moreover, the identified expert disagreements reveal key open challenges and help guide future research toward resolving them.
zh
[CV-22] RED.AI Id-Pattern: First Results of Stone Deterioration Patterns with Multi-Agent Systems
【速读】:该论文旨在解决传统石材劣化模式识别方法依赖专家团队现场观察所带来的高时间与资源成本问题。解决方案的关键在于构建一个基于认知架构的多智能体人工智能(Multi-Agent Artificial Intelligence)系统,该系统通过模拟五类专业角色(岩石学家、病理学家、环境专家、保护修复师及诊断协调员)之间的协作机制,实现对视觉证据中多种石材病害的自动化诊断。实验表明,该系统在28张复杂图像上的性能显著优于基础模型,验证了其在提升诊断效率与准确性方面的潜力。
链接: https://arxiv.org/abs/2508.13872
作者: Daniele Corradetti,José Delgado Rodrigues
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Multiagent Systems (cs.MA)
备注: 11 pages, 1 figure, 1 table. Contribution for REEACH 2025 Symposium
Abstract:The Id-Pattern system within the this http URL project (Reabilitação Estrutural Digital através da AI) consists of an agentic system designed to assist in the identification of stone deterioration patterns. Traditional methodologies, based on direct observation by expert teams, are accurate but costly in terms of time and resources. The system developed here introduces and evaluates a multi-agent artificial intelligence (AI) system, designed to simulate collaboration between experts and automate the diagnosis of stone pathologies from visual evidence. The approach is based on a cognitive architecture that orchestrates a team of specialized AI agents which, in this specific case, are limited to five: a lithologist, a pathologist, an environmental expert, a conservator-restorer, and a diagnostic coordinator. To evaluate the system we selected 28 difficult images involving multiple deterioration patterns. Our first results showed a huge boost on all metrics of our system compared to the foundational model.
zh
[CV-23] SAGA: Learning Signal-Aligned Distributions for Improved Text-to-Image Generation
【速读】:该论文旨在解决当前文本到图像生成模型(text-to-image models)在精确对齐文本提示(text prompts)方面的不足,即生成图像时常缺失关键元素或错误融合不同概念的问题。其解决方案的关键在于提出一种无需训练(training-free)的新方法,通过在去噪过程中显式建模信号分量(signal component),学习一个高成功率的条件分布,从而确保生成图像忠实反映目标提示内容。该方法不仅有效缓解过优化和分布外伪影问题,还能无缝集成至现有的扩散模型(diffusion)与流匹配(flow matching)架构,并支持边界框等额外条件模态以提升空间对齐精度。
链接: https://arxiv.org/abs/2508.13866
作者: Paul Grimal,Michaël Soumm,Hervé Le Borgne,Olivier Ferret,Akihiro Sugimoto
机构: Université Paris-Saclay, CEA, List, F-91120, Palaiseau, France; Télécom Paris; National Institute of Informatics, Japan
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:State-of-the-art text-to-image models produce visually impressive results but often struggle with precise alignment to text prompts, leading to missing critical elements or unintended blending of distinct concepts. We propose a novel approach that learns a high-success-rate distribution conditioned on a target prompt, ensuring that generated images faithfully reflect the corresponding prompts. Our method explicitly models the signal component during the denoising process, offering fine-grained control that mitigates over-optimization and out-of-distribution artifacts. Moreover, our framework is training-free and seamlessly integrates with both existing diffusion and flow matching architectures. It also supports additional conditioning modalities – such as bounding boxes – for enhanced spatial alignment. Extensive experiments demonstrate that our approach outperforms current state-of-the-art methods. The code is available at this https URL.
zh
[CV-24] Self-Aware Adaptive Alignment: Enabling Accurate Perception for Intelligent Transportation Systems
【速读】:该论文旨在解决跨域目标检测(cross-domain object detection)中的性能下降问题,尤其是在源域和目标域之间存在显著分布差异时,如何有效对齐特征并提升检测精度。解决方案的关键在于提出一种自aware自适应对齐方法(Self-Aware Adaptive Alignment, SA3),其核心创新包括:1)设计了一个基于注意力机制的图像级特征对齐模块,通过在源域和目标域数据上联合训练,实现局部-全局自适应对齐;2)引入通道重要性重加权策略优化特征表示,并将其输入区域建议网络(Region Proposal Network, RPN)以提取显著区域特征;3)构建针对目标域的实例到图像级别的对齐模块,动态缓解域间差异。实验表明,SA3在多个主流跨域检测基准上优于现有最先进方法。
链接: https://arxiv.org/abs/2508.13823
作者: Tong Xiang,Hongxia Zhao,Fenghua Zhu,Yuanyuan Chen,Yisheng Lv
机构: State Key Laboratory of Multimodal Artificial Intelligence Systems, Institute of Automation, Chinese Academy of Sciences (中国科学院自动化研究所)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Domain adaptation, Virtual Reality, Object Detection
Abstract:Achieving top-notch performance in Intelligent Transportation detection is a critical research area. However, many challenges still need to be addressed when it comes to detecting in a cross-domain scenario. In this paper, we propose a Self-Aware Adaptive Alignment (SA3), by leveraging an efficient alignment mechanism and recognition strategy. Our proposed method employs a specified attention-based alignment module trained on source and target domain datasets to guide the image-level features alignment process, enabling the local-global adaptive alignment between the source domain and target domain. Features from both domains, whose channel importance is re-weighted, are fed into the region proposal network, which facilitates the acquisition of salient region features. Also, we introduce an instance-to-image level alignment module specific to the target domain to adaptively mitigate the domain gap. To evaluate the proposed method, extensive experiments have been conducted on popular cross-domain object detection benchmarks. Experimental results show that SA3 achieves superior results to the previous state-of-the-art methods.
zh
[CV-25] Unsupervised Urban Tree Biodiversity Mapping from Street-Level Imagery Using Spatially-Aware Visual Clustering
【速读】:该论文旨在解决城市树木多样性评估中缺乏精细化数据的问题,尤其针对传统实地调查成本高、耗时长,以及监督式人工智能方法依赖标注数据且跨区域泛化能力差的局限。其解决方案的关键在于提出一种无监督聚类框架,通过融合街景图像的视觉嵌入(visual embeddings)与空间种植模式,无需标签即可估计物种多样性水平,从而实现对城市树种属级多样性的高保真映射,并保持空间自相关性,为无详细树种清单的城市提供可扩展、细粒度的生物多样性制图工具。
链接: https://arxiv.org/abs/2508.13814
作者: Diaa Addeen Abuhani,Marco Seccaroni,Martina Mazzarello,Imran Zualkernan,Fabio Duarte,Carlo Ratti
机构: Massachusetts Institute of Technology (麻省理工学院); Politecnico di Milano (米兰理工大学); American University of Sharjah (阿联酋沙迦美国大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注: 26 pages, 7 figures, Nature Format
Abstract:Urban tree biodiversity is critical for climate resilience, ecological stability, and livability in cities, yet most municipalities lack detailed knowledge of their canopies. Field-based inventories provide reliable estimates of Shannon and Simpson diversity but are costly and time-consuming, while supervised AI methods require labeled data that often fail to generalize across regions. We introduce an unsupervised clustering framework that integrates visual embeddings from street-level imagery with spatial planting patterns to estimate biodiversity without labels. Applied to eight North American cities, the method recovers genus-level diversity patterns with high fidelity, achieving low Wasserstein distances to ground truth for Shannon and Simpson indices and preserving spatial autocorrelation. This scalable, fine-grained approach enables biodiversity mapping in cities lacking detailed inventories and offers a pathway for continuous, low-cost monitoring to support equitable access to greenery and adaptive management of urban ecosystems.
zh
[CV-26] mestep-Compressed Attack on Spiking Neural Networks through Timestep-Level Backpropagation
【速读】:该论文旨在解决当前基于梯度的脉冲神经网络(Spiking Neural Networks, SNNs)对抗攻击方法中存在的高攻击延迟问题,这类方法通常直接扩展人工神经网络(Artificial Neural Networks, ANNs)中的FGSM和PGD框架,因需多时间步处理而难以应用于实时场景。解决方案的关键在于提出一种新的时间步压缩攻击(timestep-compressed attack, TCA)框架,其核心创新包括两个方面:一是时间步级反向传播(timestep-level backpropagation, TLBP),通过发现全局时间信息在生成扰动时并非必要,从而实现逐时间步评估并提前终止;二是对抗膜电位复用(adversarial membrane potential reuse, A-MPR),利用初始时间步膜电位积累效率低的特点,预先计算并复用该“预热”阶段的膜电位,显著降低计算开销。实验表明,TCA在白盒与黑盒设置下分别将攻击延迟降低最多56.6%和57.1%,同时保持与现有最优方法相当的攻击成功率。
链接: https://arxiv.org/abs/2508.13812
作者: Donghwa Kang,Doohyun Kim,Sang-Ki Ko,Jinkyu Lee,Hyeongboo Baek,Brent ByungHoon Kang
机构: Sungkyunkwan University (成均馆大学); Seoul National University (首尔国立大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Neural and Evolutionary Computing (cs.NE)
备注: 8 pages
Abstract:State-of-the-art (SOTA) gradient-based adversarial attacks on spiking neural networks (SNNs), which largely rely on extending FGSM and PGD frameworks, face a critical limitation: substantial attack latency from multi-timestep processing, rendering them infeasible for practical real-time applications. This inefficiency stems from their design as direct extensions of ANN paradigms, which fail to exploit key SNN properties. In this paper, we propose the timestep-compressed attack (TCA), a novel framework that significantly reduces attack latency. TCA introduces two components founded on key insights into SNN behavior. First, timestep-level backpropagation (TLBP) is based on our finding that global temporal information in backpropagation to generate perturbations is not critical for an attack’s success, enabling per-timestep evaluation for early stopping. Second, adversarial membrane potential reuse (A-MPR) is motivated by the observation that initial timesteps are inefficiently spent accumulating membrane potential, a warm-up phase that can be pre-calculated and reused. Our experiments on VGG-11 and ResNet-17 with the CIFAR-10/100 and CIFAR10-DVS datasets show that TCA significantly reduces the required attack latency by up to 56.6% and 57.1% compared to SOTA methods in white-box and black-box settings, respectively, while maintaining a comparable attack success rate.
zh
[CV-27] Is-NeRF: In-scattering Neural Radiance Field for Blurred Images
【速读】:该论文旨在解决传统神经辐射场(Neural Radiance Fields, NeRF)在处理复杂光照路径和运动模糊图像时存在的几何歧义与渲染失真问题,尤其是在现实环境中因光线散射导致的细节丢失。其解决方案的关键在于提出了一种去模糊神经辐射场(Is-NeRF),通过显式建模真实环境中的六种常见光传播现象,并以入射散射(in-scattering)统一表示,构建了新的散射感知体素渲染流程,从而适应复杂光路;同时引入自适应学习策略,自动确定散射方向与采样间隔,提升对物体精细结构的捕捉能力,最终联合优化NeRF参数、散射参数与相机运动,实现从模糊图像中恢复高保真度且几何准确的场景表示。
链接: https://arxiv.org/abs/2508.13808
作者: Nan Luo,Chenglin Ye,Jiaxu Li,Gang Liu,Bo Wan,Di Wang,Lupeng Liu,Jun Xiao
机构: 未知
类目: Graphics (cs.GR); Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Neural Radiance Fields (NeRF) has gained significant attention for its prominent implicit 3D representation and realistic novel view synthesis capabilities. Available works unexceptionally employ straight-line volume rendering, which struggles to handle sophisticated lightpath scenarios and introduces geometric ambiguities during training, particularly evident when processing motion-blurred images. To address these challenges, this work proposes a novel deblur neural radiance field, Is-NeRF, featuring explicit lightpath modeling in real-world environments. By unifying six common light propagation phenomena through an in-scattering representation, we establish a new scattering-aware volume rendering pipeline adaptable to complex lightpaths. Additionally, we introduce an adaptive learning strategy that enables autonomous determining of scattering directions and sampling intervals to capture finer object details. The proposed network jointly optimizes NeRF parameters, scattering parameters, and camera motions to recover fine-grained scene representations from blurry images. Comprehensive evaluations demonstrate that it effectively handles complex real-world scenarios, outperforming state-of-the-art approaches in generating high-fidelity images with accurate geometric details.
zh
[CV-28] Sketch3DVE: Sketch-based 3D-Aware Scene Video Editing SIGGRAPH2025
【速读】:该论文旨在解决视频中3D场景结构内容编辑的难题,尤其是在存在显著视角变化(如大角度相机旋转或缩放)时,如何生成与原视频一致的新视图内容、保持未编辑区域不变,并将稀疏的2D输入转化为逼真的3D视频输出。解决方案的关键在于提出Sketch3DVE方法:首先利用图像编辑技术处理首帧并传播至后续帧以应对稀疏输入;其次通过草图(sketch)实现精确几何控制,同时兼容其他基于掩码的图像编辑方式;再者借助密集立体匹配估计点云和相机参数,进而采用基于深度图的点云编辑策略对新增3D组件进行对齐;最后结合3D感知的掩码传播机制与视频扩散模型,实现新编辑内容与原始视频的无缝融合,同时保留未编辑区域特征。
链接: https://arxiv.org/abs/2508.13797
作者: Feng-Lin Liu,Shi-Yang Li,Yan-Pei Cao,Hongbo Fu,Lin Gao
机构: Institute of Computing Technology, Chinese Academy of Sciences (中国科学院计算技术研究所); University of Chinese Academy of Sciences (中国科学院大学); VASTChina; Hong Kong University of Science and Technology (香港科技大学)
类目: Graphics (cs.GR); Computer Vision and Pattern Recognition (cs.CV)
备注: SIGGRAPH 2025
Abstract:Recent video editing methods achieve attractive results in style transfer or appearance modification. However, editing the structural content of 3D scenes in videos remains challenging, particularly when dealing with significant viewpoint changes, such as large camera rotations or zooms. Key challenges include generating novel view content that remains consistent with the original video, preserving unedited regions, and translating sparse 2D inputs into realistic 3D video outputs. To address these issues, we propose Sketch3DVE, a sketch-based 3D-aware video editing method to enable detailed local manipulation of videos with significant viewpoint changes. To solve the challenge posed by sparse inputs, we employ image editing methods to generate edited results for the first frame, which are then propagated to the remaining frames of the video. We utilize sketching as an interaction tool for precise geometry control, while other mask-based image editing methods are also supported. To handle viewpoint changes, we perform a detailed analysis and manipulation of the 3D information in the video. Specifically, we utilize a dense stereo method to estimate a point cloud and the camera parameters of the input video. We then propose a point cloud editing approach that uses depth maps to represent the 3D geometry of newly edited components, aligning them effectively with the original 3D scene. To seamlessly merge the newly edited content with the original video while preserving the features of unedited regions, we introduce a 3D-aware mask propagation strategy and employ a video diffusion model to produce realistic edited videos. Extensive experiments demonstrate the superiority of Sketch3DVE in video editing. Homepage and code: http://http://geometrylearning.com/Sketch3DVE/
zh
[CV-29] A Fully Transformer Based Multimodal Framework for Explainable Cancer Image Segmentation Using Radiology Reports
【速读】:该论文旨在解决乳腺癌超声图像分割中模型性能与可解释性不足的问题,尤其在临床实践中对高精度分割和可信诊断推理的需求。解决方案的关键在于提出一种基于Transformer的多模态框架Med-CTX,其核心创新包括:1)采用双分支视觉编码器融合ViT(Vision Transformer)与Swin Transformer以增强特征提取能力;2)引入不确定性感知的特征融合机制提升分割鲁棒性;3)利用BioClinicalBERT对BI-RADS语义结构化的临床报告进行编码,并通过跨模态注意力机制将文本信息与视觉特征对齐,从而生成具有临床依据的诊断推理(diagnostic rationales)。该方法同时输出分割掩码、不确定性图和解释文本,显著提升了计算机辅助诊断的透明度与可信度。
链接: https://arxiv.org/abs/2508.13796
作者: Enobong Adahada,Isabel Sassoon,Kate Hone,Yongmin Li
机构: Brunel University of London (布鲁内尔大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:
Abstract:We introduce Med-CTX, a fully transformer based multimodal framework for explainable breast cancer ultrasound segmentation. We integrate clinical radiology reports to boost both performance and interpretability. Med-CTX achieves exact lesion delineation by using a dual-branch visual encoder that combines ViT and Swin transformers, as well as uncertainty aware fusion. Clinical language structured with BI-RADS semantics is encoded by BioClinicalBERT and combined with visual features utilising cross-modal attention, allowing the model to provide clinically grounded, model generated explanations. Our methodology generates segmentation masks, uncertainty maps, and diagnostic rationales all at once, increasing confidence and transparency in computer assisted diagnosis. On the BUS-BRA dataset, Med-CTX achieves a Dice score of 99% and an IoU of 95%, beating existing baselines U-Net, ViT, and Swin. Clinical text plays a key role in segmentation accuracy and explanation quality, as evidenced by ablation studies that show a -5.4% decline in Dice score and -31% in CIDEr. Med-CTX achieves good multimodal alignment (CLIP score: 85%) and increased confi dence calibration (ECE: 3.2%), setting a new bar for trustworthy, multimodal medical architecture.
zh
[CV-30] VisionLaw: Inferring Interpretable Intrinsic Dynamics from Visual Observations via Bilevel Optimization
【速读】:该论文旨在解决从视觉观测中推断物体固有动力学(intrinsic dynamics)的问题,现有方法面临两大挑战:一是依赖人工定义的本构先验,难以推广至复杂场景;二是采用神经网络建模导致可解释性差且泛化能力弱。解决方案的关键在于提出一种双层优化框架 VisionLaw,其上层利用大语言模型(LLMs)驱动的解耦本构演化策略,通过提示 LLM 作为物理专家生成并修正本构定律,并引入内置解耦机制显著降低搜索复杂度;下层则设计视觉引导的本构评估机制,借助视觉仿真验证生成本构定律与真实内在动力学的一致性,从而指导上层演化。该框架实现了从视觉数据中自动推导出可解释且具备强泛化能力的固有动力学表达。
链接: https://arxiv.org/abs/2508.13792
作者: Jailing Lin,Shu Jiang,Qingyuan Zeng,Zhenzhong Wang,Min Jiang
机构: Xiamen University (厦门大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 9 pages, 6 figures
Abstract:The intrinsic dynamics of an object governs its physical behavior in the real world, playing a critical role in enabling physically plausible interactive simulation with 3D assets. Existing methods have attempted to infer the intrinsic dynamics of objects from visual observations, but generally face two major challenges: one line of work relies on manually defined constitutive priors, making it difficult to generalize to complex scenarios; the other models intrinsic dynamics using neural networks, resulting in limited interpretability and poor generalization. To address these challenges, we propose VisionLaw, a bilevel optimization framework that infers interpretable expressions of intrinsic dynamics from visual observations. At the upper level, we introduce an LLMs-driven decoupled constitutive evolution strategy, where LLMs are prompted as a knowledgeable physics expert to generate and revise constitutive laws, with a built-in decoupling mechanism that substantially reduces the search complexity of LLMs. At the lower level, we introduce a vision-guided constitutive evaluation mechanism, which utilizes visual simulation to evaluate the consistency between the generated constitutive law and the underlying intrinsic dynamics, thereby guiding the upper-level evolution. Experiments on both synthetic and real-world datasets demonstrate that VisionLaw can effectively infer interpretable intrinsic dynamics from visual observations. It significantly outperforms existing state-of-the-art methods and exhibits strong generalization for interactive simulation in novel scenarios.
zh
[CV-31] Shape-from-Template with Generalised Camera WWW
【速读】:该论文旨在解决多相机视角下非刚性三维形状到二维关键点的配准问题,即多视角下的形状恢复(Shape-from-Template, SfT)问题。传统SfT方法通常基于单图像进行,而本文通过引入广义相机模型(generalised camera model),首次实现了在任意透视或正交相机组合下对形变物体的非刚性注册,扩展了其在医学影像和手持相机等场景中的应用范围。解决方案的关键在于利用多视图间形变物体的相互约束来提升重建精度:提出三种方法——基于已知3D点方向向量的对应关系、基于未知3D点但已知局部参考系朝向的方向向量对应关系,以及结合轮廓信息的迭代优化方法;其中前两种采用凸优化求解,第三种则以凸解为基础进行迭代精化,从而显著提高了非刚性配准的准确性与鲁棒性。
链接: https://arxiv.org/abs/2508.13791
作者: Agniva Sengupta,Stefan Zachow
机构: Zuse Institute Berlin (ZIB)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Pre-print of the IMAVIS article: this https URL Code and data in: this https URL
Abstract:This article presents a new method for non-rigidly registering a 3D shape to 2D keypoints observed by a constellation of multiple cameras. Non-rigid registration of a 3D shape to observed 2D keypoints, i.e., Shape-from-Template (SfT), has been widely studied using single images, but SfT with information from multiple-cameras jointly opens new directions for extending the scope of known use-cases such as 3D shape registration in medical imaging and registration from hand-held cameras, to name a few. We represent such multi-camera setup with the generalised camera model; therefore any collection of perspective or orthographic cameras observing any deforming object can be registered. We propose multiple approaches for such SfT: the first approach where the corresponded keypoints lie on a direction vector from a known 3D point in space, the second approach where the corresponded keypoints lie on a direction vector from an unknown 3D point in space but with known orientation w.r.t some local reference frame, and a third approach where, apart from correspondences, the silhouette of the imaged object is also known. Together, these form the first set of solutions to the SfT problem with generalised cameras. The key idea behind SfT with generalised camera is the improved reconstruction accuracy from estimating deformed shape while utilising the additional information from the mutual constraints between multiple views of a deformed object. The correspondence-based approaches are solved with convex programming while the silhouette-based approach is an iterative refinement of the results from the convex solutions. We demonstrate the accuracy of our proposed methods on many synthetic and real data
zh
[CV-32] MR6D: Benchmarking 6D Pose Estimation for Mobile Robots CVPR2025
【速读】:该论文旨在解决当前6D位姿估计(6D pose estimation)数据集主要聚焦于机器人臂操作的小型家用物体,难以适配移动机器人在工业环境中面临的复杂挑战的问题。其关键解决方案是提出了MR6D数据集,该数据集专为移动机器人设计,包含92个真实场景和16类独特物体,涵盖静态与动态交互,能够有效模拟远距离视角、多样化物体配置、较大物体尺寸及复杂遮挡/自遮挡等典型移动平台难题,从而推动针对移动机器人需求的位姿估计方法的发展与评估。
链接: https://arxiv.org/abs/2508.13775
作者: Anas Gouda,Shrutarv Awasthi,Christian Blesing,Lokeshwaran Manohar,Frank Hoffmann,Alice Kirchheim
机构: TU Dortmund (多特蒙德工业大学); Fraunhofer IML (弗劳恩霍夫研究所材料与系统集成研究所); LAMARR Institute for Machine Learning (机器学习拉马尔研究所)
类目: Computer Vision and Pattern Recognition (cs.CV); Robotics (cs.RO)
备注: accepted CVPR 2025 Workshop on Recovering 6D Object Pose (R6D)
Abstract:Existing 6D pose estimation datasets primarily focus on small household objects typically handled by robot arm manipulators, limiting their relevance to mobile robotics. Mobile platforms often operate without manipulators, interact with larger objects, and face challenges such as long-range perception, heavy self-occlusion, and diverse camera perspectives. While recent models generalize well to unseen objects, evaluations remain confined to household-like settings that overlook these factors. We introduce MR6D, a dataset designed for 6D pose estimation for mobile robots in industrial environments. It includes 92 real-world scenes featuring 16 unique objects across static and dynamic interactions. MR6D captures the challenges specific to mobile platforms, including distant viewpoints, varied object configurations, larger object sizes, and complex occlusion/self-occlusion patterns. Initial experiments reveal that current 6D pipelines underperform in these settings, with 2D segmentation being another hurdle. MR6D establishes a foundation for developing and evaluating pose estimation methods tailored to the demands of mobile robotics. The dataset is available at this https URL.
zh
[CV-33] Mitigating Cross-Image Information Leakage in LVLMs for Multi-Image Tasks
【速读】:该论文旨在解决大型视觉语言模型(Large Vision-Language Models, LVLMs)在处理多图像输入时性能显著下降的问题,其根本原因在于不同图像的视觉线索在模型输出中发生交叉信息泄漏(cross-image information leakage)。解决方案的关键在于提出一种无需训练且与架构无关的解码策略FOCUS:该方法在推理阶段通过依次将除一张图像外的所有图像掩蔽为随机噪声,引导模型聚焦于单张清晰图像;随后对所有目标图像重复此过程以获取部分掩蔽上下文下的logits,并利用仅含噪声的参考输入进行对比精炼,从而抑制信息泄漏并提升输出准确性。
链接: https://arxiv.org/abs/2508.13744
作者: Yeji Park,Minyoung Lee,Sanghyuk Chun,Junsuk Choe
机构: NAVER AI Lab(NAVER人工智能实验室)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: Source code is available at this https URL
Abstract:Large Vision-Language Models (LVLMs) demonstrate strong performance on single-image tasks. However, we observe that their performance degrades significantly when handling multi-image inputs. This occurs because visual cues from different images become entangled in the model’s output. We refer to this phenomenon as cross-image information leakage. To address this issue, we propose FOCUS, a training-free and architecture-agnostic decoding strategy that mitigates cross-image information leakage during inference. FOCUS sequentially masks all but one image with random noise, guiding the model to focus on the single clean image. We repeat this process across all target images to obtain logits under partially masked contexts. These logits are aggregated and then contrastively refined using a noise-only reference input, which suppresses the leakage and yields more accurate outputs. FOCUS consistently improves performance across four multi-image benchmarks and diverse LVLM families. This demonstrates that FOCUS offers a general and practical solution for enhancing multi-image reasoning without additional training or architectural modifications.
zh
[CV-34] Enhancing Targeted Adversarial Attacks on Large Vision-Language Models through Intermediate Projector Guidance
【速读】:该论文旨在解决当前针对视觉语言模型(Vision-Language Models, VLMs)的定向对抗攻击方法中存在的两个核心问题:一是现有方法在编码器层面扰动图像,将丰富的视觉语义压缩为单一全局向量,导致攻击粒度粗略,难以实现如修改车辆同时保留背景等细粒度操控;二是这些方法忽视了投影模块(projector module),即连接视觉编码器与语言模型的关键语义桥梁(如广泛采用的Q-Former),从而未能有效破坏VLM中完整的视觉-语言对齐流程,限制了攻击效果。解决方案的关键在于提出中间投影器引导攻击(Intermediate Projector Guided Attack, IPGA),首次在Q-Former这一中间投影阶段进行攻击,利用其将全局图像嵌入转换为细粒度视觉token的能力,实现对语义有意义的视觉token的精准扰动,而非仅作用于全局表示;此外,引入残差查询对齐(Residual Query Alignment, RQA)以保留无关视觉内容,提升攻击的可控性与精确性。IPGA无需LLM微调,仅依赖预训练的Q-Former,在黑盒环境下显著优于现有方法,并成功迁移至多个商业VLM(如Google Gemini和OpenAI GPT)。
链接: https://arxiv.org/abs/2508.13739
作者: Yiming Cao,Yanjie Li,Kaisheng Liang,Yuni Lai,Bin Xiao
机构: The Hong Kong Polytechnic University (香港理工大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Targeted adversarial attacks are essential for proactively identifying security flaws in Vision-Language Models before real-world deployment. However, current methods perturb images to maximize global similarity with the target text or reference image at the encoder level, collapsing rich visual semantics into a single global vector. This limits attack granularity, hindering fine-grained manipulations such as modifying a car while preserving its background. Furthermore, these methods largely overlook the projector module, a critical semantic bridge between the visual encoder and the language model in VLMs, thereby failing to disrupt the full vision-language alignment pipeline within VLMs and limiting attack effectiveness. To address these issues, we propose the Intermediate Projector Guided Attack (IPGA), the first method to attack using the intermediate stage of the projector module, specifically the widely adopted Q-Former, which transforms global image embeddings into fine-grained visual features. This enables more precise control over adversarial perturbations by operating on semantically meaningful visual tokens rather than a single global representation. Specifically, IPGA leverages the Q-Former pretrained solely on the first vision-language alignment stage, without LLM fine-tuning, which improves both attack effectiveness and transferability across diverse VLMs. Furthermore, we propose Residual Query Alignment (RQA) to preserve unrelated visual content, thereby yielding more controlled and precise adversarial manipulations. Extensive experiments show that our attack method consistently outperforms existing methods in both standard global image captioning tasks and fine-grained visual question-answering tasks in black-box environment. Additionally, IPGA successfully transfers to multiple commercial VLMs, including Google Gemini and OpenAI GPT.
zh
[CV-35] Hierarchical Vision-Language Retrieval of Educational Metaverse Content in Agriculture
【速读】:该论文旨在解决如何在元宇宙(Metaverse)环境中高效组织与检索农业主题的虚拟博物馆(AgriMuseums)内容的问题,尤其针对用户通过自然语言查询获取匹配教育场景时面临的挑战。其关键解决方案是提出一种分层视觉-语言模型(hierarchical vision-language model),该模型能够联合建模AgriMuseums的视觉特征与文本描述,从而实现基于自然语言查询的精准检索。实验表明,该方法在R@1和MRR指标上分别达到约62%和78%,并显著优于现有基准,验证了其有效性与设计合理性。
链接: https://arxiv.org/abs/2508.13713
作者: Ali Abdari,Alex Falcon,Giuseppe Serra
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted for publication at the 23rd International Conference on Image Analysis and Processing (ICIAP 2025)
Abstract:Every day, a large amount of educational content is uploaded online across different areas, including agriculture and gardening. When these videos or materials are grouped meaningfully, they can make learning easier and more effective. One promising way to organize and enrich such content is through the Metaverse, which allows users to explore educational experiences in an interactive and immersive environment. However, searching for relevant Metaverse scenarios and finding those matching users’ interests remains a challenging task. A first step in this direction has been done recently, but existing datasets are small and not sufficient for training advanced models. In this work, we make two main contributions: first, we introduce a new dataset containing 457 agricultural-themed virtual museums (AgriMuseums), each enriched with textual descriptions; and second, we propose a hierarchical vision-language model to represent and retrieve relevant AgriMuseums using natural language queries. In our experimental setting, the proposed method achieves up to about 62% R@1 and 78% MRR, confirming its effectiveness, and it also leads to improvements on existing benchmarks by up to 6% R@1 and 11% MRR. Moreover, an extensive evaluation validates our design choices. Code and dataset are available at this https URL .
zh
[CV-36] Diversity-enhanced Collaborative Mamba for Semi-supervised Medical Image Segmentation
【速读】:该论文旨在解决医学图像分割中高质量标注数据获取成本高、耗时长的问题,提出了一种基于半监督学习的解决方案。其核心创新在于设计了一个多样性增强的协同Mamba框架(DCMamba),关键在于从数据、网络结构和特征表示三个维度协同提升模型对未标注数据的利用效率:首先通过基于Mamba扫描建模特性的patch级弱-强混合增强策略挖掘数据多样性;其次引入多样扫描协作模块,利用不同扫描方向产生的预测差异来增强模型鲁棒性;最后采用不确定性加权对比学习机制,强化特征表示的多样性与判别能力。实验表明,该方法在仅使用20%标注数据的情况下,在Synapse数据集上较现有最优方法提升了6.69%的分割性能。
链接: https://arxiv.org/abs/2508.13712
作者: Shumeng Li,Jian Zhang,Lei Qi,Luping Zhou,Yinghuan Shi,Yang Gao
机构: Nanjing University (南京大学); Southeast University (东南大学); The University of Sydney (悉尼大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Acquiring high-quality annotated data for medical image segmentation is tedious and costly. Semi-supervised segmentation techniques alleviate this burden by leveraging unlabeled data to generate pseudo labels. Recently, advanced state space models, represented by Mamba, have shown efficient handling of long-range dependencies. This drives us to explore their potential in semi-supervised medical image segmentation. In this paper, we propose a novel Diversity-enhanced Collaborative Mamba framework (namely DCMamba) for semi-supervised medical image segmentation, which explores and utilizes the diversity from data, network, and feature perspectives. Firstly, from the data perspective, we develop patch-level weak-strong mixing augmentation with Mamba’s scanning modeling characteristics. Moreover, from the network perspective, we introduce a diverse-scan collaboration module, which could benefit from the prediction discrepancies arising from different scanning directions. Furthermore, from the feature perspective, we adopt an uncertainty-weighted contrastive learning mechanism to enhance the diversity of feature representation. Experiments demonstrate that our DCMamba significantly outperforms other semi-supervised medical image segmentation methods, e.g., yielding the latest SSM-based method by 6.69% on the Synapse dataset with 20% labeled data.
zh
[CV-37] HumanPCR: Probing MLLM Capabilities in Diverse Human-Centric Scenes
【速读】:该论文旨在解决当前多模态大语言模型(Multimodal Large Language Models, MLLMs)在人类相关视觉场景理解能力上的不足,特别是其在感知、理解和推理三个层次上难以达到与人类相当的性能问题。解决方案的关键在于提出HumanPCR评估套件,该套件从三个层级系统性地评测MLLMs对人类相关视觉上下文的理解能力:感知层(Human-P)、理解层(Human-C)和推理层(Human-R)。其中,Human-P和Human-C包含超过6000个经人工验证的多项选择题,覆盖9个维度的任务;Human-R则设计了一个高难度的手动标注视频推理测试,要求模型整合多源视觉证据、主动提取超出问题提示的上下文信息,并运用类人专业知识进行推理。每个题目均附有人工标注的Chain-of-Thought(CoT)推理链及关键视觉证据,为后续研究提供支持。实验表明,现有先进模型在空间感知细节、时间理解及心智建模等任务中仍面临显著挑战,且对查询引导式检索存在依赖偏差,即使采用扩展视觉上下文或测试时思维(test-time thinking)等策略也仅带来有限提升。
链接: https://arxiv.org/abs/2508.13692
作者: Keliang Li,Hongze Shen,Hao Shi,Ruibing Hou,Hong Chang,Jie Huang,Chenghao Jia,Wen Wang,Yiling Wu,Dongmei Jiang,Shiguang Shan,Xilin Chen
机构: Key Laboratory of Intelligent Information Processing of Chinese Academy of Sciences (CAS), Institute of Computing Technology, CAS (中国科学院智能信息处理重点实验室); University of Chinese Academy of Sciences (中国科学院大学); Peng Cheng Laboratory (鹏城实验室)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:The aspiration for artificial general intelligence, fueled by the rapid progress of multimodal models, demands human-comparable performance across diverse environments. We propose HumanPCR, an evaluation suite for probing MLLMs’ capacity about human-related visual contexts across three hierarchical levels: Perception, Comprehension, and Reasoning (denoted by Human-P, Human-C, and Human-R, respectively). Human-P and Human-C feature over 6,000 human-verified multiple choice questions, assessing massive tasks of 9 dimensions, including but not limited to essential skills frequently overlooked by existing benchmarks. Human-R offers a challenging manually curated video reasoning test that requires integrating multiple visual evidences, proactively extracting context beyond question cues, and applying human-like expertise. Each question includes human-annotated Chain-of-Thought (CoT) rationales with key visual evidence to support further research. Extensive evaluations on over 30 state-of-the-art models exhibit significant challenges in human-centric visual understanding, particularly in tasks involving detailed space perception, temporal understanding, and mind modeling. Moreover, analysis of Human-R reveals the struggle of models in extracting essential proactive visual evidence from diverse human scenes and their faulty reliance on query-guided retrieval. Even with advanced techniques like scaling visual contexts and test-time thinking yield only limited benefits. We hope HumanPCR and our findings will advance the development, evaluation, and human-centric application of multimodal models.
zh
[CV-38] DeH4R: A Decoupled and Hybrid Method for Road Network Graph Extraction
【速读】:该论文旨在解决从遥感影像中自动提取完整且拓扑准确的道路网络图(road network graph)这一关键挑战。现有方法中,基于分割的方法在矢量化后难以保持拓扑一致性,而基于图生长的方法虽能保留拓扑结构但计算开销大;基于图生成的方法虽然推理速度快,却限制了动态顶点插入能力。其解决方案的关键在于提出一种新型混合模型 DeH4R,通过将任务解耦为候选顶点检测、邻近顶点预测、初始图构建与图扩展四个阶段,实现了静态顶点预测的高效性与动态顶点(边)插入的灵活性相结合,在保持快速推理的同时显著提升了拓扑保真度和空间一致性。
链接: https://arxiv.org/abs/2508.13669
作者: Dengxian Gong,Shunping Ji
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Under review
Abstract:The automated extraction of complete and precise road network graphs from remote sensing imagery remains a critical challenge in geospatial computer vision. Segmentation-based approaches, while effective in pixel-level recognition, struggle to maintain topology fidelity after vectorization postprocessing. Graph-growing methods build more topologically faithful graphs but suffer from computationally prohibitive iterative ROI cropping. Graph-generating methods first predict global static candidate road network vertices, and then infer possible edges between vertices. They achieve fast topology-aware inference, but limits the dynamic insertion of vertices. To address these challenges, we propose DeH4R, a novel hybrid model that combines graph-generating efficiency and graph-growing dynamics. This is achieved by decoupling the task into candidate vertex detection, adjacent vertex prediction, initial graph contruction, and graph expansion. This architectural innovation enables dynamic vertex (edge) insertions while retaining fast inference speed and enhancing both topology fidelity and spatial consistency. Comprehensive evaluations on CityScale and SpaceNet benchmarks demonstrate state-of-the-art (SOTA) performance. DeH4R outperforms the prior SOTA graph-growing method RNGDet++ by 4.62 APLS and 10.18 IoU on CityScale, while being approximately 10 \times faster. The code will be made publicly available at this https URL.
zh
[CV-39] Model-based Multi-object Visual Tracking: Identification and Standard Model Limitations
【速读】:该论文旨在解决基于2D边界框检测的行人跟踪问题,其核心挑战在于如何有效建模多目标运动状态并准确估计目标轨迹。解决方案的关键在于采用雷达跟踪领域成熟的多目标跟踪方法,具体使用标准点对象(Standard Point-Object, SPO)模型,并通过泊松多伯努利混合(Poisson Multi-Bernoulli Mixture, PMBM)滤波器计算后验密度,从而实现对多个行人的联合状态估计。文中进一步讨论了连续时间建模下参数的选择,包括出生概率和生存概率,部分参数基于物理原理设定,另一些则从公开数据集MOT-17中识别得出,最终虽获得良好性能,但仍发现SPO模型与实际数据存在不匹配问题,提示未来改进应聚焦于优化导致该不匹配的模型组件。
链接: https://arxiv.org/abs/2508.13647
作者: Jan Krejčí,Oliver Kost,Yuxuan Xia,Lennart Svensson,Ondřej Straka
机构: University of West Bohemia (西波希米亚大学); Shanghai Jiaotong University (上海交通大学); Chalmers University of Technology (查尔姆斯理工大学)
类目: ystems and Control (eess.SY); Computer Vision and Pattern Recognition (cs.CV)
备注: Submitted to FUSION 2025 conference
Abstract:This paper uses multi-object tracking methods known from the radar tracking community to address the problem of pedestrian tracking using 2D bounding box detections. The standard point-object (SPO) model is adopted, and the posterior density is computed using the Poisson multi-Bernoulli mixture (PMBM) filter. The selection of the model parameters rooted in continuous time is discussed, including the birth and survival probabilities. Some parameters are selected from the first principles, while others are identified from the data, which is, in this case, the publicly available MOT-17 dataset. Although the resulting PMBM algorithm yields promising results, a mismatch between the SPO model and the data is revealed. The model-based approach assumes that modifying the problematic components causing the SPO model-data mismatch will lead to better model-based algorithms in future developments.
zh
[CV-40] OmniTry: Virtual Try-On Anything without Masks
【速读】:该论文旨在解决传统虚拟试穿(Virtual Try-ON, VTON)任务仅局限于衣物类物品的局限性,扩展至包括珠宝、配饰等各类可穿戴物体,并在无需掩码(mask-free)条件下实现更实用的应用场景。其核心挑战在于获取大量成对图像数据(即目标物体图像与对应的试穿结果)困难。解决方案的关键在于提出一个两阶段训练流程:第一阶段利用大规模未配对的人像图像(含任意可穿戴物品)训练模型实现无掩码下的物体定位,通过复用图像修复(inpainting)模型自动将物体放置于合理位置;第二阶段则使用少量配对样本对模型进行微调,以保持物体外观的一致性。实验表明,该方法在12类常见可穿戴物品上均表现出优越的定位准确性和身份保留能力。
链接: https://arxiv.org/abs/2508.13632
作者: Yutong Feng,Linlin Zhang,Hengyuan Cao,Yiming Chen,Xiaoduan Feng,Jian Cao,Yuxiong Wu,Bin Wang
机构: Kunbyte AI; Zhejiang University(浙江大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Virtual Try-ON (VTON) is a practical and widely-applied task, for which most of existing works focus on clothes. This paper presents OmniTry, a unified framework that extends VTON beyond garment to encompass any wearable objects, e.g., jewelries and accessories, with mask-free setting for more practical application. When extending to various types of objects, data curation is challenging for obtaining paired images, i.e., the object image and the corresponding try-on result. To tackle this problem, we propose a two-staged pipeline: For the first stage, we leverage large-scale unpaired images, i.e., portraits with any wearable items, to train the model for mask-free localization. Specifically, we repurpose the inpainting model to automatically draw objects in suitable positions given an empty mask. For the second stage, the model is further fine-tuned with paired images to transfer the consistency of object appearance. We observed that the model after the first stage shows quick convergence even with few paired samples. OmniTry is evaluated on a comprehensive benchmark consisting of 12 common classes of wearable objects, with both in-shop and in-the-wild images. Experimental results suggest that OmniTry shows better performance on both object localization and ID-preservation compared with existing methods. The code, model weights, and evaluation benchmark of OmniTry will be made publicly available at this https URL.
zh
[CV-41] DiffIER: Optimizing Diffusion Models with Iterative Error Reduction
【速读】:该论文旨在解决扩散模型(Diffusion Models)在条件生成任务中因引导权重(guidance weight)选择敏感而导致的生成质量不稳定问题,其核心症结在于训练与推理阶段之间存在的“训练-推理差距”(training-inference gap)。该差距表现为推理过程中累积误差的增加,进而导致生成结果对引导权重高度敏感。解决方案的关键在于提出 DiffIER——一种基于优化的推理阶段误差最小化方法,通过在每一步推理中迭代地最小化累积误差,从而有效缩小训练-推理差距。该方法作为即插即用的框架,可在文本到图像、图像超分辨率和文本到语音等多种任务中显著提升生成质量,展现出良好的通用性和实用性。
链接: https://arxiv.org/abs/2508.13628
作者: Ao Chen,Lihe Ding,Tianfan Xue
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Diffusion models have demonstrated remarkable capabilities in generating high-quality samples and enhancing performance across diverse domains through Classifier-Free Guidance (CFG). However, the quality of generated samples is highly sensitive to the selection of the guidance weight. In this work, we identify a critical ``training-inference gap’’ and we argue that it is the presence of this gap that undermines the performance of conditional generation and renders outputs highly sensitive to the guidance weight. We quantify this gap by measuring the accumulated error during the inference stage and establish a correlation between the selection of guidance weight and minimizing this gap. Furthermore, to mitigate this gap, we propose DiffIER, an optimization-based method for high-quality generation. We demonstrate that the accumulated error can be effectively reduced by an iterative error minimization at each step during inference. By introducing this novel plug-and-play optimization framework, we enable the optimization of errors at every single inference step and enhance generation quality. Empirical results demonstrate that our proposed method outperforms baseline approaches in conditional generation tasks. Furthermore, the method achieves consistent success in text-to-image generation, image super-resolution, and text-to-speech generation, underscoring its versatility and potential for broad applications in future research.
zh
[CV-42] RCGNet: RGB-based Category-Level 6D Object Pose Estimation with Geometric Guidance IROS2025
【速读】:该论文旨在解决当前基于RGB-D图像的类别级物体位姿估计方法在缺乏深度信息场景下性能显著下降的问题,提出了一种仅依赖RGB图像的新型位姿估计方案。其解决方案的关键在于设计了一个基于Transformer的神经网络架构,通过Transformer预测并融合目标物体的几何特征,并引入几何特征引导算法以确保所预测特征能准确捕捉物体几何结构;最终利用RANSAC-PnP算法实现鲁棒的位姿计算,有效应对物体尺度变化带来的挑战。
链接: https://arxiv.org/abs/2508.13623
作者: Sheng Yu,Di-Hua Zhai,Yuanqing Xia
机构: Beijing Institute of Technology (北京理工大学); Zhongyuan University of Technology (中原工学院)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted by IROS2025
Abstract:While most current RGB-D-based category-level object pose estimation methods achieve strong performance, they face significant challenges in scenes lacking depth information. In this paper, we propose a novel category-level object pose estimation approach that relies solely on RGB images. This method enables accurate pose estimation in real-world scenarios without the need for depth data. Specifically, we design a transformer-based neural network for category-level object pose estimation, where the transformer is employed to predict and fuse the geometric features of the target object. To ensure that these predicted geometric features faithfully capture the object’s geometry, we introduce a geometric feature-guided algorithm, which enhances the network’s ability to effectively represent the object’s geometric information. Finally, we utilize the RANSAC-PnP algorithm to compute the object’s pose, addressing the challenges associated with variable object scales in pose estimation. Experimental results on benchmark datasets demonstrate that our approach is not only highly efficient but also achieves superior accuracy compared to previous RGB-based methods. These promising results offer a new perspective for advancing category-level object pose estimation using RGB images.
zh
[CV-43] alkVid: A Large-Scale Diversified Dataset for Audio-Driven Talking Head Synthesis
【速读】:该论文旨在解决当前音频驱动人脸合成(audio-driven talking head synthesis)模型在跨种族、语言和年龄群体等人类多样性上的泛化能力不足的问题。研究表明,这一局限性源于现有训练数据在规模、质量和多样性方面的不足。解决方案的关键在于构建一个大规模、高质量且多样化的数据集——TalkVid,其包含7729名独特说话者的1244小时视频,并通过多阶段自动化筛选流程确保运动稳定性、美学质量与面部细节;同时,论文进一步提出TalkVid-Bench作为分层评估基准,以揭示传统整体指标掩盖的子群体性能差异,从而推动更具公平性和泛化能力的模型发展。
链接: https://arxiv.org/abs/2508.13618
作者: Shunian Chen,Hejin Huang,Yexin Liu,Zihan Ye,Pengcheng Chen,Chenghao Zhu,Michael Guan,Rongsheng Wang,Junying Chen,Guanbin Li,Ser-Nam Lim,Harry Yang,Benyou Wang
机构: The Chinese University of Hong Kong, Shenzhen (香港中文大学(深圳)); Sun Yat-sen University (中山大学); The Hong Kong University of Science and Technology (香港科技大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Audio-driven talking head synthesis has achieved remarkable photorealism, yet state-of-the-art (SOTA) models exhibit a critical failure: they lack generalization to the full spectrum of human diversity in ethnicity, language, and age groups. We argue that this generalization gap is a direct symptom of limitations in existing training data, which lack the necessary scale, quality, and diversity. To address this challenge, we introduce TalkVid, a new large-scale, high-quality, and diverse dataset containing 1244 hours of video from 7729 unique speakers. TalkVid is curated through a principled, multi-stage automated pipeline that rigorously filters for motion stability, aesthetic quality, and facial detail, and is validated against human judgments to ensure its reliability. Furthermore, we construct and release TalkVid-Bench, a stratified evaluation set of 500 clips meticulously balanced across key demographic and linguistic axes. Our experiments demonstrate that a model trained on TalkVid outperforms counterparts trained on previous datasets, exhibiting superior cross-dataset generalization. Crucially, our analysis on TalkVid-Bench reveals performance disparities across subgroups that are obscured by traditional aggregate metrics, underscoring its necessity for future research. Code and data can be found in this https URL
zh
[CV-44] wo-Factor Authentication Smart Entryway Using Modified LBPH Algorithm
【速读】:该论文旨在解决智能门禁系统中人脸口罩检测与身份认证的自动化问题,特别是在疫情防控背景下对非接触式访问控制的需求。其解决方案的关键在于构建一个基于树莓派(Raspberry Pi)平台的双因素认证系统,结合面部识别与密码验证,并引入局部二值模式直方图(Local Binary Patterns Histograms, LBPH)算法及其改进版本用于完整人脸和遮挡人脸的检测,实现了陌生人报警、远程通知(通过Telegram)、自动监控激活及门锁状态控制等功能,整体在测试中达到约70%的准确率、80%的精确率和83.26%的召回率,具备良好的实用性和用户接受度。
链接: https://arxiv.org/abs/2508.13617
作者: Zakiah Ayop,Wan Mohamad Hariz Bin Wan Mohamad Rosdi,Looi Wei Hua,Syarulnaziah Anawar,Nur Fadzilah Othman
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Face mask detection has become increasingly important recently, particularly during the COVID-19 pandemic. Many face detection models have been developed in smart entryways using IoT. However, there is a lack of IoT development on face mask detection. This paper proposes a two-factor authentication system for smart entryway access control using facial recognition and passcode verification and an automation process to alert the owner and activate the surveillance system when a stranger is detected and controls the system remotely via Telegram on a Raspberry Pi platform. The system employs the Local Binary Patterns Histograms for the full face recognition algorithm and modified LBPH algorithm for occluded face detection. On average, the system achieved an Accuracy of approximately 70%, a Precision of approximately 80%, and a Recall of approximately 83.26% across all tested users. The results indicate that the system is capable of conducting face recognition and mask detection, automating the operation of the remote control to register users, locking or unlocking the door, and notifying the owner. The sample participants highly accept it for future use in the user acceptance test.
zh
[CV-45] PersonaVlog: Personalized Multimodal Vlog Generation with Multi-Agent Collaboration and Iterative Self-Correction
【速读】:该论文旨在解决当前自动化视频日志(Vlog)生成方法依赖预设脚本、缺乏动态性和个性化表达的问题,以满足短视频和定制化内容日益增长的需求。其解决方案的关键在于提出PersonaVlog框架,该框架基于多模态大语言模型(Multimodal Large Language Models, MLLMs)构建多智能体协作机制,能够根据给定主题和参考图像生成包含视频、背景音乐及内心独白语音的个性化Vlog;同时引入反馈与回滚机制,利用MLLMs对生成结果进行评估并实现多模态内容的迭代自修正,从而显著提升生成效率与创造性质量。
链接: https://arxiv.org/abs/2508.13602
作者: Xiaolu Hou,Bing Ma,Jiaxiang Cheng,Xuhua Ren,Kai Yu,Wenyue Li,Tianxiang Zheng,Qinglin Lu
机构: Tencent Hunyuan (腾讯混元); Tsinghua University (清华大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Project Page: this https URL
Abstract:With the growing demand for short videos and personalized content, automated Video Log (Vlog) generation has become a key direction in multimodal content creation. Existing methods mostly rely on predefined scripts, lacking dynamism and personal expression. Therefore, there is an urgent need for an automated Vlog generation approach that enables effective multimodal collaboration and high personalization. To this end, we propose PersonaVlog, an automated multimodal stylized Vlog generation framework that can produce personalized Vlogs featuring videos, background music, and inner monologue speech based on a given theme and reference image. Specifically, we propose a multi-agent collaboration framework based on Multimodal Large Language Models (MLLMs). This framework efficiently generates high-quality prompts for multimodal content creation based on user input, thereby improving the efficiency and creativity of the process. In addition, we incorporate a feedback and rollback mechanism that leverages MLLMs to evaluate and provide feedback on generated results, thereby enabling iterative self-correction of multimodal content. We also propose ThemeVlogEval, a theme-based automated benchmarking framework that provides standardized metrics and datasets for fair evaluation. Comprehensive experiments demonstrate the significant advantages and potential of our framework over several baselines, highlighting its effectiveness and great potential for generating automated Vlogs.
zh
[CV-46] Unleashing Semantic and Geometric Priors for 3D Scene Completion
【速读】:该论文旨在解决当前基于摄像头的3D语义场景补全(3D Semantic Scene Completion, SSC)方法中,由于编码器同时承担语义与几何特征提取任务而导致的性能瓶颈问题。现有方法依赖耦合编码器同时提供语义和几何先验,造成两者在优化过程中存在冲突,限制了整体性能提升。解决方案的关键在于提出FoundationSSC框架,通过双层解耦设计实现特征分离与精细化处理:在源级层面引入基础编码器(foundation encoder),分别向语义分支提供丰富的语义特征先验、向几何分支提供高保真立体代价体积;在路径级层面,通过专用且独立的细化路径对两类先验进行精炼,从而获得更优的语义上下文与深度分布。此外,结合轴向感知融合(Axis-Aware Fusion, AAF)模块,有效解决了多模态特征异构融合难题,最终在SemanticKITTI和SSCBench-KITTI-360上均取得显著性能提升。
链接: https://arxiv.org/abs/2508.13601
作者: Shiyuan Chen,Wei Sui,Bohao Zhang,Zeyd Boukhers,John See,Cong Yang
机构: 1. School of Software, Soochow University (苏州大学软件学院); 2. Institute of Artificial Intelligence, Soochow University (苏州大学人工智能研究院); 3. Department of Computer Science, University of California, Berkeley (加州大学伯克利分校计算机科学系); 4. Department of Electrical Engineering and Computer Science, Massachusetts Institute of Technology (麻省理工学院电气工程与计算机科学系)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 9 pages, 5 figures, 6 tables
Abstract:Camera-based 3D semantic scene completion (SSC) provides dense geometric and semantic perception for autonomous driving and robotic navigation. However, existing methods rely on a coupled encoder to deliver both semantic and geometric priors, which forces the model to make a trade-off between conflicting demands and limits its overall performance. To tackle these challenges, we propose FoundationSSC, a novel framework that performs dual decoupling at both the source and pathway levels. At the source level, we introduce a foundation encoder that provides rich semantic feature priors for the semantic branch and high-fidelity stereo cost volumes for the geometric branch. At the pathway level, these priors are refined through specialised, decoupled pathways, yielding superior semantic context and depth distributions. Our dual-decoupling design produces disentangled and refined inputs, which are then utilised by a hybrid view transformation to generate complementary 3D features. Additionally, we introduce a novel Axis-Aware Fusion (AAF) module that addresses the often-overlooked challenge of fusing these features by anisotropically merging them into a unified representation. Extensive experiments demonstrate the advantages of FoundationSSC, achieving simultaneous improvements in both semantic and geometric metrics, surpassing prior bests by +0.23 mIoU and +2.03 IoU on SemanticKITTI. Additionally, we achieve state-of-the-art performance on SSCBench-KITTI-360, with 21.78 mIoU and 48.61 IoU. The code will be released upon acceptance.
zh
[CV-47] owards Efficient Vision State Space Models via Token Merging
【速读】:该论文旨在解决状态空间模型(State Space Models, SSMs)在视觉任务中计算效率不足的问题,尤其是在进行token减少以提升效率时,如何有效保留其序列建模特性。解决方案的关键在于提出一种名为MaMe的token合并策略,该策略通过利用状态转移参数 Δ 作为token重要性度量,并设计特定的token排列方式来保持序列信息的完整性,从而在显著降低token数量的同时维持模型性能,尤其在激进的token压缩下仍表现出鲁棒性。
链接: https://arxiv.org/abs/2508.13599
作者: Jinyoung Park,Minseok Son,Changick Kim
机构: KAIST
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: under review
Abstract:State Space Models (SSMs) have emerged as powerful architectures in computer vision, yet improving their computational efficiency remains crucial for practical and scalable this http URL token reduction serves as an effective approach for model efficiency, applying it to SSMs requires careful consideration of their unique sequential modeling this http URL this work, we propose MaMe, a token-merging strategy tailored for SSM-based vision this http URL addresses two key challenges: quantifying token importance and preserving sequential properties. Our approach leverages the state transition parameter \mathbf\Delta as an informativeness measure and introduces strategic token arrangements to preserve sequential information this http URL experiments demonstrate that MaMe achieves superior efficiency-performance trade-offs for both fine-tuned and off-the-shelf models. Particularly, our approach maintains robustness even under aggressive token reduction where existing methods undergo significant performance this http URL image classification, MaMe shows strong generalization capabilities across video and audio domains, establishing an effective approach for enhancing efficiency in diverse SSM applications.
zh
[CV-48] Bridging Clear and Adverse Driving Conditions
【速读】:该论文旨在解决自动驾驶(Autonomous Driving, AD)系统在恶劣环境条件(如低光照、降雨、降雪和雾天)下性能显著下降的问题,其根本原因在于AD数据集中此类场景的样本稀缺。为克服获取和标注恶劣天气数据的高昂成本,作者提出了一种新颖的域适应(Domain Adaptation, DA)流水线,通过将清晰天气图像转换为雾、雨、雪及夜间图像来生成高质量合成数据。解决方案的关键在于开发了多种数据生成方法(包括纯仿真、基于GAN和扩散-生成对抗网络混合方法),并创新性地融合模拟与真实图像进行训练:模拟图像提供精确监督信号(完美配对图像),而真实图像则缩小仿真到现实(simulation-to-real, sim2real)的差距;此外,还引入一种自适应融合策略以减少Stable Diffusion图像到图像(img2img)生成结果中的幻觉和伪影,最终在带有对应关系的恶劣条件数据集(Adverse Conditions Dataset with Correspondences, ACDC)上验证了模型在语义分割任务上的整体提升达1.85%,夜间场景提升达4.62%,证明了该混合方法的有效性。
链接: https://arxiv.org/abs/2508.13592
作者: Yoel Shapiro,Yahia Showgan,Koustav Mullick
机构: Bosch(博世)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Autonomous Driving (AD) systems exhibit markedly degraded performance under adverse environmental conditions, such as low illumination and precipitation. The underrepresentation of adverse conditions in AD datasets makes it challenging to address this deficiency. To circumvent the prohibitive cost of acquiring and annotating adverse weather data, we propose a novel Domain Adaptation (DA) pipeline that transforms clear-weather images into fog, rain, snow, and nighttime images. Here, we systematically develop and evaluate several novel data-generation pipelines, including simulation-only, GAN-based, and hybrid diffusion-GAN approaches, to synthesize photorealistic adverse images from labelled clear images. We leverage an existing DA GAN, extend it to support auxiliary inputs, and develop a novel training recipe that leverages both simulated and real images. The simulated images facilitate exact supervision by providing perfectly matched image pairs, while the real images help bridge the simulation-to-real (sim2real) gap. We further introduce a method to mitigate hallucinations and artifacts in Stable-Diffusion Image-to-Image (img2img) outputs by blending them adaptively with their progenitor images. We finetune downstream models on our synthetic data and evaluate them on the Adverse Conditions Dataset with Correspondences (ACDC). We achieve 1.85 percent overall improvement in semantic segmentation, and 4.62 percent on nighttime, demonstrating the efficacy of our hybrid method for robust AD perception under challenging conditions.
zh
[CV-49] Breaking the SFT Plateau: Multimodal Structured Reinforcement Learning for Chart-to-Code Generation
【速读】:该论文旨在解决视觉-语言模型在处理信息密集型图像(如图表)并生成结构化输出(如代码)时,因监督微调(Supervised Fine-Tuning, SFT)方法存在性能瓶颈而难以实现深度理解与高质量生成的问题。其核心挑战在于SFT在大规模数据下趋于饱和,无法有效提升复杂任务的性能。解决方案的关键是提出多粒度结构化强化学习(Multimodal Structured Reinforcement Learning, MSRL),该方法通过融合文本级规则奖励与视觉级模型评估奖励构建多层次反馈机制:文本级奖励验证代码细节正确性,视觉级奖励则通过将生成代码渲染为图像并与原始图表比对来评估结构一致性;同时采用两阶段课程训练策略以保障训练稳定性。实验表明,MSRL显著突破了SFT的性能平台,在ChartMimic和ReachQA基准上分别提升高阶指标6.2%和9.9%,达到与先进闭源模型相当的水平。
链接: https://arxiv.org/abs/2508.13587
作者: Lei Chen,Xuanle Zhao,Zhixiong Zeng,Jing Huang,Liming Zheng,Yufeng Zhong,Lin Ma
机构: Meituan(美团)
类目: Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
备注: technical report
Abstract:While reinforcement learning (RL) has proven highly effective for general reasoning in vision-language models, its application to tasks requiring in-depth understanding of information-rich images and generation of structured outputs remains underexplored. Chart-to-code generation exemplifies this challenge, demanding complex reasoning over visual charts to generate structured code. Supervised fine-tuning (SFT) alone is often insufficient, highlighting the need for effective RL strategies that appropriately reward structured outputs. We systematically investigate the performance plateau in SFT through large-scale experiments and propose Multimodal Structured Reinforcement Learning (MSRL) for chart-to-code generation, which substantially breaks through this plateau. We construct the largest training corpus to date, containing 3 million chart-code pairs from real-world arXiv tables to mitigate simplistic patterns of prior synthetic data. Despite reaching state-of-the-art performance, our experiments show that scaling SFT data eventually hits a plateau where further increases yield negligible improvements. Our MSRL method leverages a multi-granularity structured reward system using multimodal textual and visual feedback. At the textual level, rule-based rewards validate fine-grained code details. At the visual level, model-based rewards assess structural similarity by rendering generated code into images and employing an evaluator model. We implement this within a two-stage curriculum for training stability. Results demonstrate that MSRL significantly breaks the SFT plateau, improving high-level metrics by 6.2% and 9.9% on ChartMimic and ReachQA benchmarks respectively, achieving competitive performance with advanced closed-source models.
zh
[CV-50] mporal-Conditional Referring Video Object Segmentation with Noise-Free Text-to-Video Diffusion Model
【速读】:该论文针对Referring Video Object Segmentation (RVOS)任务中分割头(segmentation head)设计不足的问题展开研究,指出当前方法过度关注特征提取与时间建模,而忽视了分割头对边界精度的优化潜力。解决方案的关键在于:1)提出一种时序条件分割模型,融合现有分割方法以增强边界分割能力;2)利用文本到视频扩散模型进行特征提取,并移除传统噪声预测模块以避免随机性对分割精度的干扰,从而简化结构并提升性能;3)设计Temporal Context Mask Refinement (TCMR)模块,有效弥补VAE在特征提取上的局限性,显著提升分割质量且无需复杂架构。
链接: https://arxiv.org/abs/2508.13584
作者: Ruixin Zhang,Jiaqing Fan,Yifan Liao,Qian Qiao,Fanzhang Li
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 11 pages, 7 figures
Abstract:Referring Video Object Segmentation (RVOS) aims to segment specific objects in a video according to textual descriptions. We observe that recent RVOS approaches often place excessive emphasis on feature extraction and temporal modeling, while relatively neglecting the design of the segmentation head. In fact, there remains considerable room for improvement in segmentation head design. To address this, we propose a Temporal-Conditional Referring Video Object Segmentation model, which innovatively integrates existing segmentation methods to effectively enhance boundary segmentation capability. Furthermore, our model leverages a text-to-video diffusion model for feature extraction. On top of this, we remove the traditional noise prediction module to avoid the randomness of noise from degrading segmentation accuracy, thereby simplifying the model while improving performance. Finally, to overcome the limited feature extraction capability of the VAE, we design a Temporal Context Mask Refinement (TCMR) module, which significantly improves segmentation quality without introducing complex designs. We evaluate our method on four public RVOS benchmarks, where it consistently achieves state-of-the-art performance.
zh
[CV-51] Generative Model-Based Feature Attention Module for Video Action Analysis
【速读】:该论文旨在解决当前视频动作分析方法在特征提取过程中忽视特征语义信息、过度依赖动作提案优化的问题,这限制了其在高精度需求场景(如自动驾驶)中的广泛应用。解决方案的关键在于提出一种基于生成式注意力机制的新模型,通过利用动作前景与背景的差异,同时学习帧级和片段级的时间动作特征语义依赖关系,从而更有效地挖掘特征语义信息,提升智能视频分析的精度与可扩展性。
链接: https://arxiv.org/abs/2508.13565
作者: Guiqin Wang,Peng Zhao,Cong Zhao,Jing Huang,Siyan Guo,Shusen Yang
机构: Xi’an Jiaotong University (西安交通大学); Pazhou Laboratory (琶洲实验室)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Video action analysis is a foundational technology within the realm of intelligent video comprehension, particularly concerning its application in Internet of Things(IoT). However, existing methodologies overlook feature semantics in feature extraction and focus on optimizing action proposals, thus these solutions are unsuitable for widespread adoption in high-performance IoT applications due to the limitations in precision, such as autonomous driving, which necessitate robust and scalable intelligent video analytics analysis. To address this issue, we propose a novel generative attention-based model to learn the relation of feature semantics. Specifically, by leveraging the differences of actions’ foreground and background, our model simultaneously learns the frame- and segment-dependencies of temporal action feature semantics, which takes advantage of feature semantics in the feature extraction effectively. To evaluate the effectiveness of our model, we conduct extensive experiments on two benchmark video task, action recognition and action detection. In the context of action detection tasks, we substantiate the superiority of our approach through comprehensive validation on widely recognized datasets. Moreover, we extend the validation of the effectiveness of our proposed method to a broader task, video action recognition. Our code is available at this https URL.
zh
[CV-52] he 9th AI City Challenge ICCV2025
【速读】:该论文旨在解决计算机视觉与人工智能在交通、工业自动化和公共安全等实际场景中的应用落地问题,通过构建多任务、多模态的AI City Challenge竞赛平台推动技术进步。其解决方案的关键在于设计四个具有挑战性的赛道:Track 1实现多类3D多摄像头目标跟踪(包括人、类人机器人、自主移动机器人和叉车),依赖高精度标定与3D边界框标注;Track 2引入视频问答与3D注视标签提升交通事件理解能力;Track 3要求AI系统基于RGB-D输入进行细粒度空间推理,融合感知、几何与语言信息回答空间问题;Track 4则聚焦于鱼眼相机下的轻量级道路目标检测,支持边缘设备实时部署。所有数据集均采用NVIDIA Omniverse生成,评估框架通过部分保留测试集和提交限制保障公平性与可复现性,从而有效推动算法创新并设立新基准。
链接: https://arxiv.org/abs/2508.13564
作者: Zheng Tang,Shuo Wang,David C. Anastasiu,Ming-Ching Chang,Anuj Sharma,Quan Kong,Norimasa Kobori,Munkhjargal Gochoo,Ganzorig Batnasan,Munkh-Erdene Otgonbold,Fady Alnajjar,Jun-Wei Hsieh,Tomasz Kornuta,Xiaolong Li,Yilin Zhao,Han Zhang,Subhashree Radhakrishnan,Arihant Jain,Ratnesh Kumar,Vidya N. Murali,Yuxing Wang,Sameer Satish Pusegaonkar,Yizhou Wang,Sujit Biswas,Xunlei Wu,Zhedong Zheng,Pranamesh Chakraborty,Rama Chellappa
机构: NVIDIA Corporation(英伟达公司); Santa Clara University(圣克拉拉大学); University at Albany, SUNY(纽约州立大学阿尔巴尼分校); Iowa State University(爱荷华州立大学); Woven by Toyota(丰田编织公司); United Arab Emirates University(阿联酋大学); National Yang-Ming Chiao-Tung University(国立阳明交通大学); University of Macau(澳门大学); Indian Institute of Technology Kanpur(印度理工学院坎普尔分校); Johns Hopkins University(约翰霍普金斯大学); Emirates Center for Mobility Research(阿联酋交通研究中心)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Robotics (cs.RO)
备注: Summary of the 9th AI City Challenge Workshop in conjunction with ICCV 2025
Abstract:The ninth AI City Challenge continues to advance real-world applications of computer vision and AI in transportation, industrial automation, and public safety. The 2025 edition featured four tracks and saw a 17% increase in participation, with 245 teams from 15 countries registered on the evaluation server. Public release of challenge datasets led to over 30,000 downloads to date. Track 1 focused on multi-class 3D multi-camera tracking, involving people, humanoids, autonomous mobile robots, and forklifts, using detailed calibration and 3D bounding box annotations. Track 2 tackled video question answering in traffic safety, with multi-camera incident understanding enriched by 3D gaze labels. Track 3 addressed fine-grained spatial reasoning in dynamic warehouse environments, requiring AI systems to interpret RGB-D inputs and answer spatial questions that combine perception, geometry, and language. Both Track 1 and Track 3 datasets were generated in NVIDIA Omniverse. Track 4 emphasized efficient road object detection from fisheye cameras, supporting lightweight, real-time deployment on edge devices. The evaluation framework enforced submission limits and used a partially held-out test set to ensure fair benchmarking. Final rankings were revealed after the competition concluded, fostering reproducibility and mitigating overfitting. Several teams achieved top-tier results, setting new benchmarks in multiple tasks.
zh
[CV-53] Learnable SMPLify: A Neural Solution for Optimization-Free Human Pose Inverse Kinematics
【速读】:该论文旨在解决3D人体姿态与形状估计中SMPLify方法因迭代优化带来的高计算成本问题,从而限制其实际应用。解决方案的关键在于提出Learnable SMPLify,一个将SMPLify的迭代拟合过程替换为单次前向传播回归模型的神经框架;其核心创新包括:(1)设计时间采样策略以构建连续帧中的初始化-目标配对数据,提升训练有效性;(2)引入以人为中心的归一化和残差学习机制,缩小解空间并增强对多样动作及未见姿态的泛化能力。该方法在保持高精度的同时实现近200倍的加速,并支持序列推理与插件式后处理,具备良好的通用性和实用性。
链接: https://arxiv.org/abs/2508.13562
作者: Yuchen Yang,Linfeng Dong,Wei Wang,Zhihang Zhong,Xiao Sun
机构: Fudan University (复旦大学); Shanghai Artificial Intelligence Laboratory (上海人工智能实验室); Zhejiang University (浙江大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:In 3D human pose and shape estimation, SMPLify remains a robust baseline that solves inverse kinematics (IK) through iterative optimization. However, its high computational cost limits its practicality. Recent advances across domains have shown that replacing iterative optimization with data-driven neural networks can achieve significant runtime improvements without sacrificing accuracy. Motivated by this trend, we propose Learnable SMPLify, a neural framework that replaces the iterative fitting process in SMPLify with a single-pass regression model. The design of our framework targets two core challenges in neural IK: data construction and generalization. To enable effective training, we propose a temporal sampling strategy that constructs initialization-target pairs from sequential frames. To improve generalization across diverse motions and unseen poses, we propose a human-centric normalization scheme and residual learning to narrow the solution space. Learnable SMPLify supports both sequential inference and plug-in post-processing to refine existing image-based estimators. Extensive experiments demonstrate that our method establishes itself as a practical and simple baseline: it achieves nearly 200x faster runtime compared to SMPLify, generalizes well to unseen 3DPW and RICH, and operates in a model-agnostic manner when used as a plug-in tool on LucidAction. The code is available at this https URL.
zh
[CV-54] DictAS: A Framework for Class-Generalizable Few-Shot Anomaly Segmentation via Dictionary Lookup ICCV2025
【速读】:该论文旨在解决少样本异常分割(Few-Shot Anomaly Segmentation, FSAS)中模型在未见类别上的跨类别泛化能力不足的问题,尤其是现有视觉-语言模型(Vision-Language Models, VLMs)如CLIP依赖真实可见异常样本进行微调或提示学习,限制了其在无目标数据重训练场景下的应用。解决方案的关键在于提出一种名为DictAS的新框架,其核心创新是通过自监督学习将词典查找(Dictionary Lookup)能力迁移至FSAS任务中,而非单纯记忆训练集中正常与异常特征模式;具体包括:(1) 利用少量正常参考图像构建模拟词典以表征正常特征空间;(2) 采用稀疏查找策略从词典中检索查询区域特征,无法匹配则判定为异常;(3) 引入查询判别正则化机制(包括对比查询约束和文本对齐约束),增强异常特征的不可检索性,从而提升检测性能。
链接: https://arxiv.org/abs/2508.13560
作者: Zhen Qu,Xian Tao,Xinyi Gong,ShiChen Qu,Xiaopei Zhang,Xingang Wang,Fei Shen,Zhengtao Zhang,Mukesh Prasad,Guiguang Ding
机构: Institute of Automation, Chinese Academy of Sciences (中国科学院自动化研究所); School of Artificial Intelligence, University of Chinese Academy of Sciences (中国科学院大学人工智能学院); Casivision; Longmen Laboratory; HDU; UTS; UCLA; Tsinghua University (清华大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted by ICCV 2025, Project: this https URL
Abstract:Recent vision-language models (e.g., CLIP) have demonstrated remarkable class-generalizable ability to unseen classes in few-shot anomaly segmentation (FSAS), leveraging supervised prompt learning or fine-tuning on seen classes. However, their cross-category generalization largely depends on prior knowledge of real seen anomaly samples. In this paper, we propose a novel framework, namely DictAS, which enables a unified model to detect visual anomalies in unseen object categories without any retraining on the target data, only employing a few normal reference images as visual prompts. The insight behind DictAS is to transfer dictionary lookup capabilities to the FSAS task for unseen classes via self-supervised learning, instead of merely memorizing the normal and abnormal feature patterns from the training set. Specifically, DictAS mainly consists of three components: (1) Dictionary Construction - to simulate the index and content of a real dictionary using features from normal reference images. (2) Dictionary Lookup - to retrieve queried region features from the dictionary via a sparse lookup strategy. When a query feature cannot be retrieved, it is classified as an anomaly. (3) Query Discrimination Regularization- to enhance anomaly discrimination by making abnormal features harder to retrieve from the dictionary. To achieve this, Contrastive Query Constraint and Text Alignment Constraint are further proposed. Extensive experiments on seven public industrial and medical datasets demonstrate that DictAS consistently outperforms state-of-the-art FSAS methods.
zh
[CV-55] Color Spike Data Generation via Bio-inspired Neuron-like Encoding with an Artificial Photoreceptor Layer
【速读】:该论文旨在解决脉冲神经网络(Spiking Neural Networks, SNNs)性能落后于卷积神经网络(Convolutional Neural Networks, CNNs)的问题,其根本原因在于脉冲数据的信息容量有限。为克服这一局限,作者提出了一种类神经元编码(Neuron-like Encoding)方法,该方法基于生物神经元的内在工作机制生成脉冲数据,从而提升脉冲信号的信息承载能力;关键创新在于引入人工光感受器层(artificial photoreceptor layer),使脉冲信号能够同时携带颜色和亮度信息,形成完整的视觉脉冲信号,从而在不违背脉冲计算本质的前提下显著增强SNN的性能表现。
链接: https://arxiv.org/abs/2508.13558
作者: Hsieh Ching-Teng,Wang Yuan-Kai
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 14 pages, 11 figures
Abstract:In recent years, neuromorphic computing and spiking neural networks (SNNs) have ad-vanced rapidly through integration with deep learning. However, the performance of SNNs still lags behind that of convolutional neural networks (CNNs), primarily due to the limited information capacity of spike-based data. Although some studies have attempted to improve SNN performance by training them with non-spiking inputs such as static images, this approach deviates from the original intent of neuromorphic computing, which emphasizes spike-based information processing. To address this issue, we propose a Neuron-like Encoding method that generates spike data based on the intrinsic operational principles and functions of biological neurons. This method is further enhanced by the incorporation of an artificial pho-toreceptor layer, enabling spike data to carry both color and luminance information, thereby forming a complete visual spike signal. Experimental results using the Integrate-and-Fire neuron model demonstrate that this biologically inspired approach effectively increases the information content of spike signals and improves SNN performance, all while adhering to neuromorphic principles. We believe this concept holds strong potential for future development and may contribute to overcoming current limitations in neuro-morphic computing, facilitating broader applications of SNNs.
zh
[CV-56] A Lightweight Dual-Mode Optimization for Generative Face Video Coding
【速读】:该论文旨在解决生成式人脸视频编码(Generative Face Video Coding, GFVC)在实际部署中因模型参数庞大和计算成本高而受限的问题。其解决方案的关键在于提出一种轻量化的双模式优化框架,通过架构重设计与操作精细化相结合的方式实现复杂度降低并保持重建质量:一方面用更紧凑高效的层替代传统的3×3卷积以减少计算负担而不损失特征表达能力;另一方面采用两阶段自适应通道剪枝策略——训练阶段通过可学习阈值软剪枝识别冗余通道,推理阶段基于掩码硬剪枝永久移除这些通道,从而兼顾训练稳定性与推理效率。实验表明,该方法相比基线模型实现了90.4%的参数压缩和88.9%的计算节省,同时在感知质量指标上优于当前最先进的视频编码标准Versatile Video Coding (VVC)。
链接: https://arxiv.org/abs/2508.13547
作者: Zihan Zhang,Shanzhi Yin,Bolin Chen,Ru-Ling Liao,Shiqi Wang,Yan Ye
机构: City University of Hong Kong (香港城市大学); DAMO Academy, Alibaba Group (阿里巴巴集团达摩院); Hupan Lab (湖畔实验室); Fudan University (复旦大学); DAMO Academy, Alibaba Group (阿里巴巴集团达摩院)
类目: Computer Vision and Pattern Recognition (cs.CV); Image and Video Processing (eess.IV)
备注:
Abstract:Generative Face Video Coding (GFVC) achieves superior rate-distortion performance by leveraging the strong inference capabilities of deep generative models. However, its practical deployment is hindered by large model parameters and high computational costs. To address this, we propose a lightweight GFVC framework that introduces dual-mode optimization - combining architectural redesign and operational refinement - to reduce complexity whilst preserving reconstruction quality. Architecturally, we replace traditional 3 x 3 convolutions with slimmer and more efficient layers, reducing complexity without compromising feature expressiveness. Operationally, we develop a two-stage adaptive channel pruning strategy: (1) soft pruning during training identifies redundant channels via learnable thresholds, and (2) hard pruning permanently eliminates these channels post-training using a derived mask. This dual-phase approach ensures both training stability and inference efficiency. Experimental results demonstrate that the proposed lightweight dual-mode optimization for GFVC can achieve 90.4% parameter reduction and 88.9% computation saving compared to the baseline, whilst achieving superior performance compared to state-of-the-art video coding standard Versatile Video Coding (VVC) in terms of perceptual-level quality metrics. As such, the proposed method is expected to enable efficient GFVC deployment in resource-constrained environments such as mobile edge devices.
zh
[CV-57] GazeProphet: Software-Only Gaze Prediction for VR Foveated Rendering
【速读】:该论文旨在解决虚拟现实(Virtual Reality, VR)中视点渲染(foveated rendering)因依赖昂贵的基于硬件的眼动追踪系统而难以广泛部署的问题,这些问题包括成本高、校准复杂以及硬件兼容性限制。解决方案的关键在于提出一种纯软件方法 GazeProphet,其核心创新是结合了用于处理360度VR场景的球面视觉Transformer(Spherical Vision Transformer)与基于长短期记忆网络(LSTM)的时间编码器,以捕捉注视序列的动态模式;并通过一个多模态融合网络整合空间场景特征与时间注视动力学,实现对未来注视位置的精准预测并提供置信度估计。实验表明,该方法在保持跨不同空间区域和场景类型一致性的同时,将中位角误差降低至3.83度,相比传统基于显著性的基线模型提升24%,且统计显著,从而为无需额外硬件的VR视点渲染提供了可行方案。
链接: https://arxiv.org/abs/2508.13546
作者: Farhaan Ebadulla,Chiraag Mudlapur,Gaurav BV
机构: PES University (PES大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 8 pages, 3 figures
Abstract:Foveated rendering significantly reduces computational demands in virtual reality applications by concentrating rendering quality where users focus their gaze. Current approaches require expensive hardware-based eye tracking systems, limiting widespread adoption due to cost, calibration complexity, and hardware compatibility constraints. This paper presents GazeProphet, a software-only approach for predicting gaze locations in VR environments without requiring dedicated eye tracking hardware. The approach combines a Spherical Vision Transformer for processing 360-degree VR scenes with an LSTM-based temporal encoder that captures gaze sequence patterns. A multi-modal fusion network integrates spatial scene features with temporal gaze dynamics to predict future gaze locations with associated confidence estimates. Experimental evaluation on a comprehensive VR dataset demonstrates that GazeProphet achieves a median angular error of 3.83 degrees, outperforming traditional saliency-based baselines by 24% while providing reliable confidence calibration. The approach maintains consistent performance across different spatial regions and scene types, enabling practical deployment in VR systems without additional hardware requirements. Statistical analysis confirms the significance of improvements across all evaluation metrics. These results show that software-only gaze prediction can work for VR foveated rendering, making this performance boost more accessible to different VR platforms and apps.
zh
[CV-58] FLAIR: Frequency- and Locality-Aware Implicit Neural Representations
【速读】:该论文旨在解决隐式神经表示(Implicit Neural Representations, INRs)在图像和三维重建任务中存在的三个核心问题:缺乏频率选择性、空间定位能力不足以及稀疏表示缺失,这些问题导致模型过度依赖冗余信号成分,并表现出频谱偏差(spectral bias),即优先学习低频分量而难以捕捉高频细节。其解决方案的关键在于提出FLAIR(Frequency- and Locality-Aware Implicit Neural Representations),包含两项创新:一是RC-GAUSS激活函数,在满足时频不确定性原理(Time-Frequency Uncertainty Principle, TFUP)约束下实现显式的频率选择与空间定位;二是基于离散小波变换(Discrete Wavelet Transform, DWT)的能量引导编码机制(Wavelet-Energy-Guided Encoding, WEGE),通过计算能量得分显式地将频率信息注入网络,从而提升对高频细节的建模能力。
链接: https://arxiv.org/abs/2508.13544
作者: Sukhun Ko,Dahyeon Kye,Kyle Min,Chanho Eom,Jihyong Oh
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: Please visit our project page at this https URL
Abstract:Implicit Neural Representations (INRs) leverage neural networks to map coordinates to corresponding signals, enabling continuous and compact representations. This paradigm has driven significant advances in various vision tasks. However, existing INRs lack frequency selectivity, spatial localization, and sparse representations, leading to an over-reliance on redundant signal components. Consequently, they exhibit spectral bias, tending to learn low-frequency components early while struggling to capture fine high-frequency details. To address these issues, we propose FLAIR (Frequency- and Locality-Aware Implicit Neural Representations), which incorporates two key innovations. The first is RC-GAUSS, a novel activation designed for explicit frequency selection and spatial localization under the constraints of the time-frequency uncertainty principle (TFUP). The second is Wavelet-Energy-Guided Encoding (WEGE), which leverages the discrete wavelet transform (DWT) to compute energy scores and explicitly guide frequency information to the network. Our method consistently outperforms existing INRs in 2D image representation and restoration, as well as 3D reconstruction.
zh
[CV-59] EAvatar: Expression-Aware Head Avatar Reconstruction with Generative Geometry Priors
【速读】:该论文旨在解决基于3D高斯泼溅(3D Gaussian Splatting, 3DGS)的头部虚拟形象重建中难以精确捕捉细微面部表情变化以及在高度可变形区域保持局部纹理连续性的问题。其解决方案的关键在于提出了一种名为EAvatar的新框架,该框架兼具表达感知(expression-aware)与形变感知(deformation-aware)特性:首先引入稀疏表达控制机制,通过少量关键高斯点影响邻近高斯点的形变,从而实现局部形变和细粒度纹理过渡的精准建模;其次利用预训练生成模型提供的高质量三维先验(3D prior),为面部几何提供结构引导,显著提升训练过程中的收敛稳定性与形状准确性。
链接: https://arxiv.org/abs/2508.13537
作者: Shikun Zhang,Cunjian Chen,Yiqun Wang,Qiuhong Ke,Yong Li
机构: Monash University (蒙纳士大学); Chongqing University (重庆大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: 20 pages, 11 figures
Abstract:High-fidelity head avatar reconstruction plays a crucial role in AR/VR, gaming, and multimedia content creation. Recent advances in 3D Gaussian Splatting (3DGS) have demonstrated effectiveness in modeling complex geometry with real-time rendering capability and are now widely used in high-fidelity head avatar reconstruction tasks. However, existing 3DGS-based methods still face significant challenges in capturing fine-grained facial expressions and preserving local texture continuity, especially in highly deformable regions. To mitigate these limitations, we propose a novel 3DGS-based framework termed EAvatar for head reconstruction that is both expression-aware and deformation-aware. Our method introduces a sparse expression control mechanism, where a small number of key Gaussians are used to influence the deformation of their neighboring Gaussians, enabling accurate modeling of local deformations and fine-scale texture transitions. Furthermore, we leverage high-quality 3D priors from pretrained generative models to provide a more reliable facial geometry, offering structural guidance that improves convergence stability and shape accuracy during training. Experimental results demonstrate that our method produces more accurate and visually coherent head reconstructions with improved expression controllability and detail fidelity.
zh
[CV-60] MimicFunc: Imitating Tool Manipulation from a Single Human Video via Functional Correspondence
【速读】:该论文旨在解决机器人从单个RGB-D人类视频中学习工具操作技能后,如何实现对功能相似但几何形态各异的新工具的泛化问题,即克服因工具间“类内功能差异”(intra-function variations)导致的技能迁移障碍。解决方案的关键在于提出MimicFunc框架,其核心创新是构建以功能为中心的局部坐标系——功能帧(function frame),通过关键点抽象实现跨工具的功能级对应关系建模,从而在无需额外示教数据的情况下,使机器人能将人类示范技能高效迁移到新型工具上完成等效任务。
链接: https://arxiv.org/abs/2508.13534
作者: Chao Tang,Anxing Xiao,Yuhong Deng,Tianrun Hu,Wenlong Dong,Hanbo Zhang,David Hsu,Hong Zhang
机构: Southern University of Science and Technology(南方科技大学); National University of Singapore(新加坡国立大学)
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted to CoRL 2025
Abstract:Imitating tool manipulation from human videos offers an intuitive approach to teaching robots, while also providing a promising and scalable alternative to labor-intensive teleoperation data collection for visuomotor policy learning. While humans can mimic tool manipulation behavior by observing others perform a task just once and effortlessly transfer the skill to diverse tools for functionally equivalent tasks, current robots struggle to achieve this level of generalization. A key challenge lies in establishing function-level correspondences, considering the significant geometric variations among functionally similar tools, referred to as intra-function variations. To address this challenge, we propose MimicFunc, a framework that establishes functional correspondences with function frame, a function-centric local coordinate frame constructed with keypoint-based abstraction, for imitating tool manipulation skills. Experiments demonstrate that MimicFunc effectively enables the robot to generalize the skill from a single RGB-D human video to manipulating novel tools for functionally equivalent tasks. Furthermore, leveraging MimicFunc’s one-shot generalization capability, the generated rollouts can be used to train visuomotor policies without requiring labor-intensive teleoperation data collection for novel objects. Our code and video are available at this https URL.
zh
[CV-61] Evaluating Open-Source Vision Language Models for Facial Emotion Recognition against Traditional Deep Learning Models
【速读】:该论文旨在解决生成式 AI (Generative AI) 在低质量图像情绪识别任务中的性能瓶颈问题,特别是针对视觉-语言模型(Vision-Language Models, VLMs)在处理噪声大、分辨率低的面部表情数据时表现不佳的现象。其关键解决方案是提出一种新颖的流水线架构,将基于GFPGAN的图像修复技术与情绪识别评估相结合,以缓解VLM训练假设与FER数据噪声特性之间的不匹配问题,从而提升模型在Fer-2013数据集上的识别准确率,并为未来研究提供可复现的基准。
链接: https://arxiv.org/abs/2508.13524
作者: Vamsi Krishna Mulukutla,Sai Supriya Pavarala,Srinivasa Raju Rudraraju,Sridevi Bonthu
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:
Abstract:Facial Emotion Recognition (FER) is crucial for applications such as human-computer interaction and mental health diagnostics. This study presents the first empirical comparison of open-source Vision-Language Models (VLMs), including Phi-3.5 Vision and CLIP, against traditional deep learning models VGG19, ResNet-50, and EfficientNet-B0 on the challenging FER-2013 dataset, which contains 35,887 low-resolution grayscale images across seven emotion classes. To address the mismatch between VLM training assumptions and the noisy nature of FER data, we introduce a novel pipeline that integrates GFPGAN-based image restoration with FER evaluation. Results show that traditional models, particularly EfficientNet-B0 (86.44%) and ResNet-50 (85.72%), significantly outperform VLMs like CLIP (64.07%) and Phi-3.5 Vision (51.66%), highlighting the limitations of VLMs in low-quality visual tasks. In addition to performance evaluation using precision, recall, F1-score, and accuracy, we provide a detailed computational cost analysis covering preprocessing, training, inference, and evaluation phases, offering practical insights for deployment. This work underscores the need for adapting VLMs to noisy environments and provides a reproducible benchmark for future research in emotion recognition.
zh
[CV-62] Calibrating Biased Distribution in VFM-derived Latent Space via Cross-Domain Geometric Consistency CVPR
【速读】:该论文旨在解决深度学习中训练样本分布与真实数据分布之间的差距问题,这一差距通常由采样偏差、噪声以及数据异构性(data heterogeneity)和类别不平衡(class imbalance)等因素引起。解决方案的关键在于利用预训练的视觉基础模型(如CLIP、DINOv2)提取特征时所展现出的几何形状具有跨域和跨数据集的可迁移性(geometric shape transferability),并基于此构建几何知识引导的分布校准框架(geometric knowledge-guided distribution calibration)。该框架在联邦学习场景下通过隐私约束下的全局几何形状获取来生成客户端新样本以弥合局部与全局观测差异,在长尾识别任务中则借助富样本类别的几何知识恢复稀缺尾部类别的真实分布,从而有效缓解因数据异构性和样本不平衡导致的信息缺失问题,并显著提升多个基准上的性能表现。
链接: https://arxiv.org/abs/2508.13518
作者: Yanbiao Ma,Wei Dai,Bowei Liu,Jiayi Chen,Wenke Huang,Guancheng Wan,Zhiwu Lu,Junchi Yan
机构: Gaoling School of Artificial Intelligence, Renmin University of China(中国人民大学高瓴人工智能学院); Tsinghua University(清华大学); Xidian University(西安电子科技大学); Wuhan University(武汉大学); Shanghai Jiao Tong University(上海交通大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: 15 pages, CVPR Oral
Abstract:Despite the fast progress of deep learning, one standing challenge is the gap of the observed training samples and the underlying true distribution. There are multiple reasons for the causing of this gap e.g. sampling bias, noise etc. In the era of foundation models, we show that when leveraging the off-the-shelf (vision) foundation models (e.g., CLIP, DINOv2) for feature extraction, the geometric shapes of the resulting feature distributions exhibit remarkable transferability across domains and datasets. To verify its practical usefulness, we embody our geometric knowledge-guided distribution calibration framework in two popular and challenging settings: federated learning and long-tailed recognition. In the federated setting, we devise a technique of acquiring the global geometric shape under privacy constraints, then leverage this knowledge to generate new samples for clients, in the aim of bridging the gap between local and global observations. In long-tailed learning, it utilizes the geometric knowledge transferred from sample-rich categories to recover the true distribution for sample-scarce tail classes. Comprehensive experiments show that our proposed geometric knowledge-guided distribution calibration effectively overcomes information deficits caused by data heterogeneity and sample imbalance, with boosted performance across benchmarks.
zh
[CV-63] 2D Gaussians Meet Visual Tokenizer
【速读】:该论文旨在解决现有基于量化(quantization)的图像分词器(image tokenizer)在生成式 AI (Generative AI) 图像建模中对几何结构建模能力不足的问题。传统方法如 VQ-GAN 主要依赖于纹理和颜色等外观特征,其基于补丁(patch-based)的设计难以有效捕捉图像中的空间结构信息。解决方案的关键在于提出一种名为视觉高斯量化(Visual Gaussian Quantization, VGQ)的新分词框架,该框架通过将图像潜在表示编码为二维高斯分布(2D Gaussian distributions),显式建模位置、旋转和尺度等结构参数,从而增强对几何与空间结构的表达能力。VGQ 在 ImageNet 256×256 数据集上实现了优异的重建质量(rFID=0.556,PSNR=24.93),显著优于现有方法,并且通过提升高斯密度可灵活权衡令牌效率与视觉丰富性。
链接: https://arxiv.org/abs/2508.13515
作者: Yiang Shi,Xiaoyang Guo,Wei Yin,Mingkai Jia,Qian Zhang,Xiaolin Hu,Wenyu Liu,Xinggang Wan
机构: Huazhong University of Science and Technology (华中科技大学); Horizon Robotics ( horizon机器人); Department of Computer Science and Technology, Tsinghua University (清华大学计算机科学与技术系)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:The image tokenizer is a critical component in AR image generation, as it determines how rich and structured visual content is encoded into compact representations. Existing quantization-based tokenizers such as VQ-GAN primarily focus on appearance features like texture and color, often neglecting geometric structures due to their patch-based design. In this work, we explored how to incorporate more visual information into the tokenizer and proposed a new framework named Visual Gaussian Quantization (VGQ), a novel tokenizer paradigm that explicitly enhances structural modeling by integrating 2D Gaussians into traditional visual codebook quantization frameworks. Our approach addresses the inherent limitations of naive quantization methods such as VQ-GAN, which struggle to model structured visual information due to their patch-based design and emphasis on texture and color. In contrast, VGQ encodes image latents as 2D Gaussian distributions, effectively capturing geometric and spatial structures by directly modeling structure-related parameters such as position, rotation and scale. We further demonstrate that increasing the density of 2D Gaussians within the tokens leads to significant gains in reconstruction fidelity, providing a flexible trade-off between token efficiency and visual richness. On the ImageNet 256x256 benchmark, VGQ achieves strong reconstruction quality with an rFID score of 1.00. Furthermore, by increasing the density of 2D Gaussians within the tokens, VGQ gains a significant boost in reconstruction capability and achieves a state-of-the-art reconstruction rFID score of 0.556 and a PSNR of 24.93, substantially outperforming existing methods. Codes will be released soon.
zh
[CV-64] Bridging the Gap: Doubles Badminton Analysis with Singles-Trained Models
【速读】:该论文旨在解决双打羽毛球比赛中基于姿态的击球识别研究不足的问题,尤其是由于数据获取困难和多人跟踪挑战导致以往研究多集中于单打场景。其解决方案的关键在于:首先利用ViT-Pose从单打数据集中提取关键点,并通过基于ST-GCN的对比学习框架进行嵌入表示;其次引入定制化的多目标跟踪算法以缓解因球员快速移动和重叠导致的ID切换问题;最后采用Transformer分类器根据学习到的嵌入特征判断击球事件的发生。该方法实现了从单打训练模型向双打场景的有效迁移,为双打羽毛球分析提供了可行的技术路径。
链接: https://arxiv.org/abs/2508.13507
作者: Seungheon Baek,Jinhyuk Yun
机构: Soongsil University (弘益大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 14 pages, 7 figures
Abstract:Badminton is known as one of the fastest racket sports in the world. Despite doubles matches being more prevalent in international tournaments than singles, previous research has mainly focused on singles due to the challenges in data availability and multi-person tracking. To address this gap, we designed an approach that transfers singles-trained models to doubles analysis. We extracted keypoints from the ShuttleSet single matches dataset using ViT-Pose and embedded them through a contrastive learning framework based on ST-GCN. To improve tracking stability, we incorporated a custom multi-object tracking algorithm that resolves ID switching issues from fast and overlapping player movements. A Transformer-based classifier then determines shot occurrences based on the learned embeddings. Our findings demonstrate the feasibility of extending pose-based shot recognition to doubles badminton, broadening analytics capabilities. This work establishes a foundation for doubles-specific datasets to enhance understanding of this predominant yet understudied format of the fast racket sport.
zh
[CV-65] AdaptiveAE: An Adaptive Exposure Strategy for HDR Capturing in Dynamic Scenes ICCV2025
【速读】:该论文旨在解决动态场景下高动态范围成像(High Dynamic Range Imaging, HDR)中曝光参数(快门速度与ISO)选择不当导致的图像质量下降问题,特别是现有方法未能充分考虑快门速度与ISO之间的复杂交互关系以及运动模糊效应。解决方案的关键在于提出一种基于强化学习的自适应曝光优化方法AdaptiveAE,其通过引入包含运动模糊和噪声模拟的图像合成流程,并结合语义信息与曝光直方图特征进行训练,能够根据用户设定的总曝光时间预算自适应地生成最优ISO与快门速度序列,从而在动态环境中实现优于传统方案的HDR重建性能。
链接: https://arxiv.org/abs/2508.13503
作者: Tianyi Xu,Fan Zhang,Boxin Shi,Tianfan Xue,Yujin Wang
机构: Shanghai AI Laboratory (上海人工智能实验室); The Chinese University of Hong Kong (香港中文大学); State Key Laboratory of Multimedia Information Processing, School of Computer Science, Peking University (北京大学计算机学院多媒体信息处理国家重点实验室); National Engineering Research Center of Visual Technology, School of Computer Science, Peking University (北京大学计算机学院视觉技术国家工程研究中心)
类目: Computer Vision and Pattern Recognition (cs.CV); Image and Video Processing (eess.IV)
备注: Accepted to ICCV 2025
Abstract:Mainstream high dynamic range imaging techniques typically rely on fusing multiple images captured with different exposure setups (shutter speed and ISO). A good balance between shutter speed and ISO is crucial for achieving high-quality HDR, as high ISO values introduce significant noise, while long shutter speeds can lead to noticeable motion blur. However, existing methods often overlook the complex interaction between shutter speed and ISO and fail to account for motion blur effects in dynamic scenes. In this work, we propose AdaptiveAE, a reinforcement learning-based method that optimizes the selection of shutter speed and ISO combinations to maximize HDR reconstruction quality in dynamic environments. AdaptiveAE integrates an image synthesis pipeline that incorporates motion blur and noise simulation into our training procedure, leveraging semantic information and exposure histograms. It can adaptively select optimal ISO and shutter speed sequences based on a user-defined exposure time budget, and find a better exposure schedule than traditional solutions. Experimental results across multiple datasets demonstrate that it achieves the state-of-the-art performance. Comments: Accepted to ICCV 2025 Subjects: Computer Vision and Pattern Recognition (cs.CV); Image and Video Processing (eess.IV) Cite as: arXiv:2508.13503 [cs.CV] (or arXiv:2508.13503v1 [cs.CV] for this version) https://doi.org/10.48550/arXiv.2508.13503 Focus to learn more arXiv-issued DOI via DataCite
zh
[CV-66] Multi-view Clustering via Bi-level Decoupling and Consistency Learning
【速读】:该论文旨在解决多视图聚类中因忽视面向聚类的表示学习而导致的特征区分性不足与簇内紧凑性不强的问题。其解决方案的关键在于提出了一种双层解耦与一致性学习框架(Bi-level Decoupling and Consistency Learning, BDCL),通过三个核心模块实现:1)多视图实例学习模块利用重建自动编码器和对比学习对齐视图间一致信息并保留私有特征;2)特征与簇的双层解耦机制分别增强特征空间和簇空间的判别能力;3)一致性学习模块将样本及其邻居在不同视图下的聚类分配视为正样本对,学习聚类分配的一致性并压缩簇内空间。该方法显著提升了多视图聚类的性能。
链接: https://arxiv.org/abs/2508.13499
作者: Shihao Dong,Yuhui Zheng,Huiying Xu,Xinzhong Zhu
机构: Nanjing University of Information Science and Technology (南京信息工程大学); Qinghai Normal University (青海师范大学); Zhejiang Normal University (浙江师范大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注:
Abstract:Multi-view clustering has shown to be an effective method for analyzing underlying patterns in multi-view data. The performance of clustering can be improved by learning the consistency and complementarity between multi-view features, however, cluster-oriented representation learning is often overlooked. In this paper, we propose a novel Bi-level Decoupling and Consistency Learning framework (BDCL) to further explore the effective representation for multi-view data to enhance inter-cluster discriminability and intra-cluster compactness of features in multi-view clustering. Our framework comprises three modules: 1) The multi-view instance learning module aligns the consistent information while preserving the private features between views through reconstruction autoencoder and contrastive learning. 2) The bi-level decoupling of features and clusters enhances the discriminability of feature space and cluster space. 3) The consistency learning module treats the different views of the sample and their neighbors as positive pairs, learns the consistency of their clustering assignments, and further compresses the intra-cluster space. Experimental results on five benchmark datasets demonstrate the superiority of the proposed method compared with the SOTA methods. Our code is published on this https URL.
zh
[CV-67] ROVER: Robust Loop Closure Verification with Trajectory Prior in Repetitive Environments
【速读】:该论文旨在解决在重复环境(repetitive environments)中,基于外观特征的回环检测(loop closure detection)易产生误检(false positive detections)的问题,从而导致SLAM系统漂移累积或全局重定位失败。其解决方案的关键在于引入历史轨迹作为先验约束(trajectory prior),通过在候选回环下进行位姿图优化(pose-graph optimization)后,评估该优化轨迹与无回环假设下的轨迹一致性,以判断回环是否可信。该方法称为ROVER,有效提升了复杂场景下的回环验证鲁棒性。
链接: https://arxiv.org/abs/2508.13488
作者: Jingwen Yu,Jiayi Yang,Anjun Hu,Jiankun Wang,Ping Tan,Hong Zhang
机构: CKS Robotics Institute, Hong Kong University of Science and Technology, Hong Kong SAR, China (机器人与计算机视觉研究中心,香港科技大学,中国香港特别行政区); Shenzhen Key Laboratory of Robotics and Computer Vision, Southern University of Science and Technology, Shenzhen, China (深圳市机器人与计算机视觉重点实验室,南方科技大学,深圳,中国); The University of Tokyo, Tokyo, Japan (东京大学,日本东京); Department of Electronic and Electrical Engineering, Southern University of Science and Technology, China (南方科技大学电子与电气工程系,中国)
类目: Robotics (cs.RO); Computer Vision and Pattern Recognition (cs.CV)
备注: 8 pages, 9 figures
Abstract:Loop closure detection is important for simultaneous localization and mapping (SLAM), which associates current observations with historical keyframes, achieving drift correction and global relocalization. However, a falsely detected loop can be fatal, and this is especially difficult in repetitive environments where appearance-based features fail due to the high similarity. Therefore, verification of a loop closure is a critical step in avoiding false positive detections. Existing works in loop closure verification predominantly focus on learning invariant appearance features, neglecting the prior knowledge of the robot’s spatial-temporal motion cue, i.e., trajectory. In this letter, we propose ROVER, a loop closure verification method that leverages the historical trajectory as a prior constraint to reject false loops in challenging repetitive environments. For each loop candidate, it is first used to estimate the robot trajectory with pose-graph optimization. This trajectory is then submitted to a scoring scheme that assesses its compliance with the trajectory without the loop, which we refer to as the trajectory prior, to determine if the loop candidate should be accepted. Benchmark comparisons and real-world experiments demonstrate the effectiveness of the proposed method. Furthermore, we integrate ROVER into state-of-the-art SLAM systems to verify its robustness and efficiency. Our source code and self-collected dataset are available at this https URL.
zh
[CV-68] CORENet: Cross-Modal 4D Radar Denoising Network with LiDAR Supervision for Autonomous Driving IROS2025
【速读】:该论文旨在解决4D雷达点云数据因稀疏性和噪声问题导致的感知性能受限难题。其核心解决方案是提出一种名为CORENet的跨模态去噪框架,关键在于利用LiDAR(光探测和测距)数据作为监督信号,在训练阶段识别雷达中的噪声模式并提取判别性特征,从而提升原始4D雷达数据的质量;该方法设计为即插即用架构,可无缝集成到基于体素(voxel-based)的目标检测框架中,且仅在训练时依赖LiDAR监督,推理阶段完全基于雷达数据运行,保证了实际应用中的鲁棒性和实用性。
链接: https://arxiv.org/abs/2508.13485
作者: Fuyang Liu,Jilin Mei,Fangyuan Mao,Chen Min,Yan Xing,Yu Hu
机构: Institute of Computing Technology, Chinese Academy of Sciences(中国科学院计算技术研究所); University of Chinese Academy of Sciences(中国科学院大学); Beijing Institute of Control Engineering(北京控制工程研究所)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: 8 pages, 5 figures, Accepted to IROS 2025
Abstract:4D radar-based object detection has garnered great attention for its robustness in adverse weather conditions and capacity to deliver rich spatial information across diverse driving scenarios. Nevertheless, the sparse and noisy nature of 4D radar point clouds poses substantial challenges for effective perception. To address the limitation, we present CORENet, a novel cross-modal denoising framework that leverages LiDAR supervision to identify noise patterns and extract discriminative features from raw 4D radar data. Designed as a plug-and-play architecture, our solution enables seamless integration into voxel-based detection frameworks without modifying existing pipelines. Notably, the proposed method only utilizes LiDAR data for cross-modal supervision during training while maintaining full radar-only operation during inference. Extensive evaluation on the challenging Dual-Radar dataset, which is characterized by elevated noise level, demonstrates the effectiveness of our framework in enhancing detection robustness. Comprehensive experiments validate that CORENet achieves superior performance compared to existing mainstream approaches.
zh
[CV-69] FAMNet: Integrating 2D and 3D Features for Micro-expression Recognition via Multi-task Learning and Hierarchical Attention IJCNN2025
【速读】:该论文旨在解决微表情识别(Micro-expression Recognition, MER)中因微表情持续时间短、强度低而导致的特征提取困难问题,尤其在细粒度时空特征建模方面的挑战。解决方案的关键在于提出一种基于多任务学习与分层注意力机制的融合模型FAMNet,该模型通过结合2D卷积神经网络(2D CNN)AMNet2D与3D卷积神经网络(3D CNN)AMNet3D,充分利用两类网络在空间和时序特征提取上的优势;同时,在训练过程中采用参数硬共享策略联合优化MER任务与面部动作单元检测(Facial Action Unit Detection, FAUD)任务,从而增强模型对微表情关键特征的感知能力与泛化性能,显著提升了多个公开数据集上的识别准确率。
链接: https://arxiv.org/abs/2508.13483
作者: Liangyu Fu,Xuecheng Wu,Danlei Huang,Xinyi Yin
机构: Northwestern Polytechnical University (西北工业大学); Xi’an Jiaotong University (西安交通大学); Zhengzhou University (郑州大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 8 pages, 6 figures. Accepted to IJCNN 2025
Abstract:Micro-expressions recognition (MER) has essential application value in many fields, but the short duration and low intensity of micro-expressions (MEs) bring considerable challenges to MER. The current MER methods in deep learning mainly include three data loading methods: static images, dynamic image sequence, and a combination of the two streams. How to effectively extract MEs’ fine-grained and spatiotemporal features has been difficult to solve. This paper proposes a new MER method based on multi-task learning and hierarchical attention, which fully extracts MEs’ omni-directional features by merging 2D and 3D CNNs. The fusion model consists of a 2D CNN AMNet2D and a 3D CNN AMNet3D, with similar structures consisting of a shared backbone network Resnet18 and attention modules. During training, the model adopts different data loading methods to adapt to two specific networks respectively, jointly trains on the tasks of MER and facial action unit detection (FAUD), and adopts the parameter hard sharing for information association, which further improves the effect of the MER task, and the final fused model is called FAMNet. Extensive experimental results show that our proposed FAMNet significantly improves task performance. On the SAMM, CASME II and MMEW datasets, FAMNet achieves 83.75% (UAR) and 84.03% (UF1). Furthermore, on the challenging CAS(ME) ^3 dataset, FAMNet achieves 51% (UAR) and 43.42% (UF1).
zh
[CV-70] Enhancing Robustness of Implicit Neural Representations Against Weight Perturbations
【速读】:该论文旨在解决隐式神经表示(Implicit Neural Representations, INRs)在实际部署中因网络权重受到不可避免扰动而导致的重建质量显著下降的问题。其解决方案的关键在于将鲁棒性建模为最小化有无权重扰动时损失函数的差异,并由此推导出一种新的鲁棒损失函数,该函数通过调控重建损失关于权重的梯度来增强模型对扰动的抵抗能力,从而在多种模态的重建任务中实现高达7.5 dB的峰值信噪比(PSNR)提升。
链接: https://arxiv.org/abs/2508.13481
作者: Wenyong Zhou,Yuxin Cheng,Zhengwu Liu,Taiqiang Wu,Chen Zhang,Ngai Wong
机构: The University of Hong Kong (香港大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 4 pages, 7 figures
Abstract:Implicit Neural Representations (INRs) encode discrete signals in a continuous manner using neural networks, demonstrating significant value across various multimedia applications. However, the vulnerability of INRs presents a critical challenge for their real-world deployments, as the network weights might be subjected to unavoidable perturbations. In this work, we investigate the robustness of INRs for the first time and find that even minor perturbations can lead to substantial performance degradation in the quality of signal reconstruction. To mitigate this issue, we formulate the robustness problem in INRs by minimizing the difference between loss with and without weight perturbations. Furthermore, we derive a novel robust loss function to regulate the gradient of the reconstruction loss with respect to weights, thereby enhancing the robustness. Extensive experiments on reconstruction tasks across multiple modalities demonstrate that our method achieves up to a 7.5~dB improvement in peak signal-to-noise ratio (PSNR) values compared to original INRs under noisy conditions.
zh
[CV-71] AIM 2025 challenge on Inverse Tone Mapping Report: Methods and Results
【速读】:该论文旨在解决从单张低动态范围(Low Dynamic Range, LDR)图像中重建高动态范围(High Dynamic Range, HDR)图像的逆色调映射(Inverse Tone Mapping, ITM)问题,核心挑战在于提升重建结果的感知保真度(perceptual fidelity)与数值一致性。解决方案的关键在于提出并验证了一系列创新的ITM算法策略,通过67名参赛者提交的319个有效结果进行系统评估,其中最优方案在PU21-PSNR指标上达到29.22 dB,显著提升了HDR重建质量,并为后续研究建立了坚实的技术基准。
链接: https://arxiv.org/abs/2508.13479
作者: Chao Wang,Francesco Banterle,Bin Ren,Radu Timofte,Xin Lu,Yufeng Peng,Chengjie Ge,Zhijing Sun,Ziang Zhou,Zihao Li,Zishun Liao,Qiyu Kang,Xueyang Fu,Zheng-Jun Zha,Zhijing Sun,Xingbo Wang,Kean Liu,Senyan Xu,Yang Qiu,Yifan Ding,Gabriel Eilertsen,Jonas Unger,Zihao Wang,Ke Wu,Jinshan Pan,Zhen Liu,Zhongyang Li,Shuaicheng Liu,S.M Nadim Uddin
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Image and Video Processing (eess.IV)
备注:
Abstract:This paper presents a comprehensive review of the AIM 2025 Challenge on Inverse Tone Mapping (ITM). The challenge aimed to push forward the development of effective ITM algorithms for HDR image reconstruction from single LDR inputs, focusing on perceptual fidelity and numerical consistency. A total of \textbf67 participants submitted \textbf319 valid results, from which the best five teams were selected for detailed analysis. This report consolidates their methodologies and performance, with the lowest PU21-PSNR among the top entries reaching 29.22 dB. The analysis highlights innovative strategies for enhancing HDR reconstruction quality and establishes strong benchmarks to guide future research in inverse tone mapping.
zh
[CV-72] Distribution-Aware Hadamard Quantization for Hardware-Efficient Implicit Neural Representations
【速读】:该论文旨在解决隐式神经表示(Implicit Neural Representations, INRs)在硬件部署时因依赖全精度数值表示而导致的显著硬件开销问题。现有量化方法主要局限于权重量化,未能有效利用激活量化的潜力,从而限制了硬件资源的节省。其解决方案的关键在于提出一种分布感知的哈达玛量化(Distribution-aware Hadamard Quantization, DHQ)方案,通过哈达玛变换(Hadamard transformation)将INR中不同层(尤其是首尾层)权重与激活的异构分布统一转换为近似高斯形状,从而支持标准化的量化策略,实现对权重和激活的同时高效量化。该方法在FPGA上的实现验证了其优越性,在图像重建任务中相较全精度模型可降低32.7%延迟、40.1%能耗及最高98.3%资源占用。
链接: https://arxiv.org/abs/2508.13478
作者: Wenyong Zhou,Jiachen Ren,Taiqiang Wu,Yuxin Cheng,Zhengwu Liu,Ngai Wong
机构: The University of Hong Kong (香港大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 6 pages, 7 figures
Abstract:Implicit Neural Representations (INRs) encode discrete signals using Multi-Layer Perceptrons (MLPs) with complex activation functions. While INRs achieve superior performance, they depend on full-precision number representation for accurate computation, resulting in significant hardware overhead. Previous INR quantization approaches have primarily focused on weight quantization, offering only limited hardware savings due to the lack of activation quantization. To fully exploit the hardware benefits of quantization, we propose DHQ, a novel distribution-aware Hadamard quantization scheme that targets both weights and activations in INRs. Our analysis shows that the weights in the first and last layers have distributions distinct from those in the intermediate layers, while the activations in the last layer differ significantly from those in the preceding layers. Instead of customizing quantizers individually, we utilize the Hadamard transformation to standardize these diverse distributions into a unified bell-shaped form, supported by both empirical evidence and theoretical analysis, before applying a standard quantizer. To demonstrate the practical advantages of our approach, we present an FPGA implementation of DHQ that highlights its hardware efficiency. Experiments on diverse image reconstruction tasks show that DHQ outperforms previous quantization methods, reducing latency by 32.7%, energy consumption by 40.1%, and resource utilization by up to 98.3% compared to full-precision counterparts.
zh
[CV-73] MINR: Efficient Implicit Neural Representations for Multi-Image Encoding
【速读】:该论文旨在解决隐式神经表示(Implicit Neural Representations, INRs)在多图像编码场景下存在的计算与存储效率低下问题。传统方法为每张图像单独训练一个神经网络(通常为多层感知机,MLP),导致参数冗余和资源浪费。解决方案的关键在于提出一种名为MINR的方法,通过共享中间层权重来实现多图像的高效编码:研究发现不同INR模型中对应层的权重分布高度相似,因此将中间层设为跨图像共享,同时保留输入层和输出层作为输入特定,并为每张图像设计一个额外的投影层以捕捉其独特特征。实验表明,MINR可在保持相近重建性能的前提下减少高达60%的参数量,并有效扩展至100张图像的场景,平均峰值信噪比(PSNR)维持在34 dB。
链接: https://arxiv.org/abs/2508.13471
作者: Wenyong Zhou,Taiqiang Wu,Zhengwu Liu,Yuxin Cheng,Chen Zhang,Ngai Wong
机构: The University of Hong Kong (香港大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 4 pages, 4 figures
Abstract:Implicit Neural Representations (INRs) aim to parameterize discrete signals through implicit continuous functions. However, formulating each image with a separate neural network~(typically, a Multi-Layer Perceptron (MLP)) leads to computational and storage inefficiencies when encoding multi-images. To address this issue, we propose MINR, sharing specific layers to encode multi-image efficiently. We first compare the layer-wise weight distributions for several trained INRs and find that corresponding intermediate layers follow highly similar distribution patterns. Motivated by this, we share these intermediate layers across multiple images while preserving the input and output layers as input-specific. In addition, we design an extra novel projection layer for each image to capture its unique features. Experimental results on image reconstruction and super-resolution tasks demonstrate that MINR can save up to 60% parameters while maintaining comparable performance. Particularly, MINR scales effectively to handle 100 images, maintaining an average peak signal-to-noise ratio (PSNR) of 34 dB. Further analysis of various backbones proves the robustness of the proposed MINR.
zh
[CV-74] STER-VLM: Spatio-Temporal With Enhanced Reference Vision-Language Models ICCV
【速读】:该论文旨在解决当前视觉-语言模型(Vision-Language Models, VLMs)在自动化交通分析中存在计算资源消耗大、难以实现细粒度时空理解的问题。其解决方案的关键在于提出一种计算高效的框架STER-VLM,通过四个核心机制实现性能提升:(1) 将图像描述分解以分别处理空间与时间信息;(2) 采用最优视角筛选策略进行时间帧选择,确保充分的时序信息;(3) 引入参考驱动机制以捕捉细粒度运动和动态场景上下文;(4) 使用精心设计的视觉/文本提示技术增强语义表达能力。实验表明,该框架在WTS和BDD数据集上显著提升了语义丰富性和交通场景解析能力,并在AI City Challenge 2025 Track 2中取得55.655的测试得分,验证了其在真实场景下资源高效且高精度交通分析的有效性。
链接: https://arxiv.org/abs/2508.13470
作者: Tinh-Anh Nguyen-Nhu,Triet Dao Hoang Minh,Dat To-Thanh,Phuc Le-Gia,Tuan Vo-Lan,Tien-Huy Nguyen
机构: Ho Chi Minh University of Technology (胡志明市科技大学); Vietnamese-German University (越南-德国大学); Ho Chi Minh University of Science (胡志明市科学大学); University of Information Technology (信息科技大学); Vietnam National University, Ho Chi Minh city (胡志明市国家大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: ICCV Workshop 2025
Abstract:Vision-language models (VLMs) have emerged as powerful tools for enabling automated traffic analysis; however, current approaches often demand substantial computational resources and struggle with fine-grained spatio-temporal understanding. This paper introduces STER-VLM, a computationally efficient framework that enhances VLM performance through (1) caption decomposition to tackle spatial and temporal information separately, (2) temporal frame selection with best-view filtering for sufficient temporal information, and (3) reference-driven understanding for capturing fine-grained motion and dynamic context and (4) curated visual/textual prompt techniques. Experimental results on the WTS \citekong2024wts and BDD \citeBDD datasets demonstrate substantial gains in semantic richness and traffic scene interpretation. Our framework is validated through a decent test score of 55.655 in the AI City Challenge 2025 Track 2, showing its effectiveness in advancing resource-efficient and accurate traffic analysis for real-world applications.
zh
[CV-75] Vision Transformers for Kidney Stone Image Classification: A Comparative Study with CNNs
【速读】:该论文旨在解决肾结石内窥镜图像分类问题,以实现个性化治疗和复发预防。传统卷积神经网络(CNN)在处理不同成像条件下的图像时,受限于其捕捉长距离依赖关系的能力不足,导致性能受限。解决方案的关键在于引入视觉Transformer(Vision Transformer, ViT)架构,利用其自注意力机制有效建模全局上下文信息,从而提升分类准确性与鲁棒性。实验表明,预训练于ImageNet-21k的ViT-base模型在多个图像子集上显著优于ResNet50基线,尤其在复杂视觉场景下表现突出,验证了ViT作为可扩展替代方案的有效性。
链接: https://arxiv.org/abs/2508.13461
作者: Ivan Reyes-Amezcua,Francisco Lopez-Tiro,Clement Larose,Andres Mendez-Vazquez,Gilberto Ochoa-Ruiz,Christian Daul
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注:
Abstract:Kidney stone classification from endoscopic images is critical for personalized treatment and recurrence prevention. While convolutional neural networks (CNNs) have shown promise in this task, their limited ability to capture long-range dependencies can hinder performance under variable imaging conditions. This study presents a comparative analysis between Vision Transformers (ViTs) and CNN-based models, evaluating their performance on two ex vivo datasets comprising CCD camera and flexible ureteroscope images. The ViT-base model pretrained on ImageNet-21k consistently outperformed a ResNet50 baseline across multiple imaging conditions. For instance, in the most visually complex subset (Section patches from endoscopic images), the ViT model achieved 95.2% accuracy and 95.1% F1-score, compared to 64.5% and 59.3% with ResNet50. In the mixed-view subset from CCD-camera images, ViT reached 87.1% accuracy versus 78.4% with CNN. These improvements extend across precision and recall as well. The results demonstrate that ViT-based architectures provide superior classification performance and offer a scalable alternative to conventional CNNs for kidney stone image analysis.
zh
[CV-76] Revisiting MLLM Token Technology through the Lens of Classical Visual Coding
【速读】:该论文旨在解决多模态大语言模型(Multimodal Large Language Model, MLLM)中的token技术与传统视觉编码(visual coding)之间缺乏系统性比较与融合的问题,从而推动高效多模态模型和下一代语义视觉编解码器的发展。其解决方案的关键在于建立一个统一的公式框架,将MLLM的token化、token压缩和token推理与视觉编码的核心原则相衔接,实现模块级的系统性对比分析;同时通过双向知识迁移,一方面利用视觉编码的成熟理论提升token技术的效率与鲁棒性,另一方面借鉴token技术范式指导新型语义视觉编解码器的设计,最终为未来研究指明方向并揭示关键挑战。
链接: https://arxiv.org/abs/2508.13460
作者: Jinming Liu,Junyan Lin,Yuntao Wei,Kele Shao,Keda Tao,Jianguo Huang,Xudong Yang,Zhibo Chen,Huan Wang,Xin Jin
机构: Eastern Institute of Technology (东方理工大学); Westlake University (西湖大学); USTC (中国科学技术大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Classical visual coding and Multimodal Large Language Model (MLLM) token technology share the core objective - maximizing information fidelity while minimizing computational cost. Therefore, this paper reexamines MLLM token technology, including tokenization, token compression, and token reasoning, through the established principles of long-developed visual coding area. From this perspective, we (1) establish a unified formulation bridging token technology and visual coding, enabling a systematic, module-by-module comparative analysis; (2) synthesize bidirectional insights, exploring how visual coding principles can enhance MLLM token techniques’ efficiency and robustness, and conversely, how token technology paradigms can inform the design of next-generation semantic visual codecs; (3) prospect for promising future research directions and critical unsolved challenges. In summary, this study presents the first comprehensive and structured technology comparison of MLLM token and visual coding, paving the way for more efficient multimodal models and more powerful visual codecs simultaneously.
zh
[CV-77] Hierarchy-Consistent Learning and Adaptive Loss Balancing for Hierarchical Multi-Label Classification CIKM2025
【速读】:该论文旨在解决层次化多标签分类(Hierarchical Multi-Label Classification, HMC)中结构一致性难以维持以及多任务学习(Multi-Task Learning, MTL)中损失权重难以平衡的问题。其核心解决方案是提出一种基于MTL的分类器HCAL,关键创新在于:(1) 引入原型对比学习(prototype contrastive learning)与自适应任务权重机制,实现语义一致性建模——通过显式建模标签和从子类到父类的特征聚合来保持层次结构;(2) 设计动态损失加权机制,依据各任务收敛速率自动分配优化资源,有效缓解传统MTL中存在的“一强多弱”优化偏差问题;此外,还引入原型扰动机制以增强决策边界鲁棒性,并提出层次违规率(Hierarchical Violation Rate, HVR)作为量化评估指标,验证模型在结构一致性和泛化能力上的提升。
链接: https://arxiv.org/abs/2508.13452
作者: Ruobing Jiang,Mengzhe Liu,Haobing Liu,Yanwei Yu
机构: Ocean University of China (中国海洋大学)
类目: Machine Learning (cs.LG); Computer Vision and Pattern Recognition (cs.CV)
备注: 10 pages, 7 figures, accepted by CIKM 2025
Abstract:Hierarchical Multi-Label Classification (HMC) faces critical challenges in maintaining structural consistency and balancing loss weighting in Multi-Task Learning (MTL). In order to address these issues, we propose a classifier called HCAL based on MTL integrated with prototype contrastive learning and adaptive task-weighting mechanisms. The most significant advantage of our classifier is semantic consistency including both prototype with explicitly modeling label and feature aggregation from child classes to parent classes. The other important advantage is an adaptive loss-weighting mechanism that dynamically allocates optimization resources by monitoring task-specific convergence rates. It effectively resolves the “one-strong-many-weak” optimization bias inherent in traditional MTL approaches. To further enhance robustness, a prototype perturbation mechanism is formulated by injecting controlled noise into prototype to expand decision boundaries. Additionally, we formalize a quantitative metric called Hierarchical Violation Rate (HVR) as to evaluate hierarchical consistency and generalization. Extensive experiments across three datasets demonstrate both the higher classification accuracy and reduced hierarchical violation rate of the proposed classifier over baseline models.
zh
[CV-78] EDTalk: Full Disentanglement for Controllable Talking Head Synthesis
【速读】:该论文旨在解决现有说话头生成方法中面部运动特征解耦不充分的问题,即难以实现多维面部动作(如嘴部形状、头部姿态、眼部运动和情感表达)的独立控制,并且缺乏对不同输入模态(如视频或音频)的兼容性。解决方案的关键在于提出EDTalk++框架,通过四个轻量级模块将面部动态分解为独立的潜在空间(mouth, pose, eye, expression),每个空间由一组可学习基向量线性组合定义特定运动;同时引入正交约束确保各空间间无干扰,并设计无需外部知识的高效训练策略分配运动责任;最终利用存储于对应库中的基向量实现音频驱动下的共享视觉先验,从而实现高质量、可控且跨模态的说话头生成。
链接: https://arxiv.org/abs/2508.13442
作者: Shuai Tan,Bin Ji
机构: Shanghai Jiao Tong University (上海交通大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 17 pages,15 figures. arXiv admin note: substantial text overlap with arXiv:2404.01647
Abstract:Achieving disentangled control over multiple facial motions and accommodating diverse input modalities greatly enhances the application and entertainment of the talking head generation. This necessitates a deep exploration of the decoupling space for facial features, ensuring that they a) operate independently without mutual interference and b) can be preserved to share with different modal inputs, both aspects often neglected in existing methods. To address this gap, this paper proposes EDTalk++, a novel full disentanglement framework for controllable talking head generation. Our framework enables individual manipulation of mouth shape, head pose, eye movement, and emotional expression, conditioned on video or audio inputs. Specifically, we employ four lightweight modules to decompose the facial dynamics into four distinct latent spaces representing mouth, pose, eye, and expression, respectively. Each space is characterized by a set of learnable bases whose linear combinations define specific motions. To ensure independence and accelerate training, we enforce orthogonality among bases and devise an efficient training strategy to allocate motion responsibilities to each space without relying on external knowledge. The learned bases are then stored in corresponding banks, enabling shared visual priors with audio input. Furthermore, considering the properties of each space, we propose an Audio-to-Motion module for audio-driven talking head synthesis. Experiments are conducted to demonstrate the effectiveness of EDTalk++.
zh
[CV-79] Mitigating Easy Option Bias in Multiple-Choice Question Answering
【速读】:该论文旨在解决多选视觉问答(Visual Question Answering, VQA)基准数据集中存在的“易选项偏差”(Easy-Options Bias, EOB)问题,即视觉语言模型(Vision-Language Models, VLMs)仅依赖视觉信息(V)和选项(O)即可准确作答,无需理解问题(Q),导致模型评估结果虚高。解决方案的关键在于通过GroundAttack工具包自动生成与正确答案在视觉上同样合理的难样本负选项(hard negative options),从而消除视觉特征空间中正确选项与错误选项之间的不平衡,使模型必须真正理解问题才能作答,实现对VLM问答能力的更真实评估。
链接: https://arxiv.org/abs/2508.13428
作者: Hao Zhang,Chen Li,Basura Fernando
机构: Institute of High-Performance Computing, Agency for Science, Technology and Research, Singapore(新加坡科技研究局高性能计算研究所); Centre for Frontier AI Research, Agency for Science, Technology and Research, Singapore(新加坡科技研究局前沿人工智能研究中心); College of Computing and Data Science, Nanyang Technological University, Singapore(南洋理工大学计算机与数据科学学院)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Multimedia (cs.MM)
备注: Under review
Abstract:In this early study, we observe an Easy-Options Bias (EOB) issue in some multiple-choice Visual Question Answering (VQA) benchmarks such as MMStar, RealWorldQA, SEED-Bench, Next-QA, STAR benchmark and Video-MME. This bias allows vision-language models (VLMs) to select the correct answer using only the vision (V) and options (O) as inputs, without the need for the question (Q). Through grounding experiments, we attribute the bias to an imbalance in visual relevance: the correct answer typically aligns more closely with the visual contents than the negative options in feature space, creating a shortcut for VLMs to infer the answer via simply vision-option similarity matching. To fix this, we introduce GroundAttack, a toolkit that automatically generates hard negative options as visually plausible as the correct answer. We apply it to the NExT-QA and MMStar datasets, creating new EOB-free annotations. On these EOB-free annotations, current VLMs approach to random accuracies under (V+O) settings, and drop to non-saturated accuracies under (V+Q+O) settings, providing a more realistic evaluation of VLMs’ QA ability. Codes and new annotations will be released soon.
zh
[CV-80] AIM 2025 Rip Current Segmentation (RipSeg) Challenge Report ICCV
【速读】:该论文旨在解决海滩安全中 rip current(离岸流)的自动分割问题,即在静态图像中精准识别和定位危险的离岸流区域,以提升视觉检测的准确性并填补该领域的研究空白。解决方案的关键在于构建了一个名为 RipSeg 的挑战赛框架,其核心包括:基于现有最大规模的 rip current 数据集 RipVIS 的扩展数据集,涵盖多样化地理环境、rip current 类型与相机视角;采用多指标复合评分机制(F₁、F₂、AP₅₀ 与 AP[50:95])确保评估结果的鲁棒性和应用相关性;参赛团队主要依赖深度学习架构、领域自适应(domain adaptation)、预训练模型及领域泛化策略来应对复杂场景下的分割挑战,从而显著提升了 rip current 的实例分割性能。
链接: https://arxiv.org/abs/2508.13401
作者: Andrei Dumitriu,Florin Miron,Florin Tatui,Radu Tudor Ionescu,Radu Timofte,Aakash Ralhan,Florin-Alexandru Vasluianu,Shenyang Qian,Mitchell Harley,Imran Razzak,Yang Song,Pu Luo,Yumei Li,Cong Xu,Jinming Chai,Kexin Zhang,Licheng Jiao,Lingling Li,Siqi Yu,Chao Zhang,Kehuan Song,Fang Liu,Puhua Chen,Xu Liu,Jin Hu,Jinyang Xu,Biao Liu
机构: Computer Vision Lab, CAIDAS & IFI, University of Würzburg, Germany (德国维尔茨堡大学计算机视觉实验室,CAIDAS 与 IFI); University of Bucharest, Romania (罗马尼亚布加勒斯特大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Challenge report paper from AIM2025 Workshop at ICCVW 2025
Abstract:This report presents an overview of the AIM 2025 RipSeg Challenge, a competition designed to advance techniques for automatic rip current segmentation in still images. Rip currents are dangerous, fast-moving flows that pose a major risk to beach safety worldwide, making accurate visual detection an important and underexplored research task. The challenge builds on RipVIS, the largest available rip current dataset, and focuses on single-class instance segmentation, where precise delineation is critical to fully capture the extent of rip currents. The dataset spans diverse locations, rip current types, and camera orientations, providing a realistic and challenging benchmark. In total, 75 participants registered for this first edition, resulting in 5 valid test submissions. Teams were evaluated on a composite score combining F_1 , F_2 , AP_50 , and AP_[50:95] , ensuring robust and application-relevant rankings. The top-performing methods leveraged deep learning architectures, domain adaptation techniques, pretrained models, and domain generalization strategies to improve performance under diverse conditions. This report outlines the dataset details, competition framework, evaluation metrics, and final results, providing insights into the current state of rip current segmentation. We conclude with a discussion of key challenges, lessons learned from the submissions, and future directions for expanding RipSeg. Comments: Challenge report paper from AIM2025 Workshop at ICCVW 2025 Subjects: Computer Vision and Pattern Recognition (cs.CV) MSC classes: cs.AI ACMclasses: I.4.0; I.4.9 Cite as: arXiv:2508.13401 [cs.CV] (or arXiv:2508.13401v1 [cs.CV] for this version) https://doi.org/10.48550/arXiv.2508.13401 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
zh
[CV-81] Applications of Small Language Models in Medical Imaging Classification with a Focus on Prompt Strategies
【速读】:该论文旨在解决大型语言模型(Large Language Models, LLMs)在医疗影像分类任务中因计算成本高、访问受限及数据隐私问题而难以部署于资源受限的医疗环境中的挑战。其解决方案的关键在于探索小型语言模型(Small Language Models, SLMs)在医学影像分类任务中的性能潜力,并通过精心设计的提示工程(prompt engineering)策略显著提升其准确性和可用性,从而在不依赖用户具备深度人工智能专业知识的前提下实现高效可靠的医疗应用。
链接: https://arxiv.org/abs/2508.13378
作者: Yiting Wang,Ziwei Wang,Jiachen Zhong,Di Zhu,Weiyi Li
机构: Georgia Institute of Technology (佐治亚理工学院)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Under Review
Abstract:Large language models (LLMs) have shown remarkable capabilities in natural language processing and multi-modal understanding. However, their high computational cost, limited accessibility, and data privacy concerns hinder their adoption in resource-constrained healthcare environments. This study investigates the performance of small language models (SLMs) in a medical imaging classification task, comparing different models and prompt designs to identify the optimal combination for accuracy and usability. Using the NIH Chest X-ray dataset, we evaluate multiple SLMs on the task of classifying chest X-ray positions (anteroposterior [AP] vs. posteroanterior [PA]) under three prompt strategies: baseline instruction, incremental summary prompts, and correction-based reflective prompts. Our results show that certain SLMs achieve competitive accuracy with well-crafted prompts, suggesting that prompt engineering can substantially enhance SLM performance in healthcare applications without requiring deep AI expertise from end users.
zh
[CV-82] Automated Assessment of Aesthetic Outcomes in Facial Plastic Surgery
【速读】:该论文旨在解决面部整形手术美学效果难以客观量化的问题,传统评估依赖主观判断,缺乏可重复性和标准化指标。其解决方案的关键在于构建一个可扩展且可解释的计算机视觉框架,整合自动特征点检测(automated landmark detection)、几何对称性计算、基于深度学习的年龄估计以及鼻部形态分析等模块,并基于迄今最大的配对术前术后面部照片数据集(7,160张图像,来自1,259名患者)进行验证。该框架能够提供统计显著的定量指标(如鼻部比例改善和面部对称性提升),并证明术后患者身份一致性高(真匹配率>99.5%),从而为外科规划、患者咨询及跨机构结果评估提供客观基准。
链接: https://arxiv.org/abs/2508.13363
作者: Pegah Varghaei,Kiran Abraham-Aggarwal,Manoj T. Abraham,Arun Ross
机构: Michigan State University (密歇根州立大学); Cornell University (康奈尔大学); New York Medical College (纽约医学院)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:We introduce a scalable, interpretable computer-vision framework for quantifying aesthetic outcomes of facial plastic surgery using frontal photographs. Our pipeline leverages automated landmark detection, geometric facial symmetry computation, deep-learning-based age estimation, and nasal morphology analysis. To perform this study, we first assemble the largest curated dataset of paired pre- and post-operative facial images to date, encompassing 7,160 photographs from 1,259 patients. This dataset includes a dedicated rhinoplasty-only subset consisting of 732 images from 366 patients, 96.2% of whom showed improvement in at least one of the three nasal measurements with statistically significant group-level change. Among these patients, the greatest statistically significant improvements (p 0.001) occurred in the alar width to face width ratio (77.0%), nose length to face height ratio (41.5%), and alar width to intercanthal ratio (39.3%). Among the broader frontal-view cohort, comprising 989 rigorously filtered subjects, 71.3% exhibited significant enhancements in global facial symmetry or perceived age (p 0.01). Importantly, our analysis shows that patient identity remains consistent post-operatively, with True Match Rates of 99.5% and 99.6% at a False Match Rate of 0.01% for the rhinoplasty-specific and general patient cohorts, respectively. Additionally, we analyze inter-practitioner variability in improvement rates. By providing reproducible, quantitative benchmarks and a novel dataset, our pipeline facilitates data-driven surgical planning, patient counseling, and objective outcome evaluation across practices.
zh
[CV-83] A Surveillance Based Interactive Robot
【速读】:该论文旨在解决移动监控机器人在实际应用中面临的实时视频流传输、语音交互与自主感知能力不足的问题,以实现用户通过手机或浏览器远程监控并操控机器人。其解决方案的关键在于采用双树莓派(Raspberry Pi 4)架构:前端单元负责采集视觉和音频数据(含摄像头、麦克风和扬声器),中央单元处理视频流(使用FFmpeg传输)、运行目标检测(YOLOv3模型用于场景理解与导航支持)及语音交互模块(集成Python语音识别、多语言翻译与文本转语音库),从而实现自然语言指令的识别与响应;同时引入Kinect RGB-D传感器提供深度信息用于障碍物感知,系统在CPU上即可实现高帧率的目标检测与可靠命令识别,且完全基于现成硬件和开源软件,具备良好的可复现性与扩展潜力(如融合超声波测距、GPU加速、人脸与文本识别等)。
链接: https://arxiv.org/abs/2508.13319
作者: Kshitij Kavimandan,Pooja Mangal,Devanshi Mehta
机构: NMIMS University (NMIMS 大学)
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
备注: 4 pages, 5 figures
Abstract:We build a mobile surveillance robot that streams video in real time and responds to speech so a user can monitor and steer it from a phone or browser. The system uses two Raspberry Pi 4 units: a front unit on a differential drive base with camera, mic, and speaker, and a central unit that serves the live feed and runs perception. Video is sent with FFmpeg. Objects in the scene are detected using YOLOv3 to support navigation and event awareness. For voice interaction, we use Python libraries for speech recognition, multilingual translation, and text-to-speech, so the robot can take spoken commands and read back responses in the requested language. A Kinect RGB-D sensor provides visual input and obstacle cues. In indoor tests the robot detects common objects at interactive frame rates on CPU, recognises commands reliably, and translates them to actions without manual control. The design relies on off-the-shelf hardware and open software, making it easy to reproduce. We discuss limits and practical extensions, including sensor fusion with ultrasonic range data, GPU acceleration, and adding face and text recognition.
zh
[CV-84] DAASH: A Meta-Attack Framework for Synthesizing Effective and Stealthy Adversarial Examples
【速读】:该论文旨在解决现有Lp范数约束下的对抗样本生成方法难以与人类感知对齐的问题,同时探索如何有效利用传统Lp约束攻击的洞察来提升对抗样本的视觉合理性。其解决方案的关键在于提出了一种全可微的元攻击框架DAASH,通过多阶段策略性组合多个基于Lp的基线攻击方法:在每一阶段中,利用学习到的自适应权重聚合候选对抗样本,并通过一种新颖的元损失函数联合优化误分类损失与感知失真度,从而动态调节各基线攻击的贡献比例。该设计使得DAASH即使仅依赖Lp约束方法,也能显著优于当前最先进的感知对齐攻击(如AdvAD),在多个数据集和对抗训练模型上实现更高的攻击成功率和更优的图像质量指标(SSIM、LPIPS、FID)。
链接: https://arxiv.org/abs/2508.13309
作者: Abdullah Al Nomaan Nafi,Habibur Rahaman,Zafaryab Haider,Tanzim Mahfuz,Fnu Suya,Swarup Bhunia,Prabuddha Chakraborty
机构: University of Maine(缅因大学); University of Florida(佛罗里达大学); University of Tennessee(田纳西大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注:
Abstract:Numerous techniques have been proposed for generating adversarial examples in white-box settings under strict Lp-norm constraints. However, such norm-bounded examples often fail to align well with human perception, and only recently have a few methods begun specifically exploring perceptually aligned adversarial examples. Moreover, it remains unclear whether insights from Lp-constrained attacks can be effectively leveraged to improve perceptual efficacy. In this paper, we introduce DAASH, a fully differentiable meta-attack framework that generates effective and perceptually aligned adversarial examples by strategically composing existing Lp-based attack methods. DAASH operates in a multi-stage fashion: at each stage, it aggregates candidate adversarial examples from multiple base attacks using learned, adaptive weights and propagates the result to the next stage. A novel meta-loss function guides this process by jointly minimizing misclassification loss and perceptual distortion, enabling the framework to dynamically modulate the contribution of each base attack throughout the stages. We evaluate DAASH on adversarially trained models across CIFAR-10, CIFAR-100, and ImageNet. Despite relying solely on Lp-constrained based methods, DAASH significantly outperforms state-of-the-art perceptual attacks such as AdvAD – achieving higher attack success rates (e.g., 20.63% improvement) and superior visual quality, as measured by SSIM, LPIPS, and FID (improvements \approx of 11, 0.015, and 5.7, respectively). Furthermore, DAASH generalizes well to unseen defenses, making it a practical and strong baseline for evaluating robustness without requiring handcrafted adaptive attacks for each new defense.
zh
[CV-85] Prune2Drive: A Plug-and-Play Framework for Accelerating Vision-Language Models in Autonomous Driving
【速读】:该论文旨在解决视觉-语言模型(Vision-Language Models, VLMs)在自动驾驶(Autonomous Driving, AD)场景中因处理高分辨率多视角图像而导致的计算开销过大问题,具体表现为视觉token数量激增,引发推理延迟和内存消耗上升,这主要源于自注意力机制的二次复杂度。解决方案的关键在于提出Prune2Drive——一个即插即用的视觉token剪枝框架,其核心创新包括:(i) 基于最远点采样思想的多样性感知token选择机制,优先保障跨视角的语义与空间覆盖性,而非仅依赖注意力分数;(ii) 视角自适应剪枝控制器,学习每个摄像头视图对下游驾驶任务的重要性并动态调整剪枝比例。该方法无需模型重训练或访问注意力映射,兼容现代高效注意力实现,在DriveLM和DriveLMM-o1两个大规模多视角驾驶基准上验证了显著加速与内存节省效果,例如保留10% token时预填充阶段提速6.40倍、FLOPs降至13.4%,性能仅下降3%。
链接: https://arxiv.org/abs/2508.13305
作者: Minhao Xiong,Zichen Wen,Zhuangcheng Gu,Xuyang Liu,Rui Zhang,Hengrui Kang,Jiabing Yang,Junyuan Zhang,Weijia Li,Conghui He,Yafei Wang,Linfeng Zhang
机构: Shanghai Jiao Tong University (上海交通大学); Tsinghua University (清华大学); Peking University (北京大学); University of Science and Technology of China (中国科学技术大学); Fudan University (复旦大学); Nanjing University (南京大学); Zhejiang University (浙江大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 13 pages, 5 figures
Abstract:Vision-Language Models (VLMs) have emerged as a promising paradigm in autonomous driving (AD), offering a unified framework for perception, reasoning, and decision-making by jointly modeling visual inputs and natural language instructions. However, their deployment is hindered by the significant computational overhead incurred when processing high-resolution, multi-view images, a standard setup in AD systems with six or more synchronized cameras. This overhead stems from the large number of visual tokens generated during encoding, increasing inference latency and memory consumption due to the quadratic complexity of self-attention. To address these challenges, we propose Prune2Drive, a plug-and-play visual token pruning framework for multi-view VLMs in autonomous driving. Prune2Drive introduces two core innovations: (i) a diversity-aware token selection mechanism inspired by farthest point sampling, which prioritizes semantic and spatial coverage across views rather than relying solely on attention scores, and (ii) a view-adaptive pruning controller that learns optimal pruning ratios for each camera view based on their importance to downstream driving tasks. Unlike prior methods, Prune2Drive does not require model retraining or access to attention maps, making it compatible with modern efficient attention implementations. Extensive experiments on two large-scale multi-view driving benchmarks, DriveLM and DriveLMM-o1, show that Prune2Drive achieves significant speedups and memory savings while maintaining or improving task performance. When retaining only 10% of the visual tokens, our method achieves a 6.40 \times speedup in the prefilling phase and consumes 13.4% of the original FLOPs, with only a 3% performance drop on the DriveLM benchmark.
zh
[CV-86] GaitCrafter: Diffusion Model for Biometric Preserving Gait Synthesis
【速读】:该论文旨在解决步态识别(gait recognition)任务中因缺乏大规模标注数据集以及难以在保护隐私的前提下收集多样化个体步态样本所带来的性能瓶颈问题。其解决方案的关键在于提出GaitCrafter——一种基于扩散模型(diffusion model)的步态序列合成框架,该框架直接在轮廓域(silhouette domain)上从零训练视频扩散模型,从而生成时序一致且身份不变的逼真步态序列;同时支持通过条件控制(如服装、携带物品和视角)实现可控生成,并引入基于身份嵌入插值的新身份生成机制,以合成未出现在原始数据集中、具有独特稳定步态模式的虚拟个体,有效提升模型性能并保障真实个体隐私。
链接: https://arxiv.org/abs/2508.13300
作者: Sirshapan Mitra,Yogesh S. Rawat
机构: CRCV, University of Central Florida (中央佛罗里达大学计算机视觉与机器人研究中心)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:
Abstract:Gait recognition is a valuable biometric task that enables the identification of individuals from a distance based on their walking patterns. However, it remains limited by the lack of large-scale labeled datasets and the difficulty of collecting diverse gait samples for each individual while preserving privacy. To address these challenges, we propose GaitCrafter, a diffusion-based framework for synthesizing realistic gait sequences in the silhouette domain. Unlike prior works that rely on simulated environments or alternative generative models, GaitCrafter trains a video diffusion model from scratch, exclusively on gait silhouette data. Our approach enables the generation of temporally consistent and identity-preserving gait sequences. Moreover, the generation process is controllable-allowing conditioning on various covariates such as clothing, carried objects, and view angle. We show that incorporating synthetic samples generated by GaitCrafter into the gait recognition pipeline leads to improved performance, especially under challenging conditions. Additionally, we introduce a mechanism to generate novel identities-synthetic individuals not present in the original dataset-by interpolating identity embeddings. These novel identities exhibit unique, consistent gait patterns and are useful for training models while maintaining privacy of real subjects. Overall, our work takes an important step toward leveraging diffusion models for high-quality, controllable, and privacy-aware gait data generation.
zh
[CV-87] CLoE: Curriculum Learning on Endoscopic Images for Robust MES Classification
【速读】:该论文旨在解决溃疡性结肠炎内镜图像中疾病严重程度评估的挑战,特别是针对Mayo Endoscopic Subscore (MES) 分类任务中存在的标签噪声问题(源于观察者间差异)以及评分的序数特性被标准模型忽略的问题。解决方案的关键在于提出一种名为CLoE的课程学习框架,其核心创新包括:利用轻量级模型基于Boston Bowel Preparation Scale (BBPS) 标签估计图像质量作为标注置信度的代理,从而按从易到难(干净到嘈杂)排序样本以构建课程;同时结合ResizeMix数据增强策略提升鲁棒性。该方法在LIMUC和HyperKvasir数据集上验证有效,显著优于监督与自监督基线模型,在保持低计算成本的同时实现高精度和强一致性(如ConvNeXt-Tiny在LIMUC上达到82.5%准确率和0.894的QWK)。
链接: https://arxiv.org/abs/2508.13280
作者: Zeynep Ozdemir,Hacer Yalim Keles,Omer Ozgur Tanriover
机构: Ankara University (安卡拉大学); Hacettepe University (哈切特佩大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注: 16 pages, 4 figures, 9 tables
Abstract:Estimating disease severity from endoscopic images is essential in assessing ulcerative colitis, where the Mayo Endoscopic Subscore (MES) is widely used to grade inflammation. However, MES classification remains challenging due to label noise from inter-observer variability and the ordinal nature of the score, which standard models often ignore. We propose CLoE, a curriculum learning framework that accounts for both label reliability and ordinal structure. Image quality, estimated via a lightweight model trained on Boston Bowel Preparation Scale (BBPS) labels, is used as a proxy for annotation confidence to order samples from easy (clean) to hard (noisy). This curriculum is further combined with ResizeMix augmentation to improve robustness. Experiments on the LIMUC and HyperKvasir datasets, using both CNNs and Transformers, show that CLoE consistently improves performance over strong supervised and self-supervised baselines. For instance, ConvNeXt-Tiny reaches 82.5% accuracy and a QWK of 0.894 on LIMUC with low computational cost. These results highlight the potential of difficulty-aware training strategies for improving ordinal classification under label uncertainty. Code will be released at this https URL.
zh
[CV-88] Exploration of Deep Learning Based Recognition for Urdu Text
【速读】:该论文旨在解决乌尔都语(Urdu)光学字符识别(Optical Character Recognition, OCR)中因字符连写(ligature)和上下文敏感性导致的高错误率问题。其关键解决方案是提出一种基于组件的分类方法,利用卷积神经网络(Convolutional Neural Network, CNN)自动学习特征,并结合层次化神经网络结构处理三种字符排列组合情况,从而实现对乌尔都语文字组件的有效识别与分类,最终在组件分类任务上达到0.99%的准确率。
链接: https://arxiv.org/abs/2508.13245
作者: Sumaiya Fazal,Sheeraz Ahmed
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Urdu is a cursive script language and has similarities with Arabic and many other South Asian languages. Urdu is difficult to classify due to its complex geometrical and morphological structure. Character classification can be processed further if segmentation technique is efficient, but due to context sensitivity in Urdu, segmentation-based recognition often results with high error rate. Our proposed approach for Urdu optical character recognition system is a component-based classification relying on automatic feature learning technique called convolutional neural network. CNN is trained and tested on Urdu text dataset, which is generated through permutation process of three characters and further proceeds to discarding unnecessary images by applying connected component technique in order to obtain ligature only. Hierarchical neural network is implemented with two levels to deal with three degrees of character permutations and component classification Our model successfully achieved 0.99% for component classification.
zh
[CV-89] DianJin-OCR-R1: Enhancing OCR Capabilities via a Reasoning -and-Tool Interleaved Vision-Language Model
【速读】:该论文旨在解决大型视觉语言模型(Large Vision-Language Models, LVLMs)在文档图像解析任务中存在幻觉(hallucination)以及在特定光学字符识别(OCR)任务上性能不如专业模型的问题。其解决方案的关键在于提出DianJin-OCR-R1框架,通过训练推理与工具交互融合的视觉语言模型(VLM),使模型在接收到识别指令后,首先利用自身OCR能力初步识别图像内容,随后调用其他专家模型(expert models)作为参考工具获取结果,并基于这些外部结果重新审视图像并进行推理,最终输出更准确的识别结果。该方法借助专家模型在特定OCR任务上的高精度和低幻觉特性,有效提升了LVLM的识别可靠性,同时因其轻量级特性降低了迭代成本。
链接: https://arxiv.org/abs/2508.13238
作者: Qian Chen,Xianyin Zhang,Lifan Guo,Feng Chen,Chi Zhang
机构: Alibaba Cloud Computing (阿里云计算)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Recent advances in large vision-language models (LVLMs) have enabled a new paradigm of end-to-end document image parsing, excelling in Optical Character Recognition (OCR) tasks such as text, table, and formula recognition. However, generative LVLMs, similarly to large language models (LLMs), are prone to hallucinations–generating words that do not exist in input images. Furthermore, LVLMs are designed for general purposes and tend to be less effective on OCR tasks compared to expert models that are trained on domain-specific datasets. In this paper, we propose DianJin-OCR-R1, a reasoning-enhanced framework designed to address these limitations through training reasoning-and-tool interleaved VLMs. Given a recognition instruction, our DianJin-OCR-R1 model first recognizes the content in the input image by its own OCR capabilities, and then calls other tools (i.e., other expert models) to obtain their results as references, finally looks again the image and rethinks about the reasoning process to provide the final recognized content. Since architectures of expert models are tailored for specific OCR tasks, which makes them less prone to hallucinations, their results can help VLMs mitigate hallucinations. Additionally, expert models are typically smaller in scale and easy to iterate, enabling performance improvements for VLMs at a lower cost. We evaluate our model on ReST and OmniDocBench, and experimental results show that our DianJin-OCR-R1 models consistently outperform their non-reasoning counterparts and expert OCR models, which proves the effectiveness of our method.
zh
[CV-90] Uncertainty-Aware Learning Policy for Reliable Pulmonary Nodule Detection on Chest X-Ray
【速读】:该论文旨在解决医学人工智能(Medical AI)在肺部结节检测中因知识匮乏导致的诊断不确定性问题,从而提升临床医生对AI系统的信任度。其解决方案的关键在于提出一种“不确定性感知学习策略”(Uncertainty-Aware Learning Policy),该策略通过联合学习放射科医师的背景知识与胸部X光片中的病灶信息,弥补传统医疗AI仅依赖重复性病灶数据训练所造成的知识不足,从而降低诊断不确定性并提高敏感性。实验结果表明,该方法在保持高精度的同时,将不确定性熵降低0.2,并使敏感性提升10%。
链接: https://arxiv.org/abs/2508.13236
作者: Hyeonjin Choi,Jinse Kim,Dong-yeon Yoo,Ju-sung Sun,Jung-won Lee
机构: Ajou University (亚洲大学); Ajou University Hospital (亚洲大学医院)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: 8 pages, 5 figures
Abstract:Early detection and rapid intervention of lung cancer are crucial. Nonetheless, ensuring an accurate diagnosis is challenging, as physicians’ ability to interpret chest X-rays varies significantly depending on their experience and degree of fatigue. Although medical AI has been rapidly advancing to assist in diagnosis, physicians’ trust in such systems remains limited, preventing widespread clinical adoption. This skepticism fundamentally stems from concerns about its diagnostic uncertainty. In clinical diagnosis, physicians utilize extensive background knowledge and clinical experience. In contrast, medical AI primarily relies on repetitive learning of the target lesion to generate diagnoses based solely on that data. In other words, medical AI does not possess sufficient knowledge to render a diagnosis, leading to diagnostic uncertainty. Thus, this study suggests an Uncertainty-Aware Learning Policy that can address the issue of knowledge deficiency by learning the physicians’ background knowledge alongside the Chest X-ray lesion information. We used 2,517 lesion-free images and 656 nodule images, all obtained from Ajou University Hospital. The proposed model attained 92% (IoU 0.2 / FPPI 2) with a 10% enhancement in sensitivity compared to the baseline model while also decreasing entropy as a measure of uncertainty by 0.2.
zh
[CV-91] RISE: Enhancing VLM Image Annotation with Self-Supervised Reasoning
【速读】:该论文旨在解决视觉语言模型(Vision-Language Models, VLMs)在复杂图像标注任务中推理能力不足的问题,特别是情绪分类和上下文驱动的目标检测等任务,这些任务需要深层次的逻辑推理。现有方法如标准监督微调(Supervised Fine-Tuning, SFT)仅关注标注结果而忽略推理过程,而视觉强化微调(Visual Reinforcement Fine-Tuning, Visual-RFT)因预训练阶段缺乏高质量、可验证的思维链(Chains of Thought, CoTs),导致生成的CoTs不一致。论文提出两阶段框架RISE(Reason-Inspire-Strengthen-Expertise),其核心创新在于:第一阶段RISE-CoT通过强化学习驱动的“标注-推理-标注”闭环机制,生成视觉对齐且逻辑一致的CoTs,并通过重建原始标注来验证其有效性;第二阶段RISE-R1利用第一阶段筛选出的高质量CoT子集进行监督微调与强化微调,从而实现准确标注与可解释推理的统一,最终提升VLM在复杂视觉任务中的专家级表现。
链接: https://arxiv.org/abs/2508.13229
作者: Suhang Hu,Wei Hu,Yuhang Su,Fan Zhang
机构: Beijing University of Chemical Technology (北京化工大学)
类目: Machine Learning (cs.LG); Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Vision-Language Models (VLMs) struggle with complex image annotation tasks, such as emotion classification and context-driven object detection, which demand sophisticated reasoning. Standard Supervised Fine-Tuning (SFT) focuses solely on annotation outcomes, ignoring underlying rationales, while Visual Reinforcement Fine-Tuning (Visual-RFT) produces inconsistent Chains of Thought (CoTs) due to the absence of high-quality, verified CoTs during pre-training. We introduce RISE (Reason-Inspire-Strengthen-Expertise), a two-stage framework to overcome these limitations. In the Reason stage (RISE-CoT), a reinforcement learning-driven “annotation-reasoning-annotation” closed-loop generates visually grounded, logically consistent CoTs by verifying their ability to reconstruct original annotations without direct leakage. The Inspire and Strengthen stage (RISE-R1) leverages a high-quality CoT subset, filtered by RISE-CoT rewards, for supervised fine-tuning, followed by reinforcement fine-tuning to produce interpretable reasoning and accurate annotations, achieving Expertise in complex visual tasks. Evaluated on complex and simple image annotation tasks, RISE-trained Qwen2-VL-2B outperforms SFT and Visual-RFT, achieving robust performance and enhanced explainability. RISE offers a self-supervised solution for advancing VLM reasoning without requiring manually annotated CoTs.
zh
[CV-92] PreSem-Surf: RGB-D Surface Reconstruction with Progressive Semantic Modeling and SG-MLP Pre-Rendering Mechanism IJCNN2025
【速读】:该论文旨在解决基于RGB-D序列的场景表面重建中存在重建质量不高、训练效率低以及难以有效融合多模态信息(如语义信息)的问题。其解决方案的关键在于提出PreSem-Surf方法,通过引入一种结合SG-MLP采样结构与PR-MLP(预条件多层感知机)的新型体素预渲染机制,使模型更早捕捉场景相关特征并更好地区分噪声与局部细节;同时采用渐进式语义建模策略,在不同精度层级上逐步提取语义信息,从而在显著缩短训练时间的同时提升场景理解能力与重建质量。
链接: https://arxiv.org/abs/2508.13228
作者: Yuyan Ye,Hang Xu,Yanghang Huang,Jiali Huang,Qian Weng
机构: Fuzhou University (福州大学); Key Laboratory of Spatial Data Mining and Information Sharing, Ministry of Education of the People’s Republic of China (中华人民共和国教育部空间数据挖掘与信息共享重点实验室)
类目: Graphics (cs.GR); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV); Image and Video Processing (eess.IV)
备注: 2025 International Joint Conference on Neural Networks (IJCNN 2025)
Abstract:This paper proposes PreSem-Surf, an optimized method based on the Neural Radiance Field (NeRF) framework, capable of reconstructing high-quality scene surfaces from RGB-D sequences in a short time. The method integrates RGB, depth, and semantic information to improve reconstruction performance. Specifically, a novel SG-MLP sampling structure combined with PR-MLP (Preconditioning Multilayer Perceptron) is introduced for voxel pre-rendering, allowing the model to capture scene-related information earlier and better distinguish noise from local details. Furthermore, progressive semantic modeling is adopted to extract semantic information at increasing levels of precision, reducing training time while enhancing scene understanding. Experiments on seven synthetic scenes with six evaluation metrics show that PreSem-Surf achieves the best performance in C-L1, F-score, and IoU, while maintaining competitive results in NC, Accuracy, and Completeness, demonstrating its effectiveness and practical applicability.
zh
[CV-93] MIRAG E: Towards AI-Generated Image Detection in the Wild
【速读】:该论文旨在解决生成式 AI (Generative AI) 图像(AIGI)在真实世界场景(in-the-wild)中难以检测的问题。现有检测方法在实验室环境下的干净图像中表现良好,但在包含噪声、多样来源及后期编辑的真实图像中泛化能力差。解决方案的关键在于构建一个名为 Mirage 的挑战性基准,该基准融合了人工验证的互联网来源 AIGI 和多专家生成器协作合成的高保真图像,以模拟真实复杂性;并提出 Mirage-R1 模型,该模型采用从启发式到分析式的推理机制和反射式推理机制,并通过两阶段训练(监督微调 + 强化学习)以及推理时自适应思维策略,在推理速度与检测准确性之间实现有效权衡,显著优于当前最先进检测器。
链接: https://arxiv.org/abs/2508.13223
作者: Cheng Xia,Manxi Lin,Jiexiang Tan,Xiaoxiong Du,Yang Qiu,Junjun Zheng,Xiangheng Kong,Yuning Jiang,Bo Zheng
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:
Abstract:The spreading of AI-generated images (AIGI), driven by advances in generative AI, poses a significant threat to information security and public trust. Existing AIGI detectors, while effective against images in clean laboratory settings, fail to generalize to in-the-wild scenarios. These real-world images are noisy, varying from ``obviously fake" images to realistic ones derived from multiple generative models and further edited for quality control. We address in-the-wild AIGI detection in this paper. We introduce Mirage, a challenging benchmark designed to emulate the complexity of in-the-wild AIGI. Mirage is constructed from two sources: (1) a large corpus of Internet-sourced AIGI verified by human experts, and (2) a synthesized dataset created through the collaboration between multiple expert generators, closely simulating the realistic AIGI in the wild. Building on this benchmark, we propose Mirage-R1, a vision-language model with heuristic-to-analytic reasoning, a reflective reasoning mechanism for AIGI detection. Mirage-R1 is trained in two stages: a supervised-fine-tuning cold start, followed by a reinforcement learning stage. By further adopting an inference-time adaptive thinking strategy, Mirage-R1 is able to provide either a quick judgment or a more robust and accurate conclusion, effectively balancing inference speed and performance. Extensive experiments show that our model leads state-of-the-art detectors by 5% and 10% on Mirage and the public benchmark, respectively. The benchmark and code will be made publicly available.
zh
[CV-94] YOLO1 1-CR: a Lightweight Convolution-and-Attention Framework for Accurate Fatigue Driving Detection
【速读】:该论文旨在解决驾驶疲劳检测中现有视觉方法存在的局限性,如对小目标或遮挡目标检测性能差、多尺度特征建模能力不足等问题。解决方案的关键在于提出一种轻量级高效的目标检测模型 YOLO11-CR,其核心创新包括两个模块:卷积与注意力融合模块(Convolution-and-Attention Fusion Module, CAFM),用于融合局部 CNN 特征与全局 Transformer 上下文以增强特征表达能力;以及矩形校准模块(Rectangular Calibration Module, RCM),通过捕捉水平和垂直方向的上下文信息提升空间定位精度,尤其适用于侧脸和手机等小目标的检测。实验表明,该模型在 DSM 数据集上实现了高精度与召回率,显著优于基线模型,具备良好的实时性和部署潜力。
链接: https://arxiv.org/abs/2508.13205
作者: Zhebin Jin,Ligang Dong
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Image and Video Processing (eess.IV)
备注:
Abstract:Driver fatigue detection is of paramount importance for intelligent transportation systems due to its critical role in mitigating road traffic accidents. While physiological and vehicle dynamics-based methods offer accuracy, they are often intrusive, hardware-dependent, and lack robustness in real-world environments. Vision-based techniques provide a non-intrusive and scalable alternative, but still face challenges such as poor detection of small or occluded objects and limited multi-scale feature modeling. To address these issues, this paper proposes YOLO11-CR, a lightweight and efficient object detection model tailored for real-time fatigue detection. YOLO11-CR introduces two key modules: the Convolution-and-Attention Fusion Module (CAFM), which integrates local CNN features with global Transformer-based context to enhance feature expressiveness; and the Rectangular Calibration Module (RCM), which captures horizontal and vertical contextual information to improve spatial localization, particularly for profile faces and small objects like mobile phones. Experiments on the DSM dataset demonstrated that YOLO11-CR achieves a precision of 87.17%, recall of 83.86%, mAP@50 of 88.09%, and mAP@50-95 of 55.93%, outperforming baseline models significantly. Ablation studies further validate the effectiveness of the CAFM and RCM modules in improving both sensitivity and localization accuracy. These results demonstrate that YOLO11-CR offers a practical and high-performing solution for in-vehicle fatigue monitoring, with strong potential for real-world deployment and future enhancements involving temporal modeling, multi-modal data integration, and embedded optimization.
zh
[CV-95] BERT-VQA: Visual Question Answering on Plots
【速读】:该论文致力于解决图表类视觉问答(Visual Question Answering on Plots)这一子任务,其核心挑战在于如何有效融合视觉与语言模态信息以准确理解图表内容并回答相关问题。解决方案的关键在于提出并实现了一种基于VisualBERT的模型架构(BERT-VQA),该架构采用预训练的ResNet 101作为图像编码器,并探索了跨模态融合机制;尽管实验结果表明,VisualBERT中的跨模态模块并非如预期般对对齐图表元素与问题短语至关重要,但该研究仍为理解图表问答任务的难度及不同模型结构的有效性提供了重要洞见。
链接: https://arxiv.org/abs/2508.13184
作者: Tai Vu,Robert Yang
机构: Stanford University (斯坦福大学)
类目: Machine Learning (cs.LG); Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Visual question answering has been an exciting challenge in the field of natural language understanding, as it requires deep learning models to exchange information from both vision and language domains. In this project, we aim to tackle a subtask of this problem, namely visual question answering on plots. To achieve this, we developed BERT-VQA, a VisualBERT-based model architecture with a pretrained ResNet 101 image encoder, along with a potential addition of joint fusion. We trained and evaluated this model against a baseline that consisted of a LSTM, a CNN, and a shallow classifier. The final outcome disproved our core hypothesis that the cross-modality module in VisualBERT is essential in aligning plot components with question phrases. Therefore, our work provided valuable insights into the difficulty of the plot question answering challenge as well as the appropriateness of different model architectures in solving this problem.
zh
[CV-96] Image2Net: Datasets Benchmark and Hybrid Framework to Convert Analog Circuit Diagrams into Netlists
【速读】:该论文旨在解决将图像形式的模拟集成电路(Analog IC)电路图高效准确地转换为文本形式的网表(netlist)这一难题,以支持大语言模型(Large Language Model, LLM)在模拟IC设计中的应用。当前LLM依赖于文本描述进行知识抽象与泛化,但现有模拟IC多以电路图形式存在,缺乏结构化的文本表示,限制了LLM在该领域的进一步发展。为此,论文提出了一种名为Image2Net的混合框架,其关键在于结合图像识别与符号推理能力,实现对多种风格和复杂度电路图的鲁棒转换;同时构建了一个包含丰富图像样式和平衡分布简单/复杂电路的开源数据集,并引入净列表编辑距离(Netlist Edit Distance, NED)作为精确评估指标,显著提升了转换成功率(达80.77%)和准确性(平均NED降低62.1%-69.6%),优于现有方法。
链接: https://arxiv.org/abs/2508.13157
作者: Haohang Xu,Chengjie Liu,Qihang Wang,Wenhao Huang,Yongjian Xu,Weiyu Chen,Anlan Peng,Zhijun Li,Bo Li,Lei Qi,Jun Yang,Yuan Du,Li Du
机构: Nanjing University (南京大学); South East University (东南大学); National Center of Technology Innovation for EDA (EDA 技术创新国家中心)
类目: Hardware Architecture (cs.AR); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV); Image and Video Processing (eess.IV)
备注: 10 pages, 12 figures, 6 tables
Abstract:Large Language Model (LLM) exhibits great potential in designing of analog integrated circuits (IC) because of its excellence in abstraction and generalization for knowledge. However, further development of LLM-based analog ICs heavily relies on textual description of analog ICs, while existing analog ICs are mostly illustrated in image-based circuit diagrams rather than text-based netlists. Converting circuit diagrams to netlists help LLMs to enrich the knowledge of analog IC. Nevertheless, previously proposed conversion frameworks face challenges in further application because of limited support of image styles and circuit elements. Up to now, it still remains a challenging task to effectively convert complex circuit diagrams into netlists. To this end, this paper constructs and opensources a new dataset with rich styles of circuit diagrams as well as balanced distribution of simple and complex analog ICs. And a hybrid framework, named Image2Net, is proposed for practical conversion from circuit diagrams to netlists. The netlist edit distance (NED) is also introduced to precisely assess the difference between the converted netlists and ground truth. Based on our benchmark, Image2Net achieves 80.77% successful rate, which is 34.62%-45.19% higher than previous works. Specifically, the proposed work shows 0.116 averaged NED, which is 62.1%-69.6% lower than state-of-the-arts.
zh
[CV-97] UNICON: UNIfied CONtinual Learning for Medical Foundational Models
【速读】:该论文旨在解决医学影像领域中基础模型(foundation models)因数据稀缺而难以针对每个特定领域、模态或任务进行充分预训练的问题。传统方法在面对新任务时往往需要大量标注数据,且容易引发灾难性遗忘(catastrophic forgetting)和任务干扰。解决方案的关键在于提出UNICON框架——一种统一的持续学习(continual learning)机制,使基础模型能够顺序地适应不同成像模态、解剖区域和临床目标,并动态扩展其能力而不丢失已有知识。该框架通过精心设计的集成策略,在无需大规模重新训练的前提下实现了跨任务、跨模态的知识迁移与累积,实验证明其在胸部CT分类基础上持续拓展至预后预测和分割任务,并进一步融合PET扫描后Dice评分提升5%,验证了基础模型具备可演化性,为通用型医学影像人工智能模型的发展提供了可行路径。
链接: https://arxiv.org/abs/2508.14024
作者: Mohammad Areeb Qazi,Munachiso S Nwadike,Ibrahim Almakky,Mohammad Yaqub,Numan Saeed
机构: 未知
类目: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV)
备注: 10 pages, 1 figure
Abstract:Foundational models are trained on extensive datasets to capture the general trends of a domain. However, in medical imaging, the scarcity of data makes pre-training for every domain, modality, or task challenging. Continual learning offers a solution by fine-tuning a model sequentially on different domains or tasks, enabling it to integrate new knowledge without requiring large datasets for each training phase. In this paper, we propose UNIfied CONtinual Learning for Medical Foundational Models (UNICON), a framework that enables the seamless adaptation of foundation models to diverse domains, tasks, and modalities. Unlike conventional adaptation methods that treat these changes in isolation, UNICON provides a unified, perpetually expandable framework. Through careful integration, we show that foundation models can dynamically expand across imaging modalities, anatomical regions, and clinical objectives without catastrophic forgetting or task interference. Empirically, we validate our approach by adapting a chest CT foundation model initially trained for classification to a prognosis and segmentation task. Our results show improved performance across both additional tasks. Furthermore, we continually incorporated PET scans and achieved a 5% improvement in Dice score compared to respective baselines. These findings establish that foundation models are not inherently constrained to their initial training scope but can evolve, paving the way toward generalist AI models for medical imaging.
zh
[CV-98] Real-Time Population-Based Reconstruction of 3D Bone Models via Very-Low-Dose Protocols
【速读】:该论文旨在解决传统基于CT的骨模型重建方法在术前规划中存在灵活性差、辐射暴露高及手动勾画耗时等问题,从而限制了其在临床中的广泛应用。其解决方案的关键在于提出了一种名为“半监督知识蒸馏重建”(Semi-Supervised Reconstruction with Knowledge Distillation, SSR-KD)的AI框架,该框架可仅通过双平面X射线在30秒内重建出误差低于1.0 mm的高质量骨模型,显著减少对CT依赖和人工操作,同时支持术中导航,具备与CT标注模型相当的临床适用性。
链接: https://arxiv.org/abs/2508.13947
作者: Yiqun Lin,Haoran Sun,Yongqing Li,Rabia Aslam,Lung Fung Tse,Tiange Cheng,Chun Sing Chui,Wing Fung Yau,Victorine R. Le Meur,Meruyert Amangeldy,Kiho Cho,Yinyu Ye,James Zou,Wei Zhao,Xiaomeng Li
机构: The Hong Kong University of Science and Technology (香港科技大学); Koln 3D Technology (Medical) Limited (Koln 3D技术(医疗)有限公司); Beihang University (北京航空航天大学); Union Hospital (联合医院); The University of Hong Kong (香港大学); The Chinese University of Hong Kong (香港中文大学); Stanford University (斯坦福大学)
类目: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Patient-specific bone models are essential for designing surgical guides and preoperative planning, as they enable the visualization of intricate anatomical structures. However, traditional CT-based approaches for creating bone models are limited to preoperative use due to the low flexibility and high radiation exposure of CT and time-consuming manual delineation. Here, we introduce Semi-Supervised Reconstruction with Knowledge Distillation (SSR-KD), a fast and accurate AI framework to reconstruct high-quality bone models from biplanar X-rays in 30 seconds, with an average error under 1.0 mm, eliminating the dependence on CT and manual work. Additionally, high tibial osteotomy simulation was performed by experts on reconstructed bone models, demonstrating that bone models reconstructed from biplanar X-rays have comparable clinical applicability to those annotated from CT. Overall, our approach accelerates the process, reduces radiation exposure, enables intraoperative guidance, and significantly improves the practicality of bone models, offering transformative applications in orthopedics.
zh
[CV-99] MMIS-Net for Retinal Fluid Segmentation and Detection
【速读】:该论文旨在解决当前深度学习方法在医学图像分割中普遍存在的局限性——即模型通常仅在单一来源、模态、器官或疾病类型的数据上训练和测试,忽略了利用其他可用标注数据的协同潜力。为应对这一问题,作者提出了一种名为MMIS-Net(MultiModal Medical Image Segmentation Network)的新算法,其关键创新在于引入了相似性融合(Similarity Fusion)模块,通过监督信号与像素级相似性知识选择实现特征图融合;同时,为处理不同数据集中类别定义不一致及标签矛盾的问题,设计了一种one-hot标签空间来统一多源标注信息。该方案使得模型能够在10个涵盖2种模态、19个器官的公开数据集上联合训练,并在RETOUTCH挑战赛隐藏测试集上显著优于现有大型基础模型和主流算法,验证了其有效性。
链接: https://arxiv.org/abs/2508.13936
作者: Nchongmaje Ndipenocha,Alina Mirona,Kezhi Wanga,Yongmin Li
机构: Brunel University London (布鲁内尔大学)
类目: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Purpose: Deep learning methods have shown promising results in the segmentation, and detection of diseases in medical images. However, most methods are trained and tested on data from a single source, modality, organ, or disease type, overlooking the combined potential of other available annotated data. Numerous small annotated medical image datasets from various modalities, organs, and diseases are publicly available. In this work, we aim to leverage the synergistic potential of these datasets to improve performance on unseen data. Approach: To this end, we propose a novel algorithm called MMIS-Net (MultiModal Medical Image Segmentation Network), which features Similarity Fusion blocks that utilize supervision and pixel-wise similarity knowledge selection for feature map fusion. Additionally, to address inconsistent class definitions and label contradictions, we created a one-hot label space to handle classes absent in one dataset but annotated in another. MMIS-Net was trained on 10 datasets encompassing 19 organs across 2 modalities to build a single model. Results: The algorithm was evaluated on the RETOUCH grand challenge hidden test set, outperforming large foundation models for medical image segmentation and other state-of-the-art algorithms. We achieved the best mean Dice score of 0.83 and an absolute volume difference of 0.035 for the fluids segmentation task, as well as a perfect Area Under the Curve of 1 for the fluid detection task. Conclusion: The quantitative results highlight the effectiveness of our proposed model due to the incorporation of Similarity Fusion blocks into the network’s backbone for supervision and similarity knowledge selection, and the use of a one-hot label space to address label class inconsistencies and contradictions.
zh
[CV-100] Learning to See Through Flare ICCV
【速读】:该论文旨在解决机器视觉系统在强激光照射下易受激光耀斑(laser flare)干扰的问题,即高强度激光导致传感器像素过饱和甚至永久损坏,从而影响环境感知能力。解决方案的关键在于提出 NeuSee 框架,其核心创新是联合学习衍射光学元件(diffractive optical element, DOE)的神经表征与频域 Mamba-GAN 网络以实现图像恢复;该框架通过端到端对抗训练,在 10 万张独特图像上优化,可将峰值激光辐照度抑制至传感器饱和阈值(I_\textrmsat)的 106 倍以下,同时支持全可见光谱成像和动态激光条件下的鲁棒性保护,显著优于现有方法。
链接: https://arxiv.org/abs/2508.13907
作者: Xiaopeng Peng,Heath Gemar,Erin Fleet,Kyle Novak,Abbie Watnik,Grover Swartzlander
机构: Rochester Institute of Technology (罗切斯特理工学院); U.S. Naval Research Laboratory (美国海军研究实验室)
类目: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV)
备注: accepted by ICCVW 2025
Abstract:Machine vision systems are susceptible to laser flare, where unwanted intense laser illumination blinds and distorts its perception of the environment through oversaturation or permanent damage to sensor pixels. We introduce NeuSee, the first computational imaging framework for high-fidelity sensor protection across the full visible spectrum. It jointly learns a neural representation of a diffractive optical element (DOE) and a frequency-space Mamba-GAN network for image restoration. NeuSee system is adversarially trained end-to-end on 100K unique images to suppress the peak laser irradiance as high as 10^6 times the sensor saturation threshold I_\textrmsat , the point at which camera sensors may experience damage without the DOE. Our system leverages heterogeneous data and model parallelism for distributed computing, integrating hyperspectral information and multiple neural networks for realistic simulation and image restoration. NeuSee takes into account open-world scenes with dynamically varying laser wavelengths, intensities, and positions, as well as lens flare effects, unknown ambient lighting conditions, and sensor noises. It outperforms other learned DOEs, achieving full-spectrum imaging and laser suppression for the first time, with a 10.1% improvement in restored image quality.
zh
[CV-101] A Novel Attention-Augmented Wavelet YOLO System for Real-time Brain Vessel Segmentation on Transcranial Color-coded Doppler
【速读】:该论文旨在解决经颅彩色多普勒超声(Transcranial Color-coded Doppler, TCCD)在评估大脑动脉环(Circle of Willis, CoW)时对操作者经验依赖性强的问题,从而限制其在临床中的广泛应用。解决方案的关键在于提出一种基于注意力机制增强的小波YOLO(Attention-Augmented Wavelet YOLO, AAW-YOLO)网络模型,能够实现对TCCD图像中脑血管结构的实时自动分割,显著提升分割精度与效率,同时降低对专业操作技能的依赖,为临床提供可推广的智能辅助诊断工具。
链接: https://arxiv.org/abs/2508.13875
作者: Wenxuan Zhang(1),Shuai Li(1),Xinyi Wang(1),Yu Sun(1),Hongyu Kang(1),Pui Yuk Chryste Wan(1),Yong-Ping Zheng(1 and 2),Sai-Kit Lam(1 and 2) ((1), Department of Biomedical Engineering, The Hong Kong Polytechnic University, Hong Kong SAR, China, (2), the Research Institute of Smart Ageing, The Hong Kong Polytechnic University, Hong Kong SAR, China)
机构: 未知
类目: Image and Video Processing (eess.IV); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:The Circle of Willis (CoW), vital for ensuring consistent blood flow to the brain, is closely linked to ischemic stroke. Accurate assessment of the CoW is important for identifying individuals at risk and guiding appropriate clinical management. Among existing imaging methods, Transcranial Color-coded Doppler (TCCD) offers unique advantages due to its radiation-free nature, affordability, and accessibility. However, reliable TCCD assessments depend heavily on operator expertise for identifying anatomical landmarks and performing accurate angle correction, which limits its widespread adoption. To address this challenge, we propose an AI-powered, real-time CoW auto-segmentation system capable of efficiently capturing cerebral arteries. No prior studies have explored AI-driven cerebrovascular segmentation using TCCD. In this work, we introduce a novel Attention-Augmented Wavelet YOLO (AAW-YOLO) network tailored for TCCD data, designed to provide real-time guidance for brain vessel segmentation in the CoW. We prospectively collected TCCD data comprising 738 annotated frames and 3,419 labeled artery instances to establish a high-quality dataset for model training and evaluation. The proposed AAW-YOLO demonstrated strong performance in segmenting both ipsilateral and contralateral CoW vessels, achieving an average Dice score of 0.901, IoU of 0.823, precision of 0.882, recall of 0.926, and mAP of 0.953, with a per-frame inference speed of 14.199 ms. This system offers a practical solution to reduce reliance on operator experience in TCCD-based cerebrovascular screening, with potential applications in routine clinical workflows and resource-constrained settings. Future research will explore bilateral modeling and larger-scale validation.
zh
[CV-102] Latent Interpolation Learning Using Diffusion Models for Cardiac Volume Reconstruction
【速读】:该论文旨在解决心脏磁共振成像(Cardiac Magnetic Resonance, CMR)中因二维短轴切片稀疏采集导致的三维体积信息不完整问题,从而影响心脏全面评估的准确性。现有方法受限于预定义插值方案(如线性或球面插值)、计算效率低以及依赖额外语义输入(如分割标签或运动数据)。其解决方案的核心在于提出一种新颖的心脏潜在插值扩散模型(Cardiac Latent Interpolation Diffusion, CaLID)框架,关键创新包括:1)基于扩散模型的数据驱动插值策略,可捕捉稀疏切片间的复杂非线性关系以提升重建精度;2)在潜在空间中进行高效计算,使全心上采样速度提升24倍,显著降低计算开销;3)仅需稀疏二维CMR图像作为输入即可达到当前最优性能,无需辅助信息,简化临床流程。此外,该方法进一步扩展至二维加时间维度(2D+T)数据,有效建模时空动态并保证时序一致性,实验证明其在体积重建质量和下游分割任务中均优于基线方法。
链接: https://arxiv.org/abs/2508.13826
作者: Niklas Bubeck,Suprosanna Shit,Chen Chen,Can Zhao,Pengfei Guo,Dong Yang,Georg Zitzlsberger,Daguang Xu,Bernhard Kainz,Daniel Rueckert,Jiazhen Pan
机构: Technical University Munich(慕尼黑工业大学); Munich Center for Machine Learning(慕尼黑机器学习中心); University of Zurich(苏黎世大学); University of Oxford(牛津大学); Imperial College London(帝国理工学院); University of Sheffield(谢菲尔德大学); NVIDIA(英伟达); Friedrich-Alexander-Universität Erlangen-Nürnberg(埃尔朗根-纽伦堡弗里德里希亚历山大大学); School of Medicine, Klinikum Rechts der Isar, Technical University of Munich(慕尼黑工业大学右岸医院医学院)
类目: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Cardiac Magnetic Resonance (CMR) imaging is a critical tool for diagnosing and managing cardiovascular disease, yet its utility is often limited by the sparse acquisition of 2D short-axis slices, resulting in incomplete volumetric information. Accurate 3D reconstruction from these sparse slices is essential for comprehensive cardiac assessment, but existing methods face challenges, including reliance on predefined interpolation schemes (e.g., linear or spherical), computational inefficiency, and dependence on additional semantic inputs such as segmentation labels or motion data. To address these limitations, we propose a novel \textbfCardiac \textbfLatent \textbfInterpolation \textbfDiffusion (CaLID) framework that introduces three key innovations. First, we present a data-driven interpolation scheme based on diffusion models, which can capture complex, non-linear relationships between sparse slices and improves reconstruction accuracy. Second, we design a computationally efficient method that operates in the latent space and speeds up 3D whole-heart upsampling time by a factor of 24, reducing computational overhead compared to previous methods. Third, with only sparse 2D CMR images as input, our method achieves SOTA performance against baseline methods, eliminating the need for auxiliary input such as morphological guidance, thus simplifying workflows. We further extend our method to 2D+T data, enabling the effective modeling of spatiotemporal dynamics and ensuring temporal coherence. Extensive volumetric evaluations and downstream segmentation tasks demonstrate that CaLID achieves superior reconstruction quality and efficiency. By addressing the fundamental limitations of existing approaches, our framework advances the state of the art for spatio and spatiotemporal whole-heart reconstruction, offering a robust and clinically practical solution for cardiovascular imaging.
zh
[CV-103] Comparing Conditional Diffusion Models for Synthesizing Contrast-Enhanced Breast MRI from Pre-Contrast Images MICCAI
【速读】:该论文旨在解决动态对比增强磁共振成像(Dynamic Contrast-Enhanced MRI, DCE-MRI)在乳腺癌诊断与治疗中依赖钆类对比剂所引发的安全风险、禁忌症、成本上升及流程复杂化等问题。其核心解决方案是提出基于预对比图像条件化的去噪扩散概率模型(pre-contrast conditioned denoising diffusion probabilistic models),用于合成高质量的DCE-MRI图像。关键创新在于引入肿瘤感知损失函数(tumor-aware loss functions)和显式肿瘤分割掩膜条件输入(explicit tumor segmentation mask conditioning),从而显著提升病灶区域的保真度;实验表明,基于减影图像的模型在五种评估指标上均优于基于后对比图像的模型,且结合掩膜条件可进一步改善定性结果,尤其在对比剂摄取特征的捕捉上表现突出,验证了生成式AI在实现无对比剂DCE-MRI方面的临床潜力。
链接: https://arxiv.org/abs/2508.13776
作者: Sebastian Ibarra,Javier del Riego,Alessandro Catanese,Julian Cuba,Julian Cardona,Nataly Leon,Jonathan Infante,Karim Lekadir,Oliver Diaz,Richard Osuala
机构: 未知
类目: Image and Video Processing (eess.IV); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
备注: 13 pages, 5 figures, submitted and accepted to MICCAI Deepbreath workshop 2025
Abstract:Dynamic contrast-enhanced (DCE) MRI is essential for breast cancer diagnosis and treatment. However, its reliance on contrast agents introduces safety concerns, contraindications, increased cost, and workflow complexity. To this end, we present pre-contrast conditioned denoising diffusion probabilistic models to synthesize DCE-MRI, introducing, evaluating, and comparing a total of 22 generative model variants in both single-breast and full breast settings. Towards enhancing lesion fidelity, we introduce both tumor-aware loss functions and explicit tumor segmentation mask conditioning. Using a public multicenter dataset and comparing to respective pre-contrast baselines, we observe that subtraction image-based models consistently outperform post-contrast-based models across five complementary evaluation metrics. Apart from assessing the entire image, we also separately evaluate the region of interest, where both tumor-aware losses and segmentation mask inputs improve evaluation metrics. The latter notably enhance qualitative results capturing contrast uptake, albeit assuming access to tumor localization inputs that are not guaranteed to be available in screening settings. A reader study involving 2 radiologists and 4 MRI technologists confirms the high realism of the synthetic images, indicating an emerging clinical potential of generative contrast-enhancement. We share our codebase at this https URL.
zh
[CV-104] Deep Biomechanically-Guided Interpolation for Keypoint-Based Brain Shift Registration MICCAI2025
【速读】:该论文旨在解决神经外科手术中脑移位(brain shift)补偿的精度问题,以维持神经导航系统在术中的可靠性。传统基于关键点的配准方法虽对大变形和拓扑变化具有鲁棒性,但通常依赖简单的几何插值器生成密集位移场,忽略了组织生物力学特性。其解决方案的关键在于提出一种新型深度学习框架,通过训练一个残差3D U-Net模型,将标准插值估计结果优化为符合生物力学约束的密集变形场;该方法首先利用生物力学仿真生成大规模合成脑变形数据集,进而使网络学习从稀疏匹配关键点到物理合理位移场的映射关系,在显著降低均方误差(提升至传统方法的一半)的同时保持极低的推理计算开销。
链接: https://arxiv.org/abs/2508.13762
作者: Tiago Assis,Ines P. Machado,Benjamin Zwick,Nuno C. Garcia,Reuben Dorent
机构: 未知
类目: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted at COLlaborative Intelligence and Autonomy in Image-guided Surgery (COLAS) Workshop - MICCAI 2025
Abstract:Accurate compensation of brain shift is critical for maintaining the reliability of neuronavigation during neurosurgery. While keypoint-based registration methods offer robustness to large deformations and topological changes, they typically rely on simple geometric interpolators that ignore tissue biomechanics to create dense displacement fields. In this work, we propose a novel deep learning framework that estimates dense, physically plausible brain deformations from sparse matched keypoints. We first generate a large dataset of synthetic brain deformations using biomechanical simulations. Then, a residual 3D U-Net is trained to refine standard interpolation estimates into biomechanically guided deformations. Experiments on a large set of simulated displacement fields demonstrate that our method significantly outperforms classical interpolators, reducing by half the mean square error while introducing negligible computational overhead at inference time. Code available at: \hrefthis https URLthis https URL.
zh
[CV-105] subCellSAM: Zero-Shot (Sub-)Cellular Segmentation for Hit Validation in Drug Discovery
【速读】:该论文旨在解决高通量筛选中细胞图像分割的自动化与泛化难题,传统方法如深度学习模型通常需要针对特定数据集进行繁琐的手动参数调优或领域特定微调,限制了其在多场景下的适用性。解决方案的关键在于提出一种零样本(zero-shot)分割框架,利用分割基础模型结合上下文学习(in-context learning)策略,通过三步流程实现细胞核、细胞及亚细胞结构的精准分割;其中创新性引入自提示机制(self-prompting mechanism),借助生长掩膜和精心设计的前景/背景点编码形态学与拓扑先验信息,从而在无需任何数据集特定调参的情况下,准确提取生物相关结构,显著提升方法的通用性和实用性。
链接: https://arxiv.org/abs/2508.13701
作者: Jacob Hanimann,Daniel Siegismund,Mario Wieser,Stephan Steigele
机构: 未知
类目: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted at DAGM German Conference on Pattern Recognition (GCPR) 2025
Abstract:High-throughput screening using automated microscopes is a key driver in biopharma drug discovery, enabling the parallel evaluation of thousands of drug candidates for diseases such as cancer. Traditional image analysis and deep learning approaches have been employed to analyze these complex, large-scale datasets, with cell segmentation serving as a critical step for extracting relevant structures. However, both strategies typically require extensive manual parameter tuning or domain-specific model fine-tuning. We present a novel method that applies a segmentation foundation model in a zero-shot setting (i.e., without fine-tuning), guided by an in-context learning strategy. Our approach employs a three-step process for nuclei, cell, and subcellular segmentation, introducing a self-prompting mechanism that encodes morphological and topological priors using growing masks and strategically placed foreground/background points. We validate our method on both standard cell segmentation benchmarks and industry-relevant hit validation assays, demonstrating that it accurately segments biologically relevant structures without the need for dataset-specific tuning.
zh
[CV-106] State of Abdominal CT Datasets: A Critical Review of Bias Clinical Relevance and Real-world Applicability ALT
【速读】:该论文旨在解决当前用于腹部CT影像人工智能(AI)建模的公开数据集存在的冗余性高、地理分布偏倚严重以及潜在偏差(如领域偏移和选择偏倚)等问题,这些问题可能削弱模型在不同医疗环境中的泛化能力,尤其是在资源有限的地区。解决方案的关键在于通过多机构协作、采用标准化协议,并有意识地纳入多样化患者群体和成像技术,从而提升数据集的质量与代表性,推动更公平且临床鲁棒的AI模型开发。
链接: https://arxiv.org/abs/2508.13626
作者: Saeide Danaei,Zahra Dehghanian,Elahe Meftah,Nariman Naderi,Seyed Amir Ahmad Safavi-Naini,Faeze Khorasanizade,Hamid R. Rabiee
机构: Sharif University of Technology (谢里夫理工大学); Shahid Beheshti University of Medical Sciences (沙希德·贝赫什提医科大学); Tehran University of Medical Sciences Cancer Research Institute (德黑兰医科大学癌症研究所); Icahn School of Medicine at Mount Sinai (蒙特菲纳医疗中心伊坎医学院)
类目: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV)
备注: Preprint. Submitted to IEEE Journal of Biomedical and Health Informatics (under review). 10 pages, 3 figures, 5 tables
Abstract:This systematic review critically evaluates publicly available abdominal CT datasets and their suitability for artificial intelligence (AI) applications in clinical settings. We examined 46 publicly available abdominal CT datasets (50,256 studies). Across all 46 datasets, we found substantial redundancy (59.1% case reuse) and a Western/geographic skew (75.3% from North America and Europe). A bias assessment was performed on the 19 datasets with =100 cases; within this subset, the most prevalent high-risk categories were domain shift (63%) and selection bias (57%), both of which may undermine model generalizability across diverse healthcare environments – particularly in resource-limited settings. To address these challenges, we propose targeted strategies for dataset improvement, including multi-institutional collaboration, adoption of standardized protocols, and deliberate inclusion of diverse patient populations and imaging technologies. These efforts are crucial in supporting the development of more equitable and clinically robust AI models for abdominal imaging.
zh
[CV-107] owards Understanding and Harnessing the Transferability of Prognostic Knowledge in Computational Pathology
【速读】:该论文旨在解决全切片图像(Whole-Slide Image, WSI)在癌症预后预测中面临的两个核心挑战:一是稀有肿瘤疾病因样本量有限导致模型难以训练;二是现有基于WSI的预后研究通常采用“癌症特异性模型开发”范式,即每种癌症对应一个独立模型,无法利用其他癌症中的可迁移预后知识(prognostic knowledge)。为应对这些问题,论文提出了首个系统性研究——病理学预后知识迁移(Path-PKT),其关键在于通过构建包含13种癌症的大规模数据集UNI2-h-DSS来量化不同癌症间预后知识的可迁移性,并设计实验识别影响知识迁移的关键因素;在此基础上,提出一种基于门控机制的混合专家模型(MoE-PKT),通过路由机制动态整合来自其他癌症的通用预后知识,从而提升对罕见肿瘤疾病的预测性能。
链接: https://arxiv.org/abs/2508.13482
作者: Pei Liu,Luping Ji,Jiaxiang Gou,Xiangxiang Zeng
机构: Hunan University (湖南大学)
类目: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV)
备注: 15 pages (13 figures and 5 tables)
Abstract:Whole-Slide Image (WSI) is an important tool for evaluating the prognosis of cancer patients. Present WSI-based prognosis studies generally follow a conventional paradigm – cancer-specific model development – where one cancer disease corresponds to one model and this model cannot make use of the prognostic knowledge from others. Despite its notable success in recent years, this paradigm has inherent limitations and has always been struggling with practical requirements: (i) scaling to the rare tumor diseases with very limited samples and (ii) benefiting from the generalizable prognostic knowledge in other cancers. To this end, this paper presents the first systematic study on Prognostic Knowledge Transfer in Pathology, called Path-PKT. It comprises three main parts. (1) We curate a large dataset (UNI2-h-DSS) with 13 cancers and use it to evaluate the transferability of prognostic knowledge between different cancers computationally. (2) We design experiments to understand what factors affect knowledge transfer and what causes positive transfers. (3) Motivated by empirical findings, we propose a new baseline approach (MoE-PKT) with a routing mechanism to utilize the generalizable prognostic knowledge in other cancers. Finally, we show the transferability of source models to rare tumor diseases. This study could lay solid foundations for the study of knowledge transfer in WSI-based cancer prognosis. Source code is available at this https URL.
zh
[CV-108] Susceptibility Distortion Correction of Diffusion MRI with a single Phase-Encoding Direction
【速读】:该论文旨在解决扩散磁共振成像(Diffusion MRI, dMRI)中因磁敏感效应引起的几何和强度畸变问题,这类畸变常导致图像质量下降,尤其在单向相位编码采集条件下难以校正。传统方法如topup依赖于双方向(blip-up 和 blip-down)图像对进行校正,限制了其在仅获取单一方向数据的回顾性研究中的应用。该文提出一种基于深度学习的解决方案,其关键在于仅使用单次采集数据(无论 blip-up 或 blip-down)即可实现高精度畸变校正,无需配对图像,从而显著提升方法的适用性和实用性,实验结果表明其性能可与 topup 相媲美。
链接: https://arxiv.org/abs/2508.13340
作者: Sedigheh Dargahi,Sylvain Bouix,Christian Desrosier
机构: 未知
类目: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Diffusion MRI (dMRI) is a valuable tool to map brain microstructure and connectivity by analyzing water molecule diffusion in tissue. However, acquiring dMRI data requires to capture multiple 3D brain volumes in a short time, often leading to trade-offs in image quality. One challenging artifact is susceptibility-induced distortion, which introduces significant geometric and intensity deformations. Traditional correction methods, such as topup, rely on having access to blip-up and blip-down image pairs, limiting their applicability to retrospective data acquired with a single phase encoding direction. In this work, we propose a deep learning-based approach to correct susceptibility distortions using only a single acquisition (either blip-up or blip-down), eliminating the need for paired acquisitions. Experimental results show that our method achieves performance comparable to topup, demonstrating its potential as an efficient and practical alternative for susceptibility distortion correction in dMRI.
zh
[CV-109] InnerGS: Internal Scenes Rendering via Factorized 3D Gaussian Splatting
【速读】:该论文旨在解决从稀疏切片数据中重建物体内部结构的问题,这在需要深入理解对象内部特征的应用场景中至关重要。传统方法多聚焦于外部表面建模,而本文通过直接利用内部3D高斯分布来建模连续的体密度(volumetric density),从而有效重建出平滑且细节丰富的内部结构。其解决方案的关键在于:摒弃了对相机位姿(camera poses)的依赖,采用显式异向3D高斯集合(anisotropic 3D Gaussians)表示场景,并具备即插即用特性与任意数据模态的天然兼容性。
链接: https://arxiv.org/abs/2508.13287
作者: Shuxin Liang,Yihan Xiao,Wenlu Tang
机构: University of Alberta (阿尔伯塔大学); Sichuan University (四川大学)
类目: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:3D Gaussian Splatting (3DGS) has recently gained popularity for efficient scene rendering by representing scenes as explicit sets of anisotropic 3D Gaussians. However, most existing work focuses primarily on modeling external surfaces. In this work, we target the reconstruction of internal scenes, which is crucial for applications that require a deep understanding of an object’s interior. By directly modeling a continuous volumetric density through the inner 3D Gaussian distribution, our model effectively reconstructs smooth and detailed internal structures from sparse sliced data. Our approach eliminates the need for camera poses, is plug-and-play, and is inherently compatible with any data modalities. We provide cuda implementation at: this https URL.
zh
[CV-110] Automated Cervical Cancer Detection through Visual Inspection with Acetic Acid in Resource-Poor Settings with Lightweight Deep Learning Models Deployed on an Android Device
【速读】:该论文旨在解决低资源环境下宫颈癌(cervical cancer)筛查效率低、依赖高技能医疗人员以及主观性强的问题。当前虽有多种筛查手段,但视觉醋酸试验(Visual Inspection with Acetic Acid, VIA)因其低成本和操作简便成为首选,然而其结果依赖人工判读且存在主观差异。为此,作者提出一种轻量级深度学习算法,核心在于结合EfficientDet-Lite3作为区域感兴趣(Region of Interest, ROI)检测器与基于MobileNet-V2的分类模型,实现自动化判读;该方案可在安卓设备上部署运行,无需互联网或复杂基础设施支持,从而实现快速、准确的筛查结果输出,显著提升基层医疗场景下的可及性和筛查覆盖率。
链接: https://arxiv.org/abs/2508.13253
作者: Leander Melroy Maben,Keerthana Prasad,Shyamala Guruvare,Vidya Kudva,P C Siddalingaswamy
机构: 未知
类目: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注:
Abstract:Cervical cancer is among the most commonly occurring cancer among women and claims a huge number of lives in low and middle-income countries despite being relatively easy to treat. Several studies have shown that public screening programs can bring down cervical cancer incidence and mortality rates significantly. While several screening tests are available, visual inspection with acetic acid (VIA) presents itself as the most viable option for low-resource settings due to the affordability and simplicity of performing the test. VIA requires a trained medical professional to interpret the test and is subjective in nature. Automating VIA using AI eliminates subjectivity and would allow shifting of the task to less trained health workers. Task shifting with AI would help further expedite screening programs in low-resource settings. In our work, we propose a lightweight deep learning algorithm that includes EfficientDet-Lite3 as the Region of Interest (ROI) detector and a MobileNet- V2 based model for classification. These models would be deployed on an android-based device that can operate remotely and provide almost instant results without the requirement of highly-trained medical professionals, labs, sophisticated infrastructure, or internet connectivity. The classification model gives an accuracy of 92.31%, a sensitivity of 98.24%, and a specificity of 88.37% on the test dataset and presents itself as a promising automated low-resource screening approach.
zh
[CV-111] PediDemi – A Pediatric Demyelinating Lesion Segmentation Dataset
【速读】:该论文试图解决儿科脱髓鞘病变影像分析中缺乏公开可用数据集的问题,尤其是在多发性硬化(Multiple Sclerosis, MS)以外的疾病类型(如急性播散性脑脊髓炎,Acute Disseminated Encephalomyelitis, ADEM)中。解决方案的关键在于首次构建并发布一个包含13名儿科患者MRI扫描的公开数据集,涵盖多种脱髓鞘疾病,并附带详细的病患元数据(包括诊断、治疗史、实验室结果等),同时提供病变分割掩膜用于模型训练与评估。通过在现有MS数据集上训练的先进分割模型对该数据集进行测试,验证了其质量与临床相关性,强调了多样化数据对提升模型泛化能力的重要性。
链接: https://arxiv.org/abs/2508.13239
作者: Maria Popa,Gabriela Adriana Visa
机构: 未知
类目: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Demyelinating disorders of the central nervous system may have multiple causes, the most common are infections, autoimmune responses, genetic or vascular etiology. Demyelination lesions are characterized by areas were the myelin sheath of the nerve fibers are broken or destroyed. Among autoimmune disorders, Multiple Sclerosis (MS) is the most well-known Among these disorders, Multiple Sclerosis (MS) is the most well-known and aggressive form. Acute Disseminated Encephalomyelitis (ADEM) is another type of demyelinating disease, typically with a better prognosis. Magnetic Resonance Imaging (MRI) is widely used for diagnosing and monitoring disease progression by detecting lesions. While both adults and children can be affected, there is a significant lack of publicly available datasets for pediatric cases and demyelinating disorders beyond MS. This study introduces, for the first time, a publicly available pediatric dataset for demyelinating lesion segmentation. The dataset comprises MRI scans from 13 pediatric patients diagnosed with demyelinating disorders, including 3 with ADEM. In addition to lesion segmentation masks, the dataset includes extensive patient metadata, such as diagnosis, treatment, personal medical background, and laboratory results. To assess the quality of the dataset and demonstrate its relevance, we evaluate a state-of-the-art lesion segmentation model trained on an existing MS dataset. The results underscore the importance of diverse datasets
zh
[CV-112] Benchmarking GPT -5 for Zero-Shot Multimodal Medical Reasoning in Radiology and Radiation Oncology
【速读】:该论文旨在解决生成式 AI(Generative AI)在医学影像、放射治疗和医学物理等高风险医疗领域中决策能力的评估问题,特别是针对最新大模型 GPT-5 是否能在图像理解与数值推理任务上带来可测量的性能提升。其解决方案的关键在于设计并实施一项针对性的零样本(zero-shot)评测框架,涵盖三个代表性任务:VQA-RAD(医学影像视觉问答)、SLAKE(跨模态语义对齐测试)以及模拟医学物理执业医师考试的多选题集。结果表明,GPT-5 在所有任务中均显著优于 GPT-4o,尤其在胸部纵隔、肺部及脑组织解析等复杂解剖区域分别实现 +20.00%、+13.60% 和 +11.44% 的准确率提升,并在医学物理考题中达到 90.7% 的准确率(超过人类通过阈值),验证了 GPT-5 在医学图像推理与专业数值问题求解中的强大潜力,为未来辅助专家工作流程提供了坚实基础。
链接: https://arxiv.org/abs/2508.13192
作者: Mingzhe Hu,Zach Eidex,Shansong Wang,Mojtaba Safari,Qiang Li,Xiaofeng Yang
机构: 未知
类目: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Radiology, radiation oncology, and medical physics require decision-making that integrates medical images, textual reports, and quantitative data under high-stakes conditions. With the introduction of GPT-5, it is critical to assess whether recent advances in large multimodal models translate into measurable gains in these safety-critical domains. We present a targeted zero-shot evaluation of GPT-5 and its smaller variants (GPT-5-mini, GPT-5-nano) against GPT-4o across three representative tasks. We present a targeted zero-shot evaluation of GPT-5 and its smaller variants (GPT-5-mini, GPT-5-nano) against GPT-4o across three representative tasks: (1) VQA-RAD, a benchmark for visual question answering in radiology; (2) SLAKE, a semantically annotated, multilingual VQA dataset testing cross-modal grounding; and (3) a curated Medical Physics Board Examination-style dataset of 150 multiple-choice questions spanning treatment planning, dosimetry, imaging, and quality assurance. Across all datasets, GPT-5 achieved the highest accuracy, with substantial gains over GPT-4o up to +20.00% in challenging anatomical regions such as the chest-mediastinal, +13.60% in lung-focused questions, and +11.44% in brain-tissue interpretation. On the board-style physics questions, GPT-5 attained 90.7% accuracy (136/150), exceeding the estimated human passing threshold, while GPT-4o trailed at 78.0%. These results demonstrate that GPT-5 delivers consistent and often pronounced performance improvements over GPT-4o in both image-grounded reasoning and domain-specific numerical problem-solving, highlighting its potential to augment expert workflows in medical imaging and therapeutic physics.
zh
[CV-113] Colon Polyps Detection from Colonoscopy Images Using Deep Learning
【速读】:该论文旨在解决结肠息肉(colon polyps)早期识别难题,以提升结直肠癌(colorectal cancer)筛查的准确性。其解决方案的关键在于采用基于深度学习的目标检测方法,具体使用YOLOv5系列模型(包括YOLOv5s、YOLOv5m和YOLOv5l)对结肠镜图像进行自动检测,并通过Kvasir-SEG数据集进行训练与验证。实验表明,YOLOv5l在平均精度(mean average precision, mAP)达到85.1%,且平均交并比(IoU)为0.86,显著优于其他变体,证明其在息肉定位任务中具有优越性能,可作为提高结直肠癌早筛效率的可靠工具。
链接: https://arxiv.org/abs/2508.13188
作者: Md Al Amin,Bikash Kumar Paul
机构: 未知
类目: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV)
备注: 17 Pages
Abstract:Colon polyps are precursors to colorectal cancer, a leading cause of cancer-related mortality worldwide. Early detection is critical for improving patient outcomes. This study investigates the application of deep learning-based object detection for early polyp identification using colonoscopy images. We utilize the Kvasir-SEG dataset, applying extensive data augmentation and splitting the data into training (80%), validation (20% of training), and testing (20%) sets. Three variants of the YOLOv5 architecture (YOLOv5s, YOLOv5m, YOLOv5l) are evaluated. Experimental results show that YOLOv5l outperforms the other variants, achieving a mean average precision (mAP) of 85.1%, with the highest average Intersection over Union (IoU) of 0.86. These findings demonstrate that YOLOv5l provides superior detection performance for colon polyp localization, offering a promising tool for enhancing colorectal cancer screening accuracy.
zh
人工智能
[AI-0] ComputerRL: Scaling End-to-End Online Reinforcement Learning for Computer Use Agents
【速读】:该论文旨在解决机器代理在人类主导的桌面环境中执行复杂数字任务时存在的适配性问题,即如何让代理在不依赖特定任务设计的前提下,自主、高效地操作多样化的桌面应用。其核心挑战在于传统方法难以实现跨任务的泛化能力以及长时间训练中的稳定性问题。解决方案的关键在于提出ComputerRL框架,该框架采用API-GUI统一范式,融合程序化API调用与直接GUI交互,以弥合机器代理与人机界面之间的语义鸿沟;同时构建分布式强化学习(Reinforcement Learning, RL)基础设施和Entropulse训练策略,通过大规模并行虚拟桌面环境加速在线训练,并通过交替进行强化学习与监督微调来抑制熵崩溃现象,从而显著提升模型在OSWorld基准上的性能表现,达到48.1%的新SOTA准确率。
链接: https://arxiv.org/abs/2508.14040
作者: Hanyu Lai,Xiao Liu,Yanxiao Zhao,Han Xu,Hanchen Zhang,Bohao Jing,Yanyu Ren,Shuntian Yao,Yuxiao Dong,Jie Tang
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:
Abstract:We introduce ComputerRL, a framework for autonomous desktop intelligence that enables agents to operate complex digital workspaces skillfully. ComputerRL features the API-GUI paradigm, which unifies programmatic API calls and direct GUI interaction to address the inherent mismatch between machine agents and human-centric desktop environments. Scaling end-to-end RL training is crucial for improvement and generalization across diverse desktop tasks, yet remains challenging due to environmental inefficiency and instability in extended training. To support scalable and robust training, we develop a distributed RL infrastructure capable of orchestrating thousands of parallel virtual desktop environments to accelerate large-scale online RL. Furthermore, we propose Entropulse, a training strategy that alternates reinforcement learning with supervised fine-tuning, effectively mitigating entropy collapse during extended training runs. We employ ComputerRL on open models GLM-4-9B-0414 and Qwen2.5-14B, and evaluate them on the OSWorld benchmark. The AutoGLM-OS-9B based on GLM-4-9B-0414 achieves a new state-of-the-art accuracy of 48.1%, demonstrating significant improvements for general agents in desktop automation. The algorithm and framework are adopted in building AutoGLM (Liu et al., 2024a)
zh
[AI-1] A Biased Random Key Genetic Algorithm for Solving the Longest Run Subsequence Problem
【速读】:该论文旨在解决最长运行子序列(Longest Run Subsequence, LRS)问题,这是一个属于生物信息学中子序列问题类的NP-hard组合优化问题,在基因组重新组装中具有重要应用。解决方案的关键在于提出一种有偏随机键遗传算法(Biased Random Key Genetic Algorithm, BRKGA),其核心优势体现在对个体评估过程的计算效率优化,特别是将灰度值向量高效转换为问题的有效解;此外,作者还对比了最大最小蚁群系统(Max-Min Ant System)和整数线性规划求解器CPLEX,结果表明所提出的BRKGA目前是求解LRS问题的最先进方法,但仍存在改进空间,尤其是在大字母表规模输入场景下。
链接: https://arxiv.org/abs/2508.14020
作者: Christian Blum,Pedro Pinacho-Davidson
机构: 未知
类目: Artificial Intelligence (cs.AI); Discrete Mathematics (cs.DM)
备注:
Abstract:The longest run subsequence (LRS) problem is an NP-hard combinatorial optimization problem belonging to the class of subsequence problems from bioinformatics. In particular, the problem plays a role in genome reassembly. In this paper, we present a solution to the LRS problem using a Biased Random Key Genetic Algorithm (BRKGA). Our approach places particular focus on the computational efficiency of evaluating individuals, which involves converting vectors of gray values into valid solutions to the problem. For comparison purposes, a Max-Min Ant System is developed and implemented. This is in addition to the application of the integer linear programming solver CPLEX for solving all considered problem instances. The computation results show that the proposed BRKGA is currently a state-of-the-art technique for the LRS problem. Nevertheless, the results also show that there is room for improvement, especially in the context of input strings based on large alphabet sizes.
zh
[AI-2] Efficient Knowledge Graph Unlearning with Zeroth-order Information
【速读】:该论文旨在解决知识图谱(Knowledge Graph, KG)中训练数据移除问题,即在遵守如“被遗忘的权利”等法规要求下,高效地从模型中移除特定训练数据及其影响。传统全量重训练成本高昂,而现有基于影响估计的机器遗忘方法在大规模知识图谱上计算开销巨大。解决方案的关键在于提出一种针对知识图谱结构特性的影响函数,并通过泰勒展开近似参数更新,避免昂贵的一阶和二阶导数计算;具体而言,利用Fisher矩阵与零阶优化技术,在不构建计算图的前提下近似逆海森矩阵向量积,从而显著降低计算复杂度并提升遗忘效率与质量。
链接: https://arxiv.org/abs/2508.14013
作者: Yang Xiao,Ruimeng Ye,Bohan Liu,Xiaolong Ma,Bo Hui
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注: 9 page
Abstract:Due to regulations like the Right to be Forgotten, there is growing demand for removing training data and its influence from models. Since full retraining is costly, various machine unlearning methods have been proposed. In this paper, we firstly present an efficient knowledge graph (KG) unlearning algorithm. We remark that KG unlearning is nontrivial due to the distinctive structure of KG and the semantic relations between entities. Also, unlearning by estimating the influence of removed components incurs significant computational overhead when applied to large-scale knowledge graphs. To this end, we define an influence function for KG unlearning and propose to approximate the model’s sensitivity without expensive computation of first-order and second-order derivatives for parameter updates. Specifically, we use Taylor expansion to estimate the parameter changes caused by data removal. Given that the first-order gradients and second-order derivatives dominate the computational load, we use the Fisher matrices and zeroth-order optimization to approximate the inverse-Hessian vector product without constructing the computational graphs. Our experimental results demonstrate that the proposed method outperforms other state-of-the-art graph unlearning baselines significantly in terms of unlearning efficiency and unlearning quality. Our code is released at this https URL.
zh
[AI-3] Evaluating Identity Leakage in Speaker De-Identification Systems ICASSP2026
【速读】:该论文旨在解决语音说话人去标识(Speaker De-identification)技术中存在的身份泄露问题,即在保持语音可懂度的同时有效隐藏说话者身份。其解决方案的关键在于构建一个基准测试框架,通过三种互补的错误率指标——等错误率(Equal Error Rate, EER)、累积匹配特性命中率(Cumulative Match Characteristic Hit Rate, CMC)以及基于典型相关分析(Canonical Correlation Analysis, CCA)和Procrustes分析的嵌入空间相似性——量化残留的身份泄露程度。评估结果表明,当前最先进的说话人去标识系统仍存在显著的身份泄露风险,最高性能系统仅略优于随机猜测,最低性能系统在前50名候选者中达到45%的命中率,凸显了现有技术在隐私保护方面的局限性。
链接: https://arxiv.org/abs/2508.14012
作者: Seungmin Seo,Oleg Aulov,Afzal Godil,Kevin Mangold
机构: 未知
类目: ound (cs.SD); Artificial Intelligence (cs.AI)
备注: Submitted to ICASSP 2026
Abstract:Speaker de-identification aims to conceal a speaker’s identity while preserving intelligibility of the underlying speech. We introduce a benchmark that quantifies residual identity leakage with three complementary error rates: equal error rate, cumulative match characteristic hit rate, and embedding-space similarity measured via canonical correlation analysis and Procrustes analysis. Evaluation results reveal that all state-of-the-art speaker de-identification systems leak identity information. The highest performing system in our evaluation performs only slightly better than random guessing, while the lowest performing system achieves a 45% hit rate within the top 50 candidates based on CMC. These findings highlight persistent privacy risks in current speaker de-identification technologies.
zh
[AI-4] ASDFormer: A Transformer with Mixtures of Pooling-Classifier Experts for Robust Autism Diagnosis and Biomarker Discovery
【速读】:该论文旨在解决自闭症谱系障碍(Autism Spectrum Disorder, ASD)诊断中功能连接模式识别与生物标志物发现的难题,核心挑战在于如何有效捕捉大脑不同功能社区间及内部的异常连接特征,并实现高精度分类与可解释性。解决方案的关键在于提出ASDFormer模型,其基于Transformer架构并引入混合池化-分类专家(Mixture of Pooling-Classifier Experts, MoE)机制,通过多专家分支结合注意力机制,自适应地聚焦于与ASD相关的特定脑区及其连接模式,从而在ABIDE数据集上实现了最优诊断准确率,并揭示了与ASD显著相关的功能连接紊乱,为生物标志物发现提供了可解释的神经影像学依据。
链接: https://arxiv.org/abs/2508.14005
作者: Mohammad Izadi,Mehran Safayani
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:
Abstract:Autism Spectrum Disorder (ASD) is a complex neurodevelopmental condition marked by disruptions in brain connectivity. Functional MRI (fMRI) offers a non-invasive window into large-scale neural dynamics by measuring blood-oxygen-level-dependent (BOLD) signals across the brain. These signals can be modeled as interactions among Regions of Interest (ROIs), which are grouped into functional communities based on their underlying roles in brain function. Emerging evidence suggests that connectivity patterns within and between these communities are particularly sensitive to ASD-related alterations. Effectively capturing these patterns and identifying interactions that deviate from typical development is essential for improving ASD diagnosis and enabling biomarker discovery. In this work, we introduce ASDFormer, a Transformer-based architecture that incorporates a Mixture of Pooling-Classifier Experts (MoE) to capture neural signatures associated with ASD. By integrating multiple specialized expert branches with attention mechanisms, ASDFormer adaptively emphasizes different brain regions and connectivity patterns relevant to autism. This enables both improved classification performance and more interpretable identification of disorder-related biomarkers. Applied to the ABIDE dataset, ASDFormer achieves state-of-the-art diagnostic accuracy and reveals robust insights into functional connectivity disruptions linked to ASD, highlighting its potential as a tool for biomarker discovery.
zh
[AI-5] Embodied-R1: Reinforced Embodied Reasoning for General Robotic Manipulation
【速读】:该论文旨在解决具身人工智能(Embodied AI)中的泛化难题,其核心瓶颈是“视觉到动作的鸿沟”(seeing-to-doing gap),主要由数据稀缺性和具身异质性导致。解决方案的关键在于提出以“指指点点”(pointing)作为统一且与具体机器人形态无关的中间表征,定义了四种核心的具身指代表达能力,从而将高层视觉-语言理解与底层动作原语有效衔接。在此基础上,研究构建了大规模数据集 Embodied-Points-200K,并训练了一个3B参数的具身视觉语言模型 Embodied-R1,采用两阶段强化微调(Reinforced Fine-tuning, RFT)策略和多任务奖励设计进行优化。实验表明,该方法在11个具身空间推理与指代表达基准上达到最先进性能,并在SIMPLEREnv中实现56.2%的零样本成功率,在8项真实XArm任务中达到87.5%的成功率,显著优于现有基线,验证了以指指点点为中心的表示学习结合RFT训练范式,是弥合感知-动作鸿沟的有效路径。
链接: https://arxiv.org/abs/2508.13998
作者: Yifu Yuan,Haiqin Cui,Yaoting Huang,Yibin Chen,Fei Ni,Zibin Dong,Pengyi Li,Yan Zheng,Jianye Hao
机构: 未知
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: Embodied-R1 technical report
Abstract:Generalization in embodied AI is hindered by the “seeing-to-doing gap,” which stems from data scarcity and embodiment heterogeneity. To address this, we pioneer “pointing” as a unified, embodiment-agnostic intermediate representation, defining four core embodied pointing abilities that bridge high-level vision-language comprehension with low-level action primitives. We introduce Embodied-R1, a 3B Vision-Language Model (VLM) specifically designed for embodied reasoning and pointing. We use a wide range of embodied and general visual reasoning datasets as sources to construct a large-scale dataset, Embodied-Points-200K, which supports key embodied pointing capabilities. We then train Embodied-R1 using a two-stage Reinforced Fine-tuning (RFT) curriculum with a specialized multi-task reward design. Embodied-R1 achieves state-of-the-art performance on 11 embodied spatial and pointing benchmarks. Critically, it demonstrates robust zero-shot generalization by achieving a 56.2% success rate in the SIMPLEREnv and 87.5% across 8 real-world XArm tasks without any task-specific fine-tuning, representing a 62% improvement over strong baselines. Furthermore, the model exhibits high robustness against diverse visual disturbances. Our work shows that a pointing-centric representation, combined with an RFT training paradigm, offers an effective and generalizable pathway to closing the perception-action gap in robotics.
zh
[AI-6] he Social Context of Human-Robot Interactions
【速读】:该论文旨在解决人机交互(Human-Robot Interaction, HRI)领域中“社会情境”(social context)概念使用不统一的问题,这种术语定义的多样性导致研究之间难以建立有效关联,阻碍了对人-机器人交互中社会情境的理解与建模。解决方案的关键在于提出一个系统性的概念模型,用于描述人-机器人交互中的社会情境,并通过该模型对现有文献进行应用和分析,从而识别出一系列可帮助研究人员规划交互、开发机器人行为模型以及事后分析交互结果的属性。
链接: https://arxiv.org/abs/2508.13982
作者: Sydney Thompson,Kate Candon,Marynel Vázquez
机构: 未知
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI); Human-Computer Interaction (cs.HC); Multiagent Systems (cs.MA)
备注: To be published in Annual Review of Control, Robotics, and Autonomous Systems
Abstract:The Human-Robot Interaction (HRI) community often highlights the social context of an interaction as a key consideration when designing, implementing, and evaluating robot behavior. Unfortunately, researchers use the term “social context” in varied ways. This can lead to miscommunication, making it challenging to draw connections between related work on understanding and modeling the social contexts of human-robot interactions. To address this gap, we survey the HRI literature for existing definitions and uses of the term “social context”. Then, we propose a conceptual model for describing the social context of a human-robot interaction. We apply this model to existing work, and we discuss a range of attributes of social contexts that can help researchers plan for interactions, develop behavior models for robots, and gain insights after interactions have taken place. We conclude with a discussion of open research questions in relation to understanding and modeling the social contexts of human-robot interactions.
zh
[AI-7] ChronoLLM : Customizing Language Models for Physics-Based Simulation Code Generation
【速读】:该论文试图解决的问题是:如何通过微调和定制预训练大语言模型(Large Language Models, LLMs),使其成为能够协助专家高效使用仿真工具的虚拟助手。具体而言,研究聚焦于将LLMs应用于PyChrono这一开源多物理场动力学引擎,以生成高质量的PyChrono仿真脚本。解决方案的关键在于提出了一套通用框架,通过系统性的微调与定制流程,显著提升LLMs生成脚本的质量,使其从简单的单摆模拟到涉及全车辆在可变形地形上的复杂虚拟实验均能适用;尽管生成脚本通常不完美,但可作为用户进一步修改和优化的可靠起点,同时LLM还能回答API相关问题并推荐建模策略,从而降低仿真工具的使用门槛。
链接: https://arxiv.org/abs/2508.13975
作者: Jingquan Wang,Andrew Negrut,Harry Zhang,Khailanii Slaton,Shu Wang,Radu Serban,Jinlong Wu,Dan Negrut
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:
Abstract:This contribution is concerned with the following issue: can pretrained large language models (LLMs) be refined and customized to the point where they become virtual assistants helping experts with the effective use of a simulation tool? In this case study, the ``simulation tool’’ considered is PyChrono, an open source multi-physics dynamics engine for multibody systems. We present a framework for refining and customizing both open- and closed-source LLMs to harness the power of AI in generating scripts that perform PyChrono virtual experiments. We refine and customize several classes of LLMs through a process that leads to a quantifiable improvement in the quality of the generated PyChrono simulation scripts. These scripts can range from simple single-pendulum simulations to complex virtual experiments involving full vehicles on deformable terrain. While the generated scripts are rarely perfect, they often serve as strong starting points for the user to modify and improve on. Additionally, the LLM can answer specific API questions about the simulator, or recommend modeling approaches. The framework discussed is general and can be applied to lower the entry barrier for simulation tools associated with other application domains.
zh
[AI-8] Learning to Use AI for Learning: How Can We Effectively Teach and Measure Prompting Literacy for K-12 Students?
【速读】:该论文旨在解决K-12教育领域中教师对培养学生负责任地应用人工智能(Artificial Intelligence, AI)能力的迫切需求,特别是提升中学生在学习过程中与生成式AI(Generative AI)交互时的提示工程能力(prompting literacy)。其解决方案的关键在于设计并实施了一个基于大语言模型(Large-Language Model, LLM)的教学模块,通过情境化、刻意练习(deliberate practice)的活动,使学生直接与智能LLM代理互动,在实践中提升提示技巧,并增强其对AI辅助学习的积极认知。该方案经两轮真实课堂部署验证,表明AI自动评分系统具备良好评估质量,且教学材料能有效促进学生提示技能提升和态度转变,同时揭示了真/假题与开放性问题相较于选择题更能精准测量学生的提示素养水平。
链接: https://arxiv.org/abs/2508.13962
作者: Ruiwei Xiao,Xinying Hou,Ying-Jui Tseng,Hsuan Nieu,Guanze Liao,John Stamper,Kenneth R. Koedinger
机构: 未知
类目: Human-Computer Interaction (cs.HC); Artificial Intelligence (cs.AI)
备注: 7 pages + 2 pages references; under review for an [anonymized according to the conference policy] conference
Abstract:As Artificial Intelligence (AI) becomes increasingly integrated into daily life, there is a growing need to equip the next generation with the ability to apply, interact with, evaluate, and collaborate with AI systems responsibly. Prior research highlights the urgent demand from K-12 educators to teach students the ethical and effective use of AI for learning. To address this need, we designed an Large-Language Model (LLM)-based module to teach prompting literacy. This includes scenario-based deliberate practice activities with direct interaction with intelligent LLM agents, aiming to foster secondary school students’ responsible engagement with AI chatbots. We conducted two iterations of classroom deployment in 11 authentic secondary education classrooms, and evaluated 1) AI-based auto-grader’s capability; 2) students’ prompting performance and confidence changes towards using AI for learning; and 3) the quality of learning and assessment materials. Results indicated that the AI-based auto-grader could grade student-written prompts with satisfactory quality. In addition, the instructional materials supported students in improving their prompting skills through practice and led to positive shifts in their perceptions of using AI for learning. Furthermore, data from Study 1 informed assessment revisions in Study 2. Analyses of item difficulty and discrimination in Study 2 showed that True/False and open-ended questions could measure prompting literacy more effectively than multiple-choice questions for our target learners. These promising outcomes highlight the potential for broader deployment and highlight the need for broader studies to assess learning effectiveness and assessment design.
zh
[AI-9] A Mechanism for Mutual Fairness in Cooperative Games with Replicable Resources – Extended Version ECAI2025
【速读】:该论文旨在解决协作式人工智能系统中公平奖励分配的问题,尤其是在协同学习(collaborative learning)场景下,如何确保各参与方(人工代理与智能代理)在达成全局目标时获得与其贡献相匹配的公平回报。传统合作博弈论中的公平性概念(如Shapley值)假设资源不可复制,但现实中的数据和模型具有无限可复制性,导致现有机制无法有效应对参与者间互惠收益不平衡的问题,可能引发策略性操纵和不公平分配。论文的关键解决方案是一种新的公平性机制,其核心在于提出并证明了“平衡互惠公理”(Balanced Reciprocity Axiom),该公理保证任意一对参与者之间彼此从对方参与中获得的收益相等,从而实现了相互公平(mutual fairness)。
链接: https://arxiv.org/abs/2508.13960
作者: Björn Filter,Ralf Möller,Özgür Lütfü Özçep
机构: 未知
类目: Computer Science and Game Theory (cs.GT); Artificial Intelligence (cs.AI)
备注: This paper is the extended version of a paper accepted at the European Conference on Artificial Intelligence 2025 (ECAI 2025), providing the proof of the main theorem in the appendix
Abstract:The latest developments in AI focus on agentic systems where artificial and human agents cooperate to realize global goals. An example is collaborative learning, which aims to train a global model based on data from individual agents. A major challenge in designing such systems is to guarantee safety and alignment with human values, particularly a fair distribution of rewards upon achieving the global goal. Cooperative game theory offers useful abstractions of cooperating agents via value functions, which assign value to each coalition, and via reward functions. With these, the idea of fair allocation can be formalized by specifying fairness axioms and designing concrete mechanisms. Classical cooperative game theory, exemplified by the Shapley value, does not fully capture scenarios like collaborative learning, as it assumes nonreplicable resources, whereas data and models can be replicated. Infinite replicability requires a generalized notion of fairness, formalized through new axioms and mechanisms. These must address imbalances in reciprocal benefits among participants, which can lead to strategic exploitation and unfair allocations. The main contribution of this paper is a mechanism and a proof that it fulfills the property of mutual fairness, formalized by the Balanced Reciprocity Axiom. It ensures that, for every pair of players, each benefits equally from the participation of the other.
zh
[AI-10] he Collaboration Paradox: Why Generative AI Requires Both Strategic Intelligence and Operational Stability in Supply Chain Management
【速读】:该论文旨在解决自主AI代理在经济场景中,特别是多层级供应链(multi-echelon supply chain)环境下,因协作行为引发的系统性不稳定问题,尤其是由生成式AI代理(generative AI agents)导致的“协作悖论”——即理论上更优的基于供应商管理库存(Vendor-Managed Inventory, VMI)原则的AI代理反而表现劣于非AI基准。其解决方案的关键在于构建一个双层协同机制:上层由AI驱动的前瞻性策略设定(proactive policy-setting),用于制定稳健的运营目标;下层为协作执行协议(collaborative execution protocol),结合主动下游补货机制以维持系统稳定性。这一合成框架能够自主生成、评估并量化可行的战略组合,从而实现稳定且高效的AI驱动供应链管理。
链接: https://arxiv.org/abs/2508.13942
作者: Soumyadeep Dhar
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:
Abstract:The rise of autonomous, AI-driven agents in economic settings raises critical questions about their emergent strategic behavior. This paper investigates these dynamics in the cooperative context of a multi-echelon supply chain, a system famously prone to instabilities like the bullwhip effect. We conduct computational experiments with generative AI agents, powered by Large Language Models (LLMs), within a controlled supply chain simulation designed to isolate their behavioral tendencies. Our central finding is the “collaboration paradox”: a novel, catastrophic failure mode where theoretically superior collaborative AI agents, designed with Vendor-Managed Inventory (VMI) principles, perform even worse than non-AI baselines. We demonstrate that this paradox arises from an operational flaw where agents hoard inventory, starving the system. We then show that resilience is only achieved through a synthesis of two distinct layers: high-level, AI-driven proactive policy-setting to establish robust operational targets, and a low-level, collaborative execution protocol with proactive downstream replenishment to maintain stability. Our final framework, which implements this synthesis, can autonomously generate, evaluate, and quantify a portfolio of viable strategic choices. The work provides a crucial insight into the emergent behaviors of collaborative AI agents and offers a blueprint for designing stable, effective AI-driven systems for business analytics.
zh
[AI-11] InPars: Supercharging Synthetic Data Generation for Information Retrieval Systems
【速读】:该论文旨在解决神经信息检索(Neural Information Retrieval, NIR)中训练数据生成效率与质量不足的问题,尤其是传统合成查询生成流水线在信号噪声较高时需依赖激进过滤策略,从而损失有效训练样本。其解决方案的关键在于两个核心改进:一是通过对比偏好优化(Contrastive Preference Optimization, CPO)微调查询生成大语言模型(LLM),提升生成查询的语义相关性与检索信号质量;二是引入基于链式思维(Chain-of-Thought, CoT)优化的动态提示模板,利用DSPy框架替代静态提示模板,增强提示的推理能力与适应性。这两项改进显著降低了对严格过滤的依赖,同时提升了检索性能。
链接: https://arxiv.org/abs/2508.13930
作者: Matey Krastev,Miklos Hamar,Danilo Toapanta,Jesse Brouwers,Yibin Lei
机构: 未知
类目: Information Retrieval (cs.IR); Artificial Intelligence (cs.AI)
备注:
Abstract:This work revisits and extends synthetic query generation pipelines for Neural Information Retrieval (NIR) by leveraging the InPars Toolkit, a reproducible, end-to-end framework for generating training data using large language models (LLMs). We first assess the reproducibility of the original InPars, InPars-V2, and Promptagator pipelines on the SciFact benchmark and validate their effectiveness using open-source reranker and generator models. Building on this foundation, we introduce two key extensions to the pipeline: (1) fine-tuning a query generator LLM via Contrastive Preference Optimization (CPO) to improve the signal quality in generated queries, and (2) replacing static prompt templates with dynamic, Chain-of-Thought (CoT) optimized prompts using the DSPy framework. Our results show that both extensions reduce the need for aggressive filtering while improving retrieval performance. All code, models, and synthetic datasets are publicly released to support further research at: \hrefthis https URLthis https URL.
zh
[AI-12] Categorical Policies: Multimodal Policy Learning and Exploration in Continuous Control
【速读】:该论文旨在解决深度强化学习(Deep Reinforcement Learning, DRL)中策略函数通常仅用单一高斯分布(Gaussian distribution)参数化而导致的单峰行为限制问题,这在连续控制任务中尤其不利,因为其探索能力受限于预测最优动作附近的区域,难以应对稀疏奖励、复杂动力学或情境变化带来的学习挑战。解决方案的关键在于引入类别型策略(Categorical Policy),通过一个中间的类别分布(categorical distribution)来建模多模态行为模式,并基于采样得到的行为模式生成最终的动作输出。该方法利用可微分的采样技巧(如Gumbel-Softmax)保持梯度传播路径的连续性,从而实现结构化探索与多模态行为表示的统一,显著提升了策略的学习效率和性能。
链接: https://arxiv.org/abs/2508.13922
作者: SM Mazharul Islam,Manfred Huber
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注: 6 pages, 4 figures; Has been submitted and accepted at IEEE SMC, 2025
Abstract:A policy in deep reinforcement learning (RL), either deterministic or stochastic, is commonly parameterized as a Gaussian distribution alone, limiting the learned behavior to be unimodal. However, the nature of many practical decision-making problems favors a multimodal policy that facilitates robust exploration of the environment and thus to address learning challenges arising from sparse rewards, complex dynamics, or the need for strategic adaptation to varying contexts. This issue is exacerbated in continuous control domains where exploration usually takes place in the vicinity of the predicted optimal action, either through an additive Gaussian noise or the sampling process of a stochastic policy. In this paper, we introduce Categorical Policies to model multimodal behavior modes with an intermediate categorical distribution, and then generate output action that is conditioned on the sampled mode. We explore two sampling schemes that ensure differentiable discrete latent structure while maintaining efficient gradient-based optimization. By utilizing a latent categorical distribution to select the behavior mode, our approach naturally expresses multimodality while remaining fully differentiable via the sampling tricks. We evaluate our multimodal policy on a set of DeepMind Control Suite environments, demonstrating that through better exploration, our learned policies converge faster and outperform standard Gaussian policies. Our results indicate that the Categorical distribution serves as a powerful tool for structured exploration and multimodal behavior representation in continuous control.
zh
[AI-13] Structured Agent ic Workflows for Financial Time-Series Modeling with LLM s and Reflective Feedback
【速读】:该论文旨在解决金融领域时间序列建模中模型性能、可解释性与可审计性难以兼顾的问题,尤其是在自动化机器学习(AutoML)框架缺乏领域适应性和动态响应能力的背景下。其解决方案的关键在于提出了一种名为 \textsf{TS-Agent} 的模块化智能体(agent)框架,该框架将时间序列建模流程形式化为一个由模型选择、代码优化和微调三个阶段组成的结构化迭代决策过程,通过具备结构化知识库和预置模型/优化策略库的规划智能体(planner agent)实现上下文感知的推理与探索,从而提升模型准确性、增强可解释性并减少错误传播,同时支持自适应学习、鲁棒调试与透明审计,满足金融等高风险场景的需求。
链接: https://arxiv.org/abs/2508.13915
作者: Yihao Ang,Yifan Bao,Lei Jiang,Jiajie Tao,Anthony K. H. Tung,Lukasz Szpruch,Hao Ni
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:
Abstract:Time-series data is central to decision-making in financial markets, yet building high-performing, interpretable, and auditable models remains a major challenge. While Automated Machine Learning (AutoML) frameworks streamline model development, they often lack adaptability and responsiveness to domain-specific needs and evolving objectives. Concurrently, Large Language Models (LLMs) have enabled agentic systems capable of reasoning, memory management, and dynamic code generation, offering a path toward more flexible workflow automation. In this paper, we introduce \textsfTS-Agent, a modular agentic framework designed to automate and enhance time-series modeling workflows for financial applications. The agent formalizes the pipeline as a structured, iterative decision process across three stages: model selection, code refinement, and fine-tuning, guided by contextual reasoning and experimental feedback. Central to our architecture is a planner agent equipped with structured knowledge banks, curated libraries of models and refinement strategies, which guide exploration, while improving interpretability and reducing error propagation. \textsfTS-Agent supports adaptive learning, robust debugging, and transparent auditing, key requirements for high-stakes environments such as financial services. Empirical evaluations on diverse financial forecasting and synthetic data generation tasks demonstrate that \textsfTS-Agent consistently outperforms state-of-the-art AutoML and agentic baselines, achieving superior accuracy, robustness, and decision traceability.
zh
[AI-14] Fisher-Orthogonal Projection Methods for Natural Gradient Descent with Large Batches
【速读】:该论文旨在解决大规模训练中优化器性能下降的问题,特别是在极大数据批量(mini-batch size)下,一阶方法因梯度噪声降低而难以逃离尖锐或次优局部极小值,而二阶方法如基于Kronecker-Factored Approximate Curvature(KFAC)的自然梯度则因需过高阻尼以维持稳定性,导致曲率信息被削弱,性能退化为简单梯度下降。解决方案的关键在于提出Fisher-Orthogonal Projection(FOP)技术,该方法通过利用两个子批次的梯度构建一个方差感知的更新方向,在Fisher度量下将梯度差异的正交分量叠加到平均梯度上,从而在不增加计算负担的前提下恢复二阶方法在大批次下的有效性,实现更优的泛化能力和更快的收敛速度。
链接: https://arxiv.org/abs/2508.13898
作者: Yishun Lu,Wesley Armour
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:
Abstract:Modern GPUs are equipped with large amounts of high-bandwidth memory, enabling them to support mini-batch sizes of up to tens of thousands of training samples. However, most existing optimizers struggle to perform effectively at such a large batch size. As batch size increases, gradient noise decreases due to averaging over many samples, limiting the ability of first-order methods to escape sharp or suboptimal minima and reach the global minimum. Meanwhile, second-order methods like the natural gradient with Kronecker-Factored Approximate Curvature (KFAC) often require excessively high damping to remain stable at large batch sizes. This high damping effectively washes out the curvature information that gives these methods their advantage, reducing their performance to that of simple gradient descent. In this paper, we introduce Fisher-Orthogonal Projection (FOP), a novel technique that restores the effectiveness of the second-order method at very large batch sizes, enabling scalable training with improved generalization and faster convergence. FOP constructs a variance-aware update direction by leveraging gradients from two sub-batches, enhancing the average gradient with a component of the gradient difference that is orthogonal to the average under the Fisher-metric.
zh
[AI-15] oward Deployable Multi-Robot Collaboration via a Symbolically-Guided Decision Transformer
【速读】:该论文旨在解决强化学习(Reinforcement Learning, RL)在多机器人操作任务中因数据密集性和对马尔可夫决策过程(Markov Decision Process, MDP)假设的依赖,导致难以部署于复杂动态和长时序依赖场景的问题。其解决方案的关键在于提出一种符号引导的决策变压器(Symbolically-Guided Decision Transformer, SGDT)框架,该框架通过神经符号机制与因果变压器(causal transformer)的融合,构建了一个分层决策架构:高层由神经符号规划器生成基于符号的子目标计划,低层则由目标条件决策变压器(goal-conditioned decision transformer, GCDT)依据这些子目标执行序列决策。这一结构实现了复杂多机器人协作任务中的可解释、可泛化且结构化的决策能力。
链接: https://arxiv.org/abs/2508.13877
作者: Rathnam Vidushika Rasanji,Jin Wei-Kocsis,Jiansong Zhang,Dongming Gan,Ragu Athinarayanan,Paul Asunda
机构: 未知
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI)
备注:
Abstract:Reinforcement learning (RL) has demonstrated great potential in robotic operations. However, its data-intensive nature and reliance on the Markov Decision Process (MDP) assumption limit its practical deployment in real-world scenarios involving complex dynamics and long-term temporal dependencies, such as multi-robot manipulation. Decision Transformers (DTs) have emerged as a promising offline alternative by leveraging causal transformers for sequence modeling in RL tasks. However, their applications to multi-robot manipulations still remain underexplored. To address this gap, we propose a novel framework, Symbolically-Guided Decision Transformer (SGDT), which integrates a neuro-symbolic mechanism with a causal transformer to enable deployable multi-robot collaboration. In the proposed SGDT framework, a neuro-symbolic planner generates a high-level task-oriented plan composed of symbolic subgoals. Guided by these subgoals, a goal-conditioned decision transformer (GCDT) performs low-level sequential decision-making for multi-robot manipulation. This hierarchical architecture enables structured, interpretable, and generalizable decision making in complex multi-robot collaboration tasks. We evaluate the performance of SGDT across a range of task scenarios, including zero-shot and few-shot scenarios. To our knowledge, this is the first work to explore DT-based technology for multi-robot manipulation.
zh
[AI-16] UniECS: Unified Multimodal E-Commerce Search Framework with Gated Cross-modal Fusion CIKM2025
【速读】:该论文旨在解决当前电商多模态检索系统中存在的两大问题:一是现有方法通常针对特定任务进行优化,且模态配对固定,缺乏灵活性;二是缺少全面的基准测试来评估统一的多模态检索方法。其解决方案的关键在于提出UniECS框架,该框架具备三个核心创新:首先,设计了一种新颖的门控多模态编码器(gated multimodal encoder),通过自适应融合机制实现不同模态表示的整合,并能有效处理缺失模态的情况;其次,构建了包含跨模态对齐损失(CMAL)、局部一致性对齐损失(CLAL)、模内对比损失(IMCL)及自适应损失加权的综合训练策略,提升模型学习效率与泛化能力;最后,开发了M-BEER基准数据集,涵盖5万组商品对,用于系统性评估多模态电商搜索性能。实验证明,UniECS在多个电商基准上均显著优于现有方法,且在真实业务场景中部署后显著提升了点击率(+2.74%)和收入(+8.33%)。
链接: https://arxiv.org/abs/2508.13843
作者: Zihan Liang,Yufei Ma,ZhiPeng Qian,Huangyu Dai,Zihan Wang,Ben Chen,Chenyi Lei,Yuqing Ding,Han Li
机构: 未知
类目: Information Retrieval (cs.IR); Artificial Intelligence (cs.AI)
备注: Accepted at CIKM2025 as a long paper
Abstract:Current e-commerce multimodal retrieval systems face two key limitations: they optimize for specific tasks with fixed modality pairings, and lack comprehensive benchmarks for evaluating unified retrieval approaches. To address these challenges, we introduce UniECS, a unified multimodal e-commerce search framework that handles all retrieval scenarios across image, text, and their combinations. Our work makes three key contributions. First, we propose a flexible architecture with a novel gated multimodal encoder that uses adaptive fusion mechanisms. This encoder integrates different modality representations while handling missing modalities. Second, we develop a comprehensive training strategy to optimize learning. It combines cross-modal alignment loss (CMAL), cohesive local alignment loss (CLAL), intra-modal contrastive loss (IMCL), and adaptive loss weighting. Third, we create M-BEER, a carefully curated multimodal benchmark containing 50K product pairs for e-commerce search evaluation. Extensive experiments demonstrate that UniECS consistently outperforms existing methods across four e-commerce benchmarks with fine-tuning or zero-shot evaluation. On our M-BEER bench, UniECS achieves substantial improvements in cross-modal tasks (up to 28% gain in R@10 for text-to-image retrieval) while maintaining parameter efficiency (0.2B parameters) compared to larger models like GME-Qwen2VL (2B) and MM-Embed (8B). Furthermore, we deploy UniECS in the e-commerce search platform of Kuaishou Inc. across two search scenarios, achieving notable improvements in Click-Through Rate (+2.74%) and Revenue (+8.33%). The comprehensive evaluation demonstrates the effectiveness of our approach in both experimental and real-world settings. Corresponding codes, models and datasets will be made publicly available at this https URL.
zh
[AI-17] One Shot vs. Iterative: Rethinking Pruning Strategies for Model Compression
【速读】:该论文旨在解决神经网络压缩中剪枝(Pruning)策略选择的争议问题,即在实际应用中,尽管迭代剪枝(Iterative Pruning)被广泛采用,但其优势是否在所有场景下均成立尚未经过系统验证。为解决这一问题,作者提出了一种系统化的比较框架,对单次剪枝(One-shot Pruning)与迭代剪枝在结构化和非结构化剪枝设置下的性能进行了全面评估,并结合不同剪枝标准与模态进行实验验证。研究发现:单次剪枝在低剪枝率时表现更优,而迭代剪枝在高剪枝率时更具优势。基于此,论文的关键创新在于提出“耐心剪枝”(Patience-based Pruning)理念,并设计一种混合剪枝方法,能够根据任务需求灵活切换策略,在特定场景下超越传统方法,从而为实践者提供可定制的剪枝策略选择依据。
链接: https://arxiv.org/abs/2508.13836
作者: Mikołaj Janusz,Tomasz Wojnar,Yawei Li,Luca Benini,Kamil Adamczewski
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:
Abstract:Pruning is a core technique for compressing neural networks to improve computational efficiency. This process is typically approached in two ways: one-shot pruning, which involves a single pass of training and pruning, and iterative pruning, where pruning is performed over multiple cycles for potentially finer network refinement. Although iterative pruning has historically seen broader adoption, this preference is often assumed rather than rigorously tested. Our study presents one of the first systematic and comprehensive comparisons of these methods, providing rigorous definitions, benchmarking both across structured and unstructured settings, and applying different pruning criteria and modalities. We find that each method has specific advantages: one-shot pruning proves more effective at lower pruning ratios, while iterative pruning performs better at higher ratios. Building on these findings, we advocate for patience-based pruning and introduce a hybrid approach that can outperform traditional methods in certain scenarios, providing valuable insights for practitioners selecting a pruning strategy tailored to their goals and constraints. Source code is available at this https URL.
zh
[AI-18] Revisiting RAG Ensemble: A Theoretical and Mechanistic Analysis of Multi-RAG System Collaboration
【速读】:该论文旨在解决当前单一检索增强生成(Retrieval-Augmented Generation, RAG)框架在应对多样化下游任务时适应性不足的问题。其解决方案的关键在于提出并系统研究了基于多RAG系统的集成方法(RAG ensemble),通过理论分析(首次从信息熵角度解释集成机制)与机制分析(从流水线层级和模块层级深入探究),验证了在不同流水线(Branching、Iterative、Loop、Agentic)和模块(Generator、Retriever、Reranker)组合下,聚合多个RAG系统能够实现通用性和鲁棒性提升,从而为多RAG系统集成提供了坚实基础。
链接: https://arxiv.org/abs/2508.13828
作者: Yifei Chen,Guanting Dong,Yutao Zhu,Zhicheng Dou
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:
Abstract:Retrieval-Augmented Generation (RAG) technology has been widely applied in recent years. However, despite the emergence of various RAG frameworks, a single RAG framework still cannot adapt well to a broad range of downstream tasks. Therefore, how to leverage the advantages of multiple RAG systems has become an area worth exploring. To address this issue, we have conducted a comprehensive and systematic investigation into ensemble methods based on RAG systems. Specifically, we have analyzed the RAG ensemble framework from both theoretical and mechanistic analysis perspectives. From the theoretical analysis, we provide the first explanation of the RAG ensemble framework from the perspective of information entropy. In terms of mechanism analysis, we have explored the RAG ensemble framework from both the pipeline and module levels. We carefully select four different pipelines (Branching, Iterative, Loop, and Agentic) and three different modules (Generator, Retriever, and Reranker) to solve seven different research questions. The experiments show that aggregating multiple RAG systems is both generalizable and robust, whether at the pipeline level or the module level. Our work lays the foundation for similar research on the multi-RAG system ensemble.
zh
[AI-19] Assessing Trustworthiness of AI Training Dataset using Subjective Logic – A Use Case on Bias KDD ECML
【速读】:该论文旨在解决AI训练数据集可信度评估问题,特别是针对仅在数据集层面显现的全局属性(如偏见)缺乏有效评估方法的问题。传统方法仅能评估单个数据点的可信度,而无法量化整体数据集在公平性或偏见等性质上的不确定性。解决方案的关键在于构建首个形式化的数据集可信度评估框架,基于主观逻辑(Subjective Logic),支持对信任命题的建模,并能处理证据不完整、分布或冲突场景下的不确定性量化。该框架在偏见这一具体属性上进行了实例化,并通过交通标志识别数据集实验验证了其在集中式与联邦学习场景下均具备可解释性和鲁棒性。
链接: https://arxiv.org/abs/2508.13813
作者: Koffi Ismael Ouattara,Ioannis Krontiris,Theo Dimitrakos,Frank Kargl
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注: Accepted at ECML PKDD Bias Workshop '25
Abstract:As AI systems increasingly rely on training data, assessing dataset trustworthiness has become critical, particularly for properties like fairness or bias that emerge at the dataset level. Prior work has used Subjective Logic to assess trustworthiness of individual data, but not to evaluate trustworthiness properties that emerge only at the level of the dataset as a whole. This paper introduces the first formal framework for assessing the trustworthiness of AI training datasets, enabling uncertainty-aware evaluations of global properties such as bias. Built on Subjective Logic, our approach supports trust propositions and quantifies uncertainty in scenarios where evidence is incomplete, distributed, and/or conflicting. We instantiate this framework on the trustworthiness property of bias, and we experimentally evaluate it based on a traffic sign recognition dataset. The results demonstrate that our method captures class imbalance and remains interpretable and robust in both centralized and federated contexts.
zh
[AI-20] Quantifier Instantiations: To Mimic or To Revolt?
【速读】:该论文旨在解决Satisfiability Modulo Theories (SMT) 求解器在处理含量词公式(quantified formulas)时面临的挑战,尤其是由于其固有的不可判定性导致的推理效率与完备性难题。现有实例化技术(如e-matching、语法引导式、基于模型、冲突驱动及枚举方法)虽各有优势但常需互补使用。论文提出一种新颖的动态学习型实例化方法:将求解过程中观察到的实例化视为潜在语言的样本,利用概率上下文无关文法(probabilistic context-free grammars, PCFGs)生成新的相似项;其关键在于通过学习历史成功实例化模式实现知识复用(exploitation),同时通过可选地反转所学项的概率分布来探索新路径(exploration),从而在量化推理中有效平衡探索与利用的关系。
链接: https://arxiv.org/abs/2508.13811
作者: Jan Jakubův,Mikoláš Janota
机构: 未知
类目: Artificial Intelligence (cs.AI); Logic in Computer Science (cs.LO)
备注: Accepted to SMT 2025: 23rd International Workshop on Satisfiability Modulo Theories
Abstract:Quantified formulas pose a significant challenge for Satisfiability Modulo Theories (SMT) solvers due to their inherent undecidability. Existing instantiation techniques, such as e-matching, syntax-guided, model-based, conflict-based, and enumerative methods, often complement each other. This paper introduces a novel instantiation approach that dynamically learns from these techniques during solving. By treating observed instantiations as samples from a latent language, we use probabilistic context-free grammars to generate new, similar terms. Our method not only mimics successful past instantiations but also explores diversity by optionally inverting learned term probabilities, aiming to balance exploitation and exploration in quantifier reasoning.
zh
[AI-21] BetaWeb: Towards a Blockchain-enabled Trustworthy Agent ic Web
【速读】:该论文旨在解决当前基于大语言模型(Large Language Models, LLMs)的多智能体系统(LLMs-based Multi-Agent Systems, LaMAS)生态碎片化、封闭性以及缺乏可信协同机制的问题,尤其针对现有中心化或半中心化架构在隐私保护、数据管理与价值衡量方面的局限性。其核心解决方案是提出区块链赋能的可信智能体网络(Blockchain-enabled Trustworthy Agentic Web, BetaWeb),通过利用区块链的去中心化、不可篡改和可追溯特性,构建一个可扩展、可信且支持智能体能力所有权与智力成果货币化的基础设施,从而推动从Web3向Web3.5演进,实现LaMAS从被动执行到自主治理的五阶段进化路径。
链接: https://arxiv.org/abs/2508.13787
作者: Zihan Guo,Yuanjian Zhou,Chenyi Wang,Linlin You,Minjie Bian,Weinan Zhang
机构: 未知
类目: Multiagent Systems (cs.MA); Artificial Intelligence (cs.AI); Networking and Internet Architecture (cs.NI)
备注: A technical report with 21 pages, 3 figures, and 3 tables
Abstract:The rapid development of large language models (LLMs) has significantly propelled the development of artificial intelligence (AI) agents, which are increasingly evolving into diverse autonomous entities, advancing the LLM-based multi-agent systems (LaMAS). However, current agentic ecosystems remain fragmented and closed. Establishing an interconnected and scalable paradigm for Agentic AI has become a critical prerequisite. Although Agentic Web proposes an open architecture to break the ecosystem barriers, its implementation still faces core challenges such as privacy protection, data management, and value measurement. Existing centralized or semi-centralized paradigms suffer from inherent limitations, making them inadequate for supporting large-scale, heterogeneous, and cross-domain autonomous interactions. To address these challenges, this paper introduces the blockchain-enabled trustworthy Agentic Web (BetaWeb). By leveraging the inherent strengths of blockchain, BetaWeb not only offers a trustworthy and scalable infrastructure for LaMAS but also has the potential to advance the Web paradigm from Web3 (centered on data ownership) towards Web3.5, which emphasizes ownership of agent capabilities and the monetization of intelligence. Beyond a systematic examination of the BetaWeb framework, this paper presents a five-stage evolutionary roadmap, outlining the path of LaMAS from passive execution to advanced collaboration and autonomous governance. We also conduct a comparative analysis of existing products and discuss key challenges of BetaWeb from multiple perspectives. Ultimately, we argue that deep integration between blockchain and LaMAS can lay the foundation for a resilient, trustworthy, and sustainably incentivized digital ecosystem. A summary of the enabling technologies for each stage is available at this https URL.
zh
[AI-22] DegDiT: Controllable Audio Generation with Dynamic Event Graph Guided Diffusion Transformer
【速读】:该论文旨在解决可控文本到音频生成(controllable text-to-audio generation)中的三大挑战:精确的时间定位(accurate temporal localization)、开放词汇表扩展性(open-vocabulary scalability)以及实际效率(practical efficiency)之间的固有权衡。现有方法难以同时满足这些需求,导致生成音频在内容准确性与时间结构控制上存在不足。解决方案的关键在于提出DegDiT框架——一种基于动态事件图引导的扩散变换器(dynamic event graph-guided diffusion transformer)。其核心创新包括:将文本描述中的事件编码为结构化的动态图,其中节点融合语义特征、时间属性和事件间关联;利用图变压器(graph transformer)生成上下文感知的事件嵌入作为扩散模型的引导信号;结合分层事件标注与多标准质量评分的数据筛选策略构建高质量数据集,并引入共识偏好优化(consensus preference optimization)以提升生成多样性与一致性。实验表明,DegDiT在多个基准数据集上实现了最先进的性能。
链接: https://arxiv.org/abs/2508.13786
作者: Yisu Liu,Chenxing Li,Wanqian Zhang,Wenfu Wang,Meng Yu,Ruibo Fu,Zheng Lin,Weiping Wang,Dong Yu
机构: 未知
类目: ound (cs.SD); Artificial Intelligence (cs.AI)
备注:
Abstract:Controllable text-to-audio generation aims to synthesize audio from textual descriptions while satisfying user-specified constraints, including event types, temporal sequences, and onset and offset timestamps. This enables precise control over both the content and temporal structure of the generated audio. Despite recent progress, existing methods still face inherent trade-offs among accurate temporal localization, open-vocabulary scalability, and practical efficiency. To address these challenges, we propose DegDiT, a novel dynamic event graph-guided diffusion transformer framework for open-vocabulary controllable audio generation. DegDiT encodes the events in the description as structured dynamic graphs. The nodes in each graph are designed to represent three aspects: semantic features, temporal attributes, and inter-event connections. A graph transformer is employed to integrate these nodes and produce contextualized event embeddings that serve as guidance for the diffusion model. To ensure high-quality and diverse training data, we introduce a quality-balanced data selection pipeline that combines hierarchical event annotation with multi-criteria quality scoring, resulting in a curated dataset with semantic diversity. Furthermore, we present consensus preference optimization, facilitating audio generation through consensus among multiple reward signals. Extensive experiments on AudioCondition, DESED, and AudioTime datasets demonstrate that DegDiT achieves state-of-the-art performances across a variety of objective and subjective evaluation metrics.
zh
[AI-23] Agent ic DraCor and the Art of Docstring Engineering: Evaluating MCP-empowered LLM Usage of the DraCor API
【速读】:该论文旨在解决如何使大型语言模型(Large Language Models, LLM)能够自主、可靠地与数字人文基础设施(如DraCor API)交互的问题,从而提升计算文学研究(Computational Literary Studies)中自动化工具调用的效率与准确性。其解决方案的关键在于设计并实现一个模型上下文协议(Model Context Protocol, MCP)服务器,通过系统性地优化工具文档(即“Docstring Engineering”),增强LLM对可用工具的理解与选择能力,从而显著提升工具调用的正确性(Tool Correctness)、效率(Tool-Calling Efficiency)和可靠性(Tool-Use Reliability)。
链接: https://arxiv.org/abs/2508.13774
作者: Peer Trilcke,Ingo Börner,Henny Sluyter-Gäthje,Daniil Skorinkin,Frank Fischer,Carsten Milling
机构: 未知
类目: oftware Engineering (cs.SE); Artificial Intelligence (cs.AI)
备注: Preprint, submitted to the 2nd Workshop on Computational Drama Analysis at DraCor Summit 2025, September 03, 2025, Berlin, Germany
Abstract:This paper reports on the implementation and evaluation of a Model Context Protocol (MCP) server for DraCor, enabling Large Language Models (LLM) to autonomously interact with the DraCor API. We conducted experiments focusing on tool selection and application by the LLM, employing a qualitative approach that includes systematic observation of prompts to understand how LLMs behave when using MCP tools, evaluating “Tool Correctness”, “Tool-Calling Efficiency”, and “Tool-Use Reliability”. Our findings highlight the importance of “Docstring Engineering”, defined as reflexively crafting tool documentation to optimize LLM-tool interaction. Our experiments demonstrate both the promise of agentic AI for research in Computational Literary Studies and the essential infrastructure development needs for reliable Digital Humanities infrastructures.
zh
[AI-24] PENGUIN: Enhancing Transformer with Periodic-Nested Group Attention for Long-term Time Series Forecasting
【速读】:该论文旨在解决基于Transformer的模型在长期时间序列预测(Long-term Time Series Forecasting, LTSF)中有效性仍存争议的问题,尤其关注自注意力机制在捕捉时间序列周期性模式方面的不足。解决方案的关键在于提出一种名为PENGUIN(Periodic-Nested Group Attention)的简单但有效的机制:首先引入周期嵌套的相对注意力偏置(periodic-nested relative attention bias),显式建模周期结构;其次设计分组注意力机制(grouped attention mechanism),通过多查询注意力(multi-query attention)分别处理多个共存周期(如日周期与周周期),从而提升对复杂周期模式的建模能力。实验表明,该方法在多种基准数据集上均显著优于基于MLP和传统Transformer的模型。
链接: https://arxiv.org/abs/2508.13773
作者: Tian Sun,Yuqi Chen,Weiwei Sun
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:
Abstract:Long-term time series forecasting (LTSF) is a fundamental task with wide-ranging applications. Although Transformer-based models have made significant breakthroughs in forecasting, their effectiveness for time series forecasting remains debatable. In this paper, we revisit the significance of self-attention and propose a simple yet effective mechanism, Periodic-Nested Group Attention, namely PENGUIN. Our approach highlights the importance of explicitly modeling periodic patterns and incorporating relative attention bias for effective time series modeling. To this end, we introduce a periodic-nested relative attention bias that captures periodic structures directly. To handle multiple coexisting periodicities (e.g., daily and weekly cycles), we design a grouped attention mechanism, where each group targets a specific periodicity using a multi-query attention mechanism. Extensive experiments across diverse benchmarks demonstrate that PENGUIN consistently outperforms both MLP-based and Transformer-based models.
zh
[AI-25] COMPASS: A Multi-Dimensional Benchmark for Evaluating Code Generation in Large Language Models
【速读】:该论文旨在解决当前代码生成评估基准过于侧重功能正确性,而忽视了算法效率和代码质量这两个在真实编程场景中至关重要的维度的问题。其解决方案的关键在于提出一个名为COMPASS(COdility’s Multi-dimensional Programming ASSessment)的综合性评估框架,该框架从正确性、效率和质量三个维度对代码生成进行系统评测;其中包含50道来自真实Codility竞赛的编程题目,并基于393,150次人类提交数据提供了权威的人类基线,同时采用工业标准分析工具量化评估运行效率与代码可维护性,从而揭示出高正确率模型在算法效率和代码质量方面可能存在显著不足,推动研究向更贴近生产环境的AI代码生成系统发展。
链接: https://arxiv.org/abs/2508.13757
作者: James Meaden,Michał Jarosz,Piotr Jodłowski,Grigori Melnik
机构: 未知
类目: oftware Engineering (cs.SE); Artificial Intelligence (cs.AI)
备注:
Abstract:Current code generation benchmarks focus primarily on functional correctness while overlooking two critical aspects of real-world programming: algorithmic efficiency and code quality. We introduce COMPASS (COdility’s Multi-dimensional Programming ASSessment), a comprehensive evaluation framework that assesses code generation across three dimensions: correctness, efficiency, and quality. COMPASS consists of 50 competitive programming problems from real Codility competitions, providing authentic human baselines from 393,150 submissions. Unlike existing benchmarks that treat algorithmically inefficient solutions identically to optimal ones provided they pass test cases, COMPASS systematically evaluates runtime efficiency and code quality using industry-standard analysis tools. Our evaluation of three leading reasoning-enhanced models, Anthropic Claude Opus 4, Google Gemini 2.5 Pro, and OpenAI O4-Mini-High, reveals that models achieving high correctness scores do not necessarily produce efficient algorithms or maintainable code. These findings highlight the importance of evaluating more than just correctness to truly understand the real-world capabilities of code generation models. COMPASS serves as a guiding framework, charting a path for future research toward AI systems that are robust, reliable, and ready for production use.
zh
[AI-26] Depth-Breadth Synergy in RLVR: Unlocking LLM Reasoning Gains with Adaptive Exploration
【速读】:该论文旨在解决强化学习中可验证奖励(Reinforcement Learning with Verifiable Reward, RLVR)范式下大语言模型推理能力提升受限的问题,核心挑战在于两个未被充分探索的维度:深度(模型能采样的最难题目)和广度(单次迭代中消耗的样本数量)。研究发现,现有GRPO算法存在系统性偏差,即累积优势机制过度加权中等准确率样本,而忽视了对突破推理边界至关重要的低准确率样本。为此,作者提出Difficulty Adaptive Rollout Sampling (DARS) 方法,通过目标导向的多阶段采样重新加权难题,显著增加难问题的正向 rollout 数量,从而有效弥补“深度”探索不足;同时,通过大幅扩展批量大小并采用全批更新替代PPO的小批量迭代,实现“广度”的激进扩展,显著提升Pass@1性能,并维持高token级熵以保障持续探索与低梯度噪声。最终提出的DARS-B方法结合两者,在不增加推理成本的前提下同步提升Pass@K与Pass@1指标,证明广度与自适应深度探索在RLVR中是正交且互补的关键维度,共同释放模型推理潜力。
链接: https://arxiv.org/abs/2508.13755
作者: Zhicheng Yang,Zhijiang Guo,Yinya Huang,Yongxin Wang,Dongchun Xie,Yiwei Wang,Xiaodan Liang,Jing Tang
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注: 11pages, 9 figures
Abstract:Reinforcement Learning with Verifiable Reward (RLVR) has emerged as a powerful paradigm for unlocking reasoning capabilities in large language models, yet its full potential is hindered by two under-explored dimensions: Depth-the hardest problem a model can sample; Breadth-the number of instances consumed in a single iteration. We dissect the popular GRPO algorithm and reveal a systematic bias: the cumulative-advantage disproportionately weights samples with medium accuracy, while down-weighting the low-accuracy instances that are crucial for pushing reasoning boundaries. To rectify the depth neglect, we introduce Difficulty Adaptive Rollout Sampling (DARS), which re-weights hard problems through targeted multi-stage rollouts, thereby increasing the number of positive rollouts for hard problems. Empirically, naively enlarging rollout size only accelerates convergence and even hurts Pass@K. Our DARS, in contrast, delivers consistent Pass@K gains without extra inference cost at convergence. Just as we adaptively expanded the depth of exploration, we now ask whether aggressively scaling the breadth of training data can further amplify reasoning gains. To this end, we intensely scale batch size and replace PPO’s mini-batch iterations with full-batch updates over multiple epochs. Increasing breadth significantly enhances Pass@1 performance. Large-breadth training sustains high token-level entropy, indicating continued exploration and reduced gradient noise. We further present DARS-B, which augments DARS with large breadth, and demonstrate simultaneous gains in Pass@K and Pass@1. The results confirm that breadth and adaptive exploration across depth operate as orthogonal dimensions in RLVR, which are key to unleashing the reasoning power of RLVR.
zh
[AI-27] Expertise-aware Multi-LLM Recruitment and Collaboration for Medical Decision-Making
【速读】:该论文旨在解决医疗决策(Medical Decision-Making, MDM)过程中因单一大语言模型(Large Language Model, LLM)存在参数知识限制和静态训练语料导致的临床信息整合能力不足问题。其核心解决方案是提出一种专家感知的多LLM招募与协作框架(Expertise-aware Multi-LLM Recruitment and Collaboration, EMRC),关键在于:第一阶段通过构建公开语料库驱动的LLM专业能力表(LLM Expertise Table),实现基于医学科室类别和查询难度的动态最优LLM选择;第二阶段则利用自评置信度融合与对抗验证机制,增强多代理协作下的诊断可靠性,从而有效发挥各LLM的专业互补性,显著提升MDM系统的准确性与鲁棒性。
链接: https://arxiv.org/abs/2508.13754
作者: Liuxin Bao,Zhihao Peng,Xiaofei Zhou,Runmin Cong,Jiyong Zhang,Yixuan Yuan
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注: 14 pages
Abstract:Medical Decision-Making (MDM) is a complex process requiring substantial domain-specific expertise to effectively synthesize heterogeneous and complicated clinical information. While recent advancements in Large Language Models (LLMs) show promise in supporting MDM, single-LLM approaches are limited by their parametric knowledge constraints and static training corpora, failing to robustly integrate the clinical information. To address this challenge, we propose the Expertise-aware Multi-LLM Recruitment and Collaboration (EMRC) framework to enhance the accuracy and reliability of MDM systems. It operates in two stages: (i) expertise-aware agent recruitment and (ii) confidence- and adversarial-driven multi-agent collaboration. Specifically, in the first stage, we use a publicly available corpus to construct an LLM expertise table for capturing expertise-specific strengths of multiple LLMs across medical department categories and query difficulty levels. This table enables the subsequent dynamic selection of the optimal LLMs to act as medical expert agents for each medical query during the inference phase. In the second stage, we employ selected agents to generate responses with self-assessed confidence scores, which are then integrated through the confidence fusion and adversarial validation to improve diagnostic reliability. We evaluate our EMRC framework on three public MDM datasets, where the results demonstrate that our EMRC outperforms state-of-the-art single- and multi-LLM methods, achieving superior diagnostic performance. For instance, on the MMLU-Pro-Health dataset, our EMRC achieves 74.45% accuracy, representing a 2.69% improvement over the best-performing closed-source model GPT- 4-0613, which demonstrates the effectiveness of our expertise-aware agent recruitment strategy and the agent complementarity in leveraging each LLM’s specialized capabilities.
zh
[AI-28] On the Security and Privacy of Federated Learning: A Survey with Attacks Defenses Frameworks Applications and Future Directions
【速读】:该论文旨在解决联邦学习(Federated Learning, FL)在实际应用中面临的安全与隐私威胁问题,包括恶意行为(如拜占庭攻击、投毒攻击和Sybil攻击)以及敏感数据泄露风险。其解决方案的关键在于系统性地梳理和分类超过200篇相关研究,从安全增强和隐私保护两个维度提出应对策略:安全增强方法通过提高模型鲁棒性来抵御恶意客户端干扰;隐私保护技术则借助密码学、差分隐私(Differential Privacy)和安全聚合(Secure Aggregation)等手段实现数据隐私保障。论文进一步分析了现有方法的优劣、隐私-安全-性能之间的权衡关系,并指出非独立同分布(non-IID)数据对防御效果的影响,从而为构建高效、可扩展且适应动态异构环境的联邦学习系统提供理论支撑与实践指导。
链接: https://arxiv.org/abs/2508.13730
作者: Daniel M. Jimenez-Gutierrez,Yelizaveta Falkouskaya,Jose L. Hernandez-Ramos,Aris Anagnostopoulos,Ioannis Chatzigiannakis,Andrea Vitaletti
机构: 未知
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI); Distributed, Parallel, and Cluster Computing (cs.DC)
备注:
Abstract:Federated Learning (FL) is an emerging distributed machine learning paradigm enabling multiple clients to train a global model collaboratively without sharing their raw data. While FL enhances data privacy by design, it remains vulnerable to various security and privacy threats. This survey provides a comprehensive overview of more than 200 papers regarding the state-of-the-art attacks and defense mechanisms developed to address these challenges, categorizing them into security-enhancing and privacy-preserving techniques. Security-enhancing methods aim to improve FL robustness against malicious behaviors such as byzantine attacks, poisoning, and Sybil attacks. At the same time, privacy-preserving techniques focus on protecting sensitive data through cryptographic approaches, differential privacy, and secure aggregation. We critically analyze the strengths and limitations of existing methods, highlight the trade-offs between privacy, security, and model performance, and discuss the implications of non-IID data distributions on the effectiveness of these defenses. Furthermore, we identify open research challenges and future directions, including the need for scalable, adaptive, and energy-efficient solutions operating in dynamic and heterogeneous FL environments. Our survey aims to guide researchers and practitioners in developing robust and privacy-preserving FL systems, fostering advancements safeguarding collaborative learning frameworks’ integrity and confidentiality.
zh
[AI-29] CausalPlan: Empowering Efficient LLM Multi-Agent Collaboration Through Causality-Driven Planning
【速读】:该论文旨在解决小型开源大语言模型(Large Language Model, LLM)在多智能体协作任务中因依赖表面相关性而非因果推理而导致的因果无效或逻辑不一致行为问题,这限制了其在动态环境中的协调与规划能力。解决方案的关键在于提出CausalPlan框架,该框架采用两阶段设计,核心是结构化因果动作(Structural Causal Action, SCA)模型——通过从智能体轨迹中学习因果图来捕捉先前动作和当前环境状态对未来决策的影响,并据此为LLM生成的动作提案分配因果得分,从而重新加权或替换为因果一致的替代方案,使规划过程嵌入因果知识而不需微调LLM本身。
链接: https://arxiv.org/abs/2508.13721
作者: Minh Hoang Nguyen,Van Dai Do,Dung Nguyen,Thin Nguyen,Hung Le
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:
Abstract:Large language model (LLM) agents-especially smaller, open-source models-often produce causally invalid or incoherent actions in collaborative tasks due to their reliance on surface-level correlations rather than grounded causal reasoning. This limitation undermines their performance in terms of coordination and planning in dynamic environments. We address this challenge with CausalPlan, a two-phase framework that integrates explicit structural causal reasoning into the LLM planning process. At the core of CausalPlan is the Structural Causal Action (SCA) model, which learns a causal graph from agent trajectories to capture how prior actions and current environment states influence future decisions. This structure is then used to guide action selection by assigning causal scores to LLM-generated proposals, reweighting them accordingly, or falling back to causally grounded alternatives when needed. By embedding this causal knowledge directly into the decision loop, CausalPlan constrains planning to intervention-consistent behaviours without requiring fine-tuning of the LLM itself. We evaluate CausalPlan on the Overcooked-AI benchmark across five multi-agent coordination tasks and four LLMs of varying sizes: Gemma-7B, Llama-8B, Qwen-14B, and Llama-70B. Experimental results show that CausalPlan consistently reduces invalid actions and improves collaboration in both AI-AI and human-AI settings, outperforming strong reinforcement learning baselines. Our findings highlight the value of causality-driven planning for deploying efficient, interpretable, and generalisable multi-agent LLM systems.
zh
[AI-30] he AI Risk Spectrum: From Dangerous Capabilities to Existential Threats
【速读】:该论文旨在系统性地梳理和分类人工智能(AI)可能带来的风险,从当前已知的个体用户伤害到可能威胁人类生存的存在性风险,构建一个全面的风险图谱。其核心问题是:随着AI系统能力增强、集成度提高和应用范围扩大,如何识别、理解并应对这些多层次、多维度的风险,以避免从现有趋势演变为灾难性后果。解决方案的关键在于明确区分三类主要风险——滥用风险(Misuse Risks)、对齐风险(Misalignment Risks)与系统性风险(Systemic Risks),并识别出竞争压力、事故、企业冷漠及协调失败等风险放大因素;同时强调,实现安全且有益的未来并非自然发生,而是需要前所未有的全球协作与前瞻性治理策略。
链接: https://arxiv.org/abs/2508.13700
作者: Markov Grey,Charbel-Raphaël Segerie
机构: 未知
类目: Computers and Society (cs.CY); Artificial Intelligence (cs.AI)
备注:
Abstract:As AI systems become more capable, integrated, and widespread, understanding the associated risks becomes increasingly important. This paper maps the full spectrum of AI risks, from current harms affecting individual users to existential threats that could endanger humanity’s survival. We organize these risks into three main causal categories. Misuse risks, which occur when people deliberately use AI for harmful purposes - creating bioweapons, launching cyberattacks, adversarial AI attacks or deploying lethal autonomous weapons. Misalignment risks happen when AI systems pursue outcomes that conflict with human values, irrespective of developer intentions. This includes risks arising through specification gaming (reward hacking), scheming and power-seeking tendencies in pursuit of long-term strategic goals. Systemic risks, which arise when AI integrates into complex social systems in ways that gradually undermine human agency - concentrating power, accelerating political and economic disempowerment, creating overdependence that leads to human enfeeblement, or irreversibly locking in current values curtailing future moral progress. Beyond these core categories, we identify risk amplifiers - competitive pressures, accidents, corporate indifference, and coordination failures - that make all risks more likely and severe. Throughout, we connect today’s existing risks and empirically observable AI behaviors to plausible future outcomes, demonstrating how existing trends could escalate to catastrophic outcomes. Our goal is to help readers understand the complete landscape of AI risks. Good futures are possible, but they don’t happen by default. Navigating these challenges will require unprecedented coordination, but an extraordinary future awaits if we do.
zh
[AI-31] he DeepLog Neurosymbolic Machine
【速读】:该论文旨在解决神经符号人工智能(Neurosymbolic AI)系统设计与实现中缺乏统一抽象框架的问题,即如何在理论和操作层面提供一个通用、高效且可扩展的建模与计算平台。其解决方案的关键在于提出DeepLog框架,该框架包含两个核心组件:一是基于标注神经扩展的 grounded first-order logic 的DeepLog语言,用于形式化描述神经符号模型及其推理任务,并抽象出逻辑类型(如布尔、模糊或概率逻辑)及逻辑使用方式(架构内或损失函数中);二是基于扩展代数电路(algebraic circuits)的计算图结构,作为底层计算机制,支持GPU加速并实现高效的数值运算。二者共同构成一个神经符号抽象机(neurosymbolic abstract machine),通过声明式编程方式灵活组合不同的代数结构和逻辑体系,从而在保持通用性的同时显著提升效率。
链接: https://arxiv.org/abs/2508.13697
作者: Vincent Derkinderen,Robin Manhaeve,Rik Adriaensen,Lucas Van Praet,Lennert De Smet,Giuseppe Marra,Luc De Raedt
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:
Abstract:We contribute a theoretical and operational framework for neurosymbolic AI called DeepLog. DeepLog introduces building blocks and primitives for neurosymbolic AI that make abstraction of commonly used representations and computational mechanisms used in neurosymbolic AI. DeepLog can represent and emulate a wide range of neurosymbolic systems. It consists of two key components. The first is the DeepLog language for specifying neurosymbolic models and inference tasks. This language consists of an annotated neural extension of grounded first-order logic, and makes abstraction of the type of logic, e.g. boolean, fuzzy or probabilistic, and whether logic is used in the architecture or in the loss function. The second DeepLog component is situated at the computational level and uses extended algebraic circuits as computational graphs. Together these two components are to be considered as a neurosymbolic abstract machine, with the DeepLog language as the intermediate level of abstraction and the circuits level as the computational one. DeepLog is implemented in software, relies on the latest insights in implementing algebraic circuits on GPUs, and is declarative in that it is easy to obtain different neurosymbolic models by making different choices for the underlying algebraic structures and logics. The generality and efficiency of the DeepLog neurosymbolic machine is demonstrated through an experimental comparison between 1) different fuzzy and probabilistic logics, 2) between using logic in the architecture or in the loss function, and 3) between a standalone CPU-based implementation of a neurosymbolic AI system and a DeepLog GPU-based one.
zh
[AI-32] Neuro-Symbolic Artificial Intelligence: Towards Improving the Reasoning Abilities of Large Language Models IJCAI2025
【速读】:该论文旨在解决大语言模型(Large Language Models, LLMs)在推理能力方面的局限性问题,这是实现人工通用智能(Artificial General Intelligence, AGI)的关键瓶颈。解决方案的核心在于采用神经符号(neuro-symbolic)方法,通过融合符号推理与神经网络的学习能力,提升LLMs的结构化推理性能。论文从三个维度系统梳理了相关技术:Symbolic-LLM(符号系统引导LLM)、LLM-Symbolic(LLM生成符号表示)、LLM+Symbolic(LLM与符号系统协同工作),揭示了神经符号范式在增强LLM推理能力中的潜力与挑战。
链接: https://arxiv.org/abs/2508.13678
作者: Xiao-Wen Yang,Jie-Jing Shao,Lan-Zhe Guo,Bo-Wen Zhang,Zhi Zhou,Lin-Han Jia,Wang-Zhou Dai,Yu-Feng Li
机构: 未知
类目: Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: 9 pages, 3 figures, IJCAI 2025 Survey Track
Abstract:Large Language Models (LLMs) have shown promising results across various tasks, yet their reasoning capabilities remain a fundamental challenge. Developing AI systems with strong reasoning capabilities is regarded as a crucial milestone in the pursuit of Artificial General Intelligence (AGI) and has garnered considerable attention from both academia and industry. Various techniques have been explored to enhance the reasoning capabilities of LLMs, with neuro-symbolic approaches being a particularly promising way. This paper comprehensively reviews recent developments in neuro-symbolic approaches for enhancing LLM reasoning. We first present a formalization of reasoning tasks and give a brief introduction to the neurosymbolic learning paradigm. Then, we discuss neuro-symbolic methods for improving the reasoning capabilities of LLMs from three perspectives: Symbolic-LLM, LLM-Symbolic, and LLM+Symbolic. Finally, we discuss several key challenges and promising future directions. We have also released a GitHub repository including papers and resources related to this survey: this https URL.
zh
[AI-33] MHSNet:An MoE-based Hierarchical Semantic Representation Network for Accurate Duplicate Resume Detection with Large Language Model
【速读】:该论文旨在解决第三方网站(如LinkedIn、Indeed)获取的简历因信息不完整或不准确而导致企业人才库质量下降的问题,核心挑战在于如何高效检测 fetched resumes 与企业已有简历之间的重复项,这受限于简历文本的语义复杂性、结构异构性和信息缺失。解决方案的关键是提出 MHSNet 框架,通过对比学习微调 BGE-M3 模型,并结合状态感知的混合专家(state-aware Mixture-of-Experts, MoE)机制,生成多层级稀疏与密集表示,从而计算多层级语义相似度,有效提升在不完整简历场景下的身份验证准确性。
链接: https://arxiv.org/abs/2508.13676
作者: Yu Li,Zulong Chen,Wenjian Xu,Hong Wen,Yipeng Yu,Man Lung Yiu,Yuyu Yin
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:
Abstract:To maintain the company’s talent pool, recruiters need to continuously search for resumes from third-party websites (e.g., LinkedIn, Indeed). However, fetched resumes are often incomplete and inaccurate. To improve the quality of third-party resumes and enrich the company’s talent pool, it is essential to conduct duplication detection between the fetched resumes and those already in the company’s talent pool. Such duplication detection is challenging due to the semantic complexity, structural heterogeneity, and information incompleteness of resume texts. To this end, we propose MHSNet, an multi-level identity verification framework that fine-tunes BGE-M3 using contrastive learning. With the fine-tuned , Mixture-of-Experts (MoE) generates multi-level sparse and dense representations for resumes, enabling the computation of corresponding multi-level semantic similarities. Moreover, the state-aware Mixture-of-Experts (MoE) is employed in MHSNet to handle diverse incomplete resumes. Experimental results verify the effectiveness of MHSNet
zh
[AI-34] Knowledge Graph Completion for Action Prediction on Situational Graphs – A Case Study on Household Tasks
【速读】:该论文旨在解决情境知识图谱(situational knowledge graph)在家庭行为建模中的补全问题,尤其针对从视频中提取的信息通常不完整这一挑战。其核心贡献在于指出传统链接预测算法因无法适配情境知识图谱的特殊结构与语义特性而表现不佳,甚至难以超越简单基线模型;解决方案的关键在于识别并利用这些特殊性,从而设计更契合任务本质的建模方法,以提升知识图谱补全的效果和实用性。
链接: https://arxiv.org/abs/2508.13675
作者: Mariam Arustashvili,Jörg Deigmöller,Heiko Paulheim
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注: Accepted at Semantics 2025
Abstract:Knowledge Graphs are used for various purposes, including business applications, biomedical analyses, or digital twins in industry 4.0. In this paper, we investigate knowledge graphs describing household actions, which are beneficial for controlling household robots and analyzing video footage. In the latter case, the information extracted from videos is notoriously incomplete, and completing the knowledge graph for enhancing the situational picture is essential. In this paper, we show that, while a standard link prediction problem, situational knowledge graphs have special characteristics that render many link prediction algorithms not fit for the job, and unable to outperform even simple baselines.
zh
[AI-35] Multi-Plasticity Synergy with Adaptive Mechanism Assignment for Training Spiking Neural Networks
【速读】:该论文旨在解决当前脉冲神经网络(Spiking Neural Networks, SNNs)在训练过程中依赖单一突触可塑性机制所导致的适应性差与表征能力受限的问题。其解决方案的关键在于提出一种生物启发式的训练框架,该框架整合了多种协同作用的突触可塑性机制,使不同学习算法能够共同调控信息积累过程,同时保持各自相对独立的更新动力学特性,从而显著提升SNN在静态图像和动态类脑数据集上的性能与鲁棒性。
链接: https://arxiv.org/abs/2508.13673
作者: Yuzhe Liu,Xin Deng,Qiang Yu
机构: 未知
类目: Neural and Evolutionary Computing (cs.NE); Artificial Intelligence (cs.AI)
备注:
Abstract:Spiking Neural Networks (SNNs) are promising brain-inspired models known for low power consumption and superior potential for temporal processing, but identifying suitable learning mechanisms remains a challenge. Despite the presence of multiple coexisting learning strategies in the brain, current SNN training methods typically rely on a single form of synaptic plasticity, which limits their adaptability and representational capability. In this paper, we propose a biologically inspired training framework that incorporates multiple synergistic plasticity mechanisms for more effective SNN training. Our method enables diverse learning algorithms to cooperatively modulate the accumulation of information, while allowing each mechanism to preserve its own relatively independent update dynamics. We evaluated our approach on both static image and dynamic neuromorphic datasets to demonstrate that our framework significantly improves performance and robustness compared to conventional learning mechanism models. This work provides a general and extensible foundation for developing more powerful SNNs guided by multi-strategy brain-inspired learning.
zh
[AI-36] ITL-LIME: Instance-Based Transfer Learning for Enhancing Local Explanations in Low-Resource Data Settings CIKM2025
【速读】:该论文旨在解决传统局部可解释性方法(如LIME)在数据稀缺场景下因扰动和采样随机性导致的局部性不足与解释不稳定性问题,尤其是在生成不符合真实数据流形的不合理样本时,难以准确逼近原模型复杂决策边界。其解决方案的关键在于提出一种基于实例迁移学习的LIME框架(ITL-LIME),通过引入源域中相关的真实实例增强目标域的解释能力:首先利用聚类为源域构建具有代表性原型的簇,再根据目标实例与各簇原型的相似度检索最相关的源实例,并结合目标实例邻近的真实样本构成紧凑局部集合;随后设计基于对比学习的编码器作为加权机制,依据实例与目标实例的距离分配权重,最终使用加权后的源与目标实例训练代理模型以提升解释的保真度与稳定性。
链接: https://arxiv.org/abs/2508.13672
作者: Rehan Raza,Guanjin Wang,Kevin Wong,Hamid Laga,Marco Fisichella
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注: Accepted at the 34th ACM International Conference on Information and Knowledge Management (CIKM 2025)
Abstract:Explainable Artificial Intelligence (XAI) methods, such as Local Interpretable Model-Agnostic Explanations (LIME), have advanced the interpretability of black-box machine learning models by approximating their behavior locally using interpretable surrogate models. However, LIME’s inherent randomness in perturbation and sampling can lead to locality and instability issues, especially in scenarios with limited training data. In such cases, data scarcity can result in the generation of unrealistic variations and samples that deviate from the true data manifold. Consequently, the surrogate model may fail to accurately approximate the complex decision boundary of the original model. To address these challenges, we propose a novel Instance-based Transfer Learning LIME framework (ITL-LIME) that enhances explanation fidelity and stability in data-constrained environments. ITL-LIME introduces instance transfer learning into the LIME framework by leveraging relevant real instances from a related source domain to aid the explanation process in the target domain. Specifically, we employ clustering to partition the source domain into clusters with representative prototypes. Instead of generating random perturbations, our method retrieves pertinent real source instances from the source cluster whose prototype is most similar to the target instance. These are then combined with the target instance’s neighboring real instances. To define a compact locality, we further construct a contrastive learning-based encoder as a weighting mechanism to assign weights to the instances from the combined set based on their proximity to the target instance. Finally, these weighted source and target instances are used to train the surrogate model for explanation purposes.
zh
[AI-37] Interactive Query Answering on Knowledge Graphs with Soft Entity Constraints
【速读】:该论文旨在解决知识图谱(Knowledge Graph, KG)中查询回答任务在面对软约束(soft constraints)时的不足问题。传统方法主要针对形式化为一阶逻辑(first-order logic)的查询,但在实际应用中,许多查询涉及模糊或上下文相关的约束,如对属性偏好或类别关联的主观判断,现有方法难以有效处理。为此,作者提出神经查询重排序器(Neural Query Reranker, NQR),其核心在于通过交互式机制,利用增量示例(偏好与非偏好实体)调整原始查询答案得分,从而在不破坏原有答案结构的前提下融入软约束,实现对复杂语义约束的建模与适应。
链接: https://arxiv.org/abs/2508.13663
作者: Daniel Daza,Alberto Bernardi,Luca Costabello,Christophe Gueret,Masoud Mansoury,Michael Cochez,Martijn Schut
机构: 未知
类目: Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:
Abstract:Methods for query answering over incomplete knowledge graphs retrieve entities that are likely to be answers, which is particularly useful when such answers cannot be reached by direct graph traversal due to missing edges. However, existing approaches have focused on queries formalized using first-order-logic. In practice, many real-world queries involve constraints that are inherently vague or context-dependent, such as preferences for attributes or related categories. Addressing this gap, we introduce the problem of query answering with soft constraints. We propose a Neural Query Reranker (NQR) designed to adjust query answer scores by incorporating soft constraints without disrupting the original answers to a query. NQR operates interactively, refining answers based on incremental examples of preferred and non-preferred entities. We extend existing QA benchmarks by generating datasets with soft constraints. Our experiments demonstrate that NQR can capture soft constraints while maintaining robust query answering performance.
zh
[AI-38] In-Context Decision Making for Optimizing Complex AutoML Pipelines
【速读】:该论文旨在解决现代机器学习(ML)工作流中日益复杂的管道选择与适应问题,即在预训练模型广泛应用的背景下,如何高效地从多样化且异构的ML管道(包括微调、集成等多种适配技术)中识别最优方案。传统组合算法选择与超参数优化(CASH)框架已难以应对此类复杂性。其解决方案的关键在于提出PS-PFN方法,通过将后验采样(Posterior Sampling, PS)扩展至最大k臂赌博机(max k-armed bandit)问题设置,并利用先验数据拟合网络(Prior-Data Fitted Networks, PFNs)借助上下文学习(in-context learning)高效估计最大值的后验分布,从而实现对不同ML管道的探索与利用平衡,同时支持考虑各臂(arm)的执行成本差异及为每条臂单独建模奖励分布。
链接: https://arxiv.org/abs/2508.13657
作者: Amir Rezaei Balef,Katharina Eggensperger
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:
Abstract:Combined Algorithm Selection and Hyperparameter Optimization (CASH) has been fundamental to traditional AutoML systems. However, with the advancements of pre-trained models, modern ML workflows go beyond hyperparameter optimization and often require fine-tuning, ensembling, and other adaptation techniques. While the core challenge of identifying the best-performing model for a downstream task remains, the increasing heterogeneity of ML pipelines demands novel AutoML approaches. This work extends the CASH framework to select and adapt modern ML pipelines. We propose PS-PFN to efficiently explore and exploit adapting ML pipelines by extending Posterior Sampling (PS) to the max k-armed bandit problem setup. PS-PFN leverages prior-data fitted networks (PFNs) to efficiently estimate the posterior distribution of the maximal value via in-context learning. We show how to extend this method to consider varying costs of pulling arms and to use different PFNs to model reward distributions individually per arm. Experimental results on one novel and two existing standard benchmark tasks demonstrate the superior performance of PS-PFN compared to other bandit and AutoML strategies. We make our code and data available at this https URL.
zh
[AI-39] GRAFT: Gradient-Aware Fast MaxVol Technique for Dynamic Data Sampling
【速读】:该论文旨在解决现代神经网络在大规模数据集上训练时存在的计算成本高和环境负担重的问题(即训练过程中的能耗与碳排放问题)。其核心解决方案是提出一种名为GRAFT的可扩展训练中子集选择方法,关键在于:首先对每个训练批次提取低秩特征表示;其次利用快速最大体积采样(Fast MaxVol sampler)从中挑选出能覆盖批次主导子空间的小规模多样化样本子集;最后通过梯度近似准则动态调整子集大小。该方法在保持训练轨迹不变的前提下,显著降低训练时间和碳排放,同时在多个基准测试中实现精度与效率的最优平衡。
链接: https://arxiv.org/abs/2508.13653
作者: Ashish Jha,Anh huy Phan,Razan Dibo,Valentin Leplat
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Numerical Analysis (math.NA)
备注:
Abstract:Training modern neural networks on large datasets is computationally and environmentally costly. We introduce GRAFT, a scalable in-training subset selection method that (i) extracts a low-rank feature representation for each batch, (ii) applies a Fast MaxVol sampler to select a small, diverse subset that spans the batch’s dominant subspace, and (iii) dynamically adjusts the subset size using a gradient-approximation criterion. By operating in low-rank subspaces and training on carefully chosen examples instead of full batches, GRAFT preserves the training trajectory while reducing wall-clock time, energy consumption, and \mathrmCO_2 emissions. Across multiple benchmarks, GRAFT matches or exceeds recent selection baselines in both accuracy and efficiency, providing a favorable trade-off between accuracy, efficiency, and emissions.
zh
[AI-40] V2P: From Background Suppression to Center Peaking for Robust GUI Grounding Task
【速读】:该论文旨在解决GUI元素精确定位任务中传统方法存在的两大问题:一是忽略处理背景区域导致注意力漂移,二是统一标签无法区分目标UI元素的中心与边缘,从而影响点击精度。解决方案的关键在于提出一种受人类视觉认知启发的“谷到峰”(Valley-to-Peak, V2P)方法,其核心包括两个创新:首先引入抑制注意力机制(suppression attention mechanism),有效减少模型对无关背景区域的关注,增强对目标区域的聚焦;其次采用基于Fitts定律的二维高斯热图建模策略,通过以目标尺寸为参数的高斯权重分布,使模型更关注UI元素中心而非边缘,从而提升定位精度和交互准确性。
链接: https://arxiv.org/abs/2508.13634
作者: Jikai Chen,Long Chen,Dong Wang,Leilei Gan,Chenyi Zhuang,Jinjie Gu
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:
Abstract:Precise localization of GUI elements is crucial for the development of GUI agents. Traditional methods rely on bounding box or center-point regression, neglecting spatial interaction uncertainty and visual-semantic hierarchies. Recent methods incorporate attention mechanisms but still face two key issues: (1) ignoring processing background regions causes attention drift from the desired area, and (2) uniform labeling fails to distinguish between center and edges of the target UI element, leading to click imprecision. Inspired by how humans visually process and interact with GUI elements, we propose the Valley-to-Peak (V2P) method to address these issues. To mitigate background distractions, V2P introduces a suppression attention mechanism that minimizes the model’s focus on irrelevant regions to highlight the intended region. For the issue of center-edge distinction, V2P applies a Fitts’ Law-inspired approach by modeling GUI interactions as 2D Gaussian heatmaps where the weight gradually decreases from the center towards the edges. The weight distribution follows a Gaussian function, with the variance determined by the target’s size. Consequently, V2P effectively isolates the target area and teaches the model to concentrate on the most essential point of the UI element. The model trained by V2P achieves the performance with 92.3% and 50.5% on two benchmarks ScreenSpot-v2 and ScreenSpot-Pro. Ablations further confirm each component’s contribution, highlighting V2P’s generalizability for precise GUI grounding tasks.
zh
[AI-41] owards a Larger Model via One-Shot Federated Learning on Heterogeneous Client Models
【速读】:该论文旨在解决联邦学习(Federated Learning, FL)在移动网络场景下面临的三大挑战:资源异构性导致的客户端计算负担重、多轮通信带来的高延迟,以及统一模型架构限制了客户端模型定制化能力。针对这些问题,论文提出了一种名为FedOL的一次性(one-shot)联邦学习框架,其核心创新在于摒弃传统参数共享机制,转而采用知识蒸馏(knowledge distillation)策略,使客户端仅需向服务器传输对未标注公共数据集的预测输出(compact predictions),而非完整模型权重,从而显著降低通信开销并支持异构模型架构。为应对客户端因本地数据分布偏斜导致的预测偏差及公共数据缺乏真实标签的问题,FedOL设计了一个迭代优化的目标函数,用于动态校准伪标签(pseudo-labels)并同步更新服务器模型,提升知识迁移的可靠性与效果。
链接: https://arxiv.org/abs/2508.13625
作者: Wenxuan Ye,Xueli An,Onur Ayan,Junfan Wang,Xueqiang Yan,Georg Carle
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注: Accepted to Globecom 2025
Abstract:Large models, renowned for superior performance, outperform smaller ones even without billion-parameter scales. While mobile network servers have ample computational resources to support larger models than client devices, privacy constraints prevent clients from directly sharing their raw data. Federated Learning (FL) enables decentralized clients to collaboratively train a shared model by exchanging model parameters instead of transmitting raw data. Yet, it requires a uniform model architecture and multiple communication rounds, which neglect resource heterogeneity, impose heavy computational demands on clients, and increase communication overhead. To address these challenges, we propose FedOL, to construct a larger and more comprehensive server model in one-shot settings (i.e., in a single communication round). Instead of model parameter sharing, FedOL employs knowledge distillation, where clients only exchange model prediction outputs on an unlabeled public dataset. This reduces communication overhead by transmitting compact predictions instead of full model weights and enables model customization by allowing heterogeneous model architectures. A key challenge in this setting is that client predictions may be biased due to skewed local data distributions, and the lack of ground-truth labels in the public dataset further complicates reliable learning. To mitigate these issues, FedOL introduces a specialized objective function that iteratively refines pseudo-labels and the server model, improving learning reliability. To complement this, FedOL incorporates a tailored pseudo-label generation and knowledge distillation strategy that effectively integrates diverse knowledge. Simulation results show that FedOL significantly outperforms existing baselines, offering a cost-effective solution for mobile networks where clients possess valuable private data but limited computational resources.
zh
[AI-42] Bounding Causal Effects and Counterfactuals
【速读】:该论文旨在解决因果推断中因强假设(如无未测量混杂或完美依从性)难以满足而导致的估计偏差问题,提出通过部分识别(partial identification)方法构建反映数据内在不确定性的因果效应边界,从而替代依赖不可验证假设的精确点估计。其解决方案的关键在于系统比较多种边界算法(包括符号法、基于优化的方法和信息论方法),并在此基础上统一实现与扩展,特别是对熵约束方法进行了改进,使其适用于反事实查询(如必要性和充分性概率 PNS),同时开发了一个开源 Python 工具包 CausalBoundingEngine,支持用户通过统一接口应用和比较不同方法,并结合决策树与机器学习模型辅助算法选择,提升实际可用性与效率。
链接: https://arxiv.org/abs/2508.13607
作者: Tobias Maringgele
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Methodology (stat.ME)
备注: Bachelor’s thesis, Technical University of Munich, 2025. 102 pages, 20 figures
Abstract:Causal inference often hinges on strong assumptions - such as no unmeasured confounding or perfect compliance - that are rarely satisfied in practice. Partial identification offers a principled alternative: instead of relying on unverifiable assumptions to estimate causal effects precisely, it derives bounds that reflect the uncertainty inherent in the data. Despite its theoretical appeal, partial identification remains underutilized in applied work, in part due to the fragmented nature of existing methods and the lack of practical guidance. This thesis addresses these challenges by systematically comparing a diverse set of bounding algorithms across multiple causal scenarios. We implement, extend, and unify state-of-the-art methods - including symbolic, optimization-based, and information-theoretic approaches - within a common evaluation framework. In particular, we propose an extension of a recently introduced entropy-bounded method, making it applicable to counterfactual queries such as the Probability of Necessity and Sufficiency (PNS). Our empirical study spans thousands of randomized simulations involving both discrete and continuous data-generating processes. We assess each method in terms of bound tightness, computational efficiency, and robustness to assumption violations. To support practitioners, we distill our findings into a practical decision tree for algorithm selection and train a machine learning model to predict the best-performing method based on observable data characteristics. All implementations are released as part of an open-source Python package, CausalBoundingEngine, which enables users to apply and compare bounding methods through a unified interface. Comments: Bachelor’s thesis, Technical University of Munich, 2025. 102 pages, 20 figures Subjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Methodology (stat.ME) MSC classes: 62A01 (Foundations of statistics), 68T01 (Artificial intelligence, general) ACMclasses: G.3; I.2.6 Cite as: arXiv:2508.13607 [cs.LG] (or arXiv:2508.13607v1 [cs.LG] for this version) https://doi.org/10.48550/arXiv.2508.13607 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
zh
[AI-43] oward Better EHR Reasoning in LLM s: Reinforcement Learning with Expert Attention Guidance
【速读】:该论文旨在解决大型语言模型(Large Language Models, LLMs)在电子健康记录(Electronic Health Record, EHR)推理任务中表现不佳的问题,其核心挑战在于LLMs难以有效建模时序结构复杂且高维的EHR数据,而现有混合范式通常仅将LLMs作为静态知识检索器,未能提升其内在推理能力并继承了深度学习(Deep Learning, DL)模型的泛化局限性。解决方案的关键在于提出一种两阶段训练框架EAG-RL(Expert-guided Attention for Reinforcement Learning),首先利用专家EHR模型引导的蒙特卡洛树搜索构建高质量、分步的推理轨迹以初始化LLM策略,随后通过强化学习进一步优化策略,使LLM注意力机制对齐专家EHR模型识别出的临床显著特征,从而实现LLM内在EHR推理能力的增强。
链接: https://arxiv.org/abs/2508.13579
作者: Yue Fang,Yuxin Guo,Jiaran Gao,Hongxin Ding,Xinke Jiang,Weibin Liao,Yongxin Xu,Yinghao Zhu,Zhibang Yang,Liantao Ma,Junfeng Zhao,Yasha Wang
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:
Abstract:Improving large language models (LLMs) for electronic health record (EHR) reasoning is essential for enabling accurate and generalizable clinical predictions. While LLMs excel at medical text understanding, they underperform on EHR-based prediction tasks due to challenges in modeling temporally structured, high-dimensional data. Existing approaches often rely on hybrid paradigms, where LLMs serve merely as frozen prior retrievers while downstream deep learning (DL) models handle prediction, failing to improve the LLM’s intrinsic reasoning capacity and inheriting the generalization limitations of DL models. To this end, we propose EAG-RL, a novel two-stage training framework designed to intrinsically enhance LLMs’ EHR reasoning ability through expert attention guidance, where expert EHR models refer to task-specific DL models trained on EHR data. Concretely, EAG-RL first constructs high-quality, stepwise reasoning trajectories using expert-guided Monte Carlo Tree Search to effectively initialize the LLM’s policy. Then, EAG-RL further optimizes the policy via reinforcement learning by aligning the LLM’s attention with clinically salient features identified by expert EHR models. Extensive experiments on two real-world EHR datasets show that EAG-RL improves the intrinsic EHR reasoning ability of LLMs by an average of 14.62%, while also enhancing robustness to feature perturbations and generalization to unseen clinical domains. These results demonstrate the practical potential of EAG-RL for real-world deployment in clinical prediction tasks. Our code have been available at this https URL.
zh
[AI-44] Collapsing ROC approach for risk prediction research on both common and rare variants
【速读】:该论文旨在解决当前疾病风险预测模型在临床应用中准确性不足的问题,特别是由于现有基于常见遗传变异(common variants)的预测模型未能充分整合稀有变异(rare variants)信息所致。其关键解决方案是提出一种新的“合并受试者工作特征曲线”(Collapsing Receiver Operating Characteristic, CROC)方法,该方法在先前发展的前向受试者工作特征曲线(Forward ROC, FROC)基础上扩展了对稀有变异的处理机制,能够同时整合常见与稀有单核苷酸多态性(SNP)数据以提升预测性能。实证结果显示,CROC在包含全部SNP时AUC达0.605,优于仅使用常见变异的FROC(AUC=0.585),并在常见变异逐渐减少甚至仅剩稀有变异的情况下仍保持较高准确性(AUC=0.603 vs. FROC的0.524),证明其在全面风险预测中的优越性。
链接: https://arxiv.org/abs/2508.13552
作者: Changshuai Wei,Qing Lu
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Methodology (stat.ME)
备注:
Abstract:Risk prediction that capitalizes on emerging genetic findings holds great promise for improving public health and clinical care. However, recent risk prediction research has shown that predictive tests formed on existing common genetic loci, including those from genome-wide association studies, have lacked sufficient accuracy for clinical use. Because most rare variants on the genome have not yet been studied for their role in risk prediction, future disease prediction discoveries should shift toward a more comprehensive risk prediction strategy that takes into account both common and rare variants. We are proposing a collapsing receiver operating characteristic CROC approach for risk prediction research on both common and rare variants. The new approach is an extension of a previously developed forward ROC FROC approach, with additional procedures for handling rare variants. The approach was evaluated through the use of 533 single-nucleotide polymorphisms SNPs in 37 candidate genes from the Genetic Analysis Workshop 17 mini-exome data set. We found that a prediction model built on all SNPs gained more accuracy AUC = 0.605 than one built on common variants alone AUC = 0.585. We further evaluated the performance of two approaches by gradually reducing the number of common variants in the analysis. We found that the CROC method attained more accuracy than the FROC method when the number of common variants in the data decreased. In an extreme scenario, when there are only rare variants in the data, the CROC reached an AUC value of 0.603, whereas the FROC had an AUC value of 0.524.
zh
[AI-45] CrafterDojo: A Suite of Foundation Models for Building Open-Ended Embodied Agents in Crafter
【速读】:该论文旨在解决当前通用具身智能体(general-purpose embodied agents)研究中缺乏轻量、高效且具备丰富挑战性的实验环境的问题。尽管Minecraft提供了复杂性和大规模数据,但其运行速度慢和工程开销大限制了快速原型开发;而Crafter虽为轻量替代方案,却因缺少基础模型(foundation models)支持,仅适用于特定任务。解决方案的关键在于提出CrafterDojo——一个包含三个核心基础模型的工具套件:CrafterVPT(行为先验)、CrafterCLIP(视觉-语言对齐)和CrafterSteve-1(指令遵循),并配套生成行为与描述数据集(CrafterPlay 和 CrafterCaption)、参考代理实现、基准评估及开源代码库,从而将Crafter打造为类Minecraft、适合通用具身智能体研究的轻量化测试平台。
链接: https://arxiv.org/abs/2508.13530
作者: Junyeong Park,Hyeonseo Cho,Sungjin Ahn
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:
Abstract:Developing general-purpose embodied agents is a core challenge in AI. Minecraft provides rich complexity and internet-scale data, but its slow speed and engineering overhead make it unsuitable for rapid prototyping. Crafter offers a lightweight alternative that retains key challenges from Minecraft, yet its use has remained limited to narrow tasks due to the absence of foundation models that have driven progress in the Minecraft setting. In this paper, we present CrafterDojo, a suite of foundation models and tools that unlock the Crafter environment as a lightweight, prototyping-friendly, and Minecraft-like testbed for general-purpose embodied agent research. CrafterDojo addresses this by introducing CrafterVPT, CrafterCLIP, and CrafterSteve-1 for behavior priors, vision-language grounding, and instruction following, respectively. In addition, we provide toolkits for generating behavior and caption datasets (CrafterPlay and CrafterCaption), reference agent implementations, benchmark evaluations, and a complete open-source codebase.
zh
[AI-46] DDoS Attacks in Cloud Computing: Detection and Prevention
【速读】:该论文旨在解决当前日益复杂和频繁的分布式拒绝服务(DDoS)攻击难以有效检测与缓解的问题。其解决方案的关键在于系统性地分析不同类型的DDoS攻击(包括流量型、协议层和应用层攻击),并评估现有检测技术(如包过滤、入侵检测系统及基于机器学习的方法)与预防措施(如防火墙、速率限制、CPP和ELD机制)的有效性及其适用场景,从而为组织和个人提供可操作的防御策略与改进方向。
链接: https://arxiv.org/abs/2508.13522
作者: Zain Ahmad,Musab Ahmad,Bilal Ahmad
机构: 未知
类目: Distributed, Parallel, and Cluster Computing (cs.DC); Artificial Intelligence (cs.AI); Cryptography and Security (cs.CR)
备注:
Abstract:DDoS attacks are one of the most prevalent and harmful cybersecurity threats faced by organizations and individuals today. In recent years, the complexity and frequency of DDoS attacks have increased significantly, making it challenging to detect and mitigate them effectively. The study analyzes various types of DDoS attacks, including volumetric, protocol, and application layer attacks, and discusses the characteristics, impact, and potential targets of each type. It also examines the existing techniques used for DDoS attack detection, such as packet filtering, intrusion detection systems, and machine learning-based approaches, and their strengths and limitations. Moreover, the study explores the prevention techniques employed to mitigate DDoS attacks, such as firewalls, rate limiting , CPP and ELD mechanism. It evaluates the effectiveness of each approach and its suitability for different types of attacks and environments. In conclusion, this study provides a comprehensive overview of the different types of DDoS attacks, their detection, and prevention techniques. It aims to provide insights and guidelines for organizations and individuals to enhance their cybersecurity posture and protect against DDoS attacks.
zh
[AI-47] Heterogeneous Influence Maximization in User Recommendation CIKM2025
【速读】:该论文旨在解决用户推荐系统中两个关键问题:一是传统推荐方法未能充分挖掘候选用户的传播潜力(spread capability),二是影响力最大化(Influence-Maximization, IM)方法忽视了用户之间的互动意愿(interaction willingness)。为应对上述挑战,作者提出两种模型——HeteroIR 和 HeteroIM。其核心创新在于:HeteroIR 采用两阶段框架估算传播收益,从而释放推荐系统的扩散潜能;HeteroIM 则通过增量选择最具影响力的被邀请者,并基于包含邀请者与被邀请者的逆可达(reverse reachable, RR)集合数量进行重排序,有效融合传播覆盖与互动意愿,实现推荐任务与IM方法的有机衔接。实验表明,两者均显著优于现有基线方法(p < 0.05),并在腾讯在线游戏平台部署后分别带来8.5%和10%的性能提升。
链接: https://arxiv.org/abs/2508.13517
作者: Hongru Hou,Jiachen Sun,Wenqing Lin,Wendong Bi,Xiangrong Wang,Deqing Yang
机构: 未知
类目: Information Retrieval (cs.IR); Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Social and Information Networks (cs.SI)
备注: Accepted in CIKM 2025
Abstract:User recommendation systems enhance user engagement by encouraging users to act as inviters to interact with other users (invitees), potentially fostering information propagation. Conventional recommendation methods typically focus on modeling interaction willingness. Influence-Maximization (IM) methods focus on identifying a set of users to maximize the information propagation. However, existing methods face two significant challenges. First, recommendation methods fail to unleash the candidates’ spread capability. Second, IM methods fail to account for the willingness to interact. To solve these issues, we propose two models named HeteroIR and HeteroIM. HeteroIR provides an intuitive solution to unleash the dissemination potential of user recommendation systems. HeteroIM fills the gap between the IM method and the recommendation task, improving interaction willingness and maximizing spread coverage. The HeteroIR introduces a two-stage framework to estimate the spread profits. The HeteroIM incrementally selects the most influential invitee to recommend and rerank based on the number of reverse reachable (RR) sets containing inviters and invitees. RR set denotes a set of nodes that can reach a target via propagation. Extensive experiments show that HeteroIR and HeteroIM significantly outperform the state-of-the-art baselines with the p-value 0.05. Furthermore, we have deployed HeteroIR and HeteroIM in Tencent’s online gaming platforms and gained an 8.5% and 10% improvement in the online A/B test, respectively. Implementation codes are available at this https URL.
zh
[AI-48] LM Agents May Fail to Act on Their Own Risk Knowledge
【速读】:该论文试图解决语言模型(Language Model, LM)代理在安全关键场景中存在“风险认知与安全执行能力不匹配”的问题,即代理虽能识别潜在风险(如执行 sudo rm -rf /*
是危险的),但在实际行为轨迹中却难以识别或规避此类风险,甚至会直接执行高危操作。解决方案的关键在于利用观察到的“生成器-验证器”式性能差距,设计一个独立的风险验证模块:该模块包含一个抽象器(abstractor),将具体的执行轨迹转化为抽象描述以增强大模型对风险的识别能力,并通过一个独立的验证器对代理提出的动作进行批判性评估,从而显著降低高危行为的发生率(相比基线模型减少55.3%)。
链接: https://arxiv.org/abs/2508.13465
作者: Yuzhi Tang,Tianxiao Li,Elizabeth Li,Chris J. Maddison,Honghua Dong,Yangjun Ruan
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:
Abstract:Language model (LM) agents have demonstrated significant potential for automating real-world tasks, yet they pose a diverse array of potential, severe risks in safety-critical scenarios. In this work, we identify a significant gap between LM agents’ risk awareness and safety execution abilities: while they often answer “Yes” to queries like “Is executing `sudo rm -rf /*’ dangerous?”, they will likely fail to identify such risks in instantiated trajectories or even directly perform these risky actions when acting as agents. To systematically investigate this, we develop a comprehensive evaluation framework to examine agents’ safety across three progressive dimensions: 1) their knowledge about potential risks, 2) their ability to identify corresponding risks in execution trajectories, and 3) their actual behaviors to avoid executing these risky actions. Our evaluation reveals two critical performance gaps that resemble the generator-validator gaps observed in LMs: while agents demonstrate near-perfect risk knowledge ( 98% pass rates), they fail to apply this knowledge when identifying risks in actual scenarios (with performance dropping by 23% ) and often still execute risky actions ( 26% pass rates). Notably, this trend persists across more capable LMs as well as in specialized reasoning models like DeepSeek-R1, indicating that simply scaling model capabilities or inference compute does not inherently resolve safety concerns. Instead, we take advantage of these observed gaps to develop a risk verifier that independently critiques the proposed actions by agents, with an abstractor that converts specific execution trajectories into abstract descriptions where LMs can more effectively identify the risks. Our overall system achieves a significant reduction of risky action execution by 55.3% over vanilla-prompted agents.
zh
[AI-49] Consumer Autonomy or Illusion? Rethinking Consumer Agency in the Age of Algorithms
【速读】:该论文旨在解决数字时代消费者在系统性障碍与算法操纵下消费自主性(consumer agency)被削弱的问题,揭示结构性、行为性和时间维度上的代理权限制如何导致理性个体仍面临早期财务崩溃的风险。其解决方案的关键在于将消费自主性视为一种需主动培育的价值,并通过形式化建模明确其与财务不稳定性之间的因果关联;在此基础上,提出系统性干预与消费者教育相结合的策略,以增强代理权并支持知情决策,从而实现对消费行为的规范性引导与长期财务福祉的保障。
链接: https://arxiv.org/abs/2508.13440
作者: Pegah Nokhiz,Aravinda Kanchana Ruwanpathirana
机构: 未知
类目: Computers and Society (cs.CY); Artificial Intelligence (cs.AI)
备注: Accepted and appearing in Journal of Social Computing (JSC)
Abstract:Consumer agency in the digital age is increasingly constrained by systemic barriers and algorithmic manipulation, raising concerns about the authenticity of consumption choices. Nowadays, financial decisions are shaped by external pressures like obligatory consumption, algorithmic persuasion, and unstable work schedules that erode financial autonomy. Obligatory consumption (like hidden fees) is intensified by digital ecosystems. Algorithmic tactics like personalized recommendations lead to impulsive purchases. Unstable work schedules also undermine financial planning. Thus, it is important to study how these factors impact consumption agency. To do so, we examine formal models grounded in discounted consumption with constraints that bound agency. We construct analytical scenarios in which consumers face obligatory payments, algorithm-influenced impulsive expenses, or unpredictable income due to temporal instability. Using this framework, we demonstrate that even rational, utility-maximizing agents can experience early financial ruin when agency is limited across structural, behavioral, or temporal dimensions and how diminished autonomy impacts long-term financial well-being. Our central argument is that consumer agency must be treated as a value (not a given) requiring active cultivation, especially in digital ecosystems. The connection between our formal modeling and this argument allows us to indicate that limitations on agency (whether structural, behavioral, or temporal) can be rigorously linked to measurable risks like financial instability. This connection is also a basis for normative claims about consumption as a value, by anchoring them in a formally grounded analysis of consumer behavior. As solutions, we study systemic interventions and consumer education to support value deliberation and informed choices. We formally demonstrate how these measures strengthen agency.
zh
[AI-50] Discrete Optimization of Min-Max Violation and its Applications Across Computational Sciences
【速读】:该论文旨在解决一类具有最坏情况性能约束的离散优化问题,即离散最小最大违反(Discrete Min-Max Violation, DMMV)问题,其目标是寻找变量的离散取值分配以最小化最大约束违反度。此类问题广泛适用于对最差情形性能有严格要求的应用场景。解决方案的关键在于提出一种基于GPU加速的启发式算法,该算法充分利用了DMMV问题的数学特性,在保证解质量的同时显著提升求解效率。通过在语言模型后训练量化、离散层析成像和有限冲激响应(FIR)滤波器设计三个不同应用场景中的实证验证,表明该方法相较于现有技术在平均性能上提升14%、重建误差降低16%且计算加速6倍,并在FIR滤波器设计中实现接近50%的纹波减少,充分体现了DMMV作为上下文无关优化问题的研究价值及所提启发式算法的优越性。
链接: https://arxiv.org/abs/2508.13437
作者: Cheikh Ahmed,Mahdi Mostajabdaveh,Samin Aref,Zirui Zhou
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:
Abstract:We introduce the Discrete Min-Max Violation (DMMV) as a general optimization problem which seeks an assignment of discrete values to variables that minimizes the largest constraint violation. This context-free mathematical formulation is applicable to a wide range of use cases that have worst-case performance requirements. After defining the DMMV problem mathematically, we explore its properties to establish a foundational understanding. To tackle DMMV instance sizes of practical relevance, we develop a GPU-accelerated heuristic that takes advantage of the mathematical properties of DMMV for speeding up the solution process. We demonstrate the versatile applicability of our heuristic by solving three optimization problems as use cases: (1) post-training quantization of language models, (2) discrete tomography, and (3) Finite Impulse Response (FIR) filter design. In quantization without outlier separation, our heuristic achieves 14% improvement on average over existing methods. In discrete tomography, it reduces reconstruction error by 16% under uniform noise and accelerates computations by a factor of 6 on GPU. For FIR filter design, it nearly achieves 50% ripple reduction compared to using the commercial integer optimization solver, Gurobi. Our comparative results point to the benefits of studying DMMV as a context-free optimization problem and the advantages that our proposed heuristic offers on three distinct problems. Our GPU-accelerated heuristic will be made open-source to further stimulate research on DMMV and its other applications. The code is available at this https URL
zh
[AI-51] Dynamic Design of Machine Learning Pipelines via Metalearning
【速读】:该论文旨在解决自动化机器学习(AutoML)系统在模型选择、超参数调优和特征工程过程中面临的高计算成本及搜索空间过大导致的过拟合问题。其解决方案的关键在于引入一种基于元学习(metalearning)的方法,通过利用历史元知识(metaknowledge)动态识别并聚焦于搜索空间中的高潜力区域,从而显著减少不必要的计算开销并提升优化效率。实验表明,该方法可在不显著牺牲预测性能的前提下,使随机搜索(Random Search)的运行时间减少89%,同时压缩预处理器和分类器的搜索空间分别达1.8/13和4.3/16,且在Auto-Sklearn中也展现出良好的适应性与效果。
链接: https://arxiv.org/abs/2508.13436
作者: Edesio Alcobaça,André C. P. L. F. de Carvalho
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Neural and Evolutionary Computing (cs.NE)
备注:
Abstract:Automated machine learning (AutoML) has democratized the design of machine learning based systems, by automating model selection, hyperparameter tuning and feature engineering. However, the high computational cost associated with traditional search and optimization strategies, such as Random Search, Particle Swarm Optimization and Bayesian Optimization, remains a significant challenge. Moreover, AutoML systems typically explore a large search space, which can lead to overfitting. This paper introduces a metalearning method for dynamically designing search spaces for AutoML system. The proposed method uses historical metaknowledge to select promising regions of the search space, accelerating the optimization process. According to experiments conducted for this study, the proposed method can reduce runtime by 89% in Random Search and search space by (1.8/13 preprocessor and 4.3/16 classifier), without compromising significant predictive performance. Moreover, the proposed method showed competitive performance when adapted to Auto-Sklearn, reducing its search space. Furthermore, this study encompasses insights into meta-feature selection, meta-model explainability, and the trade-offs inherent in search space reduction strategies.
zh
[AI-52] SVDformer: Direction-Aware Spectral Graph Embedding Learning via SVD and Transformer
【速读】:该论文旨在解决现有方向图神经网络(Directed Graph Neural Networks, DGNNs)在联合捕捉边方向语义与全局结构模式方面的局限性,其根源在于各向同性聚合机制和局部滤波机制的不足。解决方案的关键在于提出SVDformer框架,该框架通过融合奇异值分解(Singular Value Decomposition, SVD)与Transformer架构,实现方向感知的图表示学习:首先利用多头自注意力机制对奇异值嵌入进行精炼,自适应增强关键频谱成分并抑制高频噪声,从而无需显式设计谱核即可实现可学习的低通/高通图滤波;其次将奇异向量作为方向投影基、奇异值作为缩放因子,借助Transformer建模输入/输出边模式间的多尺度交互,显式保留特征传播过程中的边方向信息。
链接: https://arxiv.org/abs/2508.13435
作者: Jiayu Fang,Zhiqi Shao,S T Boris Choy,Junbin Gao
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:
Abstract:Directed graphs are widely used to model asymmetric relationships in real-world systems. However, existing directed graph neural networks often struggle to jointly capture directional semantics and global structural patterns due to their isotropic aggregation mechanisms and localized filtering mechanisms. To address this limitation, this paper proposes SVDformer, a novel framework that synergizes SVD and Transformer architecture for direction-aware graph representation learning. SVDformer first refines singular value embeddings through multi-head self-attention, adaptively enhancing critical spectral components while suppressing high-frequency noise. This enables learnable low-pass/high-pass graph filtering without requiring spectral kernels. Furthermore, by treating singular vectors as directional projection bases and singular values as scaling factors, SVDformer uses the Transformer to model multi-scale interactions between incoming/outgoing edge patterns through attention weights, thereby explicitly preserving edge directionality during feature propagation. Extensive experiments on six directed graph benchmarks demonstrate that SVDformer consistently outperforms state-of-the-art GNNs and direction-aware baselines on node classification tasks, establishing a new paradigm for learning representations on directed graphs.
zh
[AI-53] EventTSF: Event-Aware Non-Stationary Time Series Forecasting
【速读】:该论文旨在解决非平稳时间序列预测中因缺乏多模态上下文信息而导致的性能瓶颈问题,特别是如何有效融合自然语言描述的外部事件以提升对复杂动态变化的建模能力。其核心挑战在于:(1)时变离散文本事件与连续时间序列之间的细粒度同步难题;(2)文本语义引入的时间不确定性;(3)文本事件嵌入与多分辨率时间模式之间的错位。解决方案的关键在于提出事件感知的非平稳时间序列预测框架(EventTSF),该框架采用自回归扩散机制结合流匹配(flow matching)策略,在每一步生成中捕捉细腻的时间-事件交互关系;通过根据事件语义信号自适应控制流匹配的步长来缓解事件引发的不确定性;并设计了一个多模态U型扩散Transformer结构,实现跨不同时间尺度下时序与文本模态的高效融合。
链接: https://arxiv.org/abs/2508.13434
作者: Yunfeng Ge,Ming Jin,Yiji Zhao,Hongyan Li,Bo Du,Chang Xu,Shirui Pan
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注: 13 pages, 10 figures
Abstract:Time series forecasting plays a vital role in critical domains like energy and transportation, where non-stationary dynamics are deeply intertwined with events in other modalities such as texts. However, incorporating natural language-based external events to improve non-stationary forecasting remains largely unexplored, as most approaches still rely on a single modality, resulting in limited contextual knowledge and model underperformance. Enabling fine-grained multimodal interactions between temporal and textual data is challenged by three fundamental issues: (1) the difficulty of fine-grained synchronization between time-varying discrete textual events and continuous time series; (2) the inherent temporal uncertainty introduced by textual semantics; and (3) the misalignment between textual event embeddings and multi-resolution temporal patterns. In this work, we address these challenges by introducing event-aware non-stationary time series forecasting (EventTSF), an autoregressive generation framework that integrates historical time series with textual events to make subsequent forecasts. Specifically, EventTSF uses autoregressive diffusion with flow matching at each step to capture nuanced temporal-event interactions. To handle event-induced uncertainty, flow matching timesteps are adaptively controlled according to event semantic signals. The underlying denoiser employs a multimodal U-shaped diffusion transformer that efficiently fuses temporal and textual modalities across different resolutions. Extensive experiments on 8 synthetic and real-world datasets show that EventTSF outperforms 12 baselines across diverse event-aware non-stationary time series forecasting scenarios, achieving substantial improvements of 10.7% higher forecasting accuracy and 1.13\times faster training efficiency.
zh
[AI-54] STPFormer: A State-of-the-Art Pattern-Aware Spatio-Temporal Transformer for Traffic Forecasting
【速读】:该论文旨在解决交通流量预测中复杂的时序模式、动态的空间结构以及多样输入格式带来的挑战,尤其针对基于Transformer的模型在时序编码僵化和时空融合能力弱的问题。其解决方案的关键在于提出STPFormer,一种面向时空模式感知的Transformer架构,通过四个核心模块实现统一且可解释的表征学习:Temporal Position Aggregator(TPA)用于模式感知的时序编码,Spatial Sequence Aggregator(SSA)实现序列化空间建模,Spatial-Temporal Graph Matching(STGM)完成跨域对齐,以及Attention Mixer实现多尺度特征融合,从而显著提升预测性能并具备良好的可解释性和泛化能力。
链接: https://arxiv.org/abs/2508.13433
作者: Jiayu Fang,Zhiqi Shao,S T Boris Choy,Junbin Gao
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:
Abstract:Spatio-temporal traffic forecasting is challenging due to complex temporal patterns, dynamic spatial structures, and diverse input formats. Although Transformer-based models offer strong global modeling, they often struggle with rigid temporal encoding and weak space-time fusion. We propose STPFormer, a Spatio-Temporal Pattern-Aware Transformer that achieves state-of-the-art performance via unified and interpretable representation learning. It integrates four modules: Temporal Position Aggregator (TPA) for pattern-aware temporal encoding, Spatial Sequence Aggregator (SSA) for sequential spatial learning, Spatial-Temporal Graph Matching (STGM) for cross-domain alignment, and an Attention Mixer for multi-scale fusion. Experiments on five real-world datasets show that STPFormer consistently sets new SOTA results, with ablation and visualizations confirming its effectiveness and generalizability.
zh
[AI-55] AdaptJobRec: Enhancing Conversational Career Recommendation through an LLM -Powered Agent ic System
【速读】:该论文旨在解决对话式职位推荐系统(Conversational Job Recommendation System, CJRS)中因采用自主代理(agentic system)而导致的响应延迟问题,尤其是在处理复杂查询时的性能瓶颈。解决方案的关键在于提出AdaptJobRec系统,其核心创新是引入用户查询复杂度识别机制:对于简单查询,代理直接调用合适的个性化推荐工具以实现快速响应;对于复杂查询,则通过记忆处理模块筛选相关聊天历史,再经由智能任务分解规划器进行任务拆解,并最终利用个性化推荐算法工具执行多步骤推理与推荐。这一机制有效平衡了复杂任务处理能力与响应延迟之间的权衡,在真实职场推荐场景下将平均响应延迟降低高达53.3%,同时显著提升推荐准确性。
链接: https://arxiv.org/abs/2508.13423
作者: Qixin Wang,Dawei Wang,Kun Chen,Yaowei Hu,Puneet Girdhar,Ruoteng Wang,Aadesh Gupta,Chaitanya Devella,Wenlai Guo,Shangwen Huang,Bachir Aoun,Greg Hayworth,Han Li,Xintao Wu
机构: 未知
类目: Information Retrieval (cs.IR); Artificial Intelligence (cs.AI)
备注:
Abstract:In recent years, recommendation systems have evolved from providing a single list of recommendations to offering a comprehensive suite of topic focused services. To better accomplish this task, conversational recommendation systems (CRS) have progressed from basic retrieval augmented LLM generation to agentic systems with advanced reasoning and self correction capabilities. However, agentic systems come with notable response latency, a longstanding challenge for conversational recommendation systems. To balance the trade off between handling complex queries and minimizing latency, we propose AdaptJobRec, the first conversational job recommendation system that leverages autonomous agent to integrate personalized recommendation algorithm tools. The system employs a user query complexity identification mechanism to minimize response latency. For straightforward queries, the agent directly selects the appropriate tool for rapid responses. For complex queries, the agent uses the memory processing module to filter chat history for relevant content, then passes the results to the intelligent task decomposition planner, and finally executes the tasks using personalized recommendation tools. Evaluation on Walmart’s real world career recommendation scenarios demonstrates that AdaptJobRec reduces average response latency by up to 53.3% compared to competitive baselines, while significantly improving recommendation accuracy.
zh
[AI-56] Virtuous Machines: Towards Artificial General Science
【速读】:该论文试图解决当前人工智能(Artificial Intelligence, AI)在科学研究中应用受限于特定领域、需大量人工干预的问题,以及科学文献爆炸式增长与学科高度专业化导致研究人员难以跨领域整合知识、构建统一理论的瓶颈。其解决方案的关键在于开发一种无领域依赖(domain-agnostic)、具备代理能力(agentic)的AI系统,能够自主执行从假设生成、数据收集到论文撰写等完整科研流程;该系统通过自主设计并实施三项心理学实验(涉及视觉工作记忆、心理旋转和意象生动性),完成一项包含288名参与者的新在线数据采集,并在连续8小时以上的编码会话中构建分析管道,最终产出可发表的研究论文,验证了AI在理论推理和方法严谨性上已接近人类专家水平,尽管在概念细微差别和理论阐释方面仍存局限。这一成果标志着迈向能通过现实实验验证假设的具身AI(embodied AI)的重要一步,有望突破人类认知与资源限制,加速探索科学空间中尚未被触及的区域。
链接: https://arxiv.org/abs/2508.13421
作者: Gabrielle Wehr,Reuben Rideaux,Amaya J. Fox,David R. Lightfoot,Jason Tangen,Jason B. Mattingley,Shane E. Ehrhardt
机构: 未知
类目: Artificial Intelligence (cs.AI); Emerging Technologies (cs.ET)
备注:
Abstract:Artificial intelligence systems are transforming scientific discovery by accelerating specific research tasks, from protein structure prediction to materials design, yet remain confined to narrow domains requiring substantial human oversight. The exponential growth of scientific literature and increasing domain specialisation constrain researchers’ capacity to synthesise knowledge across disciplines and develop unifying theories, motivating exploration of more general-purpose AI systems for science. Here we show that a domain-agnostic, agentic AI system can independently navigate the scientific workflow - from hypothesis generation through data collection to manuscript preparation. The system autonomously designed and executed three psychological studies on visual working memory, mental rotation, and imagery vividness, executed one new online data collection with 288 participants, developed analysis pipelines through 8-hour+ continuous coding sessions, and produced completed manuscripts. The results demonstrate the capability of AI scientific discovery pipelines to conduct non-trivial research with theoretical reasoning and methodological rigour comparable to experienced researchers, though with limitations in conceptual nuance and theoretical interpretation. This is a step toward embodied AI that can test hypotheses through real-world experiments, accelerating discovery by autonomously exploring regions of scientific space that human cognitive and resource constraints might otherwise leave unexplored. It raises important questions about the nature of scientific understanding and the attribution of scientific credit.
zh
[AI-57] Semi-Supervised Anomaly Detection Pipeline for SOZ Localization Using Ictal-Related Chirp
【速读】:该论文旨在解决癫痫患者中临床定义的发作起始区(Seizure Onset Zone, SOZ)与通过时间-频率分析识别出的统计异常电极之间空间一致性量化评估的问题。其解决方案的关键在于提出了一种两步式定量框架:首先利用局部离群因子(Local Outlier Factor, LOF)结合自适应邻域选择,基于chirp事件的谱时特征(起始频率、终止频率和持续时间)实现无监督异常电极检测;其次采用空间相关性分析方法,通过精确共现指标与加权索引相似性(考虑半球一致性及电极邻近度)进行匹配评估。研究表明,基于LOF的方法在参数设置为N_neighbors=20、contamination=0.2时能有效识别异常通道,且加权相似性指标优于精确匹配,在手术成功病例中表现出更高的定位精度(平均索引精度达0.865),从而为SOZ定位提供了一种可量化的辅助手段。
链接: https://arxiv.org/abs/2508.13406
作者: Nooshin Bahador,Milad Lankarany
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注: 23 pages, 7 figures
Abstract:This study presents a quantitative framework for evaluating the spatial concordance between clinically defined seizure onset zones (SOZs) and statistically anomalous channels identified through time-frequency analysis of chirp events. The proposed pipeline employs a two-step methodology: (1) Unsupervised Outlier Detection, where Local Outlier Factor (LOF) analysis with adaptive neighborhood selection identifies anomalous channels based on spectro-temporal features of chirp (Onset frequency, offset frequency, and temporal duration); and (2) Spatial Correlation Analysis, which computes both exact co-occurrence metrics and weighted index similarity, incorporating hemispheric congruence and electrode proximity. Key findings demonstrate that the LOF-based approach (N neighbors=20, contamination=0.2) effectively detects outliers, with index matching (weighted by channel proximity) outperforming exact matching in SOZ localization. Performance metrics (precision, recall, F1) were highest for seizure-free patients (Index Precision mean: 0.903) and those with successful surgical outcomes (Index Precision mean: 0.865), whereas failure cases exhibited lower concordance (Index Precision mean: 0.460). The key takeaway is that chirp-based outlier detection, combined with weighted spatial metrics, provides a complementary method for SOZ localization, particularly in patients with successful surgical outcomes.
zh
[AI-58] SPANER: Shared Prompt Aligner for Multimodal Semantic Representation
【速读】:该论文旨在解决多模态参数高效微调(Parameter-Efficient Fine-Tuning, PEFT)中模态特定表示孤立、缺乏跨模态泛化能力的问题。现有方法虽能提升下游任务性能,但忽视了多模态嵌入空间的结构对齐,导致不同模态的语义表示难以融合。解决方案的关键在于提出一种无模态依赖的PEFT框架——Shared Prompt AligNER (SPANER),其核心是引入共享提示(shared prompt)机制作为概念锚点,使来自不同模态的语义相关实例在嵌入空间中趋于聚集,从而实现跨模态语义一致性。该设计具有天然可扩展性,无需修改主架构即可无缝集成新模态(如音频),实验证明其在视觉-语言与视听基准上均能保持高语义保真度并具备竞争力的少样本检索性能。
链接: https://arxiv.org/abs/2508.13387
作者: Thye Shan Ng,Caren Soyeon Han,Eun-Jung Holden
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:
Abstract:Recent advances in multimodal Parameter-Efficient Fine-Tuning (PEFT) have significantly improved performance on downstream tasks such as few-shot retrieval. However, most existing approaches focus on task-specific gains while neglecting the structure of the multimodal embedding space. As a result, modality-specific representations often remain isolated, limiting cross-modal generalisation. In this work, we introduce Shared Prompt AligNER (SPANER), a modality-agnostic PEFT framework designed to embed inputs from diverse modalities into a unified semantic space. At its core, SPANER employs a shared prompt mechanism that acts as a conceptual anchor, enabling semantically related instances to converge spatially regardless of modality. This shared prompt design is inherently extensible, supporting the seamless integration of additional modalities, such as audio, without altering the core architecture. Through comprehensive experiments across vision-language and audio-visual benchmarks, SPANER demonstrates competitive few-shot retrieval performance while preserving high semantic coherence in the learned embedding space. Our results highlight the importance of aligning embedding structures, rather than merely tuning adapter weights, for scalable multimodal learning.
zh
[AI-59] LOOP: A Plug-and-Play Neuro-Symbolic Framework for Enhancing Planning in Autonomous Systems
【速读】:该论文旨在解决当前神经规划方法在复杂领域中生成的计划存在前提缺失、目标不一致和幻觉等问题,同时克服传统符号规划器在自然语言理解与灵活性方面的不足。解决方案的关键在于提出一种名为LOOP的新型神经符号规划框架,其核心创新是将规划过程设计为神经组件与符号组件之间的迭代对话,而非单次翻译。LOOP通过13个协同的神经特征(如图神经网络用于空间关系建模、多智能体验证确保一致性、层次分解处理复杂任务等)动态生成并迭代优化PDDL规范,并基于执行轨迹构建因果知识库,从而实现可靠且可解释的规划。实验表明,LOOP在六个标准IPC基准域上达到85.8%的成功率,显著优于现有方法,证明了“让神经与符号模块持续交互”才是提升规划可靠性与实用性的关键。
链接: https://arxiv.org/abs/2508.13371
作者: Ronit Virwani,Ruchika Suryawanshi
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注: Submitted to IAAI-26
Abstract:Planning is one of the most critical tasks in autonomous systems, where even a small error can lead to major failures or million-dollar losses. Current state-of-the-art neural planning approaches struggle with complex domains, producing plans with missing preconditions, inconsistent goals, and hallucinations. While classical planners provide logical guarantees, they lack the flexibility and natural language understanding capabilities needed for modern autonomous systems. Existing neuro-symbolic approaches use one-shot translation from natural language to formal plans, missing the opportunity for neural and symbolic components to work and refine solutions together. To address this gap, we develop LOOP – a novel neuro-symbolic planning framework that treats planning as an iterative conversation between neural and symbolic components rather than simple translation. LOOP integrates 13 coordinated neural features including graph neural networks for spatial relationships, multi-agent validation for consensus-based correctness, hierarchical decomposition for complex task management, and causal memory that learns from both successes and failures. Unlike existing approaches, LOOP generates PDDL specifications, refines them iteratively based on symbolic feedback, and builds a causal knowledge base from execution traces. LOOP was evaluated on six standard IPC benchmark domains, where it achieved 85.8% success rate compared to LLM+P (55.0%), LLM-as-Planner (19.2%), and Tree-of-Thoughts (3.3%). This work shows that the key to reliable planning is not in choosing between neural networks or symbolic reasoners but it lies in making them actually ``talk’’ to each other during the entire process. LOOP provides a thorough blueprint for building autonomous systems that can finally be trusted with critical real-world applications.
zh
[AI-60] Counterfactual Probabilistic Diffusion with Expert Models
【速读】:该论文旨在解决复杂动态系统中反事实分布预测的问题,这在公共卫生和医学等领域的科学建模与决策中至关重要。现有方法多依赖于点估计或纯数据驱动模型,在数据稀缺时表现不佳。解决方案的关键在于提出一种基于时间序列扩散的框架 ODE-Diff,通过从不完美的专家模型中提取高层次信号作为结构化先验,引导生成建模过程,从而融合机制模型与数据驱动方法,提升因果推断的可靠性与可解释性。
链接: https://arxiv.org/abs/2508.13355
作者: Wenhao Mu,Zhi Cao,Mehmed Uludag,Alexander Rodríguez
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Methodology (stat.ME)
备注:
Abstract:Predicting counterfactual distributions in complex dynamical systems is essential for scientific modeling and decision-making in domains such as public health and medicine. However, existing methods often rely on point estimates or purely data-driven models, which tend to falter under data scarcity. We propose a time series diffusion-based framework that incorporates guidance from imperfect expert models by extracting high-level signals to serve as structured priors for generative modeling. Our method, ODE-Diff, bridges mechanistic and data-driven approaches, enabling more reliable and interpretable causal inference. We evaluate ODE-Diff across semi-synthetic COVID-19 simulations, synthetic pharmacological dynamics, and real-world case studies, demonstrating that it consistently outperforms strong baselines in both point prediction and distributional accuracy.
zh
[AI-61] HiFo-Prompt: Prompting with Hindsight and Foresight for LLM -based Automatic Heuristic Design
【速读】:该论文旨在解决基于大语言模型(Large Language Models, LLMs)的自动启发式设计(Automatic Heuristic Design, AHD)在进化计算(Evolutionary Computation, EC)框架中面临的两大问题:一是静态算子导致的适应性不足,二是缺乏知识积累机制致使经验无法复用。解决方案的关键在于提出HiFo-Prompt框架,其通过两种协同的提示策略——前瞻(Foresight)与回溯(Hindsight)——实现动态引导与知识沉淀:Foresight提示依据种群演化状态自适应调整搜索方向,平衡探索与开发;Hindsight提示则从历史代际中提炼成功启发式为通用设计原则,构建可复用的知识库,从而将瞬时发现转化为持久知识,显著提升启发式质量、收敛速度和查询效率。
链接: https://arxiv.org/abs/2508.13333
作者: Chentong Chen,Mengyuan Zhong,Jianyong Sun,Ye Fan,Jialong Shi
机构: 未知
类目: Artificial Intelligence (cs.AI); Neural and Evolutionary Computing (cs.NE); Optimization and Control (math.OC)
备注: 9 pages, 6 figures
Abstract:LLM-based Automatic Heuristic Design (AHD) within Evolutionary Computation (EC) frameworks has shown promising results. However, its effectiveness is hindered by the use of static operators and the lack of knowledge accumulation mechanisms. We introduce HiFo-Prompt, a framework that guides LLMs with two synergistic prompting strategies: Foresight and Hindsight. Foresight-based prompts adaptively steer the search based on population dynamics, managing the exploration-exploitation trade-off. In addition, hindsight-based prompts mimic human expertise by distilling successful heuristics from past generations into fundamental, reusable design principles. This dual mechanism transforms transient discoveries into a persistent knowledge base, enabling the LLM to learn from its own experience. Empirical results demonstrate that HiFo-Prompt significantly outperforms state-of-the-art LLM-based AHD methods, generating higher-quality heuristics while achieving substantially faster convergence and superior query efficiency.
zh
[AI-62] A Dual-Attention Graph Network for fMRI Data Classification
【速读】:该论文旨在解决当前功能磁共振成像(fMRI)分类方法中静态功能连接建模难以捕捉脑区间时变交互关系的问题。现有方法要么依赖静态功能连接,要么无法全面刻画时空动态特性,导致对自闭症谱系障碍(ASD)等神经发育疾病的诊断精度受限。其解决方案的关键在于提出一种融合动态图构建与时空注意力机制的新框架:首先通过基于Transformer的注意力机制在每个时间区间内动态推断脑区间的功能连接,实现对关键脑区和时间片段的选择性聚焦;其次利用图卷积网络(GCN)与Transformer的层级融合策略,同时捕获局部空间交互与全局时间依赖性,从而提升fMRI数据的表征能力与分类性能。
链接: https://arxiv.org/abs/2508.13328
作者: Amirali Arbab,Zeinab Davarani,Mehran Safayani
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:
Abstract:Understanding the complex neural activity dynamics is crucial for the development of the field of neuroscience. Although current functional MRI classification approaches tend to be based on static functional connectivity or cannot capture spatio-temporal relationships comprehensively, we present a new framework that leverages dynamic graph creation and spatiotemporal attention mechanisms for Autism Spectrum Disorder(ASD) diagnosis. The approach used in this research dynamically infers functional brain connectivity in each time interval using transformer-based attention mechanisms, enabling the model to selectively focus on crucial brain regions and time segments. By constructing time-varying graphs that are then processed with Graph Convolutional Networks (GCNs) and transformers, our method successfully captures both localized interactions and global temporal dependencies. Evaluated on the subset of ABIDE dataset, our model achieves 63.2 accuracy and 60.0 AUC, outperforming static graph-based approaches (e.g., GCN:51.8). This validates the efficacy of joint modeling of dynamic connectivity and spatio-temporal context for fMRI classification. The core novelty arises from (1) attention-driven dynamic graph creation that learns temporal brain region interactions and (2) hierarchical spatio-temporal feature fusion through GCNtransformer fusion.
zh
[AI-63] owards Unified Multimodal Financial Forecasting: Integrating Sentiment Embeddings and Market Indicators via Cross-Modal Attention
【速读】:该论文试图解决传统股票走势预测模型仅依赖数值市场指标(numerical market indicators)而导致信息利用不充分的问题,从而影响预测精度。其解决方案的关键在于提出STONK(Stock Optimization using News Knowledge)框架,通过特征拼接(feature concatenation)与跨模态注意力机制(cross-modal attention)融合数值数据与情感增强的新闻嵌入(sentiment-enriched news embeddings),构建一个统一的多模态预测管道,有效整合金融市场的结构化数据与非结构化新闻文本信息,提升日度股票变动预测性能。
链接: https://arxiv.org/abs/2508.13327
作者: Sarthak Khanna,Armin Berger,David Berghaus,Tobias Deusser,Lorenz Sparrenberg,Rafet Sifa
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注: Accepted in IEEE-DSAA 2025
Abstract:We propose STONK (Stock Optimization using News Knowledge), a multimodal framework integrating numerical market indicators with sentiment-enriched news embeddings to improve daily stock-movement prediction. By combining numerical textual embeddings via feature concatenation and cross-modal attention, our unified pipeline addresses limitations of isolated analyses. Backtesting shows STONK outperforms numeric-only baselines. A comprehensive evaluation of fusion strategies and model configurations offers evidence-based guidance for scalable multimodal financial forecasting. Source code is available on GitHub
zh
[AI-64] Diff-MSM: Differentiable MusculoSkeletal Model for Simultaneous Identification of Human Muscle and Bone Parameters
【速读】:该论文旨在解决个性化人体肌骨模型中肌肉参数(如Hill型肌肉模型参数)和骨骼动力学参数难以准确识别的问题,尤其是由于无法直接测量体内生物力学变量(特别是关节力矩),导致传统方法在参数估计精度上受限。其解决方案的关键在于提出一种可微分肌骨模型(Differentiable MusculoSkeletal Model, Diff-MSM),利用端到端自动微分技术,从可测量的肌肉激活信号出发,通过关节力矩间接推导出可观测的运动轨迹,从而实现对肌肉与骨骼参数的同时高效识别,无需直接测量内部关节力矩。实验表明,该方法显著优于现有最优基线方法,在初始猜测存在较大偏差的情况下仍能实现极高的参数估计精度(平均百分比误差低至0.05%)。
链接: https://arxiv.org/abs/2508.13303
作者: Yingfan Zhou,Philip Sanderink,Sigurd Jager Lemming,Cheng Fang
机构: 未知
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI)
备注: 8 pages, 7 figures
Abstract:High-fidelity personalized human musculoskeletal models are crucial for simulating realistic behavior of physically coupled human-robot interactive systems and verifying their safety-critical applications in simulations before actual deployment, such as human-robot co-transportation and rehabilitation through robotic exoskeletons. Identifying subject-specific Hill-type muscle model parameters and bone dynamic parameters is essential for a personalized musculoskeletal model, but very challenging due to the difficulty of measuring the internal biomechanical variables in vivo directly, especially the joint torques. In this paper, we propose using Differentiable MusculoSkeletal Model (Diff-MSM) to simultaneously identify its muscle and bone parameters with an end-to-end automatic differentiation technique differentiating from the measurable muscle activation, through the joint torque, to the resulting observable motion without the need to measure the internal joint torques. Through extensive comparative simulations, the results manifested that our proposed method significantly outperformed the state-of-the-art baseline methods, especially in terms of accurate estimation of the muscle parameters (i.e., initial guess sampled from a normal distribution with the mean being the ground truth and the standard deviation being 10% of the ground truth could end up with an average of the percentage errors of the estimated values as low as 0.05%). In addition to human musculoskeletal modeling and simulation, the new parameter identification technique with the Diff-MSM has great potential to enable new applications in muscle health monitoring, rehabilitation, and sports science.
zh
[AI-65] Hierarchical Conformal Classification
【速读】:该论文旨在解决传统 conformal prediction (CP) 在分类任务中忽略类别间语义关系和层次结构的问题,从而导致预测集缺乏领域知识引导的合理性。其解决方案的关键在于提出层级 conformal 分类(Hierarchical Conformal Classification, HCC),将类别层次结构(如语义或树状结构)纳入预测集构造过程,通过形式化为一个约束优化问题,在保证有限样本覆盖保证的前提下生成由不同层级节点组成的预测集;同时,论文证明了只需考虑一个结构良好且规模显著缩小的候选解子集即可实现最优性与覆盖率的统一,有效缓解了组合复杂度问题。
链接: https://arxiv.org/abs/2508.13288
作者: Floris den Hengst,Inès Blin,Majid Mohammadi,Syed Ihtesham Hussain Shah,Taraneh Younesian
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Machine Learning (stat.ML)
备注:
Abstract:Conformal prediction (CP) is a powerful framework for quantifying uncertainty in machine learning models, offering reliable predictions with finite-sample coverage guarantees. When applied to classification, CP produces a prediction set of possible labels that is guaranteed to contain the true label with high probability, regardless of the underlying classifier. However, standard CP treats classes as flat and unstructured, ignoring domain knowledge such as semantic relationships or hierarchical structure among class labels. This paper presents hierarchical conformal classification (HCC), an extension of CP that incorporates class hierarchies into both the structure and semantics of prediction sets. We formulate HCC as a constrained optimization problem whose solutions yield prediction sets composed of nodes at different levels of the hierarchy, while maintaining coverage guarantees. To address the combinatorial nature of the problem, we formally show that a much smaller, well-structured subset of candidate solutions suffices to ensure coverage while upholding optimality. An empirical evaluation on three new benchmarks consisting of audio, image, and text data highlights the advantages of our approach, and a user study shows that annotators significantly prefer hierarchical over flat prediction sets.
zh
[AI-66] ViTAD: Timing Violation-Aware Debugging of RTL Code using Large Language Models
【速读】:该论文旨在解决现代超大规模集成电路(VLSI)设计流程中寄存器传输级(RTL)阶段的时序优化问题,即如何高效、自动化地识别并修复时序违例(timing violations),以避免因微小延迟导致的功能失效或系统崩溃。传统方法依赖人工分析时序报告并反复调试,效率低且易出错。其解决方案的关键在于提出ViTAD方法:首先构建信号时序依赖图(Signal Timing Dependency Graph, STDG),基于该图进行违例路径分析,并利用大语言模型(LLMs)推断违例的根本原因;随后,从领域知识库中检索相关调试信息,生成针对性的修复策略。此方法实现了从违例定位到智能修复的全流程自动化,实验表明其修复成功率较仅使用LLM的基线提升19.30%。
链接: https://arxiv.org/abs/2508.13257
作者: Wenhao Lv,Yingjie Xia,Xiyuan Chen,Li Kuang
机构: 未知
类目: Hardware Architecture (cs.AR); Artificial Intelligence (cs.AI)
备注:
Abstract:In modern Very Large Scale Integrated (VLSI) circuit design flow, the Register-Transfer Level (RTL) stage presents a critical opportunity for timing optimization. Addressing timing violations at this early stage is essential, as modern systems demand higher speeds, where even minor timing violations can lead to functional failures or system crashes. However, traditional timing optimization heavily relies on manual expertise, requiring engineers to iteratively analyze timing reports and debug. To automate this process, this paper proposes ViTAD, a method that efficiently analyzes the root causes of timing violations and dynamically generates targeted repair strategies. Specifically, we first parse Verilog code and timing reports to construct a Signal Timing Dependency Graph (STDG). Based on the STDG, we perform violation path analysis and use large language models (LLMs) to infer the root causes of violations. Finally, by analyzing the causes of violations, we selectively retrieve relevant debugging knowledge from a domain-specific knowledge base to generate customized repair solutions. To evaluate the effectiveness of our method, we construct a timing violation dataset based on real-world open-source projects. This dataset contains 54 cases of violations. Experimental results show that our method achieves a 73.68% success rate in repairing timing violations, while the baseline using only LLM is 54.38%. Our method improves the success rate by 19.30%.
zh
[AI-67] CardAIc-Agents : A Multimodal Framework with Hierarchical Adaptation for Cardiac Care Support
【速读】:该论文旨在解决当前人工智能(AI)在心血管疾病(CVDs)临床应用中面临的四大局限性:1)基于提示的临床角色分配依赖模型固有能力而缺乏领域专用工具支持;2)刚性的顺序工作流无法适应临床中所需的动态推理与个性化诊疗路径;3)静态知识库缺乏持续学习能力;4)输入模态固定且缺少按需生成可视化输出的能力。解决方案的关键在于提出一个名为CardAIc-Agents的多模态框架,其核心创新包括:1)引入CardiacRAG代理从可更新的心脏病学知识库中生成通用计划;2)主代理集成外部工具以自主执行计划并作出决策;3)设计分步更新策略,在任务复杂时根据前序执行结果动态优化计划;4)引入多学科讨论工具辅助疑难病例解析,并通过可视化审查面板支持临床验证。该架构显著提升了AI在复杂心脏诊疗场景中的适应性、准确性与可解释性。
链接: https://arxiv.org/abs/2508.13256
作者: Yuting Zhang,Karina V. Bunting,Asgher Champsi,Xiaoxia Wang,Wenqi Lu,Alexander Thorley,Sandeep S Hothi,Zhaowen Qiu,Dipak Kotecha,Jinming Duan
机构: 未知
类目: Artificial Intelligence (cs.AI); Computers and Society (cs.CY); Multiagent Systems (cs.MA)
备注:
Abstract:Cardiovascular diseases (CVDs) remain the foremost cause of mortality worldwide, a burden worsened by a severe deficit of healthcare workers. Artificial intelligence (AI) agents have shown potential to alleviate this gap via automated early detection and proactive screening, yet their clinical application remains limited by: 1) prompt-based clinical role assignment that relies on intrinsic model capabilities without domain-specific tool support; or 2) rigid sequential workflows, whereas clinical care often requires adaptive reasoning that orders specific tests and, based on their results, guides personalised next steps; 3) general and static knowledge bases without continuous learning capability; and 4) fixed unimodal or bimodal inputs and lack of on-demand visual outputs when further clarification is needed. In response, a multimodal framework, CardAIc-Agents, was proposed to augment models with external tools and adaptively support diverse cardiac tasks. Specifically, a CardiacRAG agent generated general plans from updatable cardiac knowledge, while the chief agent integrated tools to autonomously execute these plans and deliver decisions. To enable adaptive and case-specific customization, a stepwise update strategy was proposed to dynamically refine plans based on preceding execution results, once the task was assessed as complex. In addition, a multidisciplinary discussion tool was introduced to interpret challenging cases, thereby supporting further adaptation. When clinicians raised concerns, visual review panels were provided to assist final validation. Experiments across three datasets showed the efficiency of CardAIc-Agents compared to mainstream Vision-Language Models (VLMs), state-of-the-art agentic systems, and fine-tuned VLMs.
zh
[AI-68] “DIVE” into Hydrogen Storag e Materials Discovery with AI Agents
【速读】:该论文旨在解决科学文献中大量未结构化图表数据难以被人工智能(AI)有效利用的问题,从而阻碍了基于大语言模型(LLM)的智能体在材料自动设计中的应用。其核心挑战在于从非结构化的图形元素(如图、表)中准确提取实验数据。解决方案的关键是提出了一种名为“描述性视觉表达解读”(Descriptive Interpretation of Visual Expression, DIVE)的多智能体工作流,该方法系统性地读取并结构化科学文献中的图形数据,显著提升了数据提取的准确性和覆盖范围——相较商用多模态模型提升10–15%,较开源模型提升超30%。基于此,研究团队构建了一个包含超过3万条记录的高质量数据库,并实现了两分钟内完成新型固态储氢材料的逆向设计,为AI驱动的材料发现提供了可迁移的范式。
链接: https://arxiv.org/abs/2508.13251
作者: Di Zhang,Xue Jia,Tran Ba Hung,Seong Hoon Jang,Linda Zhang,Ryuhei Sato,Yusuke Hashimoto,Toyoto Sato,Kiyoe Konno,Shin-ichi Orimo,Hao Li
机构: 未知
类目: Artificial Intelligence (cs.AI); Materials Science (cond-mat.mtrl-sci)
备注: 23 pages, 5 figures. The supplementary video is available at the GitHub link provided in the manuscript
Abstract:Data-driven artificial intelligence (AI) approaches are fundamentally transforming the discovery of new materials. Despite the unprecedented availability of materials data in the scientific literature, much of this information remains trapped in unstructured figures and tables, hindering the construction of large language model (LLM)-based AI agent for automated materials design. Here, we present the Descriptive Interpretation of Visual Expression (DIVE) multi-agent workflow, which systematically reads and organizes experimental data from graphical elements in scientific literatures. We focus on solid-state hydrogen storage materials-a class of materials central to future clean-energy technologies and demonstrate that DIVE markedly improves the accuracy and coverage of data extraction compared to the direct extraction by multimodal models, with gains of 10-15% over commercial models and over 30% relative to open-source models. Building on a curated database of over 30,000 entries from 4,000 publications, we establish a rapid inverse design workflow capable of identifying previously unreported hydrogen storage compositions in two minutes. The proposed AI workflow and agent design are broadly transferable across diverse materials, providing a paradigm for AI-driven materials discovery.
zh
[AI-69] Goal-Directedness is in the Eye of the Beholder
【速读】:该论文试图解决如何客观测量复杂智能体(agent)的目标导向性(goal-directedness)这一核心问题。现有方法分为行为层面和机制层面:前者通过观察行为推断目标,后者则试图从内部模型状态中探测目标。论文指出,这两种路径均面临技术与概念上的困境,尤其在将目标形式化为代理系统属性时存在根本性挑战。最终结论是,目标导向性无法被客观度量,其本质应被视为动态多智能体系统中涌现的属性。解决方案的关键在于重构目标导向性的建模范式,将其从静态、可量化的目标假设转向基于系统交互与演化过程的动态涌现视角。
链接: https://arxiv.org/abs/2508.13247
作者: Nina Rajcic,Anders Søgaard
机构: 未知
类目: Multiagent Systems (cs.MA); Artificial Intelligence (cs.AI)
备注: Submitted to Conference and Workshop on Neural Information Processing Systems 2025
Abstract:Our ability to predict the behavior of complex agents turns on the attribution of goals. Probing for goal-directed behavior comes in two flavors: Behavioral and mechanistic. The former proposes that goal-directedness can be estimated through behavioral observation, whereas the latter attempts to probe for goals in internal model states. We work through the assumptions behind both approaches, identifying technical and conceptual problems that arise from formalizing goals in agent systems. We arrive at the perhaps surprising position that goal-directedness cannot be measured objectively. We outline new directions for modeling goal-directedness as an emergent property of dynamic, multi-agent systems.
zh
[AI-70] Involuntary Jailbreak
【速读】:该论文旨在解决大型语言模型(Large Language Models, LLMs)中一种新型、隐蔽且具有广泛破坏力的安全漏洞——“非自愿越狱”(involuntary jailbreak)问题。与传统针对特定有害内容生成的攻击不同,该漏洞不依赖于明确的恶意目标,而是通过触发模型整体防护机制(guardrail)的结构性脆弱性,导致模型在无意识状态下放弃其安全约束。解决方案的关键在于使用一个单一的通用提示(universal prompt),引导模型生成本应被拒绝的问题及其深度回答(而非简单拒绝),从而系统性地突破主流LLM的防御体系,实验证明该策略可成功攻破包括Claude Opus 4.1、Grok 4、Gemini 2.5 Pro和GPT-4.1在内的多个前沿模型。
链接: https://arxiv.org/abs/2508.13246
作者: Yangyang Guo,Yangyan Li,Mohan Kankanhalli
机构: 未知
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI)
备注: We plan to temporarily restrict access to the github code due to potential risks of malicious use. But in the meantime, you can try using the prompt, provided it hasn’t been banned
Abstract:In this study, we disclose a worrying new vulnerability in Large Language Models (LLMs), which we term \textbfinvoluntary jailbreak. Unlike existing jailbreak attacks, this weakness is distinct in that it does not involve a specific attack objective, such as generating instructions for \textitbuilding a bomb. Prior attack methods predominantly target localized components of the LLM guardrail. In contrast, involuntary jailbreaks may potentially compromise the entire guardrail structure, which our method reveals to be surprisingly fragile. We merely employ a single universal prompt to achieve this goal. In particular, we instruct LLMs to generate several questions that would typically be rejected, along with their corresponding in-depth responses (rather than a refusal). Remarkably, this simple prompt strategy consistently jailbreaks the majority of leading LLMs, including Claude Opus 4.1, Grok 4, Gemini 2.5 Pro, and GPT 4.1. We hope this problem can motivate researchers and practitioners to re-evaluate the robustness of LLM guardrails and contribute to stronger safety alignment in future.
zh
[AI-71] Quantifying Loss Aversion in Cyber Adversaries via LLM Analysis
【速读】:该论文旨在解决如何从实证数据中理解和量化人类认知偏差(cognitive biases),特别是在网络安全领域中,传统防御策略多聚焦于静态加固,难以动态解读正在进行的攻击行为。其核心问题在于缺乏对攻击者认知特征(如损失厌恶)的实时识别与建模能力,从而限制了主动防御和预测性响应的能力。解决方案的关键在于利用大语言模型(Large Language Models, LLMs)对黑客在受控网络实验中生成的行动日志进行结构化分析,通过将行为片段与预定义的持久化机制关联,并进一步映射到操作触发条件,从而提取可量化的损失厌恶(loss aversion)认知特征。这一方法实现了对黑客决策过程的精细化解析,为基于行为的实时威胁感知和自适应防御提供了新范式。
链接: https://arxiv.org/abs/2508.13240
作者: Soham Hans,Nikolos Gurney,Stacy Marsella,Sofia Hirschmann
机构: 未知
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI)
备注:
Abstract:Understanding and quantifying human cognitive biases from empirical data has long posed a formidable challenge, particularly in cybersecurity, where defending against unknown adversaries is paramount. Traditional cyber defense strategies have largely focused on fortification, while some approaches attempt to anticipate attacker strategies by mapping them to cognitive vulnerabilities, yet they fall short in dynamically interpreting attacks in progress. In recognition of this gap, IARPA’s ReSCIND program seeks to infer, defend against, and even exploit attacker cognitive traits. In this paper, we present a novel methodology that leverages large language models (LLMs) to extract quantifiable insights into the cognitive bias of loss aversion from hacker behavior. Our data are collected from an experiment in which hackers were recruited to attack a controlled demonstration network. We process the hacker generated notes using LLMs using it to segment the various actions and correlate the actions to predefined persistence mechanisms used by hackers. By correlating the implementation of these mechanisms with various operational triggers, our analysis provides new insights into how loss aversion manifests in hacker decision-making. The results demonstrate that LLMs can effectively dissect and interpret nuanced behavioral patterns, thereby offering a transformative approach to enhancing cyber defense strategies through real-time, behavior-based analysis.
zh
[AI-72] he Role of AI in Facilitating Interdisciplinary Collaboration: Evidence from AlphaFold
【速读】:该论文试图解决的问题是:人工智能(AI)技术是否能够促进跨学科协作,以及其作用机制和影响程度如何。针对这一问题,研究的关键在于通过实证分析AlphaFold对结构生物学领域的影响,利用1,247篇相关论文和7,700名作者的Scopus数据,结合文献计量学与因果推断方法,对比采用AlphaFold的学者与未采用者之间的跨学科合作模式。结果表明,尽管AI常被认为能推动跨学科融合,但AlphaFold仅使结构生物学与计算机科学之间的合作增加0.48%,且对其他学科无显著影响,揭示出AI本身在弥合学科鸿沟方面效果有限,其跨学科协作效应受技术特性、普及程度等因素制约。
链接: https://arxiv.org/abs/2508.13234
作者: Naixuan Zhao,Chunli Wei,Xinyan Zhang,Jiang Li
机构: 未知
类目: Digital Libraries (cs.DL); Artificial Intelligence (cs.AI); Computers and Society (cs.CY)
备注: 25pages, 2figures
Abstract:The acceleration of artificial intelligence (AI) in science is recognized and many scholars have begun to explore its role in interdisciplinary collaboration. However, the mechanisms and extent of this impact are still unclear. This study, using AlphaFold’s impact on structural biologists, examines how AI technologies influence interdisciplinary collaborative patterns. By analyzing 1,247 AlphaFold-related papers and 7,700 authors from Scopus, we employ bibliometric analysis and causal inference to compare interdisciplinary collaboration between AlphaFold adopters and non-adopters. Contrary to the widespread belief that AI facilitates interdisciplinary collaboration, our findings show that AlphaFold increased structural biology-computer science collaborations by just 0.48%, with no measurable effect on other disciplines. Specifically, AI creates interdisciplinary collaboration demands with specific disciplines due to its technical characteristics, but this demand is weakened by technological democratization and other factors. These findings demonstrate that artificial intelligence (AI) alone has limited efficacy in bridging disciplinary divides or fostering meaningful interdisciplinary collaboration.
zh
[AI-73] Accelerating LLM Inference via Dynamic KV Cache Placement in Heterogeneous Memory System
【速读】:该论文旨在解决大语言模型(Large Language Model, LLM)推理过程中因键值(Key-Value, KV)缓存频繁访问导致的内存带宽瓶颈问题,尤其是在异构内存系统(如高带宽内存HBM与高速片外DRAM共存)中如何动态分配KV缓存以最大化聚合带宽利用率。其解决方案的关键在于将KV缓存放置问题形式化为一个数学优化问题,并推导出理论上的性能上限,从而揭示了运行时调度优化的巨大潜力,而非直接提出特定调度策略。这一方法首次对LLM推理中异构内存环境下的动态KV缓存调度进行了严谨的理论分析。
链接: https://arxiv.org/abs/2508.13231
作者: Yunhua Fang,Rui Xie,Asad Ul Haq,Linsen Ma,Kaoutar El Maghraoui,Naigang Wang,Meng Wang,Liu Liu,Tong Zhang
机构: 未知
类目: Hardware Architecture (cs.AR); Artificial Intelligence (cs.AI); Performance (cs.PF)
备注:
Abstract:Large Language Model (LLM) inference is increasingly constrained by memory bandwidth, with frequent access to the key-value (KV) cache dominating data movement. While attention sparsity reduces some memory traffic, the relevance of past tokens varies over time, requiring the full KV cache to remain accessible and sustaining pressure on both bandwidth and capacity. With advances in interconnects such as NVLink and LPDDR5X, modern AI hardware now integrates high-bandwidth memory (HBM) with high-speed off-package DRAM, making heterogeneous memory systems a practical solution. This work investigates dynamic KV cache placement across such systems to maximize aggregated bandwidth utilization under capacity constraints. Rather than proposing a specific scheduling policy, we formulate the placement problem mathematically and derive a theoretical upper bound, revealing substantial headroom for runtime optimization. To our knowledge, this is the first formal treatment of dynamic KV cache scheduling in heterogeneous memory systems for LLM inference.
zh
[AI-74] MCPSecBench: A Systematic Security Benchmark and Playground for Testing Model Context Protocols
【速读】:该论文旨在解决大型语言模型(Large Language Models, LLMs)通过模型上下文协议(Model Context Protocol, MCP)与外部数据源和工具集成时引入的新型安全风险问题。MCP作为连接AI代理与外部资源的通用开放标准,虽然提升了LLM的能力,但也显著扩展了攻击面。论文的关键解决方案是提出首个系统性的MCP安全分类体系,识别出四大攻击面下的17类攻击类型,并构建MCPSecBench——一个模块化、可扩展的安全基准测试平台,集成提示数据集、MCP服务器、客户端及攻击脚本,用于在三大主流MCP提供商上评估这些攻击。实验表明,超过85%的攻击能成功渗透至少一个平台,核心漏洞普遍存在于Claude、OpenAI和Cursor等主流服务中,而基于提示和工具的攻击则表现出显著的平台差异性,从而为MCP层的安全评估提供了标准化方法和实证基础。
链接: https://arxiv.org/abs/2508.13220
作者: Yixuan Yang,Daoyuan Wu,Yufan Chen
机构: 未知
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI)
备注: This is a technical report from Lingnan University, Hong Kong
Abstract:Large Language Models (LLMs) are increasingly integrated into real-world applications via the Model Context Protocol (MCP), a universal, open standard for connecting AI agents with data sources and external tools. While MCP enhances the capabilities of LLM-based agents, it also introduces new security risks and expands their attack surfaces. In this paper, we present the first systematic taxonomy of MCP security, identifying 17 attack types across 4 primary attack surfaces. We introduce MCPSecBench, a comprehensive security benchmark and playground that integrates prompt datasets, MCP servers, MCP clients, and attack scripts to evaluate these attacks across three major MCP providers. Our benchmark is modular and extensible, allowing researchers to incorporate custom implementations of clients, servers, and transport protocols for systematic security assessment. Experimental results show that over 85% of the identified attacks successfully compromise at least one platform, with core vulnerabilities universally affecting Claude, OpenAI, and Cursor, while prompt-based and tool-centric attacks exhibit considerable variability across different hosts and models. Overall, MCPSecBench standardizes the evaluation of MCP security and enables rigorous testing across all MCP layers.
zh
[AI-75] Deep Graph Neural Point Process For Learning Temporal Interactive Networks
【速读】:该论文旨在解决传统时序交互网络(Temporal Interaction Networks, TIN)建模中忽略网络拓扑结构影响的问题,即将TIN视为粗粒度多序列预测任务的局限性。其解决方案的关键在于提出一种深度图神经点过程模型(Deep Graph Neural Point Process, DGNPP),该模型包含两个核心模块:节点聚合层(Node Aggregation Layer)用于捕获静态拓扑结构以生成用户与物品的静态表示,自注意力层(Self Attentive Layer)则动态更新嵌入表示以捕捉时间演化特征;通过将动态与静态嵌入融合至事件强度函数,并基于最大似然估计进行优化,DGNPP实现了对事件发生及其时间的有效预测,在多个公开数据集上显著优于基线模型。
链接: https://arxiv.org/abs/2508.13219
作者: Su Chen,Xiaohua Qi,Xixun Lin,Yanmin Shang,Xiaolin Xu,Yangxi Li
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:
Abstract:Learning temporal interaction networks(TIN) is previously regarded as a coarse-grained multi-sequence prediction problem, ignoring the network topology structure influence. This paper addresses this limitation and a Deep Graph Neural Point Process(DGNPP) model for TIN is proposed. DGNPP consists of two key modules: the Node Aggregation Layer and the Self Attentive Layer. The Node Aggregation Layer captures topological structures to generate static representation for users and items, while the Self Attentive Layer dynamically updates embeddings over time. By incorporating both dynamic and static embeddings into the event intensity function and optimizing the model via maximum likelihood estimation, DGNPP predicts events and occurrence time effectively. Experimental evaluations on three public datasets demonstrate that DGNPP achieves superior performance in event prediction and time prediction tasks with high efficiency, significantly outperforming baseline models and effectively mitigating the limitations of prior approaches.
zh
[AI-76] oo Easily Fooled? Prompt Injection Breaks LLM s on Frustratingly Simple Multiple-Choice Questions
【速读】:该论文旨在解决大语言模型(Large Language Models, LLMs)在作为评判者(LLM-as-a-judge)应用中面临的鲁棒性问题,特别是其对隐藏提示注入攻击(prompt injection attacks)的脆弱性。解决方案的关键在于设计了一种简单但有效的攻击场景:将基础算术问题(如“3 + 2 = ?”)以选择题或真假判断题形式嵌入PDF文件中,并在文档内隐藏恶意指令,从而测试LLMs是否会被误导。实验结果表明,即使在看似简单的任务中,LLMs也极易被此类隐藏提示攻击所影响,揭示了其在教育评估、同行评审和数据质量检测等关键应用场景中的严重安全风险。
链接: https://arxiv.org/abs/2508.13214
作者: Xuyang Guo,Zekai Huang,Zhao Song,Jiahao Zhang
机构: 未知
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI)
备注:
Abstract:Large Language Models (LLMs) have recently demonstrated strong emergent abilities in complex reasoning and zero-shot generalization, showing unprecedented potential for LLM-as-a-judge applications in education, peer review, and data quality evaluation. However, their robustness under prompt injection attacks, where malicious instructions are embedded into the content to manipulate outputs, remains a significant concern. In this work, we explore a frustratingly simple yet effective attack setting to test whether LLMs can be easily misled. Specifically, we evaluate LLMs on basic arithmetic questions (e.g., “What is 3 + 2?”) presented as either multiple-choice or true-false judgment problems within PDF files, where hidden prompts are injected into the file. Our results reveal that LLMs are indeed vulnerable to such hidden prompt injection attacks, even in these trivial scenarios, highlighting serious robustness risks for LLM-as-a-judge applications.
zh
[AI-77] AI sustains higher strategic tension than humans in chess
【速读】:该论文旨在解决战略决策中短期机会与长期目标之间权衡的问题,通过对比人类对弈与AI对弈的博弈过程来揭示不同策略模式的本质差异。其解决方案的关键在于提出了一种基于网络的棋子间交互度量指标(piece-to-piece interaction metric),用于量化棋盘上持续的战略紧张度(strategic tension),并分析该指标在对局中的演化特征。结果表明,顶尖AI玩家能维持更高水平的战略紧张度更长时间,且紧张度随算法复杂度和人类棋手等级(约1600和2300 Elo)呈现非线性跃升;这反映出AI倾向于长期维持攻防平衡的复杂局面,而人类则因认知限制主动降低紧张度,体现不同的适应性策略。
链接: https://arxiv.org/abs/2508.13213
作者: Adamo Cerioli,Edward D. Lee,Vito D. P. Servedio
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:
Abstract:Strategic decision-making involves managing the tension between immediate opportunities and long-term objectives. We study this trade-off in chess by characterizing and comparing dynamics between human vs human and AI vs AI games. We propose a network-based metric of piece-to-piece interaction to quantify the ongoing strategic tension on the board. Its evolution in games reveals that the most competitive AI players sustain higher levels of strategic tension for longer durations than elite human players. Cumulative tension varies with algorithmic complexity for AI and correspondingly in human-played games increases abruptly with expertise at about 1600 Elo and again at 2300 Elo. The profiles reveal different approaches. Highly competitive AI tolerates interconnected positions balanced between offensive and defensive tactics over long periods. Human play, in contrast, limits tension and game complexity, which may reflect cognitive limitations and adaptive strategies. The difference may have implications for AI usage in complex, strategic environments.
zh
[AI-78] Research on Conversational Recommender System Considering Consumer Types
【速读】:该论文旨在解决当前对话推荐系统(Conversational Recommender Systems, CRS)在个性化服务中忽视用户异质性决策风格和知识水平的问题,从而限制了推荐的准确性与交互效率。其解决方案的关键在于提出CT-CRS框架,通过引入消费者类型建模(Consumer Type Modeling),将用户划分为依赖型、高效型、谨慎型和专家型四类,基于决策风格(最大化者 vs. 满意者)与知识水平(高 vs. 低)两个维度进行定义,并利用交互历史和大语言模型实时推断用户类型,无需静态问卷;同时设计类型自适应策略以动态调整推荐粒度、多样性及属性查询复杂度,并结合逆强化学习(Inverse Reinforcement Learning, IRL)优化对话策略,使代理能根据用户类型逼近专家级行为,从而显著提升推荐成功率并减少交互轮次。
链接: https://arxiv.org/abs/2508.13209
作者: Yaying Luo,Hui Fang,Zhu Sun
机构: 未知
类目: Information Retrieval (cs.IR); Artificial Intelligence (cs.AI); Social and Information Networks (cs.SI)
备注: 10 pages
Abstract:Conversational Recommender Systems (CRS) provide personalized services through multi-turn interactions, yet most existing methods overlook users’ heterogeneous decision-making styles and knowledge levels, which constrains both accuracy and efficiency. To address this gap, we propose CT-CRS (Consumer Type-Enhanced Conversational Recommender System), a framework that integrates consumer type modeling into dialogue recommendation. Based on consumer type theory, we define four user categories–dependent, efficient, cautious, and expert–derived from two dimensions: decision-making style (maximizers vs. satisficers) and knowledge level (high vs. low). CT-CRS employs interaction histories and fine-tunes the large language model to automatically infer user types in real time, avoiding reliance on static questionnaires. We incorporate user types into state representation and design a type-adaptive policy that dynamically adjusts recommendation granularity, diversity, and attribute query complexity. To further optimize the dialogue policy, we adopt Inverse Reinforcement Learning (IRL), enabling the agent to approximate expert-like strategies conditioned on consumer type. Experiments on LastFM, Amazon-Book, and Yelp show that CTCRS improves recommendation success rate and reduces interaction turns compared to strong baselines. Ablation studies confirm that both consumer type modeling and IRL contribute significantly to performance gains. These results demonstrate that CT-CRS offers a scalable and interpretable solution for enhancing CRS personalization through the integration of psychological modeling and advanced policy optimization.
zh
[AI-79] QuickMerge: Fast Token Merging with Autoregressive Prior ICML
【速读】:该论文旨在解决生成式模型(Generative AI)在语言、视觉和视频等多模态领域中,随着输入规模扩大而导致的token级计算成本过高这一瓶颈问题。其核心挑战在于如何在不显著牺牲下游任务性能的前提下,动态减少参与计算的token数量。解决方案的关键在于提出QuickMerge框架:该框架通过注意力机制中的归一化幅度(attention norm magnitude)动态选择具有语义显著性的token,并利用基于熵的预算估计器灵活控制token压缩比例;同时引入轻量级Transformer先验模型以保持自回归(autoregressive, AR)生成的一致性,从而实现高效且准确的next-token预测。实验表明,QuickMerge在多个模态上均优于固定patch基线和学习型tokenizer,在显著降低token消耗的同时维持甚至提升性能。
链接: https://arxiv.org/abs/2508.13204
作者: Dong Liu,Yanxuan Yu
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注: The paper has been accepted to ICML Tokshop at this https URL
Abstract:As generative models scale to larger inputs across language, vision, and video domains, the cost of token-level computation has become a key bottleneck. While prior work suggests that only a subset of tokens significantly influence downstream predictions, most token selection methods are static, modality-specific, or incompatible with autoregressive generation. In this paper, we propose QuickMerge, a lightweight token merging framework designed for efficient next-token prediction. QuickMerge dynamically selects a reduced number of tokens based on attention norm magnitude, guided by an entropy-based budget estimator. To preserve autoregressive compatibility, we introduce a lightweight transformer prior trained over the merged token sequence. By combining semantic salience estimation, flexible token budgets, and AR alignment, QuickMerge enables accurate generation with fewer tokens. We evaluate QuickMerge across multi-modality domains, demonstrating consistent improvements in compute-accuracy tradeoffs. Specifically, QuickMerge reduces token counts sustantially while matching as well as exceeding the performance of learned tokenizers and fixed-patch baselines. Comments: The paper has been accepted to ICML Tokshop at this https URL Subjects: Artificial Intelligence (cs.AI) Cite as: arXiv:2508.13204 [cs.AI] (or arXiv:2508.13204v1 [cs.AI] for this version) https://doi.org/10.48550/arXiv.2508.13204 Focus to learn more arXiv-issued DOI via DataCite
zh
[AI-80] Contextual Attention-Based Multimodal Fusion of LLM and CNN for Sentiment Analysis
【速读】:该论文旨在解决自然灾害情境下社交媒体数据中多模态情感分析的难题,即如何有效融合文本与图像模态信息以提升情感分类准确性,从而支持更精准的危机管理决策。其解决方案的关键在于提出一种融合卷积神经网络(CNN)图像分析与大语言模型(LLM)文本处理的新架构,通过生成式预训练变换器(GPT)和提示工程(prompt engineering)提取情感相关特征,并引入上下文注意力机制(contextual attention mechanism)建模跨模态交互关系,从而增强对图文复杂关联的理解能力。实验表明,该方法在CrisisMMD数据集上相较基线模型实现了2.43%的准确率提升和5.18%的F1-score提升,显著优于传统独立处理文本与图像的方法。
链接: https://arxiv.org/abs/2508.13196
作者: Meriem Zerkouk,Miloud Mihoubi,Belkacem Chikhaoui
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Information Retrieval (cs.IR)
备注: The 38th Canadian Conference on Artificial Intelligence ( 2025 )
Abstract:This paper introduces a novel approach for multimodal sentiment analysis on social media, particularly in the context of natural disasters, where understanding public sentiment is crucial for effective crisis management. Unlike conventional methods that process text and image modalities separately, our approach seamlessly integrates Convolutional Neural Network (CNN) based image analysis with Large Language Model (LLM) based text processing, leveraging Generative Pre-trained Transformer (GPT) and prompt engineering to extract sentiment relevant features from the CrisisMMD dataset. To effectively model intermodal relationships, we introduce a contextual attention mechanism within the fusion process. Leveraging contextual-attention layers, this mechanism effectively captures intermodality interactions, enhancing the model’s comprehension of complex relationships between textual and visual data. The deep neural network architecture of our model learns from these fused features, leading to improved accuracy compared to existing baselines. Experimental results demonstrate significant advancements in classifying social media data into informative and noninformative categories across various natural disasters. Our model achieves a notable 2.43% increase in accuracy and 5.18% in F1-score, highlighting its efficacy in processing complex multimodal data. Beyond quantitative metrics, our approach provides deeper insight into the sentiments expressed during crises. The practical implications extend to real time disaster management, where enhanced sentiment analysis can optimize the accuracy of emergency interventions. By bridging the gap between multimodal analysis, LLM powered text understanding, and disaster response, our work presents a promising direction for Artificial Intelligence (AI) driven crisis management solutions. Keywords:
zh
[AI-81] Using Artificial Intuition in Distinct Minimalist Classification of Scientific Abstracts for Management of Technology Portfolios
【速读】:该论文旨在解决科学摘要(scientific abstract)自动分类的难题,该任务因文本稀疏、缺乏上下文线索而难以自动化,且现有基于元数据(metadata)的方法常依赖半监督设置,生成的标签易出现重叠,无法唯一定义抽象内容。解决方案的关键在于引入“人工直觉”(artificial intuition)这一过程,利用大型语言模型(Large Language Model, LLM)生成高质量的元数据以构建区分性强的标签体系,并通过美国国家科学基金会(NSF)公开摘要训练模型后,在中国国家自然科学基金(NSFC)摘要上验证其有效性,从而实现对科研组合管理与技术侦察等战略活动的可行支持。
链接: https://arxiv.org/abs/2508.13182
作者: Prateek Ranka,Fred Morstatter,Andrea Belz,Alexandra Graddy-Reed
机构: 未知
类目: Digital Libraries (cs.DL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:
Abstract:Classification of scientific abstracts is useful for strategic activities but challenging to automate because the sparse text provides few contextual clues. Metadata associated with the scientific publication can be used to improve performance but still often requires a semi-supervised setting. Moreover, such schemes may generate labels that lack distinction – namely, they overlap and thus do not uniquely define the abstract. In contrast, experts label and sort these texts with ease. Here we describe an application of a process we call artificial intuition to replicate the expert’s approach, using a Large Language Model (LLM) to generate metadata. We use publicly available abstracts from the United States National Science Foundation to create a set of labels, and then we test this on a set of abstracts from the Chinese National Natural Science Foundation to examine funding trends. We demonstrate the feasibility of this method for research portfolio management, technology scouting, and other strategic activities.
zh
[AI-82] Search-Time Data Contamination
【速读】:该论文旨在解决搜索时间污染(search-time contamination, STC)问题,即在评估基于搜索的大型语言模型(LLM)代理时,由于代理在检索阶段从在线源中获取到测试集中的问题-答案对(或近似重复),导致其直接复制而非推理作答,从而破坏基准测试的完整性与有效性。解决方案的关键在于识别并量化STC的影响:研究发现约3%的测试问题可通过HuggingFace平台被代理直接获取,且当数百万查询针对同一基准时,微小但重复的数据泄露会显著加速基准过时;此外,通过阻断HuggingFace访问后观察到相关子集准确率下降约15%,验证了污染的存在;最终提出基准设计和结果报告的最佳实践以应对此类泄漏,并公开实验日志以支持评估结果的审计。
链接: https://arxiv.org/abs/2508.13180
作者: Ziwen Han,Meher Mankikar,Julian Michael,Zifan Wang
机构: 未知
类目: Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:
Abstract:Data contamination refers to the leakage of evaluation data into model training data, resulting in overfitting to supposedly held-out test sets and compromising test validity. We identify an analogous issue, search-time contamination (STC), in evaluating search-based LLM agents which use tools to gather information from online sources when answering user queries. STC occurs when the retrieval step surfaces a source containing the test question (or a near-duplicate) alongside its answer, enabling agents to copy rather than genuinely infer or reason, undermining benchmark integrity. We find that HuggingFace, an online platform hosting evaluation datasets, appears among retrieved sources in search based agent logs. Consequently, agents often explicitly acknowledge discovering question answer pairs from HuggingFace within their reasoning chains. On three commonly used capability benchmarks: Humanity’s Last Exam (HLE), SimpleQA, and GPQA, we demonstrate that for approximately 3% of questions, search-based agents directly find the datasets with ground truth labels on HuggingFace. When millions of evaluation queries target the same benchmark, even small, repeated leaks can accelerate the benchmark’s obsolescence, shortening its intended lifecycle. After HuggingFace is blocked, we observe a drop in accuracy on the contaminated subset of approximately 15%. We further show through ablation experiments that publicly accessible evaluation datasets on HuggingFace may not be the sole source of STC. To this end, we conclude by proposing best practices for benchmark design and result reporting to address this novel form of leakage and ensure trustworthy evaluation of search-based LLM agents. To facilitate the auditing of evaluation results, we also publicly release the complete logs from our experiments.
zh
[AI-83] oward an African Agenda for AI Safety
【速读】:该论文旨在解决非洲在人工智能(Artificial Intelligence, AI)安全治理中面临的独特风险及其在全球AI安全议程中话语权缺失的问题。这些问题包括由深度伪造引发的选举干预、数据殖民依赖、计算资源匮乏、劳动力市场大规模 disruption 以及气候驱动的环境成本 disproportionately 影响非洲地区。解决方案的关键在于提出一项五点行动纲领:一是以保护最易受AI负面社会经济影响群体的人权为核心制定政策;二是设立非洲AI安全研究所(African AI Safety Institute),填补大陆在AI安全专业机构上的空白;三是提升公众对AI的认知与素养;四是开发涵盖25种以上非洲语言的早期预警系统及包容性基准测试套件;五是建立年度非洲联盟(AU)层级的AI安全论坛,推动区域协同治理与全球参与。
链接: https://arxiv.org/abs/2508.13179
作者: Samuel T. Segun,Rachel Adams,Ana Florido,Scott Timcke,Jonathan Shock,Leah Junck,Fola Adeleke,Nicolas Grossman,Ayantola Alayande,Jerry John Kponyo,Matthew Smith,Dickson Marfo Fosu,Prince Dawson Tetteh,Juliet Arthur,Stephanie Kasaon,Odilile Ayodele,Laetitia Badolo,Paul Plantinga,Michael Gastrow,Sumaya Nur Adan,Joanna Wiaterek,Cecil Abungu,Kojo Apeagyei,Luise Eder,Tegawende Bissyande
机构: 未知
类目: Computers and Society (cs.CY); Artificial Intelligence (cs.AI)
备注: 28 pages, 2 figures
Abstract:This paper maps Africa’s distinctive AI risk profile, from deepfake fuelled electoral interference and data colonial dependency to compute scarcity, labour disruption and disproportionate exposure to climate driven environmental costs. While major benefits are promised to accrue, the availability, development and adoption of AI also mean that African people and countries face particular AI safety risks, from large scale labour market disruptions to the nefarious use of AI to manipulate public opinion. To date, African perspectives have not been meaningfully integrated into global debates and processes regarding AI safety, leaving African stakeholders with limited influence over the emerging global AI safety governance agenda. While there are Computer Incident Response Teams on the continent, none hosts a dedicated AI Safety Institute or office. We propose a five-point action plan centred on (i) a policy approach that foregrounds the protection of the human rights of those most vulnerable to experiencing the harmful socio-economic effects of AI; (ii) the establishment of an African AI Safety Institute; (iii) promote public AI literacy and awareness; (iv) development of early warning system with inclusive benchmark suites for 25+ African languages; and (v) an annual AU-level AI Safety Security Forum.
zh
[AI-84] A Hardware-oriented Approach for Efficient Active Inference Computation and Deployment
【速读】:该论文旨在解决主动推理(Active Inference, AIF)在资源受限环境中部署时面临的计算复杂度高和内存占用大的问题。其解决方案的关键在于将pymdp框架的灵活性与效率相结合,并设计了一个统一的、稀疏的计算图(computational graph),以支持硬件高效的执行,从而实现超过2倍的延迟降低和最高达35%的内存节省,显著提升了AIF代理在实时和嵌入式应用场景中的可行性。
链接: https://arxiv.org/abs/2508.13177
作者: Nikola Pižurica,Nikola Milović,Igor Jovančević,Conor Heins,Miguel de Prado
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:
Abstract:Active Inference (AIF) offers a robust framework for decision-making, yet its computational and memory demands pose challenges for deployment, especially in resource-constrained environments. This work presents a methodology that facilitates AIF’s deployment by integrating pymdp’s flexibility and efficiency with a unified, sparse, computational graph tailored for hardware-efficient execution. Our approach reduces latency by over 2x and memory by up to 35%, advancing the deployment of efficient AIF agents for real-time and embedded applications.
zh
[AI-85] Fitting Ontologies and Constraints to Relational Structures KR2025
【速读】:该论文旨在解决如何从正例和负例(以有限关系结构的形式给出)中拟合本体(ontology)和约束(constraint),具体针对描述逻辑 EL 和 ELI 以及多种元组生成依赖(tuple-generating dependencies, TGDs)类,包括全TGD、守卫TGD、前哨守卫TGD、前哨单TGD和无限制TGD及包含依赖。其核心解决方案在于精确刻画了各类语言下拟合问题的计算复杂度,设计出相应的算法,并分析了所生成本体与TGD的大小。关键发现是:对于 EL、ELI、守卫TGD和包含依赖,存在有限基(finite basis);而对于全TGD、前哨守卫TGD和前哨单TGD,一般不存在有限基,这揭示了不同约束表达能力在可学习性上的本质差异。
链接: https://arxiv.org/abs/2508.13176
作者: Simon Hosemann,Jean Christoph Jung,Carsten Lutz,Sebastian Rudolph
机构: 未知
类目: Artificial Intelligence (cs.AI); Databases (cs.DB)
备注: Accepted at the 22nd International Conference on Principles of Knowledge Representation and Reasoning (KR 2025)
Abstract:We study the problem of fitting ontologies and constraints to positive and negative examples that take the form of a finite relational structure. As ontology and constraint languages, we consider the description logics \mathcalE\mkern-2mu L and \mathcalE\mkern-2mu LI as well as several classes of tuple-generating dependencies (TGDs): full, guarded, frontier-guarded, frontier-one, and unrestricted TGDs as well as inclusion dependencies. We pinpoint the exact computational complexity, design algorithms, and analyze the size of fitting ontologies and TGDs. We also investigate the related problem of constructing a finite basis of concept inclusions / TGDs for a given set of finite structures. While finite bases exist for \mathcalE\mkern-2mu L , \mathcalE\mkern-2mu LI , guarded TGDs, and inclusion dependencies, they in general do not exist for full, frontier-guarded and frontier-one TGDs.
zh
[AI-86] AlphaEval: A Comprehensive and Efficient Evaluation Framework for Formula Alpha Mining
【速读】:该论文旨在解决生成式 AI (Generative AI) 在量化投资中用于挖掘预测信号(alpha)时存在的评估难题,包括现有方法在计算效率、多维评估能力以及可复现性方面的局限。传统回测方法虽能全面衡量策略性能但计算成本高且难以并行;相关性指标虽高效却仅关注预测能力,忽略稳定性、鲁棒性、金融逻辑性和多样性等关键属性;同时,多数现有模型为闭源,阻碍了研究的透明度与进展。解决方案的关键在于提出 AlphaEval——一个统一、可并行化且无需回测的评估框架,通过五个互补维度(预测力、稳定性、市场扰动下的鲁棒性、金融逻辑合理性及多样性)对生成的 alpha 进行系统评估,实现了与完整回测相当的一致性评价,同时显著提升效率并提供更全面的洞察,从而推动自动化 alpha 发现的可靠性和可扩展性。
链接: https://arxiv.org/abs/2508.13174
作者: Hongjun Ding,Binqi Chen,Jinsheng Huang,Taian Guo,Zhengyang Mao,Guoyi Shao,Lutong Zou,Luchen Liu,Ming Zhang
机构: 未知
类目: Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Computational Finance (q-fin.CP); Machine Learning (stat.ML)
备注: 12 pages, 5 figures
Abstract:Formula alpha mining, which generates predictive signals from financial data, is critical for quantitative investment. Although various algorithmic approaches-such as genetic programming, reinforcement learning, and large language models-have significantly expanded the capacity for alpha discovery, systematic evaluation remains a key challenge. Existing evaluation metrics predominantly include backtesting and correlation-based measures. Backtesting is computationally intensive, inherently sequential, and sensitive to specific strategy parameters. Correlation-based metrics, though efficient, assess only predictive ability and overlook other crucial properties such as temporal stability, robustness, diversity, and interpretability. Additionally, the closed-source nature of most existing alpha mining models hinders reproducibility and slows progress in this field. To address these issues, we propose AlphaEval, a unified, parallelizable, and backtest-free evaluation framework for automated alpha mining models. AlphaEval assesses the overall quality of generated alphas along five complementary dimensions: predictive power, stability, robustness to market perturbations, financial logic, and diversity. Extensive experiments across representative alpha mining algorithms demonstrate that AlphaEval achieves evaluation consistency comparable to comprehensive backtesting, while providing more comprehensive insights and higher efficiency. Furthermore, AlphaEval effectively identifies superior alphas compared to traditional single-metric screening approaches. All implementations and evaluation tools are open-sourced to promote reproducibility and community engagement.
zh
[AI-87] Sustainable AI Training via Hardware-Software Co-Design on NVIDIA AMD and Emerging GPU Architectures
【速读】:该论文旨在解决大规模深度学习与人工智能模型训练中因计算复杂度急剧上升而导致的能源消耗激增问题,进而引发的可持续性挑战。其核心解决方案在于通过软硬件协同设计(hardware-software co-design)方法,显著提升内存级和核级操作效率,从而优化性能功耗比(performance-per-watt)。关键技术包括对专用张量核心(tensor cores)和矩阵核心(matrix cores)的利用、高级内存优化策略、混合精度计算(mixed-precision arithmetic)、能量感知调度算法(energy-aware scheduling algorithms)以及编译器驱动的内核增强等,这些措施共同实现了显著的能效提升,并通过Meta、Google、Amazon等企业的实际案例验证了其有效性。
链接: https://arxiv.org/abs/2508.13163
作者: Yashasvi Makin,Rahul Maliakkal
机构: 未知
类目: Hardware Architecture (cs.AR); Artificial Intelligence (cs.AI); Distributed, Parallel, and Cluster Computing (cs.DC); Machine Learning (cs.LG)
备注: IEEE CISOSE Industry Track 2025 Conference
Abstract:In particular, large-scale deep learning and artificial intelligence model training uses a lot of computational power and energy, so it poses serious sustainability issues. The fast rise in model complexity has resulted in exponential increases in energy consumption, increasing the demand for techniques maximizing computational efficiency and lowering environmental impact. This work explores environmentally driven performance optimization methods especially intended for advanced GPU architectures from NVIDIA, AMD, and other emerging GPU architectures. Our main focus is on investigating hardware-software co-design techniques meant to significantly increase memory-level and kernel-level operations, so improving performance-per-watt measures. Our thorough research encompasses evaluations of specialized tensor and matrix cores, advanced memory optimization methods, and creative integration approaches that taken together result in notable energy efficiency increases. We also discuss important software-level optimizations that augment hardware capability including mixed-precision arithmetic, advanced energy-aware scheduling algorithms, and compiler-driven kernel enhancements. Moreover, we methodically point out important research gaps and suggest future directions necessary to create really sustainable artificial intelligence systems. This paper emphasizes how major increases in training efficiency can be obtained by co-design of hardware and software, so lowering the environmental impact of artificial intelligence without compromising performance. To back up our analysis, we use real-world case studies from top companies like Meta, Google, Amazon, and others that show how these sustainable AI training methods are used in the real world.
zh
[AI-88] Piano: A Multi-Constraint Pin Assignment-Aware Floorplanner
【速读】:该论文旨在解决现代集成电路物理设计中floorplanning阶段对引脚分配(pin assignment)考虑不足的问题,尤其是在固定轮廓(fixed-outline)、白空间(whitespace)移除以及预放置模块(pre-placed modules)等复杂约束下的优化难题。传统floorplanner通常忽略pin assignment与模块布局的协同优化,导致后续详细放置和布线阶段性能下降。解决方案的关键在于提出Piano框架,其核心是构建基于模块几何关系与网表连接的图结构,通过迭代最短路径搜索确定最优pin assignment,并结合白空间移除策略及三种局部优化器,在多约束条件下同步优化模块位置与引脚分配,从而显著改善HPWL、feedthrough wirelength、feedthrough模块数量和未放置引脚数等关键指标。
链接: https://arxiv.org/abs/2508.13161
作者: Zhexuan Xu,Kexin Zhou,Jie Wang,Zijie Geng,Siyuan Xu,Shixiong Kai,Mingxuan Yuan,Feng Wu
机构: 未知
类目: Hardware Architecture (cs.AR); Artificial Intelligence (cs.AI)
备注:
Abstract:Floorplanning is a critical step in VLSI physical design, increasingly complicated by modern constraints such as fixed-outline requirements, whitespace removal, and the presence of pre-placed modules. In addition, the assignment of pins on module boundaries significantly impacts the performance of subsequent stages, including detailed placement and routing. However, traditional floorplanners often overlook pin assignment with modern constraints during the floorplanning stage. In this work, we introduce Piano, a floorplanning framework that simultaneously optimizes module placement and pin assignment under multiple constraints. Specifically, we construct a graph based on the geometric relationships among modules and their netlist connections, then iteratively search for shortest paths to determine pin assignments. This graph-based method also enables accurate evaluation of feedthrough and unplaced pins, thereby guiding overall layout quality. To further improve the design, we adopt a whitespace removal strategy and employ three local optimizers to enhance layout metrics under multi-constraint scenarios. Experimental results on widely used benchmark circuits demonstrate that Piano achieves an average 6.81% reduction in HPWL, a 13.39% decrease in feedthrough wirelength, a 16.36% reduction in the number of feedthrough modules, and a 21.21% drop in unplaced pins, while maintaining zero whitespace.
zh
[AI-89] EvoVerilog: Large Langugage Model Assisted Evolution of Verilog Code
【速读】:该论文旨在解决当前基于大语言模型(Large Language Models, LLMs)生成Verilog硬件描述语言代码时存在的两大问题:一是现有方法高度依赖人工干预和定制数据集微调,限制了其在自动化设计流程中的可扩展性;二是现有迭代搜索技术难以探索多样化的电路设计方案,且性能常不如简单重复提示(repeated prompting)策略。解决方案的关键在于提出EvoVerilog框架,该框架将LLMs的推理能力与进化算法(evolutionary algorithms)相结合,采用多目标、种群驱动的搜索策略,在无需人工介入的情况下自动探索广泛的设计空间,从而实现Verilog代码的高效生成与优化。实验表明,EvoVerilog在VerilogEval-Machine和VerilogEval-Human基准上分别取得89.1和80.2的pass@10得分,并能同时生成多种功能正确且资源利用率优化的Verilog实现。
链接: https://arxiv.org/abs/2508.13156
作者: Ping Guo,Yiting Wang,Wanghao Ye,Yexiao He,Ziyao Wang,Xiaopeng Dai,Ang Li,Qingfu Zhang
机构: 未知
类目: Hardware Architecture (cs.AR); Artificial Intelligence (cs.AI)
备注:
Abstract:Large Language Models (LLMs) have demonstrated great potential in automating the generation of Verilog hardware description language code for hardware design. This automation is critical to reducing human effort in the complex and error-prone process of hardware design. However, existing approaches predominantly rely on human intervention and fine-tuning using curated datasets, limiting their scalability in automated design workflows. Although recent iterative search techniques have emerged, they often fail to explore diverse design solutions and may underperform simpler approaches such as repeated prompting. To address these limitations, we introduce EvoVerilog, a novel framework that combines the reasoning capabilities of LLMs with evolutionary algorithms to automatically generate and refine Verilog code. EvoVerilog utilizes a multiobjective, population-based search strategy to explore a wide range of design possibilities without requiring human intervention. Extensive experiments demonstrate that EvoVerilog achieves state-of-the-art performance, with pass@10 scores of 89.1 and 80.2 on the VerilogEval-Machine and VerilogEval-Human benchmarks, respectively. Furthermore, the framework showcases its ability to explore diverse designs by simultaneously generating a variety of functional Verilog code while optimizing resource utilization. Subjects: Hardware Architecture (cs.AR); Artificial Intelligence (cs.AI) Cite as: arXiv:2508.13156 [cs.AR] (or arXiv:2508.13156v1 [cs.AR] for this version) https://doi.org/10.48550/arXiv.2508.13156 Focus to learn more arXiv-issued DOI via DataCite Submission history From: Ping Guo [view email] [v1] Thu, 26 Jun 2025 13:32:25 UTC (617 KB) Full-text links: Access Paper: View a PDF of the paper titled EvoVerilog: Large Langugage Model Assisted Evolution of Verilog Code, by Ping Guo and 7 other authorsView PDFHTML (experimental)TeX SourceOther Formats view license Current browse context: cs.AR prev | next new | recent | 2025-08 Change to browse by: cs cs.AI References Citations NASA ADSGoogle Scholar Semantic Scholar a export BibTeX citation Loading… BibTeX formatted citation loading… Data provided by: Bookmark checked=“checked”> Bibliographic Tools Bibliographic and Citation Tools Bibliographic Explorer Toggle Bibliographic Explorer (What is the Explorer?) Connected Papers Toggle Connected Papers (What is Connected Papers?) Litmaps Toggle Litmaps (What is Litmaps?) scite.ai Toggle scite Smart Citations (What are Smart Citations?) Code, Data, Media Code, Data and Media Associated with this Article alphaXiv Toggle alphaXiv (What is alphaXiv?) Links to Code Toggle CatalyzeX Code Finder for Papers (What is CatalyzeX?) DagsHub Toggle DagsHub (What is DagsHub?) GotitPub Toggle Gotit.pub (What is GotitPub?) Huggingface Toggle Hugging Face (What is Huggingface?) Links to Code Toggle Papers with Code (What is Papers with Code?) ScienceCast Toggle ScienceCast (What is ScienceCast?) Demos Demos Replicate Toggle Replicate (What is Replicate?) Spaces Toggle Hugging Face Spaces (What is Spaces?) Spaces Toggle TXYZ.AI (What is TXYZ.AI?) Related Papers Recommenders and Search Tools Link to Influence Flower Influence Flower (What are Influence Flowers?) Core recommender toggle CORE Recommender (What is CORE?) Author Venue Institution Topic About arXivLabs arXivLabs: experimental projects with community collaborators arXivLabs is a framework that allows collaborators to develop and share new arXiv features directly on our website. Both individuals and organizations that work with arXivLabs have embraced and accepted our values of openness, community, excellence, and user data privacy. arXiv is committed to these values and only works with partners that adhere to them. Have an idea for a project that will add value for arXiv’s community? Learn more about arXivLabs. Which authors of this paper are endorsers? | Disable MathJax (What is MathJax?) mathjaxToggle(); About Help contact arXivClick here to contact arXiv Contact subscribe to arXiv mailingsClick here to subscribe Subscribe Copyright Privacy Policy Web Accessibility Assistance arXiv Operational Status Get status notifications via email or slack
zh
[AI-90] Preliminary suggestions for rigorous GPAI model evaluations WWW
【速读】:该论文旨在解决生成式 AI (Generative AI, GPAI) 评估实践中存在的内部有效性(internal validity)、外部有效性(external validity)和可复现性(reproducibility)不足的问题。其解决方案的关键在于提出一套结构化的评估实践框架,将评估流程划分为设计、实施、执行和文档四个阶段,并融合机器学习、统计学、心理学、经济学及生物学等多个学科的最佳实践,以提升 GPAI 评估的科学性和严谨性,从而支持监管机构、第三方评估者和研究者开展更可靠、透明且具可比性的评估工作。
链接: https://arxiv.org/abs/2508.00875
作者: Patricia Paskov,Michael J. Byun,Kevin Wei,Toby Webster
机构: 未知
类目: Computers and Society (cs.CY); Artificial Intelligence (cs.AI)
备注: Santa Monica, CA: RAND Corporation, 2025. Published as a RAND expert commentary at: this https URL
Abstract:This document presents a preliminary compilation of general-purpose AI (GPAI) evaluation practices that may promote internal validity, external validity and reproducibility. It includes suggestions for human uplift studies and benchmark evaluations, as well as cross-cutting suggestions that may apply to many different evaluation types. Suggestions are organised across four stages in the evaluation life cycle: design, implementation, execution and documentation. Drawing from established practices in machine learning, statistics, psychology, economics, biology and other fields recognised to have important lessons for AI evaluation, these suggestions seek to contribute to the conversation on the nascent and evolving field of the science of GPAI evaluations. The intended audience of this document includes providers of GPAI models presenting systemic risk (GPAISR), for whom the EU AI Act lays out specific evaluation requirements; third-party evaluators; policymakers assessing the rigour of evaluations; and academic researchers developing or conducting GPAI evaluations.
zh
[AI-91] End-to-End Audio-Visual Learning for Cochlear Implant Sound Coding in Noisy Environments
【速读】:该论文旨在解决当前人工耳蜗(Cochlear Implant, CI)在噪声或混响环境下语音理解能力不足的问题。其解决方案的关键在于提出了一种新型的端到端噪声抑制人工耳蜗系统 AVSE-ECS,该系统将音频-视觉语音增强(Audio-Visual Speech Enhancement, AVSE)模型作为预处理模块,与基于深度学习的 ElectrodeNet-CS(ECS)声编码策略联合训练,从而利用视觉线索提升复杂声学环境下的语音清晰度。实验结果表明,该方法在噪声条件下显著优于传统 ECS 策略,验证了深度学习融合多模态信息在改进 CI 声码器性能方面的可行性与潜力。
链接: https://arxiv.org/abs/2508.13576
作者: Meng-Ping Lin,Enoch Hsin-Ho Huang,Shao-Yi Chien,Yu Tsao
机构: 未知
类目: Audio and Speech Processing (eess.AS); Artificial Intelligence (cs.AI); Sound (cs.SD); Image and Video Processing (eess.IV)
备注: 6 pages, 4 figures
Abstract:The cochlear implant (CI) is a remarkable biomedical device that successfully enables individuals with severe-to-profound hearing loss to perceive sound by converting speech into electrical stimulation signals. Despite advancements in the performance of recent CI systems, speech comprehension in noisy or reverberant conditions remains a challenge. Recent and ongoing developments in deep learning reveal promising opportunities for enhancing CI sound coding capabilities, not only through replicating traditional signal processing methods with neural networks, but also through integrating visual cues as auxiliary data for multimodal speech processing. Therefore, this paper introduces a novel noise-suppressing CI system, AVSE-ECS, which utilizes an audio-visual speech enhancement (AVSE) model as a pre-processing module for the deep-learning-based ElectrodeNet-CS (ECS) sound coding strategy. Specifically, a joint training approach is applied to model AVSE-ECS, an end-to-end CI system. Experimental results indicate that the proposed method outperforms the previous ECS strategy in noisy conditions, with improved objective speech intelligibility scores. The methods and findings in this study demonstrate the feasibility and potential of using deep learning to integrate the AVSE module into an end-to-end CI system
zh
[AI-92] Physics-Informed Neural Networks for Programmable Origami Metamaterials with Controlled Deployment
【速读】:该论文旨在解决折纸启发结构(origami-inspired structures)在设计过程中面临的挑战,包括复杂的非线性力学行为、多稳态特性以及部署力的精确控制难题。其核心解决方案是提出一种物理信息神经网络(physics-informed neural network, PINN)框架,无需预先收集训练数据即可实现对锥形Kresling折纸(conical Kresling origami, CKO)的正向预测与逆向设计。该框架通过将机械平衡方程直接嵌入学习过程,高精度预测完整的能量景观并减少非物理解 artifacts;同时,逆向设计模块可指定目标稳定态高度及分离能垒,从而实现对整个能量曲线的自由编程。此方法进一步扩展至分层CKO组装体,通过程序化能垒大小实现逐层顺序部署,经有限元仿真与物理原型实验验证,证明了其在复杂机械能景观编程中的鲁棒性和通用性。
链接: https://arxiv.org/abs/2508.13559
作者: Sukheon Kang,Youngkwon Kim,Jinkyu Yang,Seunghwa Ryu
机构: 未知
类目: oft Condensed Matter (cond-mat.soft); Artificial Intelligence (cs.AI); Computational Physics (physics.comp-ph)
备注:
Abstract:Origami-inspired structures provide unprecedented opportunities for creating lightweight, deployable systems with programmable mechanical responses. However, their design remains challenging due to complex nonlinear mechanics, multistability, and the need for precise control of deployment forces. Here, we present a physics-informed neural network (PINN) framework for both forward prediction and inverse design of conical Kresling origami (CKO) without requiring pre-collected training data. By embedding mechanical equilibrium equations directly into the learning process, the model predicts complete energy landscapes with high accuracy while minimizing non-physical artifacts. The inverse design routine specifies both target stable-state heights and separating energy barriers, enabling freeform programming of the entire energy curve. This capability is extended to hierarchical CKO assemblies, where sequential layer-by-layer deployment is achieved through programmed barrier magnitudes. Finite element simulations and experiments on physical prototypes validate the designed deployment sequences and barrier ratios, confirming the robustness of the approach. This work establishes a versatile, data-free route for programming complex mechanical energy landscapes in origami-inspired metamaterials, offering broad potential for deployable aerospace systems, morphing structures, and soft robotic actuators.
zh
[AI-93] AlphaX: An AI-Based Value Investing Strategy for the Brazilian Stock Market
【速读】:该论文试图解决的问题是:当前基于人工智能(AI)的自主交易策略在回测(backtesting)中表现优异,但在实际市场部署时性能显著下降,尤其是风险调整后收益恶化,这主要归因于模型存在前瞻偏差(lookahead bias)及其他形式的偏差,导致过度拟合(overfitting)。解决方案的关键在于提出一种受经典价值投资(Value Investing)理念启发的AI策略——AlphaX,通过在计算模拟中严格控制上述偏差,有效降低过拟合风险;实证结果表明,该策略不仅优于巴西主要市场基准,且在统计上显著优于常用的量化技术指标(如相对强弱指数 RSI 和资金流量指数 MFI),从而为构建稳健的AI驱动价值投资框架提供了可行路径。
链接: https://arxiv.org/abs/2508.13429
作者: Paulo André Lima de Castro
机构: 未知
类目: Computational Finance (q-fin.CP); Artificial Intelligence (cs.AI)
备注: 6 pages
Abstract:Autonomous trading strategies have been a subject of research within the field of artificial intelligence (AI) for aconsiderable period. Various AI techniques have been explored to develop autonomous agents capable of trading financial assets. These approaches encompass traditional methods such as neural networks, fuzzy logic, and reinforcement learning, as well as more recent advancements, including deep neural networks and deep reinforcement learning. Many developers report success in creating strategies that exhibit strong performance during simulations using historical price data, a process commonly referred to as backtesting. However, when these strategies are deployed in real markets, their performance often deteriorates, particularly in terms of risk-adjusted returns. In this study, we propose an AI-based strategy inspired by a classical investment paradigm: Value Investing. Financial AI models are highly susceptible to lookahead bias and other forms of bias that can significantly inflate performance in backtesting compared to live trading conditions. To address this issue, we conducted a series of computational simulations while controlling for these biases, thereby reducing the risk of overfitting. Our results indicate that the proposed approach outperforms major Brazilian market benchmarks. Moreover, the strategy, named AlphaX, demonstrated superior performance relative to widely used technical indicators such as the Relative Strength Index (RSI) and Money Flow Index (MFI), with statistically significant results. Finally, we discuss several open challenges and highlight emerging technologies in qualitative analysis that may contribute to the development of a comprehensive AI-based Value Investing framework in the future
zh
[AI-94] Utilizing the RAIN method and Graph SAGE Model to Identify Effective Drug Combinations for Gastric Neoplasm Treatment
【速读】:该论文旨在解决胃部肿瘤(尤其是腺癌)因诊断延迟导致的高死亡率问题,以及现有治疗方案在应对疾病异质性、耐药性和疗效不足方面的局限性。其解决方案的关键在于提出一种名为RAIN的整合方法,该方法结合图卷积神经网络(Graph SAGE)构建药物-基因-蛋白交互图模型,并通过p值加权边进行药物组合推荐;随后利用自然语言处理(NLP)与系统性文献回顾(PubMed、Scopus等数据库)验证候选药物,并进一步采用网络荟萃分析(network meta-analysis)评估不同组合的疗效,从而识别出最优药物组合,如奥沙利铂、氟尿嘧啶和曲妥珠单抗的三联疗法,显著提升治疗效果(p值从0.0229降至0.0069)。
链接: https://arxiv.org/abs/2508.13207
作者: S. Z. Pirasteh,Ali A. Kiaei,Mahnaz Bush,Sabra Moghadam,Raha Aghaei,Behnaz Sadeghigol
机构: 未知
类目: Quantitative Methods (q-bio.QM); Artificial Intelligence (cs.AI)
备注: 43 pages
Abstract:Background: Gastric neoplasm, primarily adenocarcinoma, is an aggressive cancer with high mortality, often diagnosed late, leading to complications like metastasis. Effective drug combinations are vital to address disease heterogeneity, enhance efficacy, reduce resistance, and improve patient outcomes. Methods: The RAIN method integrated Graph SAGE to propose drug combinations, using a graph model with p-value-weighted edges connecting drugs, genes, and proteins. NLP and systematic literature review (PubMed, Scopus, etc.) validated proposed drugs, followed by network meta-analysis to assess efficacy, implemented in Python. Results: Oxaliplatin, fluorouracil, and trastuzumab were identified as effective, supported by 61 studies. Fluorouracil alone had a p-value of 0.0229, improving to 0.0099 with trastuzumab, and 0.0069 for the triple combination, indicating superior efficacy. Conclusion: The RAIN method, combining AI and network meta-analysis, effectively identifies optimal drug combinations for gastric neoplasm, offering a promising strategy to enhance treatment outcomes and guide health policy.
zh
[AI-95] Benchmarking LLM -based Agents for Single-cell Omics Analysis
【速读】:该论文旨在解决单细胞多组学(single-cell omics)数据分析中传统人工定义分析流程的局限性,尤其是在面对日益增长的多模态数据时,缺乏可适应、可执行、可追溯且能实时融合知识的智能分析方法。其解决方案的关键在于提出一个全新的基准评估体系,该体系包含:兼容多种AI代理框架和大语言模型(LLM)的统一平台;涵盖认知程序合成、协作能力、执行效率、生物信息学知识整合及任务完成质量等多维指标;以及50个涵盖不同组学类型、物种和测序技术的真实世界分析任务。通过此系统,研究发现Grok-3-beta在所测试代理框架中表现最优,多代理架构通过角色分工显著提升协作与执行效率,同时识别出高质量代码生成是任务成功的核心因素,自我反思能力对整体性能影响最大,其次为检索增强生成(RAG)和规划能力,从而为构建计算生物学领域鲁棒的AI代理提供了实证基础与最佳实践。
链接: https://arxiv.org/abs/2508.13201
作者: Yang Liu,Lu Zhou,Ruikun He,Rongbo Shen,Yixue Li
机构: 未知
类目: Genomics (q-bio.GN); Artificial Intelligence (cs.AI); Multiagent Systems (cs.MA)
备注:
Abstract:The surge in multimodal single-cell omics data exposes limitations in traditional, manually defined analysis workflows. AI agents offer a paradigm shift, enabling adaptive planning, executable code generation, traceable decisions, and real-time knowledge fusion. However, the lack of a comprehensive benchmark critically hinders progress. We introduce a novel benchmarking evaluation system to rigorously assess agent capabilities in single-cell omics analysis. This system comprises: a unified platform compatible with diverse agent frameworks and LLMs; multidimensional metrics assessing cognitive program synthesis, collaboration, execution efficiency, bioinformatics knowledge integration, and task completion quality; and 50 diverse real-world single-cell omics analysis tasks spanning multi-omics, species, and sequencing technologies. Our evaluation reveals that Grok-3-beta achieves state-of-the-art performance among tested agent frameworks. Multi-agent frameworks significantly enhance collaboration and execution efficiency over single-agent approaches through specialized role division. Attribution analyses of agent capabilities identify that high-quality code generation is crucial for task success, and self-reflection has the most significant overall impact, followed by retrieval-augmented generation (RAG) and planning. This work highlights persistent challenges in code generation, long-context handling, and context-aware knowledge retrieval, providing a critical empirical foundation and best practices for developing robust AI agents in computational biology.
zh
[AI-96] he Rise of Generative AI for Metal-Organic Framework Design and Synthesis
【速读】:该论文旨在解决金属有机框架材料(Metal-Organic Frameworks, MOFs)设计与发现过程中效率低下的问题,传统方法依赖人工枚举候选结构,耗时且难以覆盖广阔的设计空间。解决方案的关键在于引入生成式人工智能(Generative AI)技术,通过深度学习模型(如变分自编码器、扩散模型和基于大语言模型的智能体)从日益丰富的MOF数据中学习结构-性能关系,并自主生成具有目标性质的新拓扑结构。这些生成工具可与高通量计算筛选及自动化实验相结合,构建闭环加速发现流程,从而显著提升对高性能MOF材料的探索效率,推动其在清洁空气与能源领域的应用。
链接: https://arxiv.org/abs/2508.13197
作者: Chenru Duan,Aditya Nandy,Shyam Chand Pal,Xin Yang,Wenhao Gao,Yuanqi Du,Hendrik Kraß,Yeonghun Kang,Varinia Bernales,Zuyang Ye,Tristan Pyle,Ray Yang,Zeqi Gu,Philippe Schwaller,Shengqian Ma,Shijing Sun,Alán Aspuru-Guzik,Seyed Mohamad Moosavi,Robert Wexler,Zhiling Zheng
机构: 未知
类目: Materials Science (cond-mat.mtrl-sci); Artificial Intelligence (cs.AI)
备注: 10 pages, 5 figures
Abstract:Advances in generative artificial intelligence are transforming how metal-organic frameworks (MOFs) are designed and discovered. This Perspective introduces the shift from laborious enumeration of MOF candidates to generative approaches that can autonomously propose and synthesize in the laboratory new porous reticular structures on demand. We outline the progress of employing deep learning models, such as variational autoencoders, diffusion models, and large language model-based agents, that are fueled by the growing amount of available data from the MOF community and suggest novel crystalline materials designs. These generative tools can be combined with high-throughput computational screening and even automated experiments to form accelerated, closed-loop discovery pipelines. The result is a new paradigm for reticular chemistry in which AI algorithms more efficiently direct the search for high-performance MOF materials for clean air and energy applications. Finally, we highlight remaining challenges such as synthetic feasibility, dataset diversity, and the need for further integration of domain knowledge.
zh
[AI-97] Preference Models assume Proportional Hazards of Utilities
【速读】:该论文试图解决如何将经典的Plackett-Luce模型与Cox比例风险模型(Cox Proportional Hazards model)建立联系,从而深化对偏好估计中统计假设的理解。其解决方案的关键在于揭示两者之间的数学等价性或映射关系,进而为基于人类标注数据的偏好建模提供新的理论视角,尤其对当前主流的奖励建模(Reward Modelling)和直接偏好优化(Direct Preference Optimization)方法具有潜在的启发意义。
链接: https://arxiv.org/abs/2508.13189
作者: Chirag Nagpal
机构: 未知
类目: Machine Learning (stat.ML); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:
Abstract:Approaches for estimating preferences from human annotated data typically involves inducing a distribution over a ranked list of choices such as the Plackett-Luce model. Indeed, modern AI alignment tools such as Reward Modelling and Direct Preference Optimization are based on the statistical assumptions posed by the Plackett-Luce model. In this paper, I will connect the Plackett-Luce model to another classical and well known statistical model, the Cox Proportional Hazards model and attempt to shed some light on the implications of the connection therein.
zh
机器学习
[LG-0] Learning from Preferences and Mixed Demonstrations in General Settings
链接: https://arxiv.org/abs/2508.14027
作者: Jason R Brown,Carl Henrik Ek,Robert D Mullins
类目: Machine Learning (cs.LG)
*备注:
Abstract:Reinforcement learning is a general method for learning in sequential settings, but it can often be difficult to specify a good reward function when the task is complex. In these cases, preference feedback or expert demonstrations can be used instead. However, existing approaches utilising both together are often ad-hoc, rely on domain-specific properties, or won’t scale. We develop a new framing for learning from human data, \emphreward-rational partial orderings over observations, designed to be flexible and scalable. Based on this we introduce a practical algorithm, LEOPARD: Learning Estimated Objectives from Preferences And Ranked Demonstrations. LEOPARD can learn from a broad range of data, including negative demonstrations, to efficiently learn reward functions across a wide range of domains. We find that when a limited amount of preference and demonstration feedback is available, LEOPARD outperforms existing baselines by a significant margin. Furthermore, we use LEOPARD to investigate learning from many types of feedback compared to just a single one, and find that combining feedback types is often beneficial.
[LG-1] BLIPs: Bayesian Learned Interatomic Potentials
链接: https://arxiv.org/abs/2508.14022
作者: Dario Coscia,Pim de Haan,Max Welling
类目: Machine Learning (cs.LG)
*备注:
Abstract:Machine Learning Interatomic Potentials (MLIPs) are becoming a central tool in simulation-based chemistry. However, like most deep learning models, MLIPs struggle to make accurate predictions on out-of-distribution data or when trained in a data-scarce regime, both common scenarios in simulation-based chemistry. Moreover, MLIPs do not provide uncertainty estimates by construction, which are fundamental to guide active learning pipelines and to ensure the accuracy of simulation results compared to quantum calculations. To address this shortcoming, we propose BLIPs: Bayesian Learned Interatomic Potentials. BLIP is a scalable, architecture-agnostic variational Bayesian framework for training or fine-tuning MLIPs, built on an adaptive version of Variational Dropout. BLIP delivers well-calibrated uncertainty estimates and minimal computational overhead for energy and forces prediction at inference time, while integrating seamlessly with (equivariant) message-passing architectures. Empirical results on simulation-based computational chemistry tasks demonstrate improved predictive accuracy with respect to standard MLIPs, and trustworthy uncertainty estimates, especially in data-scarse or heavy out-of-distribution regimes. Moreover, fine-tuning pretrained MLIPs with BLIP yields consistent performance gains and calibrated uncertainties.
[LG-2] yped Topological Structures Of Datasets
链接: https://arxiv.org/abs/2508.14008
作者: Wanjun Hu
类目: Machine Learning (cs.LG); General Topology (math.GN)
*备注: 14 pages 5 figures
Abstract:A datatset X on R^2 is a finite topological space. Current research of a dataset focuses on statistical methods and the algebraic topological method \citecarlsson. In \citehu, the concept of typed topological space was introduced and showed to have the potential for studying finite topological spaces, such as a dataset. It is a new method from the general topology perspective. A typed topological space is a topological space whose open sets are assigned types. Topological concepts and methods can be redefined using open sets of certain types. In this article, we develop a special set of types and its related typed topology on a dataset X . Using it, we can investigate the inner structure of X . In particular, R^2 has a natural quotient space, in which X is organized into tracks, and each track is split into components. Those components are in a order. Further, they can be represented by an integer sequence. Components crossing tracks form branches, and the relationship can be well represented by a type of pseudotree (called typed-II pseudotree). Such structures provide a platform for new algorithms for problems such as calculating convex hull, holes, clustering and anomaly detection.
[LG-3] GDNSQ: Gradual Differentiable Noise Scale Quantization for Low-bit Neural Networks
链接: https://arxiv.org/abs/2508.14004
作者: Sergey Salishev,Ian Akhremchik
类目: Machine Learning (cs.LG); Information Theory (cs.IT); Numerical Analysis (math.NA)
*备注: 9 pages, 6 figures, 7 tables
Abstract:Quantized neural networks can be viewed as a chain of noisy channels, where rounding in each layer reduces capacity as bit-width shrinks; the floating-point (FP) checkpoint sets the maximum input rate. We track capacity dynamics as the average bit-width decreases and identify resulting quantization bottlenecks by casting fine-tuning as a smooth, constrained optimization problem. Our approach employs a fully differentiable Straight-Through Estimator (STE) with learnable bit-width, noise scale and clamp bounds, and enforces a target bit-width via an exterior-point penalty; mild metric smoothing (via distillation) stabilizes training. Despite its simplicity, the method attains competitive accuracy down to the extreme W1A1 setting while retaining the efficiency of STE.
[LG-4] Formal Algorithms for Model Efficiency
链接: https://arxiv.org/abs/2508.14000
作者: Naman Tyagi,Srishti Das,Kunal,Vatsal Gupta
类目: Machine Learning (cs.LG)
*备注: 17 pages, 0 figures
Abstract:We introduce the Knob-Meter-Rule (KMR) framework, a unified formalism for representing and reasoning about model efficiency techniques in deep learning. By abstracting diverse methods, including pruning, quantization, knowledge distillation, and parameter-efficient architectures, into a consistent set of controllable knobs, deterministic rules, and measurable meters, KMR provides a mathematically precise and modular perspective on efficiency optimization. The framework enables systematic composition of multiple techniques, flexible policy-driven application, and iterative budgeted optimization through the Budgeted-KMR algorithm. We demonstrate how well-known efficiency methods can be instantiated as KMR triples and present concise algorithmic templates for each. The framework highlights underlying relationships between methods, facilitates hybrid pipelines, and lays the foundation for future research in automated policy learning, dynamic adaptation, and theoretical analysis of cost-quality trade-offs. Overall, KMR offers both a conceptual and practical tool for unifying and advancing model efficiency research.
[LG-5] Multi-User Contextual Cascading Bandits for Personalized Recommendation
链接: https://arxiv.org/abs/2508.13981
作者: Jiho Park,Huiwen Jia
类目: Machine Learning (cs.LG); Optimization and Control (math.OC); Machine Learning (stat.ML)
*备注: 35 pages, 5 figures
Abstract:We introduce a Multi-User Contextual Cascading Bandit model, a new combinatorial bandit framework that captures realistic online advertising scenarios where multiple users interact with sequentially displayed items simultaneously. Unlike classical contextual bandits, MCCB integrates three key structural elements: (i) cascading feedback based on sequential arm exposure, (ii) parallel context sessions enabling selective exploration, and (iii) heterogeneous arm-level rewards. We first propose Upper Confidence Bound with Backward Planning (UCBBP), a UCB-style algorithm tailored to this setting, and prove that it achieves a regret bound of \widetildeO(\sqrtTHN) over T episodes, H session steps, and N contexts per episode. Motivated by the fact that many users interact with the system simultaneously, we introduce a second algorithm, termed Active Upper Confidence Bound with Backward Planning (AUCBBP), which shows a strict efficiency improvement in context scaling, i.e., user scaling, with a regret bound of \widetildeO(\sqrtT+HN) . We validate our theoretical findings via numerical experiments, demonstrating the empirical effectiveness of both algorithms under various settings.
[LG-6] AutoScale: Linear Scalarization Guided by Multi-Task Optimization Metrics
链接: https://arxiv.org/abs/2508.13979
作者: Yi Yang,Kei Ikemura,Qingwen Zhang,Xiaomeng Zhu,Ci Li,Nazre Batool,Sina Sharif Mansouri,John Folkesson
类目: Machine Learning (cs.LG)
*备注: The first two authors hold equal contribution. 10 pages, 6 figures
Abstract:Recent multi-task learning studies suggest that linear scalarization, when using well-chosen fixed task weights, can achieve comparable to or even better performance than complex multi-task optimization (MTO) methods. It remains unclear why certain weights yield optimal performance and how to determine these weights without relying on exhaustive hyperparameter search. This paper establishes a direct connection between linear scalarization and MTO methods, revealing through extensive experiments that well-performing scalarization weights exhibit specific trends in key MTO metrics, such as high gradient magnitude similarity. Building on this insight, we introduce AutoScale, a simple yet effective two-phase framework that uses these MTO metrics to guide weight selection for linear scalarization, without expensive weight search. AutoScale consistently shows superior performance with high efficiency across diverse datasets including a new large-scale benchmark.
[LG-7] Convergent Reinforcement Learning Algorithms for Stochastic Shortest Path Problem
链接: https://arxiv.org/abs/2508.13963
作者: Soumyajit Guin,Shalabh Bhatnagar
类目: Machine Learning (cs.LG)
*备注:
Abstract:In this paper we propose two algorithms in the tabular setting and an algorithm for the function approximation setting for the Stochastic Shortest Path (SSP) problem. SSP problems form an important class of problems in Reinforcement Learning (RL), as other types of cost-criteria in RL can be formulated in the setting of SSP. We show asymptotic almost-sure convergence for all our algorithms. We observe superior performance of our tabular algorithms compared to other well-known convergent RL algorithms. We further observe reliable performance of our function approximation algorithm compared to other algorithms in the function approximation setting.
[LG-8] How Usable is Automated Feature Engineering for Tabular Data?
链接: https://arxiv.org/abs/2508.13932
作者: Bastian Schäfer,Lennart Purucker,Maciej Janowski,Frank Hutter
类目: Machine Learning (cs.LG)
*备注: Accepted as a short paper at the non-archival content track of AutoML 2025
Abstract:Tabular data, consisting of rows and columns, is omnipresent across various machine learning applications. Each column represents a feature, and features can be combined or transformed to create new, more informative features. Such feature engineering is essential to achieve peak performance in machine learning. Since manual feature engineering is expensive and time-consuming, a substantial effort has been put into automating it. Yet, existing automated feature engineering (AutoFE) methods have never been investigated regarding their usability for practitioners. Thus, we investigated 53 AutoFE methods. We found that these methods are, in general, hard to use, lack documentation, and have no active communities. Furthermore, no method allows users to set time and memory constraints, which we see as a necessity for usable automation. Our survey highlights the need for future work on usable, well-engineered AutoFE methods.
[LG-9] Automated Energy-Aware Time-Series Model Deployment on Embedded FPGAs for Resilient Combined Sewer Overflow Management
链接: https://arxiv.org/abs/2508.13905
作者: Tianheng Ling,Vipin Singh,Chao Qian,Felix Biessmann,Gregor Schiele
类目: Machine Learning (cs.LG)
*备注: 6 pages, 6 figures, 1 table, accepted by the 11th IEEE International Smart Cities Conference
Abstract:Extreme weather events, intensified by climate change, increasingly challenge aging combined sewer systems, raising the risk of untreated wastewater overflow. Accurate forecasting of sewer overflow basin filling levels can provide actionable insights for early intervention, helping mitigating uncontrolled discharge. In recent years, AI-based forecasting methods have offered scalable alternatives to traditional physics-based models, but their reliance on cloud computing limits their reliability during communication outages. To address this, we propose an end-to-end forecasting framework that enables energy-efficient inference directly on edge devices. Our solution integrates lightweight Transformer and Long Short-Term Memory (LSTM) models, compressed via integer-only quantization for efficient on-device execution. Moreover, an automated hardware-aware deployment pipeline is used to search for optimal model configurations by jointly minimizing prediction error and energy consumption on an AMD Spartan-7 XC7S15 FPGA. Evaluated on real-world sewer data, the selected 8-bit Transformer model, trained on 24 hours of historical measurements, achieves high accuracy (MSE 0.0376) at an energy cost of 0.370 mJ per inference. In contrast, the optimal 8-bit LSTM model requires significantly less energy (0.009 mJ, over 40x lower) but yields 14.89% worse accuracy (MSE 0.0432) and much longer training time. This trade-off highlights the need to align model selection with deployment priorities, favoring LSTM for ultra-low energy consumption or Transformer for higher predictive accuracy. In general, our work enables local, energy-efficient forecasting, contributing to more resilient combined sewer systems. All code can be found in the GitHub Repository (this https URL).
[LG-10] Revisiting Diffusion Q-Learning: From Iterative Denoising to One-Step Action Generation
链接: https://arxiv.org/abs/2508.13904
作者: Thanh Nguyen,Chang D. Yoo
类目: Machine Learning (cs.LG)
*备注:
Abstract:The generative power of diffusion models (DMs) has recently enabled high-performing decision-making algorithms in offline reinforcement learning (RL), achieving state-of-the-art results across standard benchmarks. Among them, Diffusion Q-Learning (DQL) stands out as a leading method for its consistently strong performance. Nevertheless, DQL remains limited in practice due to its reliance on multi-step denoising for action generation during both training and inference. Although one-step denoising is desirable, simply applying it to DQL leads to a drastic performance drop. In this work, we revisit DQL and identify its core limitations. We then propose One-Step Flow Q-Learning (OFQL), a novel framework that enables efficient one-step action generation during both training and inference, without requiring auxiliary models, distillation, or multi-phase training. Specifically, OFQL reformulates DQL within the sample-efficient Flow Matching (FM) framework. While conventional FM induces curved generative trajectories that impede one-step generation, OFQL instead learns an average velocity field that facilitates direct, accurate action generation. Collectively, OFQL eliminates the need for multi-step sampling and recursive gradient updates in DQL, resulting in faster and more robust training and inference. Extensive experiments on the D4RL benchmark demonstrate that OFQL outperforms DQL and other diffusion-based baselines, while substantially reducing both training and inference time compared to DQL.
[LG-11] FedUP: Efficient Pruning-based Federated Unlearning for Model Poisoning Attacks
链接: https://arxiv.org/abs/2508.13853
作者: Nicolò Romandini,Cristian Borcea,Rebecca Montanari,Luca Foschini
类目: Machine Learning (cs.LG)
*备注: 15 pages, 5 figures, 7 tables
Abstract:Federated Learning (FL) can be vulnerable to attacks, such as model poisoning, where adversaries send malicious local weights to compromise the global model. Federated Unlearning (FU) is emerging as a solution to address such vulnerabilities by selectively removing the influence of detected malicious contributors on the global model without complete retraining. However, unlike typical FU scenarios where clients are trusted and cooperative, applying FU with malicious and possibly colluding clients is challenging because their collaboration in unlearning their data cannot be assumed. This work presents FedUP, a lightweight FU algorithm designed to efficiently mitigate malicious clients’ influence by pruning specific connections within the attacked model. Our approach achieves efficiency by relying only on clients’ weights from the last training round before unlearning to identify which connections to inhibit. Isolating malicious influence is non-trivial due to overlapping updates from benign and malicious clients. FedUP addresses this by carefully selecting and zeroing the highest magnitude weights that diverge the most between the latest updates from benign and malicious clients while preserving benign information. FedUP is evaluated under a strong adversarial threat model, where up to 50%-1 of the clients could be malicious and have full knowledge of the aggregation process. We demonstrate the effectiveness, robustness, and efficiency of our solution through experiments across IID and Non-IID data, under label-flipping and backdoor attacks, and by comparing it with state-of-the-art (SOTA) FU solutions. In all scenarios, FedUP reduces malicious influence, lowering accuracy on malicious data to match that of a model retrained from scratch while preserving performance on benign data. FedUP achieves effective unlearning while consistently being faster and saving storage compared to the SOTA.
[LG-12] Disentangled Deep Smoothed Bootstrap for Fair Imbalanced Regression
链接: https://arxiv.org/abs/2508.13829
作者: Samuel Stocksieker,Denys pommeret,Arthur Charpentier
类目: Machine Learning (cs.LG); Machine Learning (stat.ML)
*备注:
Abstract:Imbalanced distribution learning is a common and significant challenge in predictive modeling, often reducing the performance of standard algorithms. Although various approaches address this issue, most are tailored to classification problems, with a limited focus on regression. This paper introduces a novel method to improve learning on tabular data within the Imbalanced Regression (IR) framework, which is a critical problem. We propose using Variational Autoencoders (VAEs) to model and define a latent representation of data distributions. However, VAEs can be inefficient with imbalanced data like other standard approaches. To address this, we develop an innovative data generation method that combines a disentangled VAE with a Smoothed Bootstrap applied in the latent space. We evaluate the efficiency of this method through numerical comparisons with competitors on benchmark datasets for IR.
[LG-13] Reinforcement Learning-based Adaptive Path Selection for Programmable Networks
链接: https://arxiv.org/abs/2508.13806
作者: José Eduardo Zerna Torres,Marios Avgeris,Chrysa Papagianni,Gergely Pongrácz,István Gódor,Paola Grosso
类目: Machine Learning (cs.LG)
*备注:
Abstract:This work presents a proof-of-concept implementation of a distributed, in-network reinforcement learning (IN-RL) framework for adaptive path selection in programmable networks. By combining Stochastic Learning Automata (SLA) with real-time telemetry data collected via In-Band Network Telemetry (INT), the proposed system enables local, data-driven forwarding decisions that adapt dynamically to congestion conditions. The system is evaluated on a Mininet-based testbed using P4-programmable BMv2 switches, demonstrating how our SLA-based mechanism converges to effective path selections and adapts to shifting network conditions at line rate.
[LG-14] Communication-Efficient Federated Learning with Adaptive Number of Participants
链接: https://arxiv.org/abs/2508.13803
作者: Sergey Skorik,Vladislav Dorofeev,Gleb Molodtsov,Aram Avetisyan,Dmitry Bylinkin,Daniil Medyakov,Aleksandr Beznosikov
类目: Machine Learning (cs.LG)
*备注:
Abstract:Rapid scaling of deep learning models has enabled performance gains across domains, yet it introduced several challenges. Federated Learning (FL) has emerged as a promising framework to address these concerns by enabling decentralized training. Nevertheless, communication efficiency remains a key bottleneck in FL, particularly under heterogeneous and dynamic client participation. Existing methods, such as FedAvg and FedProx, or other approaches, including client selection strategies, attempt to mitigate communication costs. However, the problem of choosing the number of clients in a training round remains extremely underexplored. We introduce Intelligent Selection of Participants (ISP), an adaptive mechanism that dynamically determines the optimal number of clients per round to enhance communication efficiency without compromising model accuracy. We validate the effectiveness of ISP across diverse setups, including vision transformers, real-world ECG classification, and training with gradient compression. Our results show consistent communication savings of up to 30% without losing the final quality. Applying ISP to different real-world ECG classification setups highlighted the selection of the number of clients as a separate task of federated learning.
[LG-15] Order Optimal Regret Bounds for Sharpe Ratio Optimization in the Bandit Setting
链接: https://arxiv.org/abs/2508.13749
作者: Mohammad Taha Shah,Sabrina Khurshid,Gourab Ghatak
类目: Machine Learning (cs.LG); Information Theory (cs.IT)
*备注:
Abstract:In this paper, we investigate the problem of sequential decision-making for Sharpe ratio (SR) maximization in a stochastic bandit setting. We focus on the Thompson Sampling (TS) algorithm, a Bayesian approach celebrated for its empirical performance and exploration efficiency, under the assumption of Gaussian rewards with unknown parameters. Unlike conventional bandit objectives focusing on maximizing cumulative reward, Sharpe ratio optimization instead introduces an inherent tradeoff between achieving high returns and controlling risk, demanding careful exploration of both mean and variance. Our theoretical contributions include a novel regret decomposition specifically designed for the Sharpe ratio, highlighting the role of information acquisition about the reward distribution in driving learning efficiency. Then, we establish fundamental performance limits for the proposed algorithm \textttSRTS in terms of an upper bound on regret. We also derive the matching lower bound and show the order-optimality. Our results show that Thompson Sampling achieves logarithmic regret over time, with distribution-dependent factors capturing the difficulty of distinguishing arms based on risk-adjusted performance. Empirical simulations show that our algorithm significantly outperforms existing algorithms.
[LG-16] DREAMS: Preserving both Local and Global Structure in Dimensionality Reduction
链接: https://arxiv.org/abs/2508.13747
作者: Noël Kury,Dmitry Kobak,Sebastian Damrich
类目: Machine Learning (cs.LG)
*备注:
Abstract:Dimensionality reduction techniques are widely used for visualizing high-dimensional data in two dimensions. Existing methods are typically designed to preserve either local (e.g. t -SNE, UMAP) or global (e.g. MDS, PCA) structure of the data, but none of the established methods can represent both aspects well. In this paper, we present DREAMS (Dimensionality Reduction Enhanced Across Multiple Scales), a method that combines the local structure preservation of t -SNE with the global structure preservation of PCA via a simple regularization term. Our approach generates a spectrum of embeddings between the locally well-structured t -SNE embedding and the globally well-structured PCA embedding, efficiently balancing both local and global structure preservation. We benchmark DREAMS across seven real-world datasets, including five from single-cell transcriptomics and one from population genetics, showcasing qualitatively and quantitatively its superior ability to preserve structure across multiple scales compared to previous approaches.
[LG-17] rans-XFed: An Explainable Federated Learning for Supply Chain Credit Assessment
链接: https://arxiv.org/abs/2508.13715
作者: Jie Shi,Arno P. J. M. Siebes,Siamak Mehrkanoon
类目: Machine Learning (cs.LG); Distributed, Parallel, and Cluster Computing (cs.DC)
*备注: Accepted by FLTA 2025
Abstract:This paper proposes a Trans-XFed architecture that combines federated learning with explainable AI techniques for supply chain credit assessment. The proposed model aims to address several key challenges, including privacy, information silos, class imbalance, non-identically and independently distributed (Non-IID) data, and model interpretability in supply chain credit assessment. We introduce a performance-based client selection strategy (PBCS) to tackle class imbalance and Non-IID problems. This strategy achieves faster convergence by selecting clients with higher local F1 scores. The FedProx architecture, enhanced with homomorphic encryption, is used as the core model, and further incorporates a transformer encoder. The transformer encoder block provides insights into the learned features. Additionally, we employ the integrated gradient explainable AI technique to offer insights into decision-making. We demonstrate the effectiveness of Trans-XFed through experimental evaluations on real-world supply chain datasets. The obtained results show its ability to deliver accurate credit assessments compared to several baselines, while maintaining transparency and privacy.
[LG-18] Minimizing the Weighted Number of Tardy Jobs: Data-Driven Heuristic for Single-Machine Scheduling
链接: https://arxiv.org/abs/2508.13703
作者: Nikolai Antonov,Prěmysl Šůcha,Mikoláš Janota,Jan Hůla
类目: Machine Learning (cs.LG); Optimization and Control (math.OC); Machine Learning (stat.ML)
*备注: Manuscript submitted for review to Computers Operations Research
Abstract:Existing research on single-machine scheduling is largely focused on exact algorithms, which perform well on typical instances but can significantly deteriorate on certain regions of the problem space. In contrast, data-driven approaches provide strong and scalable performance when tailored to the structure of specific datasets. Leveraging this idea, we focus on a single-machine scheduling problem where each job is defined by its weight, duration, due date, and deadline, aiming to minimize the total weight of tardy jobs. We introduce a novel data-driven scheduling heuristic that combines machine learning with problem-specific characteristics, ensuring feasible solutions, which is a common challenge for ML-based algorithms. Experimental results demonstrate that our approach significantly outperforms the state-of-the-art in terms of optimality gap, number of optimal solutions, and adaptability across varied data scenarios, highlighting its flexibility for practical applications. In addition, we conduct a systematic exploration of ML models, addressing a common gap in similar studies by offering a detailed model selection process and providing insights into why the chosen model is the best fit.
[LG-19] Know Me by My Pulse: Toward Practical Continuous Authentication on Wearable Devices via Wrist-Worn PPG NDSS
链接: https://arxiv.org/abs/2508.13690
作者: Wei Shao,Zequan Liang,Ruoyu Zhang,Ruijie Fang,Ning Miao,Ehsan Kourkchi,Setareh Rafatirad,Houman Homayoun,Chongzhou Fang
类目: Cryptography and Security (cs.CR); Machine Learning (cs.LG)
*备注: To be published in Network and Distributed System Security (NDSS) Symposium 2026
Abstract:Biometric authentication using physiological signals offers a promising path toward secure and user-friendly access control in wearable devices. While electrocardiogram (ECG) signals have shown high discriminability, their intrusive sensing requirements and discontinuous acquisition limit practicality. Photoplethysmography (PPG), on the other hand, enables continuous, non-intrusive authentication with seamless integration into wrist-worn wearable devices. However, most prior work relies on high-frequency PPG (e.g., 75 - 500 Hz) and complex deep models, which incur significant energy and computational overhead, impeding deployment in power-constrained real-world systems. In this paper, we present the first real-world implementation and evaluation of a continuous authentication system on a smartwatch, We-Be Band, using low-frequency (25 Hz) multi-channel PPG signals. Our method employs a Bi-LSTM with attention mechanism to extract identity-specific features from short (4 s) windows of 4-channel PPG. Through extensive evaluations on both public datasets (PTTPPG) and our We-Be Dataset (26 subjects), we demonstrate strong classification performance with an average test accuracy of 88.11%, macro F1-score of 0.88, False Acceptance Rate (FAR) of 0.48%, False Rejection Rate (FRR) of 11.77%, and Equal Error Rate (EER) of 2.76%. Our 25 Hz system reduces sensor power consumption by 53% compared to 512 Hz and 19% compared to 128 Hz setups without compromising performance. We find that sampling at 25 Hz preserves authentication accuracy, whereas performance drops sharply at 20 Hz while offering only trivial additional power savings, underscoring 25 Hz as the practical lower bound. Additionally, we find that models trained exclusively on resting data fail under motion, while activity-diverse training improves robustness across physiological states.
[LG-20] Heavy-tailed Linear Bandits: Adversarial Robustness Best-of-both-worlds and Beyond
链接: https://arxiv.org/abs/2508.13679
作者: Canzhe Zhao,Shinji Ito,Shuai Li
类目: Machine Learning (cs.LG)
*备注:
Abstract:Heavy-tailed bandits have been extensively studied since the seminal work of \citetBubeck2012BanditsWH. In particular, heavy-tailed linear bandits, enabling efficient learning with both a large number of arms and heavy-tailed noises, have recently attracted significant attention \citepShaoYKL18,XueWWZ20,ZhongHYW21,Wang2025heavy,tajdini2025improved. However, prior studies focus almost exclusively on stochastic regimes, with few exceptions limited to the special case of heavy-tailed multi-armed bandits (MABs) \citepHuang0H22,ChengZ024,Chen2024uniINF. In this work, we propose a general framework for adversarial heavy-tailed bandit problems, which performs follow-the-regularized-leader (FTRL) over the loss estimates shifted by a bonus function. Via a delicate setup of the bonus function, we devise the first FTRL-type best-of-both-worlds (BOBW) algorithm for heavy-tailed MABs, which does not require the truncated non-negativity assumption and achieves an \widetildeO(T^\frac1\varepsilon) worst-case regret in the adversarial regime as well as an \widetildeO(\log T) gap-dependent regret in the stochastic regime. We then extend our framework to the linear case, proposing the first algorithm for adversarial heavy-tailed linear bandits with finite arm sets. This algorithm achieves an \widetildeO(d^\frac12T^\frac1\varepsilon) regret, matching the best-known worst-case regret bound in stochastic regimes. Moreover, we propose a general data-dependent learning rate, termed \textitheavy-tailed noise aware stability-penalty matching (HT-SPM). We prove that HT-SPM guarantees BOBW regret bounds for general heavy-tailed bandit problems once certain conditions are satisfied. By using HT-SPM and, in particular, a variance-reduced linear loss estimator, we obtain the first BOBW result for heavy-tailed linear bandits. Subjects: Machine Learning (cs.LG) Cite as: arXiv:2508.13679 [cs.LG] (or arXiv:2508.13679v1 [cs.LG] for this version) https://doi.org/10.48550/arXiv.2508.13679 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
[LG-21] MACTAS: Self-Attention-Based Module for Inter-Agent Communication in Multi-Agent Reinforcement Learning AAAI2026
链接: https://arxiv.org/abs/2508.13661
作者: Maciej Wojtala,Bogusz Stefańczyk,Dominik Bogucki,Łukasz Lepak,Jakub Strykowski,Paweł Wawrzyński
类目: Machine Learning (cs.LG); Multiagent Systems (cs.MA)
*备注: Submitted for AAAI 2026
Abstract:Communication is essential for the collective execution of complex tasks by human agents, motivating interest in communication mechanisms for multi-agent reinforcement learning (MARL). However, existing communication protocols in MARL are often complex and non-differentiable. In this work, we introduce a self-attention-based communication module that exchanges information between the agents in MARL. Our proposed approach is fully differentiable, allowing agents to learn to generate messages in a reward-driven manner. The module can be seamlessly integrated with any action-value function decomposition method and can be viewed as an extension of such decompositions. Notably, it includes a fixed number of trainable parameters, independent of the number of agents. Experimental results on the SMAC benchmark demonstrate the effectiveness of our approach, which achieves state-of-the-art performance on several maps.
[LG-22] Personalized Subgraph Federated Learning with Sheaf Collaboration ECAI2025
链接: https://arxiv.org/abs/2508.13642
作者: Wenfei Liang,Yanan Zhao,Rui She,Yiming Li,Wee Peng Tay
类目: Machine Learning (cs.LG)
*备注: Full version of our ECAI 2025 accepted paper
Abstract:Graph-structured data is prevalent in many applications. In subgraph federated learning (FL), this data is distributed across clients, each with a local subgraph. Personalized subgraph FL aims to develop a customized model for each client to handle diverse data distributions. However, performance variation across clients remains a key issue due to the heterogeneity of local subgraphs. To overcome the challenge, we propose FedSheafHN, a novel framework built on a sheaf collaboration mechanism to unify enhanced client descriptors with efficient personalized model generation. Specifically, FedSheafHN embeds each client’s local subgraph into a server-constructed collaboration graph by leveraging graph-level embeddings and employing sheaf diffusion within the collaboration graph to enrich client representations. Subsequently, FedSheafHN generates customized client models via a server-optimized hypernetwork. Empirical evaluations demonstrate that FedSheafHN outperforms existing personalized subgraph FL methods on various graph datasets. Additionally, it exhibits fast model convergence and effectively generalizes to new clients.
[LG-23] Explainable Learning Rate Regimes for Stochastic Optimization
链接: https://arxiv.org/abs/2508.13639
作者: Zhuang Yang
类目: Machine Learning (cs.LG)
*备注:
Abstract:Modern machine learning is trained by stochastic gradient descent (SGD), whose performance critically depends on how the learning rate (LR) is adjusted and decreased over time. Yet existing LR regimes may be intricate, or need to tune one or more additional hyper-parameters manually whose bottlenecks include huge computational expenditure, time and power in practice. This work, in a natural and direct manner, clarifies how LR should be updated automatically only according to the intrinsic variation of stochastic gradients. An explainable LR regime by leveraging stochastic second-order algorithms is developed, behaving a similar pattern to heuristic algorithms but implemented simply without any parameter tuning requirement, where it is of an automatic procedure that LR should increase (decrease) as the norm of stochastic gradients decreases (increases). The resulting LR regime shows its efficiency, robustness, and scalability in different classical stochastic algorithms, containing SGD, SGDM, and SIGNSGD, on machine learning tasks.
[LG-24] xt2Weight: Bridging Natural Language and Neural Network Weight Spaces ACM-MM2025
链接: https://arxiv.org/abs/2508.13633
作者: Bowen Tian,Wenshuo Chen,Zexi Li,Songning Lai,Jiemin Wu,Yutao Yue
类目: Machine Learning (cs.LG)
*备注: Accepted By ACM MM 2025 Main Track
Abstract:How far are we really from automatically generating neural networks? While neural network weight generation shows promise, current approaches struggle with generalization to unseen tasks and practical application exploration. To address this, we propose T2W, a diffusion transformer framework that generates task-specific weights conditioned on natural language descriptions. T2W hierarchically processes network parameters into uniform blocks, integrates text embeddings from CLIP via a prior attention mechanism, and employs adversarial training with weight-space augmentation to enhance generalization. Experiments on Cifar100, Caltech256, and TinyImageNet demonstrate T2W’s ability to produce high-quality weights for unseen tasks, outperforming optimization-based initialization and enabling novel applications such as weight enhancement and text-guided model fusion. Our work bridges textual semantics with weight-space dynamics, supported by an open-source dataset of text-weight pairs, advancing the practicality of generative models in neural network parameter synthesis. Our code is available on Github.
[LG-25] owards safe control parameter tuning in distributed multi-agent systems
链接: https://arxiv.org/abs/2508.13608
作者: Abdullah Tokmak,Thomas B. Schön,Dominik Baumann
类目: ystems and Control (eess.SY); Machine Learning (cs.LG); Optimization and Control (math.OC)
*备注: Accepted to CDC 2025
Abstract:Many safety-critical real-world problems, such as autonomous driving and collaborative robots, are of a distributed multi-agent nature. To optimize the performance of these systems while ensuring safety, we can cast them as distributed optimization problems, where each agent aims to optimize their parameters to maximize a coupled reward function subject to coupled constraints. Prior work either studies a centralized setting, does not consider safety, or struggles with sample efficiency. Since we require sample efficiency and work with unknown and nonconvex rewards and constraints, we solve this optimization problem using safe Bayesian optimization with Gaussian process regression. Moreover, we consider nearest-neighbor communication between the agents. To capture the behavior of non-neighboring agents, we reformulate the static global optimization problem as a time-varying local optimization problem for each agent, essentially introducing time as a latent variable. To this end, we propose a custom spatio-temporal kernel to integrate prior knowledge. We show the successful deployment of our algorithm in simulations.
[LG-26] Approximate Bayesian Inference via Bitstring Representations UAI2025
链接: https://arxiv.org/abs/2508.13598
作者: Aleksanteri Sladek,Martin Trapp,Arno Solin
类目: Machine Learning (cs.LG)
*备注: Published at Uncertainty in Artificial Intelligence (UAI 2025)
Abstract:The machine learning community has recently put effort into quantized or low-precision arithmetics to scale large models. This paper proposes performing probabilistic inference in the quantized, discrete parameter space created by these representations, effectively enabling us to learn a continuous distribution using discrete parameters. We consider both 2D densities and quantized neural networks, where we introduce a tractable learning approach using probabilistic circuits. This method offers a scalable solution to manage complex distributions and provides clear insights into model behavior. We validate our approach with various models, demonstrating inference efficiency without sacrificing accuracy. This work advances scalable, interpretable machine learning by utilizing discrete approximations for probabilistic computations.
[LG-27] A Generalized Learning Framework for Self-Supervised Contrastive Learning
链接: https://arxiv.org/abs/2508.13596
作者: Lingyu Si,Jingyao Wang,Wenwen Qiang
类目: Machine Learning (cs.LG)
*备注:
Abstract:Self-supervised contrastive learning (SSCL) has recently demonstrated superiority in multiple downstream tasks. In this paper, we generalize the standard SSCL methods to a Generalized Learning Framework (GLF) consisting of two parts: the aligning part and the constraining part. We analyze three existing SSCL methods: BYOL, Barlow Twins, and SwAV, and show that they can be unified under GLF with different choices of the constraining part. We further propose empirical and theoretical analyses providing two insights into designing the constraining part of GLF: intra-class compactness and inter-class separability, which measure how well the feature space preserves the class information of the inputs. However, since SSCL can not use labels, it is challenging to design a constraining part that satisfies these properties. To address this issue, we consider inducing intra-class compactness and inter-class separability by iteratively capturing the dynamic relationship between anchor and other samples and propose a plug-and-play method called Adaptive Distribution Calibration (ADC) to ensure that samples that are near or far from the anchor point in the original input space are closer or further away from the anchor point in the feature space. Both the theoretical analysis and the empirical evaluation demonstrate the superiority of ADC.
[LG-28] Understanding Distribution Structure on Calibrated Recommendation Systems
链接: https://arxiv.org/abs/2508.13568
作者: Diego Correa da Silva,Denis Robson Dantas Boaventura,Mayki dos Santos Oliveira,Eduardo Ferreira da Silva,Joel Machado Pires,Frederico Araújo Durão
类目: Information Retrieval (cs.IR); Machine Learning (cs.LG)
*备注:
Abstract:Traditional recommender systems aim to generate a recommendation list comprising the most relevant or similar items to the user’s profile. These approaches can create recommendation lists that omit item genres from the less prominent areas of a user’s profile, thereby undermining the user’s experience. To solve this problem, the calibrated recommendation system provides a guarantee of including less representative areas in the recommended list. The calibrated context works with three distributions. The first is from the user’s profile, the second is from the candidate items, and the last is from the recommendation list. These distributions are G-dimensional, where G is the total number of genres in the system. This high dimensionality requires a different evaluation method, considering that traditional recommenders operate in a one-dimensional data space. In this sense, we implement fifteen models that help to understand how these distributions are structured. We evaluate the users’ patterns in three datasets from the movie domain. The results indicate that the models of outlier detection provide a better understanding of the structures. The calibrated system creates recommendation lists that act similarly to traditional recommendation lists, allowing users to change their groups of preferences to the same degree.
[LG-29] Prediction of Hospital Associated Infections During Continuous Hospital Stays
链接: https://arxiv.org/abs/2508.13561
作者: Rituparna Datta,Methun Kamruzzaman,Eili Y. Klein,Gregory R Madden,Xinwei Deng,Anil Vullikanti,Parantapa Bhattacharya
类目: Machine Learning (cs.LG)
*备注:
Abstract:The US Centers for Disease Control and Prevention (CDC), in 2019, designated Methicillin-resistant Staphylococcus aureus (MRSA) as a serious antimicrobial resistance threat. The risk of acquiring MRSA and suffering life-threatening consequences due to it remains especially high for hospitalized patients due to a unique combination of factors, including: co-morbid conditions, immuno suppression, antibiotic use, and risk of contact with contaminated hospital workers and equipment. In this paper, we present a novel generative probabilistic model, GenHAI, for modeling sequences of MRSA test results outcomes for patients during a single hospitalization. This model can be used to answer many important questions from the perspectives of hospital administrators for mitigating the risk of MRSA infections. Our model is based on the probabilistic programming paradigm, and can be used to approximately answer a variety of predictive, causal, and counterfactual questions. We demonstrate the efficacy of our model by comparing it against discriminative and generative machine learning models using two real-world datasets.
[LG-30] CALYPSO: Forecasting and Analyzing MRSA Infection Patterns with Community and Healthcare Transmission Dynamics
链接: https://arxiv.org/abs/2508.13548
作者: Rituparna Datta,Jiaming Cui,Gregory R. Madden,Anil Vullikanti
类目: Machine Learning (cs.LG)
*备注:
Abstract:Methicillin-resistant Staphylococcus aureus (MRSA) is a critical public health threat within hospitals as well as long-term care facilities. Better understanding of MRSA risks, evaluation of interventions and forecasting MRSA rates are important public health problems. Existing forecasting models rely on statistical or neural network approaches, which lack epidemiological interpretability, and have limited performance. Mechanistic epidemic models are difficult to calibrate and limited in incorporating diverse datasets. We present CALYPSO, a hybrid framework that integrates neural networks with mechanistic metapopulation models to capture the spread dynamics of infectious diseases (i.e., MRSA) across healthcare and community settings. Our model leverages patient-level insurance claims, commuting data, and healthcare transfer patterns to learn region- and time-specific parameters governing MRSA spread. This enables accurate, interpretable forecasts at multiple spatial resolutions (county, healthcare facility, region, state) and supports counterfactual analyses of infection control policies and outbreak risks. We also show that CALYPSO improves statewide forecasting performance by over 4.5% compared to machine learning baselines, while also identifying high-risk regions and cost-effective strategies for allocating infection prevention resources.
[LG-31] MuFlex: A Scalable Physics-based Platform for Multi-Building Flexibility Analysis and Coordination
链接: https://arxiv.org/abs/2508.13532
作者: Ziyan Wu,Ivan Korolija,Rui Tang
类目: Machine Learning (cs.LG); Systems and Control (eess.SY)
*备注: The platform will be released open-source on GitHub: this https URL once pre-printed
Abstract:With the increasing penetration of renewable generation on the power grid, maintaining system balance requires coordinated demand flexibility from aggregations of buildings. Reinforcement learning (RL) has been widely explored for building controls because of its model-free nature. Open-source simulation testbeds are essential not only for training RL agents but also for fairly benchmarking control strategies. However, most building-sector testbeds target single buildings; multi-building platforms are relatively limited and typically rely on simplified models (e.g., Resistance-Capacitance) or data-driven approaches, which lack the ability to fully capture the physical intricacies and intermediate variables necessary for interpreting control performance. Moreover, these platforms often impose fixed inputs, outputs, and model formats, restricting their applicability as benchmarking tools across diverse control scenarios. To address these gaps, MuFlex, a scalable, open-source platform for benchmarking and testing control strategies for multi-building flexibility coordination, was developed in this study. MuFlex enables synchronous information exchange across EnergyPlus building models and adheres to the latest OpenAI Gym interface, providing a modular, standardized RL implementation. The platform capabilities were demonstrated in a case study coordinating demand flexibility across four office buildings using the Soft Actor-Critic algorithm with carefully fine-tuned hyperparameters. The results show that aggregating the four buildings flexibility reduced total peak demand below a specified threshold while maintaining indoor environmental quality.
[LG-32] Explainability of Algorithms
链接: https://arxiv.org/abs/2508.13529
作者: Andrés Páez
类目: Machine Learning (cs.LG)
*备注:
Abstract:The opaqueness of many complex machine learning algorithms is often mentioned as one of the main obstacles to the ethical development of artificial intelligence (AI). But what does it mean for an algorithm to be opaque? Highly complex algorithms such as artificial neural networks process enormous volumes of data in parallel along multiple hidden layers of interconnected nodes, rendering their inner workings epistemically inaccessible to any human being, including their designers and developers; they are “black boxes” for all their stakeholders. But opaqueness is not always the inevitable result of technical complexity. Sometimes, the way an algorithm works is intentionally hidden from view for proprietary reasons, especially in commercial automated decision systems, creating an entirely different type of opaqueness. In the first part of the chapter, we will examine these two ways of understanding opacity and the ethical implications that stem from each of them. In the second part, we explore the different explanatory methods that have been developed in computer science to overcome an AI system’s technical opaqueness. As the analysis shows, explainable AI (XAI) still faces numerous challenges.
[LG-33] Uncertainty Tube Visualization of Particle Trajectories
链接: https://arxiv.org/abs/2508.13505
作者: Jixian Li,Timbwaoga Aime Judicael Ouermi,Mengjiao Han,Chris R. Johnson
类目: Machine Learning (cs.LG); Human-Computer Interaction (cs.HC)
*备注:
Abstract:Predicting particle trajectories with neural networks (NNs) has substantially enhanced many scientific and engineering domains. However, effectively quantifying and visualizing the inherent uncertainty in predictions remains challenging. Without an understanding of the uncertainty, the reliability of NN models in applications where trustworthiness is paramount is significantly compromised. This paper introduces the uncertainty tube, a novel, computationally efficient visualization method designed to represent this uncertainty in NN-derived particle paths. Our key innovation is the design and implementation of a superelliptical tube that accurately captures and intuitively conveys nonsymmetric uncertainty. By integrating well-established uncertainty quantification techniques, such as Deep Ensembles, Monte Carlo Dropout (MC Dropout), and Stochastic Weight Averaging-Gaussian (SWAG), we demonstrate the practical utility of the uncertainty tube, showcasing its application on both synthetic and simulation datasets.
[LG-34] DyMixOp: Guiding Neural Operator Design for PDEs from a Complex Dynamics Perspective with Local-Global-Mixing
链接: https://arxiv.org/abs/2508.13490
作者: Pengyu Lai,Yixiao Chen,Hui Xu
类目: Machine Learning (cs.LG); Chaotic Dynamics (nlin.CD)
*备注:
Abstract:A primary challenge in using neural networks to approximate nonlinear dynamical systems governed by partial differential equations (PDEs) is transforming these systems into a suitable format, especially when dealing with non-linearizable dynamics or the need for infinite-dimensional spaces for linearization. This paper introduces DyMixOp, a novel neural operator framework for PDEs that integrates insights from complex dynamical systems to address this challenge. Grounded in inertial manifold theory, DyMixOp transforms infinite-dimensional nonlinear PDE dynamics into a finite-dimensional latent space, establishing a structured foundation that maintains essential nonlinear interactions and enhances physical interpretability. A key innovation is the Local-Global-Mixing (LGM) transformation, inspired by convection dynamics in turbulence. This transformation effectively captures both fine-scale details and nonlinear interactions, while mitigating spectral bias commonly found in existing neural operators. The framework is further strengthened by a dynamics-informed architecture that connects multiple LGM layers to approximate linear and nonlinear dynamics, reflecting the temporal evolution of dynamical systems. Experimental results across diverse PDE benchmarks demonstrate that DyMixOp achieves state-of-the-art performance, significantly reducing prediction errors, particularly in convection-dominated scenarios reaching up to 86.7%, while maintaining computational efficiency and scalability.
[LG-35] Classifying Clinical Outcome of Epilepsy Patients with Ictal Chirp Embeddings
链接: https://arxiv.org/abs/2508.13476
作者: Nooshin Bahador,Milad Lankarany
类目: Machine Learning (cs.LG)
*备注: 21 pages, 10 figures
Abstract:This study presents a pipeline leveraging t-Distributed Stochastic Neighbor Embedding (t-SNE) for interpretable visualizations of chirp features across diverse outcome scenarios. The dataset, comprising chirp-based temporal, spectral, and frequency metrics. Using t-SNE, local neighborhood relationships were preserved while addressing the crowding problem through Student t-distribution-based similarity optimization. Three classification tasks were formulated on the 2D t-SNE embeddings: (1) distinguishing clinical success from failure/no-resection, (2) separating high-difficulty from low-difficulty cases, and (3) identifying optimal cases, defined as successful outcomes with minimal clinical difficulty. Four classifiers, namely, Random Forests, Support Vector Machines, Logistic Regression, and k-Nearest Neighbors, were trained and evaluated using stratified 5-fold cross-validation. Across tasks, the Random Forest and k-NN classifiers demonstrated superior performance, achieving up to 88.8% accuracy in optimal case detection (successful outcomes with minimal clinical difficulty). Additionally, feature influence sensitivity maps were generated using SHAP explanations applied to model predicting t-SNE coordinates, revealing spatially localized feature importance within the embedding space. These maps highlighted how specific chirp attributes drive regional clustering and class separation, offering insights into the latent structure of the data. The integrated framework showcases the potential of interpretable embeddings and local feature attribution for clinical stratification and decision support.
[LG-36] ASAP: Unsupervised Post-training with Label Distribution Shift Adaptive Learning Rate CIKM2025
链接: https://arxiv.org/abs/2508.13445
作者: Heewon Park,Mugon Joe,Miru Kim,Minhae Kwon
类目: Machine Learning (cs.LG)
*备注: 5 pages, 3 figures, accepted for ACM CIKM 2025
Abstract:In real-world applications, machine learning models face online label shift, where label distributions change over time. Effective adaptation requires careful learning rate selection: too low slows adaptation and too high causes instability. We propose ASAP (Adaptive Shift Aware Post-training), which dynamically adjusts the learning rate by computing the cosine distance between current and previous unlabeled outputs and mapping it within a bounded range. ASAP requires no labels, model ensembles, or past inputs, using only the previous softmax output for fast, lightweight adaptation. Experiments across multiple datasets and shift scenarios show ASAP consistently improves accuracy and efficiency, making it practical for unsupervised model adaptation.
[LG-37] MAVIS: Multi-Objective Alignment via Value-Guided Inference-Time Search
链接: https://arxiv.org/abs/2508.13415
作者: Jeremy Carleton,Debajoy Mukherjee,Srinivas Shakkottai,Dileep Kalathil
类目: Machine Learning (cs.LG)
*备注: 20 pages, 6 figures
Abstract:Large Language Models (LLMs) are increasingly deployed across diverse applications that demand balancing multiple, often conflicting, objectives – such as helpfulness, harmlessness, or humor. Aligning outputs to user-specific preferences in such multi-objective settings typically requires fine-tuning models for each objective or preference configuration, which is computationally expensive and inflexible. We introduce MAVIS – Multi-Objective Alignment via Value-Guided Inference-Time Search – a lightweight inference-time alignment framework that enables dynamic control over LLM behavior without modifying the base model’s weights. MAVIS trains a set of small value models, each corresponding to a distinct objective. At inference time, these value models are combined using user-specified weights to produce a tilting function that adjusts the base model’s output distribution toward desired trade-offs. The value models are trained using a simple iterative algorithm that ensures monotonic improvement of the KL-regularized policy. We show empirically that MAVIS outperforms baselines that fine-tune per-objective models and combine them post hoc, and even approaches the performance of the idealized setting where models are fine-tuned for a user’s exact preferences.
[LG-38] Decentralized Contextual Bandits with Network Adaptivity
链接: https://arxiv.org/abs/2508.13411
作者: Chuyun Deng,Huiwen Jia
类目: Machine Learning (cs.LG); Optimization and Control (math.OC)
*备注: 46 Pages, 9 figures
Abstract:We consider contextual linear bandits over networks, a class of sequential decision-making problems where learning occurs simultaneously across multiple locations and the reward distributions share structural similarities while also exhibiting local differences. While classical contextual bandits assume either fully centralized data or entirely isolated learners, much remains unexplored in networked environments when information is partially shared. In this paper, we address this gap by developing two network-aware Upper Confidence Bound (UCB) algorithms, NetLinUCB and Net-SGD-UCB, which enable adaptive information sharing guided by dynamically updated network weights. Our approach decompose learning into global and local components and as a result allow agents to benefit from shared structure without full synchronization. Both algorithms incur lighter communication costs compared to a fully centralized setting as agents only share computed summaries regarding the homogeneous features. We establish regret bounds showing that our methods reduce the learning complexity associated with the shared structure from O(N) to sublinear O(\sqrtN) , where N is the size of the network. The two algorithms reveal complementary strengths: NetLinUCB excels in low-noise regimes with fine-grained heterogeneity, while Net-SGD-UCB is robust to high-dimensional, high-variance contexts. We further demonstrate the effectiveness of our methods across simulated pricing environments compared to standard benchmarks.
[LG-39] NovoMolGen: Rethinking Molecular Language Model Pretraining
链接: https://arxiv.org/abs/2508.13408
作者: Kamran Chitsaz,Roshan Balaji,Quentin Fournier,Nirav Pravinbhai Bhatt,Sarath Chandar
类目: Machine Learning (cs.LG)
*备注:
Abstract:Designing de-novo molecules with desired property profiles requires efficient exploration of the vast chemical space ranging from 10^23 to 10^60 possible synthesizable candidates. While various deep generative models have been developed to design small molecules using diverse input representations, Molecular Large Language Models (Mol-LLMs) based on string representations have emerged as a scalable approach capable of exploring billions of molecules. However, there remains limited understanding regarding how standard language modeling practices such as textual representations, tokenization strategies, model size, and dataset scale impact molecular generation performance. In this work, we systematically investigate these critical aspects by introducing NovoMolGen, a family of transformer-based foundation models pretrained on 1.5 billion molecules for de-novo molecule generation. Through extensive empirical analyses, we identify a weak correlation between performance metrics measured during pretraining and actual downstream performance, revealing important distinctions between molecular and general NLP training dynamics. NovoMolGen establishes new state-of-the-art results, substantially outperforming prior Mol-LLMs and specialized generative models in both unconstrained and goal-directed molecular generation tasks, thus providing a robust foundation for advancing efficient and effective molecular modeling strategies.
[LG-40] Batching-Aware Joint Model Onloading and Offloading for Hierarchical Multi-Task Inference
链接: https://arxiv.org/abs/2508.13380
作者: Seohyeon Cha,Kevin Chan,Gustavo de Veciana,Haris Vikalo
类目: Machine Learning (cs.LG)
*备注:
Abstract:The growing demand for intelligent services on resource-constrained edge devices has spurred the development of collaborative inference systems that distribute workloads across end devices, edge servers, and the cloud. While most existing frameworks focus on single-task, single-model scenarios, many real-world applications (e.g., autonomous driving and augmented reality) require concurrent execution of diverse tasks including detection, segmentation, and depth estimation. In this work, we propose a unified framework to jointly decide which multi-task models to deploy (onload) at clients and edge servers, and how to route queries across the hierarchy (offload) to maximize overall inference accuracy under memory, compute, and communication constraints. We formulate this as a mixed-integer program and introduce J3O (Joint Optimization of Onloading and Offloading), an alternating algorithm that (i) greedily selects models to onload via Lagrangian-relaxed submodular optimization and (ii) determines optimal offloading via constrained linear programming. We further extend J3O to account for batching at the edge, maintaining scalability under heterogeneous task loads. Experiments show J3O consistently achieves over 97% of the optimal accuracy while incurring less than 15% of the runtime required by the optimal solver across multi-task benchmarks.
[LG-41] OrbitChain: Orchestrating In-orbit Real-time Analytics of Earth Observation Data
链接: https://arxiv.org/abs/2508.13374
作者: Zhouyu Li,Zhijing Yang,Huayue Gu,Xiaojian Wang,Yuchen Liu,Ruozhou Yu
类目: Distributed, Parallel, and Cluster Computing (cs.DC); Emerging Technologies (cs.ET); Machine Learning (cs.LG); Networking and Internet Architecture (cs.NI)
*备注: currently under review
Abstract:Earth observation analytics have the potential to serve many time-sensitive applications. However, due to limited bandwidth and duration of ground-satellite connections, it takes hours or even days to download and analyze data from existing Earth observation satellites, making real-time demands like timely disaster response impossible. Toward real-time analytics, we introduce OrbitChain, a collaborative analytics framework that orchestrates computational resources across multiple satellites in an Earth observation constellation. OrbitChain decomposes analytics applications into microservices and allocates computational resources for time-constrained analysis. A traffic routing algorithm is devised to minimize the inter-satellite communication overhead. OrbitChain adopts a pipeline workflow that completes Earth observation tasks in real-time, facilitates time-sensitive applications and inter-constellation collaborations such as tip-and-cue. To evaluate OrbitChain, we implement a hardware-in-the-loop orbital computing testbed. Experiments show that our system can complete up to 60% analytics workload than existing Earth observation analytics framework while reducing the communication overhead by up to 72%.
[LG-42] A Risk Manager for Intrusion Tolerant Systems: Enhancing HAL 9000 with New Scoring and Data Sources
链接: https://arxiv.org/abs/2508.13364
作者: Tadeu Freitas,Carlos Novo,Inês Dutra,João Soares,Manuel Correia,Benham Shariati,Rolando Martins
类目: Cryptography and Security (cs.CR); Machine Learning (cs.LG)
*备注:
Abstract:Intrusion Tolerant Systems (ITSs) have become increasingly critical due to the rise of multi-domain adversaries exploiting diverse attack surfaces. ITS architectures aim to tolerate intrusions, ensuring system compromise is prevented or mitigated even with adversary presence. Existing ITS solutions often employ Risk Managers leveraging public security intelligence to adjust system defenses dynamically against emerging threats. However, these approaches rely heavily on databases like NVD and ExploitDB, which require manual analysis for newly discovered vulnerabilities. This dependency limits the system’s responsiveness to rapidly evolving threats. HAL 9000, an ITS Risk Manager introduced in our prior work, addressed these challenges through machine learning. By analyzing descriptions of known vulnerabilities, HAL 9000 predicts and assesses new vulnerabilities automatically. To calculate the risk of a system, it also incorporates the Exploitability Probability Scoring system to estimate the likelihood of exploitation within 30 days, enhancing proactive defense capabilities. Despite its success, HAL 9000’s reliance on NVD and ExploitDB knowledge is a limitation, considering the availability of other sources of information. This extended work introduces a custom-built scraper that continuously mines diverse threat sources, including security advisories, research forums, and real-time exploit proofs-of-concept. This significantly expands HAL 9000’s intelligence base, enabling earlier detection and assessment of unverified vulnerabilities. Our evaluation demonstrates that integrating scraper-derived intelligence with HAL 9000’s risk management framework substantially improves its ability to address emerging threats. This paper details the scraper’s integration into the architecture, its role in providing additional information on new threats, and the effects on HAL 9000’s management. Subjects: Cryptography and Security (cs.CR); Machine Learning (cs.LG) Cite as: arXiv:2508.13364 [cs.CR] (or arXiv:2508.13364v1 [cs.CR] for this version) https://doi.org/10.48550/arXiv.2508.13364 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
[LG-43] Adaptive Conformal Prediction Intervals Over Trajectory Ensembles
链接: https://arxiv.org/abs/2508.13362
作者: Ruipu Li,Daniel Menacho,Alexander Rodríguez
类目: Machine Learning (cs.LG)
*备注:
Abstract:Future trajectories play an important role across domains such as autonomous driving, hurricane forecasting, and epidemic modeling, where practitioners commonly generate ensemble paths by sampling probabilistic models or leveraging multiple autoregressive predictors. While these trajectories reflect inherent uncertainty, they are typically uncalibrated. We propose a unified framework based on conformal prediction that transforms sampled trajectories into calibrated prediction intervals with theoretical coverage guarantees. By introducing a novel online update step and an optimization step that captures inter-step dependencies, our method can produce discontinuous prediction intervals around each trajectory, naturally capture temporal dependencies, and yield sharper, more adaptive uncertainty estimates.
[LG-44] Dimension lower bounds for linear approaches to function approximation WWW
链接: https://arxiv.org/abs/2508.13346
作者: Daniel Hsu
类目: Machine Learning (cs.LG); Statistics Theory (math.ST)
*备注: First appeared on author’s homepage in August 2021 this https URL
Abstract:This short note presents a linear algebraic approach to proving dimension lower bounds for linear methods that solve L^2 function approximation problems. The basic argument has appeared in the literature before (e.g., Barron, 1993) for establishing lower bounds on Kolmogorov n -widths. The argument is applied to give sample size lower bounds for kernel methods.
[LG-45] Decoding Communications with Partial Information
链接: https://arxiv.org/abs/2508.13326
作者: Dylan Cope,Peter McBurney
类目: Machine Learning (cs.LG)
*备注: Proceedings of ALIFE 2025
Abstract:Machine language acquisition is often presented as a problem of imitation learning: there exists a community of language users from which a learner observes speech acts and attempts to decode the mappings between utterances and situations. However, an interesting consideration that is typically unaddressed is partial observability, i.e. the learner is assumed to see all relevant information. This paper explores relaxing this assumption, thereby posing a more challenging setting where such information needs to be inferred from knowledge of the environment, the actions taken, and messages sent. We see several motivating examples of this problem, demonstrate how they can be solved in a toy setting, and formally explore challenges that arise in more general settings. A learning-based algorithm is then presented to perform the decoding of private information to facilitate language acquisition.
[LG-46] Efficient Constraint-Aware Flow Matching via Randomized Exploration
链接: https://arxiv.org/abs/2508.13316
作者: Zhengyan Huan,Jacob Boerma,Li-Ping Liu,Shuchin Aeron
类目: Machine Learning (cs.LG)
*备注:
Abstract:We consider the problem of generating samples via Flow Matching (FM) with an additional requirement that the generated samples must satisfy given constraints. We consider two scenarios, viz.: (a) when a differentiable distance function to the constraint set is given, and (b) when the constraint set is only available via queries to a membership oracle. For case (a), we propose a simple adaptation of the FM objective with an additional term that penalizes the distance between the constraint set and the generated samples. For case (b), we propose to employ randomization and learn a mean flow that is numerically shown to have a high likelihood of satisfying the constraints. This approach deviates significantly from existing works that require simple convex constraints, knowledge of a barrier function, or a reflection mechanism to constrain the probability flow. Furthermore, in the proposed setting we show that a two-stage approach, where both stages approximate the same original flow but with only the second stage probing the constraints via randomization, is more computationally efficient. Through several synthetic cases of constrained generation, we numerically show that the proposed approaches achieve significant gains in terms of constraint satisfaction while matching the target distributions. As a showcase for a practical oracle-based constraint, we show how our approach can be used for training an adversarial example generator, using queries to a hard-label black-box classifier. We conclude with several future research directions. Our code is available at this https URL.
[LG-47] owards Human-AI Complementarity in Matching Tasks KDD ECML
链接: https://arxiv.org/abs/2508.13285
作者: Adrian Arnaiz-Rodriguez,Nina Corvelo Benz,Suhas Thejaswi,Nuria Oliver,Manuel Gomez-Rodriguez
类目: Machine Learning (cs.LG); Human-Computer Interaction (cs.HC)
*备注: Accepted in Workshop on Hybrid Human-Machine Learning and Decision Making at ECML PKDD
Abstract:Data-driven algorithmic matching systems promise to help human decision makers make better matching decisions in a wide variety of high-stakes application domains, such as healthcare and social service provision. However, existing systems are not designed to achieve human-AI complementarity: decisions made by a human using an algorithmic matching system are not necessarily better than those made by the human or by the algorithm alone. Our work aims to address this gap. To this end, we propose collaborative matching (comatch), a data-driven algorithmic matching system that takes a collaborative approach: rather than making all the matching decisions for a matching task like existing systems, it selects only the decisions that it is the most confident in, deferring the rest to the human decision maker. In the process, comatch optimizes how many decisions it makes and how many it defers to the human decision maker to provably maximize performance. We conduct a large-scale human subject study with 800 participants to validate the proposed approach. The results demonstrate that the matching outcomes produced by comatch outperform those generated by either human participants or by algorithmic matching on their own. The data gathered in our human subject study and an implementation of our system are available as open source at this https URL.
[LG-48] Physically Plausible Data Augmentations for Wearable IMU-based Human Activity Recognition Using Physics Simulation
链接: https://arxiv.org/abs/2508.13284
作者: Nobuyuki Oishi,Philip Birch,Daniel Roggen,Paula Lago
类目: Machine Learning (cs.LG)
*备注: 12 pages, 4 figures
Abstract:The scarcity of high-quality labeled data in sensor-based Human Activity Recognition (HAR) hinders model performance and limits generalization across real-world scenarios. Data augmentation is a key strategy to mitigate this issue by enhancing the diversity of training datasets. Signal Transformation-based Data Augmentation (STDA) techniques have been widely used in HAR. However, these methods are often physically implausible, potentially resulting in augmented data that fails to preserve the original meaning of the activity labels. In this study, we introduce and systematically characterize Physically Plausible Data Augmentation (PPDA) enabled by physics simulation. PPDA leverages human body movement data from motion capture or video-based pose estimation and incorporates various realistic variabilities through physics simulation, including modifying body movements, sensor placements, and hardware-related effects. We compare the performance of PPDAs with traditional STDAs on three public datasets of daily activities and fitness workouts. First, we evaluate each augmentation method individually, directly comparing PPDAs to their STDA counterparts. Next, we assess how combining multiple PPDAs can reduce the need for initial data collection by varying the number of subjects used for training. Experiments show consistent benefits of PPDAs, improving macro F1 scores by an average of 3.7 pp (up to 13 pp) and achieving competitive performance with up to 60% fewer training subjects than STDAs. As the first systematic study of PPDA in sensor-based HAR, these results highlight the advantages of pursuing physical plausibility in data augmentation and the potential of physics simulation for generating synthetic Inertial Measurement Unit data for training deep learning HAR models. This cost-effective and scalable approach therefore helps address the annotation scarcity challenge in HAR.
[LG-49] Data driven feedback linearization of nonlinear control systems via Lie derivatives and stacked regression approach
链接: https://arxiv.org/abs/2508.13241
作者: Lakshmi Priya P. K.,Andreas Schwung
类目: Machine Learning (cs.LG)
*备注:
Abstract:Discovering the governing equations of a physical system and designing an effective feedback controller remains one of the most challenging and intensive areas of ongoing research. This task demands a deep understanding of the system behavior, including the nonlinear factors that influence its dynamics. In this article, we propose a novel methodology for identifying a feedback linearized physical system based on known prior dynamic behavior. Initially, the system is identified using a sparse regression algorithm, subsequently a feedback controller is designed for the discovered system by applying Lie derivatives to the dictionary of output functions to derive an augmented constraint which guarantees that no internal dynamics are observed. Unlike the prior related works, the novel aspect of this article combines the approach of stacked regression algorithm and relative degree conditions to discover and feedback linearize the true governing equations of a physical model.
[LG-50] A Recurrent Neural Network based Clustering Method for Binary Data Sets in Education
链接: https://arxiv.org/abs/2508.13224
作者: Mizuki Ohira,Toshimichi Saito
类目: Machine Learning (cs.LG); Computers and Society (cs.CY)
*备注:
Abstract:This paper studies an application of a recurrent neural network to clustering method for the S-P chart: a binary data set used widely in education. As the number of students increases, the S-P chart becomes hard to handle. In order to classify the large chart into smaller charts, we present a simple clustering method based on the network dynamics. In the method, the network has multiple fixed points and basins of attraction give clusters corresponding to small S-P charts. In order to evaluate the clustering performance, we present an important feature quantity: average caution index that characterizes singularity of students answer oatterns. Performing fundamental experiments, effectiveness of the method is confirmed.
[LG-51] he Course Difficulty Analysis Cookbook
链接: https://arxiv.org/abs/2508.13218
作者: Frederik Baucks,Robin Schmucker,Laurenz Wiskott
类目: Computers and Society (cs.CY); Machine Learning (cs.LG); Applications (stat.AP)
*备注:
Abstract:Curriculum analytics (CA) studies curriculum structure and student data to ensure the quality of educational programs. An essential aspect is studying course properties, which involves assigning each course a representative difficulty value. This is critical for several aspects of CA, such as quality control (e.g., monitoring variations over time), course comparisons (e.g., articulation), and course recommendation (e.g., advising). Measuring course difficulty requires careful consideration of multiple factors: First, when difficulty measures are sensitive to the performance level of enrolled students, it can bias interpretations by overlooking student diversity. By assessing difficulty independently of enrolled students’ performances, we can reduce the risk of bias and enable fair, representative assessments of difficulty. Second, from a measurement theoretic perspective, the measurement must be reliable and valid to provide a robust basis for subsequent analyses. Third, difficulty measures should account for covariates, such as the characteristics of individual students within a diverse populations (e.g., transfer status). In recent years, various notions of difficulty have been proposed. This paper provides the first comprehensive review and comparison of existing approaches for assessing course difficulty based on grade point averages and latent trait modeling. It further offers a hands-on tutorial on model selection, assumption checking, and practical CA applications. These applications include monitoring course difficulty over time and detecting courses with disparate outcomes between distinct groups of students (e.g., dropouts vs. graduates), ultimately aiming to promote high-quality, fair, and equitable learning experiences. To support further research and application, we provide an open-source software package and artificial datasets, facilitating reproducibility and adoption.
[LG-52] Strategies for training point distributions in physics-informed neural networks
链接: https://arxiv.org/abs/2508.13216
作者: Santosh Humagain,Toni Schneidereit
类目: Machine Learning (cs.LG)
*备注:
Abstract:Physics-informed neural networks approach the approximation of differential equations by directly incorporating their structure and given conditions in a loss function. This enables conditions like, e.g., invariants to be easily added during the modelling phase. In addition, the approach can be considered as mesh free and can be utilised to compute solutions on arbitrary grids after the training phase. Therefore, physics-informed neural networks are emerging as a promising alternative to solving differential equations with methods from numerical mathematics. However, their performance highly depends on a large variety of factors. In this paper, we systematically investigate and evaluate a core component of the approach, namely the training point distribution. We test two ordinary and two partial differential equations with five strategies for training data generation and shallow network architectures, with one and two hidden layers. In addition to common distributions, we introduce sine-based training points, which are motivated by the construction of Chebyshev nodes. The results are challenged by using certain parameter combinations like, e.g., random and fixed-seed weight initialisation for reproducibility. The results show the impact of the training point distributions on the solution accuracy and we find evidence that they are connected to the characteristics of the differential equation.
[LG-53] FedChip: Federated LLM for Artificial Intelligence Accelerator Chip Design
链接: https://arxiv.org/abs/2508.13162
作者: Mahmoud Nazzal,Khoa Nguyen,Deepak Vungarala,Ramtin Zand,Shaahin Angizi,Hai Phan,Abdallah Khreishah
类目: Hardware Architecture (cs.AR); Machine Learning (cs.LG)
*备注:
Abstract:AI hardware design is advancing rapidly, driven by the promise of design automation to make chip development faster, more efficient, and more accessible to a wide range of users. Amongst automation tools, Large Language Models (LLMs) offer a promising solution by automating and streamlining parts of the design process. However, their potential is hindered by data privacy concerns and the lack of domain-specific training. To address this, we introduce FedChip, a Federated fine-tuning approach that enables multiple Chip design parties to collaboratively enhance a shared LLM dedicated for automated hardware design generation while protecting proprietary data. FedChip enables parties to train the model on proprietary local data and improve the shared LLM’s performance. To exemplify FedChip’s deployment, we create and release APTPU-Gen, a dataset of 30k design variations spanning various performance metric values such as power, performance, and area (PPA). To encourage the LLM to generate designs that achieve a balance across multiple quality metrics, we propose a new design evaluation metric, Chip@k, which statistically evaluates the quality of generated designs against predefined acceptance criteria. Experimental results show that FedChip improves design quality by more than 77% over high-end LLMs while maintaining data privacy
[LG-54] Adaptive Model-Predictive Control of a Soft Continuum Robot Using a Physics-Informed Neural Network Based on Cosserat Rod Theory
链接: https://arxiv.org/abs/2508.12681
作者: Johann Licher,Max Bartholdt,Henrik Krauss,Tim-Lukas Habich,Thomas Seel,Moritz Schappler
类目: Robotics (cs.RO); Machine Learning (cs.LG); Systems and Control (eess.SY)
*备注: 20 pages, 15 figures
Abstract:Dynamic control of soft continuum robots (SCRs) holds great potential for expanding their applications, but remains a challenging problem due to the high computational demands of accurate dynamic models. While data-driven approaches like Koopman-operator-based methods have been proposed, they typically lack adaptability and cannot capture the full robot shape, limiting their applicability. This work introduces a real-time-capable nonlinear model-predictive control (MPC) framework for SCRs based on a domain-decoupled physics-informed neural network (DD-PINN) with adaptable bending stiffness. The DD-PINN serves as a surrogate for the dynamic Cosserat rod model with a speed-up factor of 44000. It is also used within an unscented Kalman filter for estimating the model states and bending compliance from end-effector position measurements. We implement a nonlinear evolutionary MPC running at 70 Hz on the GPU. In simulation, it demonstrates accurate tracking of dynamic trajectories and setpoint control with end-effector position errors below 3 mm (2.3% of the actuator’s length). In real-world experiments, the controller achieves similar accuracy and accelerations up to 3.55 m/s2.
[LG-55] Machine Learning H-theorem
链接: https://arxiv.org/abs/2508.14003
作者: Ruben Lier
类目: atistical Mechanics (cond-mat.stat-mech); Machine Learning (cs.LG)
*备注:
Abstract:H-theorem provides a microscopic foundation of the Second Law of Thermodynamics and is therefore essential to establishing statistical physics, but at the same time, H-theorem has been subject to controversy that in part persists till this day. To better understand H-theorem and its relation to the arrow of time, we study the equilibration of randomly oriented and positioned hard disks with periodic boundary conditions. Using a model based on the DeepSets architecture, which imposes permutation invariance of the particle labels, we train a model to capture the irreversibility of the H-functional.
[LG-56] Uncertainty-Aware PCA for Arbitrarily Distributed Data Modeled by Gaussian Mixture Models
链接: https://arxiv.org/abs/2508.13990
作者: Daniel Klötzl,Ozan Tastekin,David Hägele,Marina Evers,Daniel Weiskopf
类目: Machine Learning (stat.ML); Graphics (cs.GR); Machine Learning (cs.LG)
*备注: 10 pages, 6 figures
Abstract:Multidimensional data is often associated with uncertainties that are not well-described by normal distributions. In this work, we describe how such distributions can be projected to a low-dimensional space using uncertainty-aware principal component analysis (UAPCA). We propose to model multidimensional distributions using Gaussian mixture models (GMMs) and derive the projection from a general formulation that allows projecting arbitrary probability density functions. The low-dimensional projections of the densities exhibit more details about the distributions and represent them more faithfully compared to UAPCA mappings. Further, we support including user-defined weights between the different distributions, which allows for varying the importance of the multidimensional distributions. We evaluate our approach by comparing the distributions in low-dimensional space obtained by our method and UAPCA to those obtained by sample-based projections.
[LG-57] A PC Algorithm for Max-Linear Bayesian Networks
链接: https://arxiv.org/abs/2508.13967
作者: Carlos Améndola,Benjamin Hollering,Francesco Nowell
类目: Machine Learning (stat.ML); Machine Learning (cs.LG); Combinatorics (math.CO); Statistics Theory (math.ST)
*备注: 24 pages, 7 figures, 1 table
Abstract:Max-linear Bayesian networks (MLBNs) are a relatively recent class of structural equation models which arise when the random variables involved have heavy-tailed distributions. Unlike most directed graphical models, MLBNs are typically not faithful to d-separation and thus classical causal discovery algorithms such as the PC algorithm or greedy equivalence search can not be used to accurately recover the true graph structure. In this paper, we begin the study of constraint-based discovery algorithms for MLBNs given an oracle for testing conditional independence in the true, unknown graph. We show that if the oracle is given by the \ast -separation criteria in the true graph, then the PC algorithm remains consistent despite the presence of additional CI statements implied by \ast -separation. We also introduce a new causal discovery algorithm named “PCstar” which assumes faithfulness to C^\ast -separation and is able to orient additional edges which cannot be oriented with only d- or \ast -separation.
[LG-58] Generalisation and benign over-fitting for linear regression onto random functional covariates
链接: https://arxiv.org/abs/2508.13895
作者: Andrew Jones,Nick Whiteley
类目: Machine Learning (stat.ML); Machine Learning (cs.LG)
*备注:
Abstract:We study theoretical predictive performance of ridge and ridge-less least-squares regression when covariate vectors arise from evaluating p random, means-square continuous functions over a latent metric space at n random and unobserved locations, subject to additive noise. This leads us away from the standard assumption of i.i.d. data to a setting in which the n covariate vectors are exchangeable but not independent in general. Under an assumption of independence across dimensions, 4 -th order moment, and other regularity conditions, we obtain probabilistic bounds on a notion of predictive excess risk adapted to our random functional covariate setting, making use of recent results of Barzilai and Shamir. We derive convergence rates in regimes where p grows suitably fast relative to n , illustrating interplay between ingredients of the model in determining convergence behaviour and the role of additive covariate noise in benign-overfitting.
[LG-59] Online Conformal Selection with Accept-to-Reject Changes
链接: https://arxiv.org/abs/2508.13838
作者: Kangdao Liu,Huajun Xi,Chi-Man Vong,Hongxin Wei
类目: Machine Learning (stat.ML); Machine Learning (cs.LG)
*备注:
Abstract:Selecting a subset of promising candidates from a large pool is crucial across various scientific and real-world applications. Conformal selection offers a distribution-free and model-agnostic framework for candidate selection with uncertainty quantification. While effective in offline settings, its application to online scenarios, where data arrives sequentially, poses challenges. Notably, conformal selection permits the deselection of previously selected candidates, which is incompatible with applications requiring irreversible selection decisions. This limitation is particularly evident in resource-intensive sequential processes, such as drug discovery, where advancing a compound to subsequent stages renders reversal impractical. To address this issue, we extend conformal selection to an online Accept-to-Reject Changes (ARC) procedure: non-selected data points can be reconsidered for selection later, and once a candidate is selected, the decision is irreversible. Specifically, we propose a novel conformal selection method, Online Conformal Selection with Accept-to-Reject Changes (dubbed OCS-ARC), which incorporates online Benjamini-Hochberg procedure into the candidate selection process. We provide theoretical guarantees that OCS-ARC controls the false discovery rate (FDR) at or below the nominal level at any timestep under both i.i.d. and exchangeable data assumptions. Additionally, we theoretically show that our approach naturally extends to multivariate response settings. Extensive experiments on synthetic and real-world datasets demonstrate that OCS-ARC significantly improves selection power over the baseline while maintaining valid FDR control across all examined timesteps.
[LG-60] Smooth Flow Matching
链接: https://arxiv.org/abs/2508.13831
作者: Jianbin Tan,Anru R. Zhang
类目: Machine Learning (stat.ML); Machine Learning (cs.LG)
*备注: 86 pages, 7 figures
Abstract:Functional data, i.e., smooth random functions observed over a continuous domain, are increasingly available in areas such as biomedical research, health informatics, and epidemiology. However, effective statistical analysis for functional data is often hindered by challenges such as privacy constraints, sparse and irregular sampling, infinite dimensionality, and non-Gaussian structures. To address these challenges, we introduce a novel framework named Smooth Flow Matching (SFM), tailored for generative modeling of functional data to enable statistical analysis without exposing sensitive real data. Built upon flow-matching ideas, SFM constructs a semiparametric copula flow to generate infinite-dimensional functional data, free from Gaussianity or low-rank assumptions. It is computationally efficient, handles irregular observations, and guarantees the smoothness of the generated functions, offering a practical and flexible solution in scenarios where existing deep generative methods are not applicable. Through extensive simulation studies, we demonstrate the advantages of SFM in terms of both synthetic data quality and computational efficiency. We then apply SFM to generate clinical trajectory data from the MIMIC-IV patient electronic health records (EHR) longitudinal database. Our analysis showcases the ability of SFM to produce high-quality surrogate data for downstream statistical tasks, highlighting its potential to boost the utility of EHR data for clinical applications.
[LG-61] Optimizing Region of Interest Selection for Effective Embedding in Video Steganography Based on Genetic Algorithms
链接: https://arxiv.org/abs/2508.13710
作者: Nizheen A. Ali,Ramadhan J. Mstafa
类目: Image and Video Processing (eess.IV); Cryptography and Security (cs.CR); Machine Learning (cs.LG); Multimedia (cs.MM)
*备注: 19 Pages, 7 Figures, 4 Tables
Abstract:With the widespread use of the internet, there is an increasing need to ensure the security and privacy of transmitted data. This has led to an intensified focus on the study of video steganography, which is a technique that hides data within a video cover to avoid detection. The effectiveness of any steganography method depends on its ability to embed data without altering the original video quality while maintaining high efficiency. This paper proposes a new method to video steganography, which involves utilizing a Genetic Algorithm (GA) for identifying the Region of Interest (ROI) in the cover video. The ROI is the area in the video that is the most suitable for data embedding. The secret data is encrypted using the Advanced Encryption Standard (AES), which is a widely accepted encryption standard, before being embedded into the cover video, utilizing up to 10% of the cover video. This process ensures the security and confidentiality of the embedded data. The performance metrics for assessing the proposed method are the Peak Signal to Noise Ratio (PSNR) and the encoding and decoding time. The results show that the proposed method has a high embedding capacity and efficiency, with a PSNR ranging between 64 and 75 dBs, which indicates that the embedded data is almost indistinguishable from the original video. Additionally, the method can encode and decode data quickly, making it efficient for real time applications.
[LG-62] Flow Matching-Based Generative Modeling for Efficient and Scalable Data Assimilation
链接: https://arxiv.org/abs/2508.13313
作者: Taos Transue,Bohan Chen,So Takao,Bao Wang
类目: Machine Learning (stat.ML); Machine Learning (cs.LG); Optimization and Control (math.OC)
*备注:
Abstract:Data assimilation (DA) is the problem of sequentially estimating the state of a dynamical system from noisy observations. Recent advances in generative modeling have inspired new approaches to DA in high-dimensional nonlinear settings, especially the ensemble score filter (EnSF). However, these come at a significant computational burden due to slow sampling. In this paper, we introduce a new filtering framework based on flow matching (FM) – called the ensemble flow filter (EnFF) – to accelerate sampling and enable flexible design of probability paths. EnFF – a training-free DA approach – integrates MC estimators for the marginal FM vector field (VF) and a localized guidance to assimilate observations. EnFF has faster sampling and more flexibility in VF design compared to existing generative modeling for DA. Theoretically, we show that EnFF encompasses classical filtering methods such as the bootstrap particle filter and the ensemble Kalman filter as special cases. Experiments on high-dimensional filtering benchmarks demonstrate improved cost-accuracy tradeoffs and the ability to leverage larger ensembles than prior methods. Our results highlight the promise of FM as a scalable tool for filtering in high-dimensional applications that enable the use of large ensembles.
[LG-63] Structural Foundations for Leading Digit Laws: Beyond Probabilistic Mixtures
链接: https://arxiv.org/abs/2508.13237
作者: Vladimir Berman
类目: Machine Learning (stat.ML); Machine Learning (cs.LG); Statistics Theory (math.ST); Methodology (stat.ME)
*备注: 57 pp, 12 figures
Abstract:This article presents a modern deterministic framework for the study of leading significant digit distributions in numerical data. Rather than relying on traditional probabilistic or mixture-based explanations, we demonstrate that the observed frequencies of leading digits are determined by the underlying arithmetic, algorithmic, and structural properties of the data-generating process. Our approach centers on a shift-invariant functional equation, whose general solution is given by explicit affine-plus-periodic formulas. This structural formulation explains the diversity of digit distributions encountered in both empirical and mathematical datasets, including cases with pronounced deviations from logarithmic or scale-invariant profiles. We systematically analyze digit distributions in finite and infinite datasets, address deterministic sequences such as prime numbers and recurrence relations, and highlight the emergence of block-structured and fractal features. The article provides critical examination of probabilistic models, explicit examples and counterexamples, and discusses limitations and open problems for further research. Overall, this work establishes a unified mathematical foundation for digital phenomena and offers a versatile toolset for modeling and analyzing digit patterns in applied and theoretical contexts. Comments: 57 pp, 12 figures Subjects: Machine Learning (stat.ML); Machine Learning (cs.LG); Statistics Theory (math.ST); Methodology (stat.ME) Cite as: arXiv:2508.13237 [stat.ML] (or arXiv:2508.13237v1 [stat.ML] for this version) https://doi.org/10.48550/arXiv.2508.13237 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
[LG-64] Modeling GRNs with a Probabilistic Categorical Framework
链接: https://arxiv.org/abs/2508.13208
作者: Yiyang Jia,Zheng Wei,Zheng Yang,Guohong Peng
类目: Molecular Networks (q-bio.MN); Machine Learning (cs.LG); Category Theory (math.CT)
*备注: 21 pages, 5 figures
Abstract:Understanding the complex and stochastic nature of Gene Regulatory Networks (GRNs) remains a central challenge in systems biology. Existing modeling paradigms often struggle to effectively capture the intricate, multi-factor regulatory logic and to rigorously manage the dual uncertainties of network structure and kinetic parameters. In response, this work introduces the Probabilistic Categorical GRN(PC-GRN) framework. It is a novel theoretical approach founded on the synergistic integration of three core methodologies. Firstly, category theory provides a formal language for the modularity and composition of regulatory pathways. Secondly, Bayesian Typed Petri Nets (BTPNs) serve as an interpretable,mechanistic substrate for modeling stochastic cellular processes, with kinetic parameters themselves represented as probability distributions. The central innovation of PC-GRN is its end-to-end generative Bayesian inference engine, which learns a full posterior distribution over BTPN models (P (G, \Theta|D)) directly from data. This is achieved by the novel interplay of a GFlowNet, which learns a policy to sample network topologies, and a HyperNetwork, which performs amortized inference to predict their corresponding parameter distributions. The resulting framework provides a mathematically rigorous, biologically interpretable, and uncertainty-aware representation of GRNs, advancing predictive modeling and systems-level analysis.
[LG-65] Sex-Specific Vascular Score: A Novel Perfusion Biomarker from Supervoxel Analysis of 3D pCASL MRI
链接: https://arxiv.org/abs/2508.13173
作者: Sneha Noble,Neelam Sinha,Vaanathi Sundareshan,Thomas Gregor Issac
类目: Image and Video Processing (eess.IV); Machine Learning (cs.LG)
*备注: 18 pages, 7 figures
Abstract:We propose a novel framework that leverages 3D pseudo-continuous arterial spin labeling (3D pCASL) MRI to compute sex-specific vascular scores that quantify cerebrovascular health and potential disease susceptibility. The brain is parcellated into spatially contiguous regions of homogeneous perfusion using supervoxel clustering, capturing both microvascular and macrovascular contributions. Mean cerebral blood flow (CBF) values are extracted from 186 cognitively healthy participants and used to train a custom convolutional neural network, achieving 95 percent accuracy in sex classification. This highlights robust, sex-specific perfusion patterns across the brain. Additionally, regional CBF variations and age-related effects are systematically evaluated within male and female cohorts. The proposed vascular risk-scoring framework enhances understanding of normative brain perfusion and aging, and may facilitate early detection and personalized interventions for neurodegenerative diseases such as Alzheimer’s.
信息检索
[IR-0] rust and Reputation in Data Sharing: A Survey
链接: https://arxiv.org/abs/2508.14028
作者: Wenbo Wu,George Konstantinidis
类目: ocial and Information Networks (cs.SI); Computers and Society (cs.CY); Information Retrieval (cs.IR)
*备注:
Abstract:Data sharing is the fuel of the galloping artificial intelligence economy, providing diverse datasets for training robust models. Trust between data providers and data consumers is widely considered one of the most important factors for enabling data sharing initiatives. Concerns about data sensitivity, privacy breaches, and misuse contribute to reluctance in sharing data across various domains. In recent years, there has been a rise in technological and algorithmic solutions to measure, capture and manage trust, trustworthiness, and reputation in what we collectively refer to as Trust and Reputation Management Systems (TRMSs). Such approaches have been developed and applied to different domains of computer science, such as autonomous vehicles, or IoT networks, but there have not been dedicated approaches to data sharing and its unique characteristics. In this survey, we examine TRMSs from a data-sharing perspective, analyzing how they assess the trustworthiness of both data and entities across different environments. We develop novel taxonomies for system designs, trust evaluation framework, and evaluation metrics for both data and entity, and we systematically analyze the applicability of existing TRMSs in data sharing. Finally, we identify open challenges and propose future research directions to enhance the explainability, comprehensiveness, and accuracy of TRMSs in large-scale data-sharing ecosystems.
[IR-1] Democratizing News Recommenders: Modeling Multiple Perspectives for News Candidate Generation with VQ-VAE
链接: https://arxiv.org/abs/2508.13978
作者: Hardy,Sebastian Padó,Amelie Wührl,Tanise Ceron
类目: Information Retrieval (cs.IR)
*备注:
Abstract:Current News Recommender Systems based on past clicks are designed for engagement, but come at the cost of limiting diversity in the suggested content. While diversity-aware algorithms exist, they suffer from two major limitations. First, they fail to account for normative diversity, which requires fair access to a broad range of perspectives. Second, they typically apply diversity late in the system’s pipeline, after a lot of content has already been filtered out. Both limitations confine their effectiveness and prevent them from promoting true normative diversity in news recommendations. We propose Aspect-Aware Candidate Generation (A2CG) to address these limitations. Our framework introduces diversity into the earliest pipeline stage and uses a configurable mechanism to align diversity with specific democratic goals. A2CG represents each news article using multiple aspects of perspectives (e.g., sentiment, political leaning, frame) and uses a Vector Quantized Variational Autoencoder (VQ-VAE) to create a discrete, multi-faceted representation. A decoder-only model then learns user preferences over these aspect codes. We then inject diversity directly by reversing the sign on some of the query vector’s aspects during the candidate retrieval process, ensuring a more diverse set of candidates. Our method, evaluated on the MIND dataset, enables a flexible trade-off between personalization and diversity early in the recommendation pipeline. It also generates more novel, diverse, and serendipitous candidates while effectively taking into account aspects that strengthen democratic values. These empirical results make it a promising approach for downstream democratized news recommendation systems. Subjects: Information Retrieval (cs.IR) Cite as: arXiv:2508.13978 [cs.IR] (or arXiv:2508.13978v1 [cs.IR] for this version) https://doi.org/10.48550/arXiv.2508.13978 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
[IR-2] CARE: Contextual Adaptation of Recommenders for LLM -based Conversational Recommendation
链接: https://arxiv.org/abs/2508.13889
作者: Chuang Li,Yang Deng,Hengchang Hu,See-Kiong Ng,Min-Yen Kan,Haizhou Li
类目: Information Retrieval (cs.IR)
*备注:
Abstract:We tackle the challenge of integrating large language models (LLMs) with external recommender systems to enhance domain expertise in conversational recommendation (CRS). Current LLM-based CRS approaches primarily rely on zero- or few-shot methods for generating item recommendations based on user queries, but this method faces two significant challenges: (1) without domain-specific adaptation, LLMs frequently recommend items not in the target item space, resulting in low recommendation accuracy; and (2) LLMs largely rely on dialogue context for content-based recommendations, neglecting the collaborative relationships among entities or item sequences. To address these limitations, we introduce the CARE (Contextual Adaptation of Recommenders) framework. CARE customizes LLMs for CRS tasks, and synergizes them with external recommendation systems. CARE (a) integrates external recommender systems as domain experts, producing recommendations through entity-level insights, and (b) enhances those recommendations by leveraging contextual information for more accurate and unbiased final recommendations using LLMs. Our results demonstrate that incorporating external recommender systems with entity-level information significantly enhances recommendation accuracy of LLM-based CRS by an average of 54% and 25% for ReDial and INSPIRED datasets. The most effective strategy in the CARE framework involves LLMs selecting and reranking candidate items that external recommenders provide based on contextual insights. Our analysis indicates that the CARE framework effectively addresses the identified challenges and mitigates the popularity bias in the external recommender.
[IR-3] Bites of Tomorrow: Personalized Recommendations for a Healthier and Greener Plate
链接: https://arxiv.org/abs/2508.13870
作者: Jiazheng Jing,Yinan Zhang,Chunyan Miao
类目: Information Retrieval (cs.IR)
*备注:
Abstract:The recent emergence of extreme climate events has significantly raised awareness about sustainable living. In addition to developing energy-saving materials and technologies, existing research mainly relies on traditional methods that encourage behavioral shifts towards sustainability, which can be overly demanding or only passively engaging. In this work, we propose to employ recommendation systems to actively nudge users toward more sustainable choices. We introduce Green Recommender Aligned with Personalized Eating (GRAPE), which is designed to prioritize and recommend sustainable food options that align with users’ evolving preferences. We also design two innovative Green Loss functions that cater to green indicators with either uniform or differentiated priorities, thereby enhancing adaptability across a range of scenarios. Extensive experiments on a real-world dataset demonstrate the effectiveness of our GRAPE.
[IR-4] Refining Contrastive Learning and Homography Relations for Multi-Modal Recommendation ACM-MM2025
链接: https://arxiv.org/abs/2508.13745
作者: Shouxing Ma,Yawen Zeng,Shiqing Wu,Guandong Xu
类目: Information Retrieval (cs.IR)
*备注: This paper has been accepted as a full paper at ACM MM 2025
Abstract:Multi-modal recommender system focuses on utilizing rich modal information ( i.e., images and textual descriptions) of items to improve recommendation performance. The current methods have achieved remarkable success with the powerful structure modeling capability of graph neural networks. However, these methods are often hindered by sparse data in real-world scenarios. Although contrastive learning and homography ( i.e., homogeneous graphs) are employed to address the data sparsity challenge, existing methods still suffer two main limitations: 1) Simple multi-modal feature contrasts fail to produce effective representations, causing noisy modal-shared features and loss of valuable information in modal-unique features; 2) The lack of exploration of the homograph relations between user interests and item co-occurrence results in incomplete mining of user-item interplay. To address the above limitations, we propose a novel framework for \textbfR\textbfEfining multi-mod\textbfAl cont\textbfRastive learning and ho\textbfMography relations (\textbfREARM). Specifically, we complement multi-modal contrastive learning by employing meta-network and orthogonal constraint strategies, which filter out noise in modal-shared features and retain recommendation-relevant information in modal-unique features. To mine homogeneous relationships effectively, we integrate a newly constructed user interest graph and an item co-occurrence graph with the existing user co-occurrence and item semantic graphs for graph learning. The extensive experiments on three real-world datasets demonstrate the superiority of REARM to various state-of-the-art baselines. Our visualization further shows an improvement made by REARM in distinguishing between modal-shared and modal-unique features. Code is available \hrefthis https URLhere. Comments: This paper has been accepted as a full paper at ACM MM 2025 Subjects: Information Retrieval (cs.IR) Cite as: arXiv:2508.13745 [cs.IR] (or arXiv:2508.13745v1 [cs.IR] for this version) https://doi.org/10.48550/arXiv.2508.13745 Focus to learn more arXiv-issued DOI via DataCite (pending registration) Related DOI: https://doi.org/10.1145/3746027.3755779 Focus to learn more DOI(s) linking to related resources
[IR-5] MUFFIN: Mixture of User-Adaptive Frequency Filtering for Sequential Recommendation CIKM2025
链接: https://arxiv.org/abs/2508.13670
作者: Ilwoong Baek,Mincheol Yoon,Seongmin Park,Jongwuk Lee
类目: Information Retrieval (cs.IR)
*备注: Accepted by CIKM 2025
Abstract:Sequential recommendation (SR) aims to predict users’ subsequent interactions by modeling their sequential behaviors. Recent studies have explored frequency domain analysis, which effectively models periodic patterns in user sequences. However, existing frequency-domain SR models still face two major drawbacks: (i) limited frequency band coverage, often missing critical behavioral patterns in a specific frequency range, and (ii) lack of personalized frequency filtering, as they apply an identical filter for all users regardless of their distinct frequency characteristics. To address these challenges, we propose a novel frequency-domain model, Mixture of User-adaptive Frequency FIlteriNg (MUFFIN), operating through two complementary modules. (i) The global filtering module (GFM) handles the entire frequency spectrum to capture comprehensive behavioral patterns. (ii) The local filtering module (LFM) selectively emphasizes important frequency bands without excluding information from other ranges. (iii) In both modules, the user-adaptive filter (UAF) is adopted to generate user-specific frequency filters tailored to individual unique characteristics. Finally, by aggregating both modules, MUFFIN captures diverse user behavioral patterns across the full frequency spectrum. Extensive experiments show that MUFFIN consistently outperforms state-of-the-art frequency-domain SR models over five benchmark datasets. The source code is available at this https URL.
[IR-6] ENCODE: Breaking the Trade-Off Between Performance and Efficiency in Long-Term User Behavior Modeling
链接: https://arxiv.org/abs/2508.13567
作者: Wenji Zhou,Yuhang Zheng,Yinfu Feng,Yunan Ye,Rong Xiao,Long Chen,Xiaosong Yang,Jun Xiao
类目: Information Retrieval (cs.IR)
*备注: Accepted to TKDE
Abstract:Long-term user behavior sequences are a goldmine for businesses to explore users’ interests to improve Click-Through Rate. However, it is very challenging to accurately capture users’ long-term interests from their long-term behavior sequences and give quick responses from the online serving systems. To meet such requirements, existing methods “inadvertently” destroy two basic requirements in long-term sequence modeling: R1) make full use of the entire sequence to keep the information as much as possible; R2) extract information from the most relevant behaviors to keep high relevance between learned interests and current target items. The performance of online serving systems is significantly affected by incomplete and inaccurate user interest information obtained by existing methods. To this end, we propose an efficient two-stage long-term sequence modeling approach, named as EfficieNt Clustering based twO-stage interest moDEling (ENCODE), consisting of offline extraction stage and online inference stage. It not only meets the aforementioned two basic requirements but also achieves a desirable balance between online service efficiency and precision. Specifically, in the offline extraction stage, ENCODE clusters the entire behavior sequence and extracts accurate interests. To reduce the overhead of the clustering process, we design a metric learning-based dimension reduction algorithm that preserves the relative pairwise distances of behaviors in the new feature space. While in the online inference stage, ENCODE takes the off-the-shelf user interests to predict the associations with target items. Besides, to further ensure the relevance between user interests and target items, we adopt the same relevance metric throughout the whole pipeline of ENCODE. The extensive experiment and comparison with SOTA have demonstrated the effectiveness and efficiency of our proposed ENCODE.
[IR-7] CASPER: Concept-integrated Sparse Representation for Scientific Retrieval
链接: https://arxiv.org/abs/2508.13394
作者: Lam Thanh Do,Linh Van Nguyen,David Fu,Kevin Chen-Chuan Chang
类目: Information Retrieval (cs.IR)
*备注: 11 Pages. Code: this https URL
Abstract:The exponential growth of scientific literature has made it increasingly difficult for researchers to keep up with the literature. In an attempt to alleviate this problem, we propose CASPER, a sparse retrieval model for scientific search that utilizes tokens and keyphrases as representation units (i.e. dimensions in the sparse embedding space), enabling it to represent queries and documents with research concepts and match them at both granular and conceptual levels. To overcome the lack of suitable training data, we propose mining training data by leveraging scholarly references (i.e. signals that capture how research concepts of papers are expressed in different settings), including titles, citation contexts, author-assigned keyphrases, and co-citations. CASPER outperforms strong dense and sparse retrieval baselines on eight scientific retrieval benchmarks. Moreover, we demonstrate that through simple post-processing, CASPER can be effectively used for the keyphrase generation tasks, achieving competitive performance with the established CopyRNN while producing more diverse keyphrases and being nearly four times faster.
[IR-8] FLAIR: Feedback Learning for Adaptive Information Retrieval CIKM2025
链接: https://arxiv.org/abs/2508.13390
作者: William Zhang,Yiwen Zhu,Yunlei Lu,Mathieu Demarne,Wenjing Wang,Kai Deng,Nutan Sahoo,Katherine Lin,Miso Cilimdzic,Subru Krishnan
类目: Information Retrieval (cs.IR)
*备注: Accepted to CIKM2025
Abstract:Recent advances in Large Language Models (LLMs) have driven the adoption of copilots in complex technical scenarios, underscoring the growing need for specialized information retrieval solutions. In this paper, we introduce FLAIR, a lightweight, feedback learning framework that adapts copilot systems’ retrieval strategies by integrating domain-specific expert feedback. FLAIR operates in two stages: an offline phase obtains indicators from (1) user feedback and (2) questions synthesized from documentation, storing these indicators in a decentralized manner. An online phase then employs a two-track ranking mechanism to combine raw similarity scores with the collected indicators. This iterative setup refines retrieval performance for any query. Extensive real-world evaluations of FLAIR demonstrate significant performance gains on both previously seen and unseen queries, surpassing state-of-the-art approaches. The system has been successfully integrated into Copilot DECO, serving thousands of users at Microsoft, demonstrating its scalability and effectiveness in operational environments.