本篇博文主要内容为 2025-07-31 从Arxiv.org论文网站获取的最新论文列表,自动更新,按照NLP、CV、ML、AI、IR五个大方向区分,若需要邮件定时接收,请在评论区留下你的邮箱号。
说明:每日论文数据从Arxiv.org获取,每天早上12:00左右定时自动更新。
友情提示: 如何您需要邮箱接收每日论文数据,请在评论处留下你的邮箱。
目录
概览 (2025-07-31)
今日共更新384篇论文,其中:
- 自然语言处理共51篇(Computation and Language (cs.CL))
- 人工智能共120篇(Artificial Intelligence (cs.AI))
- 计算机视觉共84篇(Computer Vision and Pattern Recognition (cs.CV))
- 机器学习共105篇(Machine Learning (cs.LG))
自然语言处理
[NLP-0] Where to show Demos in Your Prompt: A Positional Bias of In-Context Learning
【速读】: 该论文旨在解决大语言模型(Large Language Models, LLMs)在上下文学习(In-context Learning, ICL)中因演示样本(demos)位置变化而导致的预测不稳定性和准确性波动问题,即首次揭示并系统研究了“演示位置在提示中的影响”(DEMOS’ POSITION IN PROMPT, DPP)偏差。其解决方案的关键在于设计了一套系统化的评估流程,并引入两个量化指标——准确率变化(ACCURACY-CHANGE)和预测变化(PREDICTION-CHANGE),用于衡量不同演示位置对模型输出稳定性与正确性的影响;实验结果表明,将演示置于提示开头可显著提升稳定性和准确性(最高提升6个百分点),而将其置于用户消息末尾则会引发超过30%的预测翻转且未改善问答任务的正确性,从而为优化ICL提示结构提供了实证依据。
链接: https://arxiv.org/abs/2507.22887
作者: Kwesi Cobbina,Tianyi Zhou
机构: University of Maryland, College Park (马里兰大学学院市分校)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:
Abstract:In-context learning (ICL) is a critical emerging capability of large language models (LLMs), enabling few-shot learning during inference by including a few demonstrations (demos) in the prompt. However, it has been found that ICL’s performance can be sensitive to the choices of demos and their order. This paper investigates an unexplored new positional bias of ICL for the first time: we observe that the predictions and accuracy can drift drastically when the positions of demos, the system prompt, and the user message in LLM input are varied. We refer to this bias as DEMOS’ POSITION IN PROMPT (DPP) bias. We design a systematic evaluation pipeline to study this type of positional bias across classification, question answering, summarization, and reasoning tasks. We introduce two metrics, ACCURACY-CHANGE and PREDICTION-CHANGE, to quantify net gains and output volatility induced by changes in the demos’ position. Extensive experiments on ten LLMs from four open-source model families (QWEN, LLAMA3, MISTRAL, COHERE) verify that the bias significantly affects their accuracy and predictions: placing demos at the start of the prompt yields the most stable and accurate outputs with gains of up to +6 points. In contrast, placing demos at the end of the user message flips over 30% of predictions without improving correctness on QA tasks. Smaller models are most affected by this sensitivity, though even large models remain marginally affected on complex tasks.
zh
[NLP-1] RecGPT Technical Report
【速读】: 该论文旨在解决当前工业推荐系统过度依赖历史共现模式和日志拟合目标(log-fitting objective)所带来的局限性,即系统仅优化过去用户交互行为,未能显式建模用户意图,导致对用户潜在兴趣捕捉不足,进而加剧过滤气泡(filter bubble)和长尾现象,损害用户体验并威胁推荐生态的可持续性。解决方案的关键在于提出RecGPT框架,其核心是将用户意图(user intent)置于推荐流程中心,并通过集成大语言模型(Large Language Models, LLMs)于用户兴趣挖掘、物品召回和解释生成等关键阶段,实现从日志拟合向意图驱动的范式转变。为有效规模化地对齐通用LLM与推荐任务,RecGPT采用多阶段训练策略,融合推理增强预对齐与自训练演化机制,并由人-LLM协同判官系统引导优化,从而在淘宝App上线后实现了用户、商家与平台三方收益的持续提升。
链接: https://arxiv.org/abs/2507.22879
作者: Chao Yi,Dian Chen,Gaoyang Guo,Jiakai Tang,Jian Wu,Jing Yu,Sunhao Dai,Wen Chen,Wenjun Yang,Yuning Jiang,Zhujin Gao,Bo Zheng,Chi Li,Dimin Wang,Dixuan Wang,Fan Li,Fan Zhang,Haibin Chen,Haozhuang Liu,Jialin Zhu,Jiamang Wang,Jiawei Wu,Jin Cui,Ju Huang,Kai Zhang,Kan Liu,Lang Tian,Liang Rao,Longbin Li,Lulu Zhao,Mao Zhang,Na He,Peiyang Wang,Qiqi Huang,Tao Luo,Wenbo Su,Xiaoxiao He,Xin Tong,Xu Chen,Xunke Xi,Yang Li,Yaxuan Wu,Yeqiu Yang,Yi Hu,Yinnan Song,Yuchen Li,Yujie Luo,Yujin Yuan,Yuliang Yan,Zhengyang Wang,Zhibo Xiao,Zhixin Ma,Zile Zhou
机构: 未知
类目: Information Retrieval (cs.IR); Computation and Language (cs.CL)
备注:
Abstract:Recommender systems are among the most impactful applications of artificial intelligence, serving as critical infrastructure connecting users, merchants, and platforms. However, most current industrial systems remain heavily reliant on historical co-occurrence patterns and log-fitting objectives, i.e., optimizing for past user interactions without explicitly modeling user intent. This log-fitting approach often leads to overfitting to narrow historical preferences, failing to capture users’ evolving and latent interests. As a result, it reinforces filter bubbles and long-tail phenomena, ultimately harming user experience and threatening the sustainability of the whole recommendation ecosystem. To address these challenges, we rethink the overall design paradigm of recommender systems and propose RecGPT, a next-generation framework that places user intent at the center of the recommendation pipeline. By integrating large language models (LLMs) into key stages of user interest mining, item retrieval, and explanation generation, RecGPT transforms log-fitting recommendation into an intent-centric process. To effectively align general-purpose LLMs to the above domain-specific recommendation tasks at scale, RecGPT incorporates a multi-stage training paradigm, which integrates reasoning-enhanced pre-alignment and self-training evolution, guided by a Human-LLM cooperative judge system. Currently, RecGPT has been fully deployed on the Taobao App. Online experiments demonstrate that RecGPT achieves consistent performance gains across stakeholders: users benefit from increased content diversity and satisfaction, merchants and the platform gain greater exposure and conversions. These comprehensive improvement results across all stakeholders validates that LLM-driven, intent-centric design can foster a more sustainable and mutually beneficial recommendation ecosystem. Subjects: Information Retrieval (cs.IR); Computation and Language (cs.CL) Cite as: arXiv:2507.22879 [cs.IR] (or arXiv:2507.22879v1 [cs.IR] for this version) https://doi.org/10.48550/arXiv.2507.22879 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
zh
[NLP-2] GeoOutageKG: A Multimodal Geospatiotemporal Knowledge Graph for Multiresolution Power Outage Analysis ISWC2025
【速读】: 该论文旨在解决电力中断(power outage)检测、分析与预测中的时空分辨率不足问题,尤其是在极端天气事件下难以精准刻画局部停电模式的挑战。现有县级上报数据虽具备良好的时间粒度,但空间分辨率较低;而夜间灯光卫星图像(nighttime light, NTL)虽空间分辨率高,却仅提供每日数据,二者难以协同利用。解决方案的关键在于构建一个名为GeoOutageKG的多模态知识图谱,通过统一语义表示框架(即GeoOutageOnto本体)对NTL影像、高时空分辨率停电地图及县级时间序列停电报告进行对齐整合,从而实现多源异构数据的语义融合与可计算建模,显著提升停电事件的时空感知能力与预测精度。
链接: https://arxiv.org/abs/2507.22878
作者: Ethan Frakes,Yinghui Wu,Roger H. French,Mengjie Li
机构: 未知
类目: Information Retrieval (cs.IR); Computation and Language (cs.CL); Computers and Society (cs.CY)
备注: Accepted to the 24th International Semantic Web Conference Resource Track (ISWC 2025)
Abstract:Detecting, analyzing, and predicting power outages is crucial for grid risk assessment and disaster mitigation. Numerous outages occur each year, exacerbated by extreme weather events such as hurricanes. Existing outage data are typically reported at the county level, limiting their spatial resolution and making it difficult to capture localized patterns. However, it offers excellent temporal granularity. In contrast, nighttime light satellite image data provides significantly higher spatial resolution and enables a more comprehensive spatial depiction of outages, enhancing the accuracy of assessing the geographic extent and severity of power loss after disaster events. However, these satellite data are only available on a daily basis. Integrating spatiotemporal visual and time-series data sources into a unified knowledge representation can substantially improve power outage detection, analysis, and predictive reasoning. In this paper, we propose GeoOutageKG, a multimodal knowledge graph that integrates diverse data sources, including nighttime light satellite image data, high-resolution spatiotemporal power outage maps, and county-level timeseries outage reports in the U.S. We describe our method for constructing GeoOutageKG by aligning source data with a developed ontology, GeoOutageOnto. Currently, GeoOutageKG includes over 10.6 million individual outage records spanning from 2014 to 2024, 300,000 NTL images spanning from 2012 to 2024, and 15,000 outage maps. GeoOutageKG is a novel, modular and reusable semantic resource that enables robust multimodal data integration. We demonstrate its use through multiresolution analysis of geospatiotemporal power outages.
zh
[NLP-3] he Incomplete Bridge: How AI Research (Mis)Engages with Psychology
【速读】: 该论文旨在解决人工智能(AI)与心理学之间跨学科融合不足的问题,尤其关注大语言模型(LLM)研究中对心理学理论和方法的引用、应用及其潜在误用情况。其解决方案的关键在于系统性地分析2023至2025年间发表于顶级AI会议的1,006篇LLM相关论文及其引用的2,544篇心理学文献,识别出心理学领域中最常被引用的方向、理论的实际操作化方式、常见误用类型,并提出改进跨学科整合的有效路径,从而构建AI与心理学深度协作的全景图,推动更科学、合理的AI系统设计与理解。
链接: https://arxiv.org/abs/2507.22847
作者: Han Jiang,Pengda Wang,Xiaoyuan Yi,Xing Xie,Ziang Xiao
机构: Johns Hopkins University (约翰霍普金斯大学); Microsoft Research Asia (微软亚洲研究院); Rice University (莱斯大学)
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Computers and Society (cs.CY)
备注:
Abstract:Social sciences have accumulated a rich body of theories and methodologies for investigating the human mind and behaviors, while offering valuable insights into the design and understanding of Artificial Intelligence (AI) systems. Focusing on psychology as a prominent case, this study explores the interdisciplinary synergy between AI and the field by analyzing 1,006 LLM-related papers published in premier AI venues between 2023 and 2025, along with the 2,544 psychology publications they cite. Through our analysis, we identify key patterns of interdisciplinary integration, locate the psychology domains most frequently referenced, and highlight areas that remain underexplored. We further examine how psychology theories/frameworks are operationalized and interpreted, identify common types of misapplication, and offer guidance for more effective incorporation. Our work provides a comprehensive map of interdisciplinary engagement between AI and psychology, thereby facilitating deeper collaboration and advancing AI systems.
zh
[NLP-4] Beyond Natural Language Plans: Structure-Aware Planning for Query-Focused Table Summarization
【速读】: 该论文旨在解决查询聚焦的表格摘要任务中因自然语言(NL)计划存在歧义性和结构性不足而导致的可执行程序转换困难与扩展性受限问题,尤其在多表场景下表现尤为突出。其解决方案的关键在于提出一种范式转变:采用结构化表示方法,引入一种名为TaSoF(Table Summary Plan Format)的结构化计划,并构建SPaGe框架,该框架将推理过程形式化为三个阶段:1)结构化规划(Structured Planning),从查询生成TaSoF;2)基于图的执行(Graph-based Execution),将计划步骤转化为SQL并利用有向环图建模依赖关系以支持并行执行;3)摘要生成(Summary Generation),产出查询聚焦的总结内容。通过显式捕捉复杂依赖关系,该方法显著提升了可靠性和可扩展性,在多个公开基准测试中均优于现有模型。
链接: https://arxiv.org/abs/2507.22829
作者: Weijia Zhang,Songgaojun Deng,Evangelos Kanoulas
机构: IRLab, University of Amsterdam (阿姆斯特丹大学信息检索实验室)
类目: Computation and Language (cs.CL)
备注: 10 pages, 4 figures, and 5 tables
Abstract:Query-focused table summarization requires complex reasoning, often approached through step-by-step natural language (NL) plans. However, NL plans are inherently ambiguous and lack structure, limiting their conversion into executable programs like SQL and hindering scalability, especially for multi-table tasks. To address this, we propose a paradigm shift to structured representations. We introduce a new structured plan, TaSoF, inspired by formalism in traditional multi-agent systems, and a framework, SPaGe, that formalizes the reasoning process in three phases: 1) Structured Planning to generate TaSoF from a query, 2) Graph-based Execution to convert plan steps into SQL and model dependencies via a directed cyclic graph for parallel execution, and 3) Summary Generation to produce query-focused summaries. Our method explicitly captures complex dependencies and improves reliability. Experiments on three public benchmarks show that SPaGe consistently outperforms prior models in both single- and multi-table settings, demonstrating the advantages of structured representations for robust and scalable summarization.
zh
[NLP-5] DBLPLink 2.0 – An Entity Linker for the DBLP Scholarly Knowledge Graph
【速读】: 该论文旨在解决DBLP知识图谱(Knowledge Graph, KG)中实体链接(Entity Linking)的问题,特别是在其2025版本中引入了新的实体类型“dblp:Stream”(即出版物场所)后,如何高效、准确地将文本提及映射到对应KG实体。相较于此前基于KG嵌入(KG-embeddings)和重排序模型的有监督方法,本文提出了一种零样本(zero-shot)实体链接方案,其关键创新在于利用大语言模型(Large Language Models, LLMs)的输出特性:通过分析LLM在倒数第二层对“yes” token的log-probabilities来对候选实体进行重排序,从而实现无需训练数据即可完成高质量实体链接。
链接: https://arxiv.org/abs/2507.22811
作者: Debayan Banerjee,Tilahun Abedissa Taffa,Ricardo Usbeck
机构: Leuphana University of Lüneburg (吕讷堡大学); University of Hamburg (汉堡大学)
类目: Computation and Language (cs.CL)
备注:
Abstract:In this work we present an entity linker for DBLP’s 2025 version of RDF-based Knowledge Graph. Compared to the 2022 version, DBLP now considers publication venues as a new entity type called dblp:Stream. In the earlier version of DBLPLink, we trained KG-embeddings and re-rankers on a dataset to produce entity linkings. In contrast, in this work, we develop a zero-shot entity linker using LLMs using a novel method, where we re-rank candidate entities based on the log-probabilities of the “yes” token output at the penultimate layer of the LLM.
zh
[NLP-6] MASCA: LLM based-Multi Agents System for Credit Assessment ACL
【速读】: 该论文旨在解决信用评估(credit assessment)这一传统上依赖规则和统计模型的金融问题,其核心挑战在于如何提升评估的准确性与公平性。解决方案的关键在于提出MASCA——一个基于大语言模型(LLM)的分层多智能体系统(multi-agent system),通过模拟真实世界的决策流程,使多个专业化LLM代理协作完成子任务,并引入对比学习(contrastive learning)优化风险与收益判断;同时,从信号博弈理论(signaling game theory)角度分析了层级结构下智能体间的交互机制,为系统设计提供理论支撑,并辅以详尽的偏见分析以保障公平性。实验表明,该方法在信用评分任务中显著优于基线模型。
链接: https://arxiv.org/abs/2507.22758
作者: Gautam Jajoo,Pranjal A Chitale,Saksham Agarwal
机构: Kairos AI; Microsoft Research
类目: Computation and Language (cs.CL); Computational Engineering, Finance, and Science (cs.CE); Machine Learning (cs.LG)
备注: Accepted at ACL REALM Workshop. Work in Progress
Abstract:Recent advancements in financial problem-solving have leveraged LLMs and agent-based systems, with a primary focus on trading and financial modeling. However, credit assessment remains an underexplored challenge, traditionally dependent on rule-based methods and statistical models. In this paper, we introduce MASCA, an LLM-driven multi-agent system designed to enhance credit evaluation by mirroring real-world decision-making processes. The framework employs a layered architecture where specialized LLM-based agents collaboratively tackle sub-tasks. Additionally, we integrate contrastive learning for risk and reward assessment to optimize decision-making. We further present a signaling game theory perspective on hierarchical multi-agent systems, offering theoretical insights into their structure and interactions. Our paper also includes a detailed bias analysis in credit assessment, addressing fairness concerns. Experimental results demonstrate that MASCA outperforms baseline approaches, highlighting the effectiveness of hierarchical LLM-based multi-agent systems in financial applications, particularly in credit scoring.
zh
[NLP-7] Opportunities and Challenges of LLM s in Education: An NLP Perspective
【速读】: 该论文旨在解决如何系统性理解大语言模型(Large Language Models, LLMs)在教育领域自然语言处理(Natural Language Processing, NLP)中的作用与影响这一问题。其解决方案的关键在于从教学辅助(assistance)和评估(assessment)两大应用场景出发,结合阅读、写作、口语和辅导四个维度进行结构化分析,并进一步提炼出LLMs所驱动的新研究方向及亟需应对的核心挑战,从而为未来面向语言能力培养的NLP教育应用提供理论框架与实践指引。
链接: https://arxiv.org/abs/2507.22753
作者: Sowmya Vajjala,Bashar Alhafni,Stefano Bannò,Kaushal Kumar Maurya,Ekaterina Kochmar
机构: National Research Council, Canada (加拿大国家研究委员会); MBZUAI; University of Cambridge (剑桥大学)
类目: Computation and Language (cs.CL)
备注: Pre-print
Abstract:Interest in the role of large language models (LLMs) in education is increasing, considering the new opportunities they offer for teaching, learning, and assessment. In this paper, we examine the impact of LLMs on educational NLP in the context of two main application scenarios: \em assistance and \em assessment, grounding them along the four dimensions – reading, writing, speaking, and tutoring. We then present the new directions enabled by LLMs, and the key challenges to address. We envision that this holistic overview would be useful for NLP researchers and practitioners interested in exploring the role of LLMs in developing language-focused and NLP-enabled educational applications of the future.
zh
[NLP-8] CUS-QA: Local-Knowledge-Oriented Open-Ended Question Answering Dataset
【速读】: 该论文旨在解决当前大语言模型(Large Language Models, LLMs)在区域知识理解上的不足,以及开放式多模态问答任务中评估指标与人类判断之间一致性差的问题。其解决方案的关键在于构建一个涵盖文本与视觉模态、由本地母语者手工标注的区域性问答基准数据集,该数据集覆盖捷克、斯洛伐克和乌克兰地区,并提供英文翻译;同时通过提示(prompting)方式测试先进LLMs的表现,并结合人工评估来分析现有自动评估指标的可靠性。实验结果揭示了当前LLMs在区域知识上的显著短板,且自动化指标与人类判断的相关性极低,凸显了开发更可靠评估体系的重要性。
链接: https://arxiv.org/abs/2507.22752
作者: Jindřich Libovický,Jindřich Helcl,Andrei Manea,Gianluca Vico
机构: 未知
类目: Computation and Language (cs.CL)
备注:
Abstract:We introduce a benchmark for open-ended regional question answering that encompasses both textual and visual modalities. We also provide strong baselines using state-of-the-art large language models (LLMs). Our dataset consists of manually curated questions and answers grounded in Wikipedia, created by native speakers from Czechia, Slovakia, and Ukraine, with accompanying English translations. It includes both purely textual questions and those requiring visual understanding. As a baseline, we evaluate state-of-the-art LLMs through prompting and complement this with human judgments of answer correctness. Using these human evaluations, we analyze the reliability of existing automatic evaluation metrics. Our baseline results highlight a significant gap in regional knowledge among current LLMs. Moreover, apart from LLM-based evaluation, there is minimal correlation between automated metrics and human judgment. We release this dataset as a resource to (1) assess regional knowledge in LLMs, (2) study cross-lingual generation consistency in a challenging setting, and (3) advance the development of evaluation metrics for open-ended question answering.
zh
[NLP-9] Next Tokens Denoising for Speech Synthesis
【速读】: 该论文旨在解决自回归(Autoregressive, AR)模型与扩散模型在文本到语音(Text-to-Speech, TTS)生成中各自存在的局限性:AR模型因依赖因果注意力机制无法利用未来上下文信息,且生成速度较慢;而扩散模型则面临键值缓存(Key-Value Caching, KV-cache)难以应用的问题。解决方案的关键在于提出Dragon-FM架构,该架构将AR建模与流匹配(Flow-Matching, FM)统一起来,通过以12.5 tokens/秒的紧凑速率处理48 kHz音频编码器(codec)token,并在块(chunk)内采用并行流匹配实现快速迭代去噪,同时在块间保持AR建模以确保全局连贯性。这种设计使模型能够跨块使用KV-cache并利用块内未来上下文,从而兼顾高效生成与高质量输出,尤其适用于长序列内容生成任务。
链接: https://arxiv.org/abs/2507.22746
作者: Yanqing Liu,Ruiqing Xue,Chong Zhang,Yufei Liu,Gang Wang,Bohan Li,Yao Qian,Lei He,Shujie Liu,Sheng Zhao
机构: Microsoft(微软)
类目: ound (cs.SD); Computation and Language (cs.CL); Audio and Speech Processing (eess.AS)
备注:
Abstract:While diffusion and autoregressive (AR) models have significantly advanced generative modeling, they each present distinct limitations. AR models, which rely on causal attention, cannot exploit future context and suffer from slow generation speeds. Conversely, diffusion models struggle with key-value (KV) caching. To overcome these challenges, we introduce Dragon-FM, a novel text-to-speech (TTS) design that unifies AR and flow-matching. This model processes 48 kHz audio codec tokens in chunks at a compact 12.5 tokens per second rate. This design enables AR modeling across chunks, ensuring global coherence, while parallel flow-matching within chunks facilitates fast iterative denoising. Consequently, the proposed model can utilize KV-cache across chunks and incorporate future context within each chunk. Furthermore, it bridges continuous and discrete feature modeling, demonstrating that continuous AR flow-matching can predict discrete tokens with finite scalar quantizers. This efficient codec and fast chunk-autoregressive architecture also makes the proposed model particularly effective for generating extended content. Experiment for demos of our work on podcast datasets demonstrate its capability to efficiently generate high-quality zero-shot podcasts.
zh
[NLP-10] Reducing Hallucinations in Summarization via Reinforcement Learning with Entity Hallucination Index
【速读】: 该论文旨在解决生成式摘要中实体幻觉(Entity Hallucination)问题,即模型在生成摘要时引入未在原文中出现或不准确的命名实体,从而影响摘要的事实准确性。解决方案的关键在于提出一种基于奖励驱动的微调框架,通过将实体幻觉指数(Entity Hallucination Index, EHI)作为强化学习的奖励信号,直接优化模型生成内容中实体的正确性与可 grounding 性,而无需依赖人工标注的事实性标签,从而实现可扩展、自动化的幻觉抑制训练。
链接: https://arxiv.org/abs/2507.22744
作者: Praveenkumar Katwe,Rakesh Chandra,Balabantaray Kali,Prasad Vittala
机构: International Institute of Information Technology, Bhubaneshwar, INDIA (国际信息科技学院, 布巴内斯瓦尔, 印度); Informatica Business Solutions, Bengaluru, INDIA (Informatica 商业解决方案公司, 班加罗尔, 印度)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: 8
Abstract:Reducing hallucinations in abstractive summarization remains a critical challenge for deploying language models (LMs) in real-world settings. In this work, we introduce a rewarddriven fine-tuning framework that explicitly optimizes for Entity Hallucination Index (EHI), a metric designed to quantify the presence, correctness, and grounding of named entities in generated summaries. Given a corpus of meeting transcripts, we first generate baseline summaries using a pre-trained LM and compute EHI scores via automatic entity extraction and matching. We then apply reinforcement learning to fine-tune the model parameters, using EHI as a reward signal to bias generation toward entity-faithful outputs. Our approach does not rely on human-written factuality annotations, enabling scalable fine-tuning. Experiments demonstrate consistent improvements in EHI across datasets, with qualitative analysis revealing a significant reduction in entity-level hallucinations without degradation in fluency or informativeness. We release a reproducible Colab pipeline, facilitating further research on hallucination-aware model fine-tuning using lightweight, hallucintion metrics like EHI.
zh
[NLP-11] Resource-Efficient Adaptation of Large Language Models for Text Embeddings via Prompt Engineering and Contrastive Fine-tuning
【速读】: 该论文旨在解决预训练的解码器-only大语言模型(Large Language Models, LLMs)在生成文本时虽具备丰富的词元级语义表示,但将其池化为句子或文档级嵌入(text embedding)时会丢失关键信息的问题。这一问题直接影响聚类、分类和检索等非生成式下游任务的性能。解决方案的关键在于结合三种策略:(i) 多种词元嵌入聚合方法,(ii) 针对任务的提示工程(prompt engineering),以及 (iii) 通过对比学习进行轻量级微调(contrastive fine-tuning),其中合成正样本对的对比微调显著提升了嵌入质量,并使注意力机制更聚焦于语义相关词汇,从而实现更有效的语义压缩与控制。
链接: https://arxiv.org/abs/2507.22729
作者: Benedikt Roth,Stephan Rappensperger,Tianming Qiu,Hamza Imamović,Julian Wörmann,Hao Shen
机构: fortiss GmbH (德国弗劳恩霍夫信息安全研究所); School of Computation and Information Technology, Technical University of Munich (慕尼黑工业大学计算与信息技术学院)
类目: Computation and Language (cs.CL)
备注:
Abstract:Large Language Models (LLMs) have become a cornerstone in Natural Language Processing (NLP), achieving impressive performance in text generation. Their token-level representations capture rich, human-aligned semantics. However, pooling these vectors into a text embedding discards crucial information. Nevertheless, many non-generative downstream tasks, such as clustering, classification, or retrieval, still depend on accurate and controllable sentence- or document-level embeddings. We explore several adaptation strategies for pre-trained, decoder-only LLMs: (i) various aggregation techniques for token embeddings, (ii) task-specific prompt engineering, and (iii) text-level augmentation via contrastive fine-tuning. Combining these components yields state-of-the-art performance on the English clustering track of the Massive Text Embedding Benchmark (MTEB). An analysis of the attention map further shows that fine-tuning shifts focus from prompt tokens to semantically relevant words, indicating more effective compression of meaning into the final hidden state. Our experiments demonstrate that LLMs can be effectively adapted as text embedding models through a combination of prompt engineering and resource-efficient contrastive fine-tuning on synthetically generated positive pairs.
zh
[NLP-12] Investigating Hallucination in Conversations for Low Resource Languages
【速读】: 该论文旨在解决大型语言模型(Large Language Models, LLMs)在生成文本时存在的“幻觉”(hallucination)问题,即模型输出与事实不符的陈述,这严重影响了模型的可靠性与实用性。研究通过构建多语言对话数据集,对GPT-3.5、GPT-4o、Llama-3.1、Gemma-2.0、DeepSeek-R1和Qwen-3等主流LLM在中文、印地语和波斯语中的表现进行系统评估,发现模型在中文中幻觉较少,而在印地语和波斯语中显著更多。解决方案的关键在于跨语言对比分析,揭示不同语言环境下LLM幻觉模式的差异,为后续针对性优化提供实证依据。
链接: https://arxiv.org/abs/2507.22720
作者: Amit Das,Md. Najib Hasan,Souvika Sarkar,Zheng Zhang,Fatemeh Jamshidi,Tathagata Bhattacharya,Nilanjana Raychawdhury,Dongji Feng,Vinija Jain,Aman Chadha
机构: 未知
类目: Computation and Language (cs.CL)
备注:
Abstract:Large Language Models (LLMs) have demonstrated remarkable proficiency in generating text that closely resemble human writing. However, they often generate factually incorrect statements, a problem typically referred to as ‘hallucination’. Addressing hallucination is crucial for enhancing the reliability and effectiveness of LLMs. While much research has focused on hallucinations in English, our study extends this investigation to conversational data in three languages: Hindi, Farsi, and Mandarin. We offer a comprehensive analysis of a dataset to examine both factual and linguistic errors in these languages for GPT-3.5, GPT-4o, Llama-3.1, Gemma-2.0, DeepSeek-R1 and Qwen-3. We found that LLMs produce very few hallucinated responses in Mandarin but generate a significantly higher number of hallucinations in Hindi and Farsi.
zh
[NLP-13] From Sufficiency to Reflection: Reinforcement-Guided Thinking Quality in Retrieval-Augmented Reasoning for LLM s
【速读】: 该论文旨在解决当前基于强化学习的检索增强生成(Retrieval-Augmented Generation, RAG)方法在推理能力提升上的局限性,即仅依赖最终答案奖励而忽视中间推理过程的质量,导致三种典型失败模式:信息不足(information insufficiency)、推理错误(faulty reasoning)以及答案与推理不一致(answer-reasoning inconsistency)。解决方案的关键在于提出TIRESRAG-R1框架,其核心创新是引入“思考-检索-反思”(think-retrieve-reflect)三阶段流程,并设计多维奖励机制:包括充分性奖励(sufficiency reward)以促进全面检索、推理质量奖励(reasoning quality reward)以评估逻辑合理性与准确性,以及反思奖励(reflection reward)用于识别并修正错误。此外,通过难度感知重加权策略和训练样本过滤进一步提升复杂任务下的性能表现。
链接: https://arxiv.org/abs/2507.22716
作者: Jie He,Victor Gutierrez Basulto,Jeff Z. Pan
机构: University of Edinburgh (爱丁堡大学); Cardiff University (卡迪夫大学)
类目: Computation and Language (cs.CL)
备注:
Abstract:Reinforcement learning-based retrieval-augmented generation (RAG) methods enhance the reasoning abilities of large language models (LLMs). However, most rely only on final-answer rewards, overlooking intermediate reasoning quality. This paper analyzes existing RAG reasoning models and identifies three main failure patterns: (1) information insufficiency, meaning the model fails to retrieve adequate support; (2) faulty reasoning, where logical or content-level flaws appear despite sufficient information; and (3) answer-reasoning inconsistency, where a valid reasoning chain leads to a mismatched final answer. We propose TIRESRAG-R1, a novel framework using a think-retrieve-reflect process and a multi-dimensional reward system to improve reasoning and stability. TIRESRAG-R1 introduces: (1) a sufficiency reward to encourage thorough retrieval; (2) a reasoning quality reward to assess the rationality and accuracy of the reasoning chain; and (3) a reflection reward to detect and revise errors. It also employs a difficulty-aware reweighting strategy and training sample filtering to boost performance on complex tasks. Experiments on four multi-hop QA datasets show that TIRESRAG-R1 outperforms prior RAG methods and generalizes well to single-hop tasks. The code and data are available at: this https URL.
zh
[NLP-14] Listening to the Unspoken: Exploring 365 Aspects of Multimodal Interview Performance Assessment ACM-MM2025
【速读】: 该论文旨在解决传统面试评估中因单一模态信息局限而导致的片面性和主观性问题,以实现更全面、公平的面试表现评估。其解决方案的关键在于提出一个融合视频、音频和文本三模态数据的“365”框架(即三模态 × 六个回答 × 五个评估维度),通过模态特定的特征提取器编码异构数据流,并利用共享压缩多层感知机(Shared Compression Multilayer Perceptron)将多模态嵌入映射到统一潜在空间,从而促进高效特征交互;同时引入两级集成学习策略——独立回归头预测每个回答的分数,再通过均值池化聚合所有回答得分,最终输出五个关键维度的综合评分,显著提升了预测鲁棒性与评估准确性。
链接: https://arxiv.org/abs/2507.22676
作者: Jia Li,Yang Wang,Wenhao Qian,Zhenzhen Hu,Richang Hong,Meng Wang
机构: Hefei University of Technology (合肥工业大学)
类目: Computation and Language (cs.CL); Multimedia (cs.MM)
备注: 8 pages, 4 figures, ACM MM 2025. github: this https URL
Abstract:Interview performance assessment is essential for determining candidates’ suitability for professional positions. To ensure holistic and fair evaluations, we propose a novel and comprehensive framework that explores ``365’’ aspects of interview performance by integrating \textitthree modalities (video, audio, and text), \textitsix responses per candidate, and \textitfive key evaluation dimensions. The framework employs modality-specific feature extractors to encode heterogeneous data streams and subsequently fused via a Shared Compression Multilayer Perceptron. This module compresses multimodal embeddings into a unified latent space, facilitating efficient feature interaction. To enhance prediction robustness, we incorporate a two-level ensemble learning strategy: (1) independent regression heads predict scores for each response, and (2) predictions are aggregated across responses using a mean-pooling mechanism to produce final scores for the five target dimensions. By listening to the unspoken, our approach captures both explicit and implicit cues from multimodal data, enabling comprehensive and unbiased assessments. Achieving a multi-dimensional average MSE of 0.1824, our framework secured first place in the AVI Challenge 2025, demonstrating its effectiveness and robustness in advancing automated and multimodal interview performance assessment. The full implementation is available at this https URL.
zh
[NLP-15] Multilingual Political Views of Large Language Models : Identification and Steering
【速读】: 该论文旨在解决当前关于大语言模型(Large Language Models, LLMs)政治偏倚的评估局限性问题,特别是现有研究在模型架构、规模和多语言场景下的泛化能力不足,以及对偏倚可操控性缺乏实证检验。其解决方案的关键在于开展一项大规模跨语言、跨模型家族的系统性评测:使用14种语言和11个语义等价的改写版本对政治光谱测试(Political Compass Test)进行标准化测量,并引入一种基于中心质量激活(center-of-mass activation)的简单干预方法,证明了可通过轻量级技术有效调控模型的政治立场,从而揭示了LLMs政治倾向的可塑性与语言-模型交互效应。
链接: https://arxiv.org/abs/2507.22623
作者: Daniil Gurgurov,Katharina Trinley,Ivan Vykopal,Josef van Genabith,Simon Ostermann,Roberto Zamparelli
机构: Saarland University (萨尔兰大学); University of Trento (特伦托大学); Brno University of Technology (布杰约维采理工大学); Kempelen Institute of Intelligent Technologies (肯佩伦智能技术研究所); German Research Center for AI (DFKI) (德国人工智能研究中心)
类目: Computation and Language (cs.CL)
备注: pre-print
Abstract:Large language models (LLMs) are increasingly used in everyday tools and applications, raising concerns about their potential influence on political views. While prior research has shown that LLMs often exhibit measurable political biases–frequently skewing toward liberal or progressive positions–key gaps remain. Most existing studies evaluate only a narrow set of models and languages, leaving open questions about the generalizability of political biases across architectures, scales, and multilingual settings. Moreover, few works examine whether these biases can be actively controlled. In this work, we address these gaps through a large-scale study of political orientation in modern open-source instruction-tuned LLMs. We evaluate seven models, including LLaMA-3.1, Qwen-3, and Aya-Expanse, across 14 languages using the Political Compass Test with 11 semantically equivalent paraphrases per statement to ensure robust measurement. Our results reveal that larger models consistently shift toward libertarian-left positions, with significant variations across languages and model families. To test the manipulability of political stances, we utilize a simple center-of-mass activation intervention technique and show that it reliably steers model responses toward alternative ideological positions across multiple languages. Our code is publicly available at this https URL. Comments: pre-print Subjects: Computation and Language (cs.CL) Cite as: arXiv:2507.22623 [cs.CL] (or arXiv:2507.22623v1 [cs.CL] for this version) https://doi.org/10.48550/arXiv.2507.22623 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
zh
[NLP-16] Language Arithmetics: Towards Systematic Language Neuron Identification and Manipulation
【速读】: 该论文旨在解决大型语言模型(Large Language Models, LLMs)中语言特异性处理的神经机制不明确的问题,特别是如何识别和操控控制特定语言行为的神经元。其解决方案的关键在于提出并应用语言激活概率熵(Language Activation Probability Entropy, LAPE)方法,识别出在深层网络中聚集且对非拉丁文字更具特异性的语言控制神经元;并通过语言算术(language arithmetics)——即系统性地进行神经元激活的加法与乘法操作,实现对目标语言的激活与非目标语言的抑制,从而有效引导模型在多语言任务中的行为表现,优于传统的替换策略,并揭示了语言相似性与资源丰富度对干预效果的影响。
链接: https://arxiv.org/abs/2507.22608
作者: Daniil Gurgurov,Katharina Trinley,Yusser Al Ghussin,Tanja Baeumel,Josef van Genabith,Simon Ostermann
机构: Saarland University (萨尔兰大学); German Research Center for AI (DFKI) (德国人工智能研究中心); Centre for European Research in Trusted AI (CERTAIN) (欧洲可信人工智能研究中心)
类目: Computation and Language (cs.CL)
备注: preprint
Abstract:Large language models (LLMs) exhibit strong multilingual abilities, yet the neural mechanisms behind language-specific processing remain unclear. We analyze language-specific neurons in Llama-3.1-8B, Mistral-Nemo-12B, and Aya-Expanse-8B 32B across 21 typologically diverse languages, identifying neurons that control language behavior. Using the Language Activation Probability Entropy (LAPE) method, we show that these neurons cluster in deeper layers, with non-Latin scripts showing greater specialization. Related languages share overlapping neurons, reflecting internal representations of linguistic proximity. Through language arithmetics, i.e. systematic activation addition and multiplication, we steer models to deactivate unwanted languages and activate desired ones, outperforming simpler replacement approaches. These interventions effectively guide behavior across five multilingual tasks: language forcing, translation, QA, comprehension, and NLI. Manipulation is more successful for high-resource languages, while typological similarity improves effectiveness. We also demonstrate that cross-lingual neuron steering enhances downstream performance and reveal internal “fallback” mechanisms for language selection when neurons are progressively deactivated. Our code is made publicly available at this https URL. Comments: preprint Subjects: Computation and Language (cs.CL) Cite as: arXiv:2507.22608 [cs.CL] (or arXiv:2507.22608v1 [cs.CL] for this version) https://doi.org/10.48550/arXiv.2507.22608 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
zh
[NLP-17] VL-Cogito: Progressive Curriculum Reinforcement Learning for Advanced Multimodal Reasoning
【速读】: 该论文旨在解决当前多模态推理模型在面对语义内容和问题形式多样性的复杂任务时,性能不稳定的问题。现有模型在不同领域和难度水平下表现波动较大,难以实现鲁棒的跨域推理能力。其解决方案的核心在于提出一种基于多阶段渐进式课程强化学习(Progressive Curriculum Reinforcement Learning, PCuRL)的训练框架——VL-Cogito,该框架通过两个关键创新实现性能提升:一是在线难度软加权机制(online difficulty soft weighting mechanism),动态调节各强化学习训练阶段的任务难度;二是动态长度奖励机制(dynamic length reward mechanism),引导模型根据任务复杂度自适应调整推理路径长度,从而在推理效率与准确性之间取得更好平衡。实验表明,VL-Cogito在主流多模态基准测试中(涵盖数学、科学、逻辑及通用理解等任务)持续达到或超越现有推理导向模型的表现,验证了该方法的有效性。
链接: https://arxiv.org/abs/2507.22607
作者: Ruifeng Yuan,Chenghao Xiao,Sicong Leng,Jianyu Wang,Long Li,Weiwen Xu,Hou Pong Chan,Deli Zhao,Tingyang Xu,Zhongyu Wei,Hao Zhang,Yu Rong
机构: DAMO Academy, Alibaba Group (达摩院,阿里巴巴集团); Hupan Lab (湖畔实验室); Fudan University (复旦大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注: 21 pages, 5 figures, 6 tables. Work in progress
Abstract:Reinforcement learning has proven its effectiveness in enhancing the reasoning capabilities of large language models. Recent research efforts have progressively extended this paradigm to multimodal reasoning tasks. Due to the inherent complexity and diversity of multimodal tasks, especially in semantic content and problem formulations, existing models often exhibit unstable performance across various domains and difficulty levels. To address these limitations, we propose VL-Cogito, an advanced multimodal reasoning model trained via a novel multi-stage Progressive Curriculum Reinforcement Learning (PCuRL) framework. PCuRL systematically guides the model through tasks of gradually increasing difficulty, substantially improving its reasoning abilities across diverse multimodal contexts. The framework introduces two key innovations: (1) an online difficulty soft weighting mechanism, dynamically adjusting training difficulty across successive RL training stages; and (2) a dynamic length reward mechanism, which encourages the model to adaptively regulate its reasoning path length according to task complexity, thus balancing reasoning efficiency with correctness. Experimental evaluations demonstrate that VL-Cogito consistently matches or surpasses existing reasoning-oriented models across mainstream multimodal benchmarks spanning mathematics, science, logic, and general understanding, validating the effectiveness of our approach.
zh
[NLP-18] BALSAM: A Platform for Benchmarking Arabic Large Language Models
【速读】: 该论文旨在解决阿拉伯语大语言模型(Large Language Models, LLMs)发展滞后的问题,其核心挑战包括阿拉伯语数据稀缺、语言多样性(含多种方言)、形态学复杂性,以及现有评估基准质量不高——如依赖静态公开数据、任务覆盖不全、缺乏盲测机制导致数据污染风险。解决方案的关键在于提出一个名为BALSAM的综合性、社区驱动的基准平台,涵盖14个类别共78项自然语言处理(Natural Language Processing, NLP)任务,包含37K测试样本和15K开发样本,并提供集中化、透明化的盲评环境,从而建立统一标准并促进协作研究,推动阿拉伯语LLM能力的实质性提升。
链接: https://arxiv.org/abs/2507.22603
作者: Rawan Al-Matham,Kareem Darwish,Raghad Al-Rasheed,Waad Alshammari,Muneera Alhoshan,Amal Almazrua,Asma Al Wazrah,Mais Alheraki,Firoj Alam,Preslav Nakov,Norah Alzahrani,Eman alBilali,Nizar Habash,Abdelrahman El-Sheikh,Muhammad Elmallah,Haonan Li,Hamdy Mubarak,Mohamed Anwar,Zaid Alyafeai,Ahmed Abdelali,Nora Altwairesh,Maram Hasanain,Abdulmohsen Al Thubaity,Shady Shehata,Bashar Alhafni,Injy Hamed,Go Inoue,Khalid Elmadani,Ossama Obeid,Fatima Haouari,Tamer Elsayed,Emad Alghamdi,Khalid Almubarak,Saied Alshahrani,Ola Aljarrah,Safa Alajlan,Areej Alshaqarawi,Maryam Alshihri,Sultana Alghurabi,Atikah Alzeghayer,Afrah Altamimi,Abdullah Alfaifi,Abdulrahman AlOsaimy
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:
Abstract:The impressive advancement of Large Language Models (LLMs) in English has not been matched across all languages. In particular, LLM performance in Arabic lags behind, due to data scarcity, linguistic diversity of Arabic and its dialects, morphological complexity, etc. Progress is further hindered by the quality of Arabic benchmarks, which typically rely on static, publicly available data, lack comprehensive task coverage, or do not provide dedicated platforms with blind test sets. This makes it challenging to measure actual progress and to mitigate data contamination. Here, we aim to bridge these gaps. In particular, we introduce BALSAM, a comprehensive, community-driven benchmark aimed at advancing Arabic LLM development and evaluation. It includes 78 NLP tasks from 14 broad categories, with 52K examples divided into 37K test and 15K development, and a centralized, transparent platform for blind evaluation. We envision BALSAM as a unifying platform that sets standards and promotes collaborative research to advance Arabic LLM capabilities.
zh
[NLP-19] Unveiling the Influence of Amplifying Language-Specific Neurons
【速读】: 该论文旨在解决多语言大语言模型(Multilingual Large Language Models, MLLMs)中语言特定神经元(Language-specific Neurons)在跨语言行为中的作用机制不明确的问题,特别是其在语言放大(Amplification)干预下的影响。解决方案的关键在于通过系统性地对18种语言(包括低资源语言)实施语言特定神经元的放大干预,并引入一种新的评估指标——语言引导偏移量(Language Steering Shift, LSS),量化不同放大因子对目标语言输出的引导效果;实验表明,最优放大因子能有效引导模型输出至绝大多数测试语言,尤其对低资源语言有显著提升,但可能损害跨语言任务性能,揭示了语言特定神经元在增强语言专注性与抑制跨语言迁移之间的权衡关系。
链接: https://arxiv.org/abs/2507.22581
作者: Inaya Rahmanisa,Lyzander Marciano Andrylie,Krisna Mahardika Ihsani,Alfan Farizki Wicaksono,Haryo Akbarianto Wibowo,Alham Fikri Aji
机构: Universitas Indonesia (印度尼西亚大学); MBZUAI (穆罕默德·本·扎耶德人工智能大学)
类目: Computation and Language (cs.CL); Machine Learning (cs.LG)
备注: Our code and dataset are made available at this https URL
Abstract:Language-specific neurons in LLMs that strongly correlate with individual languages have been shown to influence model behavior by deactivating them. However, their role in amplification remains underexplored. This work investigates the effect of amplifying language-specific neurons through interventions across 18 languages, including low-resource ones, using three models primarily trained in different languages. We compare amplification factors by their effectiveness in steering to the target language using a proposed Language Steering Shift (LSS) evaluation score, then evaluate it on downstream tasks: commonsense reasoning (XCOPA, XWinograd), knowledge (Include), and translation (FLORES). The optimal amplification factors effectively steer output toward nearly all tested languages. Intervention using this factor on downstream tasks improves self-language performance in some cases but generally degrades cross-language results. These findings highlight the effect of language-specific neurons in multilingual behavior, where amplification can be beneficial especially for low-resource languages, but provides limited advantage for cross-lingual transfer.
zh
[NLP-20] Efficient Differentially Private Fine-Tuning of LLM s via Reinforcement Learning
【速读】: 该论文旨在解决差分隐私(Differential Privacy, DP)与模型效用之间的权衡问题,尤其是在训练大型语言模型(Large Language Models, LLMs)时,如何在保障数据隐私的前提下提升模型性能。现有方法如差分私有随机梯度下降(DP-SGD)通过强制裁剪梯度并注入噪声来保证隐私,但这种全局且静态的机制显著降低了样本效率和最终准确率。论文提出RLDP框架,首次将差分隐私优化建模为一个闭环控制问题,并引入深度强化学习(Deep Reinforcement Learning, RL),利用软演员-评论家(Soft Actor-Critic, SAC)超策略在线学习动态调整每参数的梯度裁剪阈值及噪声强度,从而实现对隐私预算的精细化分配。其关键创新在于:通过实时感知学习动态中的丰富统计信息,自适应地决定何时何地投入隐私预算,显著提升了模型收敛速度和下游任务性能,同时保持相同的(ε, δ)-差分隐私承诺并增强对成员推断和蜜罐提取攻击的鲁棒性。
链接: https://arxiv.org/abs/2507.22565
作者: Afshin Khadangi,Amir Sartipi,Igor Tchappi,Ramin Bahmani,Gilbert Fridgen
机构: SnT, University of Luxembourg (卢森堡大学科学技术研究所)
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注:
Abstract:The tension between data privacy and model utility has become the defining bottleneck for the practical deployment of large language models (LLMs) trained on sensitive corpora including healthcare. Differentially private stochastic gradient descent (DP-SGD) guarantees formal privacy, yet it does so at a pronounced cost: gradients are forcibly clipped and perturbed with noise, degrading sample efficiency and final accuracy. Numerous variants have been proposed to soften this trade-off, but they all share a handicap: their control knobs are hard-coded, global, and oblivious to the evolving optimization landscape. Consequently, practitioners are forced either to over-spend privacy budget in pursuit of utility, or to accept mediocre models in order to stay within privacy constraints. We present RLDP, the first framework to cast DP optimization itself as a closed-loop control problem amenable to modern deep reinforcement learning (RL). RLDP continuously senses rich statistics of the learning dynamics and acts by selecting fine-grained per parameter gradient-clipping thresholds as well as the magnitude of injected Gaussian noise. A soft actor-critic (SAC) hyper-policy is trained online during language model fine-tuning; it learns, from scratch, how to allocate the privacy budget where it matters and when it matters. Across more than 1,600 ablation experiments on GPT2-small, Llama-1B, Llama-3B, and Mistral-7B, RLDP delivers perplexity reductions of 1.3-30.5% (mean 5.4%) and an average 5.6% downstream utility gain. RLDP reaches each baseline’s final utility after only 13-43% of the gradient-update budget (mean speed-up 71%), all while honoring the same ( \epsilon , \delta )-DP contract and exhibiting equal or lower susceptibility to membership-inference and canary-extraction attacks.
zh
[NLP-21] Exploiting Synergistic Cognitive Biases to Bypass Safety in LLM s
【速读】: 该论文旨在解决大型语言模型(Large Language Models, LLMs)在安全机制上易受对抗攻击的问题,尤其是那些利用认知偏差(cognitive biases)的攻击方式。传统方法主要依赖提示工程或算法操纵,而本文指出多偏差交互作用是被忽视但极具破坏力的攻击向量。解决方案的关键在于提出一种名为CognitiveAttack的新型红队测试框架,该框架通过监督微调与强化学习相结合,系统性地生成嵌入优化偏差组合的提示词,从而有效绕过安全协议并保持高成功率。实验表明,该方法在30种不同LLM中均展现出显著漏洞利用能力,尤其对开源模型效果突出,攻击成功率远超现有最先进黑盒方法PAP(60.1% vs. 31.6%),揭示了当前防御机制的重大缺陷,并为结合认知科学与LLM安全提供了新的跨学科研究路径。
链接: https://arxiv.org/abs/2507.22564
作者: Xikang Yang,Biyu Zhou,Xuehai Tang,Jizhong Han,Songlin Hu
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:
Abstract:Large Language Models (LLMs) demonstrate impressive capabilities across a wide range of tasks, yet their safety mechanisms remain susceptible to adversarial attacks that exploit cognitive biases – systematic deviations from rational judgment. Unlike prior jailbreaking approaches focused on prompt engineering or algorithmic manipulation, this work highlights the overlooked power of multi-bias interactions in undermining LLM safeguards. We propose CognitiveAttack, a novel red-teaming framework that systematically leverages both individual and combined cognitive biases. By integrating supervised fine-tuning and reinforcement learning, CognitiveAttack generates prompts that embed optimized bias combinations, effectively bypassing safety protocols while maintaining high attack success rates. Experimental results reveal significant vulnerabilities across 30 diverse LLMs, particularly in open-source models. CognitiveAttack achieves a substantially higher attack success rate compared to the SOTA black-box method PAP (60.1% vs. 31.6%), exposing critical limitations in current defense mechanisms. These findings highlight multi-bias interactions as a powerful yet underexplored attack vector. This work introduces a novel interdisciplinary perspective by bridging cognitive science and LLM safety, paving the way for more robust and human-aligned AI systems.
zh
[NLP-22] ControlMed: Adding Reasoning Control to Medical Language Model
【速读】: 该论文旨在解决当前医疗领域推理型大语言模型(Reasoning Large Language Models, LLMs)在实际临床应用中因推理过程过长而导致计算开销大、响应延迟高,从而限制其部署效率的问题。解决方案的关键在于提出一种名为ControlMed的医学语言模型,该模型通过引入细粒度的控制标记(control markers),使用户能够在推理阶段动态调节推理过程的长度,从而在推理准确性与计算效率之间实现灵活权衡。该方法基于三阶段训练流程:大规模合成医学指令数据预训练、多长度推理数据与显式控制标记的监督微调,以及基于模型奖励信号的强化学习优化,最终在多个英文和韩文医疗基准测试中实现了与现有最优模型相当或更优的性能表现。
链接: https://arxiv.org/abs/2507.22545
作者: Sung-Min Lee,Siyoon Lee,Juyeon Kim,Kyungmin Roh
机构: Agentic AI Lab, KT (KT公司)
类目: Computation and Language (cs.CL)
备注: 13 pages
Abstract:Reasoning Large Language Models (LLMs) with enhanced accuracy and explainability are increasingly being adopted in the medical domain, as the life-critical nature of clinical decision-making demands reliable support. Despite these advancements, existing reasoning LLMs often generate unnecessarily lengthy reasoning processes, leading to significant computational overhead and response latency. These limitations hinder their practical deployment in real-world clinical environments. To address these challenges, we introduce \textbfControlMed, a medical language model that enables users to actively control the length of the reasoning process at inference time through fine-grained control markers. ControlMed is trained through a three-stage pipeline: 1) pre-training on a large-scale synthetic medical instruction dataset covering both \textitdirect and \textitreasoning responses; 2) supervised fine-tuning with multi-length reasoning data and explicit length-control markers; and 3) reinforcement learning with model-based reward signals to enhance factual accuracy and response quality. Experimental results on a variety of English and Korean medical benchmarks demonstrate that our model achieves similar or better performance compared to state-of-the-art models. Furthermore, users can flexibly balance reasoning accuracy and computational efficiency by controlling the reasoning length as needed. These findings demonstrate that ControlMed is a practical and adaptable solution for clinical question answering and medical information analysis.
zh
[NLP-23] Pre-trained Models Perform the Best When Token Distributions Follow Zipfs Law
【速读】: 该论文旨在解决自然语言处理(Natural Language Processing, NLP)及其他序列建模领域中词汇表(vocabulary size)选择缺乏理论依据的问题,现有方法多依赖启发式规则或数据集特定的经验选择,导致模型性能不稳定。解决方案的关键在于引入Zipf定律(Zipf’s law)对词频分布的分析,提出通过衡量token频率分布与幂律行为的契合度来确定最优词汇表大小;实验表明,当token分布越接近Zipfian scaling时,模型在下游任务中的效率和效果均达到峰值,从而确立了Zipfian对齐作为通用且稳健的词汇表规模选择标准。
链接: https://arxiv.org/abs/2507.22543
作者: Yanjin He,Qingkai Zeng,Meng Jiang
机构: Peking University (北京大学); University of Notre Dame (圣母大学); Nankai University (南开大学)
类目: Machine Learning (cs.LG); Computation and Language (cs.CL)
备注:
Abstract:Tokenization is a fundamental step in natural language processing (NLP) and other sequence modeling domains, where the choice of vocabulary size significantly impacts model performance. Despite its importance, selecting an optimal vocabulary size remains underexplored, typically relying on heuristics or dataset-specific choices. In this work, we propose a principled method for determining the vocabulary size by analyzing token frequency distributions through Zipf’s law. We show that downstream task performance correlates with how closely token distributions follow power-law behavior, and that aligning with Zipfian scaling improves both model efficiency and effectiveness. Extensive experiments across NLP, genomics, and chemistry demonstrate that models consistently achieve peak performance when the token distribution closely adheres to Zipf’s law, establishing Zipfian alignment as a robust and generalizable criterion for vocabulary size selection.
zh
[NLP-24] A Benchmark Dataset and Evaluation Framework for Vietnamese Large Language Models in Customer Support
【速读】: 该论文旨在解决越南语大语言模型(Vietnamese Large Language Models, ViLLMs)在客户支持问答(Question Answering, QA)场景中缺乏高质量、领域特定评估基准的问题。当前,尽管轻量级开源ViLLMs在准确性、效率和隐私方面具有优势,但企业难以基于真实业务交互数据选择合适模型,主要受限于缺少反映实际客户对话的基准数据集。解决方案的关键在于构建一个名为Customer Support Conversations Dataset (CSConDa) 的公开基准数据集,其中包含超过9,000个来自越南大型软件公司真实人工客服对话的QA对,并覆盖定价、产品可用性及技术故障排查等多样化主题;同时设计了一套综合评估框架,对11个轻量级ViLLMs进行自动指标与句法分析相结合的系统评测,从而揭示模型性能差异、行为模式与改进方向,为下一代ViLLMs研发和企业级客户支持QA模型选型提供可靠依据。
链接: https://arxiv.org/abs/2507.22542
作者: Long S. T. Nguyen,Truong P. Hua,Thanh M. Nguyen,Toan Q. Pham,Nam K. Ngo,An X. Nguyen,Nghi D. M. Pham,Nghia H. Nguyen,Tho T. Quan
机构: 未知
类目: Computation and Language (cs.CL)
备注: Under review at ICCCI 2025
Abstract:With the rapid growth of Artificial Intelligence, Large Language Models (LLMs) have become essential for Question Answering (QA) systems, improving efficiency and reducing human workload in customer service. The emergence of Vietnamese LLMs (ViLLMs) highlights lightweight open-source models as a practical choice for their accuracy, efficiency, and privacy benefits. However, domain-specific evaluations remain limited, and the absence of benchmark datasets reflecting real customer interactions makes it difficult for enterprises to select suitable models for support applications. To address this gap, we introduce the Customer Support Conversations Dataset (CSConDa), a curated benchmark of over 9,000 QA pairs drawn from real interactions with human advisors at a large Vietnamese software company. Covering diverse topics such as pricing, product availability, and technical troubleshooting, CSConDa provides a representative basis for evaluating ViLLMs in practical scenarios. We further present a comprehensive evaluation framework, benchmarking 11 lightweight open-source ViLLMs on CSConDa with both automatic metrics and syntactic analysis to reveal model strengths, weaknesses, and linguistic patterns. This study offers insights into model behavior, explains performance differences, and identifies key areas for improvement, supporting the development of next-generation ViLLMs. By establishing a robust benchmark and systematic evaluation, our work enables informed model selection for customer service QA and advances research on Vietnamese LLMs. The dataset is publicly available at this https URL.
zh
[NLP-25] CliCARE: Grounding Large Language Models in Clinical Guidelines for Decision Support over Longitudinal Cancer Electronic Health Records
【速读】: 该论文旨在解决大型语言模型(Large Language Models, LLMs)在纵向癌症电子健康记录(Electronic Health Records, EHRs)临床决策支持中面临的三大挑战:一是难以有效处理长文本和多语言病历以实现准确的时间序列分析;二是传统基于检索增强生成(Retrieval-Augmented Generation, RAG)的 grounding 方法无法充分融合以流程为导向的临床指南,导致临床幻觉风险升高;三是缺乏可靠的评估指标,阻碍了AI系统在肿瘤学领域的验证。解决方案的关键在于提出 CliCARE 框架,通过将非结构化的纵向 EHR 转化为患者特异性的时序知识图谱(Temporal Knowledge Graphs, TKGs),捕捉长期依赖关系,并将真实患者轨迹与规范性临床指南知识图谱对齐,从而实现基于证据的决策支持,生成高保真临床摘要和可执行建议。
链接: https://arxiv.org/abs/2507.22533
作者: Dongchen Li(1),Jitao Liang(1),Wei Li(1),Xiaoyu Wang(2),Longbing Cao(3),Kun Yu(4) ((1) College of Computer Science and Engineering, Northeastern University, Shenyang, China, (2) Liaoning Cancer Hospital and Institute, Shenyang, China, (3) Macquarie University, Sydney, Australia, (4) College of Medicine and Biological Information Engineering, Northeastern University, Shenyang, China)
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:
Abstract:Large Language Models (LLMs) hold significant promise for improving clinical decision support and reducing physician burnout by synthesizing complex, longitudinal cancer Electronic Health Records (EHRs). However, their implementation in this critical field faces three primary challenges: the inability to effectively process the extensive length and multilingual nature of patient records for accurate temporal analysis; a heightened risk of clinical hallucination, as conventional grounding techniques such as Retrieval-Augmented Generation (RAG) do not adequately incorporate process-oriented clinical guidelines; and unreliable evaluation metrics that hinder the validation of AI systems in oncology. To address these issues, we propose CliCARE, a framework for Grounding Large Language Models in Clinical Guidelines for Decision Support over Longitudinal Cancer Electronic Health Records. The framework operates by transforming unstructured, longitudinal EHRs into patient-specific Temporal Knowledge Graphs (TKGs) to capture long-range dependencies, and then grounding the decision support process by aligning these real-world patient trajectories with a normative guideline knowledge graph. This approach provides oncologists with evidence-grounded decision support by generating a high-fidelity clinical summary and an actionable recommendation. We validated our framework using large-scale, longitudinal data from a private Chinese cancer dataset and the public English MIMIC-IV dataset. In these diverse settings, CliCARE significantly outperforms strong baselines, including leading long-context LLMs and Knowledge Graph-enhanced RAG methods. The clinical validity of our results is supported by a robust evaluation protocol, which demonstrates a high correlation with assessments made by expert oncologists.
zh
[NLP-26] SLM-SQL: An Exploration of Small Language Models for Text-to-SQL
【速读】: 该论文旨在解决小语言模型(Small Language Models, SLMs)在Text-to-SQL任务中性能不足的问题,尤其是在逻辑推理能力受限的情况下。尽管SLMs在推理速度和边缘部署方面具有优势,但其在生成准确SQL查询方面的表现远低于大语言模型(Large Language Models, LLMs)。解决方案的关键在于利用近期后训练技术(post-training techniques),包括监督微调(supervised fine-tuning)与基于强化学习的优化,并结合纠错式自一致性推理(corrective self-consistency approach),从而显著提升SLMs在Text-to-SQL任务中的执行准确率(execution accuracy, EX)。通过构建两个衍生数据集(SynSQL-Think-916K用于SQL生成,SynSQL-Merge-Think-310K用于SQL合并修订),作者验证了所提出方法SLM-SQL的有效性与泛化能力,在BIRD开发集上平均提升31.4点,其中0.5B模型达到56.87% EX,1.5B模型达到67.08% EX。
链接: https://arxiv.org/abs/2507.22478
作者: Lei Sheng,Shuai-Shuai Xu
机构: Wuhan University of Technology (武汉理工大学); University of Science and Technology of China (中国科学技术大学)
类目: Computation and Language (cs.CL)
备注: 16 pages, 2 figures, work in progress
Abstract:Large language models (LLMs) have demonstrated strong performance in translating natural language questions into SQL queries (Text-to-SQL). In contrast, small language models (SLMs) ranging from 0.5B to 1.5B parameters currently underperform on Text-to-SQL tasks due to their limited logical reasoning capabilities. However, SLMs offer inherent advantages in inference speed and suitability for edge deployment. To explore their potential in Text-to-SQL applications, we leverage recent advancements in post-training techniques. Specifically, we used the open-source SynSQL-2.5M dataset to construct two derived datasets: SynSQL-Think-916K for SQL generation and SynSQL-Merge-Think-310K for SQL merge revision. We then applied supervised fine-tuning and reinforcement learning-based post-training to the SLM, followed by inference using a corrective self-consistency approach. Experimental results validate the effectiveness and generalizability of our method, SLM-SQL. On the BIRD development set, the five evaluated models achieved an average improvement of 31.4 points. Notably, the 0.5B model reached 56.87% execution accuracy (EX), while the 1.5B model achieved 67.08% EX. We will release our dataset, model, and code to github: this https URL.
zh
[NLP-27] IFEvalCode: Controlled Code Generation
【速读】: 该论文旨在解决代码大语言模型(Code LLMs)在实际应用中难以严格遵循详细指令要求的问题,例如编码风格、行数限制和结构约束等,而不仅仅是保证生成代码的正确性。其解决方案的关键在于提出一种双向约束生成机制——前向约束生成(forward constraints generation)与后向约束生成(backward constraints generation),通过显式建模人类定义的指导规则,增强模型在可控代码生成中的指令遵循能力。同时,作者构建了IFEvalCode多语言基准测试集,将评估解耦为正确性(Corr.)与指令遵循度(Instr.)两个独立指标,从而更精确地衡量模型在满足用户具体需求方面的表现。
链接: https://arxiv.org/abs/2507.22462
作者: Jian Yang,Wei Zhang,Shukai Liu,Linzheng Chai,Yingshui Tan,Jiaheng Liu,Ge Zhang,Wangchunshu Zhou,Guanglin Niu,Zhoujun Li,Binyuan Hui,Junyang Lin
机构: Beihang University (北京航空航天大学); M-A-P; OPPO
类目: Computation and Language (cs.CL)
备注: 10 pages
Abstract:Code large language models (Code LLMs) have made significant progress in code generation by translating natural language descriptions into functional code; however, real-world applications often demand stricter adherence to detailed requirements such as coding style, line count, and structural constraints, beyond mere correctness. To address this, the paper introduces forward and backward constraints generation to improve the instruction-following capabilities of Code LLMs in controlled code generation, ensuring outputs align more closely with human-defined guidelines. The authors further present IFEvalCode, a multilingual benchmark comprising 1.6K test samples across seven programming languages (Python, Java, JavaScript, TypeScript, Shell, C++, and C#), with each sample featuring both Chinese and English queries. Unlike existing benchmarks, IFEvalCode decouples evaluation into two metrics: correctness (Corr.) and instruction-following (Instr.), enabling a more nuanced assessment. Experiments on over 40 LLMs reveal that closed-source models outperform open-source ones in controllable code generation and highlight a significant gap between the models’ ability to generate correct code versus code that precisely follows instructions.
zh
[NLP-28] What is an “Abstract Reason er”? Revisiting Experiments and Arguments about Large Language Models CONLL2025
【速读】: 该论文试图解决的问题是:当前学界对大语言模型(Large Language Models, LLMs)是否具备“抽象推理能力”的争议,特别是针对其在零样本(zero-shot)场景下表现不佳的批评。论文通过重新审视相关实验,提出一个关键解决方案:即使LLMs在零样本设置中表现有限,仅对输入编码部分参数进行微调(fine-tuning),即可实现接近完美的性能;然而,这种微调效果并不一定跨数据集迁移。这一发现提示我们需重新定义“抽象推理”的内涵,并深入探讨LLMs是否应被归类为抽象推理者,以及此类分类对模型设计与评估的意义。
链接: https://arxiv.org/abs/2507.22457
作者: Tian Yun,Chen Sun,Ellie Pavlick
机构: Brown University (布朗大学)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: CONLL 2025. Project webpage: this https URL
Abstract:Recent work has argued that large language models (LLMs) are not “abstract reasoners”, citing their poor zero-shot performance on a variety of challenging tasks as evidence. We revisit these experiments in order to add nuance to the claim. First, we show that while LLMs indeed perform poorly in a zero-shot setting, even tuning a small subset of parameters for input encoding can enable near-perfect performance. However, we also show that this finetuning does not necessarily transfer across datasets. We take this collection of empirical results as an invitation to (re-)open the discussion of what it means to be an “abstract reasoner”, and why it matters whether LLMs fit the bill.
zh
[NLP-29] Falcon-H1: A Family of Hybrid-Head Language Models Redefining Efficiency and Performance
【速读】: 该论文旨在解决当前大语言模型(Large Language Models, LLMs)在性能与效率之间难以平衡的问题,尤其是在长上下文处理、参数利用率和训练数据效率方面存在的瓶颈。解决方案的关键在于提出一种基于混合架构设计的新型模型 Falcon-H1,其核心创新是将 Transformer 的注意力机制与状态空间模型(State Space Models, SSMs)并行结合,从而在保持高推理能力的同时显著提升计算效率和长程依赖建模能力。通过系统性重构模型结构、数据策略和训练动态,Falcon-H1 在多个基准测试中实现了超越更大规模模型(如 Qwen3-32B、Llama3.3-70B)的性能表现,且所需参数和训练数据更少,展现出卓越的参数和训练效率。
链接: https://arxiv.org/abs/2507.22448
作者: Jingwei Zuo,Maksim Velikanov,Ilyas Chahed,Younes Belkada,Dhia Eddine Rhayem,Guillaume Kunsch,Hakim Hacid,Hamza Yous,Brahim Farhat,Ibrahim Khadraoui,Mugariya Farooq,Giulia Campesan,Ruxandra Cojocaru,Yasser Djilali,Shi Hu,Iheb Chaabane,Puneesh Khanna,Mohamed El Amine Seddik,Ngoc Dung Huynh,Phuc Le Khac,Leen AlQadi,Billel Mokeddem,Mohamed Chami,Abdalgader Abubaker,Mikhail Lubinets,Kacper Piskorski,Slim Frikha
机构: 未知
类目: Computation and Language (cs.CL)
备注: Technical report of Falcon-H1 model series
Abstract:In this report, we introduce Falcon-H1, a new series of large language models (LLMs) featuring hybrid architecture designs optimized for both high performance and efficiency across diverse use cases. Unlike earlier Falcon models built solely on Transformer or Mamba architectures, Falcon-H1 adopts a parallel hybrid approach that combines Transformer-based attention with State Space Models (SSMs), known for superior long-context memory and computational efficiency. We systematically revisited model design, data strategy, and training dynamics, challenging conventional practices in the field. Falcon-H1 is released in multiple configurations, including base and instruction-tuned variants at 0.5B, 1.5B, 1.5B-deep, 3B, 7B, and 34B parameters. Quantized instruction-tuned models are also available, totaling over 30 checkpoints on Hugging Face Hub. Falcon-H1 models demonstrate state-of-the-art performance and exceptional parameter and training efficiency. The flagship Falcon-H1-34B matches or outperforms models up to 70B scale, such as Qwen3-32B, Qwen2.5-72B, and Llama3.3-70B, while using fewer parameters and less data. Smaller models show similar trends: the Falcon-H1-1.5B-Deep rivals current leading 7B-10B models, and Falcon-H1-0.5B performs comparably to typical 7B models from 2024. These models excel across reasoning, mathematics, multilingual tasks, instruction following, and scientific knowledge. With support for up to 256K context tokens and 18 languages, Falcon-H1 is suitable for a wide range of applications. All models are released under a permissive open-source license, underscoring our commitment to accessible and impactful AI research.
zh
[NLP-30] AI-generated stories favour stability over change: homogeneity and cultural stereotyping in narratives generated by gpt -4o-mini
【速读】: 该论文试图解决的问题是:以英语国家文本为主要训练数据的生成式AI(Generative AI)是否能够生成对非英语国家具有文化相关性的故事。其解决方案的关键在于通过向OpenAI的gpt-4o-mini模型发送统一提示词“Write a 1500 word potential demonym story”,生成涵盖236个国家的共11,800个故事,并系统分析其叙事结构与文化表现特征。研究发现,尽管故事表面包含部分国家符号和主题,但整体上呈现高度一致的单一情节模式——主人公回归小镇、通过传统纽带解决微小冲突并组织社区活动,体现出对现实冲突的淡化、对浪漫元素的缺失以及对怀旧与和解的偏好,从而揭示出一种由AI模型内部结构导致的“叙事标准化”(narrative standardisation),这是一种区别于传统代表性偏见(representational bias)的新形式AI偏见。
链接: https://arxiv.org/abs/2507.22445
作者: Jill Walker Rettberg,Hermann Wigers
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: This project has received funding from the European Union’s Horizon 2020 research and innovation programme under grant agreement number 101142306. The project is also supported by the Center for Digital Narrative, which is funded by the Research Council of Norway through its Centres of Excellence scheme, project number 332643
Abstract:Can a language model trained largely on Anglo-American texts generate stories that are culturally relevant to other nationalities? To find out, we generated 11,800 stories - 50 for each of 236 countries - by sending the prompt “Write a 1500 word potential demonym story” to OpenAI’s model gpt-4o-mini. Although the stories do include surface-level national symbols and themes, they overwhelmingly conform to a single narrative plot structure across countries: a protagonist lives in or returns home to a small town and resolves a minor conflict by reconnecting with tradition and organising community events. Real-world conflicts are sanitised, romance is almost absent, and narrative tension is downplayed in favour of nostalgia and reconciliation. The result is a narrative homogenisation: an AI-generated synthetic imaginary that prioritises stability above change and tradition above growth. We argue that the structural homogeneity of AI-generated narratives constitutes a distinct form of AI bias, a narrative standardisation that should be acknowledged alongside the more familiar representational bias. These findings are relevant to literary studies, narratology, critical AI studies, NLP research, and efforts to improve the cultural alignment of generative AI.
zh
[NLP-31] NeedleChain: Measuring Intact Long-Context Reasoning Capability of Large Language Models
【速读】: 该论文旨在解决当前长上下文(Long Context, LC)理解评估中存在过度估计大语言模型(Large Language Models, LLMs)真实能力的问题,尤其是现有标准如Needle-in-a-Haystack(NIAH)基准可能因包含大量无关信息而误导对模型真正理解能力的判断。其关键解决方案是提出一种新的基准测试方法——NeedleChain,该方法构建完全由查询相关句子组成的上下文,迫使模型必须完整理解输入才能正确作答,从而更准确地衡量LLMs在复杂推理和信息整合方面的实际能力;同时,作者还提出了一种简单但有效的改进策略——ROPE Contraction,用于提升模型对长上下文的理解性能。
链接: https://arxiv.org/abs/2507.22411
作者: Hyeonseok Moon,Heuiseok Lim
机构: Korea University (韩国国立大学)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: 13 pages
Abstract:The Needle-in-a-Haystack (NIAH) benchmark is widely used to evaluate Large Language Models’ (LLMs) ability to understand long contexts (LC). It evaluates the capability to identify query-relevant context within extensive query-irrelevant passages. Although this method serves as a widely accepted standard for evaluating long-context understanding, our findings suggest it may overestimate the true LC capability of LLMs. We demonstrate that even state-of-the-art models such as GPT-4o struggle to intactly incorporate given contexts made up of solely query-relevant ten sentences. In response, we introduce a novel benchmark, \textbfNeedleChain, where the context consists entirely of query-relevant information, requiring the LLM to fully grasp the input to answer correctly. Our benchmark allows for flexible context length and reasoning order, offering a more comprehensive analysis of LLM performance. Additionally, we propose an extremely simple yet compelling strategy to improve LC understanding capability of LLM: ROPE Contraction. Our experiments with various advanced LLMs reveal a notable disparity between their ability to process large contexts and their capacity to fully understand them. Source code and datasets are available at this https URL
zh
[NLP-32] Question Generation for Assessing Early Literacy Reading Comprehension
【速读】: 该论文旨在解决针对K-2年级英语学习者(K-12 English learners)的阅读理解评估问题,即如何生成覆盖文本内容且适配学习者个体能力水平的多样化、多难度层级的 comprehension questions(理解性问题)。其解决方案的关键在于提出一种新颖的问题生成方法,该方法能够确保对原文材料的完整覆盖,并根据学习者的具体熟练程度进行自适应调整,同时支持生成多种类型的问题以实现全面评估。通过在FairytaleQA数据集上测试不同语言模型的表现,验证了该框架在构建自主AI驱动英语教学系统中的潜力。
链接: https://arxiv.org/abs/2507.22410
作者: Xiaocheng Yang,Sumuk Shashidhar,Dilek Hakkani-Tur
机构: University of Illinois(伊利诺伊大学)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: 2 pages, 1 figure, accepted by SLaTE 2025
Abstract:Assessment of reading comprehension through content-based interactions plays an important role in the reading acquisition process. In this paper, we propose a novel approach for generating comprehension questions geared to K-2 English learners. Our method ensures complete coverage of the underlying material and adaptation to the learner’s specific proficiencies, and can generate a large diversity of question types at various difficulty levels to ensure a thorough evaluation. We evaluate the performance of various language models in this framework using the FairytaleQA dataset as the source material. Eventually, the proposed approach has the potential to become an important part of autonomous AI-driven English instructors.
zh
[NLP-33] PATENTWRITER: A Benchmarking Study for Patent Drafting with LLM s
【速读】: 该论文旨在解决专利撰写过程中繁琐低效的问题,通过引入大语言模型(Large Language Models, LLMs)实现专利摘要的自动化生成,从而推动专利撰写范式的转变。其解决方案的关键在于提出首个统一的基准测试框架PATENTWRITER,用于系统评估LLMs在专利摘要生成任务中的性能表现,涵盖零样本、少样本及思维链(chain-of-thought)提示策略,并结合标准自然语言处理指标(如BLEU、ROUGE、BERTScore)、输入扰动下的鲁棒性以及下游专利分类与检索任务的应用效果进行多维度评测,同时开展风格分析以确保生成摘要在长度、可读性和语气上符合专利规范。实验表明,现代LLMs能够生成高保真且风格恰当的专利摘要,显著优于领域专用基线方法。
链接: https://arxiv.org/abs/2507.22387
作者: Homaira Huda Shomee,Suman Kalyan Maity,Sourav Medya
机构: University of Illinois Chicago (伊利诺伊大学芝加哥分校); Missouri University of Science and Technology (密苏里科技大学)
类目: Computation and Language (cs.CL); Machine Learning (cs.LG)
备注:
Abstract:Large language models (LLMs) have emerged as transformative approaches in several important fields. This paper aims for a paradigm shift for patent writing by leveraging LLMs to overcome the tedious patent-filing process. In this work, we present PATENTWRITER, the first unified benchmarking framework for evaluating LLMs in patent abstract generation. Given the first claim of a patent, we evaluate six leading LLMs – including GPT-4 and LLaMA-3 – under a consistent setup spanning zero-shot, few-shot, and chain-of-thought prompting strategies to generate the abstract of the patent. Our benchmark PATENTWRITER goes beyond surface-level evaluation: we systematically assess the output quality using a comprehensive suite of metrics – standard NLP measures (e.g., BLEU, ROUGE, BERTScore), robustness under three types of input perturbations, and applicability in two downstream patent classification and retrieval tasks. We also conduct stylistic analysis to assess length, readability, and tone. Experimental results show that modern LLMs can generate high-fidelity and stylistically appropriate patent abstracts, often surpassing domain-specific baselines. Our code and dataset are open-sourced to support reproducibility and future research.
zh
[NLP-34] raits Run Deep: Enhancing Personality Assessment via Psychology-Guided LLM Representations and Multimodal Apparent Behaviors ACM-MM2025
【速读】: 该论文旨在解决人格特质(personality traits)跨模态建模中的两大难题:一是传统浅层特征难以准确捕捉人格语义,二是多模态信号(如文本、音频、视觉)在时间上存在异步性,导致跨模态理解困难。解决方案的关键在于提出一个名为“Traits Run Deep”的新框架,其核心创新包括:1)引入心理学引导的提示(psychology-informed prompts)以激发大语言模型(LLMs)提取高阶人格相关语义表示;2)设计文本中心的特质融合网络(Text-Centric Trait Fusion Network),通过分块投影器(Chunk-Wise Projector)、跨模态连接器(Cross-Modal Connector)和文本特征增强器(Text Feature Enhancer)实现异步多模态信息的有效对齐与融合,并结合集成回归头提升小样本场景下的泛化能力。实验证明该方法显著降低了均方误差(MSE),并在AVI Challenge 2025人格评估赛道中排名第一。
链接: https://arxiv.org/abs/2507.22367
作者: Jia Li,Yichao He,Jiacheng Xu,Tianhao Luo,Zhenzhen Hu,Richang Hong,Meng Wang
机构: Hefei University of Technology (合肥工业大学)
类目: Computation and Language (cs.CL); Multimedia (cs.MM)
备注: 8 pages, 3 figures, ACM MM 2025
Abstract:Accurate and reliable personality assessment plays a vital role in many fields, such as emotional intelligence, mental health diagnostics, and personalized education. Unlike fleeting emotions, personality traits are stable, often subconsciously leaked through language, facial expressions, and body behaviors, with asynchronous patterns across modalities. It was hard to model personality semantics with traditional superficial features and seemed impossible to achieve effective cross-modal understanding. To address these challenges, we propose a novel personality assessment framework called \textit\textbfTraits Run Deep. It employs \textit\textbfpsychology-informed prompts to elicit high-level personality-relevant semantic representations. Besides, it devises a \textit\textbfText-Centric Trait Fusion Network that anchors rich text semantics to align and integrate asynchronous signals from other modalities. To be specific, such fusion module includes a Chunk-Wise Projector to decrease dimensionality, a Cross-Modal Connector and a Text Feature Enhancer for effective modality fusion and an ensemble regression head to improve generalization in data-scarce situations. To our knowledge, we are the first to apply personality-specific prompts to guide large language models (LLMs) in extracting personality-aware semantics for improved representation quality. Furthermore, extracting and fusing audio-visual apparent behavior features further improves the accuracy. Experimental results on the AVI validation set have demonstrated the effectiveness of the proposed components, i.e., approximately a 45% reduction in mean squared error (MSE). Final evaluations on the test set of the AVI Challenge 2025 confirm our method’s superiority, ranking first in the Personality Assessment track. The source code will be made available at this https URL.
zh
[NLP-35] LLM -Crowdsourced: A Benchmark-Free Paradigm for Mutual Evaluation of Large Language Models
【速读】: 该论文旨在解决当前大型语言模型(Large Language Models, LLMs)评估中存在的数据污染(data contamination)、黑箱操作(black-box operation)和主观偏好(subjective preference)等问题,这些问题限制了对LLM真实能力的全面、客观衡量。其解决方案的关键在于提出一种无基准测试(benchmark-free)的评估范式——LLM-Crowdsourced,该方法利用LLM自身生成问题、独立作答并相互评价,从而实现动态性(dynamic)、透明性(transparent)、客观性(objective)和专业性(professional)四重优势的统一,有效克服了传统评估方法难以同时满足上述标准的局限。
链接: https://arxiv.org/abs/2507.22359
作者: Qianhong Guo,Wei Xie,Xiaofang Cai,Enze Wang,Shuoyoucheng Ma,Kai Chen,Xiaofeng Wang,Baosheng Wang
机构: 未知
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注:
Abstract:Although large language models (LLMs) demonstrate remarkable capabilities across various tasks, evaluating their capabilities remains a challenging task. Existing evaluation methods suffer from issues such as data contamination, black-box operation, and subjective preference. These issues make it difficult to evaluate the LLMs’ true capabilities comprehensively. To tackle these challenges, we propose a novel benchmark-free evaluation paradigm, LLM-Crowdsourced. It utilizes LLMs to generate questions, answer independently, and evaluate mutually. This method integrates four key evaluation criteria: dynamic, transparent, objective, and professional, which existing evaluation methods cannot satisfy simultaneously. Experiments on eight mainstream LLMs across mathematics and programming verify the advantages of our method in distinguishing LLM performance. Furthermore, our study reveals several novel findings that are difficult for traditional methods to detect, including but not limited to: (1) Gemini demonstrates the highest original and professional question-design capabilities among others; (2) Some LLMs exhibit ‘‘memorization-based answering’’ by misrecognizing questions as familiar ones with a similar structure; (3) LLM evaluation results demonstrate high consistency (robustness).
zh
[NLP-36] A Comprehensive Taxonomy of Negation for NLP and Neural Retrievers
【速读】: 该论文旨在解决当前神经信息检索模型在处理包含否定(negation)的查询时性能不足的问题。其核心挑战在于,尽管密集神经模型能够学习上下文嵌入(contextualised embeddings),但在面对否定语义时仍存在显著的推理缺陷。解决方案的关键在于:首先,构建一个基于哲学、语言学和逻辑学定义的否定分类体系(taxonomy of negation),从而系统化地刻画否定类型;其次,设计两个基准数据集以支持对神经信息检索模型在否定理解上的评估与微调;最后,提出一种基于逻辑的分类机制,用于分析现有数据集对不同否定类型的覆盖情况,并揭示影响模型泛化能力的关键因素。该方法通过平衡数据分布和提升训练效率,显著改善了模型在否定查询上的表现。
链接: https://arxiv.org/abs/2507.22337
作者: Roxana Petcu,Samarth Bhargav,Maarten de Rijke,Evangelos Kanoulas
机构: University of Amsterdam (阿姆斯特丹大学)
类目: Computation and Language (cs.CL); Information Retrieval (cs.IR)
备注:
Abstract:Understanding and solving complex reasoning tasks is vital for addressing the information needs of a user. Although dense neural models learn contextualised embeddings, they still underperform on queries containing negation. To understand this phenomenon, we study negation in both traditional neural information retrieval and LLM-based models. We (1) introduce a taxonomy of negation that derives from philosophical, linguistic, and logical definitions; (2) generate two benchmark datasets that can be used to evaluate the performance of neural information retrieval models and to fine-tune models for a more robust performance on negation; and (3) propose a logic-based classification mechanism that can be used to analyze the performance of retrieval models on existing datasets. Our taxonomy produces a balanced data distribution over negation types, providing a better training setup that leads to faster convergence on the NevIR dataset. Moreover, we propose a classification schema that reveals the coverage of negation types in existing datasets, offering insights into the factors that might affect the generalization of fine-tuned models on negation.
zh
[NLP-37] Intent Recognition and Out-of-Scope Detection using LLM s in Multi-party Conversations SIGDIAL2025
【速读】: 该论文旨在解决任务导向型对话系统(Task-Oriented Dialogue Systems, TODS)中意图识别(Intent Recognition)与跨域意图检测(Out-of-Scope, OOS)的标注数据依赖问题,尤其在零样本(zero-shot)和少样本(few-shot)场景下模型性能下降的问题。解决方案的关键在于提出一种混合方法,结合预训练语言模型(LLMs)的泛化能力与BERT模型在特定任务上的计算效率,通过将BERT输出的信息共享给LLMs来增强其意图识别与OOS检测能力,从而在多参与者对话语料库上实现性能提升。
链接: https://arxiv.org/abs/2507.22289
作者: Galo Castillo-López,Gaël de Chalendar,Nasredine Semmar
机构: Université Paris-Saclay (巴黎-萨克雷大学); CEA (法国原子能和替代能源委员会); List (法国替代能源和原子能委员会信息与系统科学实验室)
类目: Computation and Language (cs.CL); Machine Learning (cs.LG)
备注: Accepted for publication at SIGDIAL 2025
Abstract:Intent recognition is a fundamental component in task-oriented dialogue systems (TODS). Determining user intents and detecting whether an intent is Out-of-Scope (OOS) is crucial for TODS to provide reliable responses. However, traditional TODS require large amount of annotated data. In this work we propose a hybrid approach to combine BERT and LLMs in zero and few-shot settings to recognize intents and detect OOS utterances. Our approach leverages LLMs generalization power and BERT’s computational efficiency in such scenarios. We evaluate our method on multi-party conversation corpora and observe that sharing information from BERT outputs to LLMs leads to system performance improvement.
zh
[NLP-38] Meaning-infused grammar: Gradient Acceptability Shapes the Geometric Representations of Constructions in LLM s
【速读】: 该论文试图解决的问题是:大型语言模型(Large Language Models, LLMs)内部表征是否能够反映基于使用构式理论(Usage-Based Constructionist, UCx)所提出的“功能驱动的梯度性”(function-infused gradience),即语言中的构式(constructions)是否以概率化、等级化的形式被习得和表征。解决方案的关键在于通过系统分析Pythia-1.4B模型在英语双宾语结构(Double Object Construction)与介词宾语结构(Prepositional Object Construction)上的神经激活表示,利用5000对句子对(人工评分偏好强度梯度)进行宏观几何分析,发现构造之间的可分性(用能量距离或Jensen-Shannon散度衡量)随偏好强度梯度显著变化——越典型的构式实例在激活空间中占据更独立的区域。这一结果表明LLMs确实学习到了富含语义、具有梯度性的构式表征,为LLMs中几何度量基本构式原则提供了实证支持。
链接: https://arxiv.org/abs/2507.22286
作者: Supantho Rakshit,Adele Goldberg
机构: Princeton University (普林斯顿大学)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: 5 pages, 3 figures, Accepted for publication at the Second International Workshop on Construction Grammars and NLP at the 16th International Conference for Computational Semantics (IWCS) 2025
Abstract:The usage-based constructionist (UCx) approach posits that language comprises a network of learned form-meaning pairings (constructions) whose use is largely determined by their meanings or functions, requiring them to be graded and probabilistic. This study investigates whether the internal representations in Large Language Models (LLMs) reflect the proposed function-infused gradience. We analyze the neural representations of the English dative constructions (Double Object and Prepositional Object) in Pythia- 1.4 B, using a dataset of 5000 sentence pairs systematically varied for human-rated preference strength. A macro-level geometric analysis finds that the separability between construction representations, as measured by Energy Distance or Jensen-Shannon Divergence, is systematically modulated by gradient preference strength. More prototypical exemplars of each construction occupy more distinct regions in the activation space of LLMs. These results provide strong evidence that LLMs learn rich, meaning-infused, graded representations of constructions and offer support for geometric measures of basic constructionist principles in LLMs.
zh
[NLP-39] CoEx – Co-evolving World-model and Exploration
【速读】: 该论文旨在解决当前大型语言模型(Large Language Model, LLM)代理在规划过程中依赖静态内部世界模型而导致的与真实环境状态逐步失准的问题,这会引发计划偏差和错误。解决方案的关键在于提出一种分层代理架构 CoEx,其核心机制是通过层次化状态抽象实现LLM规划与动态更新的世界模型协同演化;具体而言,CoEx利用LLM推理协调由子目标构成的动态计划,并通过持续将子目标经验整合为包含文本推理和代码符号记忆的神经符号信念状态(neurosymbolic belief state),从而构建一个可持久化的、随交互不断演化的世界模型。
链接: https://arxiv.org/abs/2507.22281
作者: Minsoo Kim,Seung-won Hwang
机构: 未知
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注:
Abstract:Planning in modern LLM agents relies on the utilization of LLM as an internal world model, acquired during pretraining. However, existing agent designs fail to effectively assimilate new observations into dynamic updates of the world model. This reliance on the LLM’s static internal world model is progressively prone to misalignment with the underlying true state of the world, leading to the generation of divergent and erroneous plans. We introduce a hierarchical agent architecture, CoEx, in which hierarchical state abstraction allows LLM planning to co-evolve with a dynamically updated model of the world. CoEx plans and interacts with the world by using LLM reasoning to orchestrate dynamic plans consisting of subgoals, and its learning mechanism continuously incorporates these subgoal experiences into a persistent world model in the form of a neurosymbolic belief state, comprising textual inferences and code-based symbolic memory. We evaluate our agent across a diverse set of agent scenarios involving rich environments and complex tasks including ALFWorld, PDDL, and Jericho. Our experiments show that CoEx outperforms existing agent paradigms in planning and exploration.
zh
[NLP-40] RL from Teacher-Model Refinement: Gradual Imitation Learning for Machine Translation
【速读】: 该论文旨在解决当前基于偏好学习的机器翻译(Machine Translation, MT)方法(如直接偏好优化,Direct Preference Optimization, DPO)对静态三元组数据集的高度依赖性及其在训练域外泛化能力不足的问题。其解决方案的关键在于提出一种名为“从教师模型精炼中强化学习”(Reinforcement Learning from Teacher-Model Refinement, RLfR)的新框架:该框架摒弃了传统静态三元组数据,转而利用外部教师模型(GPT-4o)提供的连续高质量反馈,将每一步翻译视为一个微教程——策略模型生成假设,教师模型对其进行精炼,策略模型根据与教师精炼结果的接近程度获得奖励。该机制通过两个互补信号引导学习过程:(i) 负编辑距离(negative edit distance),提升词汇和结构保真度;(ii) COMET分数,保障语义充分性,从而实现类似人类学习的渐进式迭代优化,在FLORES-200多语言基准上显著优于MT-SFT及基于偏好的基线方法,尤其在语义适配性和实体保留方面表现突出。
链接: https://arxiv.org/abs/2507.22219
作者: Dongyub Jude Lee,Zhenyi Ye,Pengcheng He
机构: Zoom Communications (Zoom公司)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:
Abstract:Preference-learning methods for machine translation (MT)–such as Direct Preference Optimization (DPO)–have achieved impressive gains but depend heavily on large, carefully curated triplet datasets and often struggle to generalize beyond their tuning domains. We propose Reinforcement Learning from Teacher-Model Refinement (RLfR), a novel framework that removes reliance on static triplets by leveraging continuous, high-quality feedback from an external teacher model (GPT-4o). RLfR frames each translation step as a micro-tutorial: the actor generates a hypothesis, the teacher refines it, and the actor is rewarded based on how closely it aligns with the teacher’s refinement. Guided by two complementary signals–(i) negative edit distance, promoting lexical and structural fidelity, and (ii) COMET score, ensuring semantic adequacy–the actor progressively learns to emulate the teacher, mirroring a human learning process through incremental, iterative improvement. On the FLORES-200 benchmark (English to and from German, Spanish, Chinese, Korean, and Japanese), RLfR consistently outperforms both MT-SFT and preference-based baselines, significantly improving COMET (semantic adequacy) and M-ETA (entity preservation) scores.
zh
[NLP-41] How Well Does First-Token Entropy Approximate Word Entropy as a Psycholinguistic Predictor?
【速读】: 该论文旨在解决当前对上下文熵(contextual entropy)估计中存在的偏差问题,即传统方法通常基于语言模型对单词首个子词标记(subword token)的概率分布进行近似计算,导致熵值被低估并可能产生失真。为应对这一局限,作者提出采用蒙特卡洛(Monte Carlo, MC)采样方法生成更精确的单词熵估计,该方法允许单词跨越可变数量的子词标记,从而更好地反映真实语境下的信息不确定性。实验结果表明,基于首个子词标记的熵与MC估计的熵在阅读时间回归分析中表现出显著差异,凸显了使用更准确熵估计的重要性。
链接: https://arxiv.org/abs/2507.22209
作者: Christian Clark,Byung-Doh Oh,William Schuler
机构: The Ohio State University (俄亥俄州立大学); New York University (纽约大学)
类目: Computation and Language (cs.CL)
备注:
Abstract:Contextual entropy is a psycholinguistic measure capturing the anticipated difficulty of processing a word just before it is encountered. Recent studies have tested for entropy-related effects as a potential complement to well-known effects from surprisal. For convenience, entropy is typically estimated based on a language model’s probability distribution over a word’s first subword token. However, this approximation results in underestimation and potential distortion of true word entropy. To address this, we generate Monte Carlo (MC) estimates of word entropy that allow words to span a variable number of tokens. Regression experiments on reading times show divergent results between first-token and MC word entropy, suggesting a need for caution in using first-token approximations of contextual entropy.
zh
[NLP-42] he role of media memorability in facilitating startups access to venture capital funding
【速读】: 该论文试图解决的问题是:现有研究过于关注媒体曝光的广度,而忽视了媒体内容对风险投资决策的深层次影响,从而限制了我们对媒体如何真正塑造初创企业融资结果的理解。解决方案的关键在于引入“媒体记忆度”(media memorability)这一概念,即媒体内容在相关投资者记忆中留下深刻印象的能力,并通过实证分析表明,风险投资人更依赖于诸如初创企业独特性(distinctiveness)和在新闻语义网络中的连接性(connectivity)等细粒度线索来做出投资决策。这为创业金融与媒体合法性理论提供了新的解释框架,并指出初创企业应聚焦于提升品牌记忆度,而非仅仅追求高频媒体报道。
链接: https://arxiv.org/abs/2507.22201
作者: L. Toschi,S. Torrisi,A. Fronzetti Colladon
机构: 未知
类目: Computation and Language (cs.CL); Social and Information Networks (cs.SI); Physics and Society (physics.soc-ph)
备注:
Abstract:Media reputation plays an important role in attracting venture capital investment. However, prior research has focused too narrowly on general media exposure, limiting our understanding of how media truly influences funding decisions. As informed decision-makers, venture capitalists respond to more nuanced aspects of media content. We introduce the concept of media memorability - the media’s ability to imprint a startup’s name in the memory of relevant investors. Using data from 197 UK startups in the micro and nanotechnology sector (funded between 1995 and 2004), we show that media memorability significantly influences investment outcomes. Our findings suggest that venture capitalists rely on detailed cues such as a startup’s distinctiveness and connectivity within news semantic networks. This contributes to research on entrepreneurial finance and media legitimation. In practice, startups should go beyond frequent media mentions to strengthen brand memorability through more targeted, meaningful coverage highlighting their uniqueness and relevance within the broader industry conversation.
zh
[NLP-43] Explainability Through Systematicity: The Hard Systematicity Challenge for Artificial Intelligence
【速读】: 该论文试图解决人工智能(Artificial Intelligence, AI)模型在认知合理性与科学性方面所面临的系统性(systematicity)标准问题,即如何界定AI应具备何种程度的系统性才能被视为具有理性、权威性和科学性的认知主体。其核心论点在于,当前对系统性的理解过于狭隘,局限于Fodor等人提出的“系统性挑战”(systematicity challenge),而忽略了更深层次的系统性要求——包括一致性、连贯性、全面性和简约原则性。解决方案的关键在于提出一个区分四种不同含义的“思维系统性”(systematicity of thought)概念框架,并据此重新审视连接主义(connectionism)与系统性之间的关系,指出二者并非根本冲突;同时进一步论证系统性需求必须由其背后的合理性理由(rationales for systematization)来调节,从而形成一种动态的系统化标准,明确AI模型需要达到何种程度的系统性以及在何种情境下适用。
链接: https://arxiv.org/abs/2507.22197
作者: Matthieu Queloz
机构: 未知
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Computers and Society (cs.CY)
备注: 39 pages; final, published version
Abstract:This paper argues that explainability is only one facet of a broader ideal that shapes our expectations towards artificial intelligence (AI). Fundamentally, the issue is to what extent AI exhibits systematicity–not merely in being sensitive to how thoughts are composed of recombinable constituents, but in striving towards an integrated body of thought that is consistent, coherent, comprehensive, and parsimoniously principled. This richer conception of systematicity has been obscured by the long shadow of the “systematicity challenge” to connectionism, according to which network architectures are fundamentally at odds with what Fodor and colleagues termed “the systematicity of thought.” I offer a conceptual framework for thinking about “the systematicity of thought” that distinguishes four senses of the phrase. I use these distinctions to defuse the perceived tension between systematicity and connectionism and show that the conception of systematicity that historically shaped our sense of what makes thought rational, authoritative, and scientific is more demanding than the Fodorian notion. To determine whether we have reason to hold AI models to this ideal of systematicity, I then argue, we must look to the rationales for systematization and explore to what extent they transfer to AI models. I identify five such rationales and apply them to AI. This brings into view the “hard systematicity challenge.” However, the demand for systematization itself needs to be regulated by the rationales for systematization. This yields a dynamic understanding of the need to systematize thought, which tells us how systematic we need AI models to be and when.
zh
[NLP-44] A Scalable Pipeline for Estimating Verb Frame Frequencies Using Large Language Models
【速读】: 该论文旨在解决传统Verb Frame Frequencies (VFFs) 估算方法在规模、准确性或可及性方面的局限性问题。现有工具难以实现大规模、高精度的句法框架频率统计,而人工标注虽为金标准但资源消耗巨大。解决方案的关键在于构建一个自动化流水线:首先利用大语言模型(LLM)生成包含476个英语动词的句子语料库;随后通过指令微调使LLM模拟专家语言学家角色,对语料中句子的句法结构进行分析。该方法在多个评估数据集上优于两种主流句法解析器,且显著降低资源需求,从而实现快速、可扩展的VFF估计。
链接: https://arxiv.org/abs/2507.22187
作者: Adam M. Morgan,Adeen Flinker
机构: NYU Grossman School of Medicine (纽约大学格罗斯曼医学院); NYU Tandon School of Engineering (纽约大学坦顿工程学院)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:
Abstract:We present an automated pipeline for estimating Verb Frame Frequencies (VFFs), the frequency with which a verb appears in particular syntactic frames. VFFs provide a powerful window into syntax in both human and machine language systems, but existing tools for calculating them are limited in scale, accuracy, or accessibility. We use large language models (LLMs) to generate a corpus of sentences containing 476 English verbs. Next, by instructing an LLM to behave like an expert linguist, we had it analyze the syntactic structure of the sentences in this corpus. This pipeline outperforms two widely used syntactic parsers across multiple evaluation datasets. Furthermore, it requires far fewer resources than manual parsing (the gold-standard), thereby enabling rapid, scalable VFF estimation. Using the LLM parser, we produce a new VFF database with broader verb coverage, finer-grained syntactic distinctions, and explicit estimates of the relative frequencies of structural alternates commonly studied in psycholinguistics. The pipeline is easily customizable and extensible to new verbs, syntactic frames, and even other languages. We present this work as a proof of concept for automated frame frequency estimation, and release all code and data to support future research.
zh
[NLP-45] Persona-Augmented Benchmarking: Evaluating LLM s Across Diverse Writing Styles
【速读】: 该论文旨在解决当前大语言模型(Large Language Models, LLMs)评估基准在写作风格多样性不足的问题,导致模型在面对非标准化输入时可能出现性能脆弱性。其解决方案的关键在于采用基于角色的提示方法(persona-based LLM prompting),以低成本方式模拟多样化的写作风格来重写评估提示,从而揭示相同语义内容下不同风格和格式对模型性能估计的显著影响。研究发现,某些特定写作风格会稳定地引发高或低性能表现,且这一现象跨模型家族、规模和发布时期均具一致性,表明该方法可有效增强现有基准的外部效度,提升对语言变异性下LLM性能评估的可靠性。
链接: https://arxiv.org/abs/2507.22168
作者: Kimberly Le Truong,Riccardo Fogliato,Hoda Heidari,Zhiwei Steven Wu
机构: Carnegie Mellon University (卡内基梅隆大学); Amazon AWS (亚马逊云服务)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:
Abstract:Current benchmarks for evaluating Large Language Models (LLMs) often do not exhibit enough writing style diversity, with many adhering primarily to standardized conventions. Such benchmarks do not fully capture the rich variety of communication patterns exhibited by humans. Thus, it is possible that LLMs, which are optimized on these benchmarks, may demonstrate brittle performance when faced with “non-standard” input. In this work, we test this hypothesis by rewriting evaluation prompts using persona-based LLM prompting, a low-cost method to emulate diverse writing styles. Our results show that, even with identical semantic content, variations in writing style and prompt formatting significantly impact the estimated performance of the LLM under evaluation. Notably, we identify distinct writing styles that consistently trigger either low or high performance across a range of models and tasks, irrespective of model family, size, and recency. Our work offers a scalable approach to augment existing benchmarks, improving the external validity of the assessments they provide for measuring LLM performance across linguistic variations.
zh
[NLP-46] Strategic Deflection: Defending LLM s from Logit Manipulation
【速读】: 该论文旨在解决大型语言模型(Large Language Models, LLMs)在面对高级对齐攻击(如logit-level攻击)时的安全性问题,此类攻击能够绕过传统基于拒绝响应的防御机制,通过直接操纵生成过程中的token选择来诱导模型输出有害内容。解决方案的关键在于提出一种名为“战略偏转”(Strategic Deflection, SDeflection)的新防御策略,其核心思想不是简单拒绝恶意请求,而是引导模型生成语义上接近用户意图但剥离有害意图的响应,从而在不损害良性任务性能的前提下有效降低攻击成功率(Attack Success Rate, ASR)。
链接: https://arxiv.org/abs/2507.22160
作者: Yassine Rachidy,Jihad Rbaiti,Youssef Hmamouche,Faissal Sehbaoui,Amal El Fallah Seghrouchni
机构: International Artificial Intelligence Center of Morocco, Mohammed VI Polytechnic University (摩洛哥国际人工智能中心,穆罕默德六世工业大学); AgriEdge; Sorbonne University, LIP6 - UMR 7606 CNRS, France (索邦大学,LIP6 - UMR 7606 CNRS,法国)
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注: 20 pages
Abstract:With the growing adoption of Large Language Models (LLMs) in critical areas, ensuring their security against jailbreaking attacks is paramount. While traditional defenses primarily rely on refusing malicious prompts, recent logit-level attacks have demonstrated the ability to bypass these safeguards by directly manipulating the token-selection process during generation. We introduce Strategic Deflection (SDeflection), a defense that redefines the LLM’s response to such advanced attacks. Instead of outright refusal, the model produces an answer that is semantically adjacent to the user’s request yet strips away the harmful intent, thereby neutralizing the attacker’s harmful intent. Our experiments demonstrate that SDeflection significantly lowers Attack Success Rate (ASR) while maintaining model performance on benign queries. This work presents a critical shift in defensive strategies, moving from simple refusal to strategic content redirection to neutralize advanced threats.
zh
[NLP-47] IndoPref: A Multi-Domain Pairwise Preference Dataset for Indonesian
【速读】: 该论文旨在解决印尼语(Indonesian)在大语言模型(Large Language Models, LLMs)偏好研究中严重代表性不足的问题,以及现有多语言数据集普遍依赖英文翻译导致的文化和语言真实性缺失。其解决方案的关键在于构建了IndoPref——首个完全由人类撰写、涵盖多个领域的印尼语偏好数据集,所有标注均以印尼语原生撰写,并采用Krippendorff’s alpha评估标注者间一致性,确保数据质量;同时,该数据集用于对多种LLM进行基准测试,从而客观评估模型生成文本的自然度与质量。
链接: https://arxiv.org/abs/2507.22159
作者: Vanessa Rebecca Wiyono,David Anugraha,Ayu Purwarianti,Genta Indra Winata
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: Preprint
Abstract:Over 200 million people speak Indonesian, yet the language remains significantly underrepresented in preference-based research for large language models (LLMs). Most existing multilingual datasets are derived from English translations, often resulting in content that lacks cultural and linguistic authenticity. To address this gap, we introduce IndoPref, the first fully human-authored and multi-domain Indonesian preference dataset specifically designed to evaluate the naturalness and quality of LLM-generated text. All annotations are natively written in Indonesian and evaluated using Krippendorff’s alpha, demonstrating strong inter-annotator agreement. Additionally, we benchmark the dataset across multiple LLMs and assess the output quality of each model.
zh
[NLP-48] Prompt Optimization and Evaluation for LLM Automated Red Teaming
【速读】: 该论文旨在解决生成式 AI(Generative AI)在自动化红队测试(Automated Red Teaming)中攻击生成器(attack generator)效率与有效性不足的问题,即如何更精准地评估和优化攻击提示词(prompt),以提升对目标系统的漏洞探测能力。其解决方案的关键在于引入“可发现性”(discoverability)这一指标,通过多次随机种子初始化的目标系统重复执行同一攻击,并基于攻击成功概率的期望值量化单个攻击的有效性,从而识别出可被利用的模式,为提示词优化提供数据驱动依据,最终实现更鲁棒的攻击生成器评估与迭代改进。
链接: https://arxiv.org/abs/2507.22133
作者: Michael Freenor,Lauren Alvarez,Milton Leal,Lily Smith,Joel Garrett,Yelyzaveta Husieva,Madeline Woodruff,Ryan Miller,Erich Kummerfeld,Rafael Medeiros,Sander Schulhoff
机构: Fuel iX Applied Research (Fuel iX 应用研究); North Carolina State University (北卡罗来纳州立大学); University of Minnesota (明尼苏达大学); TELUS Digital (TELUS 数字); Learn Prompting (Learn Prompting)
类目: Cryptography and Security (cs.CR); Computation and Language (cs.CL)
备注: 9 pages, 5 Figures, and 1 Appendix item
Abstract:Applications that use Large Language Models (LLMs) are becoming widespread, making the identification of system vulnerabilities increasingly important. Automated Red Teaming accelerates this effort by using an LLM to generate and execute attacks against target systems. Attack generators are evaluated using the Attack Success Rate (ASR) the sample mean calculated over the judgment of success for each attack. In this paper, we introduce a method for optimizing attack generator prompts that applies ASR to individual attacks. By repeating each attack multiple times against a randomly seeded target, we measure an attack’s discoverability the expectation of the individual attack success. This approach reveals exploitable patterns that inform prompt optimization, ultimately enabling more robust evaluation and refinement of generators.
zh
[NLP-49] CodeEvo: Interaction-Driven Synthesis of Code-centric Data through Hybrid and Iterative Feedback
【速读】: 该论文旨在解决当前代码生成大语言模型(Large Language Models, LLMs)训练中高质量指令-代码对(instruction-code pairs)数据稀缺且合成数据质量不可控的问题。现有方法或局限于代码增强,或依赖预设启发式规则,缺乏严格的验证机制,导致合成数据存在无依据、重复或过于简单等缺陷。解决方案的关键在于提出 CodeEvo 框架,通过两个 LLM 代理——Coders(生成候选代码与测试用例)和 Reviewers(提供反馈与新指令)之间的迭代交互,实现自适应的代码数据合成,并引入一种结合编译器确定性与生成式灵活性的混合反馈机制,从而在合成过程中实现自动化的质量控制。
链接: https://arxiv.org/abs/2507.22080
作者: Qiushi Sun,Jinyang Gong,Lei Li,Qipeng Guo,Fei Yuan
机构: Shanghai AI Laboratory (上海人工智能实验室); The University of Hong Kong (香港大学); New York University (纽约大学); Carnegie Mellon University (卡内基梅隆大学); Shanghai Innovation Institute (上海创新研究院)
类目: oftware Engineering (cs.SE); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注: Work in progress
Abstract:Acquiring high-quality instruction-code pairs is essential for training Large Language Models (LLMs) for code generation. Manually curated data is expensive and inherently limited in scale, motivating the development of code-centric synthesis methods. Yet, current approaches either focus on augmenting existing code or rely on predefined heuristics, both lacking rigorous data validation, which results in synthetic data that is ungrounded, repetitive, or overly simplistic. Inspired by collaborative programming practices, we propose CodeEvo, a framework that synthesizes code data through iterative interactions between two LLM agents: a Coder, which generates candidate code and test cases based on given instructions, and a Reviewer, which guides the synthesis process by producing new instructions and feedback. We further introduce a hybrid feedback mechanism that combines compiler determinism with the generative flexibility of agents, enabling automatic quality control throughout synthesis. Extensive experiments demonstrate that models fine-tuned on CodeEvo data significantly outperform established baselines across code generation benchmarks with various difficulties. In-depth analyses further provide insights from multiple perspectives into effective code-centric data synthesis.
zh
[NLP-50] CIMR: Contextualized Iterative Multimodal Reasoning for Robust Instruction Following in LVLMs
【速读】: 该论文旨在解决大型多模态模型在处理复杂、多步骤的跨模态指令时存在的局限性,例如逻辑推理能力不足、动态反馈整合困难以及缺乏迭代自我修正机制等问题。其解决方案的关键在于提出一种名为CIMR(Contextualized Iterative Multimodal Reasoning)的新框架,该框架通过引入上下文感知的迭代推理与自校正模块,在初始推理与响应生成后,利用解析后的多模态反馈进行多轮优化;同时,动态融合模块在每一步深度整合文本、视觉及上下文特征,从而显著提升模型在复杂任务中的准确性和鲁棒性。
链接: https://arxiv.org/abs/2507.22074
作者: Yangshu Yuan,Heng Chen,Xinyi Jiang,Christian Ng,Kexin Qiu
机构: Singapore Institute of Management (新加坡管理学院)
类目: Machine Learning (cs.LG); Computation and Language (cs.CL)
备注:
Abstract:The rapid advancement of Large Language Models (LLMs) and Large Vision-Language Models (LVLMs) has enhanced our ability to process and generate human language and visual information. However, these models often struggle with complex, multi-step multi-modal instructions that require logical reasoning, dynamic feedback integration, and iterative self-correction. To address this, we propose CIMR: Contextualized Iterative Multimodal Reasoning, a novel framework that introduces a context-aware iterative reasoning and self-correction module. CIMR operates in two stages: initial reasoning and response generation, followed by iterative refinement using parsed multi-modal feedback. A dynamic fusion module deeply integrates textual, visual, and contextual features at each step. We fine-tune LLaVA-1.5-7B on the Visual Instruction Tuning (VIT) dataset and evaluate CIMR on the newly introduced Multi-modal Action Planning (MAP) dataset. CIMR achieves 91.5% accuracy, outperforming state-of-the-art models such as GPT-4V (89.2%), LLaVA-1.5 (78.5%), MiniGPT-4 (75.3%), and InstructBLIP (72.8%), demonstrating the efficacy of its iterative reasoning and self-correction capabilities in complex tasks.
zh
计算机视觉
[CV-0] owards Omnimodal Expressions and Reasoning in Referring Audio-Visual Segmentation ICCV2025
【速读】:该论文旨在解决当前 referring audio-visual segmentation (RAVS) 任务中多模态信息融合不足、音频内容理解浅层化以及复杂推理能力欠缺的问题。其解决方案的关键在于提出 OmniAVS 数据集与 Omnimodal Instructed Segmentation Assistant (OISA) 模型:OmniAVS 引入了8类灵活组合文本、语音、声音和视觉线索的多模态表达,并强调对音频语义的深度理解及引入世界知识驱动的复杂推理;OISA 则基于多模态大语言模型(Multimodal Large Language Model, MLLM)实现对复杂多模态线索的理解与推理导向的分割,显著提升了在 OmniAVS 上的性能并展现出在相关任务上的竞争力。
链接: https://arxiv.org/abs/2507.22886
作者: Kaining Ying,Henghui Ding,Guanquan Jie,Yu-Gang Jiang
机构: Fudan University (复旦大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: ICCV 2025, Project Page: this https URL
Abstract:Referring audio-visual segmentation (RAVS) has recently seen significant advancements, yet challenges remain in integrating multimodal information and deeply understanding and reasoning about audiovisual content. To extend the boundaries of RAVS and facilitate future research in this field, we propose Omnimodal Referring Audio-Visual Segmentation (OmniAVS), a new dataset containing 2,098 videos and 59,458 multimodal referring expressions. OmniAVS stands out with three key innovations: (1) 8 types of multimodal expressions that flexibly combine text, speech, sound, and visual cues; (2) an emphasis on understanding audio content beyond just detecting their presence; and (3) the inclusion of complex reasoning and world knowledge in expressions. Furthermore, we introduce Omnimodal Instructed Segmentation Assistant (OISA), to address the challenges of multimodal reasoning and fine-grained understanding of audiovisual content in OmniAVS. OISA uses MLLM to comprehend complex cues and perform reasoning-based segmentation. Extensive experiments show that OISA outperforms existing methods on OmniAVS and achieves competitive results on other related tasks.
zh
[CV-1] Viser: Imperative Web-based 3D Visualization in Python
【速读】:该论文旨在解决计算机视觉与机器人领域中3D可视化工具缺乏易用性与可扩展性的问题。现有工具往往配置复杂、API不够直观,难以融入现代编程范式和开发流程。解决方案的关键在于设计并实现了一个名为Viser的Python 3D可视化库,其核心创新包括采用命令式(imperative-style)API以提升开发效率,并基于Web技术构建可视化查看器,从而增强与现代软件工程实践的兼容性与灵活性。
链接: https://arxiv.org/abs/2507.22885
作者: Brent Yi,Chung Min Kim,Justin Kerr,Gina Wu,Rebecca Feng,Anthony Zhang,Jonas Kulhanek,Hongsuk Choi,Yi Ma,Matthew Tancik,Angjoo Kanazawa
机构: UC Berkeley (加州大学伯克利分校); CTU in Prague (布拉格捷克理工大学); ETH Zurich (苏黎世联邦理工学院); HKU (香港大学); Luma AI (Luma AI)
类目: Computer Vision and Pattern Recognition (cs.CV); Robotics (cs.RO)
备注: Code and docs: this https URL
Abstract:We present Viser, a 3D visualization library for computer vision and robotics. Viser aims to bring easy and extensible 3D visualization to Python: we provide a comprehensive set of 3D scene and 2D GUI primitives, which can be used independently with minimal setup or composed to build specialized interfaces. This technical report describes Viser’s features, interface, and implementation. Key design choices include an imperative-style API and a web-based viewer, which improve compatibility with modern programming patterns and workflows.
zh
[CV-2] LCS: An AI-based Low-Complexity Scaler for Power-Efficient Super-Resolution of Game Content
【速读】:该论文旨在解决现代游戏中内容渲染复杂度增加导致GPU工作负载显著上升的问题。其解决方案的关键在于提出一种基于人工智能的低复杂度缩放器(Low-Complexity Scaler, LCS),该模型受先进高效超分辨率(Efficient Super-Resolution, ESR)模型启发,能够将原本由GPU承担的高复杂度图像缩放任务迁移至低功耗设备(如神经网络处理单元NPU)。LCS通过在原生低分辨率与高分辨率游戏图像对上进行训练,并结合对抗训练以增强感知重要细节的重建能力,同时采用重参数化(reparameterization)和量化(quantization)技术降低模型复杂度与尺寸,从而实现在资源受限设备上的高效运行,且在五项评估指标中表现出优于AMD硬件级边缘自适应缩放功能(Edge Adaptive Scaling Function, EASF)和FidelityFX Super Resolution 1(FSR1)的感知质量。
链接: https://arxiv.org/abs/2507.22873
作者: Simon Pochinda,Momen K. Tageldeen,Mark Thompson,Tony Rinaldi,Troy Giorshev,Keith Lee,Jie Zhou,Frederick Walls
机构: Advanced Micro Devices, Inc. (超威半导体公司)
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注: 8 pages, 3 figures
Abstract:The increasing complexity of content rendering in modern games has led to a problematic growth in the workload of the GPU. In this paper, we propose an AI-based low-complexity scaler (LCS) inspired by state-of-the-art efficient super-resolution (ESR) models which could offload the workload on the GPU to a low-power device such as a neural processing unit (NPU). The LCS is trained on GameIR image pairs natively rendered at low and high resolution. We utilize adversarial training to encourage reconstruction of perceptually important details, and apply reparameterization and quantization techniques to reduce model complexity and size. In our comparative analysis we evaluate the LCS alongside the publicly available AMD hardware-based Edge Adaptive Scaling Function (EASF) and AMD FidelityFX Super Resolution 1 (FSR1) on five different metrics, and find that the LCS achieves better perceptual quality, demonstrating the potential of ESR models for upscaling on resource-constrained devices.
zh
[CV-3] R-PTS: Task-Relevant Parameter and Token Selection for Efficient Tuning ICCV2025
【速读】:该论文旨在解决大型预训练模型在视觉任务中虽性能优异但微调成本高昂的问题,尤其是现有参数高效微调(Parameter-Efficient Fine-Tuning, PEFT)方法因缺乏任务特异性而导致效率与性能均不理想。其解决方案的关键在于提出一种任务驱动的框架——任务相关参数与token选择(Task-Relevant Parameter and Token Selection, TR-PTS),通过联合优化参数和token两个维度实现更高效的微调:一方面利用费舍尔信息矩阵(Fisher Information Matrix, FIM)进行分层参数选择,仅更新最具信息量的参数;另一方面动态保留关键token并合并冗余token,从而显著降低计算开销,使模型聚焦于任务判别性特征。实验表明,TR-PTS在FGVC和VTAB-1k等基准上分别超越全量微调3.40%和10.35%,达到当前最优性能。
链接: https://arxiv.org/abs/2507.22872
作者: Siqi Luo,Haoran Yang,Yi Xin,Mingyang Yi,Guangyang Wu,Guangtao Zhai,Xiaohong Liu
机构: Shanghai Jiao Tong University (上海交通大学); Shanghai Innovation Institute (上海创新研究院); Renmin University of China (中国人民大学); Suzhou Key Laboratory of Artificial Intelligence (苏州市人工智能重点实验室)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted by ICCV 2025
Abstract:Large pre-trained models achieve remarkable performance in vision tasks but are impractical for fine-tuning due to high computational and storage costs. Parameter-Efficient Fine-Tuning (PEFT) methods mitigate this issue by updating only a subset of parameters; however, most existing approaches are task-agnostic, failing to fully exploit task-specific adaptations, which leads to suboptimal efficiency and performance. To address this limitation, we propose Task-Relevant Parameter and Token Selection (TR-PTS), a task-driven framework that enhances both computational efficiency and accuracy. Specifically, we introduce Task-Relevant Parameter Selection, which utilizes the Fisher Information Matrix (FIM) to identify and fine-tune only the most informative parameters in a layer-wise manner, while keeping the remaining parameters frozen. Simultaneously, Task-Relevant Token Selection dynamically preserves the most informative tokens and merges redundant ones, reducing computational overhead. By jointly optimizing parameters and tokens, TR-PTS enables the model to concentrate on task-discriminative information. We evaluate TR-PTS on benchmark, including FGVC and VTAB-1k, where it achieves state-of-the-art performance, surpassing full fine-tuning by 3.40% and 10.35%, respectively. The code are available at this https URL.
zh
[CV-4] Mesh based segmentation for automated margin line generation on incisors receiving crown treatment
【速读】:该论文旨在解决牙冠(dental crown)设计过程中人工定义边缘线(margin line)的非重复性和不一致性问题。传统方法依赖牙科技师在扫描数据上手动勾画边缘线,效率低且易受主观因素影响。解决方案的关键在于提出一种基于深度学习的自动边缘线识别框架:首先利用合作牙科实验室提供的前牙数据集训练一个改进的基于网格(mesh-based)的神经网络模型,该模型通过分割预备牙为两个区域来定位边缘线所在的边界面;随后采用k折交叉验证训练多个模型,并结合投票分类器提升分割鲁棒性;最后通过图割(graph cut)优化和样条拟合技术对边界面进行平滑与精修,从而准确预测边缘线。实验表明,集成模型在200 μm误差阈值下成功预测了7/13测试案例,且预备质量越高,预测结果与真实边缘线的偏差越小(Spearman相关系数-0.683)。
链接: https://arxiv.org/abs/2507.22859
作者: Ammar Alsheghri,Ying Zhang,Farnoosh Ghadiri,Julia Keren,Farida Cheriet,Francois Guibault
机构: 未知
类目: Computational Engineering, Finance, and Science (cs.CE); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注:
Abstract:Dental crowns are essential dental treatments for restoring damaged or missing teeth of patients. Recent design approaches of dental crowns are carried out using commercial dental design software. Once a scan of a preparation is uploaded to the software, a dental technician needs to manually define a precise margin line on the preparation surface, which constitutes a non-repeatable and inconsistent procedure. This work proposes a new framework to determine margin lines automatically and accurately using deep learning. A dataset of incisor teeth was provided by a collaborating dental laboratory to train a deep learning segmentation model. A mesh-based neural network was modified by changing its input channels and used to segment the prepared tooth into two regions such that the margin line is contained within the boundary faces separating the two regions. Next, k-fold cross-validation was used to train 5 models, and a voting classifier technique was used to combine their results to enhance the segmentation. After that, boundary smoothing and optimization using the graph cut method were applied to refine the segmentation results. Then, boundary faces separating the two regions were selected to represent the margin line faces. A spline was approximated to best fit the centers of the boundary faces to predict the margin line. Our results show that an ensemble model combined with maximum probability predicted the highest number of successful test cases (7 out of 13) based on a maximum distance threshold of 200 m (representing human error) between the predicted and ground truth point clouds. It was also demonstrated that the better the quality of the preparation, the smaller the divergence between the predicted and ground truth margin lines (Spearman’s rank correlation coefficient of -0.683). We provide the train and test datasets for the community.
zh
[CV-5] apping into the Black Box: Uncovering Aligned Representations in Pretrained Neural Networks
【速读】:该论文旨在解决深度神经网络中可解释性不足的问题,即如何从训练好的ReLU网络中提取出具有明确语义且与输入和目标高度对齐的特征表示。其解决方案的关键在于提出了一种称为“激发回传”(excitation pullbacks)的梯度计算方法,通过简单修改反向传播过程,将网络隐含学习到的线性模型决策边界映射回输入空间,从而揭示出高分辨率、目标特定的可解释特征。这一方法表明,即使在复杂架构中,神经网络也依赖于可恢复的、具有感知一致性的可解释模式。
链接: https://arxiv.org/abs/2507.22832
作者: Maciej Satkiewicz
机构: 314 Foundation(314基金会)
类目: Machine Learning (cs.LG); Computer Vision and Pattern Recognition (cs.CV); Neural and Evolutionary Computing (cs.NE)
备注: 15 pages, 4 figures, preprint
Abstract:In this paper we argue that ReLU networks learn an implicit linear model we can actually tap into. We describe that alleged model formally and show that we can approximately pull its decision boundary back to the input space with certain simple modification to the backward pass. The resulting gradients (called excitation pullbacks) reveal high-resolution input- and target-specific features of remarkable perceptual alignment on a number of popular ImageNet-pretrained deep architectures. This strongly suggests that neural networks do, in fact, rely on learned interpretable patterns that can be recovered after training. Thus, our findings may have profound implications for knowledge discovery and the development of dependable artificial systems.
zh
[CV-6] CapRecover: A Cross-Modality Feature Inversion Attack Framework on Vision Language Models
【速读】:该论文旨在解决视觉-语言模型(Vision-Language Models, VLMs)在split-DNN架构中因中间特征(intermediate features)泄露而导致的语义信息隐私风险问题。现有图像重建方法通常生成模糊且语义不明确的结果,无法有效防御语义泄露。其核心解决方案是提出CapRecover框架,这是一种跨模态逆向还原方法,可直接从中间特征恢复高阶语义内容(如标签或描述文本),而无需重建原始图像。关键创新在于:1)通过端到端训练实现对中间特征中语义信息的精准提取;2)实验发现深层卷积层比浅层编码更多语义信息,从而指导保护策略;3)引入一种轻量级噪声注入机制——在每一层添加随机噪声并在下一层去除,即可有效阻断语义泄露,且无需额外训练成本。
链接: https://arxiv.org/abs/2507.22828
作者: Kedong Xiu,Saiqian Zhang
机构: New York University (纽约大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: 9 pages, accepted by the 2025 ACM Multimedia Conference
Abstract:As Vision-Language Models (VLMs) are increasingly deployed in split-DNN configurations–with visual encoders (e.g., ResNet, ViT) operating on user devices and sending intermediate features to the cloud–there is a growing privacy risk from semantic information leakage. Existing approaches to reconstructing images from these intermediate features often result in blurry, semantically ambiguous images. To directly address semantic leakage, we propose CapRecover, a cross-modality inversion framework that recovers high-level semantic content, such as labels or captions, directly from intermediate features without image reconstruction. We evaluate CapRecover on multiple datasets and victim models, demonstrating strong performance in semantic recovery. Specifically, CapRecover achieves up to 92.71% Top-1 label accuracy on CIFAR-10 and generates fluent captions from ResNet50 features on COCO2017 with ROUGE-L scores up to 0.52. Our analysis further reveals that deeper convolutional layers encode significantly more semantic information compared to shallow layers. To mitigate semantic leakage, we introduce a simple yet effective protection method: adding random noise to intermediate features at each layer and removing the noise in the next layer. Experimental results show that this approach prevents semantic leakage without additional training costs. Comments: 9 pages, accepted by the 2025 ACM Multimedia Conference Subjects: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI) Cite as: arXiv:2507.22828 [cs.CV] (or arXiv:2507.22828v1 [cs.CV] for this version) https://doi.org/10.48550/arXiv.2507.22828 Focus to learn more arXiv-issued DOI via DataCite (pending registration) Related DOI: https://doi.org/10.1145/3746027.3755203 Focus to learn more DOI(s) linking to related resources
zh
[CV-7] ScreenCoder: Advancing Visual-to-Code Generation for Front-End Automation via Modular Multimodal Agents
【速读】:该论文旨在解决从用户界面(UI)设计到前端代码自动化转换中的关键挑战,即现有基于自然语言提示的方法难以准确捕捉空间布局和视觉设计意图,从而限制了生成代码的结构合理性和视觉保真度。其解决方案的核心在于提出一个模块化的多智能体框架,包含三个可解释的阶段:接地(grounding)阶段由视觉-语言模型(vision-language model, VLM)识别并标注UI组件;规划(planning)阶段利用前端工程先验知识构建层次化布局;生成(generation)阶段通过自适应提示合成HTML/CSS代码。该架构显著提升了鲁棒性、可解释性与代码质量,并进一步扩展为可扩展的数据引擎,自动构建大规模图像-代码对以微调VLM,最终在布局准确性、结构连贯性和代码正确性上达到当前最优性能。
链接: https://arxiv.org/abs/2507.22827
作者: Yilei Jiang,Yaozhi Zheng,Yuxuan Wan,Jiaming Han,Qunzhong Wang,Michael R. Lyu,Xiangyu Yue
机构: CUHK 1MMLab & 2ARISE Lab (香港中文大学多媒体实验室与人工智能研究实验室)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Automating the transformation of user interface (UI) designs into front-end code holds significant promise for accelerating software development and democratizing design workflows. While recent large language models (LLMs) have demonstrated progress in text-to-code generation, many existing approaches rely solely on natural language prompts, limiting their effectiveness in capturing spatial layout and visual design intent. In contrast, UI development in practice is inherently multimodal, often starting from visual sketches or mockups. To address this gap, we introduce a modular multi-agent framework that performs UI-to-code generation in three interpretable stages: grounding, planning, and generation. The grounding agent uses a vision-language model to detect and label UI components, the planning agent constructs a hierarchical layout using front-end engineering priors, and the generation agent produces HTML/CSS code via adaptive prompt-based synthesis. This design improves robustness, interpretability, and fidelity over end-to-end black-box methods. Furthermore, we extend the framework into a scalable data engine that automatically produces large-scale image-code pairs. Using these synthetic examples, we fine-tune and reinforce an open-source VLM, yielding notable gains in UI understanding and code quality. Extensive experiments demonstrate that our approach achieves state-of-the-art performance in layout accuracy, structural coherence, and code correctness. Our code is made publicly available at this https URL.
zh
[CV-8] DepR: Depth Guided Single-view Scene Reconstruction with Instance-level Diffusion ICCV2025
【速读】:该论文旨在解决单视角场景重建中如何有效利用深度信息以提升重建质量和泛化能力的问题。现有方法通常仅在推理阶段使用深度图进行物体布局估计,未能充分挖掘深度所蕴含的丰富几何先验信息。其解决方案的关键在于提出DepR框架,该框架采用基于深度引导的扩散模型(depth-guided diffusion model),将深度信息贯穿于训练与推理全过程:一方面通过深度条件编码形状先验至扩散模型,另一方面在推理阶段利用深度引导DDIM采样和布局优化,从而增强重建结果与输入图像的一致性。这一设计使模型即使在有限合成数据上训练,也能实现卓越的重建性能与良好的跨域泛化能力。
链接: https://arxiv.org/abs/2507.22825
作者: Qingcheng Zhao,Xiang Zhang,Haiyang Xu,Zeyuan Chen,Jianwen Xie,Yuan Gao,Zhuowen Tu
机构: ShanghaiTech University (上海科技大学); UC San Diego (加州大学圣地亚哥分校); Lambda, Inc. (Lambda 公司); Stanford University (斯坦福大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: ICCV 2025
Abstract:We propose DepR, a depth-guided single-view scene reconstruction framework that integrates instance-level diffusion within a compositional paradigm. Instead of reconstructing the entire scene holistically, DepR generates individual objects and subsequently composes them into a coherent 3D layout. Unlike previous methods that use depth solely for object layout estimation during inference and therefore fail to fully exploit its rich geometric information, DepR leverages depth throughout both training and inference. Specifically, we introduce depth-guided conditioning to effectively encode shape priors into diffusion models. During inference, depth further guides DDIM sampling and layout optimization, enhancing alignment between the reconstruction and the input image. Despite being trained on limited synthetic data, DepR achieves state-of-the-art performance and demonstrates strong generalization in single-view scene reconstruction, as shown through evaluations on both synthetic and real-world datasets.
zh
[CV-9] Bi-Level Optimization for Self-Supervised AI-Generated Face Detection
【速读】:该论文旨在解决当前基于监督学习的AI生成人脸检测器在面对新兴生成技术时泛化能力不足的问题,即现有方法通常依赖于特定生成器合成的图像进行训练,导致对未见过的生成方式适应性差。其解决方案的关键在于提出一种基于双层优化(bi-level optimization)的自监督学习框架:内层通过线性加权的预训练任务(包括EXIF标签分类、排序和人工面部篡改检测)来训练视觉编码器;外层则优化这些预训练任务的权重,以提升粗粒度的人脸篡改检测能力,作为识别AI生成人脸的代理任务,从而更贴近最终检测目标。该方法使自监督学习与AI生成人脸检测任务的目标高度对齐,且预训练完成后编码器固定,可通过高斯混合模型或轻量级两层感知机实现高效检测,实验证明其在单类和二分类场景下均显著优于现有方法,并具备强泛化性能。
链接: https://arxiv.org/abs/2507.22824
作者: Mian Zou,Nan Zhong,Baosheng Yu,Yibing Zhan,Kede Ma
机构: Jiangxi University of Finance and Economics (江西财经大学); City University of Hong Kong (香港城市大学); Nanyang Technological University (南洋理工大学); Yunnan United Vision Technology (云南联合视觉科技)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:AI-generated face detectors trained via supervised learning typically rely on synthesized images from specific generators, limiting their generalization to emerging generative techniques. To overcome this limitation, we introduce a self-supervised method based on bi-level optimization. In the inner loop, we pretrain a vision encoder only on photographic face images using a set of linearly weighted pretext tasks: classification of categorical exchangeable image file format (EXIF) tags, ranking of ordinal EXIF tags, and detection of artificial face manipulations. The outer loop then optimizes the relative weights of these pretext tasks to enhance the coarse-grained detection of manipulated faces, serving as a proxy task for identifying AI-generated faces. In doing so, it aligns self-supervised learning more closely with the ultimate goal of AI-generated face detection. Once pretrained, the encoder remains fixed, and AI-generated faces are detected either as anomalies under a Gaussian mixture model fitted to photographic face features or by a lightweight two-layer perceptron serving as a binary classifier. Extensive experiments demonstrate that our detectors significantly outperform existing approaches in both one-class and binary classification settings, exhibiting strong generalization to unseen generators.
zh
[CV-10] Wall Shear Stress Estimation in Abdominal Aortic Aneurysms: Towards Generalisable Neural Surrogate Models
【速读】:该论文旨在解决腹主动脉瘤(Abdominal Aortic Aneurysm, AAA)血流动力学参数估计的计算效率问题,传统基于计算流体动力学(Computational Fluid Dynamics, CFD)的方法虽准确但耗时高,难以在临床实践中快速应用。其解决方案的关键在于提出一种E(3)-等变几何深度学习模型,利用新颖的鲁棒几何描述符和投影几何代数,直接从CT扫描获取的三维血管几何结构中预测瞬时壁面剪切应力(Transient Wall Shear Stress, WSS),并在训练中引入不同边界条件下的参考CFD模拟数据以增强泛化能力。该方法可在数秒内完成预测,且对几何重塑、边界条件变化、新分支添加及网格分辨率差异均表现出良好鲁棒性,显著提升了血流动力学参数估计的实用性与临床适应性。
链接: https://arxiv.org/abs/2507.22817
作者: Patryk Rygiel,Julian Suk,Christoph Brune,Kak Khee Yeung,Jelmer M. Wolterink
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Abdominal aortic aneurysms (AAAs) are pathologic dilatations of the abdominal aorta posing a high fatality risk upon rupture. Studying AAA progression and rupture risk often involves in-silico blood flow modelling with computational fluid dynamics (CFD) and extraction of hemodynamic factors like time-averaged wall shear stress (TAWSS) or oscillatory shear index (OSI). However, CFD simulations are known to be computationally demanding. Hence, in recent years, geometric deep learning methods, operating directly on 3D shapes, have been proposed as compelling surrogates, estimating hemodynamic parameters in just a few seconds. In this work, we propose a geometric deep learning approach to estimating hemodynamics in AAA patients, and study its generalisability to common factors of real-world variation. We propose an E(3)-equivariant deep learning model utilising novel robust geometrical descriptors and projective geometric algebra. Our model is trained to estimate transient WSS using a dataset of CT scans of 100 AAA patients, from which lumen geometries are extracted and reference CFD simulations with varying boundary conditions are obtained. Results show that the model generalizes well within the distribution, as well as to the external test set. Moreover, the model can accurately estimate hemodynamics across geometry remodelling and changes in boundary conditions. Furthermore, we find that a trained model can be applied to different artery tree topologies, where new and unseen branches are added during inference. Finally, we find that the model is to a large extent agnostic to mesh resolution. These results show the accuracy and generalisation of the proposed model, and highlight its potential to contribute to hemodynamic parameter estimation in clinical practice.
zh
[CV-11] DISTIL: Data-Free Inversion of Suspicious Trojan Inputs via Latent Diffusion ICCV2025
【速读】:该论文旨在解决深度神经网络在实际应用中面临的后门攻击(Trojan attack)问题,即模型在训练阶段被植入恶意触发器(trigger),导致其在特定输入下产生错误预测,而这种攻击难以被检测且危害严重。现有触发器逆向方法通常依赖于对触发器外观的强假设或在全像素空间搜索,无法保证重建的触发器不是普通对抗扰动。论文提出了一种无需数据、零样本的触发器逆向策略——DISTIL,其关键在于引入基于扩散机制的生成器,并通过目标分类器引导迭代生成过程,从而将搜索空间限制在与模型内部表示相关联的潜在空间中,确保生成的候选触发器能够准确反映模型因后门行为所依赖的特征模式,实现高效且可靠的触发器重建。
链接: https://arxiv.org/abs/2507.22813
作者: Hossein Mirzaei,Zeinab Taghavi,Sepehr Rezaee,Masoud Hadi,Moein Madadi,Mackenzie W. Mathis
机构: École Polytechnique Fédérale de Lausanne (瑞士联邦理工学院)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: ICCV 2025
Abstract:Deep neural networks have demonstrated remarkable success across numerous tasks, yet they remain vulnerable to Trojan (backdoor) attacks, raising serious concerns about their safety in real-world mission-critical applications. A common countermeasure is trigger inversion – reconstructing malicious “shortcut” patterns (triggers) inserted by an adversary during training. Current trigger-inversion methods typically search the full pixel space under specific assumptions but offer no assurances that the estimated trigger is more than an adversarial perturbation that flips the model output. Here, we propose a data-free, zero-shot trigger-inversion strategy that restricts the search space while avoiding strong assumptions on trigger appearance. Specifically, we incorporate a diffusion-based generator guided by the target classifier; through iterative generation, we produce candidate triggers that align with the internal representations the model relies on for malicious behavior. Empirical evaluations, both quantitative and qualitative, show that our approach reconstructs triggers that effectively distinguish clean versus Trojaned models. DISTIL surpasses alternative methods by high margins, achieving up to 7.1% higher accuracy on the BackdoorBench dataset and a 9.4% improvement on trojaned object detection model scanning, offering a promising new direction for reliable backdoor defense without reliance on extensive data or strong prior assumptions about triggers. The code is available at this https URL.
zh
[CV-12] MoCHA: Advanced Vision-Language Reasoning with MoE Connector and Hierarchical Group Attention
【速读】:该论文旨在解决视觉大语言模型(Vision Large Language Models, VLLMs)在处理复杂细粒度视觉信息时面临的高训练与推理成本,以及视觉特征提取不充分、跨模态对齐困难等问题。其解决方案的关键在于提出一种名为MoCHA的新颖视觉框架,该框架通过集成四种不同的视觉主干网络(CLIP、SigLIP、DINOv2和ConvNeXt)以提取互补的视觉特征,并引入稀疏专家混合连接模块(Sparse Mixture of Experts Connectors, MoECs),实现针对不同视觉维度的动态专家选择;同时设计分层组注意力机制(Hierarchical Group Attention, HGA)结合自适应门控策略,有效缓解MoECs模块中视觉信息冗余或不足的问题,从而显著提升模型性能与鲁棒性。
链接: https://arxiv.org/abs/2507.22805
作者: Yuqi Pang,Bowen Yang,Yun Cao,Fan Rong,Xiaoyu Li,Chen He
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:
Abstract:Vision large language models (VLLMs) are focusing primarily on handling complex and fine-grained visual information by incorporating advanced vision encoders and scaling up visual models. However, these approaches face high training and inference costs, as well as challenges in extracting visual details, effectively bridging across modalities. In this work, we propose a novel visual framework, MoCHA, to address these issues. Our framework integrates four vision backbones (i.e., CLIP, SigLIP, DINOv2 and ConvNeXt) to extract complementary visual features and is equipped with a sparse Mixture of Experts Connectors (MoECs) module to dynamically select experts tailored to different visual dimensions. To mitigate redundant or insufficient use of the visual information encoded by the MoECs module, we further design a Hierarchical Group Attention (HGA) with intra- and inter-group operations and an adaptive gating strategy for encoded visual features. We train MoCHA on two mainstream LLMs (e.g., Phi2-2.7B and Vicuna-7B) and evaluate their performance across various benchmarks. Notably, MoCHA outperforms state-of-the-art open-weight models on various tasks. For example, compared to CuMo (Mistral-7B), our MoCHA (Phi2-2.7B) presents outstanding abilities to mitigate hallucination by showing improvements of 3.25% in POPE and to follow visual instructions by raising 153 points on MME. Finally, ablation studies further confirm the effectiveness and robustness of the proposed MoECs and HGA in improving the overall performance of MoCHA.
zh
[CV-13] Advancing Fetal Ultrasound Image Quality Assessment in Low-Resource Settings MICCAI2025
【速读】:该论文旨在解决低收入国家因超声技师稀缺而导致胎儿生物测量(如腹围)高质量超声图像获取困难的问题。其核心解决方案是利用预训练于21万对胎儿超声图像与文本描述数据集的视觉-语言模型FetalCLIP,通过低秩适应(Low-Rank Adaptation, LoRA)方法微调得到专用的图像质量评估(Image Quality Assessment, IQA)模型FetalCLIP_CLS,并进一步结合分割模型转分类任务的策略提升性能,最终在ACOUSLIC-AI数据集上实现F1分数0.771,验证了参数高效微调胎儿超声基础模型在资源受限环境中推动产前护理的技术可行性。
链接: https://arxiv.org/abs/2507.22802
作者: Dongli He,Hu Wang,Mohammad Yaqub
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: Accepted to the MICCAI 2025 MIRASOL Workshop
Abstract:Accurate fetal biometric measurements, such as abdominal circumference, play a vital role in prenatal care. However, obtaining high-quality ultrasound images for these measurements heavily depends on the expertise of sonographers, posing a significant challenge in low-income countries due to the scarcity of trained personnel. To address this issue, we leverage FetalCLIP, a vision-language model pretrained on a curated dataset of over 210,000 fetal ultrasound image-caption pairs, to perform automated fetal ultrasound image quality assessment (IQA) on blind-sweep ultrasound data. We introduce FetalCLIP _CLS , an IQA model adapted from FetalCLIP using Low-Rank Adaptation (LoRA), and evaluate it on the ACOUSLIC-AI dataset against six CNN and Transformer baselines. FetalCLIP _CLS achieves the highest F1 score of 0.757. Moreover, we show that an adapted segmentation model, when repurposed for classification, further improves performance, achieving an F1 score of 0.771. Our work demonstrates how parameter-efficient fine-tuning of fetal ultrasound foundation models can enable task-specific adaptations, advancing prenatal care in resource-limited settings. The experimental code is available at: this https URL.
zh
[CV-14] Segment Anything for Video: A Comprehensive Review of Video Object Segmentation and Tracking from Past to Future
【速读】:该论文旨在解决视频目标分割与跟踪(Video Object Segmentation and Tracking, VOST)中长期存在的挑战,包括领域泛化能力弱、时间一致性差以及计算效率低等问题。其解决方案的关键在于利用基础模型(foundation models)如Segment Anything Model (SAM) 及其改进版本SAM2,构建以提示驱动(prompt-driven)为核心的新型VOST框架。该框架通过三个时间维度进行系统性设计:过去(past)阶段采用记忆机制保留和更新历史信息;当前(present)阶段优化从单帧中提取判别性特征的能力;未来(future)阶段引入运动预测与轨迹估计机制以提前感知对象动态。特别地,论文强调了基于运动感知的记忆选择和轨迹引导提示等创新策略,显著提升了分割精度与实时处理性能,推动了从传统记忆架构向流式内存(streaming memory)和实时分割能力的演进。
链接: https://arxiv.org/abs/2507.22792
作者: Guoping Xu,Jayaram K. Udupa,Yajun Yu,Hua-Chieh Shao,Songlin Zhao,Wei Liu,You Zhang
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 45 pages, 21 figures
Abstract:Video Object Segmentation and Tracking (VOST) presents a complex yet critical challenge in computer vision, requiring robust integration of segmentation and tracking across temporally dynamic frames. Traditional methods have struggled with domain generalization, temporal consistency, and computational efficiency. The emergence of foundation models like the Segment Anything Model (SAM) and its successor, SAM2, has introduced a paradigm shift, enabling prompt-driven segmentation with strong generalization capabilities. Building upon these advances, this survey provides a comprehensive review of SAM/SAM2-based methods for VOST, structured along three temporal dimensions: past, present, and future. We examine strategies for retaining and updating historical information (past), approaches for extracting and optimizing discriminative features from the current frame (present), and motion prediction and trajectory estimation mechanisms for anticipating object dynamics in subsequent frames (future). In doing so, we highlight the evolution from early memory-based architectures to the streaming memory and real-time segmentation capabilities of SAM2. We also discuss recent innovations such as motion-aware memory selection and trajectory-guided prompting, which aim to enhance both accuracy and efficiency. Finally, we identify remaining challenges including memory redundancy, error accumulation, and prompt inefficiency, and suggest promising directions for future research. This survey offers a timely and structured overview of the field, aiming to guide researchers and practitioners in advancing the state of VOST through the lens of foundation models.
zh
[CV-15] Modality-Aware Feature Matching: A Comprehensive Review of Single- and Cross-Modality Techniques
【速读】:该论文旨在解决多模态特征匹配(modality-based feature matching)中的关键挑战,即如何在不同数据模态(如RGB图像、深度图、3D点云、LiDAR扫描、医学影像及视觉-语言交互)之间实现鲁棒且准确的特征对应关系建立。传统手工设计方法(如SIFT、ORB)虽在同模态下表现稳定,但难以应对显著的模态差异;而现代深度学习方法(如基于CNN的SuperPoint和基于Transformer的LoFTR)通过端到端学习,显著提升了跨模态适应能力。其解决方案的关键在于引入模态感知机制(modality-aware mechanisms),包括针对特定模态设计的几何与深度特征描述子、稀疏/稠密学习策略、注意力增强网络结构以及专用匹配算法(如MIND描述子),从而有效弥合不同模态间的语义鸿沟,推动特征匹配技术向多样化、高精度方向发展。
链接: https://arxiv.org/abs/2507.22791
作者: Weide Liu,Wei Zhou,Jun Liu,Ping Hu,Jun Cheng,Jungong Han,Weisi Lin
机构: Nanyang Technological University (南洋理工大学); Cardiff University (卡迪夫大学); Lancaster University (兰卡斯特大学); University of Electronic Science and Technology of China (电子科技大学); Agency for Science, Technology and Research (A*STAR) (新加坡科技研究局)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Feature matching is a cornerstone task in computer vision, essential for applications such as image retrieval, stereo matching, 3D reconstruction, and SLAM. This survey comprehensively reviews modality-based feature matching, exploring traditional handcrafted methods and emphasizing contemporary deep learning approaches across various modalities, including RGB images, depth images, 3D point clouds, LiDAR scans, medical images, and vision-language interactions. Traditional methods, leveraging detectors like Harris corners and descriptors such as SIFT and ORB, demonstrate robustness under moderate intra-modality variations but struggle with significant modality gaps. Contemporary deep learning-based methods, exemplified by detector-free strategies like CNN-based SuperPoint and transformer-based LoFTR, substantially improve robustness and adaptability across modalities. We highlight modality-aware advancements, such as geometric and depth-specific descriptors for depth images, sparse and dense learning methods for 3D point clouds, attention-enhanced neural networks for LiDAR scans, and specialized solutions like the MIND descriptor for complex medical image matching. Cross-modal applications, particularly in medical image registration and vision-language tasks, underscore the evolution of feature matching to handle increasingly diverse data interactions.
zh
[CV-16] HOLA: Enhancing Audio-visual Deepfake Detection via Hierarchical Contextual Aggregations and Efficient Pre-training
【速读】:该论文旨在解决视频级深度伪造(video-level deepfake)检测在生成式 AI(Generative AI)快速发展背景下日益严峻的挑战,即现有检测技术在面对高质量、复杂多模态伪造内容时表现出显著局限性。其解决方案的关键在于提出 HOLA 框架,该框架通过大规模自监督预训练(基于 1.81M 样本构建的自建数据集)实现音频-视觉模态的统一建模,并采用两阶段架构:第一阶段引入迭代感知的跨模态学习模块以实现选择性音视频交互,第二阶段结合层次化上下文建模与门控聚合机制,在局部-全局视角下捕捉细粒度到粗粒度的语义信息;此外,还设计了伪监督信号注入策略以进一步提升模型性能。实验表明,HOLA 在多个基准上显著优于现有方法,尤其在 TestA 测试集上 AUC 达到 0.9276,排名第一。
链接: https://arxiv.org/abs/2507.22781
作者: Xuecheng Wu,Danlei Huang,Heli Sun,Xinyi Yin,Yifan Wang,Hao Wang,Jia Zhang,Fei Wang,Peihao Guo,Suyu Xing,Junxiao Xue,Liang He
机构: Xi’an Jiaotong University(西安交通大学); Zhengzhou University(郑州大学); University of Science and Technology of China(中国科学技术大学); Dalian University of Technology(大连理工大学); Zhejiang Lab(浙江省实验室)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Advances in Generative AI have made video-level deepfake detection increasingly challenging, exposing the limitations of current detection techniques. In this paper, we present HOLA, our solution to the Video-Level Deepfake Detection track of 2025 1M-Deepfakes Detection Challenge. Inspired by the success of large-scale pre-training in the general domain, we first scale audio-visual self-supervised pre-training in the multimodal video-level deepfake detection, which leverages our self-built dataset of 1.81M samples, thereby leading to a unified two-stage framework. To be specific, HOLA features an iterative-aware cross-modal learning module for selective audio-visual interactions, hierarchical contextual modeling with gated aggregations under the local-global perspective, and a pyramid-like refiner for scale-aware cross-grained semantic enhancements. Moreover, we propose the pseudo supervised singal injection strategy to further boost model performance. Extensive experiments across expert models and MLLMs impressivly demonstrate the effectiveness of our proposed HOLA. We also conduct a series of ablation studies to explore the crucial design factors of our introduced components. Remarkably, our HOLA ranks 1st, outperforming the second by 0.0476 AUC on the TestA set.
zh
[CV-17] Social-Pose: Enhancing Trajectory Prediction with Human Body Pose
【速读】:该论文旨在解决自动驾驶中人类轨迹预测的准确性问题,现有模型往往未能充分利用人类在空间导航时无意识传递的视觉线索。其解决方案的关键在于提出一种基于注意力机制的“Social-pose”姿态编码器,该编码器能够有效捕捉场景中所有人的身体姿态及其社会关系,并可集成到多种轨迹预测架构(如LSTM、GAN、MLP和Transformer)中,从而显著提升预测性能。
链接: https://arxiv.org/abs/2507.22742
作者: Yang Gao,Saeed Saadatnejad,Alexandre Alahi
机构: École Polytechnique Fédérale de Lausanne (EPFL) (洛桑联邦理工学院); Sportradar (体育雷达); European Union’s Horizon 2020 research, innovation programme (欧盟地平线2020研究创新计划)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted to IEEE Transactions on Intelligent Transportation Systems (T-ITS)
Abstract:Accurate human trajectory prediction is one of the most crucial tasks for autonomous driving, ensuring its safety. Yet, existing models often fail to fully leverage the visual cues that humans subconsciously communicate when navigating the space. In this work, we study the benefits of predicting human trajectories using human body poses instead of solely their Cartesian space locations in time. We propose `Social-pose’, an attention-based pose encoder that effectively captures the poses of all humans in a scene and their social relations. Our method can be integrated into various trajectory prediction architectures. We have conducted extensive experiments on state-of-the-art models (based on LSTM, GAN, MLP, and Transformer), and showed improvements over all of them on synthetic (Joint Track Auto) and real (Human3.6M, Pedestrians and Cyclists in Road Traffic, and JRDB) datasets. We also explored the advantages of using 2D versus 3D poses, as well as the effect of noisy poses and the application of our pose-based predictor in robot navigation scenarios.
zh
[CV-18] A Linear N-Point Solver for Structure and Motion from Asynchronous Tracks
【速读】:该论文旨在解决从具有任意时间戳的2D点对应关系中联合估计结构(3D点位置)与线性运动(线速度)的问题,尤其针对非同步成像设备如滚动快门相机(rolling shutter camera)和事件相机(event camera)等场景。传统方法如5点或8点算法仅适用于单帧对齐的点对应,无法处理时间戳异步的数据。其解决方案的关键在于:通过引入一阶动力学建模并假设恒定速度运动模型,推导出一种新颖的线性点共面约束关系(linear point incidence relation),从而实现对线速度和3D点坐标的高效求解,并具备可预测的退化条件和多解特性。该方法具有通用性,能统一处理全局快门、滚动快门及事件相机等多种传感器数据,甚至支持多传感器融合。
链接: https://arxiv.org/abs/2507.22733
作者: Hang Su,Yunlong Feng,Daniel Gehrig,Panfeng Jiang,Ling Gao,Xavier Lagorce,Laurent Kneip
机构: ShanghaiTech University (上海科技大学); University of Pennsylvania (宾夕法尼亚大学); Amap, Alibaba Group (阿里巴巴集团); Shanghai Engineering Research Center of Intelligent Vision and Imaging (上海智能视觉与成像工程技术研究中心)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Structure and continuous motion estimation from point correspondences is a fundamental problem in computer vision that has been powered by well-known algorithms such as the familiar 5-point or 8-point algorithm. However, despite their acclaim, these algorithms are limited to processing point correspondences originating from a pair of views each one representing an instantaneous capture of the scene. Yet, in the case of rolling shutter cameras, or more recently, event cameras, this synchronization breaks down. In this work, we present a unified approach for structure and linear motion estimation from 2D point correspondences with arbitrary timestamps, from an arbitrary set of views. By formulating the problem in terms of first-order dynamics and leveraging a constant velocity motion model, we derive a novel, linear point incidence relation allowing for the efficient recovery of both linear velocity and 3D points with predictable degeneracies and solution multiplicities. Owing to its general formulation, it can handle correspondences from a wide range of sensing modalities such as global shutter, rolling shutter, and event cameras, and can even combine correspondences from different collocated sensors. We validate the effectiveness of our solver on both simulated and real-world data, where we show consistent improvement across all modalities when compared to recent approaches. We believe our work opens the door to efficient structure and motion estimation from asynchronous data. Code can be found at this https URL.
zh
[CV-19] Image-Guided Shape-from-Template Using Mesh Inextensibility Constraints ICCV2025
【速读】:该论文旨在解决传统Shape-from-Template (SfT) 方法在严重遮挡情况下因点对应关系缺失而导致性能显著下降的问题,以及现有基于深度神经网络的无监督或自监督方法对大量标注数据依赖性强、计算效率低的问题。其解决方案的关键在于提出一种仅依赖图像观测(颜色特征、梯度和轮廓)与网格不可伸展性约束的无监督SfT方法,通过结合可微分物理与图形学实现高效且鲁棒的3D形变重建,在保持高精度的同时达到比现有最优无监督方法快400倍的计算速度,并在生成精细细节和处理严重遮挡方面显著优于现有方法。
链接: https://arxiv.org/abs/2507.22699
作者: Thuy Tran,Ruochen Chen,Shaifali Parashar
机构: CNRS(法国国家科学研究中心); École Centrale de Lyon (里昂中央理工学院); INSA Lyon (里昂国立应用科学学院); Université Claude Bernard Lyon 1 (克莱蒙-奥古斯特·贝尔纳里昂第一大学); LIRIS, UMR5205 (信息、计算机与数学实验室,联合研究单位5205)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted to ICCV 2025. Total 13 pages, 9 figures, 9 tables
Abstract:Shape-from-Template (SfT) refers to the class of methods that reconstruct the 3D shape of a deforming object from images/videos using a 3D template. Traditional SfT methods require point correspondences between images and the texture of the 3D template in order to reconstruct 3D shapes from images/videos in real time. Their performance severely degrades when encountered with severe occlusions in the images because of the unavailability of correspondences. In contrast, modern SfT methods use a correspondence-free approach by incorporating deep neural networks to reconstruct 3D objects, thus requiring huge amounts of data for supervision. Recent advances use a fully unsupervised or self-supervised approach by combining differentiable physics and graphics to deform 3D template to match input images. In this paper, we propose an unsupervised SfT which uses only image observations: color features, gradients and silhouettes along with a mesh inextensibility constraint to reconstruct at a 400\times faster pace than (best-performing) unsupervised SfT. Moreover, when it comes to generating finer details and severe occlusions, our method outperforms the existing methodologies by a large margin. Code is available at this https URL.
zh
[CV-20] Zero-Shot Image Anomaly Detection Using Generative Foundation Models ICCV2025
【速读】:该论文旨在解决开放世界环境中视觉系统对分布外(Out-of-Distribution, OOD)输入的检测问题,以提升模型在真实场景下的安全性与鲁棒性。其解决方案的关键在于将扩散模型(Denoising Diffusion Models, DDMs)从传统的生成器角色转变为通用感知模板(universal perceptual templates),利用其去噪轨迹中蕴含的丰富纹理和语义信息,并通过分析Stein score误差并结合结构相似性指数(SSIM)进行放大,从而实现无需针对每个目标数据集重新训练即可识别异常样本的新型方法。实验表明,仅在CelebA数据集上训练单一模型即可在多个基准测试中达到接近完美的性能,凸显了生成式基础模型在异常检测中的强大潜力。
链接: https://arxiv.org/abs/2507.22692
作者: Lemar Abdi,Amaan Valiuddin,Francisco Caetano,Christiaan Viviers,Fons van der Sommen
机构: Eindhoven University of Technology (埃因霍温理工大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted at the workshop of Anomaly Detection with Foundation Models, ICCV 2025
Abstract:Detecting out-of-distribution (OOD) inputs is pivotal for deploying safe vision systems in open-world environments. We revisit diffusion models, not as generators, but as universal perceptual templates for OOD detection. This research explores the use of score-based generative models as foundational tools for semantic anomaly detection across unseen datasets. Specifically, we leverage the denoising trajectories of Denoising Diffusion Models (DDMs) as a rich source of texture and semantic information. By analyzing Stein score errors, amplified through the Structural Similarity Index Metric (SSIM), we introduce a novel method for identifying anomalous samples without requiring re-training on each target dataset. Our approach improves over state-of-the-art and relies on training a single model on one dataset – CelebA – which we find to be an effective base distribution, even outperforming more commonly used datasets like ImageNet in several settings. Experimental results show near-perfect performance on some benchmarks, with notable headroom on others, highlighting both the strength and future potential of generative foundation models in anomaly detection.
zh
[CV-21] Hydra-Bench: A Benchmark for Multi-Modal Leaf Wetness Sensing
【速读】:该论文旨在解决自然环境下叶面湿润度检测(leaf wetness detection)在鲁棒性、精度和环境适应性方面的局限性问题,这些问题限制了现有传感系统在真实农业场景中的应用效果。其解决方案的关键在于构建了一个多模态数据集,包含同步的毫米波(mmWave)原始数据、合成孔径雷达(SAR)图像与RGB图像,覆盖六个月内五种不同植物物种在受控和户外田间环境下的观测数据,从而为机器学习算法提供高质量、多样化的训练与评估基准,并通过Hydra模型验证多模态融合策略的有效性,推动叶面湿润度检测性能的提升及未来SAR成像算法的优化。
链接: https://arxiv.org/abs/2507.22685
作者: Yimeng Liu,Maolin Gan,Yidong Ren,Gen Li,Jingkai Lin,Younsuk Dong,Zhichao Cao
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:
Abstract:Leaf wetness detection is a crucial task in agricultural monitoring, as it directly impacts the prediction and protection of plant diseases. However, existing sensing systems suffer from limitations in robustness, accuracy, and environmental resilience when applied to natural leaves under dynamic real-world conditions. To address these challenges, we introduce a new multi-modal dataset specifically designed for evaluating and advancing machine learning algorithms in leaf wetness detection. Our dataset comprises synchronized mmWave raw data, Synthetic Aperture Radar (SAR) images, and RGB images collected over six months from five diverse plant species in both controlled and outdoor field environments. We provide detailed benchmarks using the Hydra model, including comparisons against single modality baselines and multiple fusion strategies, as well as performance under varying scan distances. Additionally, our dataset can serve as a benchmark for future SAR imaging algorithm optimization, enabling a systematic evaluation of detection accuracy under diverse conditions.
zh
[CV-22] MergeSAM: Unsupervised change detection of remote sensing images based on the Segment Anything Model
【速读】:该论文旨在解决高分辨率遥感影像中复杂变化检测的难题,尤其是针对现实场景中常见的对象分裂、合并及其他结构复杂变化带来的挑战。其解决方案的关键在于提出了一种基于Segment Anything Model (SAM) 的新型无监督变化检测方法MergeSAM,通过设计MaskMatching和MaskSplitting两种创新策略,充分利用SAM强大的目标分割能力构建多时相掩码(mask),从而将地表覆盖的空间结构信息有效嵌入变化检测流程中,显著提升了对复杂变化模式的识别精度与鲁棒性。
链接: https://arxiv.org/abs/2507.22675
作者: Meiqi Hu,Lingzhi Lu,Chengxi Han,Xiaoping Liu
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 4 pages
Abstract:Recently, large foundation models trained on vast datasets have demonstrated exceptional capabilities in feature extraction and general feature representation. The ongoing advancements in deep learning-driven large models have shown great promise in accelerating unsupervised change detection methods, thereby enhancing the practical applicability of change detection technologies. Building on this progress, this paper introduces MergeSAM, an innovative unsupervised change detection method for high-resolution remote sensing imagery, based on the Segment Anything Model (SAM). Two novel strategies, MaskMatching and MaskSplitting, are designed to address real-world complexities such as object splitting, merging, and other intricate changes. The proposed method fully leverages SAM’s object segmentation capabilities to construct multitemporal masks that capture complex changes, embedding the spatial structure of land cover into the change detection process.
zh
[CV-23] Graph-Guided Dual-Level Augmentation for 3D Scene Segmentation
【速读】:该论文旨在解决3D点云分割中因大规模标注数据稀缺而导致模型性能受限的问题,现有数据增强方法多局限于局部变换或语义重组,缺乏对场景内全局结构依赖关系的建模。其解决方案的关键在于提出一种基于图引导的双层约束数据增强框架,通过从真实数据中学习物体间关系统计构建指导图(guiding graph),在局部层面约束几何合理性与语义一致性,在全局层面通过与指导图对齐来保持场景拓扑结构,从而实现更真实、多样且高质量的3D场景合成,显著提升点云分割模型的泛化能力。
链接: https://arxiv.org/abs/2507.22668
作者: Hongbin Lin,Yifan Jiang,Juangui Xu,Jesse Jiaxi Xu,Yi Lu,Zhengyu Hu,Ying-Cong Chen,Hao Wang
机构: The Hong Kong University of Science and Technology (Guangzhou)(香港科技大学(广州)); University of Toronto(多伦多大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 15 pages, 11 figures, to be published in ACMMM 2025 Conference
Abstract:3D point cloud segmentation aims to assign semantic labels to individual points in a scene for fine-grained spatial understanding. Existing methods typically adopt data augmentation to alleviate the burden of large-scale annotation. However, most augmentation strategies only focus on local transformations or semantic recomposition, lacking the consideration of global structural dependencies within scenes. To address this limitation, we propose a graph-guided data augmentation framework with dual-level constraints for realistic 3D scene synthesis. Our method learns object relationship statistics from real-world data to construct guiding graphs for scene generation. Local-level constraints enforce geometric plausibility and semantic consistency between objects, while global-level constraints maintain the topological structure of the scene by aligning the generated layout with the guiding graph. Extensive experiments on indoor and outdoor datasets demonstrate that our framework generates diverse and high-quality augmented scenes, leading to consistent improvements in point cloud segmentation performance across various models.
zh
[CV-24] SpectraSentinel: LightWeight Dual-Stream Real-Time Drone Detection Tracking and Payload Identification
【速读】:该论文旨在解决民用空域中无人机(drone)数量激增所带来的安全威胁问题,特别是在复杂环境条件下实现高精度、实时的无人机检测、跟踪与载荷识别。其解决方案的关键在于提出了一种双流监控框架,分别在红外(thermal)和可见光(RGB)两个独立的数据流上部署轻量级YOLOv11n目标检测模型,避免早期特征融合,从而针对不同模态的特性进行专门优化:通过定制化预处理策略(如限制红外图像的颜色抖动)和超参数微调,提升在强噪声、低光照及运动模糊等恶劣条件下的检测鲁棒性;最终实现了对无人机与鸟类的有效区分以及载荷类型的准确分类,同时保持实时性能。
链接: https://arxiv.org/abs/2507.22650
作者: Shahriar Kabir,Istiak Ahmmed Rifti,H.M. Shadman Tabib,Mushfiqur Rahman,Sadatul Islam Sadi,Hasnaen Adil,Ahmed Mahir Sultan Rumi,Ch Md Rakin Haider
机构: Bangladesh University of Engineering and Technology (孟加拉国工程技术大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:The proliferation of drones in civilian airspace has raised urgent security concerns, necessitating robust real-time surveillance systems. In response to the 2025 VIP Cup challenge tasks - drone detection, tracking, and payload identification - we propose a dual-stream drone monitoring framework. Our approach deploys independent You Only Look Once v11-nano (YOLOv11n) object detectors on parallel infrared (thermal) and visible (RGB) data streams, deliberately avoiding early fusion. This separation allows each model to be specifically optimized for the distinct characteristics of its input modality, addressing the unique challenges posed by small aerial objects in diverse environmental conditions. We customize data preprocessing and augmentation strategies per domain - such as limiting color jitter for IR imagery - and fine-tune training hyperparameters to enhance detection performance under conditions of heavy noise, low light, and motion blur. The resulting lightweight YOLOv11n models demonstrate high accuracy in distinguishing drones from birds and in classifying payload types, all while maintaining real-time performance. This report details the rationale for a dual-modality design, the specialized training pipelines, and the architectural optimizations that collectively enable efficient and accurate drone surveillance across RGB and IR channels.
zh
[CV-25] LOTS of Fashion! Multi-Conditioning for Image Generation via Sketch-Text Pairing ICCV25
【速读】:该论文旨在解决时尚图像生成中如何有效融合局部草图(sketch)与文本描述以实现精细化设计控制的问题。传统方法难以同时建模全局风格与局部细节,导致生成结果缺乏可定制性与准确性。其解决方案的关键在于提出LOcalized Text and Sketch for fashion image generation (LOTS)框架:首先通过模块化配对中心表示(Modularized Pair-Centric representation)将草图与文本映射至共享潜在空间并保留局部特征;其次引入基于步骤的融合策略(step-based merging strategy),在扩散模型的多步去噪过程中利用注意力机制实现局部与全局条件信息的协同引导(Diffusion Pair Guidance)。此方法显著提升了时尚图像生成的质量与可控性。
链接: https://arxiv.org/abs/2507.22627
作者: Federico Girella,Davide Talon,Ziyue Liu,Zanxi Ruan,Yiming Wang,Marco Cristani
机构: University of Verona (维罗纳大学); Fondazione Bruno Kessler (布鲁诺·凯斯勒基金会); Polytechnic Institute of Turin (都灵理工学院); University of Reykjavik (雷克雅未克大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: Accepted at ICCV25 (Oral). Project page: this https URL
Abstract:Fashion design is a complex creative process that blends visual and textual expressions. Designers convey ideas through sketches, which define spatial structure and design elements, and textual descriptions, capturing material, texture, and stylistic details. In this paper, we present LOcalized Text and Sketch for fashion image generation (LOTS), an approach for compositional sketch-text based generation of complete fashion outlooks. LOTS leverages a global description with paired localized sketch + text information for conditioning and introduces a novel step-based merging strategy for diffusion adaptation. First, a Modularized Pair-Centric representation encodes sketches and text into a shared latent space while preserving independent localized features; then, a Diffusion Pair Guidance phase integrates both local and global conditioning via attention-based guidance within the diffusion model’s multi-step denoising process. To validate our method, we build on Fashionpedia to release Sketchy, the first fashion dataset where multiple text-sketch pairs are provided per image. Quantitative results show LOTS achieves state-of-the-art image generation performance on both global and localized metrics, while qualitative examples and a human evaluation study highlight its unprecedented level of design customization.
zh
[CV-26] Bridging the Gap in Missing Modalities: Leverag ing Knowledge Distillation and Style Matching for Brain Tumor Segmentation
【速读】:该论文旨在解决脑肿瘤分割中因关键影像模态缺失而导致的边界分割不敏感和特征迁移效率低下的问题(即“missing modalities”挑战)。其解决方案的核心在于提出MST-KDNet模型,该模型包含三个关键技术组件:多尺度Transformer知识蒸馏(Multi-Scale Transformer Knowledge Distillation),用于在不同分辨率下有效捕捉注意力权重;双模式logit蒸馏(Dual-Mode Logit Distillation),提升知识迁移能力;以及全局风格匹配模块(Global Style Matching Module),结合特征匹配与对抗学习以增强表示一致性。实验表明,该方法在BraTS和FeTS 2024数据集上显著优于现有主流方法,尤其在模态缺失场景下表现出更强的鲁棒性和泛化能力。
链接: https://arxiv.org/abs/2507.22626
作者: Shenghao Zhu,Yifei Chen,Weihong Chen,Yuanhan Wang,Chang Liu,Shuo Jiang,Feiwei Qin,Changmiao Wang
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 11 pages, 2 figures
Abstract:Accurate and reliable brain tumor segmentation, particularly when dealing with missing modalities, remains a critical challenge in medical image analysis. Previous studies have not fully resolved the challenges of tumor boundary segmentation insensitivity and feature transfer in the absence of key imaging modalities. In this study, we introduce MST-KDNet, aimed at addressing these critical issues. Our model features Multi-Scale Transformer Knowledge Distillation to effectively capture attention weights at various resolutions, Dual-Mode Logit Distillation to improve the transfer of knowledge, and a Global Style Matching Module that integrates feature matching with adversarial learning. Comprehensive experiments conducted on the BraTS and FeTS 2024 datasets demonstrate that MST-KDNet surpasses current leading methods in both Dice and HD95 scores, particularly in conditions with substantial modality loss. Our approach shows exceptional robustness and generalization potential, making it a promising candidate for real-world clinical applications. Our source code is available at this https URL.
zh
[CV-27] Hate in Plain Sight: On the Risks of Moderating AI-Generated Hateful Illusions ICCV2025
【速读】:该论文旨在解决生成式 AI(Generative AI)在文本到图像扩散模型中可能被滥用以生成“仇恨幻觉”(hateful illusions)的问题,即通过视觉欺骗手段将仇恨信息隐含嵌入看似无害的图像场景中,从而规避当前内容审核模型的检测。其核心挑战在于现有内容审核系统对图像中隐藏语义信息的识别能力不足,主要依赖表面视觉特征而忽略深层语义层。解决方案的关键在于识别并改进视觉编码器对多层信息(尤其是隐藏消息)的感知能力,并探索基于图像变换和训练策略的初步缓解措施,以提升对这类隐蔽性恶意内容的检测准确率。
链接: https://arxiv.org/abs/2507.22617
作者: Yiting Qu,Ziqing Yang,Yihan Ma,Michael Backes,Savvas Zannettou,Yang Zhang
机构: CISPA Helmholtz Center for Information Security (信息安全研究中心); TU Delft (代尔夫特理工大学)
类目: Cryptography and Security (cs.CR); Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted at ICCV 2025
Abstract:Recent advances in text-to-image diffusion models have enabled the creation of a new form of digital art: optical illusions–visual tricks that create different perceptions of reality. However, adversaries may misuse such techniques to generate hateful illusions, which embed specific hate messages into harmless scenes and disseminate them across web communities. In this work, we take the first step toward investigating the risks of scalable hateful illusion generation and the potential for bypassing current content moderation models. Specifically, we generate 1,860 optical illusions using Stable Diffusion and ControlNet, conditioned on 62 hate messages. Of these, 1,571 are hateful illusions that successfully embed hate messages, either overtly or subtly, forming the Hateful Illusion dataset. Using this dataset, we evaluate the performance of six moderation classifiers and nine vision language models (VLMs) in identifying hateful illusions. Experimental results reveal significant vulnerabilities in existing moderation models: the detection accuracy falls below 0.245 for moderation classifiers and below 0.102 for VLMs. We further identify a critical limitation in their vision encoders, which mainly focus on surface-level image details while overlooking the secondary layer of information, i.e., hidden messages. To address this risk, we explore preliminary mitigation measures and identify the most effective approaches from the perspectives of image transformations and training-level strategies.
zh
[CV-28] Generative Active Learning for Long-tail Trajectory Prediction via Controllable Diffusion Model ICCV2025
【速读】:该论文旨在解决数据驱动的轨迹预测模型在长尾稀有场景(long-tail scenarios)中表现不佳的问题,这类场景因样本稀缺导致模型难以学习到有效的行为模式。传统方法多通过修改模型架构(如引入超网络)来提升性能,而本文提出了一种不改变模型结构、仅优化训练过程的新思路:通过生成式主动学习(Generative Active Learning, GALTraj)主动识别模型预测失败的稀有样本,并利用可控扩散模型(controllable diffusion model)对其进行增强。其核心创新在于设计了一种面向长尾特征的生成策略(tail-aware generation method),在保持交通规则约束的前提下,生成兼具多样性与真实性的稀有行为轨迹,从而显著提升模型对长尾样本的预测准确性,同时兼顾主流场景(head samples)的性能。
链接: https://arxiv.org/abs/2507.22615
作者: Daehee Park,Monu Surana,Pranav Desai,Ashish Mehta,Reuben MV John,Kuk-Jin Yoon
机构: DGIST(韩国科学技术院); Qualcomm Research(高通研究院); KAIST(韩国科学技术院)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted at ICCV 2025
Abstract:While data-driven trajectory prediction has enhanced the reliability of autonomous driving systems, it still struggles with rarely observed long-tail scenarios. Prior works addressed this by modifying model architectures, such as using hypernetworks. In contrast, we propose refining the training process to unlock each model’s potential without altering its structure. We introduce Generative Active Learning for Trajectory prediction (GALTraj), the first method to successfully deploy generative active learning into trajectory prediction. It actively identifies rare tail samples where the model fails and augments these samples with a controllable diffusion model during training. In our framework, generating scenarios that are diverse, realistic, and preserve tail-case characteristics is paramount. Accordingly, we design a tail-aware generation method that applies tailored diffusion guidance to generate trajectories that both capture rare behaviors and respect traffic rules. Unlike prior simulation methods focused solely on scenario diversity, GALTraj is the first to show how simulator-driven augmentation benefits long-tail learning in trajectory prediction. Experiments on multiple trajectory datasets (WOMD, Argoverse2) with popular backbones (QCNet, MTR) confirm that our method significantly boosts performance on tail samples and also enhances accuracy on head samples.
zh
[CV-29] ShortFT: Diffusion Model Alignment via Shortcut-based Fine-Tuning ICCV2025
【速读】:该论文旨在解决基于反向传播(backpropagation)的微调方法在扩散模型(diffusion models)中因去噪链过长而导致的计算成本高和梯度爆炸风险问题,从而难以实现完整的梯度回传并影响对齐效果。其解决方案的关键在于提出一种名为Shortcut-based Fine-Tuning (ShortFT) 的高效微调策略,该策略利用近期研究中的轨迹保持少步扩散模型(trajectory-preserving few-step diffusion model),构建一条较短的去噪链作为捷径(shortcut),从而显著提升微调效率与对齐性能。
链接: https://arxiv.org/abs/2507.22604
作者: Xiefan Guo,Miaomiao Cui,Liefeng Bo,Di Huang
机构: Beihang University (北京航空航天大学); Beihang University (北京航空航天大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted by ICCV 2025
Abstract:Backpropagation-based approaches aim to align diffusion models with reward functions through end-to-end backpropagation of the reward gradient within the denoising chain, offering a promising perspective. However, due to the computational costs and the risk of gradient explosion associated with the lengthy denoising chain, existing approaches struggle to achieve complete gradient backpropagation, leading to suboptimal results. In this paper, we introduce Shortcut-based Fine-Tuning (ShortFT), an efficient fine-tuning strategy that utilizes the shorter denoising chain. More specifically, we employ the recently researched trajectory-preserving few-step diffusion model, which enables a shortcut over the original denoising chain, and construct a shortcut-based denoising chain of shorter length. The optimization on this chain notably enhances the efficiency and effectiveness of fine-tuning the foundational model. Our method has been rigorously tested and can be effectively applied to various reward functions, significantly improving alignment performance and surpassing state-of-the-art alternatives.
zh
[CV-30] Robust Deepfake Detection for Electronic Know Your Customer Systems Using Registered Images
【速读】:该论文旨在解决电子身份验证(eKYC)系统在面对深度伪造(deepfake)攻击时的可靠性问题,特别是针对人脸交换(face swapping)和人脸重演(face reenactment)两种常见攻击形式,以及图像退化(image degradation)对检测性能的影响。解决方案的关键在于三个核心贡献:首先,通过分析人脸识别模型提取的身份向量在时间维度上的不一致性来判断视频真实性,从而实现对两类深度伪造攻击的全面检测;其次,引入注册图像(假设为真实)作为参考,计算输入视频与注册图像之间的身份差异,显著提升检测精度;最后,采用在更大规模数据集上训练的人脸特征提取器,增强模型对未见图像退化的鲁棒性。
链接: https://arxiv.org/abs/2507.22601
作者: Takuma Amada,Kazuya Kakizaki,Taiki Miyagawa,Akinori F. Ebihara,Kaede Shiohara,Toshihiko Yamasaki
机构: NEC Corporation(日本电气公司); The University of Tokyo(东京大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted to 19th IEEE International Conference on Automatic Face and Gesture Recognition (FG 2025)
Abstract:In this paper, we present a deepfake detection algorithm specifically designed for electronic Know Your Customer (eKYC) systems. To ensure the reliability of eKYC systems against deepfake attacks, it is essential to develop a robust deepfake detector capable of identifying both face swapping and face reenactment, while also being robust to image degradation. We address these challenges through three key contributions: (1)~Our approach evaluates the video’s authenticity by detecting temporal inconsistencies in identity vectors extracted by face recognition models, leading to comprehensive detection of both face swapping and face reenactment. (2)~In addition to processing video input, the algorithm utilizes a registered image (assumed to be genuine) to calculate identity discrepancies between the input video and the registered image, significantly improving detection accuracy. (3)~We find that employing a face feature extractor trained on a larger dataset enhances both detection performance and robustness against image degradation. Our experimental results show that our proposed method accurately detects both face swapping and face reenactment comprehensively and is robust against various forms of unseen image degradation. Our source code is publicly available this https URL.
zh
[CV-31] COOkeD: Ensemble-based OOD detection in the era of zero-shot CLIP ICCV
【速读】:该论文旨在解决分布外(Out-of-Distribution, OOD)检测在图像识别系统中的性能瓶颈问题,即现有方法受限于单一分类器在分布内(In-Distribution, ID)数据上的表现,难以在复杂场景下实现鲁棒且准确的OOD判别。其解决方案的关键在于构建一个异质集成模型COOkeD,该模型融合了三种不同机制的分类器:端到端训练的封闭世界分类器、基于CLIP的零样本分类器以及基于CLIP图像特征的线性探测分类器。这种多模态、模块化且无需重新训练的集成策略,在不显著增加计算开销的前提下,显著提升了OOD检测的准确性与鲁棒性,尤其在标签噪声、协变量偏移及零样本迁移等现实挑战场景中表现优异。
链接: https://arxiv.org/abs/2507.22576
作者: Galadrielle Humblot-Renaux,Gianni Franchi,Sergio Escalera,Thomas B. Moeslund
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: accepted at ICCVW’25 - Systematic Trust in AI Models: Ensuring Fairness, Reliability, Explainability, and Accountability in Machine Learning Frameworks
Abstract:Out-of-distribution (OOD) detection is an important building block in trustworthy image recognition systems as unknown classes may arise at test-time. OOD detection methods typically revolve around a single classifier, leading to a split in the research field between the classical supervised setting (e.g. ResNet18 classifier trained on CIFAR100) vs. the zero-shot setting (class names fed as prompts to CLIP). In both cases, an overarching challenge is that the OOD detection performance is implicitly constrained by the classifier’s capabilities on in-distribution (ID) data. In this work, we show that given a little open-mindedness from both ends, remarkable OOD detection can be achieved by instead creating a heterogeneous ensemble - COOkeD combines the predictions of a closed-world classifier trained end-to-end on a specific dataset, a zero-shot CLIP classifier, and a linear probe classifier trained on CLIP image features. While bulky at first sight, this approach is modular, post-hoc and leverages the availability of pre-trained VLMs, thus introduces little overhead compared to training a single standard classifier. We evaluate COOkeD on popular CIFAR100 and ImageNet benchmarks, but also consider more challenging, realistic settings ranging from training-time label noise, to test-time covariate shift, to zero-shot shift which has been previously overlooked. Despite its simplicity, COOkeD achieves state-of-the-art performance and greater robustness compared to both classical and CLIP-based OOD detection methods. Code is available at this https URL
zh
[CV-32] Subtyping Breast Lesions via Generative Augmentation based Long-tailed Recognition in Ultrasound MICCAI2025
【速读】:该论文旨在解决乳腺病变亚型分类中因数据分布呈现长尾偏斜(long-tailed distribution)而导致的自动化识别性能下降问题。其核心挑战在于稀有类别的样本不足,使得模型难以学习到具有判别性的特征。解决方案的关键在于提出一个双阶段框架:首先通过类可控的生成网络(class-controllable synthetic network)实现高保真度的数据合成,其中引入基于草图引导的感知分支(sketch-grounded perception branch),利用解剖学先验知识保持类别特异性特征并支持无标注推理;其次设计了一个基于强化学习的自适应采样器(reinforcement learning-driven adaptive sampler),通过多智能体策略动态调整合成数据与真实数据的比例,在缓解类别不平衡的同时避免过度使用合成数据导致整体性能退化。该方法在自建和公开的不平衡乳腺超声(breast ultrasound, US)数据集上均显著优于现有先进方法。
链接: https://arxiv.org/abs/2507.22568
作者: Shijing Chen,Xinrui Zhou,Yuhao Wang,Yuhao Huang,Ao Chang,Dong Ni,Ruobing Huang
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: MICCAI2025 Early Accept. 11 pages, 3 figures, 2 tables
Abstract:Accurate identification of breast lesion subtypes can facilitate personalized treatment and interventions. Ultrasound (US), as a safe and accessible imaging modality, is extensively employed in breast abnormality screening and diagnosis. However, the incidence of different subtypes exhibits a skewed long-tailed distribution, posing significant challenges for automated recognition. Generative augmentation provides a promising solution to rectify data distribution. Inspired by this, we propose a dual-phase framework for long-tailed classification that mitigates distributional bias through high-fidelity data synthesis while avoiding overuse that corrupts holistic performance. The framework incorporates a reinforcement learning-driven adaptive sampler, dynamically calibrating synthetic-real data ratios by training a strategic multi-agent to compensate for scarcities of real data while ensuring stable discriminative capability. Furthermore, our class-controllable synthetic network integrates a sketch-grounded perception branch that harnesses anatomical priors to maintain distinctive class features while enabling annotation-free inference. Extensive experiments on an in-house long-tailed and a public imbalanced breast US datasets demonstrate that our method achieves promising performance compared to state-of-the-art approaches. More synthetic images can be found at this https URL.
zh
[CV-33] RainbowPrompt: Diversity-Enhanced Prompt-Evolving for Continual Learning ICCV2025
【速读】:该论文旨在解决提示调优(prompt-based continual learning)中任务特定知识整合不足的问题,现有方法要么使用固定不变的提示(fixed learned prompts),要么从共享任务空间生成提示,导致提示表征多样性受限,难以适应复杂连续任务需求。其解决方案的关键在于提出一种新颖的提示演化机制(prompt-evolving mechanism),通过自适应聚合基础提示(base prompts,即任务特定提示)形成统一提示,并在演化过程中保持表征多样性;同时引入可学习的概率门控机制(learnable probabilistic gate),动态决定演化过程中激活的层,从而持续积累和迁移已有知识以支持新任务的学习。
链接: https://arxiv.org/abs/2507.22553
作者: Kiseong Hong,Gyeong-hyeon Kim,Eunwoo Kim
机构: Chung-Ang University (中央大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: Accepted by the 2025 IEEE/CVF International Conference on Computer Vision (ICCV 2025)
Abstract:Prompt-based continual learning provides a rehearsal-free solution by tuning small sets of parameters while keeping pre-trained models frozen. To meet the complex demands of sequential tasks, it is crucial to integrate task-specific knowledge within prompts effectively. However, existing works rely on either fixed learned prompts (i.e., prompts whose representations remain unchanged during new task learning) or on prompts generated from an entangled task-shared space, limiting the representational diversity of the integrated prompt. To address this issue, we propose a novel prompt-evolving mechanism to adaptively aggregate base prompts (i.e., task-specific prompts) into a unified prompt while ensuring diversity. By transforming and aligning base prompts, both previously learned and newly introduced, our approach continuously evolves accumulated knowledge to facilitate learning new tasks. We further introduce a learnable probabilistic gate that adaptively determines which layers to activate during the evolution process. We validate our method on image classification and video action recognition tasks in class-incremental learning, achieving average gains of 9.07% and 7.40% over existing methods across all scenarios.
zh
[CV-34] HRVVS: A High-resolution Video Vasculature Segmentation Network via Hierarchical Autoregressive Residual Priors
【速读】:该论文旨在解决肝切除术中肝脏血管在手术视频中的精准分割问题,该任务因缺乏高质量标注数据集及固有的复杂性而研究较少。解决方案的关键在于提出一种名为HRVVS的高分辨率视频血管分割网络:首先构建了一个包含35段长视频和11442帧高分辨率图像的高质量逐帧标注数据集;其次,在分层编码器的不同层级嵌入预训练的视觉自回归建模(Visual Autoregressive Modeling, VAR)模型作为先验信息,以缓解下采样过程中的信息退化;此外,设计了一种动态记忆解码器用于多视角分割网络,在减少帧间冗余信息传递的同时保留更多细节,从而显著优于现有最先进方法。
链接: https://arxiv.org/abs/2507.22530
作者: Xincheng Yao,Yijun Yang,Kangwei Guo,Ruiqiang Xiao,Haipeng Zhou,Haisu Tao,Jian Yang,Lei Zhu
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:
Abstract:The segmentation of the hepatic vasculature in surgical videos holds substantial clinical significance in the context of hepatectomy procedures. However, owing to the dearth of an appropriate dataset and the inherently complex task characteristics, few researches have been reported in this domain. To address this issue, we first introduce a high quality frame-by-frame annotated hepatic vasculature dataset containing 35 long hepatectomy videos and 11442 high-resolution frames. On this basis, we propose a novel high-resolution video vasculature segmentation network, dubbed as HRVVS. We innovatively embed a pretrained visual autoregressive modeling (VAR) model into different layers of the hierarchical encoder as prior information to reduce the information degradation generated during the downsampling process. In addition, we designed a dynamic memory decoder on a multi-view segmentation network to minimize the transmission of redundant information while preserving more details between frames. Extensive experiments on surgical video datasets demonstrate that our proposed HRVVS significantly outperforms the state-of-the-art methods. The source code and dataset will be publicly available at \hrefthis https URLthis https URL.
zh
[CV-35] FGFP: A Fractional Gaussian Filter and Pruning for Deep Neural Networks Compression ICML2025
【速读】:该论文旨在解决深度神经网络(Deep Neural Networks, DNNs)在边缘设备上部署时因模型参数量大、计算负载高而导致的挑战。其核心解决方案是提出一种融合分数阶微分计算与高斯函数的分数阶高斯滤波器(Fractional Gaussian Filters, FGFs)框架,并结合自适应非结构化剪枝(Adaptive Unstructured Pruning, AUP)实现高效压缩。关键创新在于利用Grünwald-Letnikov分数阶导数近似分数阶微分方程,将每个核的参数量压缩至仅7个,显著降低计算复杂度;同时通过AUP进一步提升压缩率,在保持高精度的前提下实现了显著的模型压缩效果,例如在CIFAR-10上ResNet-20模型压缩率达85.2%且精度损失仅为1.52%。
链接: https://arxiv.org/abs/2507.22527
作者: Kuan-Ting Tu,Po-Hsien Yu,Yu-Syuan Tseng,Shao-Yi Chien
机构: 未知
类目: Machine Learning (cs.LG); Computer Vision and Pattern Recognition (cs.CV)
备注: 8 pages, 2 figures, 4 tables, Accepted by ICML 2025 Workshop (TTODLer-FM)
Abstract:Network compression techniques have become increasingly important in recent years because the loads of Deep Neural Networks (DNNs) are heavy for edge devices in real-world applications. While many methods compress neural network parameters, deploying these models on edge devices remains challenging. To address this, we propose the fractional Gaussian filter and pruning (FGFP) framework, which integrates fractional-order differential calculus and Gaussian function to construct fractional Gaussian filters (FGFs). To reduce the computational complexity of fractional-order differential operations, we introduce Grünwald-Letnikov fractional derivatives to approximate the fractional-order differential equation. The number of parameters for each kernel in FGF is minimized to only seven. Beyond the architecture of Fractional Gaussian Filters, our FGFP framework also incorporates Adaptive Unstructured Pruning (AUP) to achieve higher compression ratios. Experiments on various architectures and benchmarks show that our FGFP framework outperforms recent methods in accuracy and compression. On CIFAR-10, ResNet-20 achieves only a 1.52% drop in accuracy while reducing the model size by 85.2%. On ImageNet2012, ResNet-50 achieves only a 1.63% drop in accuracy while reducing the model size by 69.1%.
zh
[CV-36] Recognizing Actions from Robotic View for Natural Human-Robot Interaction ICCV2025
【速读】:该论文旨在解决自然人机交互(Natural Human-Robot Interaction, N-HRI)中机器人在动态视角下识别人类动作的挑战,这一任务相较于传统动作识别更具复杂性,因涉及多样的距离、环境、主体以及机器人自身的运动状态。现有基准数据集在数据规模、模态多样性、任务类别和场景复杂度上均存在局限,难以支撑N-HRI的研究需求。为此,作者提出了ACTIVE数据集,其包含30个复合动作类别、80名参与者及46,868个标注视频实例,覆盖RGB与点云双模态,并模拟了从3米至50米距离、相机平台移动等真实机器人感知场景。解决方案的关键在于提出ACTIVE-PC方法,通过多层邻域采样(Multilevel Neighborhood Sampling)、分层识别器(Layered Recognizers)、弹性椭圆查询(Elastic Ellipse Query)以及精确解耦运动学干扰(kinematic interference)与人体动作,实现了远距离下高精度的人类动作感知。
链接: https://arxiv.org/abs/2507.22522
作者: Ziyi Wang,Peiming Li,Hong Liu,Zhichao Deng,Can Wang,Jun Liu,Junsong Yuan,Mengyuan Liu
机构: Peking University (北京大学); Sun Yat-sen University (中山大学); Kiel University (基尔大学); Lancaster University (兰卡斯特大学); State University of New York at Buffalo (纽约州立大学布法罗分校)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Robotics (cs.RO)
备注: 8 pages, 4 figures, Accepted to ICCV2025
Abstract:Natural Human-Robot Interaction (N-HRI) requires robots to recognize human actions at varying distances and states, regardless of whether the robot itself is in motion or stationary. This setup is more flexible and practical than conventional human action recognition tasks. However, existing benchmarks designed for traditional action recognition fail to address the unique complexities in N-HRI due to limited data, modalities, task categories, and diversity of subjects and environments. To address these challenges, we introduce ACTIVE (Action from Robotic View), a large-scale dataset tailored specifically for perception-centric robotic views prevalent in mobile service robots. ACTIVE comprises 30 composite action categories, 80 participants, and 46,868 annotated video instances, covering both RGB and point cloud modalities. Participants performed various human actions in diverse environments at distances ranging from 3m to 50m, while the camera platform was also mobile, simulating real-world scenarios of robot perception with varying camera heights due to uneven ground. This comprehensive and challenging benchmark aims to advance action and attribute recognition research in N-HRI. Furthermore, we propose ACTIVE-PC, a method that accurately perceives human actions at long distances using Multilevel Neighborhood Sampling, Layered Recognizers, Elastic Ellipse Query, and precise decoupling of kinematic interference from human actions. Experimental results demonstrate the effectiveness of ACTIVE-PC. Our code is available at: this https URL.
zh
[CV-37] AlphaDent: A dataset for automated tooth pathology detection
【速读】:该论文旨在解决牙科图像中牙齿实例分割(instance segmentation)的问题,即精确识别并分离每颗牙齿的边界以支持后续的临床分析与诊断。解决方案的关键在于构建了一个高质量、标注详尽的开源数据集 AlphaDent,该数据集包含295名患者的1200余张数码单反(DSLR)相机拍摄的牙齿图像,并细分为9类实例标签;同时,作者基于此数据集训练了神经网络模型,实现了高精度的分割预测结果,为牙科领域提供了可复现且开放的研究基础。
链接: https://arxiv.org/abs/2507.22512
作者: Evgeniy I. Sosnin,Yuriy L. Vasilev,Roman A. Solovyev,Aleksandr L. Stempkovskiy,Dmitry V. Telpukhov,Artem A. Vasilev,Aleksandr A. Amerikanov,Aleksandr Y. Romanov
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注:
Abstract:In this article, we present a new unique dataset for dental research - AlphaDent. This dataset is based on the DSLR camera photographs of the teeth of 295 patients and contains over 1200 images. The dataset is labeled for solving the instance segmentation problem and is divided into 9 classes. The article provides a detailed description of the dataset and the labeling format. The article also provides the details of the experiment on neural network training for the Instance Segmentation problem using this dataset. The results obtained show high quality of predictions. The dataset is published under an open license; and the training/inference code and model weights are also available under open licenses.
zh
[CV-38] DACA-Net: A Degradation-Aware Conditional Diffusion Network for Underwater Image Enhancement ACM-MM2025
【速读】:该论文旨在解决水下图像因复杂光学效应(如散射和吸收)导致的颜色失真、可见度降低及结构清晰度下降的问题,这些问题严重损害了图像的视觉质量并限制了下游视觉感知任务的性能。解决方案的关键在于提出一种退化感知的条件扩散模型:首先通过轻量级双流卷积网络预测输入图像的退化程度,生成连续的退化评分作为语义引导;随后基于该评分设计了一种以Swin UNet为骨干的条件扩散恢复网络,实现自适应噪声调度与分层特征优化;同时引入退化引导的自适应特征融合模块和结合感知一致性、直方图匹配与特征级对比的混合损失函数,有效融合水下特定物理先验,从而在多个基准数据集上显著提升颜色保真度、感知质量和结构细节。
链接: https://arxiv.org/abs/2507.22501
作者: Chang Huang,Jiahang Cao,Jun Ma,Kieren Yu,Cong Li,Huayong Yang,Kaishun Wu
机构: The Hong Kong University of Science and Technology (Guangzhou); Southern Marine Science and Engineering Guangdong Laboratory (Guangzhou)
类目: Computer Vision and Pattern Recognition (cs.CV); Image and Video Processing (eess.IV)
备注: accepted by ACM MM 2025
Abstract:Underwater images typically suffer from severe colour distortions, low visibility, and reduced structural clarity due to complex optical effects such as scattering and absorption, which greatly degrade their visual quality and limit the performance of downstream visual perception tasks. Existing enhancement methods often struggle to adaptively handle diverse degradation conditions and fail to leverage underwater-specific physical priors effectively. In this paper, we propose a degradation-aware conditional diffusion model to enhance underwater images adaptively and robustly. Given a degraded underwater image as input, we first predict its degradation level using a lightweight dual-stream convolutional network, generating a continuous degradation score as semantic guidance. Based on this score, we introduce a novel conditional diffusion-based restoration network with a Swin UNet backbone, enabling adaptive noise scheduling and hierarchical feature refinement. To incorporate underwater-specific physical priors, we further propose a degradation-guided adaptive feature fusion module and a hybrid loss function that combines perceptual consistency, histogram matching, and feature-level contrast. Comprehensive experiments on benchmark datasets demonstrate that our method effectively restores underwater images with superior colour fidelity, perceptual quality, and structural details. Compared with SOTA approaches, our framework achieves significant improvements in both quantitative metrics and qualitative visual assessments.
zh
[CV-39] Robust Adverse Weather Removal via Spectral-based Spatial Grouping ICCV25
【速读】:该论文旨在解决复杂多变的恶劣天气条件下图像退化模式多样且局部特征差异显著的问题,现有全融合(All-in-One, AiO)模型因依赖全局滤波方法(如频域直接操作)难以有效处理高度非均匀的退化现象。其解决方案的关键在于提出Spectral-based Spatial Grouping Transformer (SSGformer),通过频谱分解将图像分离为高频边缘特征(利用传统边缘检测)与低频结构信息(基于奇异值分解),并引入多头线性注意力机制建模二者关系;进一步设计空间分组掩码以依据纹理相似性聚类图像区域,并结合分组注意力机制实现对不同天气场景下局部退化的鲁棒恢复,同时提出的Spatial Grouping Transformer Block融合通道注意力与空间注意力,有效平衡特征间关联与空间依赖性,从而在多种恶劣天气图像复原任务中实现一致且优越的性能。
链接: https://arxiv.org/abs/2507.22498
作者: Yuhwan Jeong,Yunseo Yang,Youngjo Yoon,Kuk-Jin Yoon
机构: KAIST(韩国科学技术院)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: accepted by ICCV25
Abstract:Adverse weather conditions cause diverse and complex degradation patterns, driving the development of All-in-One (AiO) models. However, recent AiO solutions still struggle to capture diverse degradations, since global filtering methods like direct operations on the frequency domain fail to handle highly variable and localized distortions. To address these issue, we propose Spectral-based Spatial Grouping Transformer (SSGformer), a novel approach that leverages spectral decomposition and group-wise attention for multi-weather image restoration. SSGformer decomposes images into high-frequency edge features using conventional edge detection and low-frequency information via Singular Value Decomposition. We utilize multi-head linear attention to effectively model the relationship between these features. The fused features are integrated with the input to generate a grouping-mask that clusters regions based on the spatial similarity and image texture. To fully leverage this mask, we introduce a group-wise attention mechanism, enabling robust adverse weather removal and ensuring consistent performance across diverse weather conditions. We also propose a Spatial Grouping Transformer Block that uses both channel attention and spatial attention, effectively balancing feature-wise relationships and spatial dependencies. Extensive experiments show the superiority of our approach, validating its effectiveness in handling the varied and intricate adverse weather degradations.
zh
[CV-40] Estimating 2D Camera Motion with Hybrid Motion Basis ICCV2025
【速读】:该论文旨在解决2D相机运动估计问题,即如何准确建模3D相机运动在2D图像平面上的投影。现有方法存在局限性:基于单应性的方法仅适用于平面场景,而基于网格的光流(meshflow)技术虽能处理局部变换但难以应对复杂的非线性运动。其解决方案的关键在于提出CamFlow框架,该框架采用混合运动基(hybrid motion bases)表示相机运动,其中物理基来自相机几何约束,随机基用于捕捉复杂场景中的不确定性;同时设计了一种基于拉普拉斯分布的混合概率损失函数,显著提升训练鲁棒性,并通过在现有光流数据集上掩码动态物体构建新基准以隔离纯相机运动信号,实验表明该方法在零样本设置下具有更强的泛化能力和鲁棒性。
链接: https://arxiv.org/abs/2507.22480
作者: Haipeng Li,Tianhao Zhou,Zhanglei Yang,Yi Wu,Yan Chen,Zijing Mao,Shen Cheng,Bing Zeng,Shuaicheng Liu
机构: University of Electronic Science and Technology of China (电子科技大学); Xiaomi Corporation (小米公司); Dexmal (德克斯玛)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: ICCV 2025
Abstract:Estimating 2D camera motion is a fundamental computer vision task that models the projection of 3D camera movements onto the 2D image plane. Current methods rely on either homography-based approaches, limited to planar scenes, or meshflow techniques that use grid-based local homographies but struggle with complex non-linear transformations. A key insight of our work is that combining flow fields from different homographies creates motion patterns that cannot be represented by any single homography. We introduce CamFlow, a novel framework that represents camera motion using hybrid motion bases: physical bases derived from camera geometry and stochastic bases for complex scenarios. Our approach includes a hybrid probabilistic loss function based on the Laplace distribution that enhances training robustness. For evaluation, we create a new benchmark by masking dynamic objects in existing optical flow datasets to isolate pure camera motion. Experiments show CamFlow outperforms state-of-the-art methods across diverse scenarios, demonstrating superior robustness and generalization in zero-shot settings. Code and datasets are available at our project page: this https URL.
zh
[CV-41] LIDAR: Lightweight Adaptive Cue-Aware Fusion Vision Mamba for Multimodal Segmentation of Structural Cracks
【速读】:该论文旨在解决多模态数据下裂纹分割任务中像素级分割的计算成本过高问题,以及现有方法在跨模态特征感知与融合方面的适应性不足。其核心解决方案是提出轻量级自适应线索感知视觉Mamba网络(LIDAR),关键创新在于:一是设计轻量自适应线索感知视觉状态空间模块(LacaVSS),通过掩码引导的高效动态引导扫描策略(EDG-SS)实现对不同模态裂纹线索的自适应建模;二是引入轻量双域动态协同融合模块(LD3CF),结合自适应频域感知器(AFDP)与双池化融合策略,有效捕获跨模态的空间与频域特征;此外,采用轻量动态调制多核卷积(LDMK)替代大部分传统卷积操作,在保持复杂形态结构感知能力的同时显著降低计算开销。
链接: https://arxiv.org/abs/2507.22477
作者: Hui Liu,Chen Jia,Fan Shi,Xu Cheng,Mengfei Shi,Xia Xie,Shengyong Chen
机构: Tianjin University of Technology (天津理工大学); Hainan University (海南大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:
Abstract:Achieving pixel-level segmentation with low computational cost using multimodal data remains a key challenge in crack segmentation tasks. Existing methods lack the capability for adaptive perception and efficient interactive fusion of cross-modal features. To address these challenges, we propose a Lightweight Adaptive Cue-Aware Vision Mamba network (LIDAR), which efficiently perceives and integrates morphological and textural cues from different modalities under multimodal crack scenarios, generating clear pixel-level crack segmentation maps. Specifically, LIDAR is composed of a Lightweight Adaptive Cue-Aware Visual State Space module (LacaVSS) and a Lightweight Dual Domain Dynamic Collaborative Fusion module (LD3CF). LacaVSS adaptively models crack cues through the proposed mask-guided Efficient Dynamic Guided Scanning Strategy (EDG-SS), while LD3CF leverages an Adaptive Frequency Domain Perceptron (AFDP) and a dual-pooling fusion strategy to effectively capture spatial and frequency-domain cues across modalities. Moreover, we design a Lightweight Dynamically Modulated Multi-Kernel convolution (LDMK) to perceive complex morphological structures with minimal computational overhead, replacing most convolutional operations in LIDAR. Experiments on three datasets demonstrate that our method outperforms other state-of-the-art (SOTA) methods. On the light-field depth dataset, our method achieves 0.8204 in F1 and 0.8465 in mIoU with only 5.35M parameters. Code and datasets are available at this https URL.
zh
[CV-42] Visual Language Models as Zero-Shot Deepfake Detectors ICML2025
【速读】:该论文旨在解决深度伪造(deepfake)检测中现有方法依赖专用分类器且缺乏鲁棒性的问题,其关键解决方案是引入视觉语言模型(Vision Language Model, VLM)的零样本(zero-shot)能力。作者利用高质量的6万张深度伪造图像数据集,验证了VLM在无需特定任务微调的情况下即可实现优于大多数现有方法的检测性能;进一步对比InstructBLIP架构在DFDC-P数据集上的零样本与域内微调两种场景下的表现,结果表明VLM在深度伪造检测任务中显著优于传统分类器,凸显了其在跨域泛化和鲁棒性方面的优势。
链接: https://arxiv.org/abs/2507.22469
作者: Viacheslav Pirogov
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: Accepted to the ICML 2025 Workshop on Reliable and Responsible Foundation Models
Abstract:The contemporary phenomenon of deepfakes, utilizing GAN or diffusion models for face swapping, presents a substantial and evolving threat in digital media, identity verification, and a multitude of other systems. The majority of existing methods for detecting deepfakes rely on training specialized classifiers to distinguish between genuine and manipulated images, focusing only on the image domain without incorporating any auxiliary tasks that could enhance robustness. In this paper, inspired by the zero-shot capabilities of Vision Language Models, we propose a novel VLM-based approach to image classification and then evaluate it for deepfake detection. Specifically, we utilize a new high-quality deepfake dataset comprising 60,000 images, on which our zero-shot models demonstrate superior performance to almost all existing methods. Subsequently, we compare the performance of the best-performing architecture, InstructBLIP, on the popular deepfake dataset DFDC-P against traditional methods in two scenarios: zero-shot and in-domain fine-tuning. Our results demonstrate the superiority of VLMs over traditional classifiers.
zh
[CV-43] Shallow Features Matter: Hierarchical Memory with Heterogeneous Interaction for Unsupervised Video Object Segmentation ACM-MM’25
【速读】:该论文旨在解决无监督视频目标分割(Unsupervised Video Object Segmentation, UVOS)中因缺乏像素级先验知识而导致的细粒度信息缺失问题,从而限制了现有基于记忆机制方法的性能提升。其核心挑战在于:当前方法过度依赖高阶语义特征的记忆,而忽略了浅层特征所携带的精细空间细节,导致分割精度不足。解决方案的关键在于提出一种分层记忆架构(Hierarchical Memory Architecture),同时融合浅层像素级特征与高层语义特征,并设计异质交互机制(Heterogeneous Interaction Mechanism)以平衡二者在特征利用中的协同关系。具体而言,通过像素引导的局部对齐模块(Pixel-guided Local Alignment Module, PLAM)和语义引导的全局整合模块(Semantic-guided Global Integration Module, SGIM),实现了浅层记忆中的细粒度信息与高层记忆中的语义表示之间的精细化融合,显著提升了UVOS任务的分割精度与鲁棒性。
链接: https://arxiv.org/abs/2507.22465
作者: Zheng Xiangyu,He Songcheng,Li Wanyun,Li Xiaoqiang,Zhang Wei
机构: Fudan University (复旦大学); Shanghai University (上海大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: Accepted to ACM MM’25: The 33rd ACM International Conference on Multimedia Proceedings
Abstract:Unsupervised Video Object Segmentation (UVOS) aims to predict pixel-level masks for the most salient objects in videos without any prior annotations. While memory mechanisms have been proven critical in various video segmentation paradigms, their application in UVOS yield only marginal performance gains despite sophisticated design. Our analysis reveals a simple but fundamental flaw in existing methods: over-reliance on memorizing high-level semantic features. UVOS inherently suffers from the deficiency of lacking fine-grained information due to the absence of pixel-level prior knowledge. Consequently, memory design relying solely on high-level features, which predominantly capture abstract semantic cues, is insufficient to generate precise predictions. To resolve this fundamental issue, we propose a novel hierarchical memory architecture to incorporate both shallow- and high-level features for memory, which leverages the complementary benefits of pixel and semantic information. Furthermore, to balance the simultaneous utilization of the pixel and semantic memory features, we propose a heterogeneous interaction mechanism to perform pixel-semantic mutual interactions, which explicitly considers their inherent feature discrepancies. Through the design of Pixel-guided Local Alignment Module (PLAM) and Semantic-guided Global Integration Module (SGIM), we achieve delicate integration of the fine-grained details in shallow-level memory and the semantic representations in high-level memory. Our Hierarchical Memory with Heterogeneous Interaction Network (HMHI-Net) consistently achieves state-of-the-art performance across all UVOS and video saliency detection benchmarks. Moreover, HMHI-Net consistently exhibits high performance across different backbones, further demonstrating its superiority and robustness. Project page: this https URL .
zh
[CV-44] Exploiting Diffusion Prior for Task-driven Image Restoration ICCV2025
【速读】:该论文旨在解决多复杂退化场景下任务驱动图像恢复(Task-driven Image Restoration, TDIR)中因低质量(Low-Quality, LQ)输入导致的高层视觉任务性能下降问题。现有TDIR方法在处理多种复杂退化因素时表现不佳,难以从线索稀少的LQ图像中恢复出对下游任务至关重要的细节。解决方案的关键在于有效利用扩散先验(diffusion prior),通过在扩散过程中直接引入LQ图像中的有用线索:具体而言,采用基于像素误差的预恢复LQ图像并添加轻微噪声作为扩散起点,同时限制去噪步骤数量以避免冗余细节干扰关键任务信息。此策略显著提升了任务性能与视觉质量,适用于多种复杂退化场景下的TDIR任务。
链接: https://arxiv.org/abs/2507.22459
作者: Jaeha Kim,Junghun Oh,Kyoung Mu Lee
机构: Seoul National University (首尔国立大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted to ICCV 2025
Abstract:Task-driven image restoration (TDIR) has recently emerged to address performance drops in high-level vision tasks caused by low-quality (LQ) inputs. Previous TDIR methods struggle to handle practical scenarios in which images are degraded by multiple complex factors, leaving minimal clues for restoration. This motivates us to leverage the diffusion prior, one of the most powerful natural image priors. However, while the diffusion prior can help generate visually plausible results, using it to restore task-relevant details remains challenging, even when combined with recent TDIR methods. To address this, we propose EDTR, which effectively harnesses the power of diffusion prior to restore task-relevant details. Specifically, we propose directly leveraging useful clues from LQ images in the diffusion process by generating from pixel-error-based pre-restored LQ images with mild noise added. Moreover, we employ a small number of denoising steps to prevent the generation of redundant details that dilute crucial task-related information. We demonstrate that our method effectively utilizes diffusion prior for TDIR, significantly enhancing task performance and visual quality across diverse tasks with multiple complex degradations.
zh
[CV-45] opoLiDM: Topology-Aware LiDAR Diffusion Models for Interpretable and Realistic LiDAR Point Cloud Generation IROS2025
【速读】:该论文旨在解决现有LiDAR场景生成方法在几何真实性(geometric realism)和全局拓扑一致性(global topological consistency)方面的不足,尤其是基于潜在空间嵌入的LiDAR扩散模型(LiDAR Diffusion Models, LiDMs)难以建模细节几何结构且难以保持全局拓扑结构的问题。其解决方案的关键在于提出TopoLiDM框架,该框架通过引入图神经网络(Graph Neural Networks, GNNs)与扩散模型结合,并施加拓扑正则化约束:首先训练一个保持拓扑一致性的变分自编码器(Variational Autoencoder, VAE)以提取隐式图表示;随后冻结VAE,利用潜扩散模型生成新的拓扑图;并进一步引入0维持久同调(0-dimensional persistent homology, PH)约束,确保生成场景符合真实世界的全局拓扑结构。这一设计显著提升了生成LiDAR点云的保真度与可解释性,同时具备高效的推理速度(平均1.68样本/秒),适用于实际自动驾驶场景。
链接: https://arxiv.org/abs/2507.22454
作者: Jiuming Liu,Zheng Huang,Mengmeng Liu,Tianchen Deng,Francesco Nex,Hao Cheng,Hesheng Wang
机构: Shanghai Jiao Tong University (上海交通大学); University of Twente (特温特大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Image and Video Processing (eess.IV)
备注: Accepted by IROS 2025. Code: this https URL
Abstract:LiDAR scene generation is critical for mitigating real-world LiDAR data collection costs and enhancing the robustness of downstream perception tasks in autonomous driving. However, existing methods commonly struggle to capture geometric realism and global topological consistency. Recent LiDAR Diffusion Models (LiDMs) predominantly embed LiDAR points into the latent space for improved generation efficiency, which limits their interpretable ability to model detailed geometric structures and preserve global topological consistency. To address these challenges, we propose TopoLiDM, a novel framework that integrates graph neural networks (GNNs) with diffusion models under topological regularization for high-fidelity LiDAR generation. Our approach first trains a topological-preserving VAE to extract latent graph representations by graph construction and multiple graph convolutional layers. Then we freeze the VAE and generate novel latent topological graphs through the latent diffusion models. We also introduce 0-dimensional persistent homology (PH) constraints, ensuring the generated LiDAR scenes adhere to real-world global topological structures. Extensive experiments on the KITTI-360 dataset demonstrate TopoLiDM’s superiority over state-of-the-art methods, achieving improvements of 22.6% lower Frechet Range Image Distance (FRID) and 9.2% lower Minimum Matching Distance (MMD). Notably, our model also enables fast generation speed with an average inference time of 1.68 samples/s, showcasing its scalability for real-world applications. We will release the related codes at this https URL.
zh
[CV-46] RCR-AF: Enhancing Model Generalization via Rademacher Complexity Reduction Activation Function
【速读】:该论文旨在解决深度神经网络在安全敏感应用场景中对对抗攻击高度脆弱的问题。其解决方案的关键在于提出了一种新型激活函数——Rademacher Complexity Reduction Activation Function (RCR-AF),该函数通过融合GELU(Gaussian Error Linear Unit)的平滑性、梯度稳定性及负信息保留特性与ReLU(Rectified Linear Unit)的单调性优势,同时引入两个超参数 α 和 γ 控制模型稀疏性和容量,从而有效降低模型的Rademacher复杂度,从理论上提升模型的泛化能力和对抗鲁棒性。实证结果表明,RCR-AF 在标准训练下的干净准确率和对抗训练下的鲁棒性均优于ReLU、GELU和Swish等主流激活函数。
链接: https://arxiv.org/abs/2507.22446
作者: Yunrui Yu,Kafeng Wang,Hang Su,Jun Zhu
机构: Tsinghua University (清华大学)
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Despite their widespread success, deep neural networks remain critically vulnerable to adversarial attacks, posing significant risks in safety-sensitive applications. This paper investigates activation functions as a crucial yet underexplored component for enhancing model robustness. We propose a Rademacher Complexity Reduction Activation Function (RCR-AF), a novel activation function designed to improve both generalization and adversarial resilience. RCR-AF uniquely combines the advantages of GELU (including smoothness, gradient stability, and negative information retention) with ReLU’s desirable monotonicity, while simultaneously controlling both model sparsity and capacity through built-in clipping mechanisms governed by two hyperparameters, \alpha and \gamma . Our theoretical analysis, grounded in Rademacher complexity, demonstrates that these parameters directly modulate the model’s Rademacher complexity, offering a principled approach to enhance robustness. Comprehensive empirical evaluations show that RCR-AF consistently outperforms widely-used alternatives (ReLU, GELU, and Swish) in both clean accuracy under standard training and in adversarial robustness within adversarial training paradigms.
zh
[CV-47] From Sharp to Blur: Unsupervised Domain Adaptation for 2D Human Pose Estimation Under Extreme Motion Blur Using Event Cameras
【速读】:该论文旨在解决运动模糊(motion blur)条件下人体姿态估计(human pose estimation)性能显著下降的问题,其核心挑战在于训练数据通常来自清晰图像(sharp images),而实际应用中如低光照或快速运动场景下会产生模糊图像(blurred images),导致源域与目标域之间存在显著的域差异(domain gap)。解决方案的关键在于引入事件相机(event camera)以获取高时间分辨率的运动信息,并利用事件数据生成具有运动感知能力的模糊图像(motion-aware blurred images),从而在无需成对标注的情况下有效弥合域间差异;同时提出一种基于学生-教师框架(student-teacher framework)的伪标签迭代优化机制,通过互不确定性掩码(mutual uncertainty masking)剔除错误标签,提升模型在目标域上的泛化能力。
链接: https://arxiv.org/abs/2507.22438
作者: Youngho Kim,Hoonhee Cho,Kuk-Jin Yoon
机构: KAIST
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Human pose estimation is critical for applications such as rehabilitation, sports analytics, and AR/VR systems. However, rapid motion and low-light conditions often introduce motion blur, significantly degrading pose estimation due to the domain gap between sharp and blurred images. Most datasets assume stable conditions, making models trained on sharp images struggle in blurred environments. To address this, we introduce a novel domain adaptation approach that leverages event cameras, which capture high temporal resolution motion data and are inherently robust to motion blur. Using event-based augmentation, we generate motion-aware blurred images, effectively bridging the domain gap between sharp and blurred domains without requiring paired annotations. Additionally, we develop a student-teacher framework that iteratively refines pseudo-labels, leveraging mutual uncertainty masking to eliminate incorrect labels and enable more effective learning. Experimental results demonstrate that our approach outperforms conventional domain-adaptive human pose estimation methods, achieving robust pose estimation under motion blur without requiring annotations in the target domain. Our findings highlight the potential of event cameras as a scalable and effective solution for domain adaptation in real-world motion blur environments. Our project codes are available at this https URL.
zh
[CV-48] HQ-CLIP: Leverag ing Large Vision-Language Models to Create High-Quality Image-Text Datasets and CLIP Models
【速读】:该论文试图解决大规模但噪声较大的图像-文本对数据限制了视觉语言模型(Vision-Language Models, VLMs)性能提升的问题,核心在于如何通过自增强机制实现数据质量的持续优化。解决方案的关键在于提出一个由VLM驱动的数据精炼流水线,利用VLM对原始图像及其alt-text进行深度理解,生成四类互补的文本描述:长正向描述、长负向描述、短正向标签和短负向标签;在此基础上构建包含多粒度标注的高质量数据集VLM-150M,并设计一种扩展的对比学习训练范式,引入负向描述和短标签作为额外监督信号,从而显著提升模型在零样本分类、跨模态检索及细粒度视觉理解等任务上的表现。
链接: https://arxiv.org/abs/2507.22431
作者: Zhixiang Wei,Guangting Wang,Xiaoxiao Ma,Ke Mei,Huaian Chen,Yi Jin,Fengyun Rao
机构: University of Science and Technology of China (中国科学技术大学); WeChat Vision, Tencent Inc. (腾讯公司微信视觉团队)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Large-scale but noisy image-text pair data have paved the way for the success of Contrastive Language-Image Pretraining (CLIP). As the foundation vision encoder, CLIP in turn serves as the cornerstone for most large vision-language models (LVLMs). This interdependence naturally raises an interesting question: Can we reciprocally leverage LVLMs to enhance the quality of image-text pair data, thereby opening the possibility of a self-reinforcing cycle for continuous improvement? In this work, we take a significant step toward this vision by introducing an LVLM-driven data refinement pipeline. Our framework leverages LVLMs to process images and their raw alt-text, generating four complementary textual formulas: long positive descriptions, long negative descriptions, short positive tags, and short negative tags. Applying this pipeline to the curated DFN-Large dataset yields VLM-150M, a refined dataset enriched with multi-grained annotations. Based on this dataset, we further propose a training paradigm that extends conventional contrastive learning by incorporating negative descriptions and short tags as additional supervised signals. The resulting model, namely HQ-CLIP, demonstrates remarkable improvements across diverse benchmarks. Within a comparable training data scale, our approach achieves state-of-the-art performance in zero-shot classification, cross-modal retrieval, and fine-grained visual understanding tasks. In retrieval benchmarks, HQ-CLIP even surpasses standard CLIP models trained on the DFN-2B dataset, which contains 10 \times more training data than ours. All code, data, and models are available at this https URL.
zh
[CV-49] heoretical Analysis of Relative Errors in Gradient Computations for Adversarial Attacks with CE Loss
【速读】:该论文旨在解决基于梯度的对抗攻击中因浮点数计算误差导致的梯度高估问题,尤其是在使用交叉熵(Cross-Entropy, CE)损失函数时,由于浮点运算中的相对误差会显著影响梯度精度,从而削弱攻击效果和鲁棒性评估的准确性。解决方案的关键在于提出理论驱动的最小化浮点误差(Theoretical MIFPE, T-MIFPE)损失函数,其核心创新是引入一个最优缩放因子 $ T = t^* $,该因子可有效抑制浮点下溢和舍入误差对梯度计算的影响,从而提升梯度估计的稳定性与准确性。实验表明,T-MIFPE 在 MNIST、CIFAR-10 和 CIFAR-100 数据集上均优于 CE、C\W、DLR 和传统 MIFPE 等现有损失函数,在攻击强度和鲁棒性评估方面表现更优。
链接: https://arxiv.org/abs/2507.22428
作者: Yunrui Yu,Hang Su,Cheng-zhong Xu,Zhizhong Su,Jun Zhu
机构: Tsinghua University (清华大学); University of Macau (澳门大学); Horizon Robotics (地平线机器人)
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Gradient-based adversarial attacks using the Cross-Entropy (CE) loss often suffer from overestimation due to relative errors in gradient computation induced by floating-point arithmetic. This paper provides a rigorous theoretical analysis of these errors, conducting the first comprehensive study of floating-point computation errors in gradient-based attacks across four distinct scenarios: (i) unsuccessful untargeted attacks, (ii) successful untargeted attacks, (iii) unsuccessful targeted attacks, and (iv) successful targeted attacks. We establish theoretical foundations characterizing the behavior of relative numerical errors under different attack conditions, revealing previously unknown patterns in gradient computation instability, and identify floating-point underflow and rounding as key contributors. Building on this insight, we propose the Theoretical MIFPE (T-MIFPE) loss function, which incorporates an optimal scaling factor T = t^* to minimize the impact of floating-point errors, thereby enhancing the accuracy of gradient computation in adversarial attacks. Extensive experiments on the MNIST, CIFAR-10, and CIFAR-100 datasets demonstrate that T-MIFPE outperforms existing loss functions, including CE, C\W, DLR, and MIFPE, in terms of attack potency and robustness evaluation accuracy.
zh
[CV-50] Efficient Spatial-Temporal Modeling for Real-Time Video Analysis: A Unified Framework for Action Recognition and Object Tracking
【速读】:该论文旨在解决实时视频分析中如何高效处理空间与时间信息的同时保持计算效率的问题,尤其在资源受限环境下难以兼顾准确性和速度的挑战。其解决方案的关键在于提出一种统一框架,融合先进的时空建模技术,并引入一种新颖的分层注意力机制(hierarchical attention mechanism),该机制能够自适应地聚焦于时序序列中的相关空间区域,从而提升动作识别与目标跟踪的协同性能。实验表明,该方法在UCF-101、HMDB-51和MOT17数据集上分别实现了动作识别准确率提升3.2%、跟踪精度提升2.8%,且推理速度比现有方法快40%。
链接: https://arxiv.org/abs/2507.22421
作者: Shahla John
机构: Kabul University (喀布尔大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:
Abstract:Real-time video analysis remains a challenging problem in computer vision, requiring efficient processing of both spatial and temporal information while maintaining computational efficiency. Existing approaches often struggle to balance accuracy and speed, particularly in resource-constrained environments. In this work, we present a unified framework that leverages advanced spatial-temporal modeling techniques for simultaneous action recognition and object tracking. Our approach builds upon recent advances in parallel sequence modeling and introduces a novel hierarchical attention mechanism that adaptively focuses on relevant spatial regions across temporal sequences. We demonstrate that our method achieves state-of-the-art performance on standard benchmarks while maintaining real-time inference speeds. Extensive experiments on UCF-101, HMDB-51, and MOT17 datasets show improvements of 3.2% in action recognition accuracy and 2.8% in tracking precision compared to existing methods, with 40% faster inference time.
zh
[CV-51] Aleatoric Uncertainty Medical Image Segmentation Estimation via Flow Matching
【速读】:该论文旨在解决医学图像分割中难以准确量化随机不确定性(aleatoric uncertainty)的问题,这种不确定性反映了不同专家标注者之间的自然变异。传统方法依赖生成模型建模分割分布,但受限于表达能力不足;而基于扩散的方法虽在逼近数据分布上表现优异,却因固有的随机采样过程和无法建模精确密度,导致不确定性捕捉不准确。本文的关键解决方案是采用条件流匹配(conditional flow matching),一种无需模拟的流模型,能够学习精确的概率密度函数。通过在输入图像条件下引导流模型并进行多次采样,该方法生成的分割样本其像素级方差能可靠反映底层数据分布,尤其在边界模糊区域有效捕捉不确定性,从而提供与专家间差异一致的鲁棒性量化结果。
链接: https://arxiv.org/abs/2507.22418
作者: Phi Van Nguyen,Ngoc Huynh Trinh,Duy Minh Lam Nguyen,Phu Loc Nguyen,Quoc Long Tran
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:
Abstract:Quantifying aleatoric uncertainty in medical image segmentation is critical since it is a reflection of the natural variability observed among expert annotators. A conventional approach is to model the segmentation distribution using the generative model, but current methods limit the expression ability of generative models. While current diffusion-based approaches have demonstrated impressive performance in approximating the data distribution, their inherent stochastic sampling process and inability to model exact densities limit their effectiveness in accurately capturing uncertainty. In contrast, our proposed method leverages conditional flow matching, a simulation-free flow-based generative model that learns an exact density, to produce highly accurate segmentation results. By guiding the flow model on the input image and sampling multiple data points, our approach synthesizes segmentation samples whose pixel-wise variance reliably reflects the underlying data distribution. This sampling strategy captures uncertainties in regions with ambiguous boundaries, offering robust quantification that mirrors inter-annotator differences. Experimental results demonstrate that our method not only achieves competitive segmentation accuracy but also generates uncertainty maps that provide deeper insights into the reliability of the segmentation outcomes. The code for this paper is freely available at this https URL
zh
[CV-52] UAVScenes: A Multi-Modal Dataset for UAVs ICCV2025
【速读】:该论文旨在解决当前多模态无人机(UAV)数据集在高阶场景理解任务中的局限性问题,即现有数据集主要聚焦于定位与三维重建任务,且缺乏图像与激光雷达(LiDAR)点云的逐帧语义标注,导致难以支持如分割、深度估计、6自由度(6-DoF)定位、场景识别和新视角合成(NVS)等高级感知任务。解决方案的关键在于对已有的多模态无人机数据集MARS-LVIG进行增强:通过人工标注提供逐帧图像和LiDAR点云的语义标签,并补充精确的6-DoF位姿信息,从而构建一个支持2D与3D多模态感知任务的大规模基准数据集UAVScenes。
链接: https://arxiv.org/abs/2507.22412
作者: Sijie Wang,Siqi Li,Yawei Zhang,Shangshu Yu,Shenghai Yuan,Rui She,Quanjiang Guo,JinXuan Zheng,Ong Kang Howe,Leonrich Chandra,Shrivarshann Srijeyan,Aditya Sivadas,Toshan Aggarwal,Heyuan Liu,Hongming Zhang,Chujie Chen,Junyu Jiang,Lihua Xie,Wee Peng Tay
机构: Nanyang Technological University (南洋理工大学); Northeastern University (东北大学); Beihang University (北京航空航天大学); University of Electronic Science and Technology of China (电子科技大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted by ICCV 2025
Abstract:Multi-modal perception is essential for unmanned aerial vehicle (UAV) operations, as it enables a comprehensive understanding of the UAVs’ surrounding environment. However, most existing multi-modal UAV datasets are primarily biased toward localization and 3D reconstruction tasks, or only support map-level semantic segmentation due to the lack of frame-wise annotations for both camera images and LiDAR point clouds. This limitation prevents them from being used for high-level scene understanding tasks. To address this gap and advance multi-modal UAV perception, we introduce UAVScenes, a large-scale dataset designed to benchmark various tasks across both 2D and 3D modalities. Our benchmark dataset is built upon the well-calibrated multi-modal UAV dataset MARS-LVIG, originally developed only for simultaneous localization and mapping (SLAM). We enhance this dataset by providing manually labeled semantic annotations for both frame-wise images and LiDAR point clouds, along with accurate 6-degree-of-freedom (6-DoF) poses. These additions enable a wide range of UAV perception tasks, including segmentation, depth estimation, 6-DoF localization, place recognition, and novel view synthesis (NVS). Our dataset is available at this https URL
zh
[CV-53] Moiré Zero: An Efficient and High-Performance Neural Architecture for Moiré Removal
【速读】:该论文旨在解决由相机传感器采样过程与精细重复结构之间频率混叠所引起的莫尔条纹(moiré patterns)问题,该问题在消费级摄影和工业缺陷检测等实际应用中造成显著干扰。现有基于卷积神经网络(Convolutional Neural Networks, CNNs)的方法因感受野受限,难以有效处理莫尔条纹在尺度、方向和颜色上的多样性,导致去噪效果不佳。本文提出MZNet,一种U型网络架构,其关键在于三个核心组件:多尺度双注意力模块(Multi-Scale Dual Attention Block, MSDAB)用于提取和优化多尺度特征,多形状大卷积核模块(Multi-Shape Large Kernel Convolution Block, MSLKB)用于捕捉不同形态的莫尔结构,以及基于特征融合的跳跃连接机制,增强信息流动。这些设计协同提升了局部纹理恢复能力和全局伪影抑制效果,在高分辨率数据集上达到当前最优性能,同时保持较低计算开销,具备良好的实用性与效率。
链接: https://arxiv.org/abs/2507.22407
作者: Seungryong Lee,Woojeong Baek,Younghyun Kim,Eunwoo Kim,Haru Moon,Donggon Yoo,Eunbyung Park
机构: 1. Seoul National University (首尔国立大学); 2. Korea Advanced Institute of Science and Technology (韩国科学技术院); 3. Samsung Electronics (三星电子)
类目: Computer Vision and Pattern Recognition (cs.CV); Image and Video Processing (eess.IV)
备注: Project page: this https URL
Abstract:Moiré patterns, caused by frequency aliasing between fine repetitive structures and a camera sensor’s sampling process, have been a significant obstacle in various real-world applications, such as consumer photography and industrial defect inspection. With the advancements in deep learning algorithms, numerous studies-predominantly based on convolutional neural networks-have suggested various solutions to address this issue. Despite these efforts, existing approaches still struggle to effectively eliminate artifacts due to the diverse scales, orientations, and color shifts of moiré patterns, primarily because the constrained receptive field of CNN-based architectures limits their ability to capture the complex characteristics of moiré patterns. In this paper, we propose MZNet, a U-shaped network designed to bring images closer to a ‘Moire-Zero’ state by effectively removing moiré patterns. It integrates three specialized components: Multi-Scale Dual Attention Block (MSDAB) for extracting and refining multi-scale features, Multi-Shape Large Kernel Convolution Block (MSLKB) for capturing diverse moiré structures, and Feature Fusion-Based Skip Connection for enhancing information flow. Together, these components enhance local texture restoration and large-scale artifact suppression. Experiments on benchmark datasets demonstrate that MZNet achieves state-of-the-art performance on high-resolution datasets and delivers competitive results on lower-resolution dataset, while maintaining a low computational cost, suggesting that it is an efficient and practical solution for real-world applications. Project page: this https URL
zh
[CV-54] MINR: Implicit Neural Representations with Masked Image Modelling ICCV2023
【速读】:该论文旨在解决自监督学习方法(如掩码自动编码器,MAE)在训练过程中对掩码策略高度敏感,且在分布外数据上性能显著下降的问题。其解决方案的关键在于提出掩码隐式神经表示(MINR)框架,该框架将隐式神经表示(Implicit Neural Representations, INRs)与掩码图像建模相结合,通过学习一个连续函数来表示图像,从而实现与掩码策略无关的鲁棒且可泛化的重建能力。这一机制不仅提升了模型在域内和域外场景下的表现,还降低了模型复杂度,增强了方法的通用性。
链接: https://arxiv.org/abs/2507.22404
作者: Sua Lee,Joonhun Lee,Myungjoo Kang
机构: Seoul National University (首尔国立大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: Accepted to the ICCV 2023 workshop on Out-of-Distribution Generalization in Computer Vision
Abstract:Self-supervised learning methods like masked autoencoders (MAE) have shown significant promise in learning robust feature representations, particularly in image reconstruction-based pretraining task. However, their performance is often strongly dependent on the masking strategies used during training and can degrade when applied to out-of-distribution data. To address these limitations, we introduce the masked implicit neural representations (MINR) framework that synergizes implicit neural representations with masked image modeling. MINR learns a continuous function to represent images, enabling more robust and generalizable reconstructions irrespective of masking strategies. Our experiments demonstrate that MINR not only outperforms MAE in in-domain scenarios but also in out-of-distribution settings, while reducing model complexity. The versatility of MINR extends to various self-supervised learning applications, confirming its utility as a robust and efficient alternative to existing frameworks.
zh
[CV-55] On the Reliability of Vision-Language Models Under Adversarial Frequency-Domain Perturbations
【速读】:该论文旨在解决视觉语言模型(Vision-Language Models, VLMs)在面对频域中细微且结构化的扰动时所表现出的脆弱性问题,尤其是在图像真实性检测(DeepFake检测)和自动图像描述生成任务中的可靠性不足。其解决方案的关键在于设计了一种基于频域的目标性图像变换方法,通过系统性地调整图像的空间频率成分,在视觉上不可察觉的前提下干扰VLM的输出,从而揭示其对频率特征的敏感性远超语义内容本身。实验表明,该方法在五种主流VLM(包括不同参数规模的Qwen2/2.5与BLIP模型)及十个真实与生成图像数据集上均具有普适性,验证了当前VLM在黑盒环境下存在显著的感知脆弱性,凸显了构建鲁棒多模态感知系统的紧迫性。
链接: https://arxiv.org/abs/2507.22398
作者: Jordan Vice,Naveed Akhtar,Yansong Gao,Richard Hartley,Ajmal Mian
机构: University of Western Australia (西澳大利亚大学); University of Melbourne (墨尔本大学); Australian National University (澳大利亚国立大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Keywords: Vision-Language Models, Frequency-Domain Perturbations, Adversarial Robustness, Image Authenticity, Reliability
Abstract:Vision-Language Models (VLMs) are increasingly used as perceptual modules for visual content reasoning, including through captioning and DeepFake detection. In this work, we expose a critical vulnerability of VLMs when exposed to subtle, structured perturbations in the frequency domain. Specifically, we highlight how these feature transformations undermine authenticity/DeepFake detection and automated image captioning tasks. We design targeted image transformations, operating in the frequency domain to systematically adjust VLM outputs when exposed to frequency-perturbed real and synthetic images. We demonstrate that the perturbation injection method generalizes across five state-of-the-art VLMs which includes different-parameter Qwen2/2.5 and BLIP models. Experimenting across ten real and generated image datasets reveals that VLM judgments are sensitive to frequency-based cues and may not wholly align with semantic content. Crucially, we show that visually-imperceptible spatial frequency transformations expose the fragility of VLMs deployed for automated image captioning and authenticity detection tasks. Our findings under realistic, black-box constraints challenge the reliability of VLMs, underscoring the need for robust multimodal perception systems.
zh
[CV-56] Gems: Group Emotion Profiling Through Multimodal Situational Understanding
【速读】:该论文旨在解决多人群体社交场景中个体、群体及事件层面情绪理解的复杂性问题,尤其关注如何融合上下文信息以实现细粒度到粗粒度的情绪预测。其核心挑战在于现有基准数据集主要聚焦于原子交互和群体层面的情绪感知,缺乏对个体、群体与事件层级情绪的协同建模能力。解决方案的关键是提出GEMS框架,该框架基于多模态Swin-Transformer与S3Attention机制构建,能够联合处理输入场景、群体成员及上下文信息,从而生成个体、群体和事件三个层次的情绪预测结果(包括基本离散情绪和连续维度如效价与唤醒度)。通过扩展VGAF数据集为VGAF-GEMS基准,实现了从个体到情境层面的全面情绪响应关联分析,实验表明该方法在定量与定性评估上均优于现有先进模型。
链接: https://arxiv.org/abs/2507.22393
作者: Anubhav Kataria,Surbhi Madan,Shreya Ghosh,Tom Gedeon,Abhinav Dhall
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注:
Abstract:Understanding individual, group and event level emotions along with contextual information is crucial for analyzing a multi-person social situation. To achieve this, we frame emotion comprehension as the task of predicting fine-grained individual emotion to coarse grained group and event level emotion. We introduce GEMS that leverages a multimodal swin-transformer and S3Attention based architecture, which processes an input scene, group members, and context information to generate joint predictions. Existing multi-person emotion related benchmarks mainly focus on atomic interactions primarily based on emotion perception over time and group level. To this end, we extend and propose VGAF-GEMS to provide more fine grained and holistic analysis on top of existing group level annotation of VGAF dataset. GEMS aims to predict basic discrete and continuous emotions (including valence and arousal) as well as individual, group and event level perceived emotions. Our benchmarking effort links individual, group and situational emotional responses holistically. The quantitative and qualitative comparisons with adapted state-of-the-art models demonstrate the effectiveness of GEMS framework on VGAF-GEMS benchmarking. We believe that it will pave the way of further research. The code and data is available at: this https URL
zh
[CV-57] Exploring the Application of Visual Question Answering (VQA) for Classroom Activity Monitoring
【速读】:该论文旨在解决课堂行为监测(Classroom Behavior Monitoring)中自动化分析复杂师生互动的难题,以提升学生参与度和学习成效。其解决方案的关键在于利用先进的视觉问答(Visual Question Answering, VQA)模型对真实课堂视频进行行为相关问题的回答,通过构建并公开BVA-Classroom-VQA数据集,系统评估LLaMA2、LLaMA3、QWEN3与NVILA等开源VQA模型在该任务上的性能表现,验证了这些模型在自动识别和理解课堂行为方面的潜力,为未来智能教学干预系统提供了可行的技术路径。
链接: https://arxiv.org/abs/2507.22369
作者: Sinh Trong Vu,Hieu Trung Pham,Dung Manh Nguyen,Hieu Minh Hoang,Nhu Hoang Le,Thu Ha Pham,Tai Tan Mai
机构: Banking Academy of Vietnam (越南银行学院); Dublin City University (都柏林城市大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:
Abstract:Classroom behavior monitoring is a critical aspect of educational research, with significant implications for student engagement and learning outcomes. Recent advancements in Visual Question Answering (VQA) models offer promising tools for automatically analyzing complex classroom interactions from video recordings. In this paper, we investigate the applicability of several state-of-the-art open-source VQA models, including LLaMA2, LLaMA3, QWEN3, and NVILA, in the context of classroom behavior analysis. To facilitate rigorous evaluation, we introduce our BAV-Classroom-VQA dataset derived from real-world classroom video recordings at the Banking Academy of Vietnam. We present the methodology for data collection, annotation, and benchmark the performance of the selected VQA models on this dataset. Our initial experimental results demonstrate that all four models achieve promising performance levels in answering behavior-related visual questions, showcasing their potential in future classroom analytics and intervention systems.
zh
[CV-58] Object Recognition Datasets and Challenges: A Review
【速读】:该论文旨在解决对象识别(Object Recognition)研究中数据集多样性、规模与质量不足的问题,以及缺乏系统性评估工具对算法性能进行量化比较的挑战。其解决方案的关键在于对超过160个常用公共数据集进行全面的统计分析与描述,并梳理当前主流的对象识别基准测试(Benchmark)和竞赛机制,同时总结计算机视觉领域广泛采用的评价指标体系,从而为数据驱动型和机器学习研究人员提供清晰的数据资源参考框架和公平的性能评估标准。
链接: https://arxiv.org/abs/2507.22361
作者: Aria Salari,Abtin Djavadifar,Xiangrui Liu,Homayoun Najjaran
机构: University of British Columbia (不列颠哥伦比亚大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:
Abstract:Object recognition is among the fundamental tasks in the computer vision applications, paving the path for all other image understanding operations. In every stage of progress in object recognition research, efforts have been made to collect and annotate new datasets to match the capacity of the state-of-the-art algorithms. In recent years, the importance of the size and quality of datasets has been intensified as the utility of the emerging deep network techniques heavily relies on training data. Furthermore, datasets lay a fair benchmarking means for competitions and have proved instrumental to the advancements of object recognition research by providing quantifiable benchmarks for the developed models. Taking a closer look at the characteristics of commonly-used public datasets seems to be an important first step for data-driven and machine learning researchers. In this survey, we provide a detailed analysis of datasets in the highly investigated object recognition areas. More than 160 datasets have been scrutinized through statistics and descriptions. Additionally, we present an overview of the prominent object recognition benchmarks and competitions, along with a description of the metrics widely adopted for evaluation purposes in the computer vision community. All introduced datasets and challenges can be found online at this http URL.
zh
[CV-59] GVD: Guiding Video Diffusion Model for Scalable Video Distillation
【速读】:该论文旨在解决大规模视频数据集在训练过程中带来的计算和存储资源消耗过大的问题,提出通过视频数据蒸馏(video dataset distillation)从原始数据中提取关键的空间与时间信息,从而构建一个更小但性能接近全量数据的蒸馏数据集。其解决方案的关键在于提出GVD(Guiding Video Diffusion),这是首个基于扩散模型(diffusion-based)的视频蒸馏方法,能够联合蒸馏空间特征与时间特征,在保持动作多样性的同时精确捕捉运动信息,实现高保真度的视频生成。实验表明,GVD在MiniUCF和HMDB51数据集上以极低的帧比例(如MiniUCF仅需1.98%的帧数)即可达到原数据集78.29%的性能,显著优于现有最优方法,并且可在不显著增加计算成本的前提下支持更高分辨率和更高每类实例数(Instances Per Class, IPC)的蒸馏任务。
链接: https://arxiv.org/abs/2507.22360
作者: Kunyang Li,Jeffrey A Chan Santiago,Sarinda Dhanesh Samarasinghe,Gaowen Liu,Mubarak Shah
机构: Center for Research in Computer Vision, University of Central Florida (计算机视觉研究中心,中佛罗里达大学); Cisco Research, San Jose, California, USA (思科研究,加利福尼亚州圣何塞)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:
Abstract:To address the larger computation and storage requirements associated with large video datasets, video dataset distillation aims to capture spatial and temporal information in a significantly smaller dataset, such that training on the distilled data has comparable performance to training on all of the data. We propose GVD: Guiding Video Diffusion, the first diffusion-based video distillation method. GVD jointly distills spatial and temporal features, ensuring high-fidelity video generation across diverse actions while capturing essential motion information. Our method’s diverse yet representative distillations significantly outperform previous state-of-the-art approaches on the MiniUCF and HMDB51 datasets across 5, 10, and 20 Instances Per Class (IPC). Specifically, our method achieves 78.29 percent of the original dataset’s performance using only 1.98 percent of the total number of frames in MiniUCF. Additionally, it reaches 73.83 percent of the performance with just 3.30 percent of the frames in HMDB51. Experimental results across benchmark video datasets demonstrate that GVD not only achieves state-of-the-art performance but can also generate higher resolution videos and higher IPC without significantly increasing computational cost.
zh
[CV-60] FaceGCD: Generalized Face Discovery via Dynamic Prefix Generation BMVC2025
【速读】:该论文旨在解决开放世界场景下人脸识别中同时识别已知身份与发现未知身份的难题,即广义人脸发现(Generalized Face Discovery, GFD)问题。GFD要求系统不仅能区分已标注和未标注的已知身份,还需从数据中自动挖掘此前未见过的新身份,其挑战在于人脸ID具有高基数和细粒度特性,传统广义类别发现(Generalized Category Discovery, GCD)方法难以适用。解决方案的关键在于提出FaceGCD方法,通过轻量级、逐层的前缀(prefix)动态构建实例特定特征提取器:由HyperNetwork根据输入图像自适应生成一组前缀生成器,从而在无需高容量静态模型的前提下捕捉细微的身份特异性线索,实现高效且灵活的开放世界人脸识别。
链接: https://arxiv.org/abs/2507.22353
作者: Yunseok Oh,Dong-Wan Choi
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: BMVC 2025 Accepted
Abstract:Recognizing and differentiating among both familiar and unfamiliar faces is a critical capability for face recognition systems and a key step toward artificial general intelligence (AGI). Motivated by this ability, this paper introduces generalized face discovery (GFD), a novel open-world face recognition task that unifies traditional face identification with generalized category discovery (GCD). GFD requires recognizing both labeled and unlabeled known identities (IDs) while simultaneously discovering new, previously unseen IDs. Unlike typical GCD settings, GFD poses unique challenges due to the high cardinality and fine-grained nature of face IDs, rendering existing GCD approaches ineffective. To tackle this problem, we propose FaceGCD, a method that dynamically constructs instance-specific feature extractors using lightweight, layer-wise prefixes. These prefixes are generated on the fly by a HyperNetwork, which adaptively outputs a set of prefix generators conditioned on each input image. This dynamic design enables FaceGCD to capture subtle identity-specific cues without relying on high-capacity static models. Extensive experiments demonstrate that FaceGCD significantly outperforms existing GCD methods and a strong face recognition baseline, ArcFace, achieving state-of-the-art results on the GFD task and advancing toward open-world face recognition.
zh
[CV-61] DeltaVLM: Interactive Remote Sensing Image Change Analysis via Instruction-guided Difference Perception
【速读】:该论文旨在解决多时相遥感影像中地表覆盖变化(land-cover changes)难以支持交互式、指令驱动型分析的问题。现有方法通常仅提供单次变化掩膜或静态描述,无法满足用户在实际应用中对动态探索和精准问答的需求。解决方案的关键在于提出一种新的遥感影像变化分析范式(remote sensing image change analysis, RSICA),并构建大规模指令跟随数据集 ChangeChat-105k,涵盖六类交互类型;在此基础上设计 DeltaVLM 架构,其核心创新包括:(1) 细调的双时相视觉编码器以捕捉时间差异;(2) 基于跨语义关系度量(cross-semantic relation measuring, CSRM)机制的视觉差异感知模块,用于解释变化本质;(3) 指令引导的 Q-former 模块,高效提取与文本指令相关的视觉差异信息并实现对齐。通过冻结大语言模型仅微调视觉与对齐模块,DeltaVLM 实现了端到端的交互式变化分析,在单轮与多轮任务上均达到当前最优性能。
链接: https://arxiv.org/abs/2507.22346
作者: Pei Deng,Wenqian Zhou,Hanlin Wu
机构: Beijing Foreign Studies University (北京外国语大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 12 pages, 5 figures. Submitted to IEEE Transactions on Geoscience and Remote Sensing (TGRS). Code and dataset are available at this https URL
Abstract:Accurate interpretation of land-cover changes in multi-temporal satellite imagery is critical for real-world scenarios. However, existing methods typically provide only one-shot change masks or static captions, limiting their ability to support interactive, query-driven analysis. In this work, we introduce remote sensing image change analysis (RSICA) as a new paradigm that combines the strengths of change detection and visual question answering to enable multi-turn, instruction-guided exploration of changes in bi-temporal remote sensing images. To support this task, we construct ChangeChat-105k, a large-scale instruction-following dataset, generated through a hybrid rule-based and GPT-assisted process, covering six interaction types: change captioning, classification, quantification, localization, open-ended question answering, and multi-turn dialogues. Building on this dataset, we propose DeltaVLM, an end-to-end architecture tailored for interactive RSICA. DeltaVLM features three innovations: (1) a fine-tuned bi-temporal vision encoder to capture temporal differences; (2) a visual difference perception module with a cross-semantic relation measuring (CSRM) mechanism to interpret changes; and (3) an instruction-guided Q-former to effectively extract query-relevant difference information from visual changes, aligning them with textual instructions. We train DeltaVLM on ChangeChat-105k using a frozen large language model, adapting only the vision and alignment modules to optimize efficiency. Extensive experiments and ablation studies demonstrate that DeltaVLM achieves state-of-the-art performance on both single-turn captioning and multi-turn interactive change analysis, outperforming existing multimodal large language models and remote sensing vision-language models. Code, dataset and pre-trained weights are available at this https URL.
zh
[CV-62] UFV-Splatter: Pose-Free Feed-Forward 3D Gaussian Splatting Adapted to Unfavorable Views
【速读】:该论文旨在解决当前基于前馈式3D高斯溅射(3D Gaussian Splatting, 3DGS)方法在处理非理想视角(unfavorable views)时的局限性问题。现有训练范式通常假设物体位于世界原点且相机朝向原点进行渲染,即依赖于“有利视角”(favorable views),这限制了模型在真实场景中面对未知或变化相机位姿时的应用能力。解决方案的关键在于提出一种新颖的适应框架:利用预训练的无姿态(pose-free)3DGS模型,通过将输入图像重新中心化后送入模型,并引入低秩适配(Low-Rank Adaptation, LoRA)层以利用从有利图像中学到的先验知识;同时设计高斯适配模块(Gaussian adapter module)提升重中心化输入下高斯分布的几何一致性,并结合高斯对齐方法实现目标视角的准确渲染,从而有效扩展模型对不利视角的泛化能力。
链接: https://arxiv.org/abs/2507.22342
作者: Yuki Fujimura,Takahiro Kushida,Kazuya Kitano,Takuya Funatomi,Yasuhiro Mukaigawa
机构: Nara Institute of Science and Technology (奈良科学技术大学院大学); Ritsumeikan University (立命馆大学); Kyoto University (京都大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Project page: this https URL
Abstract:This paper presents a pose-free, feed-forward 3D Gaussian Splatting (3DGS) framework designed to handle unfavorable input views. A common rendering setup for training feed-forward approaches places a 3D object at the world origin and renders it from cameras pointed toward the origin – i.e., from favorable views, limiting the applicability of these models to real-world scenarios involving varying and unknown camera poses. To overcome this limitation, we introduce a novel adaptation framework that enables pretrained pose-free feed-forward 3DGS models to handle unfavorable views. We leverage priors learned from favorable images by feeding recentered images into a pretrained model augmented with low-rank adaptation (LoRA) layers. We further propose a Gaussian adapter module to enhance the geometric consistency of the Gaussians derived from the recentered inputs, along with a Gaussian alignment method to render accurate target views for training. Additionally, we introduce a new training strategy that utilizes an off-the-shelf dataset composed solely of favorable images. Experimental results on both synthetic images from the Google Scanned Objects dataset and real images from the OmniObject3D dataset validate the effectiveness of our method in handling unfavorable input views.
zh
[CV-63] Learning from Heterogeneous Structural MRI via Collaborative Domain Adaptation for Late-Life Depression Assessment
【速读】:该论文旨在解决基于结构磁共振成像(structural brain MRI)对晚发性抑郁(late-life depression, LLD)进行准确识别时,因样本量有限导致模型训练困难和泛化能力差的问题。尤其在跨域场景下,由于影像采集协议、设备硬件及人群 demographics 的显著差异,现有方法难以实现可靠的迁移学习。其解决方案的关键在于提出一种协作式域适应(Collaborative Domain Adaptation, CDA)框架,该框架融合视觉 Transformer(Vision Transformer, ViT)与卷积神经网络(Convolutional Neural Network, CNN),分别提取全局解剖上下文和局部结构特征,并通过三个阶段实现:(a) 在源域有标签数据上进行监督训练;(b) 利用自监督方式对目标域特征进行适配,通过最小化双分支分类器输出差异来增强类别边界清晰度;© 基于伪标签和强弱增强策略在目标域无标签数据上进行协同训练,以提升模型在跨域场景下的鲁棒性和泛化性能。
链接: https://arxiv.org/abs/2507.22321
作者: Yuzhen Gao,Qianqian Wang,Yongheng Sun,Cui Wang,Yongquan Liang,Mingxia Liu
机构: University of North Carolina at Chapel Hill (北卡罗来纳大学教堂山分校); Shandong University of Science and Technology (山东科技大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:
Abstract:Accurate identification of late-life depression (LLD) using structural brain MRI is essential for monitoring disease progression and facilitating timely intervention. However, existing learning-based approaches for LLD detection are often constrained by limited sample sizes (e.g., tens), which poses significant challenges for reliable model training and generalization. Although incorporating auxiliary datasets can expand the training set, substantial domain heterogeneity, such as differences in imaging protocols, scanner hardware, and population demographics, often undermines cross-domain transferability. To address this issue, we propose a Collaborative Domain Adaptation (CDA) framework for LLD detection using T1-weighted MRIs. The CDA leverages a Vision Transformer (ViT) to capture global anatomical context and a Convolutional Neural Network (CNN) to extract local structural features, with each branch comprising an encoder and a classifier. The CDA framework consists of three stages: (a) supervised training on labeled source data, (b) self-supervised target feature adaptation and © collaborative training on unlabeled target data. We first train ViT and CNN on source data, followed by self-supervised target feature adaptation by minimizing the discrepancy between classifier outputs from two branches to make the categorical boundary clearer. The collaborative training stage employs pseudo-labeled and augmented target-domain MRIs, enforcing prediction consistency under strong and weak augmentation to enhance domain robustness and generalization. Extensive experiments conducted on multi-site T1-weighted MRI data demonstrate that the CDA consistently outperforms state-of-the-art unsupervised domain adaptation methods.
zh
[CV-64] LAMA-Net: A Convergent Network Architecture for Dual-Domain Reconstruction
【速读】:该论文旨在解决稀疏视图计算机断层成像(Sparse-View Computed Tomography)中的图像重建问题,其核心挑战在于如何从有限且噪声干扰严重的测量数据中恢复高质量图像。解决方案的关键在于提出一种可学习的变分模型(learnable variational model),通过在图像域和测量域之间协同利用互补信息实现更精准的重建。具体而言,作者基于先前提出的可学习交替最小化算法(Learned Alternating Minimization Algorithm, LAMA),构建了一个具有物理可解释性的神经网络架构——LAMA-Net,并首次提供了LAMA算法的严格收敛性证明:即其特定子序列的所有聚点均为问题的Clarke驻点(Clarke stationary points)。这一理论保障显著提升了LAMA-Net的稳定性和鲁棒性;进一步地,通过引入一个精心设计的初始值生成网络,得到iLAMA-Net,从而在多个基准数据集上实现了优于当前主流方法的重建性能。
链接: https://arxiv.org/abs/2507.22316
作者: Chi Ding,Qingchao Zhang,Ge Wang,Xiaojing Ye,Yunmei Chen
机构: University of Florida (佛罗里达大学); Rensselaer Polytechnic Institute (伦斯勒理工学院); Georgia State University (佐治亚州立大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: arXiv admin note: substantial text overlap with arXiv:2410.21111
Abstract:We propose a learnable variational model that learns the features and leverages complementary information from both image and measurement domains for image reconstruction. In particular, we introduce a learned alternating minimization algorithm (LAMA) from our prior work, which tackles two-block nonconvex and nonsmooth optimization problems by incorporating a residual learning architecture in a proximal alternating framework. In this work, our goal is to provide a complete and rigorous convergence proof of LAMA and show that all accumulation points of a specified subsequence of LAMA must be Clarke stationary points of the problem. LAMA directly yields a highly interpretable neural network architecture called LAMA-Net. Notably, in addition to the results shown in our prior work, we demonstrate that the convergence property of LAMA yields outstanding stability and robustness of LAMA-Net in this work. We also show that the performance of LAMA-Net can be further improved by integrating a properly designed network that generates suitable initials, which we call iLAMA-Net. To evaluate LAMA-Net/iLAMA-Net, we conduct several experiments and compare them with several state-of-the-art methods on popular benchmark datasets for Sparse-View Computed Tomography.
zh
[CV-65] AlphaEarth Foundations: An embedding field model for accurate and efficient global mapping from sparse label data
【速读】:该论文旨在解决地球观测数据中高质量标签稀缺的问题,这一瓶颈限制了高精度地图生成与监测系统的构建。传统方法依赖于昂贵且耗时的物理测量和人工标注,难以实现大规模、多尺度的应用。其解决方案的关键在于提出AlphaEarth Foundations——一种嵌入场模型(embedding field model),能够融合来自多源数据的空间、时间及测量上下文信息,从而生成具有高度泛化能力的地理空间表征。该模型无需重新训练即可在多种映射评估任务中持续超越此前所有特征化方法的表现,显著提升了从局部到全球尺度的地图生产效率与准确性。
链接: https://arxiv.org/abs/2507.22291
作者: Christopher F. Brown,Michal R. Kazmierski,Valerie J. Pasquarella,William J. Rucklidge,Masha Samsikova,Chenhui Zhang,Evan Shelhamer,Estefania Lahera,Olivia Wiles,Simon Ilyushchenko,Noel Gorelick,Lihui Lydia Zhang,Sophia Alj,Emily Schechter,Sean Askay,Oliver Guinan,Rebecca Moore,Alexis Boukouvalas,Pushmeet Kohli
机构: Google DeepMind(谷歌深mind); Google(谷歌)
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注:
Abstract:Unprecedented volumes of Earth observation data are continually collected around the world, but high-quality labels remain scarce given the effort required to make physical measurements and observations. This has led to considerable investment in bespoke modeling efforts translating sparse labels into maps. Here we introduce AlphaEarth Foundations, an embedding field model yielding a highly general, geospatial representation that assimilates spatial, temporal, and measurement contexts across multiple sources, enabling accurate and efficient production of maps and monitoring systems from local to global scales. The embeddings generated by AlphaEarth Foundations are the only to consistently outperform all previous featurization approaches tested on a diverse set of mapping evaluations without re-training. We will release a dataset of global, annual, analysis-ready embedding field layers from 2017 through 2024.
zh
[CV-66] HOG-CNN: Integrating Histogram of Oriented Gradients with Convolutional Neural Networks for Retinal Image Classification
【速读】:该论文旨在解决传统眼底图像(fundus images)诊断流程依赖人工判读、耗时且资源密集的问题,从而实现对糖尿病视网膜病变(Diabetic Retinopathy, DR)、青光眼(Glaucoma)和年龄相关性黄斑变性(Age-related Macular Degeneration, AMD)等常见致盲性眼病的自动化与可解释性辅助诊断。其解决方案的关键在于提出一种基于混合特征提取模型 HOG-CNN 的临床决策支持框架,通过融合手工设计的梯度方向直方图(Histogram of Oriented Gradients, HOG)特征与深度卷积神经网络(Convolutional Neural Network, CNN)的高层语义表示,有效捕捉眼底图像中的局部纹理模式与高级语义信息,从而在多个公开基准数据集上实现了高精度与强泛化能力,并具备轻量化和可解释性的优势,适用于资源受限的临床环境部署。
链接: https://arxiv.org/abs/2507.22274
作者: Faisal Ahmed
机构: Embry-Riddle Aeronautical University (艾姆布里-里德航空大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注: 13 pages; 5 figures
Abstract:The analysis of fundus images is critical for the early detection and diagnosis of retinal diseases such as Diabetic Retinopathy (DR), Glaucoma, and Age-related Macular Degeneration (AMD). Traditional diagnostic workflows, however, often depend on manual interpretation and are both time- and resource-intensive. To address these limitations, we propose an automated and interpretable clinical decision support framework based on a hybrid feature extraction model called HOG-CNN. Our key contribution lies in the integration of handcrafted Histogram of Oriented Gradients (HOG) features with deep convolutional neural network (CNN) representations. This fusion enables our model to capture both local texture patterns and high-level semantic features from retinal fundus images. We evaluated our model on three public benchmark datasets: APTOS 2019 (for binary and multiclass DR classification), ORIGA (for Glaucoma detection), and IC-AMD (for AMD diagnosis); HOG-CNN demonstrates consistently high performance. It achieves 98.5% accuracy and 99.2 AUC for binary DR classification, and 94.2 AUC for five-class DR classification. On the IC-AMD dataset, it attains 92.8% accuracy, 94.8% precision, and 94.5 AUC, outperforming several state-of-the-art models. For Glaucoma detection on ORIGA, our model achieves 83.9% accuracy and 87.2 AUC, showing competitive performance despite dataset limitations. We show, through comprehensive appendix studies, the complementary strength of combining HOG and CNN features. The model’s lightweight and interpretable design makes it particularly suitable for deployment in resource-constrained clinical environments. These results position HOG-CNN as a robust and scalable tool for automated retinal disease screening.
zh
[CV-67] SmartCLIP: Modular Vision-language Alignment with Identification Guarantees CVPR2025
【速读】:该论文旨在解决对比语言-图像预训练(Contrastive Language-Image Pre-training, CLIP)模型在处理图像-文本对齐时存在的信息错位(information misalignment)与表征纠缠(entangled representation)问题。具体而言,现有数据集如MSCOCO中短描述可能仅覆盖图像的局部区域,导致模型难以确定应保留或忽略哪些视觉特征;而直接将长文本与图像对齐则会引入冗余细节,阻碍模型学习解耦的原子语义概念,从而限制其在短提示下游任务中的泛化能力。论文的关键解决方案是建立理论条件以实现跨模态表示在不同粒度下的灵活对齐,并提出一种名为\ours的新方法,通过模块化方式识别并对齐最相关的视觉与文本表征,从而同时保障跨模态语义完整性与细粒度语义解耦性。
链接: https://arxiv.org/abs/2507.22264
作者: Shaoan Xie,Lingjing Kong,Yujia Zheng,Yu Yao,Zeyu Tang,Eric P. Xing,Guangyi Chen,Kun Zhang
机构: Carnegie Mellon University (卡内基梅隆大学); Mohamed bin Zayed University of Artificial Intelligence (穆罕默德·本·扎耶德人工智能大学); The University of Sydney (悉尼大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: CVPR2025
Abstract:Contrastive Language-Image Pre-training (CLIP)~\citepradford2021learning has emerged as a pivotal model in computer vision and multimodal learning, achieving state-of-the-art performance at aligning visual and textual representations through contrastive learning. However, CLIP struggles with potential information misalignment in many image-text datasets and suffers from entangled representation. On the one hand, short captions for a single image in datasets like MSCOCO may describe disjoint regions in the image, leaving the model uncertain about which visual features to retain or disregard. On the other hand, directly aligning long captions with images can lead to the retention of entangled details, preventing the model from learning disentangled, atomic concepts – ultimately limiting its generalization on certain downstream tasks involving short prompts. In this paper, we establish theoretical conditions that enable flexible alignment between textual and visual representations across varying levels of granularity. Specifically, our framework ensures that a model can not only \emphpreserve cross-modal semantic information in its entirety but also \emphdisentangle visual representations to capture fine-grained textual concepts. Building on this foundation, we introduce \ours, a novel approach that identifies and aligns the most relevant visual and textual representations in a modular manner. Superior performance across various tasks demonstrates its capability to handle information misalignment and supports our identification theory. The code is available at this https URL. Comments: CVPR2025 Subjects: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI) Cite as: arXiv:2507.22264 [cs.CV] (or arXiv:2507.22264v1 [cs.CV] for this version) https://doi.org/10.48550/arXiv.2507.22264 Focus to learn more arXiv-issued DOI via DataCite
zh
[CV-68] mporally Consistent Unsupervised Segmentation for Mobile Robot Perception
【速读】:该论文旨在解决移动机器人在未 rehearsal 的非结构化地形中进行无监督语义分割时面临的两大挑战:一是依赖昂贵的数据采集和人工标注的监督学习方法难以适应新场景;二是现有零样本无监督分割方法多基于单帧处理,缺乏时间一致性,影响感知鲁棒性。解决方案的关键在于提出 Frontier-Seg 方法,通过提取基础模型(如 DINOv2)输出的超像素级特征,并在视频流中强制跨帧的时间一致性约束,从而无需人工标注即可识别出持久的地形边界(frontiers),实现对非结构化环境的稳定、连续的无监督分割。
链接: https://arxiv.org/abs/2507.22194
作者: Christian Ellis,Maggie Wigness,Craig Lennon,Lance Fiondella
机构: Oden Institute for Computational Engineering & Sciences, University of Texas at Austin (奥斯丁德大学计算工程与科学研究所); DEVCOM Army Research Laboratory, Adelphi, MD, United States (美国陆军研究实验室); Department of Electrical and Computer Engineering, University of Massachusetts Dartmouth (马萨诸塞大学达特茅斯分校电气与计算机工程系)
类目: Computer Vision and Pattern Recognition (cs.CV); Robotics (cs.RO)
备注:
Abstract:Rapid progress in terrain-aware autonomous ground navigation has been driven by advances in supervised semantic segmentation. However, these methods rely on costly data collection and labor-intensive ground truth labeling to train deep models. Furthermore, autonomous systems are increasingly deployed in unrehearsed, unstructured environments where no labeled data exists and semantic categories may be ambiguous or domain-specific. Recent zero-shot approaches to unsupervised segmentation have shown promise in such settings but typically operate on individual frames, lacking temporal consistency-a critical property for robust perception in unstructured environments. To address this gap we introduce Frontier-Seg, a method for temporally consistent unsupervised segmentation of terrain from mobile robot video streams. Frontier-Seg clusters superpixel-level features extracted from foundation model backbones-specifically DINOv2-and enforces temporal consistency across frames to identify persistent terrain boundaries or frontiers without human supervision. We evaluate Frontier-Seg on a diverse set of benchmark datasets-including RUGD and RELLIS-3D-demonstrating its ability to perform unsupervised segmentation across unstructured off-road environments.
zh
[CV-69] Enhancing efficiency in paediatric brain tumour segmentation using a pathologically diverse single-center clinical dataset
【速读】:该论文旨在解决儿童脑肿瘤(Paediatric Brain Tumours, PBTs)在影像学分割中因亚型多样性和MRI扫描协议差异导致的诊断与治疗挑战,尤其是如何提升深度学习(Deep Learning, DL)模型在不同肿瘤亚型和成像序列下的分割鲁棒性。其解决方案的关键在于采用3D nnU-Net模型对多类PBT亚型(包括高/低级别胶质瘤、髓母细胞瘤、室管膜瘤等)进行端到端自动分割,并通过对比人工标注者间一致性验证模型性能,结果表明该方法在全肿瘤(Whole Tumour, WT)和T2高信号区域(T2-hyperintensity, T2H)分割上表现优异(平均Dice相似系数DSC=0.85),接近人类标注者变异性(DSC=0.86),且发现仅使用T1、T1增强及T2序列即可获得与完整MRI协议相当的效果,为简化临床扫描流程和实现自动化体积评估提供了可行路径。
链接: https://arxiv.org/abs/2507.22152
作者: A. Piffer(1),J. A. Buchner(2,3,4),A. G. Gennari(5,6),P. Grehten(7),S. Sirin(7),E. Ross(8),I. Ezhov(9,10),M. Rosier(11,12),J. C. Peeken(2),M. Piraud(11),B. Menze(12),A. Guerreiro Stücklin(1),A. Jakab(6,13),F. Kofler(10,11,12,14) ((1) Division of Oncology and Children’s Research Center, University Children’s Hospital Zurich, Zurich, Switzerland, (2) Department of Radiation Oncology, TUM School of Medicine and Health, Klinikum rechts der Isar, Technical University of Munich, Munich, Germany, (3) Institute of Radiation Medicine (IRM), Helmholtz Zentrum, Oberschleißheim, Germany, (4) Partner Site Munich, German Consortium for Translational Cancer Research (DKTK), Munich, Germany, (5) Department of Neuropaediatrics, University Children’s Hospital Zurich, Switzerland, (6) Center for MR- Research, University Children’s Hospital Zurich, Zurich, Switzerland, (7) Department of Diagnostic Imaging, University Children’s Hospital Zurich, Zurich, Switzerland, (8) Georgia Institute of Technology, Geisel School of Medicine at Dartmouth, (9) Department of Computer Science, TUM School of Computation, Information and Technology, Technical University of Munich, Munich, Germany, (10) TranslaTUM - Central Institute for Translational Cancer Research, Technical University of Munich, Munich, (11) Helmholtz AI, Helmholtz Zentrum Munich, Munich, Germany, (12) Department of Quantitative Biomedicine, University of Zurich, Zurich, Switzerland, (13) Faculty of Medicine, University of Zürich, Switzerland, (14) Department of Diagnostic and Interventional Neuroradiology, Klinikum rechts der Isar, Technical University of Munich, Munich, Germany)
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Medical Physics (physics.med-ph)
备注: A. Jakab and F. Kofler have shared last authorship
Abstract:Background Brain tumours are the most common solid malignancies in children, encompassing diverse histological, molecular subtypes and imaging features and outcomes. Paediatric brain tumours (PBTs), including high- and low-grade gliomas (HGG, LGG), medulloblastomas (MB), ependymomas, and rarer forms, pose diagnostic and therapeutic challenges. Deep learning (DL)-based segmentation offers promising tools for tumour delineation, yet its performance across heterogeneous PBT subtypes and MRI protocols remains uncertain. Methods A retrospective single-centre cohort of 174 paediatric patients with HGG, LGG, medulloblastomas (MB), ependymomas, and other rarer subtypes was used. MRI sequences included T1, T1 post-contrast (T1-C), T2, and FLAIR. Manual annotations were provided for four tumour subregions: whole tumour (WT), T2-hyperintensity (T2H), enhancing tumour (ET), and cystic component (CC). A 3D nnU-Net model was trained and tested (121/53 split), with segmentation performance assessed using the Dice similarity coefficient (DSC) and compared against intra- and inter-rater variability. Results The model achieved robust performance for WT and T2H (mean DSC: 0.85), comparable to human annotator variability (mean DSC: 0.86). ET segmentation was moderately accurate (mean DSC: 0.75), while CC performance was poor. Segmentation accuracy varied by tumour type, MRI sequence combination, and location. Notably, T1, T1-C, and T2 alone produced results nearly equivalent to the full protocol. Conclusions DL is feasible for PBTs, particularly for T2H and WT. Challenges remain for ET and CC segmentation, highlighting the need for further refinement. These findings support the potential for protocol simplification and automation to enhance volumetric assessment and streamline paediatric neuro-oncology workflows.
zh
[CV-70] Color as the Impetus: Transforming Few-Shot Learner
【速读】:该论文旨在解决传统少样本学习(few-shot learning)方法中对颜色信息利用不足的问题,即现有方法多依赖抽象特征区分类别,而忽视了人类视觉系统中直观且重要的颜色感知机制。其解决方案的关键在于提出一种受生物启发的元学习框架——ColorSense Learner,通过通道间特征提取与交互学习策略,强化不同颜色通道中的判别性信息,从而有效过滤冗余特征并增强类内一致性与类间差异性;同时引入基于知识蒸馏的元蒸馏器(ColorSense Distiller),利用先验教师知识提升学生网络的元学习能力,显著增强了模型在多个细粒度和跨域任务上的泛化性、鲁棒性和迁移性能。
链接: https://arxiv.org/abs/2507.22136
作者: Chaofei Qi,Zhitai Liu,Jianbin Qiu
机构: Harbin Institute of Technology (哈尔滨工业大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Humans possess innate meta-learning capabilities, partly attributable to their exceptional color perception. In this paper, we pioneer an innovative viewpoint on few-shot learning by simulating human color perception mechanisms. We propose the ColorSense Learner, a bio-inspired meta-learning framework that capitalizes on inter-channel feature extraction and interactive learning. By strategically emphasizing distinct color information across different channels, our approach effectively filters irrelevant features while capturing discriminative characteristics. Color information represents the most intuitive visual feature, yet conventional meta-learning methods have predominantly neglected this aspect, focusing instead on abstract feature differentiation across categories. Our framework bridges the gap via synergistic color-channel interactions, enabling better intra-class commonality extraction and larger inter-class differences. Furthermore, we introduce a meta-distiller based on knowledge distillation, ColorSense Distiller, which incorporates prior teacher knowledge to augment the student network’s meta-learning capacity. We’ve conducted comprehensive coarse/fine-grained and cross-domain experiments on eleven few-shot benchmarks for validation. Numerous experiments reveal that our methods have extremely strong generalization ability, robustness, and transferability, and effortless handle few-shot classification from the perspective of color perception.
zh
[CV-71] AI in Agriculture: A Survey of Deep Learning Techniques for Crops Fisheries and Livestock
【速读】:该论文旨在解决农业领域中作物、渔业和畜牧业因气候变化、资源限制及可持续管理需求而面临的挑战,这些问题亟需高效、准确且可扩展的技术解决方案。其核心解决方案在于系统性地综述超过200篇相关研究,涵盖传统机器学习、先进深度学习技术(如视觉Transformer)以及近期的视觉-语言基础模型(如CLIP),应用于作物病害检测、畜禽健康管理和水生生物监测等多样化任务。关键突破在于识别出数据异质性、实验设计(如数据集、评估指标与地理分布)等实施瓶颈,并提出未来研究方向,包括多模态数据融合、边缘设备高效部署以及适用于不同农耕环境的领域自适应AI模型,从而推动农业人工智能从实验室走向实际应用。
链接: https://arxiv.org/abs/2507.22101
作者: Umair Nawaz,Muhammad Zaigham Zaheer,Fahad Shahbaz Khan,Hisham Cholakkal,Salman Khan,Rao Muhammad Anwer
机构: MBZ University of AI (MBZ大学人工智能学院); Australian National University (澳大利亚国立大学); Linköping University (林雪平大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Crops, fisheries and livestock form the backbone of global food production, essential to feed the ever-growing global population. However, these sectors face considerable challenges, including climate variability, resource limitations, and the need for sustainable management. Addressing these issues requires efficient, accurate, and scalable technological solutions, highlighting the importance of artificial intelligence (AI). This survey presents a systematic and thorough review of more than 200 research works covering conventional machine learning approaches, advanced deep learning techniques (e.g., vision transformers), and recent vision-language foundation models (e.g., CLIP) in the agriculture domain, focusing on diverse tasks such as crop disease detection, livestock health management, and aquatic species monitoring. We further cover major implementation challenges such as data variability and experimental aspects: datasets, performance evaluation metrics, and geographical focus. We finish the survey by discussing potential open research directions emphasizing the need for multimodal data integration, efficient edge-device deployment, and domain-adaptable AI models for diverse farming environments. Rapid growth of evolving developments in this field can be actively tracked on our project page: this https URL
zh
[CV-72] rade-offs in Image Generation: How Do Different Dimensions Interact? ICCV2025
【速读】:该论文旨在解决生成式 AI(Generative AI)在文本到图像(T2I)和图像到图像(I2I)生成任务中,模型在真实感、原创性、美学、内容一致性、关系准确性、风格匹配、知识保留、歧义控制、毒性与偏见等多维性能之间存在复杂权衡却缺乏系统量化分析的问题。其关键解决方案是提出 TRIG-Bench 基准数据集与 TRIGScore 评分机制:TRIG-Bench 覆盖10个维度、包含40,200个样本及132个维度组合子集,支持细粒度评估;TRIGScore 是一种基于视觉语言模型(VLM)作为裁判的自动适应型指标,能灵活评估不同维度表现。进一步结合关系识别系统构建维度权衡图(Dimension Trade-off Map, DTM),可视化各模型在不同维度间的性能权衡,并通过 DTM 指导微调可有效缓解特定维度弱点,从而提升整体生成质量。
链接: https://arxiv.org/abs/2507.22100
作者: Sicheng Zhang,Binzhu Xie,Zhonghao Yan,Yuli Zhang,Donghao Zhou,Xiaofei Chen,Shi Qiu,Jiaqi Liu,Guoyang Xie,Zhichao Lu
机构: Khalifa University (哈利法大学); The Chinese University of Hong Kong (香港中文大学); Queen Mary University of London (伦敦玛丽女王大学); Xi’an Jiaotong-Liverpool University (西安交通大学利物浦大学); City University of Hong Kong (香港城市大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted in ICCV 2025, Codebase: this https URL
Abstract:Model performance in text-to-image (T2I) and image-to-image (I2I) generation often depends on multiple aspects, including quality, alignment, diversity, and robustness. However, models’ complex trade-offs among these dimensions have rarely been explored due to (1) the lack of datasets that allow fine-grained quantification of these trade-offs, and (2) the use of a single metric for multiple dimensions. To bridge this gap, we introduce TRIG-Bench (Trade-offs in Image Generation), which spans 10 dimensions (Realism, Originality, Aesthetics, Content, Relation, Style, Knowledge, Ambiguity, Toxicity, and Bias), contains 40,200 samples, and covers 132 pairwise dimensional subsets. Furthermore, we develop TRIGScore, a VLM-as-judge metric that automatically adapts to various dimensions. Based on TRIG-Bench and TRIGScore, we evaluate 14 models across T2I and I2I tasks. In addition, we propose the Relation Recognition System to generate the Dimension Trade-off Map (DTM) that visualizes the trade-offs among model-specific capabilities. Our experiments demonstrate that DTM consistently provides a comprehensive understanding of the trade-offs between dimensions for each type of generative model. Notably, we show that the model’s dimension-specific weaknesses can be mitigated through fine-tuning on DTM to enhance overall performance. Code is available at: this https URL
zh
[CV-73] Runtime Failure Hunting for Physics Engine Based Software Systems: How Far Can We Go?
【速读】:该论文旨在解决物理引擎(Physics Engine, PE)在软件应用中因物理行为偏离预期而引发的“物理故障”问题,这类故障可能影响软件可靠性、用户体验,甚至导致自动驾驶或医疗机器人等安全关键系统的严重后果。当前测试方法存在局限性,主要依赖白盒访问且仅关注崩溃检测,无法有效识别语义复杂的物理错误。论文的关键解决方案在于:首次开展大规模实证研究,构建了物理故障的表现分类体系(taxonomy),系统评估了深度学习、提示驱动(prompt-based)及大模型等多种检测技术的有效性,并基于开发者实践提供了可操作的改进洞察,从而为提升PE相关软件的物理行为正确性提供了理论与工具支持。
链接: https://arxiv.org/abs/2507.22099
作者: Shuqing Li,Qiang Chen,Xiaoxue Ren,Michael R. Lyu
机构: The Chinese University of Hong Kong (香港中文大学); State Key Laboratory of Blockchain and Data Security, Zhejiang University (浙江大学区块链与数据安全重点实验室)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Multimedia (cs.MM); Software Engineering (cs.SE)
备注:
Abstract:Physics Engines (PEs) are fundamental software frameworks that simulate physical interactions in applications ranging from entertainment to safety-critical systems. Despite their importance, PEs suffer from physics failures, deviations from expected physical behaviors that can compromise software reliability, degrade user experience, and potentially cause critical failures in autonomous vehicles or medical robotics. Current testing approaches for PE-based software are inadequate, typically requiring white-box access and focusing on crash detection rather than semantically complex physics failures. This paper presents the first large-scale empirical study characterizing physics failures in PE-based software. We investigate three research questions addressing the manifestations of physics failures, the effectiveness of detection techniques, and developer perceptions of current detection practices. Our contributions include: (1) a taxonomy of physics failure manifestations; (2) a comprehensive evaluation of detection methods including deep learning, prompt-based techniques, and large multimodal models; and (3) actionable insights from developer experiences for improving detection approaches. To support future research, we release PhysiXFails, code, and other materials at this https URL.
zh
[CV-74] A Dual-Feature Extractor Framework for Accurate Back Depth and Spine Morphology Estimation from Monocular RGB Images
【速读】:该论文旨在解决青少年特发性脊柱侧弯(Adolescent Idiopathic Scoliosis, AIS)评估中依赖X射线所带来的辐射暴露及在偏远地区可及性差的问题,同时克服传统RGB图像因环境光照变化导致模型稳定性与泛化能力不足的局限。其解决方案的关键在于提出一种新颖的深度估计与脊柱形态重建联合框架:首先设计Grid-Aware Multiscale Adaptive Network (GAMA-Net),通过双编码器结构提取局部块级与全局特征,并利用Patch-Based Hybrid Attention (PBHA)模块实现特征交互,再经由Adaptive Multiscale Feature Fusion (AMFF)模块动态融合解码器中的多尺度信息,从而高精度地估计裸背表面的深度信息;随后将深度信息与表面几何信息融合用于脊柱曲线生成,显著提升了脊柱形态估计的准确性,最终实现高达97%的性能表现。
链接: https://arxiv.org/abs/2507.22691
作者: Yuxin Wei,Yue Zhang,Moxin Zhao,Chang Shi,Jason P.Y. Cheung,Teng Zhang,Nan Meng
机构: 未知
类目: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Scoliosis is a prevalent condition that impacts both physical health and appearance, with adolescent idiopathic scoliosis (AIS) being the most common form. Currently, the main AIS assessment tool, X-rays, poses significant limitations, including radiation exposure and limited accessibility in poor and remote areas. To address this problem, the current solutions are using RGB images to analyze spine morphology. However, RGB images are highly susceptible to environmental factors, such as lighting conditions, compromising model stability and generalizability. Therefore, in this study, we propose a novel pipeline to accurately estimate the depth information of the unclothed back, compensating for the limitations of 2D information, and then estimate spine morphology by integrating both depth and surface information. To capture the subtle depth variations of the back surface with precision, we design an adaptive multiscale feature learning network named Grid-Aware Multiscale Adaptive Network (GAMA-Net). This model uses dual encoders to extract both patch-level and global features, which are then interacted by the Patch-Based Hybrid Attention (PBHA) module. The Adaptive Multiscale Feature Fusion (AMFF) module is used to dynamically fuse information in the decoder. As a result, our depth estimation model achieves remarkable accuracy across three different evaluation metrics, with scores of nearly 78.2%, 93.6%, and 97.5%, respectively. To further validate the effectiveness of the predicted depth, we integrate both surface and depth information for spine morphology estimation. This integrated approach enhances the accuracy of spine curve generation, achieving an impressive performance of up to 97%.
zh
[CV-75] rAIce3D: A Prompt-Driven Transformer Based U-Net for Semantic Segmentation of Microglial Cells from Large-Scale 3D Microscopy Images
【速读】:该论文旨在解决从大规模3D显微图像中精确分割微胶质细胞(microglia)形态结构的难题,尤其是其胞体(soma)与分支结构的分离问题。现有方法在处理重叠细胞、噪声图像时表现不佳,且通常需要大量人工干预或针对每组新数据进行超参数调优。解决方案的关键在于提出一种两阶段深度学习架构 trAIce3D:第一阶段利用带有视觉变换器(vision transformer)编码器的3D U-Net结合滑动窗口策略实现胞体的自监督分割;第二阶段通过引入跨注意力块(cross-attention blocks)增强跳跃连接,以胞体坐标作为提示(prompt),并以目标细胞周围的3D局部窗口为输入,精准细化每个胞体及其分支结构。该方法分阶段训练,充分利用预训练权重,显著提升分割精度与泛化能力,适用于复杂细胞形态的大规模分析。
链接: https://arxiv.org/abs/2507.22635
作者: MohammadAmin Alamalhoda,Arsalan Firoozi,Alessandro Venturino,Sandra Siegert
机构: 未知
类目: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注: 10 pages, 2 figures
Abstract:The shape of a cell contains essential information about its function within the biological system. Segmenting these structures from large-scale 3D microscopy images is challenging, limiting clinical insights especially for microglia, immune-associated cells involved in neurodegenerative diseases. Existing segmentation methods mainly focus on cell bodies, struggle with overlapping structures, perform poorly on noisy images, require hyperparameter tuning for each new dataset, or rely on tedious semi-automated approaches. We introduce trAIce3D, a deep-learning architecture designed for precise microglia segmentation, capturing both somas and branches. It employs a two-stage approach: first, a 3D U-Net with vision transformers in the encoder detects somas using a sliding-window technique to cover the entire image. Then, the same architecture, enhanced with cross-attention blocks in skip connections, refines each soma and its branches by using soma coordinates as a prompt and a 3D window around the target cell as input. Training occurs in two phases: self-supervised Soma Segmentation, followed by prompt-based Branch Segmentation, leveraging pre-trained weights from the first phase. Trained and evaluated on a dataset of 41,230 microglial cells, trAIce3D significantly improves segmentation accuracy and generalization, enabling scalable analysis of complex cellular morphologies. While optimized for microglia, its architecture can extend to other intricate cell types, such as neurons and astrocytes, broadening its impact on neurobiological research.
zh
[CV-76] Exploration of Low-Cost but Accurate Radar-Based Human Motion Direction Determination
【速读】:该论文旨在解决雷达基人体运动方向确定(Human Motion Direction Determination, HMDD)中特征增强与运动方向识别难以同时实现的问题。现有基于多普勒-时间图(Doppler-Time Map, DTM)的方法在特征增强和运动方向判别方面仍存在不足。其解决方案的关键在于:首先利用雷达获取人体步态的DTM,并通过特征链接模型实现特征增强;随后采用一种轻量级且快速的视觉Transformer与卷积神经网络(Vision Transformer-Convolutional Neural Network, ViT-CNN)混合结构完成HMDD任务,从而在保证精度的同时提升计算效率。
链接: https://arxiv.org/abs/2507.22567
作者: Weicheng Gao
机构: 未知
类目: ignal Processing (eess.SP); Computer Vision and Pattern Recognition (cs.CV)
备注: 5 pages, 5 figures, 2 tables
Abstract:This work is completed on a whim after discussions with my junior colleague. The motion direction angle affects the micro-Doppler spectrum width, thus determining the human motion direction can provide important prior information for downstream tasks such as gait recognition. However, Doppler-Time map (DTM)-based methods still have room for improvement in achieving feature augmentation and motion determination simultaneously. In response, a low-cost but accurate radar-based human motion direction determination (HMDD) method is explored in this paper. In detail, the radar-based human gait DTMs are first generated, and then the feature augmentation is achieved using feature linking model. Subsequently, the HMDD is implemented through a lightweight and fast Vision Transformer-Convolutional Neural Network hybrid model structure. The effectiveness of the proposed method is verified through open-source dataset. The open-source code of this work is released at: this https URL.
zh
[CV-77] Learned Off-aperture Encoding for Wide Field-of-view RGBD Imaging
【速读】:该论文旨在解决端到端(End-to-end, E2E)成像系统在大视场(Field of View, FoV)下难以维持高图像保真度的问题,其核心挑战在于高计算复杂度以及难以准确建模离轴光传播和离轴像差。传统方法将编码元件置于孔径或光瞳平面,仅能实现对波前的全局控制,限制了成像质量提升。解决方案的关键在于引入一种新的设计选择:将衍射光学元件(Diffractive Optical Element, DOE)置于离轴位置(off-aperture),从而实现对成像平面内不同区域的局部波前调控,通过空间解耦自由度实现更精细的波前控制;同时结合可微分的几何光线与波动光学建模,构建折射-衍射混合光学系统,以优化深度成像质量并提升系统通用性。实验表明,该方法在约45°视场下使PSNR提升超过5 dB,且在近28°视场下成功恢复彩色与深度信息,物理原型验证了其有效性与灵活性。
链接: https://arxiv.org/abs/2507.22523
作者: Haoyu Wei,Xin Liu,Yuhui Liu,Qiang Fu,Wolfgang Heidrich,Edmund Y. Lam,Yifan Peng
机构: 未知
类目: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV)
备注: To be published in IEEE Transactions on Pattern Analysis and Machine Intelligence
Abstract:End-to-end (E2E) designed imaging systems integrate coded optical designs with decoding algorithms to enhance imaging fidelity for diverse visual tasks. However, existing E2E designs encounter significant challenges in maintaining high image fidelity at wide fields of view, due to high computational complexity, as well as difficulties in modeling off-axis wave propagation while accounting for off-axis aberrations. In particular, the common approach of placing the encoding element into the aperture or pupil plane results in only a global control of the wavefront. To overcome these limitations, this work explores an additional design choice by positioning a DOE off-aperture, enabling a spatial unmixing of the degrees of freedom and providing local control over the wavefront over the image plane. Our approach further leverages hybrid refractive-diffractive optical systems by linking differentiable ray and wave optics modeling, thereby optimizing depth imaging quality and demonstrating system versatility. Experimental results reveal that the off-aperture DOE enhances the imaging quality by over 5 dB in PSNR at a FoV of approximately 45^\circ when paired with a simple thin lens, outperforming traditional on-aperture systems. Furthermore, we successfully recover color and depth information at nearly 28^\circ FoV using off-aperture DOE configurations with compound optics. Physical prototypes for both applications validate the effectiveness and versatility of the proposed method.
zh
[CV-78] owards Blind Bitstream-corrupted Video Recovery via a Visual Foundation Model-driven Framework
【速读】:该论文旨在解决视频比特流(bitstream)损坏后的恢复问题,即在多媒体通信与存储系统中,由于轻微的比特流域损坏会导致像素域显著退化,而现有方法依赖于对每帧视频进行耗时且繁琐的手动标注 corrupted 区域,难以实际应用;同时,局部残差信息可能误导特征补全和后续内容恢复,导致高质量重建困难。解决方案的关键在于提出首个无监督的比特流损坏视频恢复框架,其核心包括两个创新模块:一是 Detect Any Corruption (DAC) 模型,利用视觉基础模型(visual foundation model, VFM)的先验知识并融合比特流与损坏类型信息,实现无需标签的损坏定位与盲恢复;二是 Corruption-aware Feature Completion (CFC) 模块,基于高层损坏理解自适应处理残差贡献,结合 VFM 引导的分层特征增强与混合残差专家(mixture-of-residual-experts, MoRE)结构,在抑制伪影的同时强化有用残差,从而实现高保真恢复。
链接: https://arxiv.org/abs/2507.22481
作者: Tianyi Liu,Kejun Wu,Chen Cai,Yi Wang,Kim-Hui Yap,Lap-Pui Chau
机构: Nanyang Technological University (南洋理工大学); Huazhong University of Science and Technology (华中科技大学); The Hong Kong Polytechnic University (香港理工大学)
类目: Image and Video Processing (eess.IV); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV); Multimedia (cs.MM)
备注: 10 pages, 5 figures, accepted by ACMMM 2025
Abstract:Video signals are vulnerable in multimedia communication and storage systems, as even slight bitstream-domain corruption can lead to significant pixel-domain degradation. To recover faithful spatio-temporal content from corrupted inputs, bitstream-corrupted video recovery has recently emerged as a challenging and understudied task. However, existing methods require time-consuming and labor-intensive annotation of corrupted regions for each corrupted video frame, resulting in a large workload in practice. In addition, high-quality recovery remains difficult as part of the local residual information in corrupted frames may mislead feature completion and successive content recovery. In this paper, we propose the first blind bitstream-corrupted video recovery framework that integrates visual foundation models with a recovery model, which is adapted to different types of corruption and bitstream-level prompts. Within the framework, the proposed Detect Any Corruption (DAC) model leverages the rich priors of the visual foundation model while incorporating bitstream and corruption knowledge to enhance corruption localization and blind recovery. Additionally, we introduce a novel Corruption-aware Feature Completion (CFC) module, which adaptively processes residual contributions based on high-level corruption understanding. With VFM-guided hierarchical feature augmentation and high-level coordination in a mixture-of-residual-experts (MoRE) structure, our method suppresses artifacts and enhances informative residuals. Comprehensive evaluations show that the proposed method achieves outstanding performance in bitstream-corrupted video recovery without requiring a manually labeled mask sequence. The demonstrated effectiveness will help to realize improved user experience, wider application scenarios, and more reliable multimedia communication and storage systems.
zh
[CV-79] Eyepiece-free pupil-optimized holographic near-eye displays
【速读】:该论文旨在解决全息近眼显示(holographic near-eye displays, NEDs)在实际应用中因有限且动态变化的瞳孔孔径(pupil aperture)导致的图像质量下降和生理深度线索缺失问题。解决方案的关键在于提出一种无目镜(eyepiece-free)的瞳孔优化方法,通过定制化的球面相位调制策略,在瞳孔区域内生成多个视点,并联合优化这些视点上的振幅与相位分布,从而显著缓解由有限瞳孔采样引起的图像退化,并消除由球面相位引入的不明显深度线索,为实现紧凑、轻量化和灵活的全息NED系统提供了重要突破。
链接: https://arxiv.org/abs/2507.22420
作者: Jie Zhou,Shuyang Xie,Yang Wu,Lei Jiang,Yimou Luo,Jun Wang
机构: 未知
类目: Optics (physics.optics); Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Computer-generated holography (CGH) represents a transformative visualization approach for next-generation immersive virtual and augmented reality (VR/AR) displays, enabling precise wavefront modulation and naturally providing comprehensive physiological depth cues without the need for bulky optical assemblies. Despite significant advancements in computational algorithms enhancing image quality and achieving real-time generation, practical implementations of holographic near-eye displays (NEDs) continue to face substantial challenges arising from finite and dynamically varying pupil apertures, which degrade image quality and compromise user experience. In this study, we introduce an eyepiece-free pupil-optimized holographic NED. Our proposed method employs a customized spherical phase modulation strategy to generate multiple viewpoints within the pupil, entirely eliminating the dependence on conventional optical eyepieces. Through the joint optimization of amplitude and phase distributions across these viewpoints, the method markedly mitigates image degradation due to finite pupil sampling and resolves inapparent depth cues induced by the spherical phase. The demonstrated method signifies a substantial advancement toward the realization of compact, lightweight, and flexible holographic NED systems, fulfilling stringent requirements for future VR/AR display technologies.
zh
[CV-80] Whole-brain Transferable Representations from Large-Scale fMRI Data Improve Task-Evoked Brain Activity Decoding
【速读】:该论文旨在解决从功能磁共振成像(fMRI)数据中解码任务诱发脑活动的难题,这一问题受限于高维度、低信噪比以及个体内部数据量有限等因素。其解决方案的关键在于提出STDA-SwiFT模型,该模型基于Transformer架构,通过空间-时间分割注意力机制(spatial-temporal divided attention)和自监督对比学习(self-supervised contrastive learning)从大规模fMRI数据集中学习可迁移的表示。利用人类连接组计划(HCP)中995名受试者的预训练体素级表示,该方法在多种感觉与认知任务中显著提升了下游解码性能,即使仅进行极少的数据预处理也表现出优越性,体现了迁移学习在克服fMRI脑活动解码挑战中的有效性。
链接: https://arxiv.org/abs/2507.22378
作者: Yueh-Po Peng,Vincent K.M. Cheung,Li Su
机构: Gamania Digital Entertainment Co., Ltd.(Gamania数字娱乐有限公司); Institute of Information Science, Academia Sinica(中央研究院资讯科学研究所); Sony Computer Science Laboratories, Inc.(索尼计算机科学实验室公司)
类目: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:A fundamental challenge in neuroscience is to decode mental states from brain activity. While functional magnetic resonance imaging (fMRI) offers a non-invasive approach to capture brain-wide neural dynamics with high spatial precision, decoding from fMRI data – particularly from task-evoked activity – remains challenging due to its high dimensionality, low signal-to-noise ratio, and limited within-subject data. Here, we leverage recent advances in computer vision and propose STDA-SwiFT, a transformer-based model that learns transferable representations from large-scale fMRI datasets via spatial-temporal divided attention and self-supervised contrastive learning. Using pretrained voxel-wise representations from 995 subjects in the Human Connectome Project (HCP), we show that our model substantially improves downstream decoding performance of task-evoked activity across multiple sensory and cognitive domains, even with minimal data preprocessing. We demonstrate performance gains from larger receptor fields afforded by our memory-efficient attention mechanism, as well as the impact of functional relevance in pretraining data when fine-tuning on small samples. Our work showcases transfer learning as a viable approach to harness large-scale datasets to overcome challenges in decoding brain activity from fMRI data.
zh
[CV-81] A Segmentation Framework for Accurate Diagnosis of Amyloid Positivity without Structural Images
【速读】:该论文旨在解决仅使用正电子发射断层成像(PET)图像实现脑区自动分割与淀粉样蛋白阳性分类的问题,避免依赖结构磁共振成像(MRI)或计算机断层扫描(CT)。其解决方案的关键在于提出了一种基于深度学习的3D U-Net架构,该模型在200例F18-氟贝他匹(F18-florbetapir)淀粉样蛋白PET图像上训练和验证,能够精准分割30个脑区并量化区域摄取值以区分淀粉样蛋白阳性状态。模型在分割任务中Dice相似系数达0.45–0.88,在关键脑区如楔前叶、额叶皮层等的摄取定量误差低至0.0011,并实现了98%的分类准确率和0.99的受试者工作特征曲线下面积(AUC),表明该方法可在无结构影像条件下实现可靠、可扩展的临床与科研应用。
链接: https://arxiv.org/abs/2507.22336
作者: Penghan Zhu,Shurui Mei,Shushan Chen,Xiaobo Chu,Shanbo He,Ziyi Liu
机构: Northwest Agricultural and Forestry Science and Technology University (西北农林科技大学); Liaoning Province Shiyan High School (辽宁省实验中学); Northeast Yucai School (东北育才学校); Northeast Yucai Foreign Language School (东北育才外国语学校)
类目: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:This study proposes a deep learning-based framework for automated segmentation of brain regions and classification of amyloid positivity using positron emission tomography (PET) images alone, without the need for structural MRI or CT. A 3D U-Net architecture with four layers of depth was trained and validated on a dataset of 200 F18-florbetapir amyloid-PET scans, with an 130/20/50 train/validation/test split. Segmentation performance was evaluated using Dice similarity coefficients across 30 brain regions, with scores ranging from 0.45 to 0.88, demonstrating high anatomical accuracy, particularly in subcortical structures. Quantitative fidelity of PET uptake within clinically relevant regions. Precuneus, prefrontal cortex, gyrus rectus, and lateral temporal cortex was assessed using normalized root mean square error, achieving values as low as 0.0011. Furthermore, the model achieved a classification accuracy of 0.98 for amyloid positivity based on regional uptake quantification, with an area under the ROC curve (AUC) of 0.99. These results highlight the model’s potential for integration into PET only diagnostic pipelines, particularly in settings where structural imaging is not available. This approach reduces dependence on coregistration and manual delineation, enabling scalable, reliable, and reproducible analysis in clinical and research applications. Future work will focus on clinical validation and extension to diverse PET tracers including C11 PiB and other F18 labeled compounds.
zh
[CV-82] Pathology Foundation Models are Scanner Sensitive: Benchmark and Mitigation with Contrastive ScanGen Loss MICCAI
【速读】:该论文旨在解决计算病理学(Computational Pathology, CPath)中因不同扫描仪引入的无关细节导致的扫描仪偏倚(scanner bias)问题,这种偏倚会削弱临床医生对基于CPath工具的信任,并阻碍其在真实世界中的部署。尽管近期出现的病理基础模型(Pathology Foundation Models, FMs)被寄予提升领域泛化能力的厚望,但本文通过多扫描仪数据集实证表明,FMs 依然存在扫描仪偏倚。解决方案的关键在于提出一种名为 ScanGen 的对比损失函数(contrastive loss function),该损失函数在任务特定微调阶段引入,通过增强模型对扫描仪差异的鲁棒性来缓解偏倚,同时保持或提升表皮生长因子受体(Epidermal Growth Factor Receptor, EGFR)突变预测的性能。
链接: https://arxiv.org/abs/2507.22092
作者: Gianluca Carloni,Biagio Brattoli,Seongho Keum,Jongchan Park,Taebum Lee,Chang Ho Ahn,Sergio Pereira
机构: 未知
类目: Quantitative Methods (q-bio.QM); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV); Image and Video Processing (eess.IV); Tissues and Organs (q-bio.TO)
备注: Accepted (Oral) in MedAGI 2025 International Workshop at MICCAI Conference
Abstract:Computational pathology (CPath) has shown great potential in mining actionable insights from Whole Slide Images (WSIs). Deep Learning (DL) has been at the center of modern CPath, and while it delivers unprecedented performance, it is also known that DL may be affected by irrelevant details, such as those introduced during scanning by different commercially available scanners. This may lead to scanner bias, where the model outputs for the same tissue acquired by different scanners may vary. In turn, it hinders the trust of clinicians in CPath-based tools and their deployment in real-world clinical practices. Recent pathology Foundation Models (FMs) promise to provide better domain generalization capabilities. In this paper, we benchmark FMs using a multi-scanner dataset and show that FMs still suffer from scanner bias. Following this observation, we propose ScanGen, a contrastive loss function applied during task-specific fine-tuning that mitigates scanner bias, thereby enhancing the models’ robustness to scanner variations. Our approach is applied to the Multiple Instance Learning task of Epidermal Growth Factor Receptor (EGFR) mutation prediction from H\E-stained WSIs in lung cancer. We observe that ScanGen notably enhances the ability to generalize across scanners, while retaining or improving the performance of EGFR mutation prediction.
zh
人工智能
[AI-0] Automatically discovering heuristics in a complex SAT solver with large language models
【速读】:该论文旨在解决现代可满足性问题(SAT)求解器在实际应用场景中难以优化的问题,其核心挑战在于求解器架构复杂、传统自动配置框架受限于人工约束的搜索空间且性能提升有限。解决方案的关键在于提出一种基于大语言模型(Large Language Models, LLMs)的新范式,通过三个核心技术突破实现高效优化:(1) 设计面向LLM兼容的模块化求解器架构,以简化代码结构、增强信息共享并减少错误;(2) 提出无监督的自动提示优化方法,提升LLM输出多样性;(3) 构建预搜索策略与进化算法(EA)相结合的高效搜索机制,用于发现高质量启发式规则。实验表明,所提出的AutoModSAT工具相较基线求解器提升50%性能,优于当前最先进(SOTA)求解器30%,并在平均速度上比经参数调优的SOTA方案快20%,显著增强了对复杂问题实例的处理能力。
链接: https://arxiv.org/abs/2507.22876
作者: Yiwen Sun,Furong Ye,Zhihan Chen,Ke Wei,Shaowei Cai
机构: 未知
类目: Artificial Intelligence (cs.AI); Logic in Computer Science (cs.LO)
备注:
Abstract:Satisfiability problem (SAT) is a cornerstone of computational complexity with broad industrial applications, and it remains challenging to optimize modern SAT solvers in real-world settings due to their intricate architectures. While automatic configuration frameworks have been developed, they rely on manually constrained search spaces and yield limited performance gains. This work introduces a novel paradigm which effectively optimizes complex SAT solvers via Large Language Models (LLMs), and a tool called AutoModSAT is developed. Three fundamental challenges are addressed in order to achieve superior performance: (1) LLM-friendly solver: Systematic guidelines are proposed for developing a modularized solver to meet LLMs’ compatibility, emphasizing code simplification, information share and bug reduction; (2) Automatic prompt optimization: An unsupervised automatic prompt optimization method is introduced to advance the diversity of LLMs’ output; (3) Efficient search strategy: We design a presearch strategy and an EA evolutionary algorithm for the final efficient and effective discovery of heuristics. Extensive experiments across a wide range of datasets demonstrate that AutoModSAT achieves 50% performance improvement over the baseline solver and achieves 30% superiority against the state-of-the-art (SOTA) solvers. Moreover, AutoModSAT attains a 20% speedup on average compared to parameter-tuned alternatives of the SOTA solvers, showcasing the enhanced capability in handling complex problem instances. This work bridges the gap between AI-driven heuristics discovery and mission-critical system optimization, and provides both methodological advancements and empirically validated results for next-generation complex solver development.
zh
[AI-1] A Bit of Freedom Goes a Long Way: Classical and Quantum Algorithms for Reinforcement Learning under a Generative Model
【速读】:该论文旨在解决有限和无限horizon平均奖励马尔可夫决策过程(Markov Decision Processes, MDPs)中的在线学习问题,核心挑战在于如何在保证高效学习的同时降低累积遗憾(regret)的复杂度。解决方案的关键在于提出了一种结合探索与生成式强化学习(generative reinforcement learning, RL)的混合模型,允许智能体在训练过程中通过访问“模拟器”进行自由采样,从而避免传统方法中依赖“不确定性乐观原则”(optimism in the face of uncertainty)或“后验采样”(posterior sampling)等策略。在此基础上,论文引入已知的经典算法与新的量子算法来近似最优策略,并直接使用这些策略以获得更优的遗憾界:对于有限horizon MDPs,量子算法实现仅对时间步数 $ T $ 对数依赖的遗憾,突破了经典的 $ O(\sqrt{T}) $ 界;对于无限horizon MDPs,尽管经典与量子算法仍保持 $ O(\sqrt{T}) $ 的时间依赖性,但改进了状态空间 $ S $ 和动作空间 $ A $ 的依赖关系,并进一步定义了一个新的遗憾度量,使得量子算法在该度量下达到 $ \operatorname{poly}\log T $ 的遗憾,相较经典算法呈指数级优势。
链接: https://arxiv.org/abs/2507.22854
作者: Andris Ambainis,Joao F. Doriguello,Debbie Lim
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Optimization and Control (math.OC); Quantum Physics (quant-ph); Machine Learning (stat.ML)
备注: 57 pages
Abstract:We propose novel classical and quantum online algorithms for learning finite-horizon and infinite-horizon average-reward Markov Decision Processes (MDPs). Our algorithms are based on a hybrid exploration-generative reinforcement learning (RL) model wherein the agent can, from time to time, freely interact with the environment in a generative sampling fashion, i.e., by having access to a “simulator”. By employing known classical and new quantum algorithms for approximating optimal policies under a generative model within our learning algorithms, we show that it is possible to avoid several paradigms from RL like “optimism in the face of uncertainty” and “posterior sampling” and instead compute and use optimal policies directly, which yields better regret bounds compared to previous works. For finite-horizon MDPs, our quantum algorithms obtain regret bounds which only depend logarithmically on the number of time steps T , thus breaking the O(\sqrtT) classical barrier. This matches the time dependence of the prior quantum works of Ganguly et al. (arXiv’23) and Zhong et al. (ICML’24), but with improved dependence on other parameters like state space size S and action space size A . For infinite-horizon MDPs, our classical and quantum bounds still maintain the O(\sqrtT) dependence but with better S and A factors. Nonetheless, we propose a novel measure of regret for infinite-horizon MDPs with respect to which our quantum algorithms have \operatornamepoly\logT regret, exponentially better compared to classical algorithms. Finally, we generalise all of our results to compact state spaces.
zh
[AI-2] Repair-R1: Better Test Before Repair
【速读】:该论文旨在解决现有基于大语言模型(Large Language Models, LLMs)的自动化程序修复(Automated Program Repair, APR)方法中存在的两个关键问题:一是测试用例仅在推理阶段被使用,未能充分参与模型训练;二是修复操作在测试生成之后进行,忽略了通过测试先行定位缺陷的可能性。为此,作者提出Repair-R1框架,其核心创新在于将测试用例引入模型训练阶段,并重构修复流程为“先生成判别性测试用例、再基于测试进行修复”的顺序。这一设计使模型能够更精准地识别缺陷位置并理解其成因,从而提升修复成功率与测试覆盖率。实验表明,Repair-R1在多个基准数据集上显著优于基线模型,修复成功率提升达2.68%至48.29%,测试生成成功率提升16.38%至53.28%。
链接: https://arxiv.org/abs/2507.22853
作者: Haichuan Hu,Xiaochen Xie,Quanjun Zhang
机构: 未知
类目: oftware Engineering (cs.SE); Artificial Intelligence (cs.AI)
备注:
Abstract:APR (Automated Program Repair) aims to automatically locate program defects, generate patches and validate the repairs. Existing techniques for APR are often combined with LLMs (Large Language Models), which leverages the code-related knowledge of LLMs to improve repair effectiveness. Current LLM-based APR methods typically utilize test cases only during the inference stage, adopting an iterative approach that performs repair first and validates it through test execution afterward. This conventional paradigm neglects two important aspects: the potential contribution of test cases in the training phase, and the possibility of leveraging testing prior to repair. To address this, we propose Repair-R1, which introduces test cases into the model’s training phase and shifts test generation to precede repair. The model is required to first generate discriminative test cases that can distinguish defective behaviors, and then perform repair based on these tests. This enables the model to better locate defects and understand the underlying causes of defects, thereby improving repair effectiveness. We implement Repair-R1 with three different backbone models, using RL (reinforcement learning) to co-optimize test generation and bug repair. Experimental results on four widely adopted benchmarks demonstrate the superiority of Repair-R1. Specially, compared to vanilla models, Repair-R1 improves repair success rate by 2.68% to 48.29%, test generation success rate by 16.38% to 53.28%, and test coverage by 0.78% to 53.96%. We publish the code and weights at this https URL and this https URL.
zh
[AI-3] RLVMR: Reinforcement Learning with Verifiable Meta-Reasoning Rewards for Robust Long-Horizon Agents
【速读】:该论文旨在解决强化学习(Reinforcement Learning, RL)在训练复杂、长周期任务中的“低效探索”(inefficient exploration)问题,即单纯以最终任务成功为目标的RL方法容易强化错误或低效的推理路径,导致智能体缺乏稳健性和泛化能力。解决方案的关键在于提出一种名为RLVMR的新框架,其核心创新是将密集的、过程级监督(process-level supervision)引入端到端强化学习,通过奖励可验证的元推理行为(meta-reasoning behaviors)来引导智能体显式标注认知步骤(如规划、探索和反思),并基于规则程序化地给予奖励,从而提升推理质量。该方法结合最终结果信号与过程导向奖励,并采用无评论家(critic-free)策略梯度进行优化,在ALFWorld和ScienceWorld等基准上显著提升了成功率和推理效率。
链接: https://arxiv.org/abs/2507.22844
作者: Zijing Zhang,Ziyang Chen,Mingxiao Li,Zhaopeng Tu,Xiaolong Li
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:
Abstract:The development of autonomous agents for complex, long-horizon tasks is a central goal in AI. However, dominant training paradigms face a critical limitation: reinforcement learning (RL) methods that optimize solely for final task success often reinforce flawed or inefficient reasoning paths, a problem we term inefficient exploration. This leads to agents that are brittle and fail to generalize, as they learn to find solutions without learning how to reason coherently. To address this, we introduce RLVMR, a novel framework that integrates dense, process-level supervision into end-to-end RL by rewarding verifiable, meta-reasoning behaviors. RLVMR equips an agent to explicitly tag its cognitive steps, such as planning, exploration, and reflection, and provides programmatic, rule-based rewards for actions that contribute to effective problem-solving. These process-centric rewards are combined with the final outcome signal and optimized using a critic-free policy gradient method. On the challenging ALFWorld and ScienceWorld benchmarks, RLVMR achieves new state-of-the-art results, with our 7B model reaching an 83.6% success rate on the most difficult unseen task split. Our analysis confirms these gains stem from improved reasoning quality, including significant reductions in redundant actions and enhanced error recovery, leading to more robust, efficient, and interpretable agents.
zh
[AI-4] G-Core: A Simple Scalable and Balanced RLHF Trainer
【速读】:该论文旨在解决当前强化学习从人类反馈(Reinforcement Learning from Human Feedback, RLHF)训练系统在扩展至多模态和扩散模型工作流时面临的挑战,包括控制器可扩展性不足、资源部署灵活性差以及复杂RLHF流水线的高效编排困难,尤其在动态采样或生成奖励建模等场景下表现受限。解决方案的关键在于提出G-Core框架,其核心创新为引入并行控制器编程模型,避免单一集中式控制器带来的瓶颈,实现复杂RLHF流程的灵活高效调度;同时设计动态资源分配策略,根据训练负载变化自适应地划分计算资源并优化任务调度,显著降低硬件空闲时间,提升资源利用率,从而在真实大规模应用场景中展现出优异的性能与鲁棒性。
链接: https://arxiv.org/abs/2507.22789
作者: Junyu Wu,Weiming Chang,Xiaotao Liu,Guanyou He,Haoqiang Hong,Boqi Liu,Hongtao Tian,Tao Yang,Yunsheng Shi,Feng Lin,Ting Yao
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:
Abstract:Reinforcement Learning from Human Feedback (RLHF) has become an increasingly popular paradigm for training large language models (LLMs) and diffusion models. While existing RLHF training systems have enabled significant progress, they often face challenges in scaling to multi-modal and diffusion workflows and adapting to dynamic workloads. In particular, current approaches may encounter limitations in controller scalability, flexible resource placement, and efficient orchestration when handling complex RLHF pipelines, especially in scenarios involving dynamic sampling or generative reward modeling. In this paper, we present \textbfG-Core, a simple, scalable, and balanced RLHF training framework designed to address these challenges. G-Core introduces a parallel controller programming model, enabling flexible and efficient orchestration of complex RLHF workflows without the bottlenecks of a single centralized controller. Furthermore, we propose a dynamic placement schema that adaptively partitions resources and schedules workloads, significantly reducing hardware idle time and improving utilization, even under highly variable training conditions. G-Core has successfully trained models that support WeChat product features serving a large-scale user base, demonstrating its effectiveness and robustness in real-world scenarios. Our results show that G-Core advances the state of the art in RLHF training, providing a solid foundation for future research and deployment of large-scale, human-aligned models.
zh
[AI-5] Enhancing Multi-Agent Collaboration with Attention-Based Actor-Critic Policies
【速读】:该论文旨在解决多智能体强化学习(Multi-Agent Reinforcement Learning, MARL)中协作效率低、联合动作空间指数级增长以及智能体角色同质化等问题。解决方案的关键在于提出一种名为Team-Attention-Actor-Critic (TAAC) 的算法,其核心创新包括:采用集中训练/集中执行(Centralized Training/Centralized Execution)框架,并在策略网络(actor)和价值网络(critic)中引入多头注意力机制(multi-headed attention mechanisms),使智能体能够动态地向队友查询信息,从而实现高效的跨智能体通信与协作;同时设计了一种带惩罚项的损失函数,鼓励智能体形成多样化且互补的角色分工,提升整体团队协作性能。
链接: https://arxiv.org/abs/2507.22782
作者: Hugo Garrido-Lestache,Jeremy Kedziora
机构: 未知
类目: Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: 8 pages
Abstract:This paper introduces Team-Attention-Actor-Critic (TAAC), a reinforcement learning algorithm designed to enhance multi-agent collaboration in cooperative environments. TAAC employs a Centralized Training/Centralized Execution scheme incorporating multi-headed attention mechanisms in both the actor and critic. This design facilitates dynamic, inter-agent communication, allowing agents to explicitly query teammates, thereby efficiently managing the exponential growth of joint-action spaces while ensuring a high degree of collaboration. We further introduce a penalized loss function which promotes diverse yet complementary roles among agents. We evaluate TAAC in a simulated soccer environment against benchmark algorithms representing other multi-agent paradigms, including Proximal Policy Optimization and Multi-Agent Actor-Attention-Critic. We find that TAAC exhibits superior performance and enhanced collaborative behaviors across a variety of metrics (win rates, goal differentials, Elo ratings, inter-agent connectivity, balanced spatial distributions, and frequent tactical interactions such as ball possession swaps).
zh
[AI-6] ASP-FZN: A Translation-based Constraint Answer Set Solver
【速读】:该论文旨在解决约束答案集编程(Constraint Answer Set Programming, CASP)中求解效率与灵活性不足的问题,尤其是在处理包含线性约束的复杂逻辑程序时。解决方案的关键在于提出一个名为 asp-fzn 的求解器,其核心创新是将 CASP 程序自动翻译为与求解器无关的 FlatZinc 语言格式,从而能够调用多种成熟的约束规划(Constraint Programming, CP)和整数规划(Integer Programming, IP)后端求解器。该方法不仅支持丰富的线性约束及常见全局约束,还通过标准化接口提升了可扩展性和求解性能,在标准 ASP 基准测试中表现优于或媲美当前主流 ASP 求解器,并在部分 CASP 实例上显著超越了现有代表性的 clingcon 求解器。
链接: https://arxiv.org/abs/2507.22774
作者: Thomas Eiter,Tobias Geibinger,Tobias Kaminski,Nysret Musliu,Johannes Oetsch
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注: Presented at the 41st International Conference on Logic Programming (ICLP 2025)
Abstract:We present the solver asp-fzn for Constraint Answer Set Programming (CASP), which extends ASP with linear constraints. Our approach is based on translating CASP programs into the solver-independent FlatZinc language that supports several Constraint Programming and Integer Programming backend solvers. Our solver supports a rich language of linear constraints, including some common global constraints. As for evaluation, we show that asp-fzn is competitive with state-of-the-art ASP solvers on benchmarks taken from past ASP competitions. Furthermore, we evaluate it on several CASP problems from the literature and compare its performance with clingcon, which is a prominent CASP solver that supports most of the asp-fzn language. The performance of asp-fzn is very promising as it is already competitive on plain ASP and even outperforms clingcon on some CASP benchmarks.
zh
[AI-7] Empirical Evaluation of Concept Drift in ML-Based Android Malware Detection
【速读】:该论文旨在解决机器学习驱动的Android恶意软件检测模型在面对概念漂移(concept drift)时性能下降的问题,即由于恶意软件特征快速演化导致模型有效性减弱。其关键解决方案在于系统性评估不同特征类型(静态、动态、混合、语义和图像特征)、数据环境及检测方法对概念漂移的影响,并发现尽管平衡算法可缓解类别不平衡问题,但无法根本解决由恶意软件生态动态变化引发的概念漂移;同时指出大型语言模型(LLMs)虽在少样本学习下表现出良好检测潜力,仍未能完全缓解概念漂移,凸显出未来研究需聚焦于适应动态威胁环境的鲁棒建模机制。
链接: https://arxiv.org/abs/2507.22772
作者: Ahmed Sabbah,Radi Jarrar,Samer Zein,David Mohaisen
机构: 未知
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: 18 pages, 12 tables, 14 figures, paper under review
Abstract:Despite outstanding results, machine learning-based Android malware detection models struggle with concept drift, where rapidly evolving malware characteristics degrade model effectiveness. This study examines the impact of concept drift on Android malware detection, evaluating two datasets and nine machine learning and deep learning algorithms, as well as Large Language Models (LLMs). Various feature types–static, dynamic, hybrid, semantic, and image-based–were considered. The results showed that concept drift is widespread and significantly affects model performance. Factors influencing the drift include feature types, data environments, and detection methods. Balancing algorithms helped with class imbalance but did not fully address concept drift, which primarily stems from the dynamic nature of the malware landscape. No strong link was found between the type of algorithm used and concept drift, the impact was relatively minor compared to other variables since hyperparameters were not fine-tuned, and the default algorithm configurations were used. While LLMs using few-shot learning demonstrated promising detection performance, they did not fully mitigate concept drift, highlighting the need for further investigation.
zh
[AI-8] aching the Teacher: Improving Neural Network Distillability for Symbolic Regression via Jacobian Regularization
【速读】:该论文旨在解决从复杂神经网络中提取高保真度符号公式(symbolic formulas)以实现可信且可解释的人工智能(interpretable AI)时面临的挑战,即传统蒸馏方法因教师网络学习的复杂函数难以被符号发现算法有效捕捉,导致学生模型拟合度低的问题。解决方案的关键在于提出一种基于雅可比矩阵(Jacobian-based)的正则化项,在训练教师网络时主动引导其学习更平滑、更适合蒸馏的目标函数,从而显著提升最终符号模型的性能;实验表明,该方法在多个真实世界回归基准上平均提升R²分数120%(相对),同时保持教师网络预测精度不变。
链接: https://arxiv.org/abs/2507.22767
作者: Soumyadeep Dhar,Kei Sen Fong,Mehul Motani
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:
Abstract:Distilling large neural networks into simple, human-readable symbolic formulas is a promising path toward trustworthy and interpretable AI. However, this process is often brittle, as the complex functions learned by standard networks are poor targets for symbolic discovery, resulting in low-fidelity student models. In this work, we propose a novel training paradigm to address this challenge. Instead of passively distilling a pre-trained network, we introduce a \textbfJacobian-based regularizer that actively encourages the ``teacher’’ network to learn functions that are not only accurate but also inherently smoother and more amenable to distillation. We demonstrate through extensive experiments on a suite of real-world regression benchmarks that our method is highly effective. By optimizing the regularization strength for each problem, we improve the R^2 score of the final distilled symbolic model by an average of \textbf120% (relative) compared to the standard distillation pipeline, all while maintaining the teacher’s predictive accuracy. Our work presents a practical and principled method for significantly improving the fidelity of interpretable models extracted from complex neural networks.
zh
[AI-9] Bayesian Optimization of Process Parameters of a Sensor-Based Sorting System using Gaussian Processes as Surrogate Models
【速读】:该论文旨在解决传感器分选系统(sensor-based sorting system)在实际运行中因物料流特性变化和工艺要求波动而导致的参数优化与持续调整难题。传统方法需频繁人工干预进行参数校准,效率低且难以兼顾两种产物流的分离精度。解决方案的关键在于引入基于贝叶斯优化(Bayesian Optimization)的代理模型策略,利用高斯过程回归(Gaussian Process Regression)对系统行为进行建模,从而在有限实验次数下同时优化两个目标——即满足两股输出物料流的质量要求,并在模型计算过程中显式考虑不确定性因素对分选精度的影响,实现高效、鲁棒的闭环参数调整。
链接: https://arxiv.org/abs/2507.22766
作者: Felix Kronenwett,Georg Maier,Thomas Laengle
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Systems and Control (eess.SY)
备注: Accepted at the 30th IEEE International Conference on Emerging Technologies and Factory Automation (ETFA)
Abstract:Sensor-based sorting systems enable the physical separation of a material stream into two fractions. The sorting decision is based on the image data evaluation of the sensors used and is carried out using actuators. Various process parameters must be set depending on the properties of the material stream, the dimensioning of the system, and the required sorting accuracy. However, continuous verification and re-adjustment are necessary due to changing requirements and material stream compositions. In this paper, we introduce an approach for optimizing, recurrently monitoring and adjusting the process parameters of a sensor-based sorting system. Based on Bayesian Optimization, Gaussian process regression models are used as surrogate models to achieve specific requirements for system behavior with the uncertainties contained therein. This method minimizes the number of necessary experiments while simultaneously considering two possible optimization targets based on the requirements for both material output streams. In addition, uncertainties are considered during determining sorting accuracies in the model calculation. We evaluated the method with three example process parameters.
zh
[AI-10] Of Good Demons and Bad Angels: Guaranteeing Safe Control under Finite Precision
【速读】:该论文旨在解决安全关键型神经网络控制的网络物理系统(NNCS)在实际有限精度实现中难以保障无限时域安全的问题。现有基于微分动态逻辑(differential dynamic logic, dL)的安全验证方法依赖于理想化的实数神经网络语义,忽略了传感、执行和计算过程中因有限精度导致的舍入误差,从而无法保证理论保证与真实部署之间的一致性。解决方案的关键在于将有限精度扰动下的鲁棒性建模为一个混合博弈问题:由“善灵”(good Demon)负责控制动作,而“恶天使”(bad Angel)引入扰动,通过形式化证明对给定有界扰动的鲁棒性,结合先进的混合精度定点优化工具,生成既高效又安全的神经网络实现,从而提供端到端的无限时域安全保证。
链接: https://arxiv.org/abs/2507.22760
作者: Samuel Teuber,Debasmita Lohar,Bernhard Beckert
机构: 未知
类目: ystems and Control (eess.SY); Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Logic in Computer Science (cs.LO)
备注: 15 pages, 3 figures, 1 table; Accepted at FMCAD 2025
Abstract:As neural networks (NNs) become increasingly prevalent in safety-critical neural network-controlled cyber-physical systems (NNCSs), formally guaranteeing their safety becomes crucial. For these systems, safety must be ensured throughout their entire operation, necessitating infinite-time horizon verification. To verify the infinite-time horizon safety of NNCSs, recent approaches leverage Differential Dynamic Logic (dL). However, these dL-based guarantees rely on idealized, real-valued NN semantics and fail to account for roundoff errors introduced by finite-precision implementations. This paper bridges the gap between theoretical guarantees and real-world implementations by incorporating robustness under finite-precision perturbations – in sensing, actuation, and computation – into the safety verification. We model the problem as a hybrid game between a good Demon, responsible for control actions, and a bad Angel, introducing perturbations. This formulation enables formal proofs of robustness w.r.t. a given (bounded) perturbation. Leveraging this bound, we employ state-of-the-art mixed-precision fixed-point tuners to synthesize sound and efficient implementations, thus providing a complete end-to-end solution. We evaluate our approach on case studies from the automotive and aeronautics domains, producing efficient NN implementations with rigorous infinite-time horizon safety guarantees.
zh
[AI-11] OFCnetLLM : Large Language Model for Network Monitoring and Alertness
【速读】:该论文旨在解决大规模网络监控数据管理成本高、查询发现效率低以及异常检测与根因分析自动化程度不足的问题。其核心解决方案是利用大型语言模型(Large Language Models, LLMs)提升网络监控的智能化水平,通过构建多智能体架构实现异常检测、自动根因分析和事件分析的闭环管理,从而降低人工干预需求并提高网络运维效率。文中以自研的OFCNetLLM模型为例,在实际光通信会议(OFC)网络环境中验证了该方法的可行性与初步效果。
链接: https://arxiv.org/abs/2507.22711
作者: Hong-Jun Yoon,Mariam Kiran,Danial Ebling,Joe Breen
机构: 未知
类目: Networking and Internet Architecture (cs.NI); Artificial Intelligence (cs.AI)
备注:
Abstract:The rapid evolution of network infrastructure is bringing new challenges and opportunities for efficient network management, optimization, and security. With very large monitoring databases becoming expensive to explore, the use of AI and Generative AI can help reduce costs of managing these datasets. This paper explores the use of Large Language Models (LLMs) to revolutionize network monitoring management by addressing the limitations of query finding and pattern analysis. We leverage LLMs to enhance anomaly detection, automate root-cause analysis, and automate incident analysis to build a well-monitored network management team using AI. Through a real-world example of developing our own OFCNetLLM, based on the open-source LLM model, we demonstrate practical applications of OFCnetLLM in the OFC conference network. Our model is developed as a multi-agent approach and is still evolving, and we present early results here.
zh
[AI-12] Bifröst: Spatial Networking with Bigraphs
【速读】:该论文旨在解决现代网络环境中缺乏统一的空间表示机制,导致空间访问策略执行脆弱且依赖人工的问题。其核心解决方案在于提出一种基于大图(bigraphs)的统一表示方法,能够在一个形式化框架中同时刻画空间、社交和通信关系,并提供用户友好的工具从物理环境生成大图;此外,设计了一种分层代理架构用于分布式空间推理,结合运行时对代理过程的交互能力与上下文感知的执行模型,将推理范围限定在最小可行子空间内,从而实现私有、可靠且低延迟的空间网络化,支持与代理工作流的安全交互。
链接: https://arxiv.org/abs/2507.22687
作者: Josh Millar,Ryan Gibb,Roy Ang,Anil Madhavapeddy,Hamed Haddadi
机构: 未知
类目: Networking and Internet Architecture (cs.NI); Artificial Intelligence (cs.AI); Logic in Computer Science (cs.LO); Multiagent Systems (cs.MA)
备注: Submitted to HotNets 2025
Abstract:Modern networked environments increasingly rely on spatial reasoning, but lack a coherent representation for coordinating physical space. Consequently, tasks such as enforcing spatial access policies remain fragile and manual. We first propose a unifying representation based on bigraphs, capturing spatial, social, and communication relationships within a single formalism, with user-facing tools to generate bigraphs from physical environments. Second, we present a hierarchical agent architecture for distributed spatial reasoning, with runtimes for agentic processes to interact the spatial representation, and a context-aware execution model that scopes reasoning to the smallest viable subspace. Together, these enable private, reliable, and low-latency spatial networking that can safely interact with agentic workflows.
zh
[AI-13] Designing for Self-Regulation in Informal Programming Learning: Insights from a Storytelling-Centric Approach
【速读】:该论文旨在解决编程学习者在自主学习过程中面临的孤立感、信息过载与缺乏指导等问题,这些问题常导致学习效率低下和动机不足。其核心挑战在于如何有效支持学习者的自我调节(self-regulation)行为,尤其是在非结构化的在线学习环境中。解决方案的关键在于设计一个结合网页平台与浏览器扩展的系统,通过将资源收集、反思实践和叙事建构整合为“学习故事”(learning stories),并利用生成式 AI 提供自动化反馈,从而增强学习者的元认知意识与情感联结。该方法不仅赋予日常学习活动以结构性意义,还借助用户已有的社交媒体使用习惯(如标记成就、设置里程碑等),实现自然融入式的自我调节支持。
链接: https://arxiv.org/abs/2507.22671
作者: Sami Saeed Alghamdi(1),Christopher Bull(1),Ahmed Kharrufa(1) ((1) Open Lab, School of Computing, Newcastle University)
机构: 未知
类目: Human-Computer Interaction (cs.HC); Artificial Intelligence (cs.AI); Computers and Society (cs.CY); Software Engineering (cs.SE)
备注: 10 pages, 9 figures
Abstract:Many people learn programming independently from online resources and often report struggles in achieving their personal learning goals. Learners frequently describe their experiences as isolating and frustrating, challenged by abundant uncertainties, information overload, and distraction, compounded by limited guidance. At the same time, social media serves as a personal space where many engage in diverse self-regulation practices, including help-seeking, using external memory aids (e.g., self-notes), self-reflection, emotion regulation, and self-motivation. For instance, learners often mark achievements and set milestones through their posts. In response, we developed a system consisting of a web platform and browser extensions to support self-regulation online. The design aims to add learner-defined structure to otherwise unstructured experiences and bring meaning to curation and reflection activities by translating them into learning stories with AI-generated feedback. We position storytelling as an integrative approach to design that connects resource curation, reflective and sensemaking practice, and narrative practices learners already use across social platforms. We recruited 15 informal programming learners who are regular social media users to engage with the system in a self-paced manner; participation concluded upon submitting a learning story and survey. We used three quantitative scales and a qualitative survey to examine users’ characteristics and perceptions of the system’s support for their self-regulation. User feedback suggests the system’s viability as a self-regulation aid. Learners particularly valued in-situ reflection, automated story feedback, and video annotation, while other features received mixed views. We highlight perceived benefits, friction points, and design opportunities for future AI-augmented self-regulation tools.
zh
[AI-14] RobEthiChor: Automated Context-aware Ethics-based Negotiation for Autonomous Robots
【速读】:该论文旨在解决自主系统(Autonomous Systems)在决策过程中缺乏对用户个体化伦理偏好(ethical preferences)的考量,从而导致用户信任度下降、行为难以契合终端用户道德信念的问题。当多个具有不同伦理偏好的系统交互时,还需通过协商达成符合各方伦理立场的一致行为。解决方案的关键在于提出RobEthiChor——一种基于伦理协商的通用参考架构,使自主系统能够将用户的伦理偏好与情境因素整合进决策流程;其核心机制是通过伦理驱动的协商过程实现多系统间的共识形成,且该方法已在Robot Operating System (ROS) 中实现为RobEthiChor-Ros,并在真实机器人场景中验证了其可行性、有效性与可扩展性。
链接: https://arxiv.org/abs/2507.22664
作者: Mashal Afzal Memon,Gianluca Filippone,Gian Luca Scoccia,Marco Autili,Paola Inverardi
机构: 未知
类目: oftware Engineering (cs.SE); Artificial Intelligence (cs.AI)
备注:
Abstract:The presence of autonomous systems is growing at a fast pace and it is impacting many aspects of our lives. Designed to learn and act independently, these systems operate and perform decision-making without human intervention. However, they lack the ability to incorporate users’ ethical preferences, which are unique for each individual in society and are required to personalize the decision-making processes. This reduces user trust and prevents autonomous systems from behaving according to the moral beliefs of their end-users. When multiple systems interact with differing ethical preferences, they must negotiate to reach an agreement that satisfies the ethical beliefs of all the parties involved and adjust their behavior consequently. To address this challenge, this paper proposes RobEthiChor, an approach that enables autonomous systems to incorporate user ethical preferences and contextual factors into their decision-making through ethics-based negotiation. RobEthiChor features a domain-agnostic reference architecture for designing autonomous systems capable of ethic-based negotiating. The paper also presents RobEthiChor-Ros, an implementation of RobEthiChor within the Robot Operating System (ROS), which can be deployed on robots to provide them with ethics-based negotiation capabilities. To evaluate our approach, we deployed RobEthiChor-Ros on real robots and ran scenarios where a pair of robots negotiate upon resource contention. Experimental results demonstrate the feasibility and effectiveness of the system in realizing ethics-based negotiation. RobEthiChor allowed robots to reach an agreement in more than 73% of the scenarios with an acceptable negotiation time (0.67s on average). Experiments also demonstrate that the negotiation approach implemented in RobEthiChor is scalable.
zh
[AI-15] A Systematic Literature Review on Detecting Software Vulnerabilities with Large Language Models
【速读】:该论文旨在解决当前生成式 AI(Generative AI)在软件漏洞检测领域研究碎片化的问题,即由于系统设计、数据集使用等方面的差异,导致现有研究难以比较和分类,从而阻碍了对当前技术状态的清晰把握。其解决方案的关键在于开展一项全面的系统性文献综述(Systematic Literature Review, SLR),对2020年1月至2025年6月间发表的227篇相关研究进行结构化分析,从任务形式、输入表示、系统架构和适配技术等维度进行分类,并深入评估所用数据集的特征与覆盖范围,最终构建细粒度的漏洞检测方法分类体系,识别关键局限并提出可操作的未来研究方向,从而提升该领域研究的透明度与可比性,为研究人员和实践者提供实用指南。
链接: https://arxiv.org/abs/2507.22659
作者: Sabrina Kaniewski,Fabian Schmidt,Markus Enzweiler,Michael Menth,Tobias Heer
机构: 未知
类目: oftware Engineering (cs.SE); Artificial Intelligence (cs.AI)
备注: 36 pages + 17 pages references, 6 tables, 10 figures
Abstract:The increasing adoption of Large Language Models (LLMs) in software engineering has sparked interest in their use for software vulnerability detection. However, the rapid development of this field has resulted in a fragmented research landscape, with diverse studies that are difficult to compare due to differences in, e.g., system designs and dataset usage. This fragmentation makes it difficult to obtain a clear overview of the state-of-the-art or compare and categorize studies meaningfully. In this work, we present a comprehensive systematic literature review (SLR) of LLM-based software vulnerability detection. We analyze 227 studies published between January 2020 and June 2025, categorizing them by task formulation, input representation, system architecture, and adaptation techniques. Further, we analyze the datasets used, including their characteristics, vulnerability coverage, and diversity. We present a fine-grained taxonomy of vulnerability detection approaches, identify key limitations, and outline actionable future research opportunities. By providing a structured overview of the field, this review improves transparency and serves as a practical guide for researchers and practitioners aiming to conduct more comparable and reproducible research. We publicly release all artifacts and maintain a living repository of LLM-based software vulnerability detection studies.
zh
[AI-16] Safe Deployment of Offline Reinforcement Learning via Input Convex Action Correction
【速读】:该论文旨在解决化工过程控制中安全高效策略开发的问题,特别是利用历史数据进行离线强化学习(offline reinforcement learning, offline RL)以避免在线实验带来的风险与成本。其核心挑战在于传统离线RL方法在实际工业场景中易出现稳态偏移和设定点附近性能下降的问题。解决方案的关键在于引入一种部署时的安全层——基于输入凸神经网络(input convex neural networks, PICNNs)的梯度驱动动作修正机制,该机制能够在不重新训练或与环境交互的前提下,实时、可微地修正策略输出的动作,从而在保持系统稳定性的同时提升控制性能。
链接: https://arxiv.org/abs/2507.22640
作者: Alex Durkin,Jasper Stolte,Matthew Jones,Raghuraman Pitchumani,Bei Li,Christian Michler,Mehmet Mercangöz
机构: 未知
类目: ystems and Control (eess.SY); Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Machine Learning (stat.ML)
备注:
Abstract:Offline reinforcement learning (offline RL) offers a promising framework for developing control strategies in chemical process systems using historical data, without the risks or costs of online experimentation. This work investigates the application of offline RL to the safe and efficient control of an exothermic polymerisation continuous stirred-tank reactor. We introduce a Gymnasium-compatible simulation environment that captures the reactor’s nonlinear dynamics, including reaction kinetics, energy balances, and operational constraints. The environment supports three industrially relevant scenarios: startup, grade change down, and grade change up. It also includes reproducible offline datasets generated from proportional-integral controllers with randomised tunings, providing a benchmark for evaluating offline RL algorithms in realistic process control tasks. We assess behaviour cloning and implicit Q-learning as baseline algorithms, highlighting the challenges offline agents face, including steady-state offsets and degraded performance near setpoints. To address these issues, we propose a novel deployment-time safety layer that performs gradient-based action correction using input convex neural networks (PICNNs) as learned cost models. The PICNN enables real-time, differentiable correction of policy actions by descending a convex, state-conditioned cost surface, without requiring retraining or environment interaction. Experimental results show that offline RL, particularly when combined with convex action correction, can outperform traditional control approaches and maintain stability across all scenarios. These findings demonstrate the feasibility of integrating offline RL with interpretable and safety-aware corrections for high-stakes chemical process control, and lay the groundwork for more reliable data-driven automation in industrial systems. Subjects: Systems and Control (eess.SY); Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Machine Learning (stat.ML) Cite as: arXiv:2507.22640 [eess.SY] (or arXiv:2507.22640v1 [eess.SY] for this version) https://doi.org/10.48550/arXiv.2507.22640 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
zh
[AI-17] H2Tune: Federated Foundation Model Fine-Tuning with Hybrid Heterogeneity
【速读】:该论文针对混合异构联邦微调(Hybrid Heterogeneous Federated Fine-Tuning, HHFFT)场景下存在的双重异构性问题展开研究,即客户端在模型架构和下游任务上均存在差异。具体挑战包括:1)异构矩阵聚合问题,由于客户端采用不同规模的基础模型导致LoRA参数聚合时维度不匹配;2)多任务知识干扰问题,本地共享参数同时包含任务共享与特定知识,难以确保仅传递共享知识。为解决上述问题,论文提出H2Tune框架,其核心创新在于三个关键组件:(i) 稀疏化三重矩阵分解,通过构建秩一致的中间矩阵对齐客户端隐藏维度,并基于客户端资源自适应稀疏化;(ii) 关系引导的矩阵层对齐机制,处理异构层结构和表征能力差异;(iii) 交替任务-知识解耦机制,通过交替优化实现本地参数中共享与特定知识的分离。理论分析表明该方法收敛速率为 O(1/T),实验验证其相比最先进基线最高提升15.4%准确率。
链接: https://arxiv.org/abs/2507.22633
作者: Wei Guo,Siyuan Lu,Yiqi Tong,Zhaojun Hu,Fuzhen Zhuang,Xiao Zhang,Tao Fan,Jin Dong
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:
Abstract:Different from existing federated fine-tuning (FFT) methods for foundation models, hybrid heterogeneous federated fine-tuning (HHFFT) is an under-explored scenario where clients exhibit double heterogeneity in model architectures and downstream tasks. This hybrid heterogeneity introduces two significant challenges: 1) heterogeneous matrix aggregation, where clients adopt different large-scale foundation models based on their task requirements and resource limitations, leading to dimensional mismatches during LoRA parameter aggregation; and 2) multi-task knowledge interference, where local shared parameters, trained with both task-shared and task-specific knowledge, cannot ensure only task-shared knowledge is transferred between clients. To address these challenges, we propose H2Tune, a federated foundation model fine-tuning with hybrid heterogeneity. Our framework H2Tune consists of three key components: (i) sparsified triple matrix decomposition to align hidden dimensions across clients through constructing rank-consistent middle matrices, with adaptive sparsification based on client resources; (ii) relation-guided matrix layer alignment to handle heterogeneous layer structures and representation capabilities; and (iii) alternating task-knowledge disentanglement mechanism to decouple shared and specific knowledge of local model parameters through alternating optimization. Theoretical analysis proves a convergence rate of O(1/\sqrtT). Extensive experiments show our method achieves up to 15.4% accuracy improvement compared to state-of-the-art baselines. Our code is available at this https URL.
zh
[AI-18] Enhancing Manufacturing Knowledge Access with LLM s and Context-aware Prompting ECAI
【速读】:该论文旨在解决制造领域中非专家用户难以高效利用知识图谱(Knowledge Graph, KG)进行信息检索的问题,核心挑战在于如何提升大型语言模型(Large Language Models, LLMs)将自然语言查询准确转换为结构化SPARQL查询的能力。解决方案的关键在于通过适当的上下文提示(context-aware prompting)策略,向LLMs提供与领域KG相关的Schema结构信息,从而引导其聚焦于ontology中的关键概念并减少幻觉风险,显著提升生成查询的正确性和完整性。
链接: https://arxiv.org/abs/2507.22619
作者: Sebastian Monka,Irlan Grangel-González,Stefan Schmid,Lavdim Halilaj,Marc Rickart,Oliver Rudolph,Rui Dias
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注: European Conference on Artificial Intelligence (ECAI) 2024
Abstract:Knowledge graphs (KGs) have transformed data management within the manufacturing industry, offering effective means for integrating disparate data sources through shared and structured conceptual schemas. However, harnessing the power of KGs can be daunting for non-experts, as it often requires formulating complex SPARQL queries to retrieve specific information. With the advent of Large Language Models (LLMs), there is a growing potential to automatically translate natural language queries into the SPARQL format, thus bridging the gap between user-friendly interfaces and the sophisticated architecture of KGs. The challenge remains in adequately informing LLMs about the relevant context and structure of domain-specific KGs, e.g., in manufacturing, to improve the accuracy of generated queries. In this paper, we evaluate multiple strategies that use LLMs as mediators to facilitate information retrieval from KGs. We focus on the manufacturing domain, particularly on the Bosch Line Information System KG and the I40 Core Information Model. In our evaluation, we compare various approaches for feeding relevant context from the KG to the LLM and analyze their proficiency in transforming real-world questions into SPARQL queries. Our findings show that LLMs can significantly improve their performance on generating correct and complete queries when provided only the adequate context of the KG schema. Such context-aware prompting techniques help LLMs to focus on the relevant parts of the ontology and reduce the risk of hallucination. We anticipate that the proposed techniques help LLMs to democratize access to complex data repositories and empower informed decision-making in manufacturing settings.
zh
[AI-19] Adaptive Duration Model for Text Speech Alignment
【速读】:该论文旨在解决神经文本到语音(Text-to-Speech, TTS)模型中语音与文本对齐(speech-to-text alignment)的稳定性问题,尤其是在非自回归(non-autoregressive)端到端TTS系统中,由于依赖外部提取的音素持续时间(duration)导致对长句或域外文本泛化能力差、出现漏词或重复词的问题。其解决方案的关键在于提出一种新颖的持续时间预测框架,能够基于输入文本生成具有条件适应能力的音素级持续时间分布,从而提升对齐精度和零样本(zero-shot)TTS模型在提示音频与输入音频不匹配场景下的鲁棒性;实验表明,该方法相较基线模型在对齐准确率上提升了约11.3%。
链接: https://arxiv.org/abs/2507.22612
作者: Junjie Cao
机构: 未知
类目: ound (cs.SD); Artificial Intelligence (cs.AI); Audio and Speech Processing (eess.AS)
备注: 4 pages, 3 figures, 2 tables
Abstract:Speech-to-text alignment is a critical component of neural text to-speech (TTS) models. Autoregressive TTS models typically use an attention mechanism to learn these alignments on-line. However, these alignments tend to be brittle and often fail to generalize to long utterances and out-of-domain text, leading to missing or repeating words. Most non-autoregressive end to-end TTS models rely on durations extracted from external sources, using additional duration models for alignment. In this paper, we propose a novel duration prediction framework that can give compromising phoneme-level duration distribution with given text. In our experiments, the proposed duration model has more precise prediction and condition adaptation ability compared to previous baseline models. Numerically, it has roughly a 11.3 percents immprovement on alignment accuracy, and makes the performance of zero-shot TTS models more robust to the mismatch between prompt audio and input audio.
zh
[AI-20] Metamorphic Testing of Deep Code Models: A Systematic Literature Review
【速读】:该论文旨在解决深度代码模型(deep code models)在实际应用中因输入扰动(如变量重命名等语义保持变换)而导致输出不稳定的问题,即模型鲁棒性不足的挑战。其解决方案的关键在于系统性地梳理和分析当前针对深度代码模型的变异测试(metamorphic testing)方法,通过归纳45篇核心文献中的变换策略、测试技术和评估指标,揭示现有研究在模型类型、编程任务、数据集和语言等方面的分布特征,并指出提升模型鲁棒性的关键瓶颈与未来方向。
链接: https://arxiv.org/abs/2507.22610
作者: Ali Asgari,Milan de Koning,Pouria Derakhshanfar,Annibale Panichella
机构: 未知
类目: oftware Engineering (cs.SE); Artificial Intelligence (cs.AI)
备注:
Abstract:Large language models and deep learning models designed for code intelligence have revolutionized the software engineering field due to their ability to perform various code-related tasks. These models can process source code and software artifacts with high accuracy in tasks such as code completion, defect detection, and code summarization; therefore, they can potentially become an integral part of modern software engineering practices. Despite these capabilities, robustness remains a critical quality attribute for deep-code models as they may produce different results under varied and adversarial conditions (e.g., variable renaming). Metamorphic testing has become a widely used approach to evaluate models’ robustness by applying semantic-preserving transformations to input programs and analyzing the stability of model outputs. While prior research has explored testing deep learning models, this systematic literature review focuses specifically on metamorphic testing for deep code models. By studying 45 primary papers, we analyze the transformations, techniques, and evaluation methods used to assess robustness. Our review summarizes the current landscape, identifying frequently evaluated models, programming tasks, datasets, target languages, and evaluation metrics, and highlights key challenges and future directions for advancing the field.
zh
[AI-21] MetaAgent : Automatically Constructing Multi-Agent Systems Based on Finite State Machines ICML2025
【速读】:该论文旨在解决当前多智能体系统(Multi-Agent System, MAS)设计中存在的局限性问题,包括人工设计框架场景覆盖有限、自动化设计方法缺乏工具集成、依赖外部训练数据以及通信结构僵化等。其解决方案的关键在于提出一种基于有限状态机(Finite State Machine, FSM)的自动构建框架 MetaAgent:该框架能够根据任务描述自动生成多智能体系统,并通过优化算法进行迭代打磨;在部署阶段,FSM 控制智能体行为与状态转换,从而实现灵活且高效的多智能体协作。实验表明,MetaAgent 生成的系统性能优于其他自动设计方法,并可达到与人工优化系统相当的水平。
链接: https://arxiv.org/abs/2507.22606
作者: Yaolun Zhang,Xiaogeng Liu,Chaowei Xiao
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注: ICML 2025
Abstract:Large Language Models (LLMs) have demonstrated the ability to solve a wide range of practical tasks within multi-agent systems. However, existing human-designed multi-agent frameworks are typically limited to a small set of pre-defined scenarios, while current automated design methods suffer from several limitations, such as the lack of tool integration, dependence on external training data, and rigid communication structures. In this paper, we propose MetaAgent, a finite state machine based framework that can automatically generate a multi-agent system. Given a task description, MetaAgent will design a multi-agent system and polish it through an optimization algorithm. When the multi-agent system is deployed, the finite state machine will control the agent’s actions and the state transitions. To evaluate our framework, we conduct experiments on both text-based tasks and practical tasks. The results indicate that the generated multi-agent system surpasses other auto-designed methods and can achieve a comparable performance with the human-designed multi-agent system, which is optimized for those specific tasks.
zh
[AI-22] RePaCA: Leverag ing Reasoning Large Language Models for Static Automated Patch Correctness Assessment
【速读】:该论文旨在解决自动化程序修复(Automated Program Repair, APR)中生成的补丁存在过拟合(overfitting)问题,即补丁仅通过测试用例但未真正修复根本缺陷。为应对这一挑战,作者提出了一种静态的补丁正确性评估方法 RePaCA,其关键在于利用经过强化学习微调的大语言模型(Large Language Models, LLMs),使其能够基于 buggy 和 fixed 代码片段生成“思维链”(Chain of Thought),系统分析代码差异、推理补丁如何解决根本原因,并最终输出二分类结果(正确或过拟合)。该方案显著提升了评估准确率(83.1% 准确率,84.8% F1 分数)与泛化能力,同时增强了可解释性。
链接: https://arxiv.org/abs/2507.22580
作者: Marcos Fuster-Pena,David de-Fitero-Dominguez,Antonio Garcia-Cabot,Eva Garcia-Lopez
机构: 未知
类目: oftware Engineering (cs.SE); Artificial Intelligence (cs.AI)
备注:
Abstract:Automated Program Repair (APR) seeks to automatically correct software bugs without requiring human intervention. However, existing tools tend to generate patches that satisfy test cases without fixing the underlying bug, those are known as overfitting patches. To address this issue, Automated Patch Correctness Assessment (APCA) attempts to identify overfitting patches generated by APR tools. It can be solved as a static approach, meaning that no additional information is needed beyond the original and fixed code snippets. Current static techniques often struggle with reliability, flexibility and transparency. To address these issues, we introduce RePaCA, a novel static APCA technique that leverages Large Language Models (LLMs) specialized in thinking tasks. Our model is prompted with both buggy and fixed code snippets and guided to generate a Chain of Thought that analyses code differences, reasons about how the patch addresses the root cause, and ultimately provides a binary classification: correct or overfitting. To enhance these reasoning capabilities for the APCA task specifically, the LLM is finetuned using Reinforcement Learning with the Group Relative Policy Optimization algorithm. When evaluated on a standard Defects4J-derived test, our approach achieves state-of-the-art performance, with 83.1% accuracy and an 84.8% F1-score. Furthermore, our model demonstrates superior generalization capabilities when trained on different datasets, outperforming the leading technique. This reasoning capability also provides enhanced explainability for the patch assessment. These findings underscore the considerable promise of finetuned, reasoning LLMs to advance static APCA by enhancing accuracy, generalization, and explainability.
zh
[AI-23] Explaining Deep Network Classification of Matrices: A Case Study on Monotonicity
【速读】:该论文旨在解决单调矩阵(monotone matrix)的分类问题,即如何从矩阵元素或其衍生参数中提取简单且实用的判别准则,以区分单调矩阵与非单调矩阵。尽管单调矩阵定义为逆矩阵所有元素非负的矩阵,但长期以来缺乏基于矩阵元素的显式判别条件。解决方案的关键在于结合深度神经网络与可解释人工智能(XAI)技术,通过训练模型并利用梯度积分等显著性方法识别出最具判别力的特征:仅需矩阵特征多项式的前两个低阶系数绝对值 ∣c0∣ 和 ∣c1∣ 即可实现95%准确率的分类;进一步的数据驱动分析表明,对于7×7随机矩阵,单调矩阵满足 ∣c0/c1∣≤0.18 的概率高达99.98%,等价于迹 tr(A−1)≥5.7 的简单约束条件。
链接: https://arxiv.org/abs/2507.22570
作者: Leandro Farina,Sergey Korotov
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Numerical Analysis (math.NA)
备注: 22 pages, 11 figures. To be submitted to a journal
Abstract:This work demonstrates a methodology for using deep learning to discover simple, practical criteria for classifying matrices based on abstract algebraic properties. By combining a high-performance neural network with explainable AI (XAI) techniques, we can distill a model’s learned strategy into human-interpretable rules. We apply this approach to the challenging case of monotone matrices, defined by the condition that their inverses are entrywise nonnegative. Despite their simple definition, an easy characterization in terms of the matrix elements or the derived parameters is not known. Here, we present, to the best of our knowledge, the first systematic machine-learning approach for deriving a practical criterion that distinguishes monotone from non-monotone matrices. After establishing a labelled dataset by randomly generated monotone and non-monotone matrices uniformly on (-1,1) , we employ deep neural network algorithms for classifying the matrices as monotone or non-monotone, using both their entries and a comprehensive set of matrix features. By saliency methods, such as integrated gradients, we identify among all features, two matrix parameters which alone provide sufficient information for the matrix classification, with 95% accuracy, namely the absolute values of the two lowest-order coefficients, c_0 and c_1 of the matrix’s characteristic polynomial. A data-driven study of 18,000 random 7\times7 matrices shows that the monotone class obeys \lvert c_0/c_1\rvert\le0.18 with probability 99.98% ; because \lvert c_0/c_1\rvert = 1/\mathrmtr(A^-1) for monotone A , this is equivalent to the simple bound \mathrmtr(A^-1)\ge5.7 .
zh
[AI-24] A surrogate model for topology optimisation of elastic structures via parametric autoencoders
【速读】:该论文旨在解决线性弹性结构在参数化载荷与边界条件下拓扑优化的计算效率问题。传统方法通常需要大量迭代才能收敛,且对每个新工况均需重新求解完整优化流程,导致计算成本高昂。其解决方案的关键在于构建一个替代模型(surrogate model)来代理整个优化流程:首先利用前馈神经网络学习系统参数到低维潜在空间的映射,从而预测接近最优的初始拓扑;随后以该预测拓扑作为启发式初值,引入惩罚中间设计变量的高效算法进行局部精修,确保结果物理一致性并消除误差。该两阶段策略显著减少平均迭代次数(降低53%),同时保持目标函数偏差低于4%,即使在超出训练域的外推场景下仍具良好泛化能力。
链接: https://arxiv.org/abs/2507.22539
作者: Matteo Giacomini,Antonio Huerta
机构: 未知
类目: Numerical Analysis (math.NA); Artificial Intelligence (cs.AI); Computational Engineering, Finance, and Science (cs.CE); Machine Learning (cs.LG); Optimization and Control (math.OC)
备注: 39 pages, 13 figures, 7 tables
Abstract:A surrogate-based topology optimisation algorithm for linear elastic structures under parametric loads and boundary conditions is proposed. Instead of learning the parametric solution of the state (and adjoint) problems or the optimisation trajectory as a function of the iterations, the proposed approach devises a surrogate version of the entire optimisation pipeline. First, the method predicts a quasi-optimal topology for a given problem configuration as a surrogate model of high-fidelity topologies optimised with the homogenisation method. This is achieved by means of a feed-forward net learning the mapping between the input parameters characterising the system setup and a latent space determined by encoder/decoder blocks reducing the dimensionality of the parametric topology optimisation problem and reconstructing a high-dimensional representation of the topology. Then, the predicted topology is used as an educated initial guess for a computationally efficient algorithm penalising the intermediate values of the design variable, while enforcing the governing equations of the system. This step allows the method to correct potential errors introduced by the surrogate model, eliminate artifacts, and refine the design in order to produce topologies consistent with the underlying physics. Different architectures are proposed and the approximation and generalisation capabilities of the resulting models are numerically evaluated. The quasi-optimal topologies allow to outperform the high-fidelity optimiser by reducing the average number of optimisation iterations by 53% while achieving discrepancies below 4% in the optimal value of the objective functional, even in the challenging scenario of testing the model to extrapolate beyond the training and validation domain.
zh
[AI-25] Accident-Driven Congestion Prediction and Simulation: An Explainable Framework Using Advanced Clustering and Bayesian Networks
【速读】:该论文旨在解决由交通事故等不确定性因素引发的城市交通拥堵问题,此类问题常导致延迟加剧、排放增加及安全风险上升。解决方案的关键在于构建一个基于自动化机器学习(AutoML)增强的深度嵌入聚类(DEC)与贝叶斯网络(Bayesian Network, BN)相结合的鲁棒预测框架:首先利用AutoML-enhanced DEC对事故数据进行聚类并标注拥堵状态,进而通过BN模型预测拥堵概率;实验表明,该BN模型整体准确率达95.6%,且在Simulation of Urban Mobility (SUMO)仿真中与真实场景高度一致,验证了其对复杂事故-拥堵关系的建模能力与高可靠性。
链接: https://arxiv.org/abs/2507.22529
作者: Kranthi Kumar Talluri,Galia Weidl,Vaishnavi Kasuluru
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:
Abstract:Traffic congestion due to uncertainties, such as accidents, is a significant issue in urban areas, as the ripple effect of accidents causes longer delays, increased emissions, and safety concerns. To address this issue, we propose a robust framework for predicting the impact of accidents on congestion. We implement Automated Machine Learning (AutoML)-enhanced Deep Embedding Clustering (DEC) to assign congestion labels to accident data and predict congestion probability using a Bayesian Network (BN). The Simulation of Urban Mobility (SUMO) simulation is utilized to evaluate the correctness of BN predictions using evidence-based scenarios. Results demonstrate that the AutoML-enhanced DEC has outperformed traditional clustering approaches. The performance of the proposed BN model achieved an overall accuracy of 95.6%, indicating its ability to understand the complex relationship of accidents causing congestion. Validation in SUMO with evidence-based scenarios demonstrated that the BN model’s prediction of congestion states closely matches those of SUMO, indicating the high reliability of the proposed BN model in ensuring smooth urban mobility.
zh
[AI-26] Collaborative Medical Triage under Uncertainty: A Multi-Agent Dynamic Matching Approach
【速读】:该论文旨在解决当前基于人工智能(AI)的急诊分诊系统面临的三大核心问题:一是医疗专业性不足导致的幻觉引发误分类,二是不同医疗机构之间科室结构的异质性,三是细节导向的问诊方式效率低下,阻碍快速分诊决策。解决方案的关键在于提出一个由三个专业化智能体组成的多代理交互式分诊系统——接收者代理(RecipientAgent)、询问者代理(InquirerAgent)和部门代理(DepartmentAgent),通过结构化的问询机制与部门特定的引导规则协同工作,将患者非结构化症状转化为准确的科室推荐。该系统在3,360例真实病例数据集上验证,经过四轮交互后实现一级科室分类准确率达89.2%、二级科室达73.9%,其基于模式匹配的引导机制能够在保持高分诊准确性的同时,高效适应多样化医院组织架构。
链接: https://arxiv.org/abs/2507.22504
作者: Hongyan Cheng,Chengzhang Yu,Yanshu Shi,Chiyue Wang,Cong Liu,Zhanpeng Jin
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注: 10 pages, 8 figures, 2 table
Abstract:The post-pandemic surge in healthcare demand, coupled with critical nursing shortages, has placed unprecedented pressure on emergency department triage systems, necessitating innovative AI-driven solutions. We present a multi-agent interactive intelligent system for medical triage that addresses three fundamental challenges in current AI-based triage systems: insufficient medical specialization leading to hallucination-induced misclassifications, heterogeneous department structures across healthcare institutions, and inefficient detail-oriented questioning that impedes rapid triage decisions. Our system employs three specialized agents - RecipientAgent, InquirerAgent, and DepartmentAgent - that collaborate through structured inquiry mechanisms and department-specific guidance rules to transform unstructured patient symptoms into accurate department recommendations. To ensure robust evaluation, we constructed a comprehensive Chinese medical triage dataset from a medical website, comprising 3,360 real-world cases spanning 9 primary departments and 62 secondary departments. Through systematic data imputation using large language models, we address the prevalent issue of incomplete medical records in real-world data. Experimental results demonstrate that our multi-agent system achieves 89.2% accuracy in primary department classification and 73.9% accuracy in secondary department classification after four rounds of patient interaction. The system’s pattern-matching-based guidance mechanisms enable efficient adaptation to diverse hospital configurations while maintaining high triage accuracy. Our work provides a scalable framework for deploying AI-assisted triage systems that can accommodate the organizational heterogeneity of healthcare institutions while ensuring clinically sound decision-making.
zh
[AI-27] LoReUn: Data Itself Implicitly Provides Cues to Improve Machine Unlearning
【速读】:该论文旨在解决当前机器遗忘(Machine Unlearning, MU)方法在处理不同难度的遗忘数据时存在的不足问题,即现有方法通常对所有待遗忘数据赋予相同权重,难以有效消除那些更难被遗忘的数据的影响。其解决方案的关键在于提出一种基于损失重加权的遗忘策略(Loss-based Reweighting Unlearning, LoReUn),通过利用数据本身的损失值来隐式反映其遗忘难度,并在遗忘过程中动态调整数据权重,从而显著缩小现有MU方法与理想完全遗忘(exact unlearning)之间的性能差距,同时仅需极少额外计算开销。
链接: https://arxiv.org/abs/2507.22499
作者: Xiang Li,Qianli Shen,Haonan Wang,Kenji Kawaguchi
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注: 23 pages
Abstract:Recent generative models face significant risks of producing harmful content, which has underscored the importance of machine unlearning (MU) as a critical technique for eliminating the influence of undesired data. However, existing MU methods typically assign the same weight to all data to be forgotten, which makes it difficult to effectively forget certain data that is harder to unlearn than others. In this paper, we empirically demonstrate that the loss of data itself can implicitly reflect its varying difficulty. Building on this insight, we introduce Loss-based Reweighting Unlearning (LoReUn), a simple yet effective plug-and-play strategy that dynamically reweights data during the unlearning process with minimal additional computational overhead. Our approach significantly reduces the gap between existing MU methods and exact unlearning in both image classification and generation tasks, effectively enhancing the prevention of harmful content generation in text-to-image diffusion models.
zh
[AI-28] Proto-EVFL: Enhanced Vertical Federated Learning via Dual Prototype with Extremely Unaligned Data
【速读】:该论文针对垂直联邦学习(Vertical Federated Learning, VFL)中因多方数据样本未对齐导致的类别不平衡问题展开研究,具体包括** party 内部类别不平衡(intra-party class imbalance)和 party 之间类别不平衡(inter-party class imbalance),这些问题会引发局部模型偏差(local model bias)和特征贡献不一致(feature contribution inconsistency),从而限制模型性能。解决方案的关键在于提出 Proto-EVFL 框架,其核心创新为:(1)引入双原型机制(dual prototypes)**,通过类原型(class prototypes)建模潜在空间中的类别关系以支持未见类预测;(2)设计基于条件最优传输成本的概率性双原型学习策略,结合局部与全局类别先验概率(mixed prior guided module)动态筛选未对齐样本;(3)采用自适应门控特征聚合策略(adaptive gated feature aggregation)对不同 party 的局部特征进行动态加权融合,缓解特征贡献不一致性。该框架是首个在 VFL 中实现双层优化的方案,并证明具有 $ O(1/\sqrt{T}) $ 的收敛速率。
链接: https://arxiv.org/abs/2507.22488
作者: Wei Guo,Yiyang Duan,Zhaojun Hu,Yiqi Tong,Fuzhen Zhuang,Xiao Zhang,Jin Dong,Ruofan Wu,Tengfei Liu,Yifan Sun
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:
Abstract:In vertical federated learning (VFL), multiple enterprises address aligned sample scarcity by leveraging massive locally unaligned samples to facilitate collaborative learning. However, unaligned samples across different parties in VFL can be extremely class-imbalanced, leading to insufficient feature representation and limited model prediction space. Specifically, class-imbalanced problems consist of intra-party class imbalance and inter-party class imbalance, which can further cause local model bias and feature contribution inconsistency issues, respectively. To address the above challenges, we propose Proto-EVFL, an enhanced VFL framework via dual prototypes. We first introduce class prototypes for each party to learn relationships between classes in the latent space, allowing the active party to predict unseen classes. We further design a probabilistic dual prototype learning scheme to dynamically select unaligned samples by conditional optimal transport cost with class prior probability. Moreover, a mixed prior guided module guides this selection process by combining local and global class prior probabilities. Finally, we adopt an \textitadaptive gated feature aggregation strategy to mitigate feature contribution inconsistency by dynamically weighting and aggregating local features across different parties. We proved that Proto-EVFL, as the first bi-level optimization framework in VFL, has a convergence rate of 1/\sqrt T. Extensive experiments on various datasets validate the superiority of our Proto-EVFL. Even in a zero-shot scenario with one unseen class, it outperforms baselines by at least 6.97%
zh
[AI-29] owards Simulating Social Influence Dynamics with LLM -based Multi-agents
【速读】:该论文旨在解决如何利用大语言模型(Large Language Models, LLMs)模拟人类在线社交互动中的核心社会动态问题,具体关注一致性(conformity)、群体极化(group polarization)和群体碎片化(fragmentation)等现象的可再现性。其解决方案的关键在于构建一个结构化的多智能体仿真框架,通过对比不同规模和推理能力的LLM在模拟过程中的行为表现,发现较小模型更易产生从众效应,而具备更强推理能力的模型则表现出对社会影响的更高抗性,从而揭示了模型能力与社会动态演化之间的非线性关系。
链接: https://arxiv.org/abs/2507.22467
作者: Hsien-Tsung Lin,Pei-Cing Huang,Chan-Tung Ku,Chan Hsu,Pei-Xuan Shieh,Yihuang Kang
机构: 未知
类目: Multiagent Systems (cs.MA); Artificial Intelligence (cs.AI); Computers and Society (cs.CY)
备注:
Abstract:Recent advancements in Large Language Models offer promising capabilities to simulate complex human social interactions. We investigate whether LLM-based multi-agent simulations can reproduce core human social dynamics observed in online forums. We evaluate conformity dynamics, group polarization, and fragmentation across different model scales and reasoning capabilities using a structured simulation framework. Our findings indicate that smaller models exhibit higher conformity rates, whereas models optimized for reasoning are more resistant to social influence.
zh
[AI-30] owards Interpretable Renal Health Decline Forecasting via Multi-LMM Collaborative Reasoning Framework
【速读】:该论文旨在解决开源大型多模态模型(Large Multimodal Models, LMMs)在估算肾小球滤过率(estimated glomerular filtration rate, eGFR)预测任务中面临的性能不足、可解释性差以及部署成本高、数据隐私风险和模型可靠性问题。解决方案的关键在于提出一个协作式框架,通过引入视觉知识迁移(visual knowledge transfer)、溯因推理(abductive reasoning)和短期记忆机制(short-term memory mechanism),在提升预测准确性的同时生成具有临床意义的解释,从而实现预测性能与临床可解释性的协同优化。
链接: https://arxiv.org/abs/2507.22464
作者: Peng-Yi Wu,Pei-Cing Huang,Ting-Yu Chen,Chantung Ku,Ming-Yen Lin,Yihuang Kang
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Multiagent Systems (cs.MA); Applications (stat.AP)
备注:
Abstract:Accurate and interpretable prediction of estimated glomerular filtration rate (eGFR) is essential for managing chronic kidney disease (CKD) and supporting clinical decisions. Recent advances in Large Multimodal Models (LMMs) have shown strong potential in clinical prediction tasks due to their ability to process visual and textual information. However, challenges related to deployment cost, data privacy, and model reliability hinder their adoption. In this study, we propose a collaborative framework that enhances the performance of open-source LMMs for eGFR forecasting while generating clinically meaningful explanations. The framework incorporates visual knowledge transfer, abductive reasoning, and a short-term memory mechanism to enhance prediction accuracy and interpretability. Experimental results show that the proposed framework achieves predictive performance and interpretability comparable to proprietary models. It also provides plausible clinical reasoning processes behind each prediction. Our method sheds new light on building AI systems for healthcare that combine predictive accuracy with clinically grounded interpretability.
zh
[AI-31] Nearest-Better Network for Visualizing and Analyzing Combinatorial Optimization Problems: A Unified Tool
【速读】:该论文旨在解决两个核心问题:一是Nearest-Better Network (NBN)计算效率低下的问题,二是将NBN方法扩展至组合优化问题以分析算法行为的挑战。针对第一个问题,论文提出了一种时间复杂度为对数线性(log-linear)的高效NBN计算方法,显著提升了计算效率;针对第二个问题,通过理论推导证明NBN本质上是算法的最大概率转移网络,并将其成功应用于OneMax和旅行商问题(Traveling Salesman Problem, TSP),首次揭示了OneMax问题的中性性(neutrality)、崎岖性(ruggedness)和多峰性(modality)特征,以及TSP问题的主要挑战在于崎岖性、多峰性和欺骗性(deception)。此外,研究还发现当前最先进的TSP算法EAX和LKH分别在处理多峰性和欺骗性方面存在局限,从而为改进算法设计提供了关键洞见。
链接: https://arxiv.org/abs/2507.22440
作者: Yiya Diao,Changhe Li,Sanyou Zeng,Xinye Cai,Wenjian Luo,Shengxiang Yang,Carlos A. Coello Coello
机构: 未知
类目: Artificial Intelligence (cs.AI); Neural and Evolutionary Computing (cs.NE)
备注:
Abstract:The Nearest-Better Network (NBN) is a powerful method to visualize sampled data for continuous optimization problems while preserving multiple landscape features. However, the calculation of NBN is very time-consuming, and the extension of the method to combinatorial optimization problems is challenging but very important for analyzing the algorithm’s behavior. This paper provides a straightforward theoretical derivation showing that the NBN network essentially functions as the maximum probability transition network for algorithms. This paper also presents an efficient NBN computation method with logarithmic linear time complexity to address the time-consuming issue. By applying this efficient NBN algorithm to the OneMax problem and the Traveling Salesman Problem (TSP), we have made several remarkable discoveries for the first time: The fitness landscape of OneMax exhibits neutrality, ruggedness, and modality features. The primary challenges of TSP problems are ruggedness, modality, and deception. Two state-of-the-art TSP algorithms (i.e., EAX and LKH) have limitations when addressing challenges related to modality and deception, respectively. LKH, based on local search operators, fails when there are deceptive solutions near global optima. EAX, which is based on a single population, can efficiently maintain diversity. However, when multiple attraction basins exist, EAX retains individuals within multiple basins simultaneously, reducing inter-basin interaction efficiency and leading to algorithm’s stagnation.
zh
[AI-32] Cross-Border Legal Adaptation of Autonomous Vehicle Design based on Logic and Non-monotonic Reasoning
【速读】:该论文旨在解决自动驾驶汽车在跨国应用中面临的法律合规性挑战,尤其关注设计阶段如何融入法律推理以应对不同司法管辖区的规范差异。其解决方案的关键在于引入一种基于论证理论(argumentation theory)的逻辑框架,用于表征基于论证的实际(规范性)推理的基本属性,并结合自然数的偏序集来表达法律规则的优先级关系;通过案例分析表明,该推理系统能够帮助设计者更灵活地调整设计方案,并清晰理解其决策所涉的法律后果。
链接: https://arxiv.org/abs/2507.22432
作者: Zhe Yu,Yiwei Lu,Burkhard Schafer,Zhe Lin
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注: Accepted to appear in Proceedings of the 20th International Conference on Artificial Intelligence and Law (ICAIL 2025)
Abstract:This paper focuses on the legal compliance challenges of autonomous vehicles in a transnational context. We choose the perspective of designers and try to provide supporting legal reasoning in the design process. Based on argumentation theory, we introduce a logic to represent the basic properties of argument-based practical (normative) reasoning, combined with partial order sets of natural numbers to express priority. Finally, through case analysis of legal texts, we show how the reasoning system we provide can help designers to adapt their design solutions more flexibly in the cross-border application of autonomous vehicles and to more easily understand the legal implications of their decisions.
zh
[AI-33] Spec-VLA: Speculative Decoding for Vision-Language-Action Models with Relaxed Acceptance
【速读】:该论文旨在解决Vision-Language-Action (VLA)模型因视觉语言模型(Visual Language Model, VLM)参数量大和自回归(autoregressive, AR)解码特性导致的高计算开销问题。现有加速方法如推测解码(Speculative Decoding, SD)虽在大型语言模型(Large Language Model, LLM)中取得成效,但其直接应用于VLA模型时效果有限,主要受限于动作预测任务的复杂性和VLA模型的贪婪解码机制。解决方案的关键在于提出Spec-VLA框架,通过引入一种基于动作token相对距离的接受松弛机制(relaxed acceptance),显著提升推测生成的接受长度(acceptance length),从而实现更高效的并行验证与生成,最终在不牺牲成功率的前提下将推理速度提升至OpenVLA基线的1.42倍,且接受长度提升达44%。
链接: https://arxiv.org/abs/2507.22424
作者: Songsheng Wang,Rucheng Yu,Zhihang Yuan,Chao Yu,Feng Gao,Yu Wang,Derek F. Wong
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注: 12 pages, 5 figures, under review
Abstract:Vision-Language-Action (VLA) models have made substantial progress by leveraging the robust capabilities of Visual Language Models (VLMs). However, VLMs’ significant parameter size and autoregressive (AR) decoding nature impose considerable computational demands on VLA models. While Speculative Decoding (SD) has shown efficacy in accelerating Large Language Models (LLMs) by incorporating efficient drafting and parallel verification, allowing multiple tokens to be generated in one forward pass, its application to VLA models remains unexplored. This work introduces Spec-VLA, an SD framework designed to accelerate VLA models. Due to the difficulty of the action prediction task and the greedy decoding mechanism of the VLA models, the direct application of the advanced SD framework to the VLA prediction task yields a minor speed improvement. To boost the generation speed, we propose an effective mechanism to relax acceptance utilizing the relative distances represented by the action tokens of the VLA model. Empirical results across diverse test scenarios affirm the effectiveness of the Spec-VLA framework, and further analysis substantiates the impact of our proposed strategies, which enhance the acceptance length by 44%, achieving 1.42 times speedup compared with the OpenVLA baseline, without compromising the success rate. The success of the Spec-VLA framework highlights the potential for broader application of speculative execution in VLA prediction scenarios.
zh
[AI-34] On the Definition of Intelligence
【速读】:该论文旨在解决如何在不依赖特定物种或任务的前提下,定义并评估通用人工智能(AGI)的核心能力问题。其关键解决方案是提出一种基于样本保真度(sample fidelity)的通用智能标准——即“ε-类别智能”(ε-category intelligence):若生成样本与原始样本无法被任何可接受的判别器区分,且误差不超过容忍度ε,则认为系统具备该类别的智能。这一框架将智能视为从给定类别样本中学习并生成同类样本的能力,从而为跨范式智能行为(如强化学习、生成模型、分类、类比推理和目标导向决策)提供统一的量化评价基准,并为智能系统的评估、安全性和泛化能力研究奠定理论基础。
链接: https://arxiv.org/abs/2507.22423
作者: Kei-Sing Ng
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注: Accepted at AGI-25
Abstract:To engineer AGI, we should first capture the essence of intelligence in a species-agnostic form that can be evaluated, while being sufficiently general to encompass diverse paradigms of intelligent behavior, including reinforcement learning, generative models, classification, analogical reasoning, and goal-directed decision-making. We propose a general criterion based on sample fidelity: intelligence is the ability, given sample(s) from a category, to generate sample(s) from the same category. We formalise this intuition as \epsilon-category intelligence: it is \epsilon-intelligent with respect to a category if no chosen admissible distinguisher can separate generated from original samples beyond tolerance \epsilon. We present the formal framework, outline empirical protocols, and discuss implications for evaluation, safety, and generalization.
zh
[AI-35] Systematic Evaluation of Knowledge Graph Repair with Large Language Models
【速读】:该论文旨在解决知识图谱修复(Knowledge Graph Repair)质量评估缺乏系统性和通用性的问题,现有方法依赖于随意构建的数据集,难以在更广泛场景下对修复系统进行严谨分析。解决方案的关键在于提出了一种基于形状约束语言(SHACL)的系统化评估框架,通过引入一种新颖的“违规诱导操作”(Violation-Inducing Operations, VIOs)机制,可主动、可控地生成各类违反SHACL约束的情况,从而实现对修复系统的全面测试。在此框架下,作者利用大语言模型构建多种修复系统,并对比不同提示策略(prompting strategies)的效果,发现包含相关SHACL约束和知识图谱关键上下文信息的简洁提示能显著提升修复性能。
链接: https://arxiv.org/abs/2507.22419
作者: Tung-Wei Lin,Gabe Fierro,Han Li,Tianzhen Hong,Pierluigi Nuzzo,Alberto Sangiovanni-Vinentelli
机构: 未知
类目: Databases (cs.DB); Artificial Intelligence (cs.AI)
备注:
Abstract:We present a systematic approach for evaluating the quality of knowledge graph repairs with respect to constraint violations defined in shapes constraint language (SHACL). Current evaluation methods rely on \emphad hoc datasets, which limits the rigorous analysis of repair systems in more general settings. Our method addresses this gap by systematically generating violations using a novel mechanism, termed violation-inducing operations (VIOs). We use the proposed evaluation framework to assess a range of repair systems which we build using large language models. We analyze the performance of these systems across different prompting strategies. Results indicate that concise prompts containing both the relevant violated SHACL constraints and key contextual information from the knowledge graph yield the best performance.
zh
[AI-36] SAEL: Leverag ing Large Language Models with Adaptive Mixture-of-Experts for Smart Contract Vulnerability Detection
【速读】:该论文旨在解决区块链智能合约漏洞检测中存在的两大问题:一是传统静态分析方法在复杂场景下表现受限,二是基于专用预训练模型的方法虽然在特定数据集上效果良好,但泛化能力不足;同时,通用大语言模型(Large Language Models, LLMs)虽能适应新漏洞模式,但在特定漏洞类型上的检测性能仍不如专用模型。解决方案的关键在于提出SAEL框架,其核心创新包括:1)设计针对性提示(prompt)引导LLMs生成漏洞识别结果及细粒度解释,将解释信息作为预测特征;2)对CodeT5和T5进行提示微调(prompt-tuning),以增强代码与解释特征的任务适配性;3)引入自适应混合专家(Adaptive Mixture-of-Experts)架构,通过门控网络(Gating Network)结合TopK筛选与Softmax归一化动态调整特征权重,并利用多头自注意力机制强化跨特征关联,最终通过联合损失函数优化各特征模块的独立表现与整体加权预测性能,从而实现更高效、准确的智能合约漏洞检测。
链接: https://arxiv.org/abs/2507.22371
作者: Lei Yu,Shiqi Cheng,Zhirong Huang,Jingyuan Zhang,Chenjie Shen,Junyi Lu,Li Yang,Fengjun Zhang,Jiajia Ma
机构: 未知
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI); Software Engineering (cs.SE)
备注: Accepted to ICSME 2025
Abstract:With the increasing security issues in blockchain, smart contract vulnerability detection has become a research focus. Existing vulnerability detection methods have their limitations: 1) Static analysis methods struggle with complex scenarios. 2) Methods based on specialized pre-trained models perform well on specific datasets but have limited generalization capabilities. In contrast, general-purpose Large Language Models (LLMs) demonstrate impressive ability in adapting to new vulnerability patterns. However, they often underperform on specific vulnerability types compared to methods based on specialized pre-trained models. We also observe that explanations generated by general-purpose LLMs can provide fine-grained code understanding information, contributing to improved detection performance. Inspired by these observations, we propose SAEL, an LLM-based framework for smart contract vulnerability detection. We first design targeted prompts to guide LLMs in identifying vulnerabilities and generating explanations, which serve as prediction features. Next, we apply prompt-tuning on CodeT5 and T5 to process contract code and explanations, enhancing task-specific performance. To combine the strengths of each approach, we introduce an Adaptive Mixture-of-Experts architecture. This dynamically adjusts feature weights via a Gating Network, which selects relevant features using TopK filtering and Softmax normalization, and incorporates a Multi-Head Self-Attention mechanism to enhance cross-feature relationships. This design enables effective integration of LLM predictions, explanation features, and code features through gradient optimization. The loss function jointly considers both independent feature performance and overall weighted predictions. Experiments show that SAEL outperforms existing methods across various vulnerabilities. Comments: Accepted to ICSME 2025 Subjects: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI); Software Engineering (cs.SE) Cite as: arXiv:2507.22371 [cs.CR] (or arXiv:2507.22371v1 [cs.CR] for this version) https://doi.org/10.48550/arXiv.2507.22371 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
zh
[AI-37] Beyond Accuracy: How AI Metacognitive Sensitivity improves AI-assisted Decision Making
【速读】:该论文旨在解决在人机协同决策场景中,如何评估和优化人工智能(AI)辅助系统对人类决策质量的影响问题。传统方法仅关注AI的预测准确性,而忽略了其置信度估计的可靠性,即AI元认知敏感性(metacognitive sensitivity)——即AI能否准确区分正确与错误预测的能力。论文提出了一种理论框架,用于分析AI预测准确性和元认知敏感性共同作用下对人类决策效果的影响,并发现:在某些条件下,即使AI预测准确性较低,只要其元认知敏感性更高,仍可提升人类整体决策精度。实验验证了这一结论,表明增强AI的元认知敏感性能显著改善人类决策表现。因此,解决方案的关键在于将AI的元认知敏感性纳入评估体系,并与预测准确性协同优化,从而实现更优的人机协作决策结果。
链接: https://arxiv.org/abs/2507.22365
作者: ZhaoBin Li,Mark Steyvers
机构: 未知
类目: Artificial Intelligence (cs.AI); Human-Computer Interaction (cs.HC)
备注: 26 pages, 5 figures, submitted to Decision Analysis
Abstract:In settings where human decision-making relies on AI input, both the predictive accuracy of the AI system and the reliability of its confidence estimates influence decision quality. We highlight the role of AI metacognitive sensitivity – its ability to assign confidence scores that accurately distinguish correct from incorrect predictions – and introduce a theoretical framework for assessing the joint impact of AI’s predictive accuracy and metacognitive sensitivity in hybrid decision-making settings. Our analysis identifies conditions under which an AI with lower predictive accuracy but higher metacognitive sensitivity can enhance the overall accuracy of human decision making. Finally, a behavioral experiment confirms that greater AI metacognitive sensitivity improves human decision performance. Together, these findings underscore the importance of evaluating AI assistance not only by accuracy but also by metacognitive sensitivity, and of optimizing both to achieve superior decision outcomes.
zh
[AI-38] Magent ic-UI: Towards Human-in-the-loop Agent ic Systems
【速读】:该论文旨在解决当前由大语言模型驱动的AI代理(AI agent)在复杂多步骤任务中仍难以达到人类水平表现的问题,同时应对这些自主系统因与外部世界交互而带来的安全与可控性风险。解决方案的关键在于提出一种“人在回路”(human-in-the-loop)的协同架构——Magentic-UI,它通过六种低开销的人机交互机制(如共规划、共执行、动作防护等),将人类监督与AI效率有机结合,从而在保证安全性的同时提升任务完成能力。该系统基于灵活的多代理架构和Model Context Protocol(MCP)支持多种工具扩展,已在多个维度验证其有效性,为实现高效且安全的人机协作提供了可扩展的技术路径。
链接: https://arxiv.org/abs/2507.22358
作者: Hussein Mozannar,Gagan Bansal,Cheng Tan,Adam Fourney,Victor Dibia,Jingya Chen,Jack Gerrits,Tyler Payne,Matheus Kunzler Maldaner,Madeleine Grunde-McLaughlin,Eric Zhu,Griffin Bassman,Jacob Alber,Peter Chang,Ricky Loynd,Friederike Niedtner,Ece Kamar,Maya Murad,Rafah Hosn,Saleema Amershi
机构: 未知
类目: Artificial Intelligence (cs.AI); Human-Computer Interaction (cs.HC)
备注:
Abstract:AI agents powered by large language models are increasingly capable of autonomously completing complex, multi-step tasks using external tools. Yet, they still fall short of human-level performance in most domains including computer use, software development, and research. Their growing autonomy and ability to interact with the outside world, also introduces safety and security risks including potentially misaligned actions and adversarial manipulation. We argue that human-in-the-loop agentic systems offer a promising path forward, combining human oversight and control with AI efficiency to unlock productivity from imperfect systems. We introduce Magentic-UI, an open-source web interface for developing and studying human-agent interaction. Built on a flexible multi-agent architecture, Magentic-UI supports web browsing, code execution, and file manipulation, and can be extended with diverse tools via Model Context Protocol (MCP). Moreover, Magentic-UI presents six interaction mechanisms for enabling effective, low-cost human involvement: co-planning, co-tasking, multi-tasking, action guards, and long-term memory. We evaluate Magentic-UI across four dimensions: autonomous task completion on agentic benchmarks, simulated user testing of its interaction capabilities, qualitative studies with real users, and targeted safety assessments. Our findings highlight Magentic-UI’s potential to advance safe and efficient human-agent collaboration.
zh
[AI-39] An Explainable Emotion Alignment Framework for LLM -Empowered Agent in Metaverse Service Ecosystem
【速读】:该论文旨在解决当前基于大语言模型(Large Language Models, LLMs)的智能体在元宇宙服务生态系统中难以有效连接虚拟世界服务与现实世界服务的问题,尤其聚焦于角色数据融合、角色知识关联以及伦理安全等关键挑战。其解决方案的关键在于提出了一种可解释的情绪对齐框架(Explainable Emotion Alignment Framework),通过将事实性因素系统性地引入LLM驱动智能体的决策循环,实现更深层次的角色关系事实对齐,从而增强智能体在复杂场景下的社会行为真实性与可信度。
链接: https://arxiv.org/abs/2507.22326
作者: Qun Ma,Xiao Xue,Ming Zhang,Yifan Shen,Zihan Zhao
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:
Abstract:Metaverse service is a product of the convergence between Metaverse and service systems, designed to address service-related challenges concerning digital avatars, digital twins, and digital natives within Metaverse. With the rise of large language models (LLMs), agents now play a pivotal role in Metaverse service ecosystem, serving dual functions: as digital avatars representing users in the virtual realm and as service assistants (or NPCs) providing personalized support. However, during the modeling of Metaverse service ecosystems, existing LLM-based agents face significant challenges in bridging virtual-world services with real-world services, particularly regarding issues such as character data fusion, character knowledge association, and ethical safety concerns. This paper proposes an explainable emotion alignment framework for LLM-based agents in Metaverse Service Ecosystem. It aims to integrate factual factors into the decision-making loop of LLM-based agents, systematically demonstrating how to achieve more relational fact alignment for these agents. Finally, a simulation experiment in the Offline-to-Offline food delivery scenario is conducted to evaluate the effectiveness of this framework, obtaining more realistic social emergence.
zh
[AI-40] From Articles to Code: On-Demand Generation of Core Algorithms from Scientific Publications
【速读】:该论文试图解决软件包维护成本高昂的问题,包括依赖管理、缺陷修复和版本控制等开销。其解决方案的关键在于利用科学出版物中丰富的方法描述作为现代大语言模型(Large Language Models, LLMs)的独立规范,从而实现按需代码生成,替代传统由人工维护的软件库。实验表明,当前最先进的LLMs能够可靠地复现核心算法的功能,且性能与传统库无显著差异,预示着未来将从静态的人工维护包向灵活的按需生成模式转变,大幅降低维护负担。
链接: https://arxiv.org/abs/2507.22324
作者: Cameron S. Movassaghi,Amanda Momenzadeh,Jesse G. Meyer
机构: 未知
类目: oftware Engineering (cs.SE); Artificial Intelligence (cs.AI)
备注:
Abstract:Maintaining software packages imposes significant costs due to dependency management, bug fixes, and versioning. We show that rich method descriptions in scientific publications can serve as standalone specifications for modern large language models (LLMs), enabling on-demand code generation that could supplant human-maintained libraries. We benchmark state-of-the-art models (GPT-o4-mini-high, Gemini Pro 2.5, Claude Sonnet 4) by tasking them with implementing a diverse set of core algorithms drawn from original publications. Our results demonstrate that current LLMs can reliably reproduce package functionality with performance indistinguishable from conventional libraries. These findings foreshadow a paradigm shift toward flexible, on-demand code generation and away from static, human-maintained packages, which will result in reduced maintenance overhead by leveraging published articles as sufficient context for the automated implementation of analytical workflows.
zh
[AI-41] AdapSCA-PSO: An Adaptive Localization Algorithm with AI-Based Hybrid SCA-PSO for IoT WSNs
【速读】:该论文旨在解决物联网(IoT)中传感器节点(sensor nodes)的精确定位问题,这是实现物联网实际应用的基础需求。其核心解决方案是提出一种混合元启发式定位算法,关键在于将擅长全局搜索的正弦余弦算法(Sine Cosine Algorithm, SCA)与擅长局部搜索的粒子群优化(Particle Swarm Optimization, PSO)相结合,并引入自适应切换模块以动态选择最优搜索策略;同时针对节点定位问题特性重新设计了初始化、适应度评估和参数设置机制,从而显著提升定位精度并减少迭代次数。
链接: https://arxiv.org/abs/2507.22317
作者: Ze Zhang,Qian Dong,Wenhan Wang
机构: 未知
类目: Networking and Internet Architecture (cs.NI); Artificial Intelligence (cs.AI)
备注:
Abstract:The accurate localization of sensor nodes is a fundamental requirement for the practical application of the Internet of Things (IoT). To enable robust localization across diverse environments, this paper proposes a hybrid meta-heuristic localization algorithm. Specifically, the algorithm integrates the Sine Cosine Algorithm (SCA), which is effective in global search, with Particle Swarm Optimization (PSO), which excels at local search. An adaptive switching module is introduced to dynamically select between the two algorithms. Furthermore, the initialization, fitness evaluation, and parameter settings of the algorithm have been specifically redesigned and optimized to address the characteristics of the node localization problem. Simulation results across varying numbers of sensor nodes demonstrate that, compared to standalone PSO and the unoptimized SCAPSO algorithm, the proposed method significantly reduces the number of required iterations and achieves an average localization error reduction of 84.97%.
zh
[AI-42] Multi-modal Relational Item Representation Learning for Inferring Substitutable and Complementary Items
【速读】:该论文旨在解决现有推荐系统中因用户行为数据噪声大和数据稀疏性(由长尾分布导致)而导致的替代品与互补品建模不准的问题。其核心解决方案是提出一种自监督多模态关系项表示学习框架MMSC,关键在于:(1) 利用多模态基础模型从物品元数据中学习高质量的多模态项表示;(2) 设计自监督的行为驱动表示学习模块以去除用户行为数据中的噪声并提升表征能力;(3) 通过分层表示聚合机制融合语义级与任务级的项表示,并借助大语言模型(LLM)生成增强训练数据进一步优化去噪过程,从而显著提升对替代品和互补品的建模效果。
链接: https://arxiv.org/abs/2507.22268
作者: Junting Wang,Chenghuan Guo,Jiao Yang,Yanhui Guo,Yan Gao,Hari Sundaram
机构: 未知
类目: Information Retrieval (cs.IR); Artificial Intelligence (cs.AI)
备注:
Abstract:We introduce a novel self-supervised multi-modal relational item representation learning framework designed to infer substitutable and complementary items. Existing approaches primarily focus on modeling item-item associations deduced from user behaviors using graph neural networks (GNNs) or leveraging item content information. However, these methods often overlook critical challenges, such as noisy user behavior data and data sparsity due to the long-tailed distribution of these behaviors. In this paper, we propose MMSC, a self-supervised multi-modal relational item representation learning framework to address these challenges. Specifically, MMSC consists of three main components: (1) a multi-modal item representation learning module that leverages a multi-modal foundational model and learns from item metadata, (2) a self-supervised behavior-based representation learning module that denoises and learns from user behavior data, and (3) a hierarchical representation aggregation mechanism that integrates item representations at both the semantic and task levels. Additionally, we leverage LLMs to generate augmented training data, further enhancing the denoising process during training. We conduct extensive experiments on five real-world datasets, showing that MMSC outperforms existing baselines by 26.1% for substitutable recommendation and 39.2% for complementary recommendation. In addition, we empirically show that MMSC is effective in modeling cold-start items.
zh
[AI-43] Promoting Online Safety by Simulating Unsafe Conversations with LLM s
【速读】:该论文旨在解决生成式 AI(Generative AI)尤其是大语言模型(Large Language Models, LLMs)被滥用以生成在线不安全对话的问题,这类对话包括网络诈骗等场景,且LLMs因其生成类人文本的能力降低了恶意行为者的门槛。解决方案的关键在于通过模拟真实 scam 对话(由两个LLM分别扮演诈骗者和目标用户)来增强用户的识别与应对能力——系统设计让用户对目标LLM的回应提供反馈,从而利用学习科学中“基于假设行为反馈促进学习”的原理,提升用户在面对实际不安全在线互动时的判断力与安全性意识。
链接: https://arxiv.org/abs/2507.22267
作者: Owen Hoffman,Kangze Peng,Zehua You,Sajid Kamal,Sukrit Venkatagiri
机构: 未知
类目: Human-Computer Interaction (cs.HC); Artificial Intelligence (cs.AI)
备注:
Abstract:Generative AI, including large language models (LLMs) have the potential – and already are being used – to increase the speed, scale, and types of unsafe conversations online. LLMs lower the barrier for entry for bad actors to create unsafe conversations in particular because of their ability to generate persuasive and human-like text. In our current work, we explore ways to promote online safety by teaching people about unsafe conversations that can occur online with and without LLMs. We build on prior work that shows that LLMs can successfully simulate scam conversations. We also leverage research in the learning sciences that shows that providing feedback on one’s hypothetical actions can promote learning. In particular, we focus on simulating scam conversations using LLMs. Our work incorporates two LLMs that converse with each other to simulate realistic, unsafe conversations that people may encounter online between a scammer LLM and a target LLM but users of our system are asked provide feedback to the target LLM.
zh
[AI-44] Agent -centric learning: from external reward maximization to internal knowledge curation
【速读】:该论文试图解决当前通用智能研究中过度依赖外部目标导致的智能体适应性不足问题,即传统方法聚焦于智能体对环境的控制或特定任务的掌握,易产生功能单一、缺乏泛化能力的专用智能体。其解决方案的关键在于提出“表征赋能”(representational empowerment)这一新范式,将学习的焦点从外部环境转向智能体内部表征结构,通过衡量智能体对自身知识结构的可控维持与多样化能力,来提升其内在准备度(preparedness),从而构建更具适应性的智能系统设计框架。
链接: https://arxiv.org/abs/2507.22255
作者: Hanqi Zhou,Fryderyk Mantiuk,David G. Nagy,Charley M. Wu
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Symbolic Computation (cs.SC)
备注: RLC Finding the Frame Workshop 2025
Abstract:The pursuit of general intelligence has traditionally centered on external objectives: an agent’s control over its environments or mastery of specific tasks. This external focus, however, can produce specialized agents that lack adaptability. We propose representational empowerment, a new perspective towards a truly agent-centric learning paradigm by moving the locus of control inward. This objective measures an agent’s ability to controllably maintain and diversify its own knowledge structures. We posit that the capacity – to shape one’s own understanding – is an element for achieving better ``preparedness’’ distinct from direct environmental influence. Focusing on internal representations as the main substrate for computing empowerment offers a new lens through which to design adaptable intelligent systems.
zh
[AI-45] Using Scaling Laws for Data Source Utility Estimation in Domain-Specific Pre-Training
【速读】:该论文旨在解决基础模型训练中领域特定数据集构建的优化问题,特别是如何在资源有限的情况下,高效评估不同数据源(如合成数据或过滤后的网络数据)的质量,并据此做出最优的数据获取决策,以实现通用预训练模型在特定领域的专业化。其解决方案的关键在于将传统的点估计方法(即微调 annealing)扩展为通过多次不同计算投入的数据整理与训练运行来估计缩放规律(scaling laws),从而克服以往依赖单一点估计时因计算规模变化导致排名不变性缺失的问题,进而实现基于性能增益与获取成本比对的、更具成本效益的数据源选择与资源配置。
链接: https://arxiv.org/abs/2507.22250
作者: Oleksiy Ostapenko,Charles Guille-Escuret,Luke Kumar,Max Tian,Denis Kocetkov,Gopeshh Subbaraj,Raymond Li,Joel Lamy-Poirier,Sebastien Paquet,Torsten Scholak
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:
Abstract:We introduce a framework for optimizing domain-specific dataset construction in foundation model training. Specifically, we seek a cost-efficient way to estimate the quality of data sources (e.g. synthetically generated or filtered web data, etc.) in order to make optimal decisions about resource allocation for data sourcing from these sources for the stage two pre-training phase, aka annealing, with the goal of specializing a generalist pre-trained model to specific domains. Our approach extends the usual point estimate approaches, aka micro-annealing, to estimating scaling laws by performing multiple annealing runs of varying compute spent on data curation and training. This addresses a key limitation in prior work, where reliance on point estimates for data scaling decisions can be misleading due to the lack of rank invariance across compute scales – a phenomenon we confirm in our experiments. By systematically analyzing performance gains relative to acquisition costs, we find that scaling curves can be estimated for different data sources. Such scaling laws can inform cost effective resource allocation across different data acquisition methods (e.g. synthetic data), data sources (e.g. user or web data) and available compute resources. We validate our approach through experiments on a pre-trained model with 7 billion parameters. We adapt it to: a domain well-represented in the pre-training data – the medical domain, and a domain underrepresented in the pretraining corpora – the math domain. We show that one can efficiently estimate the scaling behaviors of a data source by running multiple annealing runs, which can lead to different conclusions, had one used point estimates using the usual micro-annealing technique instead. This enables data-driven decision-making for selecting and optimizing data sources.
zh
[AI-46] Large Language Model-Based Framework for Explainable Cyberattack Detection in Automatic Generation Control Systems
【速读】:该论文旨在解决智能电网(Smart Grid)中因数字化带来的新型网络安全漏洞问题,特别是针对自动发电控制(AGC)系统的虚假数据注入攻击(FDIA),此类攻击可能破坏电网运行的稳定性。传统机器学习(ML)和深度学习(DL)模型虽在检测效率上表现优异,但其决策过程缺乏可解释性,限制了操作人员的信任与实际部署。解决方案的关键在于提出一种混合框架:利用轻量级ML分类器(如LightGBM)实现高精度(最高95.13%)且低延迟(仅0.004秒)的实时攻击检测,并在检测到异常后调用大语言模型(LLMs,如GPT-4o mini)生成人类可读的自然语言解释,从而提供攻击目标识别、幅度估计和起始时间预测等高保真信息,显著提升AI系统的可操作性和可信度。
链接: https://arxiv.org/abs/2507.22239
作者: Muhammad Sharshar,Ahmad Mohammad Saber,Davor Svetinovic,Amr M. Youssef,Deepa Kundur,Ehab F. El-Saadany
机构: 未知
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI); Systems and Control (eess.SY)
备注: Accepted Publication
Abstract:The increasing digitization of smart grids has improved operational efficiency but also introduced new cybersecurity vulnerabilities, such as False Data Injection Attacks (FDIAs) targeting Automatic Generation Control (AGC) systems. While machine learning (ML) and deep learning (DL) models have shown promise in detecting such attacks, their opaque decision-making limits operator trust and real-world applicability. This paper proposes a hybrid framework that integrates lightweight ML-based attack detection with natural language explanations generated by Large Language Models (LLMs). Classifiers such as LightGBM achieve up to 95.13% attack detection accuracy with only 0.004 s inference latency. Upon detecting a cyberattack, the system invokes LLMs, including GPT-3.5 Turbo, GPT-4 Turbo, and GPT-4o mini, to generate human-readable explanation of the event. Evaluated on 100 test samples, GPT-4o mini with 20-shot prompting achieved 93% accuracy in identifying the attack target, a mean absolute error of 0.075 pu in estimating attack magnitude, and 2.19 seconds mean absolute error (MAE) in estimating attack onset. These results demonstrate that the proposed framework effectively balances real-time detection with interpretable, high-fidelity explanations, addressing a critical need for actionable AI in smart grid cybersecurity.
zh
[AI-47] Quantum-Inspired Audio Unlearning: Towards Privacy-Preserving Voice Biometrics
【速读】:该论文旨在解决语音生物特征识别系统中因隐私法规(如GDPR的“被遗忘权”和印度DPDP法案)要求而产生的个体语音签名高效擦除问题。现有针对图像数据设计的遗忘方法难以处理音频信号的时序性、高维性和连续性特征,导致语音及口音级别的擦除效果不佳。其解决方案的关键在于提出一种量子启发式的音频遗忘框架QPAudioEraser,核心机制包括:利用破坏性干涉初始化权重以消除目标特征、基于叠加态的标签变换掩盖类别身份、不确定性最大化量子损失函数促进遗忘、以及受纠缠启发的权重混合策略保留模型知识。该方法在多个主流音频数据集和模型架构上实现目标数据完全擦除(遗忘准确率为0%),同时对保留数据性能影响极小(下降仅0.05%),显著优于传统基线方法。
链接: https://arxiv.org/abs/2507.22208
作者: Shreyansh Pathak,Sonu Shreshtha,Richa Singh,Mayank Vatsa
机构: 未知
类目: ound (cs.SD); Artificial Intelligence (cs.AI); Audio and Speech Processing (eess.AS)
备注: 9 pages, 2 figures, 5 tables, Accepted at IJCB 2025 (Osaka, Japan)
Abstract:The widespread adoption of voice-enabled authentication and audio biometric systems have significantly increased privacy vulnerabilities associated with sensitive speech data. Compliance with privacy regulations such as GDPR’s right to be forgotten and India’s DPDP Act necessitates targeted and efficient erasure of individual-specific voice signatures from already-trained biometric models. Existing unlearning methods designed for visual data inadequately handle the sequential, temporal, and high-dimensional nature of audio signals, leading to ineffective or incomplete speaker and accent erasure. To address this, we introduce QPAudioEraser, a quantum-inspired audio unlearning framework. Our our-phase approach involves: (1) weight initialization using destructive interference to nullify target features, (2) superposition-based label transformations that obscure class identity, (3) an uncertainty-maximizing quantum loss function, and (4) entanglement-inspired mixing of correlated weights to retain model knowledge. Comprehensive evaluations with ResNet18, ViT, and CNN architectures across AudioMNIST, Speech Commands, LibriSpeech, and Speech Accent Archive datasets validate QPAudioEraser’s superior performance. The framework achieves complete erasure of target data (0% Forget Accuracy) while incurring minimal impact on model utility, with a performance degradation on retained data as low as 0.05%. QPAudioEraser consistently surpasses conventional baselines across single-class, multi-class, sequential, and accent-level erasure scenarios, establishing the proposed approach as a robust privacy-preserving solution.
zh
[AI-48] Measuring Time-Series Dataset Similarity using Wasserstein Distance
【速读】:该论文旨在解决时间序列数据集相似性度量的问题,以支持基础模型(foundation model)在模型选择、微调和可视化等任务中的应用。其解决方案的关键在于将时间序列数据集建模为潜在的多元正态分布(multivariate normal distribution, MVN),并通过计算两个数据集对应MVN之间的Wasserstein距离来量化它们的相似性。这种方法在实验中展现出良好的有效性,特别是在跨分布推理性能估计和迁移学习评估中,与推理损失具有高度相关性(相关系数达0.60)。
链接: https://arxiv.org/abs/2507.22189
作者: Hongjie Chen,Akshay Mehra,Josh Kimball,Ryan A. Rossi
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:
Abstract:The emergence of time-series foundation model research elevates the growing need to measure the (dis)similarity of time-series datasets. A time-series dataset similarity measure aids research in multiple ways, including model selection, finetuning, and visualization. In this paper, we propose a distribution-based method to measure time-series dataset similarity by leveraging the Wasserstein distance. We consider a time-series dataset an empirical instantiation of an underlying multivariate normal distribution (MVN). The similarity between two time-series datasets is thus computed as the Wasserstein distance between their corresponding MVNs. Comprehensive experiments and visualization show the effectiveness of our approach. Specifically, we show how the Wasserstein distance helps identify similar time-series datasets and facilitates inference performance estimation of foundation models in both out-of-distribution and transfer learning evaluation, with high correlations between our proposed measure and the inference loss (0.60).
zh
[AI-49] SourceSplice: Source Selection for Machine Learning Tasks
【速读】:该论文旨在解决多数据源环境下如何高效选择最优子集以构建高质量训练数据集的问题,从而提升下游机器学习(Machine Learning, ML)任务的预测性能。其核心挑战在于现有数据发现方法主要关注元数据匹配或语义相似性,忽视了数据源质量对ML模型性能的影响。解决方案的关键在于提出两种框架——SourceGrasp与SourceSplice:前者基于贪心准则和随机化策略进行启发式搜索;后者则借鉴基因剪接(Gene Splicing)机制,模拟生物过程中片段重组的思想,通过智能组合不同数据源来最大化任务效用。实验表明,SourceSplice在显著减少子集探索次数的同时,能更有效地识别出提升ML任务性能的数据源组合。
链接: https://arxiv.org/abs/2507.22186
作者: Ambarish Singh,Romila Pradhan
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Databases (cs.DB)
备注:
Abstract:Data quality plays a pivotal role in the predictive performance of machine learning (ML) tasks - a challenge amplified by the deluge of data sources available in modern this http URL work in data discovery largely focus on metadata matching, semantic similarity or identifying tables that should be joined to answer a particular query, but do not consider source quality for high performance of the downstream ML this http URL paper addresses the problem of determining the best subset of data sources that must be combined to construct the underlying training dataset for a given ML this http URL propose SourceGrasp and SourceSplice, frameworks designed to efficiently select a suitable subset of sources that maximizes the utility of the downstream ML this http URL the algorithms rely on the core idea that sources (or their combinations) contribute differently to the task utility, and must be judiciously this http URL SourceGrasp utilizes a metaheuristic based on a greediness criterion and randomization, the SourceSplice framework presents a source selection mechanism inspired from gene splicing - a core concept used in protein this http URL empirically evaluate our algorithms on three real-world datasets and synthetic datasets and show that, with significantly fewer subset explorations, SourceSplice effectively identifies subsets of data sources leading to high task this http URL also conduct studies reporting the sensitivity of SourceSplice to the decision choices under several settings.
zh
[AI-50] Spatial-Temporal Reinforcement Learning for Network Routing with Non-Markovian Traffic
【速读】:该论文旨在解决传统强化学习(Reinforcement Learning, RL)在通信网络路由优化中面临的两大挑战:一是标准RL算法基于马尔可夫决策过程(Markov Decision Process, MDP)的假设在实际场景中往往不成立,导致无法获得最优解;二是传统RL方法常使用函数逼近(如神经网络)但未能显式建模复杂网络拓扑中的空间关系。解决方案的关键在于提出一种时空强化学习(spatial-temporal RL)框架,通过融合图神经网络(Graph Neural Networks, GNNs)和循环神经网络(Recurrent Neural Networks, RNNs),分别显式捕捉网络拓扑的空间动态性和流量模式的时间演化特性,从而显著提升路由决策的性能与鲁棒性。
链接: https://arxiv.org/abs/2507.22174
作者: Molly Wang
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:
Abstract:Reinforcement Learning (RL) has become a well-established approach for optimizing packet routing in communication networks. Standard RL algorithms typically are based on the Markov Decision Process (MDP), which assumes that the current state of the environment provides all the necessary information for system evolution and decision-making. However, this Markovian assumption is invalid in many practical scenarios, making the MDP and RL frameworks inadequate to produce the optimal solutions. Additionally, traditional RL algorithms often employ function approximations (e.g., by neural networks) that do not explicitly capture the spatial relationships inherent in environments with complex network topologies. Communication networks are characterized by dynamic traffic patterns and arbitrary numbers of nodes and links, which further complicate the decision-making process. To address these challenges, we propose a spatial-temporal RL approach that integrates Graph Neural Networks (GNNs) and Recurrent Neural Networks (RNNs) to adequately capture the spatial dynamics regarding network topology and temporal traffic patterns, respectively, to enhance routing decisions. Our evaluation demonstrates that the proposed method outperforms and is more robust to changes in the network topology when compared with traditional RL techniques.
zh
[AI-51] Enhancing Jailbreak Attacks on LLM s via Persona Prompts
【速读】:该论文旨在解决大型语言模型(Large Language Models, LLMs)在面对越狱攻击(Jailbreak attacks)时的安全脆弱性问题,尤其是针对通过角色提示(persona prompts)诱导模型生成有害内容的机制缺乏系统研究的问题。其解决方案的关键在于提出一种基于遗传算法(genetic algorithm)的方法,自动演化出能够绕过LLM安全机制的角色提示,实验证明该方法可使拒绝率降低50%-70%,且与现有攻击方法结合时能进一步提升成功率10%-20%。
链接: https://arxiv.org/abs/2507.22171
作者: Zheng Zhang,Peilin Zhao,Deheng Ye,Hao Wang
机构: 未知
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI)
备注:
Abstract:Jailbreak attacks aim to exploit large language models (LLMs) by inducing them to generate harmful content, thereby revealing their vulnerabilities. Understanding and addressing these attacks is crucial for advancing the field of LLM safety. Previous jailbreak approaches have mainly focused on direct manipulations of harmful intent, with limited attention to the impact of persona prompts. In this study, we systematically explore the efficacy of persona prompts in compromising LLM defenses. We propose a genetic algorithm-based method that automatically crafts persona prompts to bypass LLM’s safety mechanisms. Our experiments reveal that: (1) our evolved persona prompts reduce refusal rates by 50-70% across multiple LLMs, and (2) these prompts demonstrate synergistic effects when combined with existing attack methods, increasing success rates by 10-20%. Our code and data are available at this https URL.
zh
[AI-52] When Truthful Representations Flip Under Deceptive Instructions?
【速读】:该论文旨在解决大语言模型(Large Language Models, LLMs)在接收到恶意设计的欺骗性指令时,如何改变其内部表征以生成误导性输出的问题。现有研究多聚焦于输出层面的分析,而对内部表示的变化机制理解不足。论文的关键解决方案在于通过线性探测(linear probes)和稀疏自编码器(Sparse Autoencoders, SAEs)系统性地分析模型在不同指令条件下的表征动态,发现欺骗性指令会显著改变早期到中期层的内部表示,且这种变化集中体现在特定SAE特征上,并可区分出诚实与欺骗性的表征子空间。这一发现揭示了欺骗行为的特征级和层级级指纹,为检测和控制LLM中的受控不诚实行为提供了新的理论基础和技术路径。
链接: https://arxiv.org/abs/2507.22149
作者: Xianxuan Long,Yao Fu,Runchao Li,Mu Sheng,Haotian Yu,Xiaotian Han,Pan Li
机构: 未知
类目: Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:
Abstract:Large language models (LLMs) tend to follow maliciously crafted instructions to generate deceptive responses, posing safety challenges. How deceptive instructions alter the internal representations of LLM compared to truthful ones remains poorly understood beyond output analysis. To bridge this gap, we investigate when and how these representations ``flip’', such as from truthful to deceptive, under deceptive versus truthful/neutral instructions. Analyzing the internal representations of Llama-3.1-8B-Instruct and Gemma-2-9B-Instruct on a factual verification task, we find the model’s instructed True/False output is predictable via linear probes across all conditions based on the internal representation. Further, we use Sparse Autoencoders (SAEs) to show that the Deceptive instructions induce significant representational shifts compared to Truthful/Neutral representations (which are similar), concentrated in early-to-mid layers and detectable even on complex datasets. We also identify specific SAE features highly sensitive to deceptive instruction and use targeted visualizations to confirm distinct truthful/deceptive representational subspaces. % Our analysis pinpoints layer-wise and feature-level correlates of instructed dishonesty, offering insights for LLM detection and control. Our findings expose feature- and layer-level signatures of deception, offering new insights for detecting and mitigating instructed dishonesty in LLMs.
zh
[AI-53] Hybrid activation functions for deep neural networks: S3 and S4 – a novel approach to gradient flow optimization
【速读】:该论文旨在解决传统激活函数在深度神经网络中面临的梯度传播问题,如ReLU存在的“死神经元”(dead neuron)现象以及Sigmoid和Tanh导致的梯度消失(vanishing gradient)问题。解决方案的关键在于提出两种新型混合激活函数——S3(Sigmoid-Softsign)及其改进版本S4,其中S4通过引入一个可调陡度参数k实现平滑过渡机制,从而在负输入区域利用Sigmoid的稳定梯度特性,在正输入区域采用Softsign的非线性表达能力,并有效控制梯度范围([0.24, 0.59]),显著减少死神经元比例(相比ReLU降低18%),同时加快收敛速度(比ReLU快19%),展现出更强的训练稳定性和适应不同任务与网络深度的能力。
链接: https://arxiv.org/abs/2507.22090
作者: Sergii Kavun
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Neural and Evolutionary Computing (cs.NE); Networking and Internet Architecture (cs.NI)
备注: 15 pages, 2 figures, 5 tables
Abstract:Activation functions are critical components in deep neural networks, directly influencing gradient flow, training stability, and model performance. Traditional functions like ReLU suffer from dead neuron problems, while sigmoid and tanh exhibit vanishing gradient issues. We introduce two novel hybrid activation functions: S3 (Sigmoid-Softsign) and its improved version S4 (smoothed S3). S3 combines sigmoid for negative inputs with softsign for positive inputs, while S4 employs a smooth transition mechanism controlled by a steepness parameter k. We conducted comprehensive experiments across binary classification, multi-class classification, and regression tasks using three different neural network architectures. S4 demonstrated superior performance compared to nine baseline activation functions, achieving 97.4% accuracy on MNIST, 96.0% on Iris classification, and 18.7 MSE on Boston Housing regression. The function exhibited faster convergence (-19 for ReLU) and maintained stable gradient flow across network depths. Comparative analysis revealed S4’s gradient range of [0.24, 0.59] compared to ReLU’s 18% dead neurons in deep networks. The S4 activation function addresses key limitations of existing functions through its hybrid design and smooth transition mechanism. The tunable parameter k allows adaptation to different tasks and network depths, making S4 a versatile choice for deep learning applications. These findings suggest that hybrid activation functions represent a promising direction for improving neural network training dynamics.
zh
[AI-54] Principled Curriculum Learning using Parameter Continuation Methods
【速读】:该论文旨在解决深度神经网络优化过程中存在的局部最优陷阱和收敛困难问题,尤其在复杂非凸目标函数下难以获得良好泛化性能的挑战。其解决方案的关键在于引入参数延续法(parameter continuation method),该方法通过构造连续路径将初始简单问题逐步变形为最终复杂问题,从而引导优化过程从易到难地学习特征;该策略在理论上与同伦方法(homotopies)和课程学习(curriculum learning)密切相关,并在实践中展现出优于ADAM等先进优化技术的泛化能力,适用于监督与无监督学习任务。
链接: https://arxiv.org/abs/2507.22089
作者: Harsh Nilesh Pathak,Randy Paffenroth
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:
Abstract:In this work, we propose a parameter continuation method for the optimization of neural networks. There is a close connection between parameter continuation, homotopies, and curriculum learning. The methods we propose here are theoretically justified and practically effective for several problems in deep neural networks. In particular, we demonstrate better generalization performance than state-of-the-art optimization techniques such as ADAM for supervised and unsupervised learning tasks.
zh
[AI-55] ypyBench: Evaluating LLM Type Inference for Untyped Python Repositories
【速读】:该论文旨在解决动态语言(如Python)中类型推断的长期挑战,特别是针对大语言模型(LLM)在代码理解能力下类型推断性能不足的问题。其解决方案的关键在于提出一个名为TypyBench的新基准,用于评估LLM在完整Python代码库上的类型推断能力,并引入两个创新指标:TypeSim(衡量预测类型与真实类型之间的语义相似性)和TypeCheck(评估代码库内类型的一致性)。实验表明,尽管LLM在类型相似度上表现尚可,但在复杂嵌套类型和类型一致性方面存在显著缺陷,提示未来研究应从提升类型相似性转向关注代码库级别的类型一致性问题。
链接: https://arxiv.org/abs/2507.22086
作者: Honghua Dong,Jiacheng Yang,Xun Deng,Yuhe Jiang,Gennady Pekhimenko,Fan Long,Xujie Si
机构: 未知
类目: oftware Engineering (cs.SE); Artificial Intelligence (cs.AI); Programming Languages (cs.PL)
备注:
Abstract:Type inference for dynamic languages like Python is a persistent challenge in software engineering. While large language models (LLMs) have shown promise in code understanding, their type inference capabilities remain underexplored. We introduce TypyBench, a benchmark designed to evaluate LLMs’ type inference across entire Python repositories. TypyBench features two novel metrics: TypeSim, which captures nuanced semantic relationships between predicted and ground truth types, and TypeCheck, which assesses type consistency across codebases. Our evaluation of various LLMs on a curated dataset of 50 high-quality Python repositories reveals that, although LLMs achieve decent TypeSim scores, they struggle with complex nested types and exhibit significant type consistency errors. These findings suggest that future research should shift focus from improving type similarity to addressing repository-level consistency. TypyBench provides a foundation for this new direction, offering insights into model performance across different type complexities and usage contexts. Our code and data are available at this https URL.
zh
[AI-56] Shape Invariant 3D-Variational Autoencoder: Super Resolution in Turbulence flow
【速读】:该论文旨在解决湍流建模中如何有效融合多尺度信息与深度学习架构,以及如何利用生成式AI(Generative AI)实现超分辨率重构的问题。其解决方案的关键在于:一是将多尺度湍流模型与深度神经网络相结合,以提升对复杂流动结构的表征能力;二是应用深度生成模型(如变分自编码器或生成对抗网络)从低分辨率数据中重建高分辨率湍流场,从而增强模拟精度与物理一致性。
链接: https://arxiv.org/abs/2507.22082
作者: Anuraj Maurya
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Fluid Dynamics (physics.flu-dyn)
备注:
Abstract:Deep learning provides a versatile suite of methods for extracting structured information from complex datasets, enabling deeper understanding of underlying fluid dynamic phenomena. The field of turbulence modeling, in particular, benefits from the growing availability of high-dimensional data obtained through experiments, field observations, and large-scale simulations spanning multiple spatio-temporal scales. This report presents a concise overview of both classical and deep learningbased approaches to turbulence modeling. It further investigates two specific challenges at the intersection of fluid dynamics and machine learning: the integration of multiscale turbulence models with deep learning architectures, and the application of deep generative models for super-resolution reconstruction
zh
[AI-57] From Cloud-Native to Trust-Native: A Protocol for Verifiable Multi-Agent Systems
【速读】:该论文旨在解决当前由大语言模型(Large Language Models, LLMs)驱动的自主代理(Autonomous Agents)在高风险领域(如制药研发、法律流程自动化)中面临的可验证性(Verifiability)问题,即如何确保其行为符合既定政策与合规要求,而不仅仅是具备智能能力。解决方案的关键在于提出TrustTrack协议,该协议将结构化保障机制——包括可验证的身份认证、政策承诺(Policy Commitments)和抗篡改的行为日志(Tamper-Resistant Behavioral Logs)——直接嵌入代理基础设施中,从而构建“信任原生自治”(Trust-Native Autonomy)的新系统范式,将合规性从事后监督转变为设计约束,实现跨组织与司法管辖区的可信协作。
链接: https://arxiv.org/abs/2507.22077
作者: Muyang Li
机构: 未知
类目: Multiagent Systems (cs.MA); Artificial Intelligence (cs.AI); Cryptography and Security (cs.CR)
备注: 14 pages, 2 figures. Vision paper and protocol blueprint. No prior submission or publication
Abstract:As autonomous agents powered by large language models (LLMs) proliferate in high-stakes domains – from pharmaceuticals to legal workflows – the challenge is no longer just intelligence, but verifiability. We introduce TrustTrack, a protocol that embeds structural guarantees – verifiable identity, policy commitments, and tamper-resistant behavioral logs – directly into agent infrastructure. This enables a new systems paradigm: trust-native autonomy. By treating compliance as a design constraint rather than post-hoc oversight, TrustTrack reframes how intelligent agents operate across organizations and jurisdictions. We present the protocol design, system requirements, and use cases in regulated domains such as pharmaceutical RD, legal automation, and AI-native collaboration. We argue that the Cloud - AI - Agent - Trust transition represents the next architectural layer for autonomous systems.
zh
[AI-58] A Compute-Matched Re-Evaluation of TroVE on MATH
【速读】:该论文试图解决的问题是:近期提出的TroVE方法在MATH基准上表现优于PRIMITIVE基线,其宣称的优势来源于通过生成和复用工具(tool)来提升数学问题求解能力,但这一结论是否成立尚存争议。关键在于验证TroVE的性能提升是否真正源于其“工具箱”机制,还是由于其他因素如计算预算差异或自我一致性机制所致。解决方案的关键在于对TroVE进行重新评估,严格控制计算资源匹配,并修正其原始实现中的选择机制缺陷;实验表明,在匹配计算预算后,TroVE仅带来1%的边际改进,说明其工具复用机制并未显著提升性能,核心贡献在于揭示了当前方法的性能优势主要归因于计算资源分配而非结构化工具利用。
链接: https://arxiv.org/abs/2507.22069
作者: Tobias Sesterhenn,Ian Berlot-Attwell,Janis Zenkner,Christian Bartelt
机构: 未知
类目: Programming Languages (cs.PL); Artificial Intelligence (cs.AI)
备注:
Abstract:Reusing established theorems and formulas is central to mathematical problem solving, serving as essential building blocks for tackling increasingly complex challenges. Recent work, TroVE, argues that code-generating Large Language Models (LLMs) can benefit similarly on the MATH benchmark by inducing and reusing higher-level toolboxes. By allocating computational budget across an ensemble of three modes – directly generating code, creating tools, and reusing tools – TroVE claims to outperform a PRIMITIVE baseline that only performs direct generation. However, recent analysis (Berlot-Attwell et al., 2024) casts doubt on these gains, noting that the tools created are often trivial or rarely reused, suggesting that improvements may stem from self-consistency or self-correction. In this work, we re-evaluate TroVE on MATH, analyze the impact of each of its modes, and show that its benefit does not come from these mechanisms, but simply from a higher computational budget spent for TroVE compared to PRIMITIVE. To this end, we also perform a small correction in the original implementation of TroVE’s selection mechanism, boosting TroVE’s performance on MATH by 3% in accuracy. After matching for compute, the benefit of TroVE reduces to a marginal improvement of 1%, suggesting that this toolbox approach does not provide a significant benefit on MATH.
zh
[AI-59] Fuzzing: Randomness? Reasoning ! Efficient Directed Fuzzing via Large Language Models
【速读】:该论文旨在解决随机性(randomness)对模糊测试(fuzzing)效率的负面影响,尤其是在定向模糊测试(directed fuzzing)中,尽管其通过引导测试用例向目标漏洞位置靠近以减少随机性,但仍难以克服由种子(seeds)和变异器(mutators)引入的随机性问题。解决方案的关键在于利用大语言模型(large language models, LLMs)来提升种子质量和降低变异器随机性:一方面,LLMs根据函数调用链或功能语义生成可达且目标明确的种子;另一方面,LLMs通过分析漏洞成因和变异建议,构造针对特定漏洞的定制化变异器,从而显著提高漏洞暴露速度。实验表明,RandLuzz集成LLMs与定向模糊测试后,在多个基准上相较现有最优工具实现2.1×至4.8×的平均加速,并在部分漏洞上实现60秒内暴露。
链接: https://arxiv.org/abs/2507.22065
作者: Xiaotao Feng,Xiaogang Zhu,Kun Hu,Jincheng Wang,Yingjie Cao,Guang Gong,Jianfeng Pan
机构: 未知
类目: oftware Engineering (cs.SE); Artificial Intelligence (cs.AI); Cryptography and Security (cs.CR); Programming Languages (cs.PL)
备注:
Abstract:Fuzzing is highly effective in detecting bugs due to the key contribution of randomness. However, randomness significantly reduces the efficiency of fuzzing, causing it to cost days or weeks to expose bugs. Even though directed fuzzing reduces randomness by guiding fuzzing towards target buggy locations, the dilemma of randomness still challenges directed fuzzers. Two critical components, which are seeds and mutators, contain randomness and are closely tied to the conditions required for triggering bugs. Therefore, to address the challenge of randomness, we propose to use large language models (LLMs) to remove the randomness in seeds and reduce the randomness in mutators. With their strong reasoning and code generation capabilities, LLMs can be used to generate reachable seeds that target pre-determined locations and to construct bug-specific mutators tailored for specific bugs. We propose RandLuzz, which integrates LLMs and directed fuzzing, to improve the quality of seeds and mutators, resulting in efficient bug exposure. RandLuzz analyzes function call chain or functionality to guide LLMs in generating reachable seeds. To construct bug-specific mutators, RandLuzz uses LLMs to perform bug analysis, obtaining information such as bug causes and mutation suggestions, which further help generate code that performs bug-specific mutations. We evaluate RandLuzz by comparing it with four state-of-the-art directed fuzzers, AFLGo, Beacon, WindRanger, and SelectFuzz. With RandLuzz-generated seeds, the fuzzers achieve an average speedup ranging from 2.1 \times to 4.8 \times compared to using widely-used initial seeds. Additionally, when evaluated on individual bugs, RandLuzz achieves up to a 2.7 \times speedup compared to the second-fastest exposure. On 8 bugs, RandLuzz can even expose them within 60 seconds.
zh
[AI-60] Machine Learning Experiences: A story of learning AI for use in enterprise software testing that can be used by anyone
【速读】:该论文旨在解决软件测试领域中如何有效应用机器学习(Machine Learning, ML)技术的问题,尤其关注于将ML流程系统化以提升项目实施的可操作性与成功率。其解决方案的关键在于提出并实践了一个结构化的ML工作流,该流程借鉴了CRISP-DM(Cross-Industry Standard Process for Data Mining)框架,包含数据收集、数据清洗、特征工程、数据集划分(训练集与测试集)、模型选择、模型训练、模型测试与性能评估等步骤,确保任何项目都能按此流程高效地集成和落地机器学习技术。
链接: https://arxiv.org/abs/2507.22064
作者: Michael Cohoon,Debbie Furman
机构: 未知
类目: oftware Engineering (cs.SE); Artificial Intelligence (cs.AI)
备注:
Abstract:This paper details the machine learning (ML) journey of a group of people focused on software testing. It tells the story of how this group progressed through a ML workflow (similar to the CRISP-DM process). This workflow consists of the following steps and can be used by anyone applying ML techniques to a project: gather the data; clean the data; perform feature engineering on the data; splitting the data into two sets, one for training and one for testing; choosing a machine learning model; training the model; testing the model and evaluating the model performance. By following this workflow, anyone can effectively apply ML to any project that they are doing.
zh
[AI-61] RedCoder: Automated Multi-Turn Red Teaming for Code LLM s
【速读】:该论文旨在解决当前生成式 AI(Generative AI)在代码生成场景下存在安全漏洞的问题,特别是针对大型语言模型(Large Language Models, LLMs)在对抗性环境下容易生成脆弱甚至恶意代码的缺陷。现有红队测试(red-teaming)方法依赖大量人工干预,难以扩展且忽视了真实编程交互中多轮对话的特性。解决方案的关键在于提出 RedCoder —— 一个基于多智能体博弈机制构建的红队代理,通过模拟多轮对抗交互生成原型对话与可复用攻击策略,并利用这些策略对目标代码大语言模型(Code LLM)进行动态引导,从而高效诱导其输出存在安全漏洞的代码。该方法显著提升了红队测试的自动化程度和攻击有效性,在多个 Code LLM 上均优于此前单轮与多轮方法。
链接: https://arxiv.org/abs/2507.22063
作者: Wenjie Jacky Mo,Qin Liu,Xiaofei Wen,Dongwon Jung,Hadi Askari,Wenxuan Zhou,Zhe Zhao,Muhao Chen
机构: 未知
类目: oftware Engineering (cs.SE); Artificial Intelligence (cs.AI)
备注:
Abstract:Large Language Models (LLMs) for code generation (i.e., Code LLMs) have demonstrated impressive capabilities in AI-assisted software development and testing. However, recent studies have shown that these models are prone to generating vulnerable or even malicious code under adversarial settings. Existing red-teaming approaches rely on extensive human effort, limiting their scalability and practicality, and generally overlook the interactive nature of real-world AI-assisted programming, which often unfolds over multiple turns. To bridge these gaps, we present RedCoder, a red-teaming agent that engages victim models in multi-turn conversation to elicit vulnerable code. The pipeline to construct RedCoder begins with a multi-agent gaming process that simulates adversarial interactions, yielding a set of prototype conversations and an arsenal of reusable attack strategies. We then fine-tune an LLM on these prototype conversations to serve as the backbone of RedCoder. Once deployed, RedCoder autonomously engages Code LLMs in multi-turn conversations, dynamically retrieving relevant strategies from the arsenal to steer the dialogue toward vulnerability-inducing outputs. Experiments across multiple Code LLMs show that our approach outperforms prior single-turn and multi-turn red-team methods in inducing vulnerabilities in code generation, offering a scalable and effective tool for evaluating the security boundaries of modern code-generation systems.
zh
[AI-62] GABRIL: Gaze-Based Regularization for Mitigating Causal Confusion in Imitation Learning IROS2025
【速读】:该论文旨在解决模仿学习(Imitation Learning, IL)中因因果混淆(causal confusion)导致的性能下降问题,即代理在训练环境中学习到的是伪相关关系而非因果关联,从而在测试环境出现分布偏移(distribution shift)时表现不佳。解决方案的关键在于引入基于人类注视数据的正则化方法——GABRIL(GAze-Based Regularization in Imitation Learning),通过在数据收集阶段获取专家注视轨迹,构建一种正则化损失函数,引导模型关注由专家注视识别出的因果相关特征,从而有效抑制混杂变量(confounding variables)的影响,提升模型在分布外场景下的泛化能力与可解释性。
链接: https://arxiv.org/abs/2507.19647
作者: Amin Banayeeanzade,Fatemeh Bahrani,Yutai Zhou,Erdem Bıyık
机构: 未知
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: IROS 2025 camera-ready version. First two authors contributed equally
Abstract:Imitation Learning (IL) is a widely adopted approach which enables agents to learn from human expert demonstrations by framing the task as a supervised learning problem. However, IL often suffers from causal confusion, where agents misinterpret spurious correlations as causal relationships, leading to poor performance in testing environments with distribution shift. To address this issue, we introduce GAze-Based Regularization in Imitation Learning (GABRIL), a novel method that leverages the human gaze data gathered during the data collection phase to guide the representation learning in IL. GABRIL utilizes a regularization loss which encourages the model to focus on causally relevant features identified through expert gaze and consequently mitigates the effects of confounding variables. We validate our approach in Atari environments and the Bench2Drive benchmark in CARLA by collecting human gaze datasets and applying our method in both domains. Experimental results show that the improvement of GABRIL over behavior cloning is around 179% more than the same number for other baselines in the Atari and 76% in the CARLA setup. Finally, we show that our method provides extra explainability when compared to regular IL agents.
zh
[AI-63] RecPS: Privacy Risk Scoring for Recommender Systems
【速读】:该论文旨在解决推荐系统(RecSys)训练数据中用户敏感交互信息难以量化隐私风险的问题,从而实现隐私感知的模型开发与部署。其核心挑战在于缺乏一种机制使用户能够识别哪些交互行为更具隐私敏感性,进而做出知情选择。解决方案的关键是提出基于成员推理攻击(Membership-Inference Attack, MIA)的隐私评分方法 RecPS,该方法从差分隐私(Differential Privacy)理论出发,定义了交互级和用户级的隐私风险评分;其中,交互级 MIA 方法 RecLiRA 是关键组件,提供了高质量的成员归属估计能力,使得隐私风险评估具备可操作性和细粒度,同时支持推荐模型的“遗忘”(unlearning)能力验证。
链接: https://arxiv.org/abs/2507.18365
作者: Jiajie He,Yuechun Gu,Keke Chen
机构: 未知
类目: Information Retrieval (cs.IR); Artificial Intelligence (cs.AI); Cryptography and Security (cs.CR)
备注:
Abstract:Recommender systems (RecSys) have become an essential component of many web applications. The core of the system is a recommendation model trained on highly sensitive user-item interaction data. While privacy-enhancing techniques are actively studied in the research community, the real-world model development still depends on minimal privacy protection, e.g., via controlled access. Users of such systems should have the right to choose \emphnot to share highly sensitive interactions. However, there is no method allowing the user to know which interactions are more sensitive than others. Thus, quantifying the privacy risk of RecSys training data is a critical step to enabling privacy-aware RecSys model development and deployment. We propose a membership-inference attack (MIA)- based privacy scoring method, RecPS, to measure privacy risks at both the interaction and user levels. The RecPS interaction-level score definition is motivated and derived from differential privacy, which is then extended to the user-level scoring method. A critical component is the interaction-level MIA method RecLiRA, which gives high-quality membership estimation. We have conducted extensive experiments on well-known benchmark datasets and RecSys models to show the unique features and benefits of RecPS scoring in risk assessment and RecSys model unlearning.
zh
[AI-64] Spatial-Temporal Data Mining for Ocean Science: Data Methodologies and Opportunities
【速读】:该论文旨在解决当前空间-时间海洋数据挖掘(Spatial-Temporal Data Mining, STDM)研究中缺乏系统性综述的问题,从而阻碍计算机科学与海洋科学交叉领域的研究进展。其关键解决方案在于:首先梳理了广泛使用的ST海洋数据集及其独特特性(如区域多样性与高稀疏性),随后探讨了典型的数据质量增强技术;接着将现有STDM研究按任务类型分为预测、事件检测、模式挖掘和异常检测四类,并深入分析各类任务的技术方法;最后提出未来有前景的研究方向。这一系统性综述有助于跨学科研究人员更好地理解STDM在海洋科学中的基础概念、关键技术与开放挑战。
链接: https://arxiv.org/abs/2307.10803
作者: Hanchen Yang,Wengen Li,Shuyu Wang,Hui Li,Jihong Guan,Shuigeng Zhou,Jiannong Cao
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Atmospheric and Oceanic Physics (physics.ao-ph)
备注:
Abstract:With the rapid amassing of spatial-temporal (ST) ocean data, many spatial-temporal data mining (STDM) studies have been conducted to address various oceanic issues, including climate forecasting and disaster warning. Compared with typical ST data (e.g., traffic data), ST ocean data is more complicated but with unique characteristics, e.g., diverse regionality and high sparsity. These characteristics make it difficult to design and train STDM models on ST ocean data. To the best of our knowledge, a comprehensive survey of existing studies remains missing in the literature, which hinders not only computer scientists from identifying the research issues in ocean data mining but also ocean scientists to apply advanced STDM techniques. In this paper, we provide a comprehensive survey of existing STDM studies for ocean science. Concretely, we first review the widely-used ST ocean datasets and highlight their unique characteristics. Then, typical ST ocean data quality enhancement techniques are explored. Next, we classify existing STDM studies in ocean science into four types of tasks, i.e., prediction, event detection, pattern mining, and anomaly detection, and elaborate on the techniques for these tasks. Finally, promising research opportunities are discussed. This survey can help scientists from both computer science and ocean science better understand the fundamental concepts, key techniques, and open challenges of STDM for ocean science.
zh
[AI-65] A Mean-Field Theory of Θ-Expectations
【速读】:该论文旨在解决传统次线性期望(sublinear expectations)理论在处理非凸不确定性模型时的局限性问题,此类模型在实际金融与决策建模中具有重要意义但缺乏严格的数学框架。解决方案的关键在于构建一类全新的随机微分方程体系——即完全耦合的均场前向-后向随机微分方程(Mean-Field Forward-Backward Stochastic Differential Equations, FBSDE),其中后向部分的驱动函数通过关于概率分布依赖的非凸集合进行逐点最大化定义。为确保数学可处理性,作者引入了对控制变量的统一强凹性假设,从而保证优化问题存在唯一且稳定的解;进一步地,论文建立了该优化器的Lipschitz稳定性,这是整个适定性理论的核心基础,并由此证明了FBSDE系统的局部与全局适定性定理。最终得到的估值泛函——Θ-期望(Θ-Expectation)——展现出动态一致性,同时违反次可加性(sub-additivity)和平移不变性(translation invariance)公理,标志着其与经典凸范式的根本区别,为一类内生非凸模糊性下的随机微积分提供了严谨的理论支撑。
链接: https://arxiv.org/abs/2507.22577
作者: Qian Qi
机构: 未知
类目: Probability (math.PR); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:
Abstract:The canonical theory of sublinear expectations, a foundation of stochastic calculus under ambiguity, is insensitive to the non-convex geometry of primitive uncertainty models. This paper develops a new stochastic calculus for a structured class of such non-convex models. We introduce a class of fully coupled Mean-Field Forward-Backward Stochastic Differential Equations where the BSDE driver is defined by a pointwise maximization over a law-dependent, non-convex set. Mathematical tractability is achieved via a uniform strong concavity assumption on the driver with respect to the control variable, which ensures the optimization admits a unique and stable solution. A central contribution is to establish the Lipschitz stability of this optimizer from primitive geometric and regularity conditions, which underpins the entire well-posedness theory. We prove local and global well-posedness theorems for the FBSDE system. The resulting valuation functional, the \Theta -Expectation, is shown to be dynamically consistent and, most critically, to violate the axiom of sub-additivity. This, along with its failure to be translation invariant, demonstrates its fundamental departure from the convex paradigm. This work provides a rigorous foundation for stochastic calculus under a class of non-convex, endogenous ambiguity.
zh
[AI-66] aLLoyM: A large language model for alloy phase diagram prediction
【速读】:该论文旨在解决传统相图计算方法在材料设计中效率低、依赖专家经验的问题,特别是在合金体系中快速获取和生成相信息的挑战。其解决方案的关键在于构建并训练了一个名为aLLoyM的专用微调大语言模型(Large Language Model, LLM),该模型基于开源计算相图数据库(Computational Phase Diagram Database, CPDDB)和CALPHAD评估数据,针对二元与三元相图的问答任务进行优化,采用两种格式(多项选择与短文本回答)进行微调。实验表明,微调显著提升了模型在多项选择题上的准确率,并且短文本版本具备从成分直接生成新相图的能力,展现出加速新材料体系发现的巨大潜力。
链接: https://arxiv.org/abs/2507.22558
作者: Yuna Oikawa,Guillaume Deffrennes,Taichi Abe,Ryo Tamura,Koji Tsuda
机构: 未知
类目: Materials Science (cond-mat.mtrl-sci); Artificial Intelligence (cs.AI)
备注: 24 pages, 6 figures
Abstract:Large Language Models (LLMs) are general-purpose tools with wide-ranging applications, including in materials science. In this work, we introduce aLLoyM, a fine-tuned LLM specifically trained on alloy compositions, temperatures, and their corresponding phase information. To develop aLLoyM, we curated question-and-answer (QA) pairs for binary and ternary phase diagrams using the open-source Computational Phase Diagram Database (CPDDB) and assessments based on CALPHAD (CALculation of PHAse Diagrams). We fine-tuned Mistral, an open-source pre-trained LLM, for two distinct QA formats: multiple-choice and short-answer. Benchmark evaluations demonstrate that fine-tuning substantially enhances performance on multiple-choice phase diagram questions. Moreover, the short-answer model of aLLoyM exhibits the ability to generate novel phase diagrams from its components alone, underscoring its potential to accelerate the discovery of previously unexplored materials systems. To promote further research and adoption, we have publicly released the short-answer fine-tuned version of aLLoyM, along with the complete benchmarking QA dataset, on Hugging Face.
zh
[AI-67] LVM-GP: Uncertainty-Aware PDE Solver via coupling latent variable model and Gaussian process
【速读】:该论文旨在解决在存在噪声数据的情况下,对正向和反向偏微分方程(PDE)求解过程中不确定性量化(uncertainty quantification)的问题。现有方法如贝叶斯物理信息神经网络(B-PINNs)和深度集成(deep ensembles)在捕捉复杂函数依赖关系和提供可靠不确定性估计方面存在局限。解决方案的关键在于提出一种新颖的概率框架 LVM-GP,其核心是构建一个从输入到高维潜在表示的随机映射,其中通过一个置信度感知的编码器将可学习的确定性特征与高斯过程先验进行插值,插值强度由从数据中学习的置信度函数自适应控制;同时,解空间由一个条件高斯分布建模,其均值由神经算子(neural operator)作用于潜在表示预测,从而实现灵活的函数到函数映射。此外,物理定律作为软约束嵌入损失函数,确保模型输出符合PDE结构,整体实现了高效且鲁棒的不确定性量化与高精度预测。
链接: https://arxiv.org/abs/2507.22493
作者: Xiaodong Feng,Ling Guo,Xiaoliang Wan,Hao Wu,Tao Zhou,Wenwen Zhou
机构: 未知
类目: Machine Learning (stat.ML); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:
Abstract:We propose a novel probabilistic framework, termed LVM-GP, for uncertainty quantification in solving forward and inverse partial differential equations (PDEs) with noisy data. The core idea is to construct a stochastic mapping from the input to a high-dimensional latent representation, enabling uncertainty-aware prediction of the solution. Specifically, the architecture consists of a confidence-aware encoder and a probabilistic decoder. The encoder implements a high-dimensional latent variable model based on a Gaussian process (LVM-GP), where the latent representation is constructed by interpolating between a learnable deterministic feature and a Gaussian process prior, with the interpolation strength adaptively controlled by a confidence function learned from data. The decoder defines a conditional Gaussian distribution over the solution field, where the mean is predicted by a neural operator applied to the latent representation, allowing the model to learn flexible function-to-function mapping. Moreover, physical laws are enforced as soft constraints in the loss function to ensure consistency with the underlying PDE structure. Compared to existing approaches such as Bayesian physics-informed neural networks (B-PINNs) and deep ensembles, the proposed framework can efficiently capture functional dependencies via merging a latent Gaussian process and neural operator, resulting in competitive predictive accuracy and robust uncertainty quantification. Numerical experiments demonstrate the effectiveness and reliability of the method.
zh
[AI-68] Physics-constrained generative machine learning-based high-resolution downscaling of Greenlands surface mass balance and surface temperature
【速读】:该论文旨在解决当前对格陵兰冰盖表面质量平衡(Surface Mass Balance, SMB)和地表温度高分辨率投影的不足,这些问题通常受限于计算成本过高或空间分辨率较低。其解决方案的关键在于提出一种基于一致性模型(Consistency Model, CM)的物理约束生成式建模框架,该框架能够以少量采样步骤将低分辨率SMB与地表温度场的空间分辨率提升至5 km(从160 km),同时通过在推理阶段施加硬性守恒约束,确保粗尺度上的SMB和温度总和得以近似保留,并具备在极端气候条件下无需重新训练即可稳健泛化的能力。此方法显著优于传统插值法,在测试集上实现了6.31 mmWE的连续排名概率评分(CRPS)和0.1 K的温度精度,且能忠实再现多尺度空间变异性,为冰盖模拟提供快速、真实的高分辨率气候强迫输入。
链接: https://arxiv.org/abs/2507.22485
作者: Nils Bochow,Philipp Hess,Alexander Robinson
机构: 未知
类目: Geophysics (physics.geo-ph); Artificial Intelligence (cs.AI)
备注:
Abstract:Accurate, high-resolution projections of the Greenland ice sheet’s surface mass balance (SMB) and surface temperature are essential for understanding future sea-level rise, yet current approaches are either computationally demanding or limited to coarse spatial scales. Here, we introduce a novel physics-constrained generative modeling framework based on a consistency model (CM) to downscale low-resolution SMB and surface temperature fields by a factor of up to 32 (from 160 km to 5 km grid spacing) in a few sampling steps. The CM is trained on monthly outputs of the regional climate model MARv3.12 and conditioned on ice-sheet topography and insolation. By enforcing a hard conservation constraint during inference, we ensure approximate preservation of SMB and temperature sums on the coarse spatial scale as well as robust generalization to extreme climate states without retraining. On the test set, our constrained CM achieves a continued ranked probability score of 6.31 mmWE for the SMB and 0.1 K for the surface temperature, outperforming interpolation-based downscaling. Together with spatial power-spectral analysis, we demonstrate that the CM faithfully reproduces variability across spatial scales. We further apply bias-corrected outputs of the NorESM2 Earth System Model as inputs to our CM, to demonstrate the potential of our model to directly downscale ESM fields. Our approach delivers realistic, high-resolution climate forcing for ice-sheet simulations with fast inference and can be readily integrated into Earth-system and ice-sheet model workflows to improve projections of the future contribution to sea-level rise from Greenland and potentially other ice sheets and glaciers too.
zh
[AI-69] ny Noise-Robust Voice Activity Detector for Voice Assistants
【速读】:该论文旨在解决在背景噪声环境下语音活动检测(Voice Activity Detection, VAD)准确率下降的问题,尤其是在资源受限的AIoT设备上部署时,现有轻量级VAD模型难以应对低信噪比(SNR)和多样声学环境的挑战。其解决方案的关键在于引入数据预处理和后处理模块,以增强对噪声的鲁棒性,而无需增加模型复杂度或进行额外微调,从而在保持轻量化的同时显著提升噪声环境中及清洁语音场景下的检测性能。
链接: https://arxiv.org/abs/2507.22157
作者: Hamed Jafarzadeh Asl,Mahsa Ghazvini Nejad,Amin Edraki,Masoud Asgharian,Vahid Partovi Nia
机构: 未知
类目: Audio and Speech Processing (eess.AS); Artificial Intelligence (cs.AI)
备注: Hamed Jafarzadeh Asl and Mahsa Ghazvini Nejad contributed equally to this work
Abstract:Voice Activity Detection (VAD) in the presence of background noise remains a challenging problem in speech processing. Accurate VAD is essential in automatic speech recognition, voice-to-text, conversational agents, etc, where noise can severely degrade the performance. A modern application includes the voice assistant, specially mounted on Artificial Intelligence of Things (AIoT) devices such as cell phones, smart glasses, earbuds, etc, where the voice signal includes background noise. Therefore, VAD modules must remain light-weight due to their practical on-device limitation. The existing models often struggle with low signal-to-noise ratios across diverse acoustic environments. A simple VAD often detects human voice in a clean environment, but struggles to detect the human voice in noisy conditions. We propose a noise-robust VAD that comprises a light-weight VAD, with data pre-processing and post-processing added modules to handle the background noise. This approach significantly enhances the VAD accuracy in noisy environments and requires neither a larger model, nor fine-tuning. Experimental results demonstrate that our approach achieves a notable improvement compared to baselines, particularly in environments with high background noise interference. This modified VAD additionally improving clean speech detection.
zh
[AI-70] Scaling and Distilling Transformer Models for sEMG
【速读】:该论文旨在解决表面肌电信号(surface electromyography, sEMG)在人机交互(Human-Computer Interface, HCI)应用中因训练数据量有限和部署时计算资源受限而导致的模型规模难以扩展的问题。其关键解决方案是证明了原始Transformer模型可在sEMG数据上有效放大至110M参数规模,显著超越以往研究通常采用的10M参数范围,并通过知识蒸馏技术将100M参数模型压缩为原大小的1/50,仅损失1.5%的绝对性能,从而实现高效且表达能力强的实时sEMG建模,适用于复杂的真实环境任务。
链接: https://arxiv.org/abs/2507.22094
作者: Nicholas Mehlman,Jean-Christophe Gagnon-Audet,Michael Shvartsman,Kelvin Niu,Alexander H. Miller,Shagun Sodhani
机构: 未知
类目: Audio and Speech Processing (eess.AS); Artificial Intelligence (cs.AI); Human-Computer Interaction (cs.HC); Machine Learning (cs.LG)
备注: Accepted at TMLR 2025 ( this https URL ), 11 pages
Abstract:Surface electromyography (sEMG) signals offer a promising avenue for developing innovative human-computer interfaces by providing insights into muscular activity. However, the limited volume of training data and computational constraints during deployment have restricted the investigation of scaling up the model size for solving sEMG tasks. In this paper, we demonstrate that vanilla transformer models can be effectively scaled up on sEMG data and yield improved cross-user performance up to 110M parameters, surpassing the model size regime investigated in other sEMG research (usually 10M parameters). We show that 100M-parameter models can be effectively distilled into models 50x smaller with minimal loss of performance (1.5% absolute). This results in efficient and expressive models suitable for complex real-time sEMG tasks in real-world environments.
zh
[AI-71] Dimensions of Vulnerability in Visual Working Memory: An AI-Driven Approach to Perceptual Comparison
【速读】:该论文旨在解决自然场景中物体视觉工作记忆(Visual Working Memory, VWM)易受感知比较影响而产生系统性记忆扭曲的问题,特别是缺乏对导致记忆脆弱性的具体视觉特征的识别。其解决方案的关键在于提出了一种基于人工智能(AI)的创新框架,通过生成行为相关、自然化的视觉刺激(如图像轮和维度轮),系统操纵物体的视觉与语义维度,从而揭示相似性诱导的记忆偏差机制。实验表明,视觉维度比语义维度更易引发记忆扭曲,证明了自然视觉刺激的物体维度在记忆脆弱性中的核心作用。
链接: https://arxiv.org/abs/2507.22067
作者: Yuang Cao,Jiachen Zou,Chen Wei,Quanying Liu
机构: 未知
类目: Neurons and Cognition (q-bio.NC); Artificial Intelligence (cs.AI)
备注: 6 pages, 4 figures, experimental results presented in the paper, accepted for virtual poster presentation at CogSci 2025
Abstract:Human memory exhibits significant vulnerability in cognitive tasks and daily life. Comparisons between visual working memory and new perceptual input (e.g., during cognitive tasks) can lead to unintended memory distortions. Previous studies have reported systematic memory distortions after perceptual comparison, but understanding how perceptual comparison affects memory distortions in real-world objects remains a challenge. Furthermore, identifying what visual features contribute to memory vulnerability presents a novel research question. Here, we propose a novel AI-driven framework that generates naturalistic visual stimuli grounded in behaviorally relevant object dimensions to elicit similarity-induced memory biases. We use two types of stimuli – image wheels created through dimension editing and dimension wheels generated by dimension activation values – in three visual working memory (VWM) experiments. These experiments assess memory distortions under three conditions: no perceptual comparison, perceptual comparison with image wheels, and perceptual comparison with dimension wheels. The results show that similar dimensions, like similar images, can also induce memory distortions. Specifically, visual dimensions are more prone to distortion than semantic dimensions, indicating that the object dimensions of naturalistic visual stimuli play a significant role in the vulnerability of memory.
zh
机器学习
[LG-0] Decentralized Differentially Private Power Method
链接: https://arxiv.org/abs/2507.22849
作者: Andrew Campbell,Anna Scaglione,Sean Peisert
类目: Machine Learning (cs.LG)
*备注:
Abstract:We propose a novel Decentralized Differentially Private Power Method (D-DP-PM) for performing Principal Component Analysis (PCA) in networked multi-agent settings. Unlike conventional decentralized PCA approaches where each agent accesses the full n-dimensional sample space, we address the challenging scenario where each agent observes only a subset of dimensions through row-wise data partitioning. Our method ensures (\epsilon,\delta) -Differential Privacy (DP) while enabling collaborative estimation of global eigenvectors across the network without requiring a central aggregator. We achieve this by having agents share only local embeddings of the current eigenvector iterate, leveraging both the inherent privacy from random initialization and carefully calibrated Gaussian noise additions. We prove that our algorithm satisfies the prescribed (\epsilon,\delta) -DP guarantee and establish convergence rates that explicitly characterize the impact of the network topology. Our theoretical analysis, based on linear dynamics and high-dimensional probability theory, provides tight bounds on both privacy and utility. Experiments on real-world datasets demonstrate that D-DP-PM achieves superior privacy-utility tradeoffs compared to naive local DP approaches, with particularly strong performance in moderate privacy regimes ( \epsilon\in[2, 5] ). The method converges rapidly, allowing practitioners to trade iterations for enhanced privacy while maintaining competitive utility.
[LG-1] PAF-Net: Phase-Aligned Frequency Decoupling Network for Multi-Process Manufacturing Quality Prediction
链接: https://arxiv.org/abs/2507.22840
作者: Yang Luo,Haoyang Luan,Haoyun Pan,Yongquan Jia,Xiaofeng Gao,Guihai Chen
类目: Machine Learning (cs.LG)
*备注: 7 pages, 5 figures
Abstract:Accurate quality prediction in multi-process manufacturing is critical for industrial efficiency but hindered by three core challenges: time-lagged process interactions, overlapping operations with mixed periodicity, and inter-process dependencies in shared frequency bands. To address these, we propose PAF-Net, a frequency decoupled time series prediction framework with three key innovations: (1) A phase-correlation alignment method guided by frequency domain energy to synchronize time-lagged quality series, resolving temporal misalignment. (2) A frequency independent patch attention mechanism paired with Discrete Cosine Transform (DCT) decomposition to capture heterogeneous operational features within individual series. (3) A frequency decoupled cross attention module that suppresses noise from irrelevant frequencies, focusing exclusively on meaningful dependencies within shared bands. Experiments on 4 real-world datasets demonstrate PAF-Net’s superiority. It outperforms 10 well-acknowledged baselines by 7.06% lower MSE and 3.88% lower MAE. Our code is available at this https URL.
[LG-2] Quantifying surprise in clinical care: Detecting highly informative events in electronic health records with foundation models
链接: https://arxiv.org/abs/2507.22798
作者: Michael C. Burkhart,Bashar Ramadan,Luke Solo,William F. Parker,Brett K. Beaulieu-Jones
类目: Machine Learning (cs.LG)
*备注:
Abstract:We present a foundation model-derived method to identify highly informative tokens and events in electronic health records. Our approach considers incoming data in the entire context of a patient’s hospitalization and so can flag anomalous events that rule-based approaches would consider within a normal range. We demonstrate that the events our model flags are significant for predicting downstream patient outcomes and that a fraction of events identified as carrying little information can safely be dropped. Additionally, we show how informativeness can help interpret the predictions of prognostic models trained on foundation model-derived representations.
[LG-3] DO-EM: Density Operator Expectation Maximization
链接: https://arxiv.org/abs/2507.22786
作者: Adit Vishnu,Abhay Shastry,Dhruva Kashyap,Chiranjib Bhattacharyya
类目: Machine Learning (cs.LG); Quantum Physics (quant-ph)
*备注: Main text: 9 pages 1 Figure. Total: 23 pages 3 Figures
Abstract:Density operators, quantum generalizations of probability distributions, are gaining prominence in machine learning due to their foundational role in quantum computing. Generative modeling based on density operator models (\textbfDOMs) is an emerging field, but existing training algorithms – such as those for the Quantum Boltzmann Machine – do not scale to real-world data, such as the MNIST dataset. The Expectation-Maximization algorithm has played a fundamental role in enabling scalable training of probabilistic latent variable models on real-world datasets. \textitIn this paper, we develop an Expectation-Maximization framework to learn latent variable models defined through \textbfDOMs on classical hardware, with resources comparable to those used for probabilistic models, while scaling to real-world data. However, designing such an algorithm is nontrivial due to the absence of a well-defined quantum analogue to conditional probability, which complicates the Expectation step. To overcome this, we reformulate the Expectation step as a quantum information projection (QIP) problem and show that the Petz Recovery Map provides a solution under sufficient conditions. Using this formulation, we introduce the Density Operator Expectation Maximization (DO-EM) algorithm – an iterative Minorant-Maximization procedure that optimizes a quantum evidence lower bound. We show that the \textbfDO-EM algorithm ensures non-decreasing log-likelihood across iterations for a broad class of models. Finally, we present Quantum Interleaved Deep Boltzmann Machines (\textbfQiDBMs), a \textbfDOM that can be trained with the same resources as a DBM. When trained with \textbfDO-EM under Contrastive Divergence, a \textbfQiDBM outperforms larger classical DBMs in image generation on the MNIST dataset, achieving a 40–60% reduction in the Fréchet Inception Distance.
[LG-4] Label-free estimation of clinically relevant performance metrics under distribution shifts MICCAI
链接: https://arxiv.org/abs/2507.22776
作者: Tim Flühmann,Alceu Bissoto,Trung-Dung Hoang,Lisa M. Koch
类目: Machine Learning (cs.LG)
*备注: Accepted oral at UNSURE 2025 @ MICCAI
Abstract:Performance monitoring is essential for safe clinical deployment of image classification models. However, because ground-truth labels are typically unavailable in the target dataset, direct assessment of real-world model performance is infeasible. State-of-the-art performance estimation methods address this by leveraging confidence scores to estimate the target accuracy. Despite being a promising direction, the established methods mainly estimate the model’s accuracy and are rarely evaluated in a clinical domain, where strong class imbalances and dataset shifts are common. Our contributions are twofold: First, we introduce generalisations of existing performance prediction methods that directly estimate the full confusion matrix. Then, we benchmark their performance on chest x-ray data in real-world distribution shifts as well as simulated covariate and prevalence shifts. The proposed confusion matrix estimation methods reliably predicted clinically relevant counting metrics on medical images under distribution shifts. However, our simulated shift scenarios exposed important failure modes of current performance estimation techniques, calling for a better understanding of real-world deployment contexts when implementing these performance monitoring techniques for postmarket surveillance of medical AI models.
[LG-5] Enhanced Prediction of CAR T-Cell Cytotoxicity with Quantum-Kernel Methods
链接: https://arxiv.org/abs/2507.22710
作者: Filippo Utro,Meltem Tolunay,Kahn Rhrissorrakrai,Tanvi P. Gujarati,Jie Shi,Sara Capponi,Mirko Amico,Nate Earnest-Noble,Laxmi Parida
类目: Machine Learning (cs.LG); Quantitative Methods (q-bio.QM); Quantum Physics (quant-ph)
*备注:
Abstract:Chimeric antigen receptor (CAR) T-cells are T-cells engineered to recognize and kill specific tumor cells. Through their extracellular domains, CAR T-cells bind tumor cell antigens which triggers CAR T activation and proliferation. These processes are regulated by co-stimulatory domains present in the intracellular region of the CAR T-cell. Through integrating novel signaling components into the co-stimulatory domains, it is possible to modify CAR T-cell phenotype. Identifying and experimentally testing new CAR constructs based on libraries of co-stimulatory domains is nontrivial given the vast combinatorial space defined by such libraries. This leads to a highly data constrained, poorly explored combinatorial problem, where the experiments undersample all possible combinations. We propose a quantum approach using a Projected Quantum Kernel (PQK) to address this challenge. PQK operates by embedding classical data into a high dimensional Hilbert space and employs a kernel method to measure sample similarity. Using 61 qubits on a gate-based quantum computer, we demonstrate the largest PQK application to date and an enhancement in the classification performance over purely classical machine learning methods for CAR T cytotoxicity prediction. Importantly, we show improved learning for specific signaling domains and domain positions, particularly where there was lower information highlighting the potential for quantum computing in data-constrained problems.
[LG-6] Cluster-Based Random Forest Visualization and Interpretation
链接: https://arxiv.org/abs/2507.22665
作者: Max Sondag,Christofer Meinecke,Dennis Collaris,Tatiana von Landesberger,Stef van den Elzen
类目: Machine Learning (cs.LG); Human-Computer Interaction (cs.HC)
*备注:
Abstract:Random forests are a machine learning method used to automatically classify datasets and consist of a multitude of decision trees. While these random forests often have higher performance and generalize better than a single decision tree, they are also harder to interpret. This paper presents a visualization method and system to increase interpretability of random forests. We cluster similar trees which enables users to interpret how the model performs in general without needing to analyze each individual decision tree in detail, or interpret an oversimplified summary of the full forest. To meaningfully cluster the decision trees, we introduce a new distance metric that takes into account both the decision rules as well as the predictions of a pair of decision trees. We also propose two new visualization methods that visualize both clustered and individual decision trees: (1) The Feature Plot, which visualizes the topological position of features in the decision trees, and (2) the Rule Plot, which visualizes the decision rules of the decision trees. We demonstrate the efficacy of our approach through a case study on the “Glass” dataset, which is a relatively complex standard machine learning dataset, as well as a small user study.
[LG-7] ransductive Model Selection under Prior Probability Shift
链接: https://arxiv.org/abs/2507.22647
作者: Lorenzo Volpi,Alejandro Moreo,Fabrizio Sebastiani
类目: Machine Learning (cs.LG)
*备注:
Abstract:Transductive learning is a supervised machine learning task in which, unlike in traditional inductive learning, the unlabelled data that require labelling are a finite set and are available at training time. Similarly to inductive learning contexts, transductive learning contexts may be affected by dataset shift, i.e., may be such that the IID assumption does not hold. We here propose a method, tailored to transductive classification contexts, for performing model selection (i.e., hyperparameter optimisation) when the data exhibit prior probability shift, an important type of dataset shift typical of anti-causal learning problems. In our proposed method the hyperparameters can be optimised directly on the unlabelled data to which the trained classifier must be applied; this is unlike traditional model selection methods, that are based on performing cross-validation on the labelled training data. We provide experimental results that show the benefits brought about by our method.
[LG-8] Deep learning of geometrical cell division rules
链接: https://arxiv.org/abs/2507.22587
作者: Alexandre Durrmeyer,Jean-Christophe Palauqui,Philippe Andrey
类目: Machine Learning (cs.LG); Cell Behavior (q-bio.CB); Quantitative Methods (q-bio.QM)
*备注: 44 pages, 6 figures, 1 supplementary table, 15 supplementary figures
Abstract:The positioning of new cellular walls during cell division plays a key role in shaping plant tissue organization. The influence of cell geometry on the positioning of division planes has been previously captured into various geometrical rules. Accordingly, linking cell shape to division orientation has relied on the comparison between observed division patterns and predictions under specific rules. The need to define a priori the tested rules is a fundamental limitation of this hypothesis-driven approach. As an alternative, we introduce a data-based approach to investigate the relation between cell geometry and division plane positioning, exploiting the ability of deep neural network to learn complex relationships across multidimensional spaces. Adopting an image-based cell representation, we show how division patterns can be learned and predicted from mother cell geometry using a UNet architecture modified to operate on cell masks. Using synthetic data and A. thaliana embryo cells, we evaluate the model performances on a wide range of diverse cell shapes and division patterns. We find that the trained model accounted for embryo division patterns that were previously irreconcilable under existing geometrical rules. Our work shows the potential of deep networks to understand cell division patterns and to generate new hypotheses on the control of cell division positioning.
[LG-9] VAR: Visual Analysis for Rashomon Set of Machine Learning Models Performance
链接: https://arxiv.org/abs/2507.22556
作者: Yuanzhe Jin
类目: Machine Learning (cs.LG)
*备注:
Abstract:Evaluating the performance of closely matched machine learning(ML) models under specific conditions has long been a focus of researchers in the field of machine learning. The Rashomon set is a collection of closely matched ML models, encompassing a wide range of models with similar accuracies but different structures. Traditionally, the analysis of these sets has focused on vertical structural analysis, which involves comparing the corresponding features at various levels within the ML models. However, there has been a lack of effective visualization methods for horizontally comparing multiple models with specific features. We propose the VAR visualization solution. VAR uses visualization to perform comparisons of ML models within the Rashomon set. This solution combines heatmaps and scatter plots to facilitate the comparison. With the help of VAR, ML model developers can identify the optimal model under specific conditions and better understand the Rashomon set’s overall characteristics.
[LG-10] DeepC4: Deep Conditional Census-Constrained Clustering for Large-scale Multitask Spatial Disaggregation of Urban Morphology
链接: https://arxiv.org/abs/2507.22554
作者: Joshua Dimasaka,Christian Geiß,Emily So
类目: Machine Learning (cs.LG)
*备注: Non-peer-reviewed Preprint | Keywords: urban morphology, building exposure, physical vulnerability, spatial disaggregation, deep clustering | Data: this https URL | Code: this https URL
Abstract:To understand our global progress for sustainable development and disaster risk reduction in many developing economies, two recent major initiatives - the Uniform African Exposure Dataset of the Global Earthquake Model (GEM) Foundation and the Modelling Exposure through Earth Observation Routines (METEOR) Project - implemented classical spatial disaggregation techniques to generate large-scale mapping of urban morphology using the information from various satellite imagery and its derivatives, geospatial datasets of the built environment, and subnational census statistics. However, the local discrepancy with well-validated census statistics and the propagated model uncertainties remain a challenge in such coarse-to-fine-grained mapping problems, specifically constrained by weak and conditional label supervision. Therefore, we present Deep Conditional Census-Constrained Clustering (DeepC4), a novel deep learning-based spatial disaggregation approach that incorporates local census statistics as cluster-level constraints while considering multiple conditional label relationships in a joint multitask learning of the patterns of satellite imagery. To demonstrate, compared to GEM and METEOR, we enhanced the quality of Rwandan maps of urban morphology, specifically building exposure and physical vulnerability, at the third-level administrative unit from the 2022 census. As the world approaches the conclusion of our global frameworks in 2030, our work has offered a new deep learning-based mapping technique towards a spatial auditing of our existing coarse-grained derived information at large scales.
[LG-11] hermodynamics-Inspired Computing with Oscillatory Neural Networks for Inverse Matrix Computation
链接: https://arxiv.org/abs/2507.22544
作者: George Tsormpatzoglou,Filip Sabo,Aida Todri-Sanial
类目: Machine Learning (cs.LG); Emerging Technologies (cs.ET)
*备注: 9 pages, 8 figures
Abstract:We describe a thermodynamic-inspired computing paradigm based on oscillatory neural networks (ONNs). While ONNs have been widely studied as Ising machines for tackling complex combinatorial optimization problems, this work investigates their feasibility in solving linear algebra problems, specifically the inverse matrix. Grounded in thermodynamic principles, we analytically demonstrate that the linear approximation of the coupled Kuramoto oscillator model leads to the inverse matrix solution. Numerical simulations validate the theoretical framework, and we examine the parameter regimes that computation has the highest accuracy.
[LG-12] HGCN(O): A Self-Tuning GCN HyperModel Toolkit for Outcome Prediction in Event-Sequence Data
链接: https://arxiv.org/abs/2507.22524
作者: Fang Wang,Paolo Ceravolo,Ernesto Damiani
类目: Machine Learning (cs.LG)
*备注: 15 pages, 2 figures, submitted to Knowledge-Base Systems
Abstract:We propose HGCN(O), a self-tuning toolkit using Graph Convolutional Network (GCN) models for event sequence prediction. Featuring four GCN architectures (O-GCN, T-GCN, TP-GCN, TE-GCN) across the GCNConv and GraphConv layers, our toolkit integrates multiple graph representations of event sequences with different choices of node- and graph-level attributes and in temporal dependencies via edge weights, optimising prediction accuracy and stability for balanced and unbalanced datasets. Extensive experiments show that GCNConv models excel on unbalanced data, while all models perform consistently on balanced data. Experiments also confirm the superior performance of HGCN(O) over traditional approaches. Applications include Predictive Business Process Monitoring (PBPM), which predicts future events or states of a business process based on event logs.
[LG-13] SmilesT5: Domain-specific pretraining for molecular language models
链接: https://arxiv.org/abs/2507.22514
作者: Philip Spence,Brooks Paige,Anne Osbourn
类目: Machine Learning (cs.LG)
*备注:
Abstract:Molecular property prediction is an increasingly critical task within drug discovery and development. Typically, neural networks can learn molecular properties using graph-based, language-based or feature-based methods. Recent advances in natural language processing have highlighted the capabilities of neural networks to learn complex human language using masked language modelling. These approaches to training large transformer-based deep learning models have also been used to learn the language of molecules, as represented by simplified molecular-input line-entry system (SMILES) strings. Here, we present novel domain-specific text-to-text pretraining tasks that yield improved performance in six classification-based molecular property prediction benchmarks, relative to both traditional likelihood-based training and previously proposed fine-tuning tasks. Through ablation studies, we show that data and computational efficiency can be improved by using these domain-specific pretraining tasks. Finally, the pretrained embeddings from the model can be used as fixed inputs into a downstream machine learning classifier and yield comparable performance to finetuning but with much lower computational overhead.
[LG-14] Geometry of nonlinear forecast reconciliation
链接: https://arxiv.org/abs/2507.22500
作者: Lorenzo Nespoli,Anubhab Biswas,Vasco Medici
类目: Machine Learning (cs.LG); Computational Geometry (cs.CG)
*备注:
Abstract:Forecast reconciliation, an ex-post technique applied to forecasts that must satisfy constraints, has been a prominent topic in the forecasting literature over the past two decades. Recently, several efforts have sought to extend reconciliation methods to the probabilistic settings. Nevertheless, formal theorems demonstrating error reduction in nonlinear contexts, analogous to those presented in Panagiotelis et al.(2021), are still lacking. This paper addresses that gap by establishing such theorems for various classes of nonlinear hypersurfaces and vector-valued functions. Specifically, we derive an exact analog of Theorem 3.1 from Panagiotelis et al.(2021) for hypersurfaces with constant-sign curvature. Additionally, we provide probabilistic guarantees for the broader case of hypersurfaces with non-constant-sign curvature and for general vector-valued functions. To support reproducibility and practical adoption, we release a JAX-based Python package, \emphto be released upon publication, implementing the presented theorems and reconciliation procedures.
[LG-15] Breaking Obfuscation: Cluster-Aware Graph with LLM -Aided Recovery for Malicious JavaScript Detection
链接: https://arxiv.org/abs/2507.22447
作者: Zhihong Liang,Xin Wang,Zhenhuang Hu,Liangliang Song,Lin Chen,Jingjing Guo,Yanbin Wang,Ye Tian
类目: Cryptography and Security (cs.CR); Machine Learning (cs.LG)
*备注:
Abstract:With the rapid expansion of web-based applications and cloud services, malicious JavaScript code continues to pose significant threats to user privacy, system integrity, and enterprise security. But, detecting such threats remains challenging due to sophisticated code obfuscation techniques and JavaScript’s inherent language characteristics, particularly its nested closure structures and syntactic flexibility. In this work, we propose DeCoda, a hybrid defense framework that combines large language model (LLM)-based deobfuscation with code graph learning: (1) We first construct a sophisticated prompt-learning pipeline with multi-stage refinement, where the LLM progressively reconstructs the original code structure from obfuscated inputs and then generates normalized Abstract Syntax Tree (AST) representations; (2) In JavaScript ASTs, dynamic typing scatters semantically similar nodes while deeply nested functions fracture scope capturing, introducing structural noise and semantic ambiguity. To address these challenges, we then propose to learn hierarchical code graph representations via a Cluster-wise Graph that synergistically integrates graph transformer network, node clustering, and node-to-cluster attention to simultaneously capture both local node-level semantics and global cluster-induced structural relationships from AST graph. Experimental results demonstrate that our method achieves F1-scores of 94.64% and 97.71% on two benchmark datasets, demonstrating absolute improvements of 10.74% and 13.85% over state-of-the-art baselines. In false-positive control evaluation at fixed FPR levels (0.0001, 0.001, 0.01), our approach delivers 4.82, 5.91, and 2.53 higher TPR respectively compared to the best-performing baseline. These results highlight the effectiveness of LLM-based deobfuscation and underscore the importance of modeling cluster-level relationships in detecting malicious code.
[LG-16] RANA: Robust Active Learning for Noisy Network Alignment
链接: https://arxiv.org/abs/2507.22434
作者: Yixuan Nan,Xixun Lin,Yanmin Shang,Zhuofan Li,Can Zhao,Yanan Cao
类目: Machine Learning (cs.LG)
*备注:
Abstract:Network alignment has attracted widespread attention in various fields. However, most existing works mainly focus on the problem of label sparsity, while overlooking the issue of noise in network alignment, which can substantially undermine model performance. Such noise mainly includes structural noise from noisy edges and labeling noise caused by human-induced and process-driven errors. To address these problems, we propose RANA, a Robust Active learning framework for noisy Network Alignment. RANA effectively tackles both structure noise and label noise while addressing the sparsity of anchor link annotations, which can improve the robustness of network alignment models. Specifically, RANA introduces the proposed Noise-aware Selection Module and the Label Denoising Module to address structural noise and labeling noise, respectively. In the first module, we design a noise-aware maximization objective to select node pairs, incorporating a cleanliness score to address structural noise. In the second module, we propose a novel multi-source fusion denoising strategy that leverages model and twin node pairs labeling to provide more accurate labels for node pairs. Empirical results on three real-world datasets demonstrate that RANA outperforms state-of-the-art active learning-based methods in alignment accuracy. Our code is available at this https URL.
[LG-17] Comparing Normalizing Flows with Kernel Density Estimation in Estimating Risk of Automated Driving Systems
链接: https://arxiv.org/abs/2507.22429
作者: Erwin de Gelder,Maren Buermann,Olaf Op den Camp
类目: Robotics (cs.RO); Machine Learning (cs.LG)
*备注: Accepted for publication in proceedings of the 2025 IEEE International Automated Vehicle Validation Conference
Abstract:The development of safety validation methods is essential for the safe deployment and operation of Automated Driving Systems (ADSs). One of the goals of safety validation is to prospectively evaluate the risk of an ADS dealing with real-world traffic. Scenario-based assessment is a widely-used approach, where test cases are derived from real-world driving data. To allow for a quantitative analysis of the system performance, the exposure of the scenarios must be accurately estimated. The exposure of scenarios at parameter level is expressed using a Probability Density Function (PDF). However, assumptions about the PDF, such as parameter independence, can introduce errors, while avoiding assumptions often leads to oversimplified models with limited parameters to mitigate the curse of dimensionality. This paper considers the use of Normalizing Flows (NF) for estimating the PDF of the parameters. NF are a class of generative models that transform a simple base distribution into a complex one using a sequence of invertible and differentiable mappings, enabling flexible, high-dimensional density estimation without restrictive assumptions on the PDF’s shape. We demonstrate the effectiveness of NF in quantifying risk and risk uncertainty of an ADS, comparing its performance with Kernel Density Estimation (KDE), a traditional method for non-parametric PDF estimation. While NF require more computational resources compared to KDE, NF is less sensitive to the curse of dimensionality. As a result, NF can improve risk uncertainty estimation, offering a more precise assessment of an ADS’s safety. This work illustrates the potential of NF in scenario-based safety. Future work involves experimenting more with using NF for scenario generation and optimizing the NF architecture, transformation types, and training hyperparameters to further enhance their applicability. Comments: Accepted for publication in proceedings of the 2025 IEEE International Automated Vehicle Validation Conference Subjects: Robotics (cs.RO); Machine Learning (cs.LG) Cite as: arXiv:2507.22429 [cs.RO] (or arXiv:2507.22429v1 [cs.RO] for this version) https://doi.org/10.48550/arXiv.2507.22429 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
[LG-18] Multimodal Late Fusion Model for Problem-Solving Strategy Classification in a Machine Learning Game
链接: https://arxiv.org/abs/2507.22426
作者: Clemens Witt,Thiemo Leonhardt,Nadine Bergner,Mareen Grillenberger
类目: Machine Learning (cs.LG)
*备注: This is the author’s version of a paper accepted for publication at the 2025 European Conference on Technology Enhanced Learning (EC-TEL 2025). The final authenticated version will be published in the Lecture Notes in Computer Science (LNCS) series by Springer and will be available via SpringerLink
Abstract:Machine learning models are widely used to support stealth assessment in digital learning environments. Existing approaches typically rely on abstracted gameplay log data, which may overlook subtle behavioral cues linked to learners’ cognitive strategies. This paper proposes a multimodal late fusion model that integrates screencast-based visual data and structured in-game action sequences to classify students’ problem-solving strategies. In a pilot study with secondary school students (N=149) playing a multitouch educational game, the fusion model outperformed unimodal baseline models, increasing classification accuracy by over 15%. Results highlight the potential of multimodal ML for strategy-sensitive assessment and adaptive support in interactive learning contexts.
[LG-19] Improving Generalization Ability of Robotic Imitation Learning by Resolving Causal Confusion in Observations
链接: https://arxiv.org/abs/2507.22380
作者: Yifei Chen,Yuzhe Zhang,Giovanni D’urso,Nicholas Lawrance,Brendan Tidd
类目: Robotics (cs.RO); Machine Learning (cs.LG)
*备注: 13 pages
Abstract:Recent developments in imitation learning have considerably advanced robotic manipulation. However, current techniques in imitation learning can suffer from poor generalization, limiting performance even under relatively minor domain shifts. In this work, we aim to enhance the generalization capabilities of complex imitation learning algorithms to handle unpredictable changes from the training environments to deployment environments. To avoid confusion caused by observations that are not relevant to the target task, we propose to explicitly learn the causal relationship between observation components and expert actions, employing a framework similar to [6], where a causal structural function is learned by intervention on the imitation learning policy. Disentangling the feature representation from image input as in [6] is hard to satisfy in complex imitation learning process in robotic manipulation, we theoretically clarify that this requirement is not necessary in causal relationship learning. Therefore, we propose a simple causal structure learning framework that can be easily embedded in recent imitation learning architectures, such as the Action Chunking Transformer [31]. We demonstrate our approach using a simulation of the ALOHA [31] bimanual robot arms in Mujoco, and show that the method can considerably mitigate the generalization problem of existing complex imitation learning algorithms.
[LG-20] Prediction of acoustic field in 1-D uniform duct with varying mean flow and temperature using neural networks
链接: https://arxiv.org/abs/2507.22370
作者: D. Veerababu,Prasanta K. Ghosh
类目: Machine Learning (cs.LG); Sound (cs.SD); Audio and Speech Processing (eess.AS)
*备注: 22 pages
Abstract:Neural networks constrained by the physical laws emerged as an alternate numerical tool. In this paper, the governing equation that represents the propagation of sound inside a one-dimensional duct carrying a heterogeneous medium is derived. The problem is converted into an unconstrained optimization problem and solved using neural networks. Both the acoustic state variables: acoustic pressure and particle velocity are predicted and validated with the traditional Runge-Kutta solver. The effect of the temperature gradient on the acoustic field is studied. Utilization of machine learning techniques such as transfer learning and automatic differentiation for acoustic applications is demonstrated.
[LG-21] MSQ: Memory-Efficient Bit Sparsification Quantization
链接: https://arxiv.org/abs/2507.22349
作者: Seokho Han,Seoyeon Yoon,Jinhee Kim,Dongwei Wang,Kang Eun Jeon,Huanrui Yang,Jong Hwan Ko
类目: Machine Learning (cs.LG)
*备注:
Abstract:As deep neural networks (DNNs) see increased deployment on mobile and edge devices, optimizing model efficiency has become crucial. Mixed-precision quantization is widely favored, as it offers a superior balance between efficiency and accuracy compared to uniform quantization. However, finding the optimal precision for each layer is challenging. Recent studies utilizing bit-level sparsity have shown promise, yet they often introduce substantial training complexity and high GPU memory requirements. In this paper, we propose Memory-Efficient Bit Sparsification Quantization (MSQ), a novel approach that addresses these limitations. MSQ applies a round-clamp quantizer to enable differentiable computation of the least significant bits (LSBs) from model weights. It further employs regularization to induce sparsity in these LSBs, enabling effective precision reduction without explicit bit-level parameter splitting. Additionally, MSQ incorporates Hessian information, allowing the simultaneous pruning of multiple LSBs to further enhance training efficiency. Experimental results show that MSQ achieves up to 8.00x reduction in trainable parameters and up to 86% reduction in training time compared to previous bit-level quantization, while maintaining competitive accuracy and compression rates. This makes it a practical solution for training efficient DNNs on resource-constrained devices.
[LG-22] A Semi-Supervised Federated Learning Framework with Hierarchical Clustering Aggregation for Heterogeneous Satellite Networks
链接: https://arxiv.org/abs/2507.22339
作者: Zhuocheng Liu,Zhishu Shen,Qiushi Zheng,Tiehua Zhang,Zheng Lei,Jiong Jin
类目: Distributed, Parallel, and Cluster Computing (cs.DC); Machine Learning (cs.LG)
*备注:
Abstract:Low Earth Orbit (LEO) satellites are emerging as key components of 6G networks, with many already deployed to support large-scale Earth observation and sensing related tasks. Federated Learning (FL) presents a promising paradigm for enabling distributed intelligence in these resource-constrained and dynamic environments. However, achieving reliable convergence, while minimizing both processing time and energy consumption, remains a substantial challenge, particularly in heterogeneous and partially unlabeled satellite networks. To address this challenge, we propose a novel semi-supervised federated learning framework tailored for LEO satellite networks with hierarchical clustering aggregation. To further reduce communication overhead, we integrate sparsification and adaptive weight quantization techniques. In addition, we divide the FL clustering into two stages: satellite cluster aggregation stage and Ground Stations (GSs) aggregation stage. The supervised learning at GSs guides selected Parameter Server (PS) satellites, which in turn support fully unlabeled satellites during the federated training process. Extensive experiments conducted on a satellite network testbed demonstrate that our proposal can significantly reduce processing time (up to 3x) and energy consumption (up to 4x) compared to other comparative methods while maintaining model accuracy.
[LG-23] Parametrized Multi-Agent Routing via Deep Attention Models AAAI2026
链接: https://arxiv.org/abs/2507.22338
作者: Salar Basiri,Dhananjay Tiwari,Srinivasa M. Salapaka
类目: Machine Learning (cs.LG)
*备注: This work is under submission to AAAI 2026. Please cite the arXiv version until the final version is published
Abstract:We propose a scalable deep learning framework for parametrized sequential decision-making (ParaSDM), where multiple agents jointly optimize discrete action policies and shared continuous parameters. A key subclass of this setting arises in Facility-Location and Path Optimization (FLPO), where multi-agent systems must simultaneously determine optimal routes and facility locations, aiming to minimize the cumulative transportation cost within the network. FLPO problems are NP-hard due to their mixed discrete-continuous structure and highly non-convex objective. To address this, we integrate the Maximum Entropy Principle (MEP) with a neural policy model called the Shortest Path Network (SPN)-a permutation-invariant encoder-decoder that approximates the MEP solution while enabling efficient gradient-based optimization over shared parameters. The SPN achieves up to 100 \times speedup in policy inference and gradient computation compared to MEP baselines, with an average optimality gap of approximately 6% across a wide range of problem sizes. Our FLPO approach yields over 10 \times lower cost than metaheuristic baselines while running significantly faster, and matches Gurobi’s optimal cost with annealing at a 1500 \times speedup-establishing a new state of the art for ParaSDM problems. These results highlight the power of structured deep models for solving large-scale mixed-integer optimization tasks.
[LG-24] Hypernetworks for Model-Heterogeneous Personalized Federated Learning
链接: https://arxiv.org/abs/2507.22330
作者: Chen Zhang,Husheng Li,Xiang Liu,Linshan Jiang,Danxin Wang
类目: Machine Learning (cs.LG); Distributed, Parallel, and Cluster Computing (cs.DC)
*备注:
Abstract:Recent advances in personalized federated learning have focused on addressing client model heterogeneity. However, most existing methods still require external data, rely on model decoupling, or adopt partial learning strategies, which can limit their practicality and scalability. In this paper, we revisit hypernetwork-based methods and leverage their strong generalization capabilities to design a simple yet effective framework for heterogeneous personalized federated learning. Specifically, we propose MH-pFedHN, which leverages a server-side hypernetwork that takes client-specific embedding vectors as input and outputs personalized parameters tailored to each client’s heterogeneous model. To promote knowledge sharing and reduce computation, we introduce a multi-head structure within the hypernetwork, allowing clients with similar model sizes to share heads. Furthermore, we further propose MH-pFedHNGD, which integrates an optional lightweight global model to improve generalization. Our framework does not rely on external datasets and does not require disclosure of client model architectures, thereby offering enhanced privacy and flexibility. Extensive experiments on multiple benchmarks and model settings demonstrate that our approach achieves competitive accuracy, strong generalization, and serves as a robust baseline for future research in model-heterogeneous personalized federated learning.
[LG-25] CS-SHRED: Enhancing SHRED for Robust Recovery of Spatiotemporal Dynamics
链接: https://arxiv.org/abs/2507.22303
作者: Romulo B. da Silva,Cássio M. Oishi,Diego Passos,J. Nathan Kutz
类目: Machine Learning (cs.LG)
*备注: 30 pages, 7 figures, 13 tables. Code: this https URL
Abstract:We present \textbfCS-SHRED , a novel deep learning architecture that integrates Compressed Sensing (CS) into a Shallow Recurrent Decoder ( \textbfSHRED ) to reconstruct spatiotemporal dynamics from incomplete, compressed, or corrupted data. Our approach introduces two key innovations. First, by incorporating CS techniques into the \textbfSHRED architecture, our method leverages a batch-based forward framework with \ell_1 regularization to robustly recover signals even in scenarios with sparse sensor placements, noisy measurements, and incomplete sensor acquisitions. Second, an adaptive loss function dynamically combines Mean Squared Error (MSE) and Mean Absolute Error (MAE) terms with a piecewise Signal-to-Noise Ratio (SNR) regularization, which suppresses noise and outliers in low-SNR regions while preserving fine-scale features in high-SNR regions. We validate \textbfCS-SHRED on challenging problems including viscoelastic fluid flows, maximum specific humidity fields, sea surface temperature distributions, and rotating turbulent flows. Compared to the traditional \textbfSHRED approach, \textbfCS-SHRED achieves significantly higher reconstruction fidelity - as demonstrated by improved SSIM and PSNR values, lower normalized errors, and enhanced LPIPS scores-thereby providing superior preservation of small-scale structures and increased robustness against noise and outliers. Our results underscore the advantages of the jointly trained CS and SHRED design architecture which includes an LSTM sequence model for characterizing the temporal evolution with a shallow decoder network (SDN) for modeling the high-dimensional state space. The SNR-guided adaptive loss function for the spatiotemporal data recovery establishes \textbfCS-SHRED as a promising tool for a wide range of applications in environmental, climatic, and scientific data analyses. Comments: 30 pages, 7 figures, 13 tables. Code: this https URL Subjects: Machine Learning (cs.LG) MSC classes: 68T07, 35Q35, 94A12 ACMclasses: I.2.6; I.5.4; I.6.3; J.2 Cite as: arXiv:2507.22303 [cs.LG] (or arXiv:2507.22303v1 [cs.LG] for this version) https://doi.org/10.48550/arXiv.2507.22303 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
[LG-26] Comparing Cluster-Based Cross-Validation Strategies for Machine Learning Model Evaluation
链接: https://arxiv.org/abs/2507.22299
作者: Afonso Martini Spezia,Mariana Recamonde-Mendoza
类目: Machine Learning (cs.LG)
*备注:
Abstract:Cross-validation plays a fundamental role in Machine Learning, enabling robust evaluation of model performance and preventing overestimation on training and validation data. However, one of its drawbacks is the potential to create data subsets (folds) that do not adequately represent the diversity of the original dataset, which can lead to biased performance estimates. The objective of this work is to deepen the investigation of cluster-based cross-validation strategies by analyzing the performance of different clustering algorithms through experimental comparison. Additionally, a new cross-validation technique that combines Mini Batch K-Means with class stratification is proposed. Experiments were conducted on 20 datasets (both balanced and imbalanced) using four supervised learning algorithms, comparing cross-validation strategies in terms of bias, variance, and computational cost. The technique that uses Mini Batch K-Means with class stratification outperformed others in terms of bias and variance on balanced datasets, though it did not significantly reduce computational cost. On imbalanced datasets, traditional stratified cross-validation consistently performed better, showing lower bias, variance, and computational cost, making it a safe choice for performance evaluation in scenarios with class imbalance. In the comparison of different clustering algorithms, no single algorithm consistently stood out as superior. Overall, this work contributes to improving predictive model evaluation strategies by providing a deeper understanding of the potential of cluster-based data splitting techniques and reaffirming the effectiveness of well-established strategies like stratified cross-validation. Moreover, it highlights perspectives for increasing the robustness and reliability of model evaluations, especially in datasets with clustering characteristics.
[LG-27] Weighted Conditional Flow Matching
链接: https://arxiv.org/abs/2507.22270
作者: Sergio Calvo-Ordonez,Matthieu Meunier,Alvaro Cartea,Christoph Reisinger,Yarin Gal,Jose Miguel Hernandez-Lobato
类目: Machine Learning (cs.LG)
*备注: Working paper
Abstract:Conditional flow matching (CFM) has emerged as a powerful framework for training continuous normalizing flows due to its computational efficiency and effectiveness. However, standard CFM often produces paths that deviate significantly from straight-line interpolations between prior and target distributions, making generation slower and less accurate due to the need for fine discretization at inference. Recent methods enhance CFM performance by inducing shorter and straighter trajectories but typically rely on computationally expensive mini-batch optimal transport (OT). Drawing insights from entropic optimal transport (EOT), we propose Weighted Conditional Flow Matching (W-CFM), a novel approach that modifies the classical CFM loss by weighting each training pair (x, y) with a Gibbs kernel. We show that this weighting recovers the entropic OT coupling up to some bias in the marginals, and we provide the conditions under which the marginals remain nearly unchanged. Moreover, we establish an equivalence between W-CFM and the minibatch OT method in the large-batch limit, showing how our method overcomes computational and performance bottlenecks linked to batch size. Empirically, we test our method on unconditional generation on various synthetic and real datasets, confirming that W-CFM achieves comparable or superior sample quality, fidelity, and diversity to other alternative baselines while maintaining the computational efficiency of vanilla CFM.
[LG-28] Understanding Concept Drift with Deprecated Permissions in Android Malware Detection
链接: https://arxiv.org/abs/2507.22231
作者: Ahmed Sabbah,Radi Jarrar,Samer Zein,David Mohaisen
类目: Cryptography and Security (cs.CR); Machine Learning (cs.LG)
*备注: 13 pages, 9 figures, 5 tables, under review
Abstract:Permission analysis is a widely used method for Android malware detection. It involves examining the permissions requested by an application to access sensitive data or perform potentially malicious actions. In recent years, various machine learning (ML) algorithms have been applied to Android malware detection using permission-based features and feature selection techniques, often achieving high accuracy. However, these studies have largely overlooked important factors such as protection levels and the deprecation or restriction of permissions due to updates in the Android OS – factors that can contribute to concept drift. In this study, we investigate the impact of deprecated and restricted permissions on the performance of machine learning models. A large dataset containing 166 permissions was used, encompassing more than 70,000 malware and benign applications. Various machine learning and deep learning algorithms were employed as classifiers, along with different concept drift detection strategies. The results suggest that Android permissions are highly effective features for malware detection, with the exclusion of deprecated and restricted permissions having only a marginal impact on model performance. In some cases, such as with CNN, accuracy improved. Excluding these permissions also enhanced the detection of concept drift using a year-to-year analysis strategy. Dataset balancing further improved model performance, reduced low-accuracy instances, and enhanced concept drift detection via the Kolmogorov-Smirnov test. Comments: 13 pages, 9 figures, 5 tables, under review Subjects: Cryptography and Security (cs.CR); Machine Learning (cs.LG) Cite as: arXiv:2507.22231 [cs.CR] (or arXiv:2507.22231v1 [cs.CR] for this version) https://doi.org/10.48550/arXiv.2507.22231 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
[LG-29] RIBE: TRImodal Brain Encoder for whole-brain fMRI response prediction
链接: https://arxiv.org/abs/2507.22229
作者: Stéphane d’Ascoli,Jérémy Rapin,Yohann Benchetrit,Hubert Banville,Jean-Rémi King
类目: Machine Learning (cs.LG)
*备注:
Abstract:Historically, neuroscience has progressed by fragmenting into specialized domains, each focusing on isolated modalities, tasks, or brain regions. While fruitful, this approach hinders the development of a unified model of cognition. Here, we introduce TRIBE, the first deep neural network trained to predict brain responses to stimuli across multiple modalities, cortical areas and individuals. By combining the pretrained representations of text, audio and video foundational models and handling their time-evolving nature with a transformer, our model can precisely model the spatial and temporal fMRI responses to videos, achieving the first place in the Algonauts 2025 brain encoding competition with a significant margin over competitors. Ablations show that while unimodal models can reliably predict their corresponding cortical networks (e.g. visual or auditory networks), they are systematically outperformed by our multimodal model in high-level associative cortices. Currently applied to perception and comprehension, our approach paves the way towards building an integrative model of representations in the human brain. Our code is available at this https URL.
[LG-30] Explainability-Driven Feature Engineering for Mid-Term Electricity Load Forecasting in ERCOTs SCENT Region
链接: https://arxiv.org/abs/2507.22220
作者: Abhiram Bhupatiraju,Sung Bum Ahn
类目: Machine Learning (cs.LG)
*备注: 12 pages
Abstract:Accurate load forecasting is essential to the operation of modern electric power systems. Given the sensitivity of electricity demand to weather variability and temporal dynamics, capturing non-linear patterns is essential for long-term planning. This paper presents a comparative analysis of machine learning models, Linear Regression, XGBoost, LightGBM, and Long Short-Term Memory (LSTM), for forecasting system-wide electricity load up to one year in advance. Midterm forecasting has shown to be crucial for maintenance scheduling, resource allocation, financial forecasting, and market participation. The paper places a focus on the use of a method called “Shapley Additive Explanations” (SHAP) to improve model explainability. SHAP enables the quantification of feature contributions, guiding informed feature engineering and improving both model transparency and forecasting accuracy.
[LG-31] Intent-Aware Neural Query Reformulation for Behavior-Aligned Product Search SIGIR
链接: https://arxiv.org/abs/2507.22213
作者: Jayanth Yetukuri,Ishita Khan
类目: Information Retrieval (cs.IR); Machine Learning (cs.LG)
*备注: Accepted at SIGIR eCom’25. this https URL
Abstract:Understanding and modeling buyer intent is a foundational challenge in optimizing search query reformulation within the dynamic landscape of e-commerce search systems. This work introduces a robust data pipeline designed to mine and analyze large-scale buyer query logs, with a focus on extracting fine-grained intent signals from both explicit interactions and implicit behavioral cues. Leveraging advanced sequence mining techniques and supervised learning models, the pipeline systematically captures patterns indicative of latent purchase intent, enabling the construction of a high-fidelity, intent-rich dataset. The proposed framework facilitates the development of adaptive query rewrite strategies by grounding reformulations in inferred user intent rather than surface-level lexical signals. This alignment between query rewriting and underlying user objectives enhances both retrieval relevance and downstream engagement metrics. Empirical evaluations across multiple product verticals demonstrate measurable gains in precision-oriented relevance metrics, underscoring the efficacy of intent-aware reformulation. Our findings highlight the value of intent-centric modeling in bridging the gap between sparse user inputs and complex product discovery goals, and establish a scalable foundation for future research in user-aligned neural retrieval and ranking systems.
[LG-32] CTG-Insight: A Multi-Agent Interpretable LLM Framework for Cardiotocography Analysis and Classification
链接: https://arxiv.org/abs/2507.22205
作者: Black Sun, Die (Delia)Hu
类目: Machine Learning (cs.LG); Human-Computer Interaction (cs.HC)
*备注:
Abstract:Remote fetal monitoring technologies are becoming increasingly common. Yet, most current systems offer limited interpretability, leaving expectant parents with raw cardiotocography (CTG) data that is difficult to understand. In this work, we present CTG-Insight, a multi-agent LLM system that provides structured interpretations of fetal heart rate (FHR) and uterine contraction (UC) signals. Drawing from established medical guidelines, CTG-Insight decomposes each CTG trace into five medically defined features: baseline, variability, accelerations, decelerations, and sinusoidal pattern, each analyzed by a dedicated agent. A final aggregation agent synthesizes the outputs to deliver a holistic classification of fetal health, accompanied by a natural language explanation. We evaluate CTG-Insight on the NeuroFetalNet Dataset and compare it against deep learning models and the single-agent LLM baseline. Results show that CTG-Insight achieves state-of-the-art accuracy (96.4%) and F1-score (97.8%) while producing transparent and interpretable outputs. This work contributes an interpretable and extensible CTG analysis framework.
[LG-33] Multi-fidelity Bayesian Data-Driven Design of Energy Absorbing Spinodoid Cellular Structures
链接: https://arxiv.org/abs/2507.22079
作者: Leo Guo,Hirak Kansara,Siamak F. Khosroshahi,GuoQi Zhang,Wei Tan
类目: Machine Learning (cs.LG); Materials Science (cond-mat.mtrl-sci)
*备注: data-driven design, multi-fidelity, Bayesian optimization, cellular structures, energy absorption
Abstract:Finite element (FE) simulations of structures and materials are getting increasingly more accurate, but also more computationally expensive as a collateral result. This development happens in parallel with a growing demand of data-driven design. To reconcile the two, a robust and data-efficient optimization method called Bayesian optimization (BO) has been previously established as a technique to optimize expensive objective functions. In parallel, the mesh width of an FE model can be exploited to evaluate an objective at a lower or higher fidelity (cost accuracy) level. The multi-fidelity setting applied to BO, called multi-fidelity BO (MFBO), has also seen previous success. However, BO and MFBO have not seen a direct comparison with when faced with with a real-life engineering problem, such as metamaterial design for deformation and absorption qualities. Moreover, sampling quality and assessing design parameter sensitivity is often an underrepresented part of data-driven design. This paper aims to address these shortcomings by employing Sobol’ samples with variance-based sensitivity analysis in order to reduce design problem complexity. Furthermore, this work describes, implements, applies and compares the performance BO with that MFBO when maximizing the energy absorption (EA) problem of spinodoid cellular structures is concerned. The findings show that MFBO is an effective way to maximize the EA of a spinodoid structure and is able to outperform BO by up to 11% across various hyperparameter settings. The results, which are made open-source, serve to support the utility of multi-fidelity techniques across expensive data-driven design problems.
[LG-34] st-time Prompt Refinement for Text-to-Image Models ICCV2025
链接: https://arxiv.org/abs/2507.22076
作者: Mohammad Abdul Hafeez Khan,Yash Jain,Siddhartha Bhattacharyya,Vibhav Vineet
类目: Machine Learning (cs.LG)
*备注: Accepted to ICCV 2025, MARS2 Workshop. Total 14 pages, 12 figures and 3 tables
Abstract:Text-to-image (T2I) generation models have made significant strides but still struggle with prompt sensitivity: even minor changes in prompt wording can yield inconsistent or inaccurate outputs. To address this challenge, we introduce a closed-loop, test-time prompt refinement framework that requires no additional training of the underlying T2I model, termed TIR. In our approach, each generation step is followed by a refinement step, where a pretrained multimodal large language model (MLLM) analyzes the output image and the user’s prompt. The MLLM detects misalignments (e.g., missing objects, incorrect attributes) and produces a refined and physically grounded prompt for the next round of image generation. By iteratively refining the prompt and verifying alignment between the prompt and the image, TIR corrects errors, mirroring the iterative refinement process of human artists. We demonstrate that this closed-loop strategy improves alignment and visual coherence across multiple benchmark datasets, all while maintaining plug-and-play integration with black-box T2I models.
[LG-35] Prototype-Guided Pseudo-Labeling with Neighborhood-Aware Consistency for Unsupervised Adaptation
链接: https://arxiv.org/abs/2507.22075
作者: Eman Ali,Chetan Arora,Muhammad Haris Khan
类目: Machine Learning (cs.LG)
*备注:
Abstract:In unsupervised adaptation for vision-language models such as CLIP, pseudo-labels derived from zero-shot predictions often exhibit significant noise, particularly under domain shifts or in visually complex scenarios. Conventional pseudo-label filtering approaches, which rely on fixed confidence thresholds, tend to be unreliable in fully unsupervised settings. In this work, we propose a novel adaptive pseudo-labeling framework that enhances CLIP’s adaptation performance by integrating prototype consistency and neighborhood-based consistency. The proposed method comprises two key components: PICS, which assesses pseudo-label accuracy based on in-class feature compactness and cross-class feature separation; and NALR, which exploits semantic similarities among neighboring samples to refine pseudo-labels dynamically. Additionally, we introduce an adaptive weighting mechanism that adjusts the influence of pseudo-labeled samples during training according to their estimated correctness. Extensive experiments on 11 benchmark datasets demonstrate that our method achieves state-of-the-art performance in unsupervised adaptation scenarios, delivering more accurate pseudo-labels while maintaining computational efficiency.
[LG-36] Consistency of Feature Attribution in Deep Learning Architectures for Multi-Omics
链接: https://arxiv.org/abs/2507.22877
作者: Daniel Claborne,Javier Flores,Samantha Erwin,Luke Durell,Rachel Richardson,Ruby Fore,Lisa Bramer
类目: Machine Learning (stat.ML); Machine Learning (cs.LG)
*备注:
Abstract:Machine and deep learning have grown in popularity and use in biological research over the last decade but still present challenges in interpretability of the fitted model. The development and use of metrics to determine features driving predictions and increase model interpretability continues to be an open area of research. We investigate the use of Shapley Additive Explanations (SHAP) on a multi-view deep learning model applied to multi-omics data for the purposes of identifying biomolecules of interest. Rankings of features via these attribution methods are compared across various architectures to evaluate consistency of the method. We perform multiple computational experiments to assess the robustness of SHAP and investigate modeling approaches and diagnostics to increase and measure the reliability of the identification of important features. Accuracy of a random-forest model fit on subsets of features selected as being most influential as well as clustering quality using only these features are used as a measure of effectiveness of the attribution method. Our findings indicate that the rankings of features resulting from SHAP are sensitive to the choice of architecture as well as different random initializations of weights, suggesting caution when using attribution methods on multi-view deep learning models applied to multi-omics data. We present an alternative, simple method to assess the robustness of identification of important biomolecules.
[LG-37] Synchronization of mean-field models on the circle
链接: https://arxiv.org/abs/2507.22857
作者: Yury Polyanskiy,Philippe Rigollet,Andrew Yao
类目: Dynamical Systems (math.DS); Machine Learning (cs.LG); Analysis of PDEs (math.AP); Optimization and Control (math.OC)
*备注:
Abstract:This paper considers a mean-field model of n interacting particles whose state space is the unit circle, a generalization of the classical Kuramoto model. Global synchronization is said to occur if after starting from almost any initial state, all particles coalesce to a common point on the circle. We propose a general synchronization criterion in terms of L_1 -norm of the third derivative of the particle interaction function. As an application we resolve a conjecture for the so-called self-attention dynamics (stylized model of transformers), by showing synchronization for all \beta \ge -0.16 , which significantly extends the previous bound of 0\le \beta \le 1 from Criscitiello, Rebjock, McRae, and Boumal (2024). We also show that global synchronization does not occur when \beta -2/3 .
[LG-38] Federated Learning on Riemannian Manifolds: A Gradient-Free Projection-Based Approach
链接: https://arxiv.org/abs/2507.22855
作者: Hongye Wang,Zhaoye Pan,Chang He,Jiaxiang Li,Bo Jiang
类目: Optimization and Control (math.OC); Machine Learning (cs.LG)
*备注:
Abstract:Federated learning (FL) has emerged as a powerful paradigm for collaborative model training across distributed clients while preserving data privacy. However, existing FL algorithms predominantly focus on unconstrained optimization problems with exact gradient information, limiting its applicability in scenarios where only noisy function evaluations are accessible or where model parameters are constrained. To address these challenges, we propose a novel zeroth-order projection-based algorithm on Riemannian manifolds for FL. By leveraging the projection operator, we introduce a computationally efficient zeroth-order Riemannian gradient estimator. Unlike existing estimators, ours requires only a simple Euclidean random perturbation, eliminating the need to sample random vectors in the tangent space, thus reducing computational cost. Theoretically, we first prove the approximation properties of the estimator and then establish the sublinear convergence of the proposed algorithm, matching the rate of its first-order counterpart. Numerically, we first assess the efficiency of our estimator using kernel principal component analysis. Furthermore, we apply the proposed algorithm to two real-world scenarios: zeroth-order attacks on deep neural networks and low-rank neural network training to validate the theoretical findings.
[LG-39] Subgrid BoostCNN: Efficient Boosting of Convolutional Networks via Gradient-Guided Feature Selection
链接: https://arxiv.org/abs/2507.22842
作者: Biyi Fang,Jean Utke,Truong Vo,Diego Klabjan
类目: Machine Learning (stat.ML); Machine Learning (cs.LG)
*备注: 10 pages, 5 figures. Experimental results reported on CIFAR-10, SVHN, and ImageNetSub datasets. arXiv admin note: substantial text overlap with arXiv:2203.00761
Abstract:Convolutional Neural Networks (CNNs) have achieved remarkable success across a wide range of machine learning tasks by leveraging hierarchical feature learning through deep architectures. However, the large number of layers and millions of parameters often make CNNs computationally expensive to train, requiring extensive time and manual tuning to discover optimal architectures. In this paper, we introduce a novel framework for boosting CNN performance that integrates dynamic feature selection with the principles of BoostCNN. Our approach incorporates two key strategies: subgrid selection and importance sampling, to guide training toward informative regions of the feature space. We further develop a family of algorithms that embed boosting weights directly into the network training process using a least squares loss formulation. This integration not only alleviates the burden of manual architecture design but also enhances accuracy and efficiency. Experimental results across several fine-grained classification benchmarks demonstrate that our boosted CNN variants consistently outperform conventional CNNs in both predictive performance and training speed.
[LG-40] Amorphous Solid Model of Vectorial Hopfield Neural Networks
链接: https://arxiv.org/abs/2507.22787
作者: F. Gallavotti,A. Zaccone
类目: Disordered Systems and Neural Networks (cond-mat.dis-nn); Soft Condensed Matter (cond-mat.soft); Statistical Mechanics (cond-mat.stat-mech); Machine Learning (cs.LG); Neural and Evolutionary Computing (cs.NE)
*备注:
Abstract:We present a vectorial extension of the Hopfield associative memory model inspired by the theory of amorphous solids, where binary neural states are replaced by unit vectors \mathbfs_i \in \mathbbR^3 on the sphere S^2 . The generalized Hebbian learning rule creates a block-structured weight matrix through outer products of stored pattern vectors, analogous to the Hessian matrix structure in amorphous solids. We demonstrate that this model exhibits quantifiable structural properties characteristic of disordered materials: energy landscapes with deep minima for stored patterns versus random configurations (energy gaps \sim 7 units), strongly anisotropic correlations encoded in the weight matrix (anisotropy ratios \sim 10^2 ), and order-disorder transitions controlled by the pattern density \gamma = P/(N \cdot d) . The enhanced memory capacity ( \gamma_c \approx 0.55 for a fully-connected network) compared to binary networks ( \gamma_c \approx 0.138 ) and the emergence of orientational correlations establish connections between associative memory mechanisms and amorphous solid physics, particularly in systems with continuous orientational degrees of freedom. We also unveil the scaling with the coordination number Z of the memory capacity: \gamma_c \sim (Z-6) from the isostatic point Z_c =6 of the 3D elastic network, which closely mirrors the scaling of the shear modulus G \sim (Z-6) in 3D central-force spring networks.
[LG-41] A Unified Analysis of Generalization and Sample Complexity for Semi-Supervised Domain Adaptation
链接: https://arxiv.org/abs/2507.22632
作者: Elif Vural,Huseyin Karaca
类目: Machine Learning (stat.ML); Machine Learning (cs.LG)
*备注:
Abstract:Domain adaptation seeks to leverage the abundant label information in a source domain to improve classification performance in a target domain with limited labels. While the field has seen extensive methodological development, its theoretical foundations remain relatively underexplored. Most existing theoretical analyses focus on simplified settings where the source and target domains share the same input space and relate target-domain performance to measures of domain discrepancy. Although insightful, these analyses may not fully capture the behavior of modern approaches that align domains into a shared space via feature transformations. In this paper, we present a comprehensive theoretical study of domain adaptation algorithms based on domain alignment. We consider the joint learning of domain-aligning feature transformations and a shared classifier in a semi-supervised setting. We first derive generalization bounds in a broad setting, in terms of covering numbers of the relevant function classes. We then extend our analysis to characterize the sample complexity of domain-adaptive neural networks employing maximum mean discrepancy (MMD) or adversarial objectives. Our results rely on a rigorous analysis of the covering numbers of these architectures. We show that, for both MMD-based and adversarial models, the sample complexity admits an upper bound that scales quadratically with network depth and width. Furthermore, our analysis suggests that in semi-supervised settings, robustness to limited labeled target data can be achieved by scaling the target loss proportionally to the square root of the number of labeled target samples. Experimental evaluation in both shallow and deep settings lends support to our theoretical findings.
[LG-42] Set Invariance with Probability One for Controlled Diffusion: Score-based Approach
链接: https://arxiv.org/abs/2507.22385
作者: Wenqing Wang,Alexis M.H. Teter,Murat Arcak,Abhishek Halder
类目: Optimization and Control (math.OC); Machine Learning (cs.LG); Systems and Control (eess.SY); Probability (math.PR); Methodology (stat.ME)
*备注:
Abstract:Given a controlled diffusion and a connected, bounded, Lipschitz set, when is it possible to guarantee controlled set invariance with probability one? In this work, we answer this question by deriving the necessary and sufficient conditions for the same in terms of gradients of certain log-likelihoods – a.k.a. score vector fields – for two cases: given finite time horizon and infinite time horizon. The deduced conditions comprise a score-based test that provably certifies or falsifies the existence of Markovian controllers for given controlled set invariance problem data. Our results are constructive in the sense when the problem data passes the proposed test, we characterize all controllers guaranteeing the desired set invariance. When the problem data fails the proposed test, there does not exist a controller that can accomplish the desired set invariance with probability one. The computation in the proposed tests involve solving certain Dirichlet boundary value problems, and in the finite horizon case, can also account for additional constraint of hitting a target subset at the terminal time. We illustrate the results using several semi-analytical and numerical examples.
[LG-43] Robust Filtering and Learning in State-Space Models: Skewness and Heavy Tails Via Asymmetric Laplace Distribution
链接: https://arxiv.org/abs/2507.22343
作者: Yifan Yu,Shengjie Xiu,Daniel P. Palomar
类目: ignal Processing (eess.SP); Machine Learning (cs.LG)
*备注: This work has been submitted to the IEEE for possible publication
Abstract:State-space models are pivotal for dynamic system analysis but often struggle with outlier data that deviates from Gaussian distributions, frequently exhibiting skewness and heavy tails. This paper introduces a robust extension utilizing the asymmetric Laplace distribution, specifically tailored to capture these complex characteristics. We propose an efficient variational Bayes algorithm and a novel single-loop parameter estimation strategy, significantly enhancing the efficiency of the filtering, smoothing, and parameter estimation processes. Our comprehensive experiments demonstrate that our methods provide consistently robust performance across various noise settings without the need for manual hyperparameter adjustments. In stark contrast, existing models generally rely on specific noise conditions and necessitate extensive manual tuning. Moreover, our approach uses far fewer computational resources, thereby validating the model’s effectiveness and underscoring its potential for practical applications in fields such as robust control and financial modeling.
[LG-44] Decoding Neural Signatures of Semantic Evaluations in Depression and Suicidality
链接: https://arxiv.org/abs/2507.22313
作者: Woojae Jeong,Aditya Kommineni,Kleanthis Avramidis,Colin McDaniel,Donald Berry,Myzelle Hughes,Thomas McGee,Elsi Kaiser,Dani Byrd,Assal Habibi,B. Rael Cahn,Idan A. Blank,Kristina Lerman,Dimitrios Pantazis,Sudarsana R. Kadiri,Takfarinas Medani,Shrikanth Narayanan,Richard M. Leahy
类目: Neurons and Cognition (q-bio.NC); Machine Learning (cs.LG); Signal Processing (eess.SP)
*备注:
Abstract:Depression and suicidality profoundly impact cognition and emotion, yet objective neurophysiological biomarkers remain elusive. We investigated the spatiotemporal neural dynamics underlying affective semantic processing in individuals with varying levels of clinical severity of depression and suicidality using multivariate decoding of electroencephalography (EEG) data. Participants (N=137) completed a sentence evaluation task involving emotionally charged self-referential statements while EEG was recorded. We identified robust, neural signatures of semantic processing, with peak decoding accuracy between 300-600 ms – a window associated with automatic semantic evaluation and conflict monitoring. Compared to healthy controls, individuals with depression and suicidality showed earlier onset, longer duration, and greater amplitude decoding responses, along with broader cross-temporal generalization and increased activation of frontocentral and parietotemporal components. These findings suggest altered sensitivity and impaired disengagement from emotionally salient content in the clinical groups, advancing our understanding of the neurocognitive basis of mental health and providing a principled basis for developing reliable EEG-based biomarkers of depression and suicidality.
[LG-45] An Asynchronous Decentralised Optimisation Algorithm for Nonconvex Problems
链接: https://arxiv.org/abs/2507.22311
作者: Behnam Mafakheri,Jonathan H. Manton,Iman Shames
类目: Optimization and Control (math.OC); Machine Learning (cs.LG)
*备注:
Abstract:In this paper, we consider nonconvex decentralised optimisation and learning over a network of distributed agents. We develop an ADMM algorithm based on the Randomised Block Coordinate Douglas-Rachford splitting method which enables agents in the network to distributedly and asynchronously compute a set of first-order stationary solutions of the problem. To the best of our knowledge, this is the first decentralised and asynchronous algorithm for solving nonconvex optimisation problems with convergence proof. The numerical examples demonstrate the efficiency of the proposed algorithm for distributed Phase Retrieval and sparse Principal Component Analysis problems.
[LG-46] Representation biases: will we achieve complete understanding by analyzing representations?
链接: https://arxiv.org/abs/2507.22216
作者: Andrew Kyle Lampinen,Stephanie C. Y. Chan,Yuxuan Li,Katherine Hermann
类目: Neurons and Cognition (q-bio.NC); Machine Learning (cs.LG)
*备注:
Abstract:A common approach in neuroscience is to study neural representations as a means to understand a system – increasingly, by relating the neural representations to the internal representations learned by computational models. However, a recent work in machine learning (Lampinen, 2024) shows that learned feature representations may be biased to over-represent certain features, and represent others more weakly and less-consistently. For example, simple (linear) features may be more strongly and more consistently represented than complex (highly nonlinear) features. These biases could pose challenges for achieving full understanding of a system through representational analysis. In this perspective, we illustrate these challenges – showing how feature representation biases can lead to strongly biased inferences from common analyses like PCA, regression, and RSA. We also present homomorphic encryption as a simple case study of the potential for strong dissociation between patterns of representation and computation. We discuss the implications of these results for representational comparisons between systems, and for neuroscience more generally.
[LG-47] Better Together: Cross and Joint Covariances Enhance Signal Detectability in Undersampled Data
链接: https://arxiv.org/abs/2507.22207
作者: Arabind Swain,Sean Alexander Ridout,Ilya Nemenman
类目: Disordered Systems and Neural Networks (cond-mat.dis-nn); Machine Learning (cs.LG); Data Analysis, Statistics and Probability (physics.data-an); Machine Learning (stat.ML)
*备注:
Abstract:Many data-science applications involve detecting a shared signal between two high-dimensional variables. Using random matrix theory methods, we determine when such signal can be detected and reconstructed from sample correlations, despite the background of sampling noise induced correlations. We consider three different covariance matrices constructed from two high-dimensional variables: their individual self covariance, their cross covariance, and the self covariance of the concatenated (joint) variable, which incorporates the self and the cross correlation blocks. We observe the expected Baik, Ben Arous, and Péché detectability phase transition in all these covariance matrices, and we show that joint and cross covariance matrices always reconstruct the shared signal earlier than the self covariances. Whether the joint or the cross approach is better depends on the mismatch of dimensionalities between the variables. We discuss what these observations mean for choosing the right method for detecting linear correlations in data and how these findings may generalize to nonlinear statistical dependencies.
[LG-48] Stacked SVD or SVD stacked? A Random Matrix Theory perspective on data integration
链接: https://arxiv.org/abs/2507.22170
作者: Tavor Z. Baharav,Phillip B. Nicol,Rafael A. Irizarry,Rong Ma
类目: Machine Learning (stat.ML); Machine Learning (cs.LG); Statistics Theory (math.ST); Methodology (stat.ME)
*备注:
Abstract:Modern data analysis increasingly requires identifying shared latent structure across multiple high-dimensional datasets. A commonly used model assumes that the data matrices are noisy observations of low-rank matrices with a shared singular subspace. In this case, two primary methods have emerged for estimating this shared structure, which vary in how they integrate information across datasets. The first approach, termed Stack-SVD, concatenates all the datasets, and then performs a singular value decomposition (SVD). The second approach, termed SVD-Stack, first performs an SVD separately for each dataset, then aggregates the top singular vectors across these datasets, and finally computes a consensus amongst them. While these methods are widely used, they have not been rigorously studied in the proportional asymptotic regime, which is of great practical relevance in today’s world of increasing data size and dimensionality. This lack of theoretical understanding has led to uncertainty about which method to choose and limited the ability to fully exploit their potential. To address these challenges, we derive exact expressions for the asymptotic performance and phase transitions of these two methods and develop optimal weighting schemes to further improve both methods. Our analysis reveals that while neither method uniformly dominates the other in the unweighted case, optimally weighted Stack-SVD dominates optimally weighted SVD-Stack. We extend our analysis to accommodate multiple shared components, and provide practical algorithms for estimating optimal weights from data, offering theoretical guidance for method selection in practical data integration problems. Extensive numerical simulations and semi-synthetic experiments on genomic data corroborate our theoretical findings.
[LG-49] Simulating Posterior Bayesian Neural Networks with Dependent Weights
链接: https://arxiv.org/abs/2507.22095
作者: Nicola Apollonio,Giovanni Franzina,Giovanni Luca Torrisi
类目: Machine Learning (stat.ML); Machine Learning (cs.LG); Probability (math.PR)
*备注:
Abstract:In this paper we consider posterior Bayesian fully connected and feedforward deep neural networks with dependent weights. Particularly, if the likelihood is Gaussian, we identify the distribution of the wide width limit and provide an algorithm to sample from the network. In the shallow case we explicitly compute the distribution of the output, proving that it is a Gaussian mixture. All the theoretical results are numerically validated.
信息检索
[IR-0] AUV-Fusion: Cross-Modal Adversarial Fusion of User Interactions and Visual Perturbations Against VARS
链接: https://arxiv.org/abs/2507.22880
作者: Hai Ling,Tianchi Wang,Xiaohao Liu,Zhulin Tao,Lifang Yang,Xianglin Huang
类目: Information Retrieval (cs.IR)
*备注: 14 pages,6 figures
Abstract:Modern Visual-Aware Recommender Systems (VARS) exploit the integration of user interaction data and visual features to deliver personalized recommendations with high precision. However, their robustness against adversarial attacks remains largely underexplored, posing significant risks to system reliability and security. Existing attack strategies suffer from notable limitations: shilling attacks are costly and detectable, and visual-only perturbations often fail to align with user preferences. To address these challenges, we propose AUV-Fusion, a cross-modal adversarial attack framework that adopts high-order user preference modeling and cross-modal adversary generation. Specifically, we obtain robust user embeddings through multi-hop user-item interactions and transform them via an MLP into semantically aligned perturbations. These perturbations are injected onto the latent space of a pre-trained VAE within the diffusion model. By synergistically integrating genuine user interaction data with visually plausible perturbations, AUV-Fusion eliminates the need for injecting fake user profiles and effectively mitigates the challenge of insufficient user preference extraction inherent in traditional visual-only attacks. Comprehensive evaluations on diverse VARS architectures and real-world datasets demonstrate that AUV-Fusion significantly enhances the exposure of target (cold-start) items compared to conventional baseline methods. Moreover, AUV-Fusion maintains exceptional stealth under rigorous scrutiny.
[IR-1] Sustainability Evaluation Metrics for Recommender Systems
链接: https://arxiv.org/abs/2507.22520
作者: Alexander Felfernig,Damian Garber,Viet-Man Le,Sebastian Lubos,Thi Ngoc Trang Tran
类目: Information Retrieval (cs.IR)
*备注:
Abstract:Sustainability-oriented evaluation metrics can help to assess the quality of recommender systems beyond wide-spread metrics such as accuracy, precision, recall, and satisfaction. Following the United Nations`s sustainable development goals (SDGs), such metrics can help to analyse the impact of recommender systems on environmental, social, and economic aspects. We discuss different basic sustainability evaluation metrics for recommender systems and analyze their applications.
[IR-2] Generative Recommendation with Semantic IDs: A Practitioners Handbook
链接: https://arxiv.org/abs/2507.22224
作者: Clark Mingxuan Ju,Liam Collins,Leonardo Neves,Bhuvesh Kumar,Louis Yufeng Wang,Tong Zhao,Neil Shah
类目: Information Retrieval (cs.IR)
*备注:
Abstract:Generative recommendation (GR) has gained increasing attention for its promising performance compared to traditional models. A key factor contributing to the success of GR is the semantic ID (SID), which converts continuous semantic representations (e.g., from large language models) into discrete ID sequences. This enables GR models with SIDs to both incorporate semantic information and learn collaborative filtering signals, while retaining the benefits of discrete decoding. However, varied modeling techniques, hyper-parameters, and experimental setups in existing literature make direct comparisons between GR proposals challenging. Furthermore, the absence of an open-source, unified framework hinders systematic benchmarking and extension, slowing model iteration. To address this challenge, our work introduces and open-sources a framework for Generative Recommendation with semantic ID, namely GRID, specifically designed for modularity to facilitate easy component swapping and accelerate idea iteration. Using GRID, we systematically experiment with and ablate different components of GR models with SIDs on public benchmarks. Our comprehensive experiments with GRID reveal that many overlooked architectural components in GR models with SIDs substantially impact performance. This offers both novel insights and validates the utility of an open-source platform for robust benchmarking and GR research advancement. GRID is open-sourced at this https URL.
[IR-3] CleANN: Efficient Full Dynamism in Graph-based Approximate Nearest Neighbor Search
链接: https://arxiv.org/abs/2507.19802
作者: Ziyu Zhang,Yuanhao Wei,Joshua Engels,Julian Shun
类目: Databases (cs.DB); Distributed, Parallel, and Cluster Computing (cs.DC); Data Structures and Algorithms (cs.DS); Information Retrieval (cs.IR)
*备注:
Abstract:Approximate nearest neighbor search (ANNS) has become a quintessential algorithmic problem for various other foundational data tasks for AI workloads. Graph-based ANNS indexes have superb empirical trade-offs in indexing cost, query efficiency, and query approximation quality. Most existing graph-based indexes are designed for the static scenario, where there are no updates to the data after the index is constructed. However, full dynamism (insertions, deletions, and searches) is crucial to providing up-to-date responses in applications using vector databases. It is desirable that the index efficiently supports updates and search queries concurrently. Existing dynamic graph-based indexes suffer from at least one of the following problems: (1) the query quality degrades as updates happen; and (2) the graph structure updates used to maintain the index quality upon updates are global and thus expensive. To solve these problems, we propose the CleANN system which consists of three main components: (1) workload-aware linking of diverse search tree descendants to combat distribution shift; (2)query-adaptive on-the-fly neighborhood consolidation to efficiently handle deleted nodes; and (3) semi-lazy memory cleaning to clean up stale information in the data structure and reduce the work spent by the first two components. We evaluate CleANN on 7 diverse datasets on fully dynamic workloads and find that CleANN has query quality at least as good as if the index had been built statically using the corresponding data. In the in-memory setting using 56 hyper-threads, with all types of queries running concurrently, at the same recall level, CleANN achieves 7-1200x throughput improvement on million-scale real-world datasets. To the best of our knowledge, CleANN is the first concurrent ANNS index to achieve such efficiency while maintaining quality under full dynamism.