本篇博文主要内容为 2025-07-17 从Arxiv.org论文网站获取的最新论文列表,自动更新,按照NLP、CV、ML、AI、IR五个大方向区分,若需要邮件定时接收,请在评论区留下你的邮箱号。
友情提示: 如何您需要邮箱接收每日论文数据,请在评论处留下你的邮箱。
目录
概览 (2025-07-17)
今日共更新469篇论文,其中:
- 自然语言处理共59篇(Computation and Language (cs.CL))
- 人工智能共125篇(Artificial Intelligence (cs.AI))
- 计算机视觉共115篇(Computer Vision and Pattern Recognition (cs.CV))
- 机器学习共131篇(Machine Learning (cs.LG))
自然语言处理
[NLP-0] Language Models Improve When Pretraining Data Matches Target Tasks
【速读】: 该论文试图解决预训练数据选择方法中目标不明确的问题,即现有的数据选择策略通常通过基准测试驱动的迭代过程隐式形成。为了解决这一问题,作者提出了一种名为基准目标排序(BETR)的显式优化方法,其关键在于通过将基准测试示例与预训练文档嵌入到共享空间中,并基于相似性对文档进行评分,进而训练一个轻量级分类器来预测整个语料库的得分,从而实现更精准的数据选择。
链接: https://arxiv.org/abs/2507.12466
作者: David Mizrahi,Anders Boesen Lindbo Larsen,Jesse Allardice,Suzie Petryk,Yuri Gorokhov,Jeffrey Li,Alex Fang,Josh Gardner,Tom Gunter,Afshin Dehghan
机构: Apple(苹果); University of Washington(华盛顿大学); Stanford(斯坦福大学); Anthropic(Anthropic)
类目: Computation and Language (cs.CL); Machine Learning (cs.LG)
备注: 44 pages, 25 figures, 13 tables
点击查看摘要
Abstract:Every data selection method inherently has a target. In practice, these targets often emerge implicitly through benchmark-driven iteration: researchers develop selection strategies, train models, measure benchmark performance, then refine accordingly. This raises a natural question: what happens when we make this optimization explicit? To explore this, we propose benchmark-targeted ranking (BETR), a simple method that selects pretraining documents based on similarity to benchmark training examples. BETR embeds benchmark examples and a sample of pretraining documents in a shared space, scores this sample by similarity to benchmarks, then trains a lightweight classifier to predict these scores for the full corpus. We compare data selection methods by training over 500 models spanning 10^19 to 10^22 FLOPs and fitting scaling laws to them. From this, we find that simply aligning pretraining data to evaluation benchmarks using BETR achieves a 2.1x compute multiplier over DCLM-Baseline (4.7x over unfiltered data) and improves performance on 9 out of 10 tasks across all scales. BETR also generalizes well: when targeting a diverse set of benchmarks disjoint from our evaluation suite, it still matches or outperforms baselines. Our scaling analysis further reveals a clear trend: larger models require less aggressive filtering. Overall, our findings show that directly matching pretraining data to target tasks precisely shapes model capabilities and highlight that optimal selection strategies must adapt to model scale.
zh
[NLP-1] S2WTM: Spherical Sliced-Wasserstein Autoencoder for Topic Modeling ACL2025
【速读】: 该论文试图解决变分自编码器基础的神经主题模型(VAE-NTMs)中常见的后验崩溃问题,即在目标函数中的KL散度项显著减小,导致潜在表示无效。解决方案的关键在于提出一种基于球面切片Wasserstein距离的自编码器——Spherical Sliced Wasserstein Autoencoder for Topic Modeling (S2WTM),该方法采用单位超球面上的先验分布,并利用球面切片Wasserstein距离将聚合后验分布与先验对齐,从而有效缓解后验崩溃问题。
链接: https://arxiv.org/abs/2507.12451
作者: Suman Adhya,Debarshi Kumar Sanyal
机构: Indian Association for the Cultivation of Science(印度科学培育协会)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: Accepted as a long paper for ACL 2025 main conference
点击查看摘要
Abstract:Modeling latent representations in a hyperspherical space has proven effective for capturing directional similarities in high-dimensional text data, benefiting topic modeling. Variational autoencoder-based neural topic models (VAE-NTMs) commonly adopt the von Mises-Fisher prior to encode hyperspherical structure. However, VAE-NTMs often suffer from posterior collapse, where the KL divergence term in the objective function highly diminishes, leading to ineffective latent representations. To mitigate this issue while modeling hyperspherical structure in the latent space, we propose the Spherical Sliced Wasserstein Autoencoder for Topic Modeling (S2WTM). S2WTM employs a prior distribution supported on the unit hypersphere and leverages the Spherical Sliced-Wasserstein distance to align the aggregated posterior distribution with the prior. Experimental results demonstrate that S2WTM outperforms state-of-the-art topic models, generating more coherent and diverse topics while improving performance on downstream tasks.
zh
[NLP-2] Can We Predict Alignment Before Models Finish Thinking? Towards Monitoring Misaligned Reasoning Models
【速读】: 该论文试图解决生成式语言模型在生成长链式思维(CoT)过程中可能出现的对齐风险问题,特别是有害内容在CoT和最终输出中同时出现的风险。其解决方案的关键在于利用CoT的模型潜在表示(即CoT激活值)而非文本内容进行最终响应的对齐性预测。研究发现,基于CoT激活值训练的简单线性探测器能够显著优于基于文本的方法,并且能够在推理未完成时提前做出准确预测,从而为实时安全监控和早期干预提供了可行途径。
链接: https://arxiv.org/abs/2507.12428
作者: Yik Siu Chan,Zheng-Xin Yong,Stephen H. Bach
机构: Brown University (布朗大学)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:
点击查看摘要
Abstract:Open-weights reasoning language models generate long chains-of-thought (CoTs) before producing a final response, which improves performance but introduces additional alignment risks, with harmful content often appearing in both the CoTs and the final outputs. In this work, we investigate if we can use CoTs to predict final response misalignment. We evaluate a range of monitoring approaches, including humans, highly-capable large language models, and text classifiers, using either CoT text or activations. First, we find that a simple linear probe trained on CoT activations can significantly outperform all text-based methods in predicting whether a final response will be safe or unsafe. CoT texts are often unfaithful and can mislead humans and classifiers, while model latents (i.e., CoT activations) offer a more reliable predictive signal. Second, the probe makes accurate predictions before reasoning completes, achieving strong performance even when applied to early CoT segments. These findings generalize across model sizes, families, and safety benchmarks, suggesting that lightweight probes could enable real-time safety monitoring and early intervention during generation.
zh
[NLP-3] Advancing Retrieval-Augmented Generation for Structured Enterprise and Internal Data
【速读】: 该论文试图解决企业在使用大型语言模型(Large Language Models, LLMs)进行决策时面临的挑战,包括静态预训练限制、短上下文窗口以及处理异构数据格式的困难。其解决方案的关键在于提出一种先进的检索增强生成(Retrieval-Augmented Generation, RAG)框架,该框架结合了基于密集嵌入(all-mpnet-base-v2)和BM25的混合检索策略,通过SpaCy命名实体识别(NER)进行元数据感知过滤,并利用交叉编码器重新排序。此外,该框架还引入语义分块以保持文本连贯性,保留表格数据结构以确保行列完整性,以及量化索引优化检索效率,从而显著提升了企业数据场景下的检索与生成性能。
链接: https://arxiv.org/abs/2507.12425
作者: Chandana Cheerla
机构: IIT Roorkee (印度理工学院鲁尔基分校)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Computational Engineering, Finance, and Science (cs.CE); Information Retrieval (cs.IR)
备注:
点击查看摘要
Abstract:Organizations increasingly rely on proprietary enterprise data, including HR records, structured reports, and tabular documents, for critical decision-making. While Large Language Models (LLMs) have strong generative capabilities, they are limited by static pretraining, short context windows, and challenges in processing heterogeneous data formats. Conventional Retrieval-Augmented Generation (RAG) frameworks address some of these gaps but often struggle with structured and semi-structured data. This work proposes an advanced RAG framework that combines hybrid retrieval strategies using dense embeddings (all-mpnet-base-v2) and BM25, enhanced by metadata-aware filtering with SpaCy NER and cross-encoder reranking. The framework applies semantic chunking to maintain textual coherence and retains tabular data structures to preserve row-column integrity. Quantized indexing optimizes retrieval efficiency, while human-in-the-loop feedback and conversation memory improve adaptability. Experiments on enterprise datasets show notable improvements: Precision@5 increased by 15 percent (90 versus 75), Recall@5 by 13 percent (87 versus 74), and Mean Reciprocal Rank by 16 percent (0.85 versus 0.69). Qualitative evaluations show higher scores in Faithfulness (4.6 versus 3.0), Completeness (4.2 versus 2.5), and Relevance (4.5 versus 3.2) on a 5-point Likert scale. These results demonstrate the framework’s effectiveness in delivering accurate, comprehensive, and contextually relevant responses for enterprise tasks. Future work includes extending to multimodal data and integrating agent-based retrieval. The source code will be released at this https URL Subjects: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Computational Engineering, Finance, and Science (cs.CE); Information Retrieval (cs.IR) Cite as: arXiv:2507.12425 [cs.CL] (or arXiv:2507.12425v1 [cs.CL] for this version) https://doi.org/10.48550/arXiv.2507.12425 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
zh
[NLP-4] Probing for Arithmetic Errors in Language Models
【速读】: 该论文试图解决语言模型在执行算术任务时产生的错误检测问题,旨在通过分析模型的内部激活来识别算术错误。解决方案的关键在于利用简单的探测器(probes)从隐藏状态中准确解码模型的预测输出和正确答案,无论模型输出是否正确,并基于此训练轻量级的错误检测器,实现对模型正确性的高精度预测。
链接: https://arxiv.org/abs/2507.12379
作者: Yucheng Sun,Alessandro Stolfo,Mrinmaya Sachan
机构: ETH Zürich (苏黎世联邦理工学院)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:
点击查看摘要
Abstract:We investigate whether internal activations in language models can be used to detect arithmetic errors. Starting with a controlled setting of 3-digit addition, we show that simple probes can accurately decode both the model’s predicted output and the correct answer from hidden states, regardless of whether the model’s output is correct. Building on this, we train lightweight error detectors that predict model correctness with over 90% accuracy. We then extend our analysis to structured chain-of-thought traces on addition-only GSM8K problems and find that probes trained on simple arithmetic generalize well to this more complex setting, revealing consistent internal representations. Finally, we demonstrate that these probes can guide selective re-prompting of erroneous reasoning steps, improving task accuracy with minimal disruption to correct outputs. Our findings suggest that arithmetic errors can be anticipated from internal activations alone, and that simple probes offer a viable path toward lightweight model self-correction.
zh
[NLP-5] Developing Visual Augmented QA System using Scalable Vision Embedding Retrieval Late Interaction Re-ranker SIGIR
【速读】: 该论文试图解决多模态问答(multi-modal QA)系统中视觉检索过程的可扩展性和效率问题,特别是在使用基于检索增强生成(RAG)框架时所面临的挑战。其关键解决方案是采用多步骤自定义实现,结合广泛使用的混合搜索(metadata embedding)和最先进的晚期交互重排序器(late interaction re-ranker),以高效地检索最佳匹配页面,并利用多模态大语言模型(MLLM)作为读者生成答案。
链接: https://arxiv.org/abs/2507.12378
作者: Rachna Saxena,Abhijeet Kumar,Suresh Shanmugam
机构: 未知
类目: Information Retrieval (cs.IR); Computation and Language (cs.CL)
备注: Presented at NLP@IR workshop at SIGIR conference
点击查看摘要
Abstract:Traditional information extraction systems face challenges with text only language models as it does not consider infographics (visual elements of information) such as tables, charts, images etc. often used to convey complex information to readers. Multimodal LLM (MLLM) face challenges of finding needle in the haystack problem i.e., either longer context length or substantial number of documents as search space. Late interaction mechanism over visual language models has shown state of the art performance in retrieval-based vision augmented QA tasks. There are yet few challenges using it for RAG based multi-modal QA. Firstly, many popular and widely adopted vector databases do not support native multi-vector retrieval. Secondly, late interaction requires computation which inflates space footprint and can hinder enterprise adoption. Lastly, the current state of late interaction mechanism does not leverage the approximate neighbor search indexing methods for large speed ups in retrieval process. This paper explores a pragmatic approach to make vision retrieval process scalable and efficient without compromising on performance quality. We propose multi-step custom implementation utilizing widely adopted hybrid search (metadata embedding) and state of the art late interaction re-ranker to retrieve best matching pages. Finally, MLLM are prompted as reader to generate answers from contextualized best matching pages. Through experiments, we observe that the proposed design is scalable (significant speed up) and stable (without degrading performance quality), hence can be used as production systems at enterprises.
zh
[NLP-6] Web-Browsing LLM s Can Access Social Media Profiles and Infer User Demographics
【速读】: 该论文试图解决的问题是,如何利用具备网络浏览能力的大语言模型(Large Language Models, LLMs)从社交媒体用户仅提供的用户名中推断其人口统计属性。解决方案的关键在于利用LLMs的实时信息检索和多步骤推理能力,通过访问和分析社交媒体内容来预测用户 demographics,实验结果表明该方法在合理精度范围内具有可行性。
链接: https://arxiv.org/abs/2507.12372
作者: Meysam Alizadeh,Fabrizio Gilardi,Zeynab Samei,Mohsen Mosleh
机构: University of Zurich(苏黎世大学); IPM(伊朗精密制造研究所); University of Oxford(牛津大学)
类目: Computation and Language (cs.CL)
备注:
点击查看摘要
Abstract:Large language models (LLMs) have traditionally relied on static training data, limiting their knowledge to fixed snapshots. Recent advancements, however, have equipped LLMs with web browsing capabilities, enabling real time information retrieval and multi step reasoning over live web content. While prior studies have demonstrated LLMs ability to access and analyze websites, their capacity to directly retrieve and analyze social media data remains unexplored. Here, we evaluate whether web browsing LLMs can infer demographic attributes of social media users given only their usernames. Using a synthetic dataset of 48 X (Twitter) accounts and a survey dataset of 1,384 international participants, we show that these models can access social media content and predict user demographics with reasonable accuracy. Analysis of the synthetic dataset further reveals how LLMs parse and interpret social media profiles, which may introduce gender and political biases against accounts with minimal activity. While this capability holds promise for computational social science in the post API era, it also raises risks of misuse particularly in information operations and targeted advertising underscoring the need for safeguards. We recommend that LLM providers restrict this capability in public facing applications, while preserving controlled access for verified research purposes.
zh
[NLP-7] Beyond Single Models: Enhancing LLM Detection of Ambiguity in Requests through Debate
【速读】: 该论文试图解决大型语言模型(Large Language Models, LLMs)在处理用户请求时面临的歧义问题,旨在提升模型对复杂系统中模糊指令的理解与响应能力。其解决方案的关键在于引入一种多智能体辩论框架,通过多个LLM架构(如Llama3-8B、Gemma2-9B和Mistral-7B变体)之间的协作与辩论,增强模型检测和解决歧义的能力。实验结果表明,该框架显著提升了模型性能,尤其在复杂歧义和达成共识方面表现突出。
链接: https://arxiv.org/abs/2507.12370
作者: Ana Davila,Jacinto Colan,Yasuhisa Hasegawa
机构: 未知
类目: Computation and Language (cs.CL); Human-Computer Interaction (cs.HC)
备注: Accepted at the 2025 SICE Festival with Annual Conference (SICE FES)
点击查看摘要
Abstract:Large Language Models (LLMs) have demonstrated significant capabilities in understanding and generating human language, contributing to more natural interactions with complex systems. However, they face challenges such as ambiguity in user requests processed by LLMs. To address these challenges, this paper introduces and evaluates a multi-agent debate framework designed to enhance detection and resolution capabilities beyond single models. The framework consists of three LLM architectures (Llama3-8B, Gemma2-9B, and Mistral-7B variants) and a dataset with diverse ambiguities. The debate framework markedly enhanced the performance of Llama3-8B and Mistral-7B variants over their individual baselines, with Mistral-7B-led debates achieving a notable 76.7% success rate and proving particularly effective for complex ambiguities and efficient consensus. While acknowledging varying model responses to collaborative strategies, these findings underscore the debate framework’s value as a targeted method for augmenting LLM capabilities. This work offers important insights for developing more robust and adaptive language understanding systems by showing how structured debates can lead to improved clarity in interactive systems.
zh
[NLP-8] Exploring Gender Bias in Alzheimers Disease Detection: Insights from Mandarin and Greek Speech Perception
【速读】: 该论文试图解决阿尔茨海默病(Alzheimer’s Disease, AD)语音感知中的性别偏见问题,即男性语音更容易被误判为AD的现象。研究通过实验和声学分析发现,男性语音的抖动值(shimmer)与AD感知显著相关,而语音占比则与AD识别呈显著负相关。解决方案的关键在于识别并量化这些声学特征对性别偏见的影响,从而在构建AD检测模型时采取措施减少此类偏见,提升模型在不同语言环境下的可靠性。
链接: https://arxiv.org/abs/2507.12356
作者: Liu He,Yuanchao Li,Rui Feng,XinRan Han,Yin-Long Liu,Yuwei Yang,Zude Zhu,Jiahong Yuan
机构: 未知
类目: Computation and Language (cs.CL); Human-Computer Interaction (cs.HC); Sound (cs.SD)
备注: 12 pages, 5 figures, conference or other essential info
点击查看摘要
Abstract:Gender bias has been widely observed in speech perception tasks, influenced by the fundamental voicing differences between genders. This study reveals a gender bias in the perception of Alzheimer’s Disease (AD) speech. In a perception experiment involving 16 Chinese listeners evaluating both Chinese and Greek speech, we identified that male speech was more frequently identified as AD, with this bias being particularly pronounced in Chinese speech. Acoustic analysis showed that shimmer values in male speech were significantly associated with AD perception, while speech portion exhibited a significant negative correlation with AD identification. Although language did not have a significant impact on AD perception, our findings underscore the critical role of gender bias in AD speech perception. This work highlights the necessity of addressing gender bias when developing AD detection models and calls for further research to validate model performance across different linguistic contexts.
zh
[NLP-9] Nonlinear Concept Erasure: a Density Matching Approach ECAI2025
【速读】: 该论文试图解决在实际应用中,神经网络模型可能从文本表示中推断出敏感信息(如性别或种族等人口统计属性)的问题,尤其是在关注公平性时。解决方案的关键在于概念擦除(concept erasure),即通过学习嵌入空间中的正交投影,将特定概念的信息从分布式表示中移除,同时尽可能保留其余语义信息。该方法通过调整投影器的秩来控制信息移除的程度,而其正交性保证了嵌入局部结构的严格保持。
链接: https://arxiv.org/abs/2507.12341
作者: Antoine Saillenfest,Pirmin Lemberger
机构: onepoint(一point)
类目: Machine Learning (cs.LG); Computation and Language (cs.CL)
备注: 17 pages, 10 figures, accepted for publication in ECAI 2025 (28th European Conference on Artificial Intelligence)
点击查看摘要
Abstract:Ensuring that neural models used in real-world applications cannot infer sensitive information, such as demographic attributes like gender or race, from text representations is a critical challenge when fairness is a concern. We address this issue through concept erasure, a process that removes information related to a specific concept from distributed representations while preserving as much of the remaining semantic information as possible. Our approach involves learning an orthogonal projection in the embedding space, designed to make the class-conditional feature distributions of the discrete concept to erase indistinguishable after projection. By adjusting the rank of the projector, we control the extent of information removal, while its orthogonality ensures strict preservation of the local structure of the embeddings. Our method, termed \overline\mathrmL EOPARD, achieves state-of-the-art performance in nonlinear erasure of a discrete attribute on classic natural language processing benchmarks. Furthermore, we demonstrate that \overline\mathrmL EOPARD effectively mitigates bias in deep nonlinear classifiers, thereby promoting fairness.
zh
[NLP-10] Chain-of-Descriptions: Improving Code LLM s for VHDL Code Generation and Summarization
【速读】: 该论文旨在解决现有代码大语言模型(Code LLMs)在硬件描述语言(HDL)如VHDL的代码生成和摘要任务中表现不佳的问题。其关键解决方案是提出一种名为Chain-of-Descriptions (CoDes)的新方法,通过生成基于问题陈述或VHDL代码的中间描述步骤,并将这些步骤与原始输入提示结合,作为输入提供给LLMs以生成最终输出,从而显著提升模型在VHDL相关任务上的性能。
链接: https://arxiv.org/abs/2507.12308
作者: Prashanth Vijayaraghavan,Apoorva Nitsure,Charles Mackin,Luyao Shi,Stefano Ambrogio,Arvind Haran,Viresh Paruthi,Ali Elzein,Dan Coops,David Beymer,Tyler Baldwin,Ehsan Degan
机构: IBM Research; IBM Austin; IBM Research
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Hardware Architecture (cs.AR)
备注: 10 pages (6 content pages + 4 supplementary), 5 figures, Proceedings of the 2024 ACM/IEEE International Symposium on Machine Learning for CAD. 2024 (MLCAD’24)
点击查看摘要
Abstract:Large Language Models (LLMs) have become widely used across diverse NLP tasks and domains, demonstrating their adaptability and effectiveness. In the realm of Electronic Design Automation (EDA), LLMs show promise for tasks like Register-Transfer Level (RTL) code generation and summarization. However, despite the proliferation of LLMs for general code-related tasks, there’s a dearth of research focused on evaluating and refining these models for hardware description languages (HDLs), notably VHDL. In this study, we evaluate the performance of existing code LLMs for VHDL code generation and summarization using various metrics and two datasets – VHDL-Eval and VHDL-Xform. The latter, an in-house dataset, aims to gauge LLMs’ understanding of functionally equivalent code. Our findings reveal consistent underperformance of these models across different metrics, underscoring a significant gap in their suitability for this domain. To address this challenge, we propose Chain-of-Descriptions (CoDes), a novel approach to enhance the performance of LLMs for VHDL code generation and summarization tasks. CoDes involves generating a series of intermediate descriptive steps based on: (i) the problem statement for code generation, and (ii) the VHDL code for summarization. These steps are then integrated with the original input prompt (problem statement or code) and provided as input to the LLMs to generate the final output. Our experiments demonstrate that the CoDes approach significantly surpasses the standard prompting strategy across various metrics on both datasets. This method not only improves the quality of VHDL code generation and summarization but also serves as a framework for future research aimed at enhancing code LLMs for VHDL.
zh
[NLP-11] xt-ADBench: Text Anomaly Detection Benchmark based on LLM s Embedding
【速读】: 该论文旨在解决文本异常检测领域缺乏标准化和全面基准的问题,从而限制了现有检测方法的严谨比较与创新方法的发展。其解决方案的关键在于构建一个综合性文本异常检测基准,利用来自多种预训练语言模型的嵌入表示,并在多个领域文本数据集上进行系统评估。研究发现,嵌入质量显著影响异常检测效果,且基于深度学习的方法在使用LLM衍生嵌入时并未表现出优于传统浅层算法(如KNN、孤立森林)的性能;此外,跨模型性能矩阵表现出强烈的低秩特性,这为实际应用中的快速模型评估与选择提供了高效策略。
链接: https://arxiv.org/abs/2507.12295
作者: Feng Xiao,Jicong Fan
机构: The Chinese University of Hong Kong, Shenzhen(香港中文大学深圳校区)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:
点击查看摘要
Abstract:Text anomaly detection is a critical task in natural language processing (NLP), with applications spanning fraud detection, misinformation identification, spam detection and content moderation, etc. Despite significant advances in large language models (LLMs) and anomaly detection algorithms, the absence of standardized and comprehensive benchmarks for evaluating the existing anomaly detection methods on text data limits rigorous comparison and development of innovative approaches. This work performs a comprehensive empirical study and introduces a benchmark for text anomaly detection, leveraging embeddings from diverse pre-trained language models across a wide array of text datasets. Our work systematically evaluates the effectiveness of embedding-based text anomaly detection by incorporating (1) early language models (GloVe, BERT); (2) multiple LLMs (LLaMa-2, LLama-3, Mistral, OpenAI (small, ada, large)); (3) multi-domain text datasets (news, social media, scientific publications); (4) comprehensive evaluation metrics (AUROC, AUPRC). Our experiments reveal a critical empirical insight: embedding quality significantly governs anomaly detection efficacy, and deep learning-based approaches demonstrate no performance advantage over conventional shallow algorithms (e.g., KNN, Isolation Forest) when leveraging LLM-derived this http URL addition, we observe strongly low-rank characteristics in cross-model performance matrices, which enables an efficient strategy for rapid model evaluation (or embedding evaluation) and selection in practical applications. Furthermore, by open-sourcing our benchmark toolkit that includes all embeddings from different models and code at this https URL, this work provides a foundation for future research in robust and scalable text anomaly detection systems.
zh
[NLP-12] MERA Code: A Unified Framework for Evaluating Code Generation Across Tasks
【速读】: 该论文试图解决当前大型语言模型(LLM)在软件工程任务中的评估主要集中在自然语言任务上,而忽视了代码质量的问题。现有基准测试更关注高层次的推理能力而非可执行代码和实际性能,导致对模型在生产环境中真实能力和风险的理解存在空白。解决方案的关键是提出MERA Code,这是MERA基准系列的新成员,专门针对最新代码生成LLM在俄语中的表现进行评估。该基准包含11个涵盖8种编程语言的评估任务,并引入了一种分类法,以明确模型完成这些任务所需的实际编码技能。此外,MERA Code提供了一个开源代码库、兼容多种编程环境的评分系统以及具备排行榜和提交系统的平台,旨在全面评估模型在非英语语言中的实际编码能力。
链接: https://arxiv.org/abs/2507.12284
作者: Artem Chervyakov,Alexander Kharitonov,Pavel Zadorozhny,Adamenko Pavel,Rodion Levichev,Dmitrii Vorobev,Dmitrii Salikhov,Aidar Valeev,Alena Pestova,Maria Dziuba,Ilseyar Alimova,Artem Zavgorodnev,Aleksandr Medvedev,Stanislav Moiseev,Elena Bruches,Daniil Grebenkin,Roman Derunets,Vikulov Vladimir,Anton Emelyanov,Dmitrii Babaev,Vladimir V. Ivanov,Valentin Malykh,Alena Fenogenova
机构: SberAI; ITMO University; MWS AI; T-Technologies; Rostelecom; Siberian Neuronets; Skoltech; Innopolis University
类目: oftware Engineering (cs.SE); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注:
点击查看摘要
Abstract:Advancements in LLMs have enhanced task automation in software engineering; however, current evaluations primarily focus on natural language tasks, overlooking code quality. Most benchmarks prioritize high-level reasoning over executable code and real-world performance, leaving gaps in understanding true capabilities and risks associated with these models in production. To address this issue, we propose MERA Code, a new addition to the MERA benchmark family, specifically focused on evaluating code for the latest code generation LLMs in Russian. This benchmark includes 11 evaluation tasks that span 8 programming languages. Our proposed evaluation methodology features a taxonomy that outlines the practical coding skills necessary for models to complete these tasks. The benchmark comprises an open-source codebase for users to conduct MERA assessments, a scoring system compatible with various programming environments, and a platform featuring a leaderboard and submission system. We evaluate open LLMs and frontier API models, analyzing their limitations in terms of practical coding tasks in non-English languages. We are publicly releasing MERA to guide future research, anticipate groundbreaking features in model development, and standardize evaluation procedures.
zh
[NLP-13] Infherno: End-to-end Agent -based FHIR Resource Synthesis from Free-form Clinical Notes ATC WWW EMNLP2025
【速读】: 该论文试图解决从非结构化的临床文本自动转换为结构化FHIR资源时存在的泛化能力有限和结构不一致的问题。解决方案的关键在于提出一个端到端的框架,该框架利用大语言模型(LLM)代理、代码执行和医疗术语数据库工具,以确保生成的FHIR资源符合FHIR文档模式,并在预测性能上接近人类基准。
链接: https://arxiv.org/abs/2507.12261
作者: Johann Frei,Nils Feldhus,Lisa Raithel,Roland Roller,Alexander Meyer,Frank Kramer
机构: University of Augsburg (奥格斯堡大学); BIFOLD – Berlin Institute for the Foundations of Learning and Data (柏林学习与数据基础研究所); Technische Universität Berlin (柏林工业大学); German Research Center for Artificial Intelligence (DFKI) (德国人工智能研究中心); DHZC Medical Data Science, Charité - Universitätsmedizin Berlin (DHZC 医疗数据科学,夏里特-柏林大学医学中心)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: Submitted to EMNLP 2025 System Demonstrations | Code: this https URL | Video: this https URL | Demo: this https URL | HuggingFace Spaces: this https URL
点击查看摘要
Abstract:For clinical data integration and healthcare services, the HL7 FHIR standard has established itself as a desirable format for interoperability between complex health data. Previous attempts at automating the translation from free-form clinical notes into structured FHIR resources rely on modular, rule-based systems or LLMs with instruction tuning and constrained decoding. Since they frequently suffer from limited generalizability and structural inconformity, we propose an end-to-end framework powered by LLM agents, code execution, and healthcare terminology database tools to address these issues. Our solution, called Infherno, is designed to adhere to the FHIR document schema and competes well with a human baseline in predicting FHIR resources from unstructured text. The implementation features a front end for custom and synthetic data and both local and proprietary models, supporting clinical data integration processes and interoperability across institutions.
zh
[NLP-14] ranslationese-index: Using Likelihood Ratios for Graded and Generalizable Measurement of Translationese
【速读】: 该论文试图解决翻译腔(translationese)的量化评估问题,即如何对翻译文本中表现出的源语言特征进行可分级和可泛化的度量。解决方案的关键在于提出一种基于对比微调语言模型(LMs)的似然比计算的量化指标——翻译腔指数(T-index),该指标能够有效捕捉真实场景中的翻译腔现象,并与人类标注的翻译腔程度具有良好的相关性。
链接: https://arxiv.org/abs/2507.12260
作者: Yikang Liu,Wanyang Zhang,Yiming Wang,Jialong Tang,Pei Zhang,Baosong Yang,Fei Huang,Rui Wang,Hai Hu
机构: Shanghai Jiao Tong University (上海交通大学); Peking University (北京大学); Tongyi Lab (通义实验室)
类目: Computation and Language (cs.CL)
备注:
点击查看摘要
Abstract:In this paper, we propose the first quantitative measure for translationese – the translationese-index (T-index) for graded and generalizable measurement of translationese, computed from the likelihood ratios of two contrastively fine-tuned language models (LMs). We use a synthesized dataset and a dataset with translations in the wild to evaluate T-index’s generalizability in cross-domain settings and its validity against human judgments. Our results show that T-index is both robust and efficient. T-index scored by two 0.5B LMs fine-tuned on only 1-5k pairs of synthetic data can well capture translationese in the wild. We find that the relative differences in T-indices between translations can well predict pairwise translationese annotations obtained from human annotators; and the absolute values of T-indices correlate well with human ratings of degrees of translationese (Pearson’s r = 0.568 ). Additionally, the correlation between T-index and existing machine translation (MT) quality estimation (QE) metrics such as BLEU and COMET is low, suggesting that T-index is not covered by these metrics and can serve as a complementary metric in MT QE.
zh
[NLP-15] Improving Contextual ASR via Multi-grained Fusion with Large Language Models
【速读】: 该论文试图解决端到端自动语音识别(ASR)模型在识别上下文相关关键词(如专有名词或用户特定实体)时表现不佳的问题。现有方法通过在文本模态中引入关键词字典,采用令牌级融合或短语级融合来提升关键词识别效果,但这些方法在粒度上存在差异且各有局限。论文提出的解决方案的关键在于一种多粒度融合方法,该方法结合了令牌级和短语级融合的优势,并利用大型语言模型(LLM)的丰富上下文知识,通过后期融合策略将ASR的声学信息与LLM的上下文理解相结合,从而在保持非关键词文本高准确率的同时,显著提升关键词相关指标的表现。
链接: https://arxiv.org/abs/2507.12252
作者: Shilin Zhou,Zhenghua Li
机构: Soochow University (苏州大学)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:
点击查看摘要
Abstract:While end-to-end Automatic Speech Recognition (ASR) models have shown impressive performance in transcribing general speech, they often struggle to accurately recognize contextually relevant keywords, such as proper nouns or user-specific entities. Previous approaches have explored leveraging keyword dictionaries in the textual modality to improve keyword recognition, either through token-level fusion that guides token-by-token generation or phrase-level fusion that enables direct copying of keyword phrases. However, these methods operate at different granularities and have their own limitations. In this paper, we propose a novel multi-grained fusion approach that jointly leverages the strengths of both token-level and phrase-level fusion with Large Language Models (LLMs). Our approach incorporates a late-fusion strategy that elegantly combines ASR’s acoustic information with LLM’s rich contextual knowledge, balancing fine-grained token precision with holistic phrase-level understanding. Experiments on Chinese and English datasets demonstrate that our approach achieves state-of-the-art performance on keyword-related metrics while preserving high accuracy on non-keyword text. Ablation studies further confirm that the token-level and phrase-level components both contribute significantly to the performance gains, complementing each other in our joint multi-grained framework. The code and models will be publicly available at this https URL. Subjects: Computation and Language (cs.CL); Artificial Intelligence (cs.AI) Cite as: arXiv:2507.12252 [cs.CL] (or arXiv:2507.12252v1 [cs.CL] for this version) https://doi.org/10.48550/arXiv.2507.12252 Focus to learn more arXiv-issued DOI via DataCite (pending registration) Submission history From: Shilin Zhou [view email] [v1] Wed, 16 Jul 2025 13:59:32 UTC (209 KB) Full-text links: Access Paper: View a PDF of the paper titled Improving Contextual ASR via Multi-grained Fusion with Large Language Models, by Shilin Zhou and 1 other authorsView PDFHTML (experimental)TeX SourceOther Formats view license Current browse context: cs.CL prev | next new | recent | 2025-07 Change to browse by: cs cs.AI References Citations NASA ADSGoogle Scholar Semantic Scholar a export BibTeX citation Loading… BibTeX formatted citation loading… Data provided by: Bookmark checked=“checked”> Bibliographic Tools Bibliographic and Citation Tools Bibliographic Explorer Toggle Bibliographic Explorer (What is the Explorer?) Connected Papers Toggle Connected Papers (What is Connected Papers?) Litmaps Toggle Litmaps (What is Litmaps?) scite.ai Toggle scite Smart Citations (What are Smart Citations?) Code, Data, Media Code, Data and Media Associated with this Article alphaXiv Toggle alphaXiv (What is alphaXiv?) Links to Code Toggle CatalyzeX Code Finder for Papers (What is CatalyzeX?) DagsHub Toggle DagsHub (What is DagsHub?) GotitPub Toggle Gotit.pub (What is GotitPub?) Huggingface Toggle Hugging Face (What is Huggingface?) Links to Code Toggle Papers with Code (What is Papers with Code?) ScienceCast Toggle ScienceCast (What is ScienceCast?) Demos Demos Replicate Toggle Replicate (What is Replicate?) Spaces Toggle Hugging Face Spaces (What is Spaces?) Spaces Toggle TXYZ.AI (What is TXYZ.AI?) Related Papers Recommenders and Search Tools Link to Influence Flower Influence Flower (What are Influence Flowers?) Core recommender toggle CORE Recommender (What is CORE?) Author Venue Institution Topic About arXivLabs arXivLabs: experimental projects with community collaborators arXivLabs is a framework that allows collaborators to develop and share new arXiv features directly on our website. Both individuals and organizations that work with arXivLabs have embraced and accepted our values of openness, community, excellence, and user data privacy. arXiv is committed to these values and only works with partners that adhere to them. Have an idea for a project that will add value for arXiv’s community? Learn more about arXivLabs. Which authors of this paper are endorsers? | Disable MathJax (What is MathJax?) mathjaxToggle(); About Help contact arXivClick here to contact arXiv Contact subscribe to arXiv mailingsClick here to subscribe Subscribe Copyright Privacy Policy Web Accessibility Assistance arXiv Operational Status Get status notifications via email or slack
zh
[NLP-16] owards few-shot isolated word reading assessment
【速读】: 该论文试图解决在低资源环境下孤立单词阅读评估的问题,特别是针对儿童语音的识别任务。其解决方案的关键在于采用一种无需自动语音识别(ASR)的方法,通过将儿童语音与少量成人提供的参考模板进行比较,利用大规模自监督学习(SSL)模型的中间层对输入和模板进行编码,从而实现分类。研究还探讨了SSL特征离散化和模板的巴氏中心平均等设计选项,但结果表明,在少样本分类系统中,SSL表示在处理儿童数据时存在明显局限性。
链接: https://arxiv.org/abs/2507.12217
作者: Reuben Smit,Retief Louw,Herman Kamper
机构: 未知
类目: Computation and Language (cs.CL); Audio and Speech Processing (eess.AS)
备注: Accepted to SLaTE 2025
点击查看摘要
Abstract:We explore an ASR-free method for isolated word reading assessment in low-resource settings. Our few-shot approach compares input child speech to a small set of adult-provided reference templates. Inputs and templates are encoded using intermediate layers from large self-supervised learned (SSL) models. Using an Afrikaans child speech benchmark, we investigate design options such as discretising SSL features and barycentre averaging of the templates. Idealised experiments show reasonable performance for adults, but a substantial drop for child speech input, even with child templates. Despite the success of employing SSL representations in low-resource speech tasks, our work highlights the limitations of SSL representations for processing child data when used in a few-shot classification system.
zh
[NLP-17] oward a Behavioural Translation Style Space: Simulating the Temporal Dynamics of Affect Behaviour and Cognition in Human Translation Production
【速读】: 该论文试图解决人类翻译过程中行为模式与认知状态之间的关系问题,旨在构建一个能够描述行为翻译模式的结构化框架。解决方案的关键在于提出一种层次化的行为翻译风格空间(Behavioural Translation Style Space, BTSS),通过分析键入记录和注视数据,将可观测的行为模式(如眼动和手指运动)与更高阶的认知过程和情感状态联系起来,从而为计算翻译代理提供模拟人类翻译生产中情感、自动化行为和认知时间动态的基础。
链接: https://arxiv.org/abs/2507.12208
作者: Michael Carl,Takanori Mizowaki,Aishvarya Ray,Masaru Yamada,Devi Sri Bandaru,Xinyue Ren
机构: 未知
类目: Computation and Language (cs.CL)
备注:
点击查看摘要
Abstract:The paper introduces a Behavioural Translation Style Space (BTSS) that describes possible behavioural translation patterns. The suggested BTSS is organized as a hierarchical structure that entails various embedded processing layers. We posit that observable translation behaviour - i.e., eye and finger movements - is fundamental when executing the physical act of translation but it is caused and shaped by higher-order cognitive processes and affective translation states. We analyse records of keystrokes and gaze data as indicators of the hidden mental processing structure and organize the behavioural patterns as a multi-layered embedded BTSS. The BTSS serves as the basis for a computational translation agent to simulate the temporal dynamics of affect, automatized behaviour and cognition during human translation production.
zh
[NLP-18] RUMAA: Repeat-Aware Unified Music Audio Analysis for Score-Performance Alignment Transcription and Mistake Detection
【速读】: 该论文试图解决音乐表演分析中的多个任务,包括乐谱到演奏的对齐、基于乐谱的转录以及错误检测,这些问题传统上是分别处理的。解决方案的关键在于提出RUMAA框架,该框架通过预训练的乐谱和音频编码器以及一种新颖的三流解码器,以接近端到端的方式整合这些任务,从而捕捉任务间的相互依赖关系,并通过代理任务实现有效集成。
链接: https://arxiv.org/abs/2507.12175
作者: Sungkyun Chang,Simon Dixon,Emmanouil Benetos
机构: 未知
类目: ound (cs.SD); Computation and Language (cs.CL); Machine Learning (cs.LG); Audio and Speech Processing (eess.AS)
备注: Accepted to WASPAA 2025
点击查看摘要
Abstract:This study introduces RUMAA, a transformer-based framework for music performance analysis that unifies score-to-performance alignment, score-informed transcription, and mistake detection in a near end-to-end manner. Unlike prior methods addressing these tasks separately, RUMAA integrates them using pre-trained score and audio encoders and a novel tri-stream decoder capturing task interdependencies through proxy tasks. It aligns human-readable MusicXML scores with repeat symbols to full-length performance audio, overcoming traditional MIDI-based methods that rely on manually unfolded score-MIDI data with pre-specified repeat structures. RUMAA matches state-of-the-art alignment methods on non-repeated scores and outperforms them on scores with repeats in a public piano music dataset, while also delivering promising transcription and mistake detection results.
zh
[NLP-19] Overview of the Sensemaking Task at the ELOQUENT 2025 Lab: LLM s as Teachers Students and Evaluators
【速读】: 该论文试图解决如何评估生成式语言模型(Generative Language Models)在理解给定文本并生成有意义内容方面的能力问题。其解决方案的关键在于通过“Sensemaking”这一共享任务,构建一个结构化的评估框架,该框架包括三个步骤:教师系统生成问题、学生系统回答问题以及评估者系统对答案进行评分,所有步骤均严格基于给定的输入材料。该研究通过多语言测试材料和多个团队的参与,验证了这一评估框架的有效性,并探讨了生成式AI在不同任务中的表现及存在的问题。
链接: https://arxiv.org/abs/2507.12143
作者: Pavel Šindelář,Ondřej Bojar
机构: Charles University (查理大学)
类目: Computation and Language (cs.CL)
备注: 30 pages, 7 figures, CLEF 2025 Conference and Labs of the Evaluation Forum
点击查看摘要
Abstract:ELOQUENT is a set of shared tasks that aims to create easily testable high-level criteria for evaluating generative language models. Sensemaking is one such shared task. In Sensemaking, we try to assess how well generative models ``make sense out of a given text’’ in three steps inspired by exams in a classroom setting: (1) Teacher systems should prepare a set of questions, (2) Student systems should answer these questions, and (3) Evaluator systems should score these answers, all adhering rather strictly to a given set of input materials. We report on the 2025 edition of Sensemaking, where we had 7 sources of test materials (fact-checking analyses of statements, textbooks, transcribed recordings of a lecture, and educational videos) spanning English, German, Ukrainian, and Czech languages. This year, 4 teams participated, providing us with 2 Teacher submissions, 2 Student submissions, and 2 Evaluator submissions. We added baselines for Teacher and Student using commercial large language model systems. We devised a fully automatic evaluation procedure, which we compare to a minimalistic manual evaluation. We were able to make some interesting observations. For the first task, the creation of questions, better evaluation strategies will still have to be devised because it is difficult to discern the quality of the various candidate question sets. In the second task, question answering, the LLMs examined overall perform acceptably, but restricting their answers to the given input texts remains problematic. In the third task, evaluation of question answers, our adversarial tests reveal that systems using the LLM-as-a-Judge paradigm erroneously rate both garbled question-answer pairs and answers to mixed-up questions as acceptable. Comments: 30 pages, 7 figures, CLEF 2025 Conference and Labs of the Evaluation Forum Subjects: Computation and Language (cs.CL) ACMclasses: I.2.7 Cite as: arXiv:2507.12143 [cs.CL] (or arXiv:2507.12143v1 [cs.CL] for this version) https://doi.org/10.48550/arXiv.2507.12143 Focus to learn more arXiv-issued DOI via DataCite (pending registration) Submission history From: Pavel Šindelář [view email] [v1] Wed, 16 Jul 2025 11:19:28 UTC (60 KB) Full-text links: Access Paper: View a PDF of the paper titled Overview of the Sensemaking Task at the ELOQUENT 2025 Lab: LLMs as Teachers, Students and Evaluators, by Pavel \vSindel’a\vr and Ond\vrej BojarView PDFHTML (experimental)TeX SourceOther Formats view license Current browse context: cs.CL prev | next new | recent | 2025-07 Change to browse by: cs References Citations NASA ADSGoogle Scholar Semantic Scholar a export BibTeX citation Loading… BibTeX formatted citation loading… Data provided by: Bookmark checked=“checked”> Bibliographic Tools Bibliographic and Citation Tools Bibliographic Explorer Toggle Bibliographic Explorer (What is the Explorer?) Connected Papers Toggle Connected Papers (What is Connected Papers?) Litmaps Toggle Litmaps (What is Litmaps?) scite.ai Toggle scite Smart Citations (What are Smart Citations?) Code, Data, Media Code, Data and Media Associated with this Article alphaXiv Toggle alphaXiv (What is alphaXiv?) Links to Code Toggle CatalyzeX Code Finder for Papers (What is CatalyzeX?) DagsHub Toggle DagsHub (What is DagsHub?) GotitPub Toggle Gotit.pub (What is GotitPub?) Huggingface Toggle Hugging Face (What is Huggingface?) Links to Code Toggle Papers with Code (What is Papers with Code?) ScienceCast Toggle ScienceCast (What is ScienceCast?) Demos Demos Replicate Toggle Replicate (What is Replicate?) Spaces Toggle Hugging Face Spaces (What is Spaces?) Spaces Toggle TXYZ.AI (What is TXYZ.AI?) Related Papers Recommenders and Search Tools Link to Influence Flower Influence Flower (What are Influence Flowers?) Core recommender toggle CORE Recommender (What is CORE?) Author Venue Institution Topic About arXivLabs arXivLabs: experimental projects with community collaborators arXivLabs is a framework that allows collaborators to develop and share new arXiv features directly on our website. Both individuals and organizations that work with arXivLabs have embraced and accepted our values of openness, community, excellence, and user data privacy. arXiv is committed to these values and only works with partners that adhere to them. Have an idea for a project that will add value for arXiv’s community? Learn more about arXivLabs. Which authors of this paper are endorsers? | Disable MathJax (What is MathJax?) mathjaxToggle(); About Help contact arXivClick here to contact arXiv Contact subscribe to arXiv mailingsClick here to subscribe Subscribe Copyright Privacy Policy Web Accessibility Assistance arXiv Operational Status Get status notifications via email or slack
zh
[NLP-20] RiemannLoRA: A Unified Riemannian Framework for Ambiguity-Free LoRA Optimization
【速读】: 该论文旨在解决低秩适配(Low-Rank Adaptation, LoRA)在大规模语言模型(Large Language Models, LLMs)参数高效微调中面临的两个关键挑战:最优初始化策略的寻找以及低秩矩阵分解中的过参数化问题。其解决方案的关键在于将一组固定秩的LoRA矩阵视为一个光滑流形,并将适配器视为该流形上的元素,从而消除过参数化;同时,在流形上确定损失函数下降最快的方向以实现有效的初始化。此外,通过结合数值线性代数和黎曼优化的最佳实践,确保了方法的数值稳定性和计算效率。
链接: https://arxiv.org/abs/2507.12142
作者: Vladimir Bogachev,Vladimir Aletov,Alexander Molozhavenko,Denis Bobkov,Vera Soboleva,Aibek Alanov,Maxim Rakhuba
机构: HSE University(高等经济大学); MIPT, ISPRAS(莫斯科物理技术研究所,信息系统与自动化问题研究所); AIRI(人工智能研究院)
类目: Machine Learning (cs.LG); Computation and Language (cs.CL); Differential Geometry (math.DG); Numerical Analysis (math.NA)
备注:
点击查看摘要
Abstract:Low-Rank Adaptation (LoRA) has become a widely adopted standard for parameter-efficient fine-tuning of large language models (LLMs), significantly reducing memory and computational demands. However, challenges remain, including finding optimal initialization strategies or mitigating overparametrization in low-rank matrix factorization. In this work, we propose a novel approach that addresses both of the challenges simultaneously within a unified framework. Our method treats a set of fixed-rank LoRA matrices as a smooth manifold. Considering adapters as elements on this manifold removes overparametrization, while determining the direction of the fastest loss decrease along the manifold provides initialization. Special care is taken to obtain numerically stable and computationally efficient implementation of our method, using best practices from numerical linear algebra and Riemannian optimization. Experimental results on LLM and diffusion model architectures demonstrate that RiemannLoRA consistently improves both convergence speed and final performance over standard LoRA and its state-of-the-art modifications.
zh
[NLP-21] Iterative Augmentation with Summarization Refinement (IASR) Evaluation for Unstructured Survey data Modeling and Analysis
【速读】: 该论文试图解决自然语言处理(NLP)中数据稀疏性问题,尤其是在低资源环境下,有限样本阻碍了有效的语义建模。现有文本数据增强技术在大规模或迭代生成过程中缺乏确保语义保留的机制,导致冗余和不稳定。解决方案的关键在于提出一种基于大语言模型(LLM)的文本增强原则性评估框架,包含两个核心组件:(1) 可扩展性分析,用于衡量随着增强量增加时的语义一致性;(2) 带有摘要精炼的迭代增强(IASR),用于评估递归改写过程中的语义漂移。该框架有效平衡了语义保真度、多样性与生成效率,并在实际任务中显著提升了主题建模的粒度和准确性。
链接: https://arxiv.org/abs/2507.12126
作者: Payal Bhattad,Sai Manoj Pudukotai Dinakarrao,Anju Gupta
机构: 未知
类目: Computation and Language (cs.CL); Machine Learning (cs.LG)
备注:
点击查看摘要
Abstract:Text data augmentation is a widely used strategy for mitigating data sparsity in natural language processing (NLP), particularly in low-resource settings where limited samples hinder effective semantic modeling. While augmentation can improve input diversity and downstream interpretability, existing techniques often lack mechanisms to ensure semantic preservation during large-scale or iterative generation, leading to redundancy and instability. This work introduces a principled evaluation framework for large language model (LLM) based text augmentation, comprising two components: (1) Scalability Analysis, which measures semantic consistency as augmentation volume increases, and (2) Iterative Augmentation with Summarization Refinement (IASR), which evaluates semantic drift across recursive paraphrasing cycles. Empirical evaluations across state-of-the-art LLMs show that GPT-3.5 Turbo achieved the best balance of semantic fidelity, diversity, and generation efficiency. Applied to a real-world topic modeling task using BERTopic with GPT-enhanced few-shot labeling, the proposed approach results in a 400% increase in topic granularity and complete elimination of topic overlaps. These findings validated the utility of the proposed frameworks for structured evaluation of LLM-based augmentation in practical NLP pipelines.
zh
[NLP-22] Findings of MEGA: Maths Explanation with LLM s using the Socratic Method for Active Learning
【速读】: 该论文试图解决大学生在数学学习中面临的困难问题,这些问题往往源于教学方法的不足,导致学生对数学相关学科产生回避倾向。论文提出的解决方案是结合苏格拉底式教学法、思维链(Chain of Thought, CoT)推理、简化游戏化和形成性反馈的综合方法,称为通过人工智能大语言模型进行数学解释(Mathematics Explanations through Games by AI LLMs, MEGA)。该方法的关键在于通过游戏化元素和反馈机制增强学习体验,从而提升学生对复杂数学问题的理解与掌握。
链接: https://arxiv.org/abs/2507.12079
作者: Tosin Adewumi,Foteini Simistira Liwicki,Marcus Liwicki,Viktor Gardelli,Lama Alkhaled,Hamam Mokayed
机构: Michigan State University (密歇根州立大学)
类目: Computation and Language (cs.CL)
备注: This paper was accepted for the special issue AI for Education by the IEEE Signal Processing Magazine journal
点击查看摘要
Abstract:This paper presents an intervention study on the effects of the combined methods of (1) the Socratic method, (2) Chain of Thought (CoT) reasoning, (3) simplified gamification and (4) formative feedback on university students’ Maths learning driven by large language models (LLMs). We call our approach Mathematics Explanations through Games by AI LLMs (MEGA). Some students struggle with Maths and as a result avoid Math-related discipline or subjects despite the importance of Maths across many fields, including signal processing. Oftentimes, students’ Maths difficulties stem from suboptimal pedagogy. We compared the MEGA method to the traditional step-by-step (CoT) method to ascertain which is better by using a within-group design after randomly assigning questions for the participants, who are university students. Samples (n=60) were randomly drawn from each of the two test sets of the Grade School Math 8K (GSM8K) and Mathematics Aptitude Test of Heuristics (MATH) datasets, based on the error margin of 11%, the confidence level of 90%, and a manageable number of samples for the student evaluators. These samples were used to evaluate two capable LLMs at length (Generative Pretrained Transformer 4o (GPT4o) and Claude 3.5 Sonnet) out of the initial six that were tested for capability. The results showed that students agree in more instances that the MEGA method is experienced as better for learning for both datasets. It is even much better than the CoT (47.5% compared to 26.67%) in the more difficult MATH dataset, indicating that MEGA is better at explaining difficult Maths problems.
zh
[NLP-23] BOOKCOREF: Coreference Resolution at Book Scale ACL2025
【速读】: 该论文试图解决现有共指消解(Coreference Resolution)评估基准在处理长文本时的不足,特别是在书籍规模(book scale)下,现有基准如LitBank的长度有限,无法充分评估系统在数万至数十万token级别的共指提及能力。解决方案的关键在于提出一种新颖的自动化管道,用于在完整叙事文本上生成高质量的共指消解标注,并基于此创建首个书籍规模的共指消解基准BOOKCOREF,其平均文档长度超过200,000 tokens。该基准的建立不仅验证了自动标注流程的鲁棒性,还揭示了当前模型在处理长文档时性能显著下降的问题。
链接: https://arxiv.org/abs/2507.12075
作者: Giuliano Martinelli,Tommaso Bonomo,Pere-Lluís Huguet Cabot,Roberto Navigli
机构: Sapienza NLP Group, Sapienza University of Rome (Sapienza NLP 组,罗马大学)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: Accepted to ACL 2025 Main Conference. 19 pages
点击查看摘要
Abstract:Coreference Resolution systems are typically evaluated on benchmarks containing small- to medium-scale documents. When it comes to evaluating long texts, however, existing benchmarks, such as LitBank, remain limited in length and do not adequately assess system capabilities at the book scale, i.e., when co-referring mentions span hundreds of thousands of tokens. To fill this gap, we first put forward a novel automatic pipeline that produces high-quality Coreference Resolution annotations on full narrative texts. Then, we adopt this pipeline to create the first book-scale coreference benchmark, BOOKCOREF, with an average document length of more than 200,000 tokens. We carry out a series of experiments showing the robustness of our automatic procedure and demonstrating the value of our resource, which enables current long-document coreference systems to gain up to +20 CoNLL-F1 points when evaluated on full books. Moreover, we report on the new challenges introduced by this unprecedented book-scale setting, highlighting that current models fail to deliver the same performance they achieve on smaller documents. We release our data and code to encourage research and development of new book-scale Coreference Resolution systems at this https URL.
zh
[NLP-24] StylOch at PAN: Gradient-Boosted Trees with Frequency-Based Stylometric Features
【速读】: 该论文试图解决二分类AI生成文本检测的问题,其核心在于通过一种模块化的风格分析流水线来区分机器生成文本与人类撰写的文本。解决方案的关键在于利用公开的spaCy模型进行文本预处理并提取数千个特征(包括n-gram频率等语言学标注信息),随后采用轻量级梯度提升机作为分类器,并基于超过50万条机器生成文本构建大规模训练集以提升分类器性能。该方法遵循非神经网络、计算成本低但具有可解释性的策略,旨在提高检测的有效性和透明度。
链接: https://arxiv.org/abs/2507.12064
作者: Jeremi K. Ochab,Mateusz Matias,Tymoteusz Boba,Tomasz Walkowiak
机构: Institute of Theoretical Physics, Jagiellonian University, Kraków, Poland; M. Kac Center for Complex Systems Research, Jagiellonian University, Kraków, Poland; Faculty of Physics, Astronomy and Applied Computer Science, Jagiellonian University, Kraków, Poland; Faculty of Information and Communication Technology, Wroclaw University of Science and Technology, Wroclaw, Poland
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:
点击查看摘要
Abstract:This submission to the binary AI detection task is based on a modular stylometric pipeline, where: public spaCy models are used for text preprocessing (including tokenisation, named entity recognition, dependency parsing, part-of-speech tagging, and morphology annotation) and extracting several thousand features (frequencies of n-grams of the above linguistic annotations); light-gradient boosting machines are used as the classifier. We collect a large corpus of more than 500 000 machine-generated texts for the classifier’s training. We explore several parameter options to increase the classifier’s capacity and take advantage of that training set. Our approach follows the non-neural, computationally inexpensive but explainable approach found effective previously.
zh
[NLP-25] Evaluating the Ability of Large Language Models to Reason about Cardinal Directions Revisited IJCAI
【速读】: 该论文试图解决大型语言模型(Large Language Models, LLMs)在处理关于基数方向(cardinal directions, CDs)推理任务时的性能问题,即评估这些模型在给定特定情境下是否能够准确判断正确的方向。解决方案的关键在于构建一个基于模板的基准测试框架,该框架允许多种变量的引入,如参与代理的移动方式以及情境的视角(第一人称、第二人称或第三人称),从而全面测试LLMs在不同情境下的方向推理能力。
链接: https://arxiv.org/abs/2507.12059
作者: Anthony G Cohn,Robert E Blackwell
机构: University of Leeds(利兹大学); Alan Turing Institute(艾伦·图灵研究所)
类目: Computation and Language (cs.CL)
备注: 8 pages, 5 figures. Accepted at QR 2025 : 38th International Workshop on Qualitative Reasoning at IJCAI
点击查看摘要
Abstract:We investigate the abilities of 28 Large language Models (LLMs) to reason about cardinal directions (CDs) using a benchmark generated from a set of templates, extensively testing an LLM’s ability to determine the correct CD given a particular scenario. The templates allow for a number of degrees of variation such as means of locomotion of the agent involved, and whether set in the first, second or third person. Even the newer Large Reasoning Models are unable to reliably determine the correct CD for all questions. This paper summarises and extends earlier work presented at COSIT-24.
zh
[NLP-26] A Comparative Approach to Assessing Linguistic Creativity of Large Language Models and Humans
【速读】: 该论文试图解决评估人类与大型语言模型(Large Language Models, LLMs)在语言创造性方面表现差异的问题。其解决方案的关键是设计了一个综合性语言创造力测试,通过词法生成(派生和复合)和隐喻语言使用等任务,评估被试生成新单词和短语的能力,并利用OCSAI工具对原创性、详尽性和灵活性三个标准进行自动化评价,从而量化比较人类与LLMs的创造力表现。
链接: https://arxiv.org/abs/2507.12039
作者: Anca Dinu,Andra-Maria Florescu,Alina Resceanu
机构: University of Bucharest(布加勒斯特大学); School of Coumputing and Information(计算与信息学院); University of Craiova(克拉约瓦大学)
类目: Computation and Language (cs.CL)
备注: Accepted for presentation at KES 2025. To appear in Procedia Computer Science (Elsevier)
点击查看摘要
Abstract:The following paper introduces a general linguistic creativity test for humans and Large Language Models (LLMs). The test consists of various tasks aimed at assessing their ability to generate new original words and phrases based on word formation processes (derivation and compounding) and on metaphorical language use. We administered the test to 24 humans and to an equal number of LLMs, and we automatically evaluated their answers using OCSAI tool for three criteria: Originality, Elaboration, and Flexibility. The results show that LLMs not only outperformed humans in all the assessed criteria, but did better in six out of the eight test tasks. We then computed the uniqueness of the individual answers, which showed some minor differences between humans and LLMs. Finally, we performed a short manual analysis of the dataset, which revealed that humans are more inclined towards E(extending)-creativity, while LLMs favor F(ixed)-creativity.
zh
[NLP-27] Improving Data and Parameter Efficiency of Neural Language Models Using Representation Analysis
【速读】: 该论文旨在解决神经语言模型中的数据和参数效率问题,特别是针对模型的表示分析与优化技术进行研究。其解决方案的关键在于通过分析语言表示的性质与动态特性,引入基于表示平滑性的正则化策略,利用雅可比矩阵和海森矩阵来稳定训练过程并降低对输入扰动的敏感性;同时结合主动学习与参数高效微调方法,提出受表示平滑性启发的早停技术,以减少对标注验证集的依赖,并有效降低标注成本和计算资源消耗。此外,还探索了结合上下文学习的弱监督技术,以更有效地利用未标注数据,提升模型在低资源环境和动态数据场景下的泛化能力与鲁棒性。
链接: https://arxiv.org/abs/2507.12004
作者: Josip Jukić
机构: 未知
类目: Computation and Language (cs.CL)
备注:
点击查看摘要
Abstract:This thesis addresses challenges related to data and parameter efficiency in neural language models, with a focus on representation analysis and the introduction of new optimization techniques. The first part examines the properties and dynamics of language representations within neural models, emphasizing their significance in enhancing robustness and generalization. It proposes innovative approaches based on representation smoothness, including regularization strategies that utilize Jacobian and Hessian matrices to stabilize training and mitigate sensitivity to input perturbations. The second part focuses on methods to significantly enhance data and parameter efficiency by integrating active learning strategies with parameter-efficient fine-tuning, guided by insights from representation smoothness analysis. It presents smoothness-informed early-stopping techniques designed to eliminate the need for labeled validation sets and proposes innovative combinations of active learning and parameter-efficient fine-tuning to reduce labeling efforts and computational resources. Extensive experimental evaluations across various NLP tasks demonstrate that these combined approaches substantially outperform traditional methods in terms of performance, stability, and efficiency. The third part explores weak supervision techniques enhanced by in-context learning to effectively utilize unlabeled data, further reducing dependence on extensive labeling. It shows that using in-context learning as a mechanism for weak supervision enables models to better generalize from limited labeled data by leveraging unlabeled examples more effectively during training. Comprehensive empirical evaluations confirm significant gains in model accuracy, adaptability, and robustness, especially in low-resource settings and dynamic data environments.
zh
[NLP-28] Simplifications are Absolutists: How Simplified Language Reduces Word Sense Awareness in LLM -Generated Definitions
【速读】: 该论文试图解决生成式 AI (Generative AI) 在处理同音异义词(homonyms)时,因简化定义而导致语义不完整的问题,进而可能引发用户误解。其关键解决方案是通过使用直接偏好优化(Direct Preference Optimization)对模型进行微调,以提升不同目标群体(如普通用户、简化需求用户和ELI5用户)在面对同音异义词时的定义质量。
链接: https://arxiv.org/abs/2507.11981
作者: Lukas Ellinger,Miriam Anschütz,Georg Groh
机构: Technical University of Munich (慕尼黑工业大学)
类目: Computation and Language (cs.CL)
备注: Accepted by RANLP 2025
点击查看摘要
Abstract:Large Language Models (LLMs) can provide accurate word definitions and explanations for any context. However, the scope of the definition changes for different target groups, like children or language learners. This is especially relevant for homonyms, words with multiple meanings, where oversimplification might risk information loss by omitting key senses, potentially misleading users who trust LLM outputs. We investigate how simplification impacts homonym definition quality across three target groups: Normal, Simple, and ELI5. Using two novel evaluation datasets spanning multiple languages, we test DeepSeek v3, Llama 4 Maverick, Qwen3-30B A3B, GPT-4o mini, and Llama 3.1 8B via LLM-as-Judge and human annotations. Our results show that simplification drastically degrades definition completeness by neglecting polysemy, increasing the risk of misunderstanding. Fine-tuning Llama 3.1 8B with Direct Preference Optimization substantially improves homonym response quality across all prompt types. These findings highlight the need to balance simplicity and completeness in educational NLP to ensure reliable, context-aware definitions for all learners.
zh
[NLP-29] Value-Based Large Language Model Agent Simulation for Mutual Evaluation of Trust and Interpersonal Closeness
【速读】: 该论文试图解决在由大型语言模型(Large Language Models, LLMs)代理构成的人工社会中,价值观相似性是否如同在人类社会中一样对关系建立起到关键作用的问题。其解决方案的关键在于通过设计有效的模型和提示(prompt)以控制LLM代理的价值观,并在此基础上生成具有特定价值观的代理对,通过对话后对其信任度和人际亲密程度进行分析,从而验证价值观相似性对关系构建的影响。
链接: https://arxiv.org/abs/2507.11979
作者: Yuki Sakamoto,Takahisa Uchida,Hiroshi Ishiguro
机构: The University of Osaka, Graduate School of Engineering Science(大阪大学工学研究科)
类目: Computation and Language (cs.CL); Multiagent Systems (cs.MA)
备注:
点击查看摘要
Abstract:Large language models (LLMs) have emerged as powerful tools for simulating complex social phenomena using human-like agents with specific traits. In human societies, value similarity is important for building trust and close relationships; however, it remains unexplored whether this principle holds true in artificial societies comprising LLM agents. Therefore, this study investigates the influence of value similarity on relationship-building among LLM agents through two experiments. First, in a preliminary experiment, we evaluated the controllability of values in LLMs to identify the most effective model and prompt design for controlling the values. Subsequently, in the main experiment, we generated pairs of LLM agents imbued with specific values and analyzed their mutual evaluations of trust and interpersonal closeness following a dialogue. The experiments were conducted in English and Japanese to investigate language dependence. The results confirmed that pairs of agents with higher value similarity exhibited greater mutual trust and interpersonal closeness. Our findings demonstrate that the LLM agent simulation serves as a valid testbed for social science theories, contributes to elucidating the mechanisms by which values influence relationship building, and provides a foundation for inspiring new theories and insights into the social sciences.
zh
[NLP-30] Graph Representations for Reading Comprehension Analysis using Large Language Model and Eye-Tracking Biomarker
【速读】: 该论文试图解决如何比较人类与大型语言模型(Large Language Models, LLMs)在不同语境下对语言的理解差异,并将其应用于推理、情感解读和信息检索等任务的问题。其解决方案的关键在于利用LLM构建一个基于语义意义和问题导向提示的图结构文本表示,将文本中的词语分组为节点和边,进而分析眼动数据在关键节点和边上的分布,从而揭示LLMs在语言理解上与人类认知的一致性。
链接: https://arxiv.org/abs/2507.11972
作者: Yuhong Zhang,Jialu Li,Shilai Yang,Yuchen Xu,Gert Cauwenberghs,Tzyy-Ping Jung
机构: University of California, San Diego (加州大学圣地亚哥分校); Hong Kong University of Science and Technology (香港科技大学); Brown University (布朗大学)
类目: Computation and Language (cs.CL); Neurons and Cognition (q-bio.NC)
备注:
点击查看摘要
Abstract:Reading comprehension is a fundamental skill in human cognitive development. With the advancement of Large Language Models (LLMs), there is a growing need to compare how humans and LLMs understand language across different contexts and apply this understanding to functional tasks such as inference, emotion interpretation, and information retrieval. Our previous work used LLMs and human biomarkers to study the reading comprehension process. The results showed that the biomarkers corresponding to words with high and low relevance to the inference target, as labeled by the LLMs, exhibited distinct patterns, particularly when validated using eye-tracking data. However, focusing solely on individual words limited the depth of understanding, which made the conclusions somewhat simplistic despite their potential significance. This study used an LLM-based AI agent to group words from a reading passage into nodes and edges, forming a graph-based text representation based on semantic meaning and question-oriented prompts. We then compare the distribution of eye fixations on important nodes and edges. Our findings indicate that LLMs exhibit high consistency in language understanding at the level of graph topological structure. These results build on our previous findings and offer insights into effective human-AI co-learning strategies.
zh
[NLP-31] oxicity-Aware Few-Shot Prompting for Low-Resource Singlish Translation
【速读】: 该论文试图解决在低资源语言对中进行有害内容翻译时,标准翻译系统难以保留本地俚语、混语现象及文化嵌入的有害言语标记的问题。其解决方案的关键在于提出一个可复现的两阶段框架:第一阶段通过人工验证的少量示例提示工程,迭代地筛选和排序标注者选择的Singlish目标示例以捕捉微妙的俚语、语气和毒性;第二阶段通过直接翻译和反向翻译评估语义相似性,优化模型与提示的组合。该方法有效提升了翻译质量,并增强了多文化大语言模型的安全性。
链接: https://arxiv.org/abs/2507.11966
作者: Ziyu Ge,Gabriel Chua,Leanne Tan,Roy Ka-Wei Lee
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Computers and Society (cs.CY)
备注:
点击查看摘要
Abstract:As online communication increasingly incorporates under-represented languages and colloquial dialects, standard translation systems often fail to preserve local slang, code-mixing, and culturally embedded markers of harmful speech. Translating toxic content between low-resource language pairs poses additional challenges due to scarce parallel data and safety filters that sanitize offensive expressions. In this work, we propose a reproducible, two-stage framework for toxicity-preserving translation, demonstrated on a code-mixed Singlish safety corpus. First, we perform human-verified few-shot prompt engineering: we iteratively curate and rank annotator-selected Singlish-target examples to capture nuanced slang, tone, and toxicity. Second, we optimize model-prompt pairs by benchmarking several large language models using semantic similarity via direct and back-translation. Quantitative human evaluation confirms the effectiveness and efficiency of our pipeline. Beyond improving translation quality, our framework contributes to the safety of multicultural LLMs by supporting culturally sensitive moderation and benchmarking in low-resource contexts. By positioning Singlish as a testbed for inclusive NLP, we underscore the importance of preserving sociolinguistic nuance in real-world applications such as content moderation and regional platform governance.
zh
[NLP-32] PoTPTQ: A Two-step Power-of-Two Post-training for LLM s ECAI2025
【速读】: 该论文试图解决大语言模型(Large Language Models, LLMs)在部署时因计算资源需求高而面临的挑战,特别是针对Power-of-two (PoT)量化方法在GPU上效率较低的问题。其解决方案的关键在于提出一种新的PoT量化框架,通过优化量化权重的解量化过程,实现更高效的推理速度,同时保持模型的准确性。该框架采用两步后训练算法,首先以稳健的起点初始化量化尺度,然后利用最小校准集对这些尺度进行优化,从而在极低精度格式下取得优于现有最先进方法的性能。
链接: https://arxiv.org/abs/2507.11959
作者: Xinyu Wang,Vahid Partovi Nia,Peng Lu,Jerry Huang,Xiao-Wen Chang,Boxing Chen,Yufei Cui
机构: McGill University, Canada; Huawei Noah’s Ark Lab, Canada; Université de Montréal, Canada; Mila – Quebec AI Institute, Canada
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: Accepted at ECAI 2025 (European Conference on Artificial Intelligence)
点击查看摘要
Abstract:Large Language Models (LLMs) have demonstrated remarkable performance across various natural language processing (NLP) tasks. However, their deployment is challenging due to the substantial computational resources required. Power-of-two (PoT) quantization is a general tool to counteract this difficulty. Albeit previous works on PoT quantization can be efficiently dequantized on CPUs using fixed-point addition, it showed less effectiveness on GPUs. The reason is entanglement of the sign bit and sequential bit manipulations needed for dequantization. We propose a novel POT quantization framework for LLM weights that (i) outperforms state-of-the-art accuracy in extremely low-precision number formats, and (ii) enables faster inference through more efficient dequantization. To maintain the accuracy of the quantized model, we introduce a two-step post-training algorithm: (i) initialize the quantization scales with a robust starting point, and (ii) refine these scales using a minimal calibration set. The performance of our PoT post-training algorithm surpasses the current state-of-the-art in integer quantization, particularly at low precisions such as 2- and 3-bit formats. Our PoT quantization accelerates the dequantization step required for the floating point inference and leads to 3.67\times speed up on a NVIDIA V100, and 1.63\times on a NVIDIA RTX 4090, compared to uniform integer dequantization.
zh
[NLP-33] he benefits of query-based KGQA systems for complex and temporal questions in LLM era
【速读】: 该论文试图解决大型语言模型在多跳推理和时间性问题上的不足,提出了一种基于查询的多阶段知识图谱问答(KGQA)框架,以提高在复杂多跳和时间性基准测试中的性能。解决方案的关键在于引入一种利用思维链(Chain of Thought, CoT)推理的实体链接与谓词匹配方法,并通过多阶段架构增强模型对多跳和时间性问题的处理能力。
链接: https://arxiv.org/abs/2507.11954
作者: Artem Alekseev,Mikhail Chaichuk,Miron Butko,Alexander Panchenko,Elena Tutubalina,Oleg Somov
机构: 未知
类目: Computation and Language (cs.CL); Machine Learning (cs.LG)
备注: 15 pages, 3 figures, 7 tables
点击查看摘要
Abstract:Large language models excel in question-answering (QA) yet still struggle with multi-hop reasoning and temporal questions. Query-based knowledge graph QA (KGQA) offers a modular alternative by generating executable queries instead of direct answers. We explore multi-stage query-based framework for WikiData QA, proposing multi-stage approach that enhances performance on challenging multi-hop and temporal benchmarks. Through generalization and rejection studies, we evaluate robustness across multi-hop and temporal QA datasets. Additionally, we introduce a novel entity linking and predicate matching method using CoT reasoning. Our results demonstrate the potential of query-based multi-stage KGQA framework for improving multi-hop and temporal QA with small language models. Code and data: this https URL
zh
[NLP-34] IAM: Efficient Inference through Attention Mapping between Different-scale LLM s ACL2025
【速读】: 该论文试图解决大语言模型(Large Language Models, LLMs)在处理长上下文时面临的资源消耗问题,尤其是注意力计算和键值缓存(KV cache)使用效率低下的问题。解决方案的关键在于利用不同规模LLMs之间注意力矩阵的高相似性,提出IAM框架,通过在小型和大型LLMs之间进行注意力映射,实现注意力计算加速和KV缓存占用降低的双重优势。
链接: https://arxiv.org/abs/2507.11953
作者: Yi Zhao,Zuchao Li,Hai Zhao
机构: Shanghai Jiao Tong University (上海交通大学); Wuhan University (武汉大学)
类目: Computation and Language (cs.CL); Machine Learning (cs.LG)
备注: ACL 2025
点击查看摘要
Abstract:LLMs encounter significant challenges in resource consumption nowadays, especially with long contexts. Despite extensive efforts dedicate to enhancing inference efficiency, these methods primarily exploit internal sparsity within the models, without leveraging external information for optimization. We identify the high similarity of attention matrices across different-scale LLMs, which offers a novel perspective for optimization. We first conduct a comprehensive analysis of how to measure similarity, how to select mapping Layers and whether mapping is consistency. Based on these insights, we introduce the IAM framework, which achieves dual benefits of accelerated attention computation and reduced KV cache usage by performing attention mapping between small and large LLMs. Our experimental results demonstrate that IAM can accelerate prefill by 15% and reduce KV cache usage by 22.1% without appreciably sacrificing performance. Experiments on different series of models show the generalizability of IAM. Importantly, it is also orthogonal to many existing KV cache optimization methods, making it a versatile addition to the current toolkit for enhancing LLM efficiency.
zh
[NLP-35] DAC: A Dynamic Attention-aware Approach for Task-Agnostic Prompt Compression ACL2025
【速读】: 该论文试图解决任务无关的提示压缩问题,旨在通过利用自然语言中的冗余性来降低计算开销并提高提示的信息密度,特别是在长上下文场景中。现有方法主要依赖信息熵作为压缩词法单元的度量标准,以实现最小的信息损失,但忽略了两个关键因素:(i) 算法层面注意力关键标记的重要性,以及 (ii) 压缩过程中信息熵的变化。解决方案的关键在于提出一种动态注意力感知的任务无关提示压缩方法(DAC),该方法有效整合了熵和注意力信息,并在压缩过程中动态感知熵的变化,从而实现细粒度的提示压缩。
链接: https://arxiv.org/abs/2507.11942
作者: Yi Zhao,Zuchao Li,Hai Zhao,Baoyuan Qi,Guoming Liu
机构: Shanghai Jiao Tong University (上海交通大学); Wuhan University (武汉大学); Xiaomi (小米)
类目: Computation and Language (cs.CL)
备注: ACL 2025
点击查看摘要
Abstract:Task-agnostic prompt compression leverages the redundancy in natural language to reduce computational overhead and enhance information density within prompts, especially in long-context scenarios. Existing methods predominantly rely on information entropy as the metric to compress lexical units, aiming to achieve minimal information loss. However, these approaches overlook two critical aspects: (i) the importance of attention-critical tokens at the algorithmic level, and (ii) shifts in information entropy during the compression process. Motivated by these challenges, we propose a dynamic attention-aware approach for task-agnostic prompt compression (DAC). This approach effectively integrates entropy and attention information, dynamically sensing entropy shifts during compression to achieve fine-grained prompt compression. Extensive experiments across various domains, including LongBench, GSM8K, and BBH, show that DAC consistently yields robust and substantial improvements across a diverse range of tasks and LLMs, offering compelling evidence of its efficacy.
zh
[NLP-36] BlockBPE: Parallel BPE Tokenization ICML2025
【速读】: 该论文试图解决传统分词器在大规模语言模型流水线中效率低下的问题,尤其是在GPU上进行批量推理时,现有实现仍受限于CPU性能且效率不足。其解决方案的关键在于提出BlockBPE,这是一种并行化的字节对编码(Byte-Pair Encoding, BPE)GPU实现,通过消除正则表达式预分词步骤,实现了高度并行化的词素合并,从而将整体复杂度从O(n log n)降低至O(nd)(其中d≪n),显著提升了高吞吐量批量推理的性能。
链接: https://arxiv.org/abs/2507.11941
作者: Amos You
机构: 未知
类目: Computation and Language (cs.CL); Distributed, Parallel, and Cluster Computing (cs.DC)
备注: ES-FoMo III: 3rd Workshop on Efficient Systems for Foundation Models (ICML 2025)
点击查看摘要
Abstract:Tokenization is a critical preprocessing step in large language model pipelines, yet widely-used implementations remain CPU-bound and suboptimal for batch inference workflows on GPU. We present BlockBPE, a parallel GPU implementation of byte-pair encoding (BPE) that achieves near linear-time complexity under realistic assumptions and is optimized for high-throughput, batch inference. Unlike existing Rust-based tokenizers such as HuggingFace Tokenizers or OpenAI’s tiktoken-whose runtimes are dominated by Regex pre-tokenization and exhibit O(n \log n) runtime-BlockBPE eliminates the Regex pre-tokenization which leads to small loss in generation quality, but enables highly parallelized token merges within thread blocks, reducing overall complexity to O(nd) where d \ll n . On high-batch inference workloads, BlockBPE achieves up to 2x higher throughput than tiktoken and 2.5x over HuggingFace Tokenizers.
zh
[NLP-37] POLYCHARTQA: Benchmarking Large Vision-Language Models with Multilingual Chart Question Answering
【速读】: 该论文旨在解决现有图表理解基准普遍以英语为中心,限制了其在全球受众中的可访问性和适用性的问题。其解决方案的关键在于构建PolyChartQA,这是一个涵盖22,606张图表和26,151个问答对的多语言图表问答基准,覆盖10种不同的语言。该基准通过解耦的流程将图表数据与渲染代码分离,使得多语言图表可以通过简单翻译数据并复用代码来灵活生成,并利用先进的基于大语言模型(LLM)的翻译技术以及严格的质量控制流程,确保生成的多语言图表在语言和语义上的一致性。
链接: https://arxiv.org/abs/2507.11939
作者: Yichen Xu,Liangyu Chen,Liang Zhang,Wenxuan Wang,Qin Jin
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV); Multimedia (cs.MM)
备注: Work in Progress
点击查看摘要
Abstract:Charts are a universally adopted medium for interpreting and communicating data. However, existing chart understanding benchmarks are predominantly English-centric, limiting their accessibility and applicability to global audiences. In this paper, we present PolyChartQA, the first large-scale multilingual chart question answering benchmark covering 22,606 charts and 26,151 question-answering pairs across 10 diverse languages. PolyChartQA is built using a decoupled pipeline that separates chart data from rendering code, allowing multilingual charts to be flexibly generated by simply translating the data and reusing the code. We leverage state-of-the-art LLM-based translation and enforce rigorous quality control in the pipeline to ensure the linguistic and semantic consistency of the generated multilingual charts. PolyChartQA facilitates systematic evaluation of multilingual chart understanding. Experiments on both open- and closed-source large vision-language models reveal a significant performance gap between English and other languages, especially low-resource ones with non-Latin scripts. This benchmark lays a foundation for advancing globally inclusive vision-language models.
zh
[NLP-38] A Survey of Deep Learning for Geometry Problem Solving
【速读】: 该论文旨在探讨深度学习在几何问题求解中的应用,以促进数学推理能力的提升及人工智能多模态能力评估的发展。其解决方案的关键在于系统性地总结几何问题求解的相关任务,深入回顾相关的深度学习方法,详细分析评估指标与方法,并批判性讨论当前面临的挑战与未来研究方向。
链接: https://arxiv.org/abs/2507.11936
作者: Jianzhe Ma,Wenxuan Wang,Qin Jin
机构: Renmin University of China (中国人民大学)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注: Work in progress
点击查看摘要
Abstract:Geometry problem solving is a key area of mathematical reasoning, which is widely involved in many important fields such as education, mathematical ability assessment of artificial intelligence, and multimodal ability assessment. In recent years, the rapid development of deep learning technology, especially the rise of multimodal large language models, has triggered a widespread research boom. This paper provides a survey of the applications of deep learning in geometry problem solving, including (i) a comprehensive summary of the relevant tasks in geometry problem solving; (ii) a thorough review of related deep learning methods; (iii) a detailed analysis of evaluation metrics and methods; and (iv) a critical discussion of the current challenges and future directions that can be explored. Our goal is to provide a comprehensive and practical reference of deep learning for geometry problem solving to promote further developments in this field. We create a continuously updated list of papers on GitHub: this https URL.
zh
[NLP-39] Marco-Bench-MIF: On Multilingual Instruction-Following Capability of Large Language Models ACL2025
【速读】: 该论文旨在解决现有指令跟随评估数据集(如IFEval)在多语言场景下的适用性受限问题,这些问题主要源于数据集以英语为主或仅通过机器翻译扩展至其他语言,未能充分考虑语言和文化的本地化需求。其解决方案的关键在于构建一个名为Marco-Bench-MIF的本地化多语言基准,通过结合翻译与验证的混合流程,针对语言约束(如中文的大小写调整)和文化引用(如替换特定地区的公司名称)进行优化,从而提升多语言指令跟随能力的评估准确性。
链接: https://arxiv.org/abs/2507.11882
作者: Bo Zeng,Chenyang Lyu,Sinuo Liu,Mingyan Zeng,Minghao Wu,Xuanfan Ni,Tianqi Shi,Yu Zhao,Yefeng Liu,Chenyu Zhu,Ruizhe Li,Jiahui Geng,Qing Li,Yu Tong,Longyue Wang,Weihua Luo,Kaifu Zhang
机构: Alibaba International Digital Commerce (阿里巴巴国际数字商务); University of Aberdeen (阿伯丁大学); MBZUAI (MBZUAI)
类目: Computation and Language (cs.CL)
备注: ACL 2025 Main Conference paper
点击查看摘要
Abstract:Instruction-following capability has become a major ability to be evaluated for Large Language Models (LLMs). However, existing datasets, such as IFEval, are either predominantly monolingual and centered on English or simply machine translated to other languages, limiting their applicability in multilingual contexts. In this paper, we present an carefully-curated extension of IFEval to a localized multilingual version named Marco-Bench-MIF, covering 30 languages with varying levels of localization. Our benchmark addresses linguistic constraints (e.g., modifying capitalization requirements for Chinese) and cultural references (e.g., substituting region-specific company names in prompts) via a hybrid pipeline combining translation with verification. Through comprehensive evaluation of 20+ LLMs on our Marco-Bench-MIF, we found that: (1) 25-35% accuracy gap between high/low-resource languages, (2) model scales largely impact performance by 45-60% yet persists script-specific challenges, and (3) machine-translated data underestimates accuracy by7-22% versus localized data. Our analysis identifies challenges in multilingual instruction following, including keyword consistency preservation and compositional constraint adherence across languages. Our Marco-Bench-MIF is available at this https URL.
zh
[NLP-40] LLM s Encode Harmfulness and Refusal Separately
【速读】: 该论文试图解决大型语言模型(Large Language Models, LLMs)在面对有害指令时仅通过拒绝行为来体现安全机制,而缺乏对有害性的真正内在理解的问题。其解决方案的关键在于识别出一个与拒绝方向(refusal direction)不同的有害性方向(harmfulness direction),并证明通过操纵这一方向可以使模型将无害指令误判为有害,而操纵拒绝方向则直接引发拒绝响应但不改变模型对有害性的判断。这一发现揭示了模型内部对有害性的表征比其拒绝决策更为稳健,并据此提出了一种基于潜在有害性表示的内在安全防护机制(Latent Guard)。
链接: https://arxiv.org/abs/2507.11878
作者: Jiachen Zhao,Jing Huang,Zhengxuan Wu,David Bau,Weiyan Shi
机构: Northeastern University (东北大学); Stanford University (斯坦福大学)
类目: Computation and Language (cs.CL)
备注:
点击查看摘要
Abstract:LLMs are trained to refuse harmful instructions, but do they truly understand harmfulness beyond just refusing? Prior work has shown that LLMs’ refusal behaviors can be mediated by a one-dimensional subspace, i.e., a refusal direction. In this work, we identify a new dimension to analyze safety mechanisms in LLMs, i.e., harmfulness, which is encoded internally as a separate concept from refusal. There exists a harmfulness direction that is distinct from the refusal direction. As causal evidence, steering along the harmfulness direction can lead LLMs to interpret harmless instructions as harmful, but steering along the refusal direction tends to elicit refusal responses directly without reversing the model’s judgment on harmfulness. Furthermore, using our identified harmfulness concept, we find that certain jailbreak methods work by reducing the refusal signals without reversing the model’s internal belief of harmfulness. We also find that adversarially finetuning models to accept harmful instructions has minimal impact on the model’s internal belief of harmfulness. These insights lead to a practical safety application: The model’s latent harmfulness representation can serve as an intrinsic safeguard (Latent Guard) for detecting unsafe inputs and reducing over-refusals that is robust to finetuning attacks. For instance, our Latent Guard achieves performance comparable to or better than Llama Guard 3 8B, a dedicated finetuned safeguard model, across different jailbreak methods. Our findings suggest that LLMs’ internal understanding of harmfulness is more robust than their refusal decision to diverse input instructions, offering a new perspective to study AI safety
zh
[NLP-41] DualReward: A Dynamic Reinforcement Learning Framework for Cloze Tests Distractor Generation CCL2025
【速读】: 该论文试图解决自动填空题中干扰项生成的问题,旨在提升生成的干扰项在质量和相关性上的表现。其解决方案的关键在于提出了一种名为DualReward的强化学习框架,该框架采用具有自适应缩放的双重奖励结构,以区分人工创建的标准干扰项与模型生成的候选干扰项,并根据模型性能和置信度动态调整奖励信号强度。
链接: https://arxiv.org/abs/2507.11875
作者: Tianyou Huang,Xinglu Chen,Jingshen Zhang,Xinying Qiu,Ruiying Niu
机构: Guangdong University of Foreign Studies (广东外语外贸大学)
类目: Computation and Language (cs.CL)
备注: Accepted to CCL 2025
点击查看摘要
Abstract:This paper introduces DualReward, a novel reinforcement learning framework for automatic distractor generation in cloze tests. Unlike conventional approaches that rely primarily on supervised learning or static generative models, our method employs a dual reward structure with adaptive scaling that differentiates between human-created gold standard distractors and model-generated candidates. The framework dynamically adjusts reward signal intensity based on model performance and confidence. We evaluate our approach on both passage-level (CLOTH-F) and sentence-level (MCQ) cloze test datasets, demonstrating consistent improvements over state-of-the-art baselines. Experimental results show that our adaptive reward scaling mechanism provides modest but consistent benefits on homogeneous datasets (CLOTH-F) and more substantial improvements (3.48-3.86% in P@1) on diverse, cross-domain data (MCQ), suggesting its particular effectiveness for handling varied question types and domains. Our work offers a flexible framework that effectively balances learning from reliable human examples while exploring novel, high-quality distractors for automated test generation.
zh
[NLP-42] COLA-GEC: A Bidirectional Framework for Enhancing Grammatical Acceptability and Error Correction
【速读】: 该论文试图解决语法错误修正(Grammatical Error Correction, GEC)与语法可接受性判断(COLA)任务之间缺乏协同优化的问题,尽管两者共享基础的语法知识。解决方案的关键在于提出一种双向框架COLA-GEC,通过相互知识迁移同时提升两个任务的性能:一方面利用GEC数据集增强COLA模型;另一方面通过动态损失函数将COLA信号融入GEC模型训练,从而引导修正结果趋向语法上可接受的输出。
链接: https://arxiv.org/abs/2507.11867
作者: Xiangyu Yang,Xinying Qiu
机构: 未知
类目: Computation and Language (cs.CL)
备注: Accepted to CLNLP 2025
点击查看摘要
Abstract:Grammatical Error Correction (GEC) and grammatical acceptability judgment (COLA) are core tasks in natural language processing, sharing foundational grammatical knowledge yet typically evolving independently. This paper introduces COLA-GEC, a novel bidirectional framework that enhances both tasks through mutual knowledge transfer. First, we augment grammatical acceptability models using GEC datasets, significantly improving their performance across multiple languages. Second, we integrate grammatical acceptability signals into GEC model training via a dynamic loss function, effectively guiding corrections toward grammatically acceptable outputs. Our approach achieves state-of-the-art results on several multilingual benchmarks. Comprehensive error analysis highlights remaining challenges, particularly in punctuation error correction, providing insights for future improvements in grammatical modeling.
zh
[NLP-43] Cross-Domain Transfer and Few-Shot Learning for Personal Identifiable Information Recognition
【速读】: 该论文试图解决在自动化文本匿名化过程中准确识别 personally identifiable information (PII) 的问题。其解决方案的关键在于探索跨领域模型迁移、多领域数据融合以及样本高效学习在 PII 识别中的有效性。研究通过在医疗(I2B2)、法律(TAB)和传记(Wikipedia)领域的标注语料上评估模型,分析了模型在域内性能、跨域迁移能力、数据融合效果及小样本学习方面的表现。
链接: https://arxiv.org/abs/2507.11862
作者: Junhong Ye,Xu Yuan,Xinying Qiu
机构: 未知
类目: Computation and Language (cs.CL)
备注: Accepted to CLNLP 2025
点击查看摘要
Abstract:Accurate recognition of personally identifiable information (PII) is central to automated text anonymization. This paper investigates the effectiveness of cross-domain model transfer, multi-domain data fusion, and sample-efficient learning for PII recognition. Using annotated corpora from healthcare (I2B2), legal (TAB), and biography (Wikipedia), we evaluate models across four dimensions: in-domain performance, cross-domain transferability, fusion, and few-shot learning. Results show legal-domain data transfers well to biographical texts, while medical domains resist incoming transfer. Fusion benefits are domain-specific, and high-quality recognition is achievable with only 10% of training data in low-specialization domains.
zh
[NLP-44] Your LLM Knows the Future: Uncovering Its Multi-Token Prediction Potential
【速读】: 该论文旨在解决自回归语言模型(Autoregressive Language Models)在生成过程中因逐token顺序生成而导致的推理速度和并行性受限的问题。其关键解决方案是通过引入一种新颖框架,利用原始模型对后续token的内在知识,结合多项技术实现多个后续token的同时预测,包括基于掩码输入的联合预测机制、门控LoRA结构、轻量级可学习采样模块、辅助训练损失以及推测生成策略,从而显著提升生成效率而不牺牲质量。
链接: https://arxiv.org/abs/2507.11851
作者: Mohammad Samragh,Arnav Kundu,David Harrison,Kumari Nishu,Devang Naik,Minsik Cho,Mehrdad Farajtabar
机构: Apple(苹果)
类目: Computation and Language (cs.CL); Machine Learning (cs.LG)
备注:
点击查看摘要
Abstract:Autoregressive language models are constrained by their inherently sequential nature, generating one token at a time. This paradigm limits inference speed and parallelism, especially during later stages of generation when the direction and semantics of text are relatively certain. In this work, we propose a novel framework that leverages the inherent knowledge of vanilla autoregressive language models about future tokens, combining techniques to realize this potential and enable simultaneous prediction of multiple subsequent tokens. Our approach introduces several key innovations: (1) a masked-input formulation where multiple future tokens are jointly predicted from a common prefix; (2) a gated LoRA formulation that preserves the original LLM’s functionality, while equipping it for multi-token prediction; (3) a lightweight, learnable sampler module that generates coherent sequences from the predicted future tokens; (4) a set of auxiliary training losses, including a consistency loss, to enhance the coherence and accuracy of jointly generated tokens; and (5) a speculative generation strategy that expands tokens quadratically in the future while maintaining high fidelity. Our method achieves significant speedups through supervised fine-tuning on pretrained models. For example, it generates code and math nearly 5x faster, and improves general chat and knowledge tasks by almost 2.5x. These gains come without any loss in quality.
zh
[NLP-45] ILID: Native Script Language Identification for Indian Languages
【速读】: 该论文试图解决在噪声、短文本和代码混合环境下进行语言识别(Language Identification)的问题,尤其是在印度多种语言之间存在词汇和语音相似性但又具有显著差异的情况下。其解决方案的关键在于发布了一个包含230,000条句子的数据集,涵盖英语和所有22种官方印度语言,并为每条数据标注了语言标识符;同时,开发并发布了基于机器学习和深度学习的鲁棒基线模型,这些模型在语言识别任务中表现与当前最先进模型相当。
链接: https://arxiv.org/abs/2507.11832
作者: Yash Ingle,Pruthwik Mishra
机构: Sardar Vallabhbhai National Institute of Technology (萨达尔·瓦拉巴伊国立技术学院)
类目: Computation and Language (cs.CL)
备注: 8 pages, 1 figure, 7 tables, Paper accepted in RANLP 2025
点击查看摘要
Abstract:The language identification task is a crucial fundamental step in NLP. Often it serves as a pre-processing step for widely used NLP applications such as multilingual machine translation, information retrieval, question and answering, and text summarization. The core challenge of language identification lies in distinguishing languages in noisy, short, and code-mixed environments. This becomes even harder in case of diverse Indian languages that exhibit lexical and phonetic similarities, but have distinct differences. Many Indian languages share the same script making the task even more challenging. In this paper, we release a dataset of 230K sentences consisting of English and all 22 official Indian languages labeled with their language identifiers where data in most languages are newly created. We also develop and release robust baseline models using state-of-the-art approaches in machine learning and deep learning that can aid the research in this field. Our baseline models are comparable to the state-of-the-art models for the language identification task.
zh
[NLP-46] racing Facts or just Copies? A critical investigation of the Competitions of Mechanisms in Large Language Models
【速读】: 该论文试图解决大型语言模型(Large Language Models, LLMs)在处理竞争性事实与反事实信息时的机制问题,特别是关注注意力头(attention heads)在这一过程中的作用。其解决方案的关键在于通过机械可解释性工具,研究注意力头强度与事实性输出比例之间的关系,评估关于注意力头抑制机制的竞争假设,并探讨这些注意力模式的领域特异性。研究发现,促进事实性输出的注意力头主要通过通用的复制抑制机制实现,而非选择性地抑制反事实信息,同时表明注意力头的行为具有领域依赖性。
链接: https://arxiv.org/abs/2507.11809
作者: Dante Campregher,Yanxu Chen,Sander Hoffman,Maria Heuss
机构: University of Amsterdam(阿姆斯特丹大学)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: 18 Pages, 13 figures
点击查看摘要
Abstract:This paper presents a reproducibility study examining how Large Language Models (LLMs) manage competing factual and counterfactual information, focusing on the role of attention heads in this process. We attempt to reproduce and reconcile findings from three recent studies by Ortu et al., Yu, Merullo, and Pavlick and McDougall et al. that investigate the competition between model-learned facts and contradictory context information through Mechanistic Interpretability tools. Our study specifically examines the relationship between attention head strength and factual output ratios, evaluates competing hypotheses about attention heads’ suppression mechanisms, and investigates the domain specificity of these attention patterns. Our findings suggest that attention heads promoting factual output do so via general copy suppression rather than selective counterfactual suppression, as strengthening them can also inhibit correct facts. Additionally, we show that attention head behavior is domain-dependent, with larger models exhibiting more specialized and category-sensitive patterns.
zh
[NLP-47] Simulated Language Acquisition in a Biologically Realistic Model of the Brain
【速读】: 该论文试图解决神经科学中一个核心问题,即如何从神经元的尖峰活动精确地解释高级认知现象,如计划和语言。其解决方案的关键在于提出了一种简洁的数学形式化,整合了六项被广泛接受的神经科学基本原理:兴奋性神经元、脑区、随机突触、赫布可塑性、局部抑制和区际抑制。基于这一形式化,研究者构建了一个模拟类脑系统,该系统能够通过少量具身句子的学习,实现语言的基本习得,包括词汇语义、句法角色及词序,并具备生成新句子的能力。
链接: https://arxiv.org/abs/2507.11788
作者: Daniel Mitropolsky,Christos Papadimitriou
机构: MIT(麻省理工学院); Columbia University(哥伦比亚大学)
类目: Neural and Evolutionary Computing (cs.NE); Computation and Language (cs.CL)
备注: 13 pages, 6 figures
点击查看摘要
Abstract:Despite tremendous progress in neuroscience, we do not have a compelling narrative for the precise way whereby the spiking of neurons in our brain results in high-level cognitive phenomena such as planning and language. We introduce a simple mathematical formulation of six basic and broadly accepted principles of neuroscience: excitatory neurons, brain areas, random synapses, Hebbian plasticity, local inhibition, and inter-area inhibition. We implement a simulated neuromorphic system based on this formalism, which is capable of basic language acquisition: Starting from a tabula rasa, the system learns, in any language, the semantics of words, their syntactic role (verb versus noun), and the word order of the language, including the ability to generate novel sentences, through the exposure to a modest number of grounded sentences in the same language. We discuss several possible extensions and implications of this result.
zh
[NLP-48] AI Wizards at CheckThat! 2025: Enhancing Transformer-Based Embeddings with Sentiment for Subjectivity Detection in News Articles
【速读】: 该论文旨在解决新闻文章中句子的主观性检测问题,即在单语、多语和零样本设置下对句子进行主观/客观分类。其关键解决方案是通过将情感评分(由辅助模型生成)与句子表示相结合,增强基于Transformer的分类器性能,从而超越标准微调方法。此外,为应对不同语言中普遍存在的类别不平衡问题,采用了在开发集上优化的决策阈值校准方法。
链接: https://arxiv.org/abs/2507.11764
作者: Matteo Fasulo,Luca Babboni,Luca Tedeschini
机构: University of Bologna(博洛尼亚大学)
类目: Computation and Language (cs.CL); Information Retrieval (cs.IR)
备注: 14 pages, 6 figures, accepted at CLEF 2025 CheckThat! Lab
点击查看摘要
Abstract:This paper presents AI Wizards’ participation in the CLEF 2025 CheckThat! Lab Task 1: Subjectivity Detection in News Articles, classifying sentences as subjective/objective in monolingual, multilingual, and zero-shot settings. Training/development datasets were provided for Arabic, German, English, Italian, and Bulgarian; final evaluation included additional unseen languages (e.g., Greek, Romanian, Polish, Ukrainian) to assess generalization. Our primary strategy enhanced transformer-based classifiers by integrating sentiment scores, derived from an auxiliary model, with sentence representations, aiming to improve upon standard fine-tuning. We explored this sentiment-augmented architecture with mDeBERTaV3-base, ModernBERT-base (English), and Llama3.2-1B. To address class imbalance, prevalent across languages, we employed decision threshold calibration optimized on the development set. Our experiments show sentiment feature integration significantly boosts performance, especially subjective F1 score. This framework led to high rankings, notably 1st for Greek (Macro F1 = 0.51).
zh
[NLP-49] CRABS: A syntactic-semantic pincer strategy for bounding LLM interpretation of Python notebooks
【速读】: 该论文试图解决在不重新执行的情况下理解和解析数据科学和机器学习Python笔记本的问题,特别是在处理数据和软件依赖关系时的挑战。其关键解决方案是提出了一种名为Capture and Resolve Assisted Bounding Strategy (CRABS) 的方法,该方法结合了有限的语法分析与大型语言模型(LLM)的全貌理解能力,通过浅层语法解析和抽象语法树(AST)分析来捕捉笔记本中单元格间输入输出集合的上下界估计,随后利用LLM进行逐单元零样本学习以解决剩余的模糊性,从而准确识别每个单元格的真实数据输入和输出。
链接: https://arxiv.org/abs/2507.11742
作者: Meng Li,Timothy M. McPhillips,Dingmin Wang,Shin-Rong Tsai,Bertram Ludäscher
机构: School of Information Sciences, University of Illinois Urbana-Champaign (信息科学学院,伊利诺伊大学厄巴纳-香槟分校); Department of Computer Science, University of Oxford (计算机科学系,牛津大学)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: Preprint. Accepted to COLM 2025
点击查看摘要
Abstract:Recognizing the information flows and operations comprising data science and machine learning Python notebooks is critical for evaluating, reusing, and adapting notebooks for new tasks. Investigating a notebook via re-execution often is impractical due to the challenges of resolving data and software dependencies. While Large Language Models (LLMs) pre-trained on large codebases have demonstrated effectiveness in understanding code without running it, we observe that they fail to understand some realistic notebooks due to hallucinations and long-context challenges. To address these issues, we propose a notebook understanding task yielding an information flow graph and corresponding cell execution dependency graph for a notebook, and demonstrate the effectiveness of a pincer strategy that uses limited syntactic analysis to assist full comprehension of the notebook using an LLM. Our Capture and Resolve Assisted Bounding Strategy (CRABS) employs shallow syntactic parsing and analysis of the abstract syntax tree (AST) to capture the correct interpretation of a notebook between lower and upper estimates of the inter-cell I/O sets, then uses an LLM to resolve remaining ambiguities via cell-by-cell zero-shot learning, thereby identifying the true data inputs and outputs of each cell. We evaluate and demonstrate the effectiveness of our approach using an annotated dataset of 50 representative, highly up-voted Kaggle notebooks that together represent 3454 actual cell inputs and outputs. The LLM correctly resolves 1397 of 1425 (98%) ambiguities left by analyzing the syntactic structure of these notebooks. Across 50 notebooks, CRABS achieves average F1 scores of 98% identifying cell-to-cell information flows and 99% identifying transitive cell execution dependencies.
zh
[NLP-50] ExpliCIT-QA: Explainable Code-Based Image Table Question Answering
【速读】: 该论文试图解决端到端表格视觉问答(TableVQA)系统中解释性不足的问题,特别是在需要审计结果的敏感领域如金融和医疗中,缺乏透明度和可解释性。其解决方案的关键在于提出ExpliCIT-QA系统,该系统采用模块化设计,包括多模态表格理解、基于语言的推理、自动代码生成、代码执行以及自然语言解释,从而实现对答案计算过程的全面透明化和可审计性。
链接: https://arxiv.org/abs/2507.11694
作者: Maximiliano Hormazábal Lagos,Álvaro Bueno Sáez,Pedro Alonso Doval,Jorge Alcalde Vesteiro,Héctor Cerezo-Costas
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: This work has been accepted for presentation at the 24nd Portuguese Conference on Artificial Intelligence (EPIA 2025) and will be published in the proceedings by Springer in the Lecture Notes in Computer Science (LNCS) series. Please cite the published version when available
点击查看摘要
Abstract:We present ExpliCIT-QA, a system that extends our previous MRT approach for tabular question answering into a multimodal pipeline capable of handling complex table images and providing explainable answers. ExpliCIT-QA follows a modular design, consisting of: (1) Multimodal Table Understanding, which uses a Chain-of-Thought approach to extract and transform content from table images; (2) Language-based Reasoning, where a step-by-step explanation in natural language is generated to solve the problem; (3) Automatic Code Generation, where Python/Pandas scripts are created based on the reasoning steps, with feedback for handling errors; (4) Code Execution to compute the final answer; and (5) Natural Language Explanation that describes how the answer was computed. The system is built for transparency and auditability: all intermediate outputs, parsed tables, reasoning steps, generated code, and final answers are available for inspection. This strategy works towards closing the explainability gap in end-to-end TableVQA systems. We evaluated ExpliCIT-QA on the TableVQA-Bench benchmark, comparing it with existing baselines. We demonstrated improvements in interpretability and transparency, which open the door for applications in sensitive domains like finance and healthcare where auditing results are critical.
zh
[NLP-51] MetaLint: Generalizable Idiomatic Code Quality Analysis through Instruction-Following and Easy-to-Hard Generalization
【速读】: 该论文试图解决大型语言模型在代码质量分析中的不足,特别是其受限于静态训练数据且难以适应不断演变的最佳实践问题。解决方案的关键在于提出MetaLint框架,该框架将代码质量分析任务转化为基于高层规范检测和修复有问题的语义代码片段或代码惯用法。与传统方法不同,MetaLint通过在合成linter生成的数据上进行指令微调,实现了从简单到复杂的泛化能力,使模型能够在不重新训练的情况下适应新颖或复杂的代码模式。
链接: https://arxiv.org/abs/2507.11687
作者: Atharva Naik,Lawanya Baghel,Dhakshin Govindarajan,Darsh Agrawal,Daniel Fried,Carolyn Rose
机构: Carnegie Mellon University (卡内基梅隆大学)
类目: oftware Engineering (cs.SE); Computation and Language (cs.CL); Machine Learning (cs.LG)
备注:
点击查看摘要
Abstract:Large Language Models, though successful in code generation, struggle with code quality analysis because they are limited by static training data and can’t easily adapt to evolving best practices. We introduce MetaLint, a new instruction-following framework that formulates code quality analysis as the task of detecting and fixing problematic semantic code fragments or code idioms based on high-level specifications. Unlike conventional approaches that train models on static, rule-based data, MetaLint employs instruction tuning on synthetic linter-generated data to support easy-to-hard generalization, enabling models to adapt to novel or complex code patterns without retraining. To evaluate this, we construct a benchmark of challenging idioms inspired by real-world coding standards such as Python Enhancement Proposals (PEPs) and assess whether MetaLint-trained models reason adaptively or simply memorize. Our results show that MetaLint improves generalization to unseen PEP idioms, achieving a 70.37% F-score on idiom detection with the highest recall (70.43%) among all evaluated models. It also achieves 26.73% on localization, competitive for its 4B parameter size and comparable to larger state-of-the-art models like o3-mini, highlighting its potential for future-proof code quality analysis.
zh
[NLP-52] Lets Think in Two Steps: Mitigating Agreement Bias in MLLM s with Self-Grounded Verification
【速读】: 该论文试图解决在缺乏明确成功标准的领域(如计算机使用)中,如何有效利用验证器(verifiers)来评估智能体行为的问题。传统验证器在数学和棋类游戏等明确目标的领域表现优异,但在需要人类直觉判断的场景中难以转化为可扩展的规则。论文提出的解决方案是利用多模态大语言模型(MLLMs)作为验证器,但发现其存在一种关键限制——共识偏差(agreement bias),即MLLMs倾向于依赖上下文窗口内的信息,从而生成合理化错误行为的思维链。为解决这一问题,论文提出Self-Grounded Verification (SGV),其关键在于通过MLLM自身的采样机制,先独立于待评估数据获取任务完成的广泛先验知识,再基于这些自生成的先验知识对候选轨迹进行推理与评估,从而提升验证效果。
链接: https://arxiv.org/abs/2507.11662
作者: Moises Andrade,Joonhyuk Cha,Brandon Ho,Vriksha Srihari,Karmesh Yadav,Zsolt Kira
机构: 未知
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Machine Learning (cs.LG); Multiagent Systems (cs.MA); Robotics (cs.RO)
备注: Our code and data are publicly available at this https URL
点击查看摘要
Abstract:Verifiers – functions assigning rewards to agent behavior – have been key for AI progress in domains like math and board games. However, extending these gains to domains without clear-cut success criteria (e.g.,computer use) remains a challenge: while humans can recognize suitable outcomes, translating this intuition into scalable rules is non-trivial. Multimodal Large Language Models(MLLMs) emerge as a promising solution, given their world knowledge, human-preference alignment, and reasoning skills. We evaluate MLLMs as verifiers of agent trajectories across web navigation, computer use, and robotic manipulation, and identify a critical limitation: agreement bias, a strong tendency for MLLMs to favor information in their context window, often generating chains of thought to rationalize flawed behavior. This bias is pervasive across models, resilient to test-time scaling, and can impact several methods using MLLMs as evaluators (e.g.,data filtering). Notably, it occurs despite MLLMs showing strong, human-aligned priors on desired behavior. To address this, we propose Self-Grounded Verification (SGV), a lightweight method that enables more effective use of MLLMs’ knowledge and reasoning by harnessing their own sampling mechanisms via unconditional and conditional generation. SGV operates in two steps: first, the MLLM is elicited to retrieve broad priors about task completion, independent of the data under evaluation. Then, conditioned on self-generated priors, it reasons over and evaluates a candidate trajectory. Enhanced with SGV, MLLM verifiers show gains of up to 20 points in accuracy and failure detection rates, and can perform real-time supervision of heterogeneous agents, boosting task completion of a GUI specialist in OSWorld, a diffusion policy in robomimic, and a ReAct agent in VisualWebArena – setting a new state of the art on the benchmark, surpassing the previous best by 48%.
zh
[NLP-53] Partitioner Guided Modal Learning Framework
【速读】: 该论文旨在解决多模态学习中如何有效融合和区分单模态与跨模态特征的问题,以提升模型在多种下游任务中的性能。其解决方案的关键在于提出一种基于分割器引导的多模态学习框架PgM,该框架通过模态分割器将学习到的模态表示划分为单模态和配对模态特征,并分别由单模态学习器和配对模态学习器进行处理,最终通过单-配对模态解码器重构模态表示,从而实现对单模态和配对模态特征的全面学习与灵活调整。
链接: https://arxiv.org/abs/2507.11661
作者: Guimin Hu,Yi Xin,Lijie Hu,Zhihong Zhu,Hasti Seifi
机构: Guangdong University of Technology (广东工业大学); University of Copenhagen (哥本哈根大学); Nanjing University (南京大学); Mohamed bin Zayed University of Artificial Intelligence (穆罕默德·本·扎耶德人工智能大学); Tencent (腾讯); Arizona State University (亚利桑那州立大学)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: acm multimedia 2025
点击查看摘要
Abstract:Multimodal learning benefits from multiple modal information, and each learned modal representations can be divided into uni-modal that can be learned from uni-modal training and paired-modal features that can be learned from cross-modal interaction. Building on this perspective, we propose a partitioner-guided modal learning framework, PgM, which consists of the modal partitioner, uni-modal learner, paired-modal learner, and uni-paired modal decoder. Modal partitioner segments the learned modal representation into uni-modal and paired-modal features. Modal learner incorporates two dedicated components for uni-modal and paired-modal learning. Uni-paired modal decoder reconstructs modal representation based on uni-modal and paired-modal features. PgM offers three key benefits: 1) thorough learning of uni-modal and paired-modal features, 2) flexible distribution adjustment for uni-modal and paired-modal representations to suit diverse downstream tasks, and 3) different learning rates across modalities and partitions. Extensive experiments demonstrate the effectiveness of PgM across four multimodal tasks and further highlight its transferability to existing models. Additionally, we visualize the distribution of uni-modal and paired-modal features across modalities and tasks, offering insights into their respective contributions.
zh
[NLP-54] Cross-lingual Few-shot Learning for Persian Sentiment Analysis with Incremental Adaptation
【速读】: 该论文试图解决在波斯语中进行跨语言情感分析的问题,特别是在数据量有限的情况下。解决方案的关键在于结合少量样本学习(few-shot learning)和增量学习(incremental learning)方法,并利用高资源语言的先验知识来提升模型在波斯语上的性能。研究采用三种预训练多语言模型(XLM-RoBERTa、mDeBERTa 和 DistilBERT),并在来自不同来源的少量波斯语数据上进行微调,从而实现高精度的情感分析,实验结果表明 mDeBERTa 和 XLM-RoBERTa 在波斯语情感分析任务中分别达到了 96% 的准确率。
链接: https://arxiv.org/abs/2507.11634
作者: Farideh Majidi,Ziaeddin Beheshtifard
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: Proceedings of the First National Conference on Artificial Intelligence and Emerging Research: Convergence of Humans and Intelligent Systems
点击查看摘要
Abstract:This research examines cross-lingual sentiment analysis using few-shot learning and incremental learning methods in Persian. The main objective is to develop a model capable of performing sentiment analysis in Persian using limited data, while getting prior knowledge from high-resource languages. To achieve this, three pre-trained multilingual models (XLM-RoBERTa, mDeBERTa, and DistilBERT) were employed, which were fine-tuned using few-shot and incremental learning approaches on small samples of Persian data from diverse sources, including X, Instagram, Digikala, Snappfood, and Taaghche. This variety enabled the models to learn from a broad range of contexts. Experimental results show that the mDeBERTa and XLM-RoBERTa achieved high performances, reaching 96% accuracy on Persian sentiment analysis. These findings highlight the effectiveness of combining few-shot learning and incremental learning with multilingual pre-trained models.
zh
[NLP-55] Jailbreak-Tuning: Models Efficiently Learn Jailbreak Susceptibility
【速读】: 该论文试图解决当前生成式AI(Generative AI)模型在安全防护方面的漏洞问题,特别是针对通过微调(fine-tuning)技术可能被恶意利用的风险。论文指出,无论采用开放权重还是封闭微调API的方式,都可能生成仅输出有益内容的模型,而其关键解决方案是通过“jailbreak-tuning”方法,使模型能够对有害请求生成详细且高质量的响应,从而绕过现有的内容审核系统。该方法不仅提高了攻击的隐蔽性,还增强了攻击的严重性,表明现有模型的安全防护机制存在显著缺陷。
链接: https://arxiv.org/abs/2507.11630
作者: Brendan Murphy,Dillon Bowen,Shahrad Mohammadzadeh,Julius Broomfield,Adam Gleave,Kellin Pelrine
机构: FAR.AI(非营利人工智能研究实验室); Mila – Quebec AI Institute(魁北克人工智能研究所); McGill University(麦吉尔大学); Georgia Tech(佐治亚理工学院)
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Computers and Society (cs.CY)
备注:
点击查看摘要
Abstract:AI systems are rapidly advancing in capability, and frontier model developers broadly acknowledge the need for safeguards against serious misuse. However, this paper demonstrates that fine-tuning, whether via open weights or closed fine-tuning APIs, can produce helpful-only models. In contrast to prior work which is blocked by modern moderation systems or achieved only partial removal of safeguards or degraded output quality, our jailbreak-tuning method teaches models to generate detailed, high-quality responses to arbitrary harmful requests. For example, OpenAI, Google, and Anthropic models will fully comply with requests for CBRN assistance, executing cyberattacks, and other criminal activity. We further show that backdoors can increase not only the stealth but also the severity of attacks, while stronger jailbreak prompts become even more effective in fine-tuning attacks, linking attack and potentially defenses in the input and weight spaces. Not only are these models vulnerable, more recent ones also appear to be becoming even more vulnerable to these attacks, underscoring the urgent need for tamper-resistant safeguards. Until such safeguards are discovered, companies and policymakers should view the release of any fine-tunable model as simultaneously releasing its evil twin: equally capable as the original model, and usable for any malicious purpose within its capabilities.
zh
[NLP-56] MapIQ: Benchmarking Multimodal Large Language Models for Map Question Answering
【速读】: 该论文试图解决当前视觉问答研究中对地图(Map-VQA)理解能力的不足,特别是针对choropleth地图的过度关注,而忽视了其他类型的地图如cartograms和proportional symbol maps。解决方案的关键是引入MapIQ,一个包含14,706个问题-答案对的基准数据集,涵盖三种地图类型和六个主题,以全面评估多模态大语言模型(MLLMs)在不同地图类型上的表现,并通过改变地图设计来分析模型的鲁棒性和对地理知识的依赖性。
链接: https://arxiv.org/abs/2507.11625
作者: Varun Srivastava,Fan Lei,Srija Mukhopadhyay,Vivek Gupta,Ross Maciejewski
机构: Arizona State University (亚利桑那州立大学); International Institute of Information Technology (国际信息科技学院)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注: Published as a conference paper at COLM 2025
点击查看摘要
Abstract:Recent advancements in multimodal large language models (MLLMs) have driven researchers to explore how well these models read data visualizations, e.g., bar charts, scatter plots. More recently, attention has shifted to visual question answering with maps (Map-VQA). However, Map-VQA research has primarily focused on choropleth maps, which cover only a limited range of thematic categories and visual analytical tasks. To address these gaps, we introduce MapIQ, a benchmark dataset comprising 14,706 question-answer pairs across three map types: choropleth maps, cartograms, and proportional symbol maps spanning topics from six distinct themes (e.g., housing, crime). We evaluate multiple MLLMs using six visual analytical tasks, comparing their performance against one another and a human baseline. An additional experiment examining the impact of map design changes (e.g., altered color schemes, modified legend designs, and removal of map elements) provides insights into the robustness and sensitivity of MLLMs, their reliance on internal geographic knowledge, and potential avenues for improving Map-VQA performance.
zh
[NLP-57] Subjective Evaluation Profile Analysis of Science Fiction Short Stories and its Critical-Theoretical Significance
【速读】: 该论文试图解决如何评估大型语言模型(Large Language Models, LLMs)在文学评价中的主观性与一致性问题,以及它们是否能够表现出类似人类文学批评流派的个体评价特征。其解决方案的关键在于通过设计一个七次会话的同日内实验协议,利用原创的科幻小说语料库,最小化外部偏见的影响,从而观察LLMs在文学判断中隐含的价值体系,特别是由人类反馈强化学习(Reinforcement Learning from Human Feedback, RLHF)所塑造的评价模式。
链接: https://arxiv.org/abs/2507.11582
作者: Kazuyoshi Otsuka
机构: 未知
类目: Computation and Language (cs.CL)
备注: 38 pages. Manuscript submitted for review to the Journal of Computational Literary Studies (JCLS)
点击查看摘要
Abstract:This study positions large language models (LLMs) as “subjective literary critics” to explore aesthetic preferences and evaluation patterns in literary assessment. Ten Japanese science fiction short stories were translated into English and evaluated by six state-of-the-art LLMs across seven independent sessions. Principal component analysis and clustering techniques revealed significant variations in evaluation consistency (\alpha ranging from 1.00 to 0.35) and five distinct evaluation patterns. Additionally, evaluation variance across stories differed by up to 4.5-fold, with TF-IDF analysis confirming distinctive evaluation vocabularies for each model. Our seven-session within-day protocol using an original Science Fiction corpus strategically minimizes external biases, allowing us to observe implicit value systems shaped by RLHF and their influence on literary judgment. These findings suggest that LLMs may possess individual evaluation characteristics similar to human critical schools, rather than functioning as neutral benchmarkers.
zh
[NLP-58] Fairness Is Not Enough: Auditing Competence and Intersectional Bias in AI-powered Resume Screening
【速读】: 该论文试图解决生成式 AI 在简历筛选中是否真正具备评估能力的问题,而不仅仅是表面上的无偏性。研究通过双重审计验证了八种主要AI平台的性能,发现部分模型虽然看似无偏,但实际上无法进行实质性评估,仅依赖于表面的关键词匹配。论文提出“中立感 illusion”这一概念,用以描述这种因模型缺乏有效判断能力而导致的表面无偏现象。解决方案的关键在于采用双重验证框架,对AI招聘工具进行人口统计偏差和可证明能力的双重审计,以确保其公平性和有效性。
链接: https://arxiv.org/abs/2507.11548
作者: Kevin T Webster
机构: 未知
类目: Computers and Society (cs.CY); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注: 58 pages, 4 figures
点击查看摘要
Abstract:The increasing use of generative AI for resume screening is predicated on the assumption that it offers an unbiased alternative to biased human decision-making. However, this belief fails to address a critical question: are these AI systems fundamentally competent at the evaluative tasks they are meant to perform? This study investigates the question of competence through a two-part audit of eight major AI platforms. Experiment 1 confirmed complex, contextual racial and gender biases, with some models penalizing candidates merely for the presence of demographic signals. Experiment 2, which evaluated core competence, provided a critical insight: some models that appeared unbiased were, in fact, incapable of performing a substantive evaluation, relying instead on superficial keyword matching. This paper introduces the “Illusion of Neutrality” to describe this phenomenon, where an apparent lack of bias is merely a symptom of a model’s inability to make meaningful judgments. This study recommends that organizations and regulators adopt a dual-validation framework, auditing AI hiring tools for both demographic bias and demonstrable competence to ensure they are both equitable and effective.
zh
计算机视觉
[CV-0] PhysX: Physical-Grounded 3D Asset Generation
【速读】:该论文旨在解决现有3D生成模型在物理属性建模方面的不足,即当前研究主要关注几何和纹理,而忽视了物理属性的建模,导致生成的3D资产缺乏真实物理特性,限制了其在物理领域如仿真和具身人工智能中的应用。解决方案的关键在于提出PhysX,一个端到端的物理接地3D资产生成范式,其中包含两个核心部分:首先,构建了PhysXNet,这是首个在五个基础维度(绝对尺度、材料、可供性、运动学和功能描述)上系统标注的物理接地3D数据集,并采用基于视觉-语言模型的人机协同标注流程以高效生成物理优先的3D资产;其次,设计了PhysXGen,一种将物理知识注入预训练3D结构空间的前馈框架,通过双分支架构显式建模3D结构与物理属性之间的潜在关联,从而生成具有合理物理预测且保留原始几何质量的3D资产。
链接: https://arxiv.org/abs/2507.12465
作者: Ziang Cao,Zhaoxi Chen,Linag Pan,Ziwei Liu
机构: Nanyang Technological University (南洋理工大学); Shanghai AI Lab (上海人工智能实验室)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Project page: this https URL
点击查看摘要
Abstract:3D modeling is moving from virtual to physical. Existing 3D generation primarily emphasizes geometries and textures while neglecting physical-grounded modeling. Consequently, despite the rapid development of 3D generative models, the synthesized 3D assets often overlook rich and important physical properties, hampering their real-world application in physical domains like simulation and embodied AI. As an initial attempt to address this challenge, we propose \textbfPhysX, an end-to-end paradigm for physical-grounded 3D asset generation. 1) To bridge the critical gap in physics-annotated 3D datasets, we present PhysXNet - the first physics-grounded 3D dataset systematically annotated across five foundational dimensions: absolute scale, material, affordance, kinematics, and function description. In particular, we devise a scalable human-in-the-loop annotation pipeline based on vision-language models, which enables efficient creation of physics-first assets from raw 3D assets.2) Furthermore, we propose \textbfPhysXGen, a feed-forward framework for physics-grounded image-to-3D asset generation, injecting physical knowledge into the pre-trained 3D structural space. Specifically, PhysXGen employs a dual-branch architecture to explicitly model the latent correlations between 3D structures and physical properties, thereby producing 3D assets with plausible physical predictions while preserving the native geometry quality. Extensive experiments validate the superior performance and promising generalization capability of our framework. All the code, data, and models will be released to facilitate future research in generative physical AI.
zh
[CV-1] CytoSAE: Interpretable Cell Embeddings for Hematology
【速读】:该论文旨在解决医学影像领域中基础模型可解释性不足的问题,特别是针对血液学领域的应用场景。其关键解决方案是提出了一种名为CytoSAE的稀疏自编码器(Sparse Autoencoder, SAE),该模型在超过40,000张外周血单细胞图像上进行训练,能够泛化至多种域内和域外数据集,并识别出具有形态学相关性的视觉概念,从而实现对模型推理过程的可解释性分析。此外,CytoSAE能够在患者和疾病层面生成特定概念,支持在局部区域检测病理性细胞和异常情况,并在AML亚型分类任务中表现出与最先进方法相当的性能,同时提供亚细胞级别的解释能力。
链接: https://arxiv.org/abs/2507.12464
作者: Muhammed Furkan Dasdelen,Hyesu Lim,Michele Buck,Katharina S. Götze,Carsten Marr,Steffen Schneider
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG); Quantitative Methods (q-bio.QM)
备注: 11 pages, 5 figures
点击查看摘要
Abstract:Sparse autoencoders (SAEs) emerged as a promising tool for mechanistic interpretability of transformer-based foundation models. Very recently, SAEs were also adopted for the visual domain, enabling the discovery of visual concepts and their patch-wise attribution to tokens in the transformer model. While a growing number of foundation models emerged for medical imaging, tools for explaining their inferences are still lacking. In this work, we show the applicability of SAEs for hematology. We propose CytoSAE, a sparse autoencoder which is trained on over 40,000 peripheral blood single-cell images. CytoSAE generalizes to diverse and out-of-domain datasets, including bone marrow cytology, where it identifies morphologically relevant concepts which we validated with medical experts. Furthermore, we demonstrate scenarios in which CytoSAE can generate patient-specific and disease-specific concepts, enabling the detection of pathognomonic cells and localized cellular abnormalities at the patch level. We quantified the effect of concepts on a patient-level AML subtype classification task and show that CytoSAE concepts reach performance comparable to the state-of-the-art, while offering explainability on the sub-cellular level. Source code and model weights are available at this https URL.
zh
[CV-2] MMHU: A Massive-Scale Multimodal Benchmark for Human Behavior Understanding
【速读】:该论文试图解决自动驾驶中人类行为理解缺乏全面基准的问题,旨在提供一个大规模的基准以评估人类行为分析。解决方案的关键在于提出了MMHU,这是一个包含丰富标注的大型基准数据集,涵盖人类运动、轨迹、文本描述、意图及与驾驶安全相关的关键行为标签,数据来源于多种渠道并采用人机协作的标注流程生成详细的行为描述。
链接: https://arxiv.org/abs/2507.12463
作者: Renjie Li,Ruijie Ye,Mingyang Wu,Hao Frank Yang,Zhiwen Fan,Hezhen Hu,Zhengzhong Tu
机构: Texas A&M University (德克萨斯A&M大学); Brown University (布朗大学); Johns Hopkins University (约翰霍普金斯大学); UT Austin (德州大学奥斯汀分校)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
点击查看摘要
Abstract:Humans are integral components of the transportation ecosystem, and understanding their behaviors is crucial to facilitating the development of safe driving systems. Although recent progress has explored various aspects of human behavior \unicodex2014 such as motion, trajectories, and intention \unicodex2014 a comprehensive benchmark for evaluating human behavior understanding in autonomous driving remains unavailable. In this work, we propose \textbfMMHU , a large-scale benchmark for human behavior analysis featuring rich annotations, such as human motion and trajectories, text description for human motions, human intention, and critical behavior labels relevant to driving safety. Our dataset encompasses 57k human motion clips and 1.73M frames gathered from diverse sources, including established driving datasets such as Waymo, in-the-wild videos from YouTube, and self-collected data. A human-in-the-loop annotation pipeline is developed to generate rich behavior captions. We provide a thorough dataset analysis and benchmark multiple tasks \unicodex2014 ranging from motion prediction to motion generation and human behavior question answering \unicodex2014 thereby offering a broad evaluation suite. Project page : this https URL.
zh
[CV-3] SpatialTrackerV2: 3D Point Tracking Made Easy ALT ICCV2025
【速读】:该论文试图解决单目视频中的3D点跟踪问题,旨在提高跟踪的准确性与效率。其解决方案的关键在于提出一种统一的端到端架构,将点跟踪、单目深度估计和相机位姿估计内在地联系起来,从而实现对世界空间中3D运动的有效分解与建模。该方法通过联合学习几何与运动信息,实现了在多种数据集上的可扩展训练,并在性能上显著优于现有3D跟踪方法,同时保持了较高的运行速度。
链接: https://arxiv.org/abs/2507.12462
作者: Yuxi Xiao,Jianyuan Wang,Nan Xue,Nikita Karaev,Yuri Makarov,Bingyi Kang,Xing Zhu,Hujun Bao,Yujun Shen,Xiaowei Zhou
机构: Zhejiang University (浙江大学); Oxford (牛津大学); Ant Group (蚂蚁集团); Pixelwise AI (像素智能AI); Bytedance Seed (字节跳动种子)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: International Conference on Computer Vision, ICCV 2025. Huggingface Demo: this https URL , Code: this https URL
点击查看摘要
Abstract:We present SpatialTrackerV2, a feed-forward 3D point tracking method for monocular videos. Going beyond modular pipelines built on off-the-shelf components for 3D tracking, our approach unifies the intrinsic connections between point tracking, monocular depth, and camera pose estimation into a high-performing and feedforward 3D point tracker. It decomposes world-space 3D motion into scene geometry, camera ego-motion, and pixel-wise object motion, with a fully differentiable and end-to-end architecture, allowing scalable training across a wide range of datasets, including synthetic sequences, posed RGB-D videos, and unlabeled in-the-wild footage. By learning geometry and motion jointly from such heterogeneous data, SpatialTrackerV2 outperforms existing 3D tracking methods by 30%, and matches the accuracy of leading dynamic 3D reconstruction approaches while running 50 \times faster.
zh
[CV-4] Interpreting Radiologists Intention from Eye Movements in Chest X-ray Diagnosis ACM-MM2025
【速读】:该论文试图解决现有模型无法捕捉放射科医生在观察医学图像时每个注视点背后的潜在意图的问题。解决方案的关键在于提出一种基于深度学习的方法——RadGazeIntent,该方法通过Transformer架构同时处理眼动数据的时间和空间维度,将细粒度的注视特征转化为具有诊断意图的粗粒度语义表示,从而有效建模放射科医生的搜索行为和目标。
链接: https://arxiv.org/abs/2507.12461
作者: Trong-Thang Pham,Anh Nguyen,Zhigang Deng,Carol C. Wu,Hien Van Nguyen,Ngan Le
机构: University of Arkansas(阿肯色大学); University of Liverpool(利物浦大学); University of Houston(休斯顿大学); MD Anderson Cancer Center(MD安德森癌症中心)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: ACM MM 2025
点击查看摘要
Abstract:Radiologists rely on eye movements to navigate and interpret medical images. A trained radiologist possesses knowledge about the potential diseases that may be present in the images and, when searching, follows a mental checklist to locate them using their gaze. This is a key observation, yet existing models fail to capture the underlying intent behind each fixation. In this paper, we introduce a deep learning-based approach, RadGazeIntent, designed to model this behavior: having an intention to find something and actively searching for it. Our transformer-based architecture processes both the temporal and spatial dimensions of gaze data, transforming fine-grained fixation features into coarse, meaningful representations of diagnostic intent to interpret radiologists’ goals. To capture the nuances of radiologists’ varied intention-driven behaviors, we process existing medical eye-tracking datasets to create three intention-labeled subsets: RadSeq (Systematic Sequential Search), RadExplore (Uncertainty-driven Exploration), and RadHybrid (Hybrid Pattern). Experimental results demonstrate RadGazeIntent’s ability to predict which findings radiologists are examining at specific moments, outperforming baseline methods across all intention-labeled datasets.
zh
[CV-5] Mitigating Object Hallucinations via Sentence-Level Early Intervention
【速读】:该论文旨在解决多模态大语言模型(Multimodal Large Language Models, MLLMs)在生成过程中产生的幻觉问题,即模型生成的内容与视觉输入相矛盾的虚假信息。现有方法要么计算成本过高,要么导致训练数据与模型输出之间的分布不匹配。该论文提出的关键解决方案是SENTINEL框架,其核心在于通过句子级别的早期干预来消除对人工标注的依赖。该框架首先通过迭代采样模型输出并利用开放词汇检测器验证对象存在性,构建高质量的领域内偏好对;随后,利用上下文一致的正样本和幻觉负样本迭代构建上下文感知的偏好数据,并通过上下文感知的偏好损失(C-DPO)进行训练,从而在幻觉最初显现的句子级别上实现判别性学习。
链接: https://arxiv.org/abs/2507.12455
作者: Shangpin Peng,Senqiao Yang,Li Jiang,Zhuotao Tian
机构: Harbin Institute of Technology, Shenzhen(哈尔滨工业大学深圳); The Chinese University of Hong Kong(香港中文大学); The Chinese University of Hong Kong, Shenzhen(香港中文大学深圳)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
点击查看摘要
Abstract:Multimodal large language models (MLLMs) have revolutionized cross-modal understanding but continue to struggle with hallucinations - fabricated content contradicting visual inputs. Existing hallucination mitigation methods either incur prohibitive computational costs or introduce distribution mismatches between training data and model outputs. We identify a critical insight: hallucinations predominantly emerge at the early stages of text generation and propagate through subsequent outputs. To address this, we propose SENTINEL (Sentence-level Early iNtervention Through IN-domain prEference Learning), a framework that eliminates dependency on human annotations. Specifically, we first bootstrap high-quality in-domain preference pairs by iteratively sampling model outputs, validating object existence through cross-checking with two open-vocabulary detectors, and classifying sentences into hallucinated/non-hallucinated categories. Subsequently, we use context-coherent positive samples and hallucinated negative samples to build context-aware preference data iteratively. Finally, we train models using a context-aware preference loss (C-DPO) that emphasizes discriminative learning at the sentence level where hallucinations initially manifest. Experimental results show that SENTINEL can reduce hallucinations by over 90% compared to the original model and outperforms the previous state-of-the-art method on both hallucination benchmarks and general capabilities benchmarks, demonstrating its superiority and generalization ability. The models, datasets, and code are available at this https URL.
zh
[CV-6] Vision-based Perception for Autonomous Vehicles in Obstacle Avoidance Scenarios
【速读】:该论文试图解决自主车辆在复杂环境中安全避障的问题,其核心挑战在于准确的环境感知与高效的运动规划。解决方案的关键在于构建一个基于单目摄像头的感知模块和基于Frenet-Pure Pursuit的规划策略,其中利用YOLOv11进行目标检测,并结合Depth Anything V2等先进的单目深度估计模型来获取物体距离信息,从而实现高效且可靠的避障能力。
链接: https://arxiv.org/abs/2507.12449
作者: Van-Hoang-Anh Phan,Chi-Tam Nguyen,Doan-Trung Au,Thanh-Danh Phan,Minh-Thien Duong,My-Ha Le
机构: ISLab, HCMUTE; Chungbuk National University; Dept. Automatic Control, HCMUTE; Faculty of Electrical and Electronics Engineering, HCMC University of Technology and Education
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 7 pages, 6 figures, 4 tables, HSI 2025
点击查看摘要
Abstract:Obstacle avoidance is essential for ensuring the safety of autonomous vehicles. Accurate perception and motion planning are crucial to enabling vehicles to navigate complex environments while avoiding collisions. In this paper, we propose an efficient obstacle avoidance pipeline that leverages a camera-only perception module and a Frenet-Pure Pursuit-based planning strategy. By integrating advancements in computer vision, the system utilizes YOLOv11 for object detection and state-of-the-art monocular depth estimation models, such as Depth Anything V2, to estimate object distances. A comparative analysis of these models provides valuable insights into their accuracy, efficiency, and robustness in real-world conditions. The system is evaluated in diverse scenarios on a university campus, demonstrating its effectiveness in handling various obstacles and enhancing autonomous navigation. The video presenting the results of the obstacle avoidance experiments is available at: this https URL
zh
[CV-7] Describe Anything Model for Visual Question Answering on Text-rich Images ICCV2025
【速读】:该论文试图解决在文本密集的图像中进行视觉问答(VQA)的问题,特别是在需要细粒度提取文本信息的情况下。解决方案的关键在于引入DAM-QA框架,该框架利用Describe Anything Model (DAM)的区域感知能力,并通过聚合图像内容的多个区域视图来生成答案,从而更有效地识别与文本相关的信息。
链接: https://arxiv.org/abs/2507.12441
作者: Yen-Linh Vu,Dinh-Thang Duong,Truong-Binh Duong,Anh-Khoi Nguyen,Thanh-Huy Nguyen,Le Thien Phuc Nguyen,Jianhua Xing,Xingjian Li,Tianyang Wang,Ulas Bagci,Min Xu
机构: AI VIETNAM Lab (AI 越南实验室); Carnegie Mellon University (卡内基梅隆大学); University of Wisconsin - Madison (威斯康星大学麦迪逊分校); University of Pittsburgh (匹兹堡大学); University of Alabama at Birmingham (阿拉巴马大学伯明翰分校); Northwestern University (西北大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注: 11 pages, 5 figures. Accepted to VisionDocs @ ICCV 2025
点击查看摘要
Abstract:Recent progress has been made in region-aware vision-language modeling, particularly with the emergence of the Describe Anything Model (DAM). DAM is capable of generating detailed descriptions of any specific image areas or objects without the need for additional localized image-text alignment supervision. We hypothesize that such region-level descriptive capability is beneficial for the task of Visual Question Answering (VQA), especially in challenging scenarios involving images with dense text. In such settings, the fine-grained extraction of textual information is crucial to producing correct answers. Motivated by this, we introduce DAM-QA, a framework with a tailored evaluation protocol, developed to investigate and harness the region-aware capabilities from DAM for the text-rich VQA problem that requires reasoning over text-based information within images. DAM-QA incorporates a mechanism that aggregates answers from multiple regional views of image content, enabling more effective identification of evidence that may be tied to text-related elements. Experiments on six VQA benchmarks show that our approach consistently outperforms the baseline DAM, with a notable 7+ point gain on DocVQA. DAM-QA also achieves the best overall performance among region-aware models with fewer parameters, significantly narrowing the gap with strong generalist VLMs. These results highlight the potential of DAM-like models for text-rich and broader VQA tasks when paired with efficient usage and integration strategies. Our code is publicly available at this https URL.
zh
[CV-8] EgoVLA: Learning Vision-Language-Action Models from Egocentric Human Videos
【速读】:该论文旨在解决机器人模仿学习中因依赖真实机器人硬件而导致的数据规模受限问题。其解决方案的关键在于利用第一视角人类视频来训练视觉-语言-动作(Vision-Language-Action, VLA)模型,通过该模型预测人类手腕和手部动作,并结合逆运动学和动作迁移技术将人类动作转换为机器人动作。此外,研究者在少量机器人操作示范的基础上微调模型以获得机器人策略,即EgoVLA,并构建了Isaac Humanoid Manipulation Benchmark仿真基准进行评估。
链接: https://arxiv.org/abs/2507.12440
作者: Ruihan Yang,Qinxi Yu,Yecheng Wu,Rui Yan,Borui Li,An-Chieh Cheng,Xueyan Zou,Yunhao Fang,Hongxu Yin,Sifei Liu,Song Han,Yao Lu,Xiaolong Wang
机构: UC San Diego (加州大学圣地亚哥分校); UIUC (伊利诺伊大学厄巴纳-香槟分校); MIT (麻省理工学院); NVIDIA (英伟达)
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注: More videos can be found on our website: this https URL
点击查看摘要
Abstract:Real robot data collection for imitation learning has led to significant advancements in robotic manipulation. However, the requirement for robot hardware in the process fundamentally constrains the scale of the data. In this paper, we explore training Vision-Language-Action (VLA) models using egocentric human videos. The benefit of using human videos is not only for their scale but more importantly for the richness of scenes and tasks. With a VLA trained on human video that predicts human wrist and hand actions, we can perform Inverse Kinematics and retargeting to convert the human actions to robot actions. We fine-tune the model using a few robot manipulation demonstrations to obtain the robot policy, namely EgoVLA. We propose a simulation benchmark called Isaac Humanoid Manipulation Benchmark, where we design diverse bimanual manipulation tasks with demonstrations. We fine-tune and evaluate EgoVLA with Isaac Humanoid Manipulation Benchmark and show significant improvements over baselines and ablate the importance of human data. Videos can be found on our website: this https URL
zh
[CV-9] raffic-Aware Pedestrian Intention Prediction
【速读】:该论文试图解决自动驾驶车辆(AV)在复杂城市环境中准确估计行人意图的问题,当前模型往往未能充分考虑动态交通信号和上下文场景信息。其解决方案的关键在于提出一种交通感知的时空图卷积网络(TA-STGCN),该网络将交通标志及其状态(红、黄、绿)以及边界框大小作为关键特征进行集成,从而捕捉复杂城市环境中的空间和时间依赖性。
链接: https://arxiv.org/abs/2507.12433
作者: Fahimeh Orvati Nia,Hai Lin
机构: University of Notre Dame(圣母大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Systems and Control (eess.SY)
备注: 6 pages, 4 figures. Accepted to the American Control Conference (ACC) 2025
点击查看摘要
Abstract:Accurate pedestrian intention estimation is crucial for the safe navigation of autonomous vehicles (AVs) and hence attracts a lot of research attention. However, current models often fail to adequately consider dynamic traffic signals and contextual scene information, which are critical for real-world applications. This paper presents a Traffic-Aware Spatio-Temporal Graph Convolutional Network (TA-STGCN) that integrates traffic signs and their states (Red, Yellow, Green) into pedestrian intention prediction. Our approach introduces the integration of dynamic traffic signal states and bounding box size as key features, allowing the model to capture both spatial and temporal dependencies in complex urban environments. The model surpasses existing methods in accuracy. Specifically, TA-STGCN achieves a 4.75% higher accuracy compared to the baseline model on the PIE dataset, demonstrating its effectiveness in improving pedestrian intention prediction.
zh
[CV-10] DVFL-Net: A Lightweight Distilled Video Focal Modulation Network for Spatio-Temporal Action Recognition
【速读】:该论文旨在解决视频识别中Transformer模型计算成本高、难以高效部署的问题,特别是在处理密集视频数据时。其解决方案的关键在于提出一种轻量级的视频焦点调制网络DVFL-Net,通过知识蒸馏和时空特征调制技术,将大型预训练教师模型(Video-FocalNet Base)中的时空知识迁移至紧凑的纳米学生模型(VFL-Net),从而在保持高识别性能的同时显著降低计算复杂度。
链接: https://arxiv.org/abs/2507.12426
作者: Hayat Ullah,Muhammad Ali Shafique,Abbas Khan,Arslan Munir
机构: Florida Atlantic University (佛罗里达大西洋大学); Kansas State University (堪萨斯州立大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 17 pages
点击查看摘要
Abstract:The landscape of video recognition has evolved significantly, shifting from traditional Convolutional Neural Networks (CNNs) to Transformer-based architectures for improved accuracy. While 3D CNNs have been effective at capturing spatiotemporal dynamics, recent Transformer models leverage self-attention to model long-range spatial and temporal dependencies. Despite achieving state-of-the-art performance on major benchmarks, Transformers remain computationally expensive, particularly with dense video data. To address this, we propose a lightweight Video Focal Modulation Network, DVFL-Net, which distills spatiotemporal knowledge from a large pre-trained teacher into a compact nano student model, enabling efficient on-device deployment. DVFL-Net utilizes knowledge distillation and spatial-temporal feature modulation to significantly reduce computation while preserving high recognition performance. We employ forward Kullback-Leibler (KL) divergence alongside spatio-temporal focal modulation to effectively transfer both local and global context from the Video-FocalNet Base (teacher) to the proposed VFL-Net (student). We evaluate DVFL-Net on UCF50, UCF101, HMDB51, SSV2, and Kinetics-400, benchmarking it against recent state-of-the-art methods in Human Action Recognition (HAR). Additionally, we conduct a detailed ablation study analyzing the impact of forward KL divergence. The results confirm the superiority of DVFL-Net in achieving an optimal balance between performance and efficiency, demonstrating lower memory usage, reduced GFLOPs, and strong accuracy, making it a practical solution for real-time HAR applications.
zh
[CV-11] InterpIoU: Rethinking Bounding Box Regression with Interpolation-Based IoU Optimization
【速读】:该论文旨在解决基于IoU的损失函数在目标检测中的局限性,尤其是其对框形状、大小和分布的敏感性,以及在非重叠情况下导致的小目标优化不佳和边界框扩大的问题。解决方案的关键在于提出InterpIoU,通过使用插值后的预测框与目标框之间的IoU作为损失项,替代传统的人工设计几何惩罚,从而在非重叠情况下提供有意义的梯度,并避免因惩罚项不对齐导致的边界框扩大问题。
链接: https://arxiv.org/abs/2507.12420
作者: Haoyuan Liu,Hiroshi Watanabe
机构: Waseda University (早稻田大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
点击查看摘要
Abstract:Bounding box regression (BBR) is fundamental to object detection, where the regression loss is crucial for accurate localization. Existing IoU-based losses often incorporate handcrafted geometric penalties to address IoU’s non-differentiability in non-overlapping cases and enhance BBR performance. However, these penalties are sensitive to box shape, size, and distribution, often leading to suboptimal optimization for small objects and undesired behaviors such as bounding box enlargement due to misalignment with the IoU objective. To address these limitations, we propose InterpIoU, a novel loss function that replaces handcrafted geometric penalties with a term based on the IoU between interpolated boxes and the target. By using interpolated boxes to bridge the gap between predictions and ground truth, InterpIoU provides meaningful gradients in non-overlapping cases and inherently avoids the box enlargement issue caused by misaligned penalties. Simulation results further show that IoU itself serves as an ideal regression target, while existing geometric penalties are both unnecessary and suboptimal. Building on InterpIoU, we introduce Dynamic InterpIoU, which dynamically adjusts interpolation coefficients based on IoU values, enhancing adaptability to scenarios with diverse object distributions. Experiments on COCO, VisDrone, and PASCAL VOC show that our methods consistently outperform state-of-the-art IoU-based losses across various detection frameworks, with particularly notable improvements in small object detection, confirming their effectiveness.
zh
[CV-12] QuRe: Query-Relevant Retrieval through Hard Negative Sampling in Composed Image Retrieval ICML2025
【速读】:该论文试图解决组合图像检索(Composed Image Retrieval, CIR)中现有方法仅关注目标图像的检索而忽视其他图像相关性的问题。这一限制源于对比学习方法在处理时将目标图像视为正样本,而将批次中的其他图像视为负样本,从而可能引入错误的负样本,导致检索到不相关的图像,降低用户满意度。解决方案的关键在于提出Query-Relevant Retrieval through Hard Negative Sampling (QuRe),通过优化奖励模型目标来减少错误负样本,并引入一种硬负样本采样策略,选择在目标图像之后相关性分数出现两个陡降之间的图像,以有效过滤错误负样本。
链接: https://arxiv.org/abs/2507.12416
作者: Jaehyun Kwak,Ramahdani Muhammad Izaaz Inhar,Se-Young Yun,Sung-Ju Lee
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: Accepted to ICML 2025
点击查看摘要
Abstract:Composed Image Retrieval (CIR) retrieves relevant images based on a reference image and accompanying text describing desired modifications. However, existing CIR methods only focus on retrieving the target image and disregard the relevance of other images. This limitation arises because most methods employing contrastive learning-which treats the target image as positive and all other images in the batch as negatives-can inadvertently include false negatives. This may result in retrieving irrelevant images, reducing user satisfaction even when the target image is retrieved. To address this issue, we propose Query-Relevant Retrieval through Hard Negative Sampling (QuRe), which optimizes a reward model objective to reduce false negatives. Additionally, we introduce a hard negative sampling strategy that selects images positioned between two steep drops in relevance scores following the target image, to effectively filter false negatives. In order to evaluate CIR models on their alignment with human satisfaction, we create Human-Preference FashionIQ (HP-FashionIQ), a new dataset that explicitly captures user preferences beyond target retrieval. Extensive experiments demonstrate that QuRe achieves state-of-the-art performance on FashionIQ and CIRR datasets while exhibiting the strongest alignment with human preferences on the HP-FashionIQ dataset. The source code is available at this https URL.
zh
[CV-13] AutoVDC: Automated Vision Data Cleaning Using Vision-Language Models
【速读】:该论文试图解决自动驾驶系统训练中数据集标注不准确的问题,此类问题通常需要多次人工迭代才能获得高质量数据,而手动审查大规模数据集则耗时且成本高昂。解决方案的关键在于引入AutoVDC(Automated Vision Data Cleaning)框架,利用视觉-语言模型(Vision-Language Models, VLMs)自动识别视觉数据集中错误的标注,从而提升数据质量。
链接: https://arxiv.org/abs/2507.12414
作者: Santosh Vasa,Aditi Ramadwar,Jnana Rama Krishna Darabattula,Md Zafar Anwar,Stanislaw Antol,Andrei Vatavu,Thomas Monninger,Sihao Ding
机构: Mercedes-Benz Research & Development North America (梅赛德斯-奔驰北美研究与开发中心); University of Stuttgart, Institute for Artificial Intelligence (斯图加特大学人工智能研究所)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Robotics (cs.RO)
备注:
点击查看摘要
Abstract:Training of autonomous driving systems requires extensive datasets with precise annotations to attain robust performance. Human annotations suffer from imperfections, and multiple iterations are often needed to produce high-quality datasets. However, manually reviewing large datasets is laborious and expensive. In this paper, we introduce AutoVDC (Automated Vision Data Cleaning) framework and investigate the utilization of Vision-Language Models (VLMs) to automatically identify erroneous annotations in vision datasets, thereby enabling users to eliminate these errors and enhance data quality. We validate our approach using the KITTI and nuImages datasets, which contain object detection benchmarks for autonomous driving. To test the effectiveness of AutoVDC, we create dataset variants with intentionally injected erroneous annotations and observe the error detection rate of our approach. Additionally, we compare the detection rates using different VLMs and explore the impact of VLM fine-tuning on our pipeline. The results demonstrate our method’s high performance in error detection and data cleaning experiments, indicating its potential to significantly improve the reliability and accuracy of large-scale production datasets in autonomous driving.
zh
[CV-14] OD-VIRAT: A Large-Scale Benchmark for Object Detection in Realistic Surveillance Environments
【速读】:该论文旨在解决现实场景下人类监控图像中人体及交互物体检测的挑战,以提升计算机视觉模型在复杂环境中的鲁棒性。其解决方案的关键在于构建两个名为OD-VIRAT Large和OD-VIRAT Tiny的视觉目标检测基准,提供了大量带有边界框和类别标注的数据,涵盖多种复杂场景下的监控视频,并对当前先进的目标检测架构(如RETMDET、YOLOX、RetinaNet、DETR和Deformable-DETR)进行了系统性评估,从而为开发更高效、更可靠的检测算法奠定基础。
链接: https://arxiv.org/abs/2507.12396
作者: Hayat Ullah,Abbas Khan,Arslan Munir,Hari Kalva
机构: Florida Atlantic University(佛罗里达大西洋大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 14 pages
点击查看摘要
Abstract:Realistic human surveillance datasets are crucial for training and evaluating computer vision models under real-world conditions, facilitating the development of robust algorithms for human and human-interacting object detection in complex environments. These datasets need to offer diverse and challenging data to enable a comprehensive assessment of model performance and the creation of more reliable surveillance systems for public safety. To this end, we present two visual object detection benchmarks named OD-VIRAT Large and OD-VIRAT Tiny, aiming at advancing visual understanding tasks in surveillance imagery. The video sequences in both benchmarks cover 10 different scenes of human surveillance recorded from significant height and distance. The proposed benchmarks offer rich annotations of bounding boxes and categories, where OD-VIRAT Large has 8.7 million annotated instances in 599,996 images and OD-VIRAT Tiny has 288,901 annotated instances in 19,860 images. This work also focuses on benchmarking state-of-the-art object detection architectures, including RETMDET, YOLOX, RetinaNet, DETR, and Deformable-DETR on this object detection-specific variant of VIRAT dataset. To the best of our knowledge, it is the first work to examine the performance of these recently published state-of-the-art object detection architectures on realistic surveillance imagery under challenging conditions such as complex backgrounds, occluded objects, and small-scale objects. The proposed benchmarking and experimental settings will help in providing insights concerning the performance of selected object detection models and set the base for developing more efficient and robust object detection architectures.
zh
[CV-15] xt-driven Multiplanar Visual Interaction for Semi-supervised Medical Image Segmentation MICCAI2025
【速读】:该论文旨在解决半监督医学图像分割中因标注数据有限而导致的视觉语义理解不足问题,其解决方案的关键在于利用文本信息增强视觉语义嵌入。论文提出了一种基于文本驱动的多平面视觉交互框架(Text-SemiSeg),包含三个核心模块:文本增强的多平面表示(TMR)、类别感知语义对齐(CSA)和动态认知增强(DCA),通过文本与视觉特征的跨模态交互,提升模型对医学图像的语义理解能力和泛化性能。
链接: https://arxiv.org/abs/2507.12382
作者: Kaiwen Huang,Yi Zhou,Huazhu Fu,Yizhe Zhang,Chen Gong,Tao Zhou
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 10 pages; 2 figures; Have been accepted by MICCAI 2025
点击查看摘要
Abstract:Semi-supervised medical image segmentation is a crucial technique for alleviating the high cost of data annotation. When labeled data is limited, textual information can provide additional context to enhance visual semantic understanding. However, research exploring the use of textual data to enhance visual semantic embeddings in 3D medical imaging tasks remains scarce. In this paper, we propose a novel text-driven multiplanar visual interaction framework for semi-supervised medical image segmentation (termed Text-SemiSeg), which consists of three main modules: Text-enhanced Multiplanar Representation (TMR), Category-aware Semantic Alignment (CSA), and Dynamic Cognitive Augmentation (DCA). Specifically, TMR facilitates text-visual interaction through planar mapping, thereby enhancing the category awareness of visual features. CSA performs cross-modal semantic alignment between the text features with introduced learnable variables and the intermediate layer of visual features. DCA reduces the distribution discrepancy between labeled and unlabeled data through their interaction, thus improving the model’s robustness. Finally, experiments on three public datasets demonstrate that our model effectively enhances visual features with textual information and outperforms other methods. Our code is available at this https URL.
zh
[CV-16] FactorHD: A Hyperdimensional Computing Model for Multi-Object Multi-Class Representation and Factorization
【速读】:该论文试图解决在神经符号人工智能(neuro-symbolic AI)中,现有超维计算(HDC)模型在表示和因子分解复杂类-子类关系时面临的挑战,如因子分解效率低、信息丢失以及“叠加灾难”和“问题2”等局限。解决方案的关键在于提出FactorHD模型,其核心是采用一种符号编码方法,嵌入额外的记忆条款以保留更多对象信息,并结合一种高效的因子分解算法,通过识别目标类的记忆条款选择性地消除冗余类别,从而显著提升计算效率和准确性。
链接: https://arxiv.org/abs/2507.12366
作者: Yifei Zhou,Xuchu Huang,Chenyu Ni,Min Zhou,Zheyu Yan,Xunzhao Yin,Cheng Zhuo
机构: Zhejiang University (浙江大学); Key Laboratory of Collaborative Sensing and Autonomous Unmanned Systems of Zhejiang Province (浙江省协同感知与自主无人系统重点实验室)
类目: ymbolic Computation (cs.SC); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
备注: 7 pages, 5 figures, 2 tables, to be published in the 62nd DAC (Design Automation Conference) proceedings
点击查看摘要
Abstract:Neuro-symbolic artificial intelligence (neuro-symbolic AI) excels in logical analysis and reasoning. Hyperdimensional Computing (HDC), a promising brain-inspired computational model, is integral to neuro-symbolic AI. Various HDC models have been proposed to represent class-instance and class-class relations, but when representing the more complex class-subclass relation, where multiple objects associate different levels of classes and subclasses, they face challenges for factorization, a crucial task for neuro-symbolic AI systems. In this article, we propose FactorHD, a novel HDC model capable of representing and factorizing the complex class-subclass relation efficiently. FactorHD features a symbolic encoding method that embeds an extra memorization clause, preserving more information for multiple objects. In addition, it employs an efficient factorization algorithm that selectively eliminates redundant classes by identifying the memorization clause of the target class. Such model significantly enhances computing efficiency and accuracy in representing and factorizing multiple objects with class-subclass relation, overcoming limitations of existing HDC models such as “superposition catastrophe” and “the problem of 2”. Evaluations show that FactorHD achieves approximately 5667x speedup at a representation size of 10^9 compared to existing HDC models. When integrated with the ResNet-18 neural network, FactorHD achieves 92.48% factorization accuracy on the Cifar-10 dataset.
zh
[CV-17] Cluster Contrast for Unsupervised Visual Representation Learning ICIP2025
【速读】:该论文试图解决无监督视觉表征学习中的表征质量与聚类效果问题,旨在提升特征表示的区分度与紧凑性。解决方案的关键在于提出一种名为Cluster Contrast (CueCo) 的新方法,该方法通过结合对比学习与聚类方法的优势,利用两个神经网络(查询网络与键网络)在特征空间中同时实现特征表示的分散与对齐,从而通过对比损失增强类间分离度,通过聚类目标促进类内紧凑性。
链接: https://arxiv.org/abs/2507.12359
作者: Nikolaos Giakoumoglou,Tania Stathaki
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: ICIP 2025
点击查看摘要
Abstract:We introduce Cluster Contrast (CueCo), a novel approach to unsupervised visual representation learning that effectively combines the strengths of contrastive learning and clustering methods. Inspired by recent advancements, CueCo is designed to simultaneously scatter and align feature representations within the feature space. This method utilizes two neural networks, a query and a key, where the key network is updated through a slow-moving average of the query outputs. CueCo employs a contrastive loss to push dissimilar features apart, enhancing inter-class separation, and a clustering objective to pull together features of the same cluster, promoting intra-class compactness. Our method achieves 91.40% top-1 classification accuracy on CIFAR-10, 68.56% on CIFAR-100, and 78.65% on ImageNet-100 using linear evaluation with a ResNet-18 backbone. By integrating contrastive learning with clustering, CueCo sets a new direction for advancing unsupervised visual representation learning.
zh
[CV-18] Improving Lightweight Weed Detection via Knowledge Distillation
【速读】:该论文旨在解决在资源受限平台中部署高精度目标检测模型的挑战,特别是在植物表型应用中区分视觉相似的杂草种类的问题。其解决方案的关键在于采用通道级知识蒸馏(CWD)和掩码生成式蒸馏(MGD)技术,通过将YOLO11x作为教师模型,YOLO11n作为参考和学生模型,有效地将知识从教师模型迁移至学生模型,从而在不增加模型复杂度的情况下提升轻量级模型的性能。
链接: https://arxiv.org/abs/2507.12344
作者: Ahmet Oğuz Saltık,Max Voigt,Sourav Modak,Mike Beckworth,Anthony Stein
机构: University of Hohenheim (海恩海姆大学); Dept. of Artificial Intelligence in Agricultural Engineering & Computational Science Hub (农业工程与计算科学中心人工智能系)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
点击查看摘要
Abstract:Weed detection is a critical component of precision agriculture, facilitating targeted herbicide application and reducing environmental impact. However, deploying accurate object detection models on resource-limited platforms remains challenging, particularly when differentiating visually similar weed species commonly encountered in plant phenotyping applications. In this work, we investigate Channel-wise Knowledge Distillation (CWD) and Masked Generative Distillation (MGD) to enhance the performance of lightweight models for real-time smart spraying systems. Utilizing YOLO11x as the teacher model and YOLO11n as both reference and student, both CWD and MGD effectively transfer knowledge from the teacher to the student model. Our experiments, conducted on a real-world dataset comprising sugar beet crops and four weed types (Cirsium, Convolvulus, Fallopia, and Echinochloa), consistently show increased AP50 across all classes. The distilled CWD student model achieves a notable improvement of 2.5% and MGD achieves 1.9% in mAP50 over the baseline without increasing model complexity. Additionally, we validate real-time deployment feasibility by evaluating the student YOLO11n model on Jetson Orin Nano and Raspberry Pi 5 embedded devices, performing five independent runs to evaluate performance stability across random seeds. These findings confirm CWD and MGD as an effective, efficient, and practical approach for improving deep learning-based weed detection accuracy in precision agriculture and plant phenotyping scenarios.
zh
[CV-19] Unsupervised Monocular 3D Keypoint Discovery from Multi-View Diffusion Priors
【速读】:该论文试图解决单目3D关键点估计(monocular 3D keypoints estimation)问题,即从单张图像中准确预测3D关键点位置。传统方法依赖于人工标注或校准的多视角图像,而这些数据收集成本较高。该论文提出的解决方案的关键在于利用预训练的多视角扩散模型(multi-view diffusion model)中的强大几何先验,通过该模型从单张图像生成多视角图像作为监督信号,并将其作为3D几何线索提供给模型。同时,该模型还被用作2D多视角特征提取器,通过其中间表示构建3D特征体积,从而将扩散模型隐含的3D先验转化为显式的3D特征。
链接: https://arxiv.org/abs/2507.12336
作者: Subin Jeon,In Cho,Junyoung Hong,Seon Joo Kim
机构: Yonsei University (延世大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
点击查看摘要
Abstract:This paper introduces KeyDiff3D, a framework for unsupervised monocular 3D keypoints estimation that accurately predicts 3D keypoints from a single image. While previous methods rely on manual annotations or calibrated multi-view images, both of which are expensive to collect, our method enables monocular 3D keypoints estimation using only a collection of single-view images. To achieve this, we leverage powerful geometric priors embedded in a pretrained multi-view diffusion model. In our framework, this model generates multi-view images from a single image, serving as a supervision signal to provide 3D geometric cues to our model. We also use the diffusion model as a powerful 2D multi-view feature extractor and construct 3D feature volumes from its intermediate representations. This transforms implicit 3D priors learned by the diffusion model into explicit 3D features. Beyond accurate keypoints estimation, we further introduce a pipeline that enables manipulation of 3D objects generated by the diffusion model. Experimental results on diverse aspects and datasets, including Human3.6M, Stanford Dogs, and several in-the-wild and out-of-domain datasets, highlight the effectiveness of our method in terms of accuracy, generalization, and its ability to enable manipulation of 3D objects generated by the diffusion model from a single image.
zh
[CV-20] Compositional Discrete Latent Code for High Fidelity Productive Diffusion Models
【速读】:该论文试图解决扩散模型在生成高质量图像时面临的样本保真度不足以及生成能力受限于训练分布的问题。其解决方案的关键在于引入离散潜在代码(Discrete Latent Code, DLC),这是一种通过自监督学习从单纯嵌入中提取的图像表示,具有易生成性和可组合性,能够支持超出训练分布的新型图像采样,从而显著提升了无条件图像生成的性能,并实现了文本到图像生成的高效扩展。
链接: https://arxiv.org/abs/2507.12318
作者: Samuel Lavoie,Michael Noukhovitch,Aaron Courville
机构: Mila(蒙特利尔学习算法研究所); Université de Montréal(蒙特利尔大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: In submission, 22 pages, 7 tables, 12 figures
点击查看摘要
Abstract:We argue that diffusion models’ success in modeling complex distributions is, for the most part, coming from their input conditioning. This paper investigates the representation used to condition diffusion models from the perspective that ideal representations should improve sample fidelity, be easy to generate, and be compositional to allow out-of-training samples generation. We introduce Discrete Latent Code (DLC), an image representation derived from Simplicial Embeddings trained with a self-supervised learning objective. DLCs are sequences of discrete tokens, as opposed to the standard continuous image embeddings. They are easy to generate and their compositionality enables sampling of novel images beyond the training distribution. Diffusion models trained with DLCs have improved generation fidelity, establishing a new state-of-the-art for unconditional image generation on ImageNet. Additionally, we show that composing DLCs allows the image generator to produce out-of-distribution samples that coherently combine the semantics of images in diverse ways. Finally, we showcase how DLCs can enable text-to-image generation by leveraging large-scale pretrained language models. We efficiently finetune a text diffusion language model to generate DLCs that produce novel samples outside of the image generator training distribution.
zh
[CV-21] PROL : Rehearsal Free Continual Learning in Streaming Data via Prompt Online Learning ICCV2025
【速读】:该论文试图解决在线持续学习(Online Continual Learning, OCL)中数据隐私约束下灾难性遗忘问题,特别是在数据只能被观察一次的情况下。现有方法通常依赖于从之前类别中保存的示例或特征进行重放,但受限于数据开放政策可能无法应用;而基于提示的方法虽然表现优异,但存在可训练参数数量增长的问题。该论文提出的解决方案的关键在于一种新颖的基于提示的方法,包含四个主要组件:(1)作为通用知识的轻量级提示生成器,(2)作为特定知识的可训练缩放与偏移模块,(3)保持预训练模型(PTM)泛化能力,(4)硬-软更新机制。该方法在多个数据集上取得了显著优于当前最先进方法的性能,并在参数数量、训练时间、推理时间和吞吐量方面表现出较好的效率。
链接: https://arxiv.org/abs/2507.12305
作者: M. Anwar Ma’sum,Mahardhika Pratama,Savitha Ramasamy,Lin Liu,Habibullah Habibullah,Ryszard Kowalczyk
机构: STEM University of South Australia (南澳大学科学与技术学院); Institute for Infocomm Research, ASTAR & IPAL (信息通信研究所,ASTAR与IPAL); CNRS@CREATE (法国国家科学研究中心@创造中心); Systems Research Institute, Polish Academy of Sciences (波兰科学院系统研究所)
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
备注: ICCV 2025
点击查看摘要
Abstract:The data privacy constraint in online continual learning (OCL), where the data can be seen only once, complicates the catastrophic forgetting problem in streaming data. A common approach applied by the current SOTAs in OCL is with the use of memory saving exemplars or features from previous classes to be replayed in the current task. On the other hand, the prompt-based approach performs excellently in continual learning but with the cost of a growing number of trainable parameters. The first approach may not be applicable in practice due to data openness policy, while the second approach has the issue of throughput associated with the streaming data. In this study, we propose a novel prompt-based method for online continual learning that includes 4 main components: (1) single light-weight prompt generator as a general knowledge, (2) trainable scaler-and-shifter as specific knowledge, (3) pre-trained model (PTM) generalization preserving, and (4) hard-soft updates mechanism. Our proposed method achieves significantly higher performance than the current SOTAs in CIFAR100, ImageNet-R, ImageNet-A, and CUB dataset. Our complexity analysis shows that our method requires a relatively smaller number of parameters and achieves moderate training time, inference time, and throughput. For further study, the source code of our method is available at this https URL.
zh
[CV-22] RegCL: Continual Adaptation of Segment Anything Model via Model Merging
【速读】:该论文旨在解决Segment Anything Model (SAM)在特定领域中性能受限的问题,尤其是现有方法在跨领域应用时易导致性能下降的灾难性遗忘问题。其解决方案的关键在于提出RegCL,一种基于非回放持续学习(CL)框架,通过模型合并技术高效整合多领域知识。具体而言,RegCL通过优化权重引导不同领域训练的SAM适应模块(如LoRA模块)参数的合并过程,以最小化合并模型与各领域专用模型之间的预测差异,从而在保持参数效率的同时实现多领域知识的融合。
链接: https://arxiv.org/abs/2507.12297
作者: Yuan-Chen Shu,Zhiwei Lin,Yongtao Wang
机构: Wangxuan Institute of Computer Technology, Peking University (北京大学计算机技术研究所)
类目: Machine Learning (cs.LG); Computer Vision and Pattern Recognition (cs.CV)
备注:
点击查看摘要
Abstract:To address the performance limitations of the Segment Anything Model (SAM) in specific domains, existing works primarily adopt adapter-based one-step adaptation paradigms. However, some of these methods are specific developed for specific domains. If used on other domains may lead to performance degradation. This issue of catastrophic forgetting severely limits the model’s scalability. To address this issue, this paper proposes RegCL, a novel non-replay continual learning (CL) framework designed for efficient multi-domain knowledge integration through model merging. Specifically, RegCL incorporates the model merging algorithm into the continual learning paradigm by merging the parameters of SAM’s adaptation modules (e.g., LoRA modules) trained on different domains. The merging process is guided by weight optimization, which minimizes prediction discrepancies between the merged model and each of the domain-specific models. RegCL effectively consolidates multi-domain knowledge while maintaining parameter efficiency, i.e., the model size remains constant regardless of the number of tasks, and no historical data storage is required. Experimental results demonstrate that RegCL achieves favorable continual learning performance across multiple downstream datasets, validating its effectiveness in dynamic scenarios.
zh
[CV-23] Efficient Calisthenics Skills Classification through Foreground Instance Selection and Depth Estimation
【速读】:该论文旨在解决传统基于骨骼姿态估计的体操技能识别方法在计算成本高、推理时间长和部署复杂等问题,从而限制其在实时应用或移动设备中的适用性。其解决方案的关键在于提出一种直接的体操技能识别方法,通过利用深度估计和运动员区域检索,避免使用计算密集型的人体姿态估计模块,从而提高效率并提升分类准确性。
链接: https://arxiv.org/abs/2507.12292
作者: Antonio Finocchiaro,Giovanni Maria Farinella,Antonino Furnari
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 13 pages, 4 figures, In International Conference on Image Analysis and Processing
点击查看摘要
Abstract:Calisthenics skill classification is the computer vision task of inferring the skill performed by an athlete from images, enabling automatic performance assessment and personalized analytics. Traditional methods for calisthenics skill recognition are based on pose estimation methods to determine the position of skeletal data from images, which is later fed to a classification algorithm to infer the performed skill. Despite the progress in human pose estimation algorithms, they still involve high computational costs, long inference times, and complex setups, which limit the applicability of such approaches in real-time applications or mobile devices. This work proposes a direct approach to calisthenics skill recognition, which leverages depth estimation and athlete patch retrieval to avoid the computationally expensive human pose estimation module. Using Depth Anything V2 for depth estimation and YOLOv10 for athlete localization, we segment the subject from the background rather than relying on traditional pose estimation techniques. This strategy increases efficiency, reduces inference time, and improves classification accuracy. Our approach significantly outperforms skeleton-based methods, achieving 38.3x faster inference with RGB image patches and improved classification accuracy with depth patches (0.837 vs. 0.815). Beyond these performance gains, the modular design of our pipeline allows for flexible replacement of components, enabling future enhancements and adaptation to real-world applications.
zh
[CV-24] FADE: Adversarial Concept Erasure in Flow Models
【速读】:该论文旨在解决生成式模型在图像生成过程中可能存在的隐私泄露和公平性问题,即模型可能会记忆敏感概念或延续偏见。其解决方案的关键在于提出一种名为FADE(Fair Adversarial Diffusion Erasure)的概念擦除方法,该方法结合了轨迹感知的微调策略与对抗目标,以可靠地移除指定概念,同时保持模型的整体生成质量。理论上,FADE通过最小化被擦除概念与模型输出之间的互信息来保证隐私和公平性;实验上,FADE在多个基准任务中表现出色,优于现有方法,并在概念移除效果与生成质量的调和均值上提升了5%至10%。
链接: https://arxiv.org/abs/2507.12283
作者: Zixuan Fu,Yan Ren,Finn Carter,Chenyue Wang,Ze Niu,Dacheng Yu,Emily Davis,Bo Zhang
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Camera Ready
点击查看摘要
Abstract:Diffusion models have demonstrated remarkable image generation capabilities, but also pose risks in privacy and fairness by memorizing sensitive concepts or perpetuating biases. We propose a novel \textbfconcept erasure method for text-to-image diffusion models, designed to remove specified concepts (e.g., a private individual or a harmful stereotype) from the model’s generative repertoire. Our method, termed \textbfFADE (Fair Adversarial Diffusion Erasure), combines a trajectory-aware fine-tuning strategy with an adversarial objective to ensure the concept is reliably removed while preserving overall model fidelity. Theoretically, we prove a formal guarantee that our approach minimizes the mutual information between the erased concept and the model’s outputs, ensuring privacy and fairness. Empirically, we evaluate FADE on Stable Diffusion and FLUX, using benchmarks from prior work (e.g., object, celebrity, explicit content, and style erasure tasks from MACE). FADE achieves state-of-the-art concept removal performance, surpassing recent baselines like ESD, UCE, MACE, and ANT in terms of removal efficacy and image quality. Notably, FADE improves the harmonic mean of concept removal and fidelity by 5–10% over the best prior method. We also conduct an ablation study to validate each component of FADE, confirming that our adversarial and trajectory-preserving objectives each contribute to its superior performance. Our work sets a new standard for safe and fair generative modeling by unlearning specified concepts without retraining from scratch.
zh
[CV-25] Site-Level Fine-Tuning with Progressive Layer Freezing: Towards Robust Prediction of Bronchopulmonary Dysplasia from Day-1 Chest Radiographs in Extremely Preterm Infants
【速读】:该论文旨在解决早产儿支气管肺发育不良(Bronchopulmonary Dysplasia, BPD)的早期预后评估问题,以避免低风险婴儿接受不必要的毒性治疗。其关键解决方案是采用深度学习方法,利用出生后24小时内获取的胸部X光片进行预测,并通过领域特定预训练(domain-specific pretraining)提升模型性能,相较于基于ImageNet的初始化显著提高了预测准确性。此外,采用渐进式层冻结和线性探针技术,既防止了过拟合,又保证了计算可行性,为临床应用和未来联邦学习部署提供了可行路径。
链接: https://arxiv.org/abs/2507.12269
作者: Sybelle Goedicke-Fritz(1),Michelle Bous(1),Annika Engel(2),Matthias Flotho(2 and 5),Pascal Hirsch(2),Hannah Wittig(1),Dino Milanovic(2),Dominik Mohr(1),Mathias Kaspar(6),Sogand Nemat(3),Dorothea Kerner(3),Arno Bücker(3),Andreas Keller(2 and 5 and 7),Sascha Meyer(4),Michael Zemlin(1),Philipp Flotho(2 and 5) ((1) Department of General Pediatrics and Neonatology, Saarland University, Campus Homburg, Homburg/Saar, Germany, (2) Chair for Clinical Bioinformatics, Saarland Informatics Campus, Saarland University, Saarbrücken, Germany, (3) Department of Radiology, and Interventional Radiology, University Hospital of Saarland, Homburg, Germany, (4) Clinical Centre Karlsruhe, Franz-Lust Clinic for Paediatrics, Karlsruhe, Germany, (5) Helmholtz Institute for Pharmaceutical Research Saarland (HIPS), Saarland University Campus, Germany, (6) Digital Medicine, University Hospital of Augsburg, Augsburg, Germany, (7) Pharma Science Hub (PSH), Saarland University Campus, Germany)
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: S.G.-F., M.B., and A.E. contributed equally to this work and share first authorship. M.Z. and P.F. contributed equally to this work and share senior authorship
点击查看摘要
Abstract:Bronchopulmonary dysplasia (BPD) is a chronic lung disease affecting 35% of extremely low birth weight infants. Defined by oxygen dependence at 36 weeks postmenstrual age, it causes lifelong respiratory complications. However, preventive interventions carry severe risks, including neurodevelopmental impairment, ventilator-induced lung injury, and systemic complications. Therefore, early BPD prognosis and prediction of BPD outcome is crucial to avoid unnecessary toxicity in low risk infants. Admission radiographs of extremely preterm infants are routinely acquired within 24h of life and could serve as a non-invasive prognostic tool. In this work, we developed and investigated a deep learning approach using chest X-rays from 163 extremely low-birth-weight infants ( \leq 32 weeks gestation, 401-999g) obtained within 24 hours of birth. We fine-tuned a ResNet-50 pretrained specifically on adult chest radiographs, employing progressive layer freezing with discriminative learning rates to prevent overfitting and evaluated a CutMix augmentation and linear probing. For moderate/severe BPD outcome prediction, our best performing model with progressive freezing, linear probing and CutMix achieved an AUROC of 0.78 \pm 0.10, balanced accuracy of 0.69 \pm 0.10, and an F1-score of 0.67 \pm 0.11. In-domain pre-training significantly outperformed ImageNet initialization (p = 0.031) which confirms domain-specific pretraining to be important for BPD outcome prediction. Routine IRDS grades showed limited prognostic value (AUROC 0.57 \pm 0.11), confirming the need of learned markers. Our approach demonstrates that domain-specific pretraining enables accurate BPD prediction from routine day-1 radiographs. Through progressive freezing and linear probing, the method remains computationally feasible for site-level implementation and future federated learning deployments.
zh
[CV-26] Comparative Analysis of CNN Performance in Keras PyTorch and JAX on PathMNIST
【速读】:该论文试图解决不同深度学习框架(如Keras、PyTorch和JAX)在医学图像分类任务中的性能比较问题,特别是其在训练效率、分类准确性和推理速度方面的表现差异。解决方案的关键在于通过PathMNIST数据集进行基准测试,系统评估各框架的CNN实现,从而揭示计算速度与模型精度之间的权衡关系,为医学图像分析领域的研究者和实践者提供有价值的参考。
链接: https://arxiv.org/abs/2507.12248
作者: Anida Nezović,Jalal Romano,Nada Marić,Medina Kapo,Amila Akagić
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注:
点击查看摘要
Abstract:Deep learning has significantly advanced the field of medical image classification, particularly with the adoption of Convolutional Neural Networks (CNNs). Various deep learning frameworks such as Keras, PyTorch and JAX offer unique advantages in model development and deployment. However, their comparative performance in medical imaging tasks remains underexplored. This study presents a comprehensive analysis of CNN implementations across these frameworks, using the PathMNIST dataset as a benchmark. We evaluate training efficiency, classification accuracy and inference speed to assess their suitability for real-world applications. Our findings highlight the trade-offs between computational speed and model accuracy, offering valuable insights for researchers and practitioners in medical image analysis.
zh
[CV-27] Calisthenics Skills Temporal Video Segmentation
【速读】:该论文试图解决的是体操技能(calisthenics skill)在视频中的时间分割问题,即从视频中自动识别并分割出静态体操技能的持续时间。解决方案的关键在于提出一个包含静态体操技能视频的数据集,并对其进行时间分割标注,以支持后续的自动化工具开发。该研究还报告了一个基线方法的结果,验证了该问题的可行性,同时指出了进一步优化的空间。
链接: https://arxiv.org/abs/2507.12245
作者: Antonio Finocchiaro,Giovanni Maria Farinella,Antonino Furnari
机构: University of Catania(卡塔尼亚大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 9 pages, 6 figures, In Proceedings of the 19th International Joint Conference on Computer Vision, Imaging and Computer Graphics Theory and Applications - Volume 2
点击查看摘要
Abstract:Calisthenics is a fast-growing bodyweight discipline that consists of different categories, one of which is focused on skills. Skills in calisthenics encompass both static and dynamic elements performed by athletes. The evaluation of static skills is based on their difficulty level and the duration of the hold. Automated tools able to recognize isometric skills from a video by segmenting them to estimate their duration would be desirable to assist athletes in their training and judges during competitions. Although the video understanding literature on action recognition through body pose analysis is rich, no previous work has specifically addressed the problem of calisthenics skill temporal video segmentation. This study aims to provide an initial step towards the implementation of automated tools within the field of Calisthenics. To advance knowledge in this context, we propose a dataset of video footage of static calisthenics skills performed by athletes. Each video is annotated with a temporal segmentation which determines the extent of each skill. We hence report the results of a baseline approach to address the problem of skill temporal segmentation on the proposed dataset. The results highlight the feasibility of the proposed problem, while there is still room for improvement.
zh
[CV-28] Generate to Ground: Multimodal Text Conditioning Boosts Phrase Grounding in Medical Vision-Language Models
【速读】:该论文试图解决医学影像中通过临床报告进行疾病定位的自然语言短语到图像区域的映射问题,即Phrase grounding。其解决方案的关键在于利用生成式文本到图像扩散模型,通过跨注意力图实现更优的零样本短语定位性能,同时采用冻结的领域特定语言模型(如CXR-BERT)进行微调,显著优于领域无关的模型。此外,引入了双模态偏差融合(Bimodal Bias Merging, BBM)作为后处理技术,进一步提升定位精度。
链接: https://arxiv.org/abs/2507.12236
作者: Felix Nützel,Mischa Dombrowski,Bernhard Kainz
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 20 pages, 6 figures. To appear in Proc. MIDL 2025 (PMLR)
点击查看摘要
Abstract:Phrase grounding, i.e., mapping natural language phrases to specific image regions, holds significant potential for disease localization in medical imaging through clinical reports. While current state-of-the-art methods rely on discriminative, self-supervised contrastive models, we demonstrate that generative text-to-image diffusion models, leveraging cross-attention maps, can achieve superior zero-shot phrase grounding performance. Contrary to prior assumptions, we show that fine-tuning diffusion models with a frozen, domain-specific language model, such as CXR-BERT, substantially outperforms domain-agnostic counterparts. This setup achieves remarkable improvements, with mIoU scores doubling those of current discriminative methods. These findings highlight the underexplored potential of generative models for phrase grounding tasks. To further enhance performance, we introduce Bimodal Bias Merging (BBM), a novel post-processing technique that aligns text and image biases to identify regions of high certainty. BBM refines cross-attention maps, achieving even greater localization accuracy. Our results establish generative approaches as a more effective paradigm for phrase grounding in the medical imaging domain, paving the way for more robust and interpretable applications in clinical practice. The source code and model weights are available at this https URL.
zh
[CV-29] MGFFD-VLM: Multi-Granularity Prompt Learning for Face Forgery Detection with VLM
【速读】:该论文旨在解决深度伪造(deepfake)检测中模型解释性不足及对人脸质量相关属性利用不充分的问题。其解决方案的关键在于构建一个扩展的视觉问答(VQA)数据集DD-VQA+,并提出一种融合属性驱动的混合LoRA策略、多粒度提示学习和伪造感知训练策略的新型检测框架MGFFD-VLM,从而提升模型的伪造分类性能与可解释性。
链接: https://arxiv.org/abs/2507.12232
作者: Tao Chen,Jingyi Zhang,Decheng Liu,Chunlei Peng
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
点击查看摘要
Abstract:Recent studies have utilized visual large language models (VLMs) to answer not only “Is this face a forgery?” but also “Why is the face a forgery?” These studies introduced forgery-related attributes, such as forgery location and type, to construct deepfake VQA datasets and train VLMs, achieving high accuracy while providing human-understandable explanatory text descriptions. However, these methods still have limitations. For example, they do not fully leverage face quality-related attributes, which are often abnormal in forged faces, and they lack effective training strategies for forgery-aware VLMs. In this paper, we extend the VQA dataset to create DD-VQA+, which features a richer set of attributes and a more diverse range of samples. Furthermore, we introduce a novel forgery detection framework, MGFFD-VLM, which integrates an Attribute-Driven Hybrid LoRA Strategy to enhance the capabilities of Visual Large Language Models (VLMs). Additionally, our framework incorporates Multi-Granularity Prompt Learning and a Forgery-Aware Training Strategy. By transforming classification and forgery segmentation results into prompts, our method not only improves forgery classification but also enhances interpretability. To further boost detection performance, we design multiple forgery-related auxiliary losses. Experimental results demonstrate that our approach surpasses existing methods in both text-based forgery judgment and analysis, achieving superior accuracy.
zh
[CV-30] RODS: Robust Optimization Inspired Diffusion Sampling for Detecting and Reducing Hallucination in Generative Models
【速读】:该论文旨在解决扩散模型在生成建模中采样过程易受幻觉影响的问题,这一问题通常源于分数估计的不准确性。其解决方案的关键在于通过优化视角重新诠释扩散采样,并引入RODS(Robust Optimization-inspired Diffusion Sampler),该方法利用损失景观中的几何线索检测并修正高风险的采样步骤,从而实现更平滑的采样轨迹和自适应扰动调整,有效降低幻觉现象,且无需重新训练并在极小的额外推理成本下完成。
链接: https://arxiv.org/abs/2507.12201
作者: Yiqi Tian,Pengfei Jin,Mingze Yuan,Na Li,Bo Zeng,Quanzheng Li
机构: Center for Advanced Medical Computing and Analysis, Massachusetts General Hospital and Harvard Medical School, Boston, MA 02114; Department of Industrial Engineering, University of Pittsburgh, Pittsburgh, PA 15261; School of Engineering and Applied Sciences, Harvard University, Boston, MA 02138
类目: Computer Vision and Pattern Recognition (cs.CV); Optimization and Control (math.OC)
备注:
点击查看摘要
Abstract:Diffusion models have achieved state-of-the-art performance in generative modeling, yet their sampling procedures remain vulnerable to hallucinations, often stemming from inaccuracies in score approximation. In this work, we reinterpret diffusion sampling through the lens of optimization and introduce RODS (Robust Optimization-inspired Diffusion Sampler), a novel method that detects and corrects high-risk sampling steps using geometric cues from the loss landscape. RODS enforces smoother sampling trajectories and adaptively adjusts perturbations, reducing hallucinations without retraining and at minimal additional inference cost. Experiments on AFHQv2, FFHQ, and 11k-hands demonstrate that RODS improves both sampling fidelity and robustness, detecting over 70% of hallucinated samples and correcting more than 25%, all while avoiding the introduction of new artifacts.
zh
[CV-31] Revealing the Ancient Beauty: Digital Reconstruction of Temple Tiles using Computer Vision
【速读】:该论文旨在解决文化遗迹保护与修复中的技术挑战,特别是在印度纪念碑的保存中如何有效维持其建筑技艺和美学价值。解决方案的关键在于提出三种前沿技术:基于图像处理的分形卷积方法,用于揭示细微的建筑图案;专为西孟加拉邦Bankura陶土寺庙设计的自敏感瓦片填充(SSTF)方法,结合了新型的数据增强技术MosaicSlice;以及超分辨率策略,以在不显著损失质量的情况下提升图像分辨率。这些方法通过创新的数据增强策略,在可控成本内实现自动化,从而确保修复过程的连贯性和真实性。
链接: https://arxiv.org/abs/2507.12195
作者: Arkaprabha Basu
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:
点击查看摘要
Abstract:Modern digitised approaches have dramatically changed the preservation and restoration of cultural treasures, integrating computer scientists into multidisciplinary projects with ease. Machine learning, deep learning, and computer vision techniques have revolutionised developing sectors like 3D reconstruction, picture inpainting,IoT-based methods, genetic algorithms, and image processing with the integration of computer scientists into multidisciplinary initiatives. We suggest three cutting-edge techniques in recognition of the special qualities of Indian monuments, which are famous for their architectural skill and aesthetic appeal. First is the Fractal Convolution methodology, a segmentation method based on image processing that successfully reveals subtle architectural patterns within these irreplaceable cultural buildings. The second is a revolutionary Self-Sensitive Tile Filling (SSTF) method created especially for West Bengal’s mesmerising Bankura Terracotta Temples with a brand-new data augmentation method called MosaicSlice on the third. Furthermore, we delve deeper into the Super Resolution strategy to upscale the images without losing significant amount of quality. Our methods allow for the development of seamless region-filling and highly detailed tiles while maintaining authenticity using a novel data augmentation strategy within affordable costs introducing automation. By providing effective solutions that preserve the delicate balance between tradition and innovation, this study improves the subject and eventually ensures unrivalled efficiency and aesthetic excellence in cultural heritage protection. The suggested approaches advance the field into an era of unmatched efficiency and aesthetic quality while carefully upholding the delicate equilibrium between tradition and innovation.
zh
[CV-32] Wavelet-based Decoupling Framework for low-light Stereo Image Enhancement
【速读】:该论文旨在解决低光图像增强中因退化因素被编码在单一潜在空间而导致特征高度纠缠和模型易陷入捷径学习的问题。其解决方案的关键在于利用小波变换实现特征空间的解耦,将特征空间分解为低频分支用于光照调整,以及多个高频分支用于纹理增强,同时引入基于高频引导的跨视角交互模块(HF-CIM)和基于交叉注意力机制的细节与纹理增强模块(DTEM),以有效提取其他视角中的有用信息并增强高频信息。
链接: https://arxiv.org/abs/2507.12188
作者: Shuangli Du,Siming Yan,Zhenghao Shi,Zhenzhen You,Lu Sun
机构: Xi’an University of Technology(西安理工大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:
点击查看摘要
Abstract:Low-light images suffer from complex degradation, and existing enhancement methods often encode all degradation factors within a single latent space. This leads to highly entangled features and strong black-box characteristics, making the model prone to shortcut learning. To mitigate the above issues, this paper proposes a wavelet-based low-light stereo image enhancement method with feature space decoupling. Our insight comes from the following findings: (1) Wavelet transform enables the independent processing of low-frequency and high-frequency information. (2) Illumination adjustment can be achieved by adjusting the low-frequency component of a low-light image, extracted through multi-level wavelet decomposition. Thus, by using wavelet transform the feature space is decomposed into a low-frequency branch for illumination adjustment and multiple high-frequency branches for texture enhancement. Additionally, stereo low-light image enhancement can extract useful cues from another view to improve enhancement. To this end, we propose a novel high-frequency guided cross-view interaction module (HF-CIM) that operates within high-frequency branches rather than across the entire feature space, effectively extracting valuable image details from the other view. Furthermore, to enhance the high-frequency information, a detail and texture enhancement module (DTEM) is proposed based on cross-attention mechanism. The model is trained on a dataset consisting of images with uniform illumination and images with non-uniform illumination. Experimental results on both real and synthetic images indicate that our algorithm offers significant advantages in light adjustment while effectively recovering high-frequency information. The code and dataset are publicly available at: this https URL.
zh
[CV-33] Hybrid Ensemble Approaches: Optimal Deep Feature Fusion and Hyperparameter-Tuned Classifier Ensembling for Enhanced Brain Tumor Classification
【速读】:该论文试图解决医学影像诊断中因人为因素导致的准确性不足问题,尤其是在MRI图像中检测脑肿瘤时可能出现的误诊或漏诊现象。其解决方案的关键在于提出一种双重集成框架,该框架结合了预训练深度学习模型进行特征提取以及微调超参数的机器学习模型进行分类,通过特征融合与分类器融合提升诊断精度,其中超参数微调被证明对性能提升具有显著作用。
链接: https://arxiv.org/abs/2507.12177
作者: Zahid Ullah,Dragan Pamucar,Jihie Kim
机构: Dongguk University (东国大学); University of Belgrade (贝尔格莱德大学); Yuan Ze University (元智大学); Korea University (高丽大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
点击查看摘要
Abstract:Magnetic Resonance Imaging (MRI) is widely recognized as the most reliable tool for detecting tumors due to its capability to produce detailed images that reveal their presence. However, the accuracy of diagnosis can be compromised when human specialists evaluate these images. Factors such as fatigue, limited expertise, and insufficient image detail can lead to errors. For example, small tumors might go unnoticed, or overlap with healthy brain regions could result in misidentification. To address these challenges and enhance diagnostic precision, this study proposes a novel double ensembling framework, consisting of ensembled pre-trained deep learning (DL) models for feature extraction and ensembled fine-tuned hyperparameter machine learning (ML) models to efficiently classify brain tumors. Specifically, our method includes extensive preprocessing and augmentation, transfer learning concepts by utilizing various pre-trained deep convolutional neural networks and vision transformer networks to extract deep features from brain MRI, and fine-tune hyperparameters of ML classifiers. Our experiments utilized three different publicly available Kaggle MRI brain tumor datasets to evaluate the pre-trained DL feature extractor models, ML classifiers, and the effectiveness of an ensemble of deep features along with an ensemble of ML classifiers for brain tumor classification. Our results indicate that the proposed feature fusion and classifier fusion improve upon the state of the art, with hyperparameter fine-tuning providing a significant enhancement over the ensemble method. Additionally, we present an ablation study to illustrate how each component contributes to accurate brain tumor classification.
zh
[CV-34] Fine-Grained Image Recognition from Scratch with Teacher-Guided Data Augmentation
【速读】:该论文旨在解决细粒度图像识别(Fine-grained Image Recognition, FGIR)中对预训练模型的依赖问题,这种依赖限制了在资源受限环境中的适应性以及任务特定架构的开发。其解决方案的关键在于提出一种名为TGDA的新型训练框架,该框架通过结合数据感知增强与弱监督,利用细粒度感知的教师模型进行知识蒸馏,从而实现从头开始训练高性能的FGIR系统。这一方法使得任务特定且硬件感知的架构设计成为可能,如针对低分辨率FGIR的LRNets和针对高效推理优化的ViTFS系列Vision Transformers。
链接: https://arxiv.org/abs/2507.12157
作者: Edwin Arkel Rios,Fernando Mikael,Oswin Gosal,Femiloye Oyerinde,Hao-Chun Liang,Bo-Cheng Lai,Min-Chun Hu
机构: National Yang Ming Chiao Tung University (国立阳明交通大学); National Tsing Hua University (国立清华大学); Cohere Labs Community (Cohere 实验室社区)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Main: 10 pages, 2 figures, 4 tables
点击查看摘要
Abstract:Fine-grained image recognition (FGIR) aims to distinguish visually similar sub-categories within a broader class, such as identifying bird species. While most existing FGIR methods rely on backbones pretrained on large-scale datasets like ImageNet, this dependence limits adaptability to resource-constrained environments and hinders the development of task-specific architectures tailored to the unique challenges of FGIR. In this work, we challenge the conventional reliance on pretrained models by demonstrating that high-performance FGIR systems can be trained entirely from scratch. We introduce a novel training framework, TGDA, that integrates data-aware augmentation with weak supervision via a fine-grained-aware teacher model, implemented through knowledge distillation. This framework unlocks the design of task-specific and hardware-aware architectures, including LRNets for low-resolution FGIR and ViTFS, a family of Vision Transformers optimized for efficient inference. Extensive experiments across three FGIR benchmarks over diverse settings involving low-resolution and high-resolution inputs show that our method consistently matches or surpasses state-of-the-art pretrained counterparts. In particular, in the low-resolution setting, LRNets trained with TGDA improve accuracy by up to 23% over prior methods while requiring up to 20.6x less parameters, lower FLOPs, and significantly less training data. Similarly, ViTFS-T can match the performance of a ViT B-16 pretrained on ImageNet-21k while using 15.3x fewer trainable parameters and requiring orders of magnitudes less data. These results highlight TGDA’s potential as an adaptable alternative to pretraining, paving the way for more efficient fine-grained vision systems. Comments: Main: 10 pages, 2 figures, 4 tables Subjects: Computer Vision and Pattern Recognition (cs.CV) ACMclasses: I.2; I.4 Cite as: arXiv:2507.12157 [cs.CV] (or arXiv:2507.12157v1 [cs.CV] for this version) https://doi.org/10.48550/arXiv.2507.12157 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
zh
[CV-35] PRISM: Distributed Inference for Foundation Models at Edge
【速读】:该论文试图解决将基础模型(Foundation Models, FMs)部署到边缘计算环境中的挑战,特别是在通信效率和计算资源受限的情况下。其解决方案的关键在于提出PRISM方法,该方法通过使用Segment Means表示来近似中间输出特征,从而显著减少设备间的通信开销;同时重新设计自注意力机制以消除因位置划分中每个设备独立计算键/值引起的冗余计算,并引入针对自回归模型的分区感知因果掩码方案。
链接: https://arxiv.org/abs/2507.12145
作者: Muhammad Azlan Qazi,Alexandros Iosifidis,Qi Zhang
机构: Aarhus University (奥胡斯大学); Tampere University (坦佩雷大学)
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
备注:
点击查看摘要
Abstract:Foundation models (FMs) have achieved remarkable success across a wide range of applications, from image classification to natural langurage processing, but pose significant challenges for deployment at edge. This has sparked growing interest in developing practical and efficient strategies for bringing foundation models to edge environments. In this work, we propose PRISM, a communication-efficient and compute-aware strategy for distributed Transformer inference on edge devices. Our method leverages a Segment Means representation to approximate intermediate output features, drastically reducing inter-device communication. Additionally, we restructure the self-attention mechanism to eliminate redundant computations caused by per-device Key/Value calculation in position-wise partitioning and design a partition-aware causal masking scheme tailored for autoregressive models. We evaluate PRISM on ViT, BERT, and GPT-2 across diverse datasets, namely CIFAR-10, CIFAR-100, ImageNet-1k, GLUE, and CBT. Our results demonstrate substantial reductions in communication overhead (up to 99.2% for BERT at compression rate CR = 128) and per-device computation (51.24% for BERT at the same setting), with only minor accuracy degradation. This method offers a scalable and practical solution for deploying foundation models in distributed resource-constrained environments.
zh
[CV-36] Neural Human Pose Prior
【速读】:该论文试图解决在人体姿态建模中构建一个灵活且具有概率基础的先验分布的问题,以提升人体运动捕捉与重建流程的性能。其解决方案的关键在于利用基于归一化流(Normalizing Flows)的方法,特别是RealNVP架构,来学习在6D旋转表示下的姿态密度分布。通过在训练过程中反转Gram-Schmidt过程,该方法有效解决了在有效6D旋转流形上建模分布的挑战,从而实现了稳定的学习并保持了与基于旋转框架的下游任务兼容性。
链接: https://arxiv.org/abs/2507.12138
作者: Michal Heker,Sefy Kararlitsky,David Tolpin
机构: Yoom(优姆)
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注: Work in progress
点击查看摘要
Abstract:We introduce a principled, data-driven approach for modeling a neural prior over human body poses using normalizing flows. Unlike heuristic or low-expressivity alternatives, our method leverages RealNVP to learn a flexible density over poses represented in the 6D rotation format. We address the challenge of modeling distributions on the manifold of valid 6D rotations by inverting the Gram-Schmidt process during training, enabling stable learning while preserving downstream compatibility with rotation-based frameworks. Our architecture and training pipeline are framework-agnostic and easily reproducible. We demonstrate the effectiveness of the learned prior through both qualitative and quantitative evaluations, and we analyze its impact via ablation studies. This work provides a sound probabilistic foundation for integrating pose priors into human motion capture and reconstruction pipelines.
zh
[CV-37] AD-GS: Object-Aware B-Spline Gaussian Splatting for Self-Supervised Autonomous Driving ICCV2025
【速读】:该论文试图解决动态城市驾驶场景的建模与渲染问题,特别是针对现有方法依赖昂贵的人工物体轨迹标注或自监督方法无法准确捕捉动态物体运动和正确分解场景导致渲染伪影的问题。解决方案的关键在于提出AD-GS框架,其核心是一种结合局部感知B样条曲线与全局感知三角函数的可学习运动模型,实现了灵活且精确的动态物体建模,并通过简化的伪2D分割自动区分场景中的物体与背景,利用动态高斯和双向时间可见性掩码进行表示,同时引入可见性推理和物理刚性正则化以提高鲁棒性。
链接: https://arxiv.org/abs/2507.12137
作者: Jiawei Xu,Kai Deng,Zexin Fan,Shenlong Wang,Jin Xie,Jian Yang
机构: Nankai University (南开大学); University of Illinois Urbana-Champaign (伊利诺伊大学厄巴纳-香槟分校); Nanjing University (南京大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted by ICCV 2025
点击查看摘要
Abstract:Modeling and rendering dynamic urban driving scenes is crucial for self-driving simulation. Current high-quality methods typically rely on costly manual object tracklet annotations, while self-supervised approaches fail to capture dynamic object motions accurately and decompose scenes properly, resulting in rendering artifacts. We introduce AD-GS, a novel self-supervised framework for high-quality free-viewpoint rendering of driving scenes from a single log. At its core is a novel learnable motion model that integrates locality-aware B-spline curves with global-aware trigonometric functions, enabling flexible yet precise dynamic object modeling. Rather than requiring comprehensive semantic labeling, AD-GS automatically segments scenes into objects and background with the simplified pseudo 2D segmentation, representing objects using dynamic Gaussians and bidirectional temporal visibility masks. Further, our model incorporates visibility reasoning and physically rigid regularization to enhance robustness. Extensive evaluations demonstrate that our annotation-free model significantly outperforms current state-of-the-art annotation-free methods and is competitive with annotation-dependent approaches.
zh
[CV-38] Learning Pixel-adaptive Multi-layer Perceptrons for Real-time Image Enhancement ICCV2025
【速读】:该论文旨在解决深度学习中的图像增强问题,特别是现有方法在建模复杂颜色关系上的局限性以及传统多层感知机(MLP)在全球共享参数下难以处理局部变化的问题。其解决方案的关键在于提出一种基于双边网格的像素自适应多层感知机(BPAM)框架,通过将双边网格的空间建模能力与MLP的非线性映射能力相结合,使每个像素能够动态获取独特的变换参数,从而实现更精确的颜色映射。
链接: https://arxiv.org/abs/2507.12135
作者: Junyu Lou,Xiaorui Zhao,Kexuan Shi,Shuhang Gu
机构: University of Electronic Science and Technology of China (中国电子科技大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted to ICCV 2025
点击查看摘要
Abstract:Deep learning-based bilateral grid processing has emerged as a promising solution for image enhancement, inherently encoding spatial and intensity information while enabling efficient full-resolution processing through slicing operations. However, existing approaches are limited to linear affine transformations, hindering their ability to model complex color relationships. Meanwhile, while multi-layer perceptrons (MLPs) excel at non-linear mappings, traditional MLP-based methods employ globally shared parameters, which is hard to deal with localized variations. To overcome these dual challenges, we propose a Bilateral Grid-based Pixel-Adaptive Multi-layer Perceptron (BPAM) framework. Our approach synergizes the spatial modeling of bilateral grids with the non-linear capabilities of MLPs. Specifically, we generate bilateral grids containing MLP parameters, where each pixel dynamically retrieves its unique transformation parameters and obtain a distinct MLP for color mapping based on spatial coordinates and intensity values. In addition, we propose a novel grid decomposition strategy that categorizes MLP parameters into distinct types stored in separate subgrids. Multi-channel guidance maps are used to extract category-specific parameters from corresponding subgrids, ensuring effective utilization of color information during slicing while guiding precise parameter generation. Extensive experiments on public datasets demonstrate that our method outperforms state-of-the-art methods in performance while maintaining real-time processing capabilities.
zh
[CV-39] Block-based Symmetric Pruning and Fusion for Efficient Vision Transformers
【速读】:该论文试图解决Vision Transformer (ViT)在计算成本上的高复杂度问题,特别是在通过独立剪枝查询(Query, Q)和键(Key, K)令牌时导致的精度下降问题。其解决方案的关键在于提出了一种基于块的对称剪枝与融合方法(Block-based Symmetric Pruning and Fusion for efficient ViT, BSPF-ViT),该方法联合优化Q/K令牌的剪枝过程,考虑令牌间的相互作用,并通过相似性融合保留关键信息,同时利用共享权重构建对称注意力矩阵,仅需剪枝上三角部分以提升效率。
链接: https://arxiv.org/abs/2507.12125
作者: Yi-Kuan Hsieh,Jun-Wei Hsieh,Xin Li,Yu-Ming Chang,Yu-Chee Tseng
机构: National Yang Ming Chiao Tung University(国立阳明交通大学); University at Albany, SUNY(纽约州立大学阿尔巴尼分校)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
点击查看摘要
Abstract:Vision Transformer (ViT) has achieved impressive results across various vision tasks, yet its high computational cost limits practical applications. Recent methods have aimed to reduce ViT’s O(n^2) complexity by pruning unimportant tokens. However, these techniques often sacrifice accuracy by independently pruning query (Q) and key (K) tokens, leading to performance degradation due to overlooked token interactions. To address this limitation, we introduce a novel \bf Block-based Symmetric Pruning and Fusion for efficient ViT (BSPF-ViT) that optimizes the pruning of Q/K tokens jointly. Unlike previous methods that consider only a single direction, our approach evaluates each token and its neighbors to decide which tokens to retain by taking token interaction into account. The retained tokens are compressed through a similarity fusion step, preserving key information while reducing computational costs. The shared weights of Q/K tokens create a symmetric attention matrix, allowing pruning only the upper triangular part for speed up. BSPF-ViT consistently outperforms state-of-the-art ViT methods at all pruning levels, increasing ImageNet classification accuracy by 1.3% on DeiT-T and 2.0% on DeiT-S, while reducing computational overhead by 50%. It achieves 40% speedup with improved accuracy across various ViTs.
zh
[CV-40] Open-Vocabulary Indoor Object Grounding with 3D Hierarchical Scene Graph
【速读】:该论文试图解决室内环境中对象的开放词汇定位问题,即在复杂室内场景中根据自然语言描述准确识别和定位物体。解决方案的关键在于提出OVIGo-3DHSG方法,该方法利用由RGB-D帧序列生成的分层场景图(Hierarchical Scene Graph)来表示广泛的室内环境,并结合开放词汇基础模型与传感器数据处理技术。通过将分层场景图与大语言模型集成,实现多步骤的空间推理,从而增强对空间上下文的理解。
链接: https://arxiv.org/abs/2507.12123
作者: Sergey Linok,Gleb Naumov
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 13 pages, 5 figures, 2 tables
点击查看摘要
Abstract:We propose OVIGo-3DHSG method - Open-Vocabulary Indoor Grounding of objects using 3D Hierarchical Scene Graph. OVIGo-3DHSG represents an extensive indoor environment over a Hierarchical Scene Graph derived from sequences of RGB-D frames utilizing a set of open-vocabulary foundation models and sensor data processing. The hierarchical representation explicitly models spatial relations across floors, rooms, locations, and objects. To effectively address complex queries involving spatial reference to other objects, we integrate the hierarchical scene graph with a Large Language Model for multistep reasoning. This integration leverages inter-layer (e.g., room-to-object) and intra-layer (e.g., object-to-object) connections, enhancing spatial contextual understanding. We investigate the semantic and geometry accuracy of hierarchical representation on Habitat Matterport 3D Semantic multi-floor scenes. Our approach demonstrates efficient scene comprehension and robust object grounding compared to existing methods. Overall OVIGo-3DHSG demonstrates strong potential for applications requiring spatial reasoning and understanding of indoor environments. Related materials can be found at this https URL.
zh
[CV-41] LidarPainter: One-Step Away From Any Lidar View To Novel Guidance
【速读】:该论文旨在解决动态驾驶场景重建中,当视点偏离输入轨迹时出现的背景和车辆模型损坏问题,以及现有方法在一致性、变形和计算耗时方面的局限性。其解决方案的关键在于提出LidarPainter,这是一种单步扩散模型,能够在实时条件下从稀疏LiDAR数据和带有伪影的渲染图像中恢复一致的驾驶视图,从而实现高保真的车道变换。
链接: https://arxiv.org/abs/2507.12114
作者: Yuzhou Ji,Ke Ma,Hong Cai,Anchun Zhang,Lizhuang Ma,Xin Tan
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
点击查看摘要
Abstract:Dynamic driving scene reconstruction is of great importance in fields like digital twin system and autonomous driving simulation. However, unacceptable degradation occurs when the view deviates from the input trajectory, leading to corrupted background and vehicle models. To improve reconstruction quality on novel trajectory, existing methods are subject to various limitations including inconsistency, deformation, and time consumption. This paper proposes LidarPainter, a one-step diffusion model that recovers consistent driving views from sparse LiDAR condition and artifact-corrupted renderings in real-time, enabling high-fidelity lane shifts in driving scene reconstruction. Extensive experiments show that LidarPainter outperforms state-of-the-art methods in speed, quality and resource efficiency, specifically 7 x faster than StreetCrafter with only one fifth of GPU memory required. LidarPainter also supports stylized generation using text prompts such as “foggy” and “night”, allowing for a diverse expansion of the existing asset library.
zh
[CV-42] Non-Adaptive Adversarial Face Generation
【速读】:该论文试图解决针对人脸识别系统(Face Recognition Systems, FRSs)的对抗攻击问题,旨在生成视觉上与真实人脸差异显著但能被FRS误识别为目标身份的合成面部图像。解决方案的关键在于利用FRS特征空间的结构特性,发现具有相同属性(如性别或种族)的个体构成属性子球面,通过该子球面实现非自适应性攻击和极少量查询次数,从而无需依赖迁移性和开源替代模型,显著提高了攻击效率与成功率。
链接: https://arxiv.org/abs/2507.12107
作者: Sunpill Kim,Seunghun Paik,Chanwoo Hwang,Minsu Kim,Jae Hong Seo
机构: Hanyang University (汉阳大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Cryptography and Security (cs.CR)
备注:
点击查看摘要
Abstract:Adversarial attacks on face recognition systems (FRSs) pose serious security and privacy threats, especially when these systems are used for identity verification. In this paper, we propose a novel method for generating adversarial faces-synthetic facial images that are visually distinct yet recognized as a target identity by the FRS. Unlike iterative optimization-based approaches (e.g., gradient descent or other iterative solvers), our method leverages the structural characteristics of the FRS feature space. We figure out that individuals sharing the same attribute (e.g., gender or race) form an attributed subsphere. By utilizing such subspheres, our method achieves both non-adaptiveness and a remarkably small number of queries. This eliminates the need for relying on transferability and open-source surrogate models, which have been a typical strategy when repeated adaptive queries to commercial FRSs are impossible. Despite requiring only a single non-adaptive query consisting of 100 face images, our method achieves a high success rate of over 93% against AWS’s CompareFaces API at its default threshold. Furthermore, unlike many existing attacks that perturb a given image, our method can deliberately produce adversarial faces that impersonate the target identity while exhibiting high-level attributes chosen by the adversary.
zh
[CV-43] Out-of-distribution data supervision towards biomedical semantic segmentation
【速读】:该论文旨在解决医学图像分割网络在有限且不完美的医学数据集上训练时,容易出现前景与背景对象的意外误分类问题。其解决方案的关键在于引入无监督的分布外(Out-of-Distribution, OoD)数据监督,从而在不需要外部数据源、特征正则化目标或额外标注的情况下,提升分割性能。该方法无需修改网络结构即可无缝集成到分割网络中,并在Lizard数据集上实现了显著的性能提升,甚至在完全使用OoD数据而无前景类别标签的情况下,达到了76.1%的mIoU测试结果。
链接: https://arxiv.org/abs/2507.12105
作者: Yiquan Gao,Duohui Xu
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: This paper was published in Proceedings of SPIE Volume 13442 and is reprinted with permission. The official version is available at this https URL . One personal copy is allowed. Reproduction, distribution, or commercial use is prohibited
点击查看摘要
Abstract:Biomedical segmentation networks easily suffer from the unexpected misclassification between foreground and background objects when learning on limited and imperfect medical datasets. Inspired by the strong power of Out-of-Distribution (OoD) data on other visual tasks, we propose a data-centric framework, Med-OoD to address this issue by introducing OoD data supervision into fully-supervised biomedical segmentation with none of the following needs: (i) external data sources, (ii) feature regularization objectives, (iii) additional annotations. Our method can be seamlessly integrated into segmentation networks without any modification on the architectures. Extensive experiments show that Med-OoD largely prevents various segmentation networks from the pixel misclassification on medical images and achieves considerable performance improvements on Lizard dataset. We also present an emerging learning paradigm of training a medical segmentation network completely using OoD data devoid of foreground class labels, surprisingly turning out 76.1% mIoU as test result. We hope this learning paradigm will attract people to rethink the roles of OoD data. Code is made available at this https URL.
zh
[CV-44] DeepShade: Enable Shade Simulation by Text-conditioned Image Generation IJCAI2025
【速读】:该论文试图解决当前路径规划系统在极端高温条件下无法有效考虑遮荫信息的问题,主要原因是难以从噪声卫星图像中直接估计遮荫,并且生成模型缺乏足够的训练数据。其解决方案的关键在于构建一个涵盖多种地理区域、建筑密度和城市布局的大型数据集,结合Blender-based 3D模拟与建筑轮廓生成不同太阳天顶角下的建筑阴影,并将其与卫星图像对齐,从而为学习遮荫模式提供丰富的资源;同时提出DeepShade模型,利用扩散机制学习并合成时间变化的遮荫特征,通过联合RGB图像与Canny边缘层以及对比学习捕捉遮荫的时空变化规律,最终通过文本描述条件生成高质量的遮荫图像。
链接: https://arxiv.org/abs/2507.12103
作者: Longchao Da,Xiangrui Liu,Mithun Shivakoti,Thirulogasankar Pranav Kutralingam,Yezhou Yang,Hua Wei
机构: Arizona State University (亚利桑那州立大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Computers and Society (cs.CY)
备注: 7pages, 4 figures. Accepted to IJCAI 2025
点击查看摘要
Abstract:Heatwaves pose a significant threat to public health, especially as global warming intensifies. However, current routing systems (e.g., online maps) fail to incorporate shade information due to the difficulty of estimating shades directly from noisy satellite imagery and the limited availability of training data for generative models. In this paper, we address these challenges through two main contributions. First, we build an extensive dataset covering diverse longitude-latitude regions, varying levels of building density, and different urban layouts. Leveraging Blender-based 3D simulations alongside building outlines, we capture building shadows under various solar zenith angles throughout the year and at different times of day. These simulated shadows are aligned with satellite images, providing a rich resource for learning shade patterns. Second, we propose the DeepShade, a diffusion-based model designed to learn and synthesize shade variations over time. It emphasizes the nuance of edge features by jointly considering RGB with the Canny edge layer, and incorporates contrastive learning to capture the temporal change rules of shade. Then, by conditioning on textual descriptions of known conditions (e.g., time of day, solar angles), our framework provides improved performance in generating shade images. We demonstrate the utility of our approach by using our shade predictions to calculate shade ratios for real-world route planning in Tempe, Arizona. We believe this work will benefit society by providing a reference for urban planning in extreme heat weather and its potential practical applications in the environment.
zh
[CV-45] BRUM: Robust 3D Vehicle Reconstruction from 360 Sparse Images
【速读】:该论文旨在解决从稀疏视角输入中准确重建车辆的问题,这一问题在现实应用场景中受到现有方法对密集输入视图依赖的限制。其解决方案的关键在于利用深度图和鲁棒的姿态估计架构来合成新视角并增强训练数据,具体包括对高置信度像素应用选择性光度损失,并用DUSt3R架构替代传统的Structure-from-Motion流程以提升相机姿态估计精度。
链接: https://arxiv.org/abs/2507.12095
作者: Davide Di Nucci,Matteo Tomei,Guido Borghi,Luca Ciuffreda,Roberto Vezzani,Rita Cucchiara
机构: University of Modena and Reggio Emilia(摩德纳和雷焦艾米利亚大学); Prometeia(普罗米修斯)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
点击查看摘要
Abstract:Accurate 3D reconstruction of vehicles is vital for applications such as vehicle inspection, predictive maintenance, and urban planning. Existing methods like Neural Radiance Fields and Gaussian Splatting have shown impressive results but remain limited by their reliance on dense input views, which hinders real-world applicability. This paper addresses the challenge of reconstructing vehicles from sparse-view inputs, leveraging depth maps and a robust pose estimation architecture to synthesize novel views and augment training data. Specifically, we enhance Gaussian Splatting by integrating a selective photometric loss, applied only to high-confidence pixels, and replacing standard Structure-from-Motion pipelines with the DUSt3R architecture to improve camera pose estimation. Furthermore, we present a novel dataset featuring both synthetic and real-world public transportation vehicles, enabling extensive evaluation of our approach. Experimental results demonstrate state-of-the-art performance across multiple benchmarks, showcasing the method’s ability to achieve high-quality reconstructions even under constrained input conditions.
zh
[CV-46] Benchmarking and Explaining Deep Learning Cortical Lesion MRI Segmentation in Multiple Sclerosis
【速读】:该论文试图解决多发性硬化症(Multiple Sclerosis, MS)中皮质病变(Cortical Lesions, CLs)在临床中的常规整合难题,主要由于其磁共振成像(MRI)表现细微、专家标注困难以及缺乏标准化的自动化方法。论文提出了一种全面的多中心CL检测与分割基准,采用自配置的nnU-Net框架,并针对CL检测进行了优化。其解决方案的关键在于利用高场强MRI数据和专家共识标注,结合改进的深度学习模型,以提升CL检测的准确性和泛化能力,同时分析数据变异性和协议差异对模型性能的影响,为临床应用提供未来改进方向。
链接: https://arxiv.org/abs/2507.12092
作者: Nataliia Molchanova,Alessandro Cagol,Mario Ocampo-Pineda,Po-Jui Lu,Matthias Weigel,Xinjie Chen,Erin Beck,Charidimos Tsagkas,Daniel Reich,Colin Vanden Bulcke,Anna Stolting,Serena Borrelli,Pietro Maggi,Adrien Depeursinge,Cristina Granziera,Henning Mueller,Pedro M. Gordaliza,Meritxell Bach Cuadra
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
点击查看摘要
Abstract:Cortical lesions (CLs) have emerged as valuable biomarkers in multiple sclerosis (MS), offering high diagnostic specificity and prognostic relevance. However, their routine clinical integration remains limited due to subtle magnetic resonance imaging (MRI) appearance, challenges in expert annotation, and a lack of standardized automated methods. We propose a comprehensive multi-centric benchmark of CL detection and segmentation in MRI. A total of 656 MRI scans, including clinical trial and research data from four institutions, were acquired at 3T and 7T using MP2RAGE and MPRAGE sequences with expert-consensus annotations. We rely on the self-configuring nnU-Net framework, designed for medical imaging segmentation, and propose adaptations tailored to the improved CL detection. We evaluated model generalization through out-of-distribution testing, demonstrating strong lesion detection capabilities with an F1-score of 0.64 and 0.5 in and out of the domain, respectively. We also analyze internal model features and model errors for a better understanding of AI decision-making. Our study examines how data variability, lesion ambiguity, and protocol differences impact model performance, offering future recommendations to address these barriers to clinical adoption. To reinforce the reproducibility, the implementation and models will be publicly accessible and ready to use at this https URL and this https URL.
zh
[CV-47] YOLOv8-SMOT: An Efficient and Robust Framework for Real-Time Small Object Tracking via Slice-Assisted Training and Adaptive Association
【速读】:该论文旨在解决从无人机视角进行小而灵活的多目标跟踪(SMOT)的挑战性问题,其难点主要源于目标外观特征极度稀少、相机与目标动态耦合导致的复杂运动纠缠以及密集群体行为引发的频繁遮挡和身份模糊。论文提出的解决方案关键在于两个层面的创新:在检测层面,引入了名为SliceTrain的系统训练增强框架,通过“确定性全覆盖切片”与“切片级随机增强”的协同作用,有效提升了高分辨率图像中对小目标的学习能力;在跟踪层面,设计了一个完全不依赖外观信息的鲁棒跟踪器,融合了运动方向保持机制和结合边界框扩展与距离惩罚的自适应相似性度量,从而稳定处理不规则运动并维持目标身份。
链接: https://arxiv.org/abs/2507.12087
作者: Xiang Yu,Xinyao Liu,Guang Liang
机构: School of Artificial Intelligence, Nanjing University, China (人工智能学院,南京大学); University of Science and Technology of China, Hefei, China (中国科学技术大学,合肥)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
点击查看摘要
Abstract:Tracking small, agile multi-objects (SMOT), such as birds, from an Unmanned Aerial Vehicle (UAV) perspective is a highly challenging computer vision task. The difficulty stems from three main sources: the extreme scarcity of target appearance features, the complex motion entanglement caused by the combined dynamics of the camera and the targets themselves, and the frequent occlusions and identity ambiguity arising from dense flocking behavior. This paper details our championship-winning solution in the MVA 2025 “Finding Birds” Small Multi-Object Tracking Challenge (SMOT4SB), which adopts the tracking-by-detection paradigm with targeted innovations at both the detection and association levels. On the detection side, we propose a systematic training enhancement framework named \textbfSliceTrain. This framework, through the synergy of ‘deterministic full-coverage slicing’ and 'slice-level stochastic augmentation, effectively addresses the problem of insufficient learning for small objects in high-resolution image training. On the tracking side, we designed a robust tracker that is completely independent of appearance information. By integrating a \textbfmotion direction maintenance (EMA) mechanism and an \textbfadaptive similarity metric combining \textbfbounding box expansion and distance penalty into the OC-SORT framework, our tracker can stably handle irregular motion and maintain target identities. Our method achieves state-of-the-art performance on the SMOT4SB public test set, reaching an SO-HOTA score of \textbf55.205, which fully validates the effectiveness and advancement of our framework in solving complex real-world SMOT problems. The source code will be made available at this https URL.
zh
[CV-48] Foresight in Motion: Reinforcing Trajectory Prediction with Reward Heuristics ICCV2025
【速读】:该论文旨在解决自动驾驶系统中对道路上交通代理的运动预测问题,这是一个具有挑战性且至关重要的任务。传统数据驱动方法直接预测未来轨迹,而本文从规划角度重新思考该任务,提出“先推理,后预测”的策略,将行为意图作为轨迹预测的空间引导。解决方案的关键在于引入一种基于新型以查询为中心的逆强化学习(Inverse Reinforcement Learning, IRL)框架的可解释、奖励驱动的行为意图推理器,通过统一向量化表示和查询中心范式聚合上下文特征,生成目标代理在场景中的奖励分布,并据此进行策略滚动以推断多种可能的意图,最终结合层次化DETR-like解码器与双向选择状态空间模型生成准确的未来轨迹及其概率。
链接: https://arxiv.org/abs/2507.12083
作者: Muleilan Pei,Shaoshuai Shi,Xuesong Chen,Xu Liu,Shaojie Shen
机构: HKUST(香港科技大学); Voyager Research, Didi Chuxing(维加斯研究,滴滴出行); Zhuoyu Technology(卓宇科技)
类目: Computer Vision and Pattern Recognition (cs.CV); Robotics (cs.RO)
备注: Accepted by ICCV 2025
点击查看摘要
Abstract:Motion forecasting for on-road traffic agents presents both a significant challenge and a critical necessity for ensuring safety in autonomous driving systems. In contrast to most existing data-driven approaches that directly predict future trajectories, we rethink this task from a planning perspective, advocating a “First Reasoning, Then Forecasting” strategy that explicitly incorporates behavior intentions as spatial guidance for trajectory prediction. To achieve this, we introduce an interpretable, reward-driven intention reasoner grounded in a novel query-centric Inverse Reinforcement Learning (IRL) scheme. Our method first encodes traffic agents and scene elements into a unified vectorized representation, then aggregates contextual features through a query-centric paradigm. This enables the derivation of a reward distribution, a compact yet informative representation of the target agent’s behavior within the given scene context via IRL. Guided by this reward heuristic, we perform policy rollouts to reason about multiple plausible intentions, providing valuable priors for subsequent trajectory generation. Finally, we develop a hierarchical DETR-like decoder integrated with bidirectional selective state space models to produce accurate future trajectories along with their associated probabilities. Extensive experiments on the large-scale Argoverse and nuScenes motion forecasting datasets demonstrate that our approach significantly enhances trajectory prediction confidence, achieving highly competitive performance relative to state-of-the-art methods.
zh
[CV-49] MS-DETR: Towards Effective Video Moment Retrieval and Highlight Detection by Joint Motion-Semantic Learning ACM-MM’25
【速读】:该论文旨在解决视频时刻检索(Video Moment Retrieval, MR)和视频亮点检测(Highlight Detection, HD)任务中,如何有效捕捉视频内容中时间运动与空间语义之间复杂关系的问题。其解决方案的关键在于提出一种名为Motion-Semantics DETR (MS-DETR) 的框架,该框架通过统一学习策略来提取丰富的运动-语义特征。具体而言,编码器首先在给定文本查询的指导下显式建模运动和语义维度内的分离模态相关性,解码器则利用时间运动与空间语义维度间的任务相关性,实现精确的查询引导定位和精细的亮点边界划分。此外,为应对MR/HD数据集中运动与语义维度的固有稀疏性问题,该方法还引入了生成策略增强语料库,并采用对比去噪学习以确保模型组件能够稳健有效地学习。
链接: https://arxiv.org/abs/2507.12062
作者: Hongxu Ma,Guanshuo Wang,Fufu Yu,Qiong Jia,Shouhong Ding
机构: Fudan University (复旦大学); Tencent Youtu Lab (腾讯优图实验室)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted by ACM MM’25
点击查看摘要
Abstract:Video Moment Retrieval (MR) and Highlight Detection (HD) aim to pinpoint specific moments and assess clip-wise relevance based on the text query. While DETR-based joint frameworks have made significant strides, there remains untapped potential in harnessing the intricate relationships between temporal motion and spatial semantics within video content. In this paper, we propose the Motion-Semantics DETR (MS-DETR), a framework that captures rich motion-semantics features through unified learning for MR/HD tasks. The encoder first explicitly models disentangled intra-modal correlations within motion and semantics dimensions, guided by the given text queries. Subsequently, the decoder utilizes the task-wise correlation across temporal motion and spatial semantics dimensions to enable precise query-guided localization for MR and refined highlight boundary delineation for HD. Furthermore, we observe the inherent sparsity dilemma within the motion and semantics dimensions of MR/HD datasets. To address this issue, we enrich the corpus from both dimensions by generation strategies and propose contrastive denoising learning to ensure the above components learn robustly and effectively. Extensive experiments on four MR/HD benchmarks demonstrate that our method outperforms existing state-of-the-art models by a margin. Our code is available at this https URL.
zh
[CV-50] InstructFLIP: Exploring Unified Vision-Language Model for Face Anti-spoofing
【速读】:该论文旨在解决面部反欺骗(Face Anti-Spoofing, FAS)中的两个关键问题:对攻击类型语义理解有限以及跨域训练冗余。其解决方案的关键在于提出一种名为InstructFLIP的新型指令调优框架,该框架通过集成视觉-语言模型(Vision-Language Models, VLMs)来增强对视觉输入的感知,并利用文本指导提升模型在单一领域训练后的泛化能力。InstructFLIP的核心创新在于将指令显式地解耦为内容与风格两部分,其中内容指令关注欺骗的本质语义,而风格指令则考虑环境和相机特性等变化因素。
链接: https://arxiv.org/abs/2507.12060
作者: Kun-Hsiang Lin,Yu-Wen Tseng,Kang-Yang Huang,Jhih-Ciang Wu,Wen-Huang Cheng
机构: National Taiwan University(国立台湾大学); National Taiwan Normal University(国立台湾师范大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Multimedia (cs.MM)
备注: Accepted by MM’25
点击查看摘要
Abstract:Face anti-spoofing (FAS) aims to construct a robust system that can withstand diverse attacks. While recent efforts have concentrated mainly on cross-domain generalization, two significant challenges persist: limited semantic understanding of attack types and training redundancy across domains. We address the first by integrating vision-language models (VLMs) to enhance the perception of visual input. For the second challenge, we employ a meta-domain strategy to learn a unified model that generalizes well across multiple domains. Our proposed InstructFLIP is a novel instruction-tuned framework that leverages VLMs to enhance generalization via textual guidance trained solely on a single domain. At its core, InstructFLIP explicitly decouples instructions into content and style components, where content-based instructions focus on the essential semantics of spoofing, and style-based instructions consider variations related to the environment and camera characteristics. Extensive experiments demonstrate the effectiveness of InstructFLIP by outperforming SOTA models in accuracy and substantially reducing training redundancy across diverse domains in FAS. Project website is available at this https URL.
zh
[CV-51] IDFace: Face Template Protection for Efficient and Secure Identification ICCV2025
【速读】:该论文试图解决在人脸识别系统(Face Recognition System, FRS)中保护用户面部模板的隐私问题,特别是在使用同态加密(Homomorphic Encryption, HE)技术时效率低下的问题。解决方案的关键在于提出IDFace方法,其基于两种新技术:一种是模板表示转换,显著降低了匹配测试的单位成本;另一种是空间高效的编码,减少了加密算法带来的空间浪费,从而减少了对加密模板的操作次数。实验表明,IDFace能够在126ms内从1M个加密模板数据库中完成人脸识别,仅比明文识别多出2倍的开销。
链接: https://arxiv.org/abs/2507.12050
作者: Sunpill Kim,Seunghun Paik,Chanwoo Hwang,Dongsoo Kim,Junbum Shin,Jae Hong Seo
机构: Hanyang University (汉阳大学); CryptoLab Inc. (加密实验室公司)
类目: Cryptography and Security (cs.CR); Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted to ICCV 2025
点击查看摘要
Abstract:As face recognition systems (FRS) become more widely used, user privacy becomes more important. A key privacy issue in FRS is protecting the user’s face template, as the characteristics of the user’s face image can be recovered from the template. Although recent advances in cryptographic tools such as homomorphic encryption (HE) have provided opportunities for securing the FRS, HE cannot be used directly with FRS in an efficient plug-and-play manner. In particular, although HE is functionally complete for arbitrary programs, it is basically designed for algebraic operations on encrypted data of predetermined shape, such as a polynomial ring. Thus, a non-tailored combination of HE and the system can yield very inefficient performance, and many previous HE-based face template protection methods are hundreds of times slower than plain systems without protection. In this study, we propose IDFace, a new HE-based secure and efficient face identification method with template protection. IDFace is designed on the basis of two novel techniques for efficient searching on a (homomorphically encrypted) biometric database with an angular metric. The first technique is a template representation transformation that sharply reduces the unit cost for the matching test. The second is a space-efficient encoding that reduces wasted space from the encryption algorithm, thus saving the number of operations on encrypted templates. Through experiments, we show that IDFace can identify a face template from among a database of 1M encrypted templates in 126ms, showing only 2X overhead compared to the identification over plaintexts.
zh
[CV-52] MoViAD: Modular Visual Anomaly Detection
【速读】:该论文旨在解决视频异常检测(Video Anomaly Detection, VAD)领域中因异常数据稀缺和需要无监督训练所带来的挑战。其解决方案的关键在于引入MoViAD,这是一个全面且高度模块化的库,旨在为状态最先进VAD模型、训练器、数据集和VAD工具提供快速便捷的访问。MoViAD支持多种应用场景,并通过专门的边缘和物联网设置解决实际部署问题,同时提供优化模型、量化与压缩工具以实现高效的设备端执行和分布式推理。
链接: https://arxiv.org/abs/2507.12049
作者: Manuel Barusco,Francesco Borsatti,Arianna Stropeni,Davide Dalle Pezze,Gian Antonio Susto
机构: University of Padova(帕多瓦大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
点击查看摘要
Abstract:VAD is a critical field in machine learning focused on identifying deviations from normal patterns in images, often challenged by the scarcity of anomalous data and the need for unsupervised training. To accelerate research and deployment in this domain, we introduce MoViAD, a comprehensive and highly modular library designed to provide fast and easy access to state-of-the-art VAD models, trainers, datasets, and VAD utilities. MoViAD supports a wide array of scenarios, including continual, semi-supervised, few-shots, noisy, and many more. In addition, it addresses practical deployment challenges through dedicated Edge and IoT settings, offering optimized models and backbones, along with quantization and compression utilities for efficient on-device execution and distributed inference. MoViAD integrates a selection of backbones, robust evaluation VAD metrics (pixel-level and image-level) and useful profiling tools for efficiency analysis. The library is designed for fast, effortless deployment, enabling machine learning engineers to easily use it for their specific setup with custom models, datasets, and backbones. At the same time, it offers the flexibility and extensibility researchers need to develop and experiment with new methods.
zh
[CV-53] Stereo Sound Event Localization and Detection with Onscreen/offscreen Classification
【速读】:该论文旨在解决基于立体声音频数据的声源定位与检测(Stereo SELD)问题,相较于以往使用四通道一阶全向音场(FOA)和麦克风阵列的设置,今年的挑战将焦点转移到更常见的有限视场(FOV)音频与媒体场景中。解决方案的关键在于针对立体声数据固有的角度模糊性,专注于仰角平面(左右轴)的方向到达角(DOA)估计以及距离估计,并在视听轨道中引入了屏幕内/外事件分类的新子任务。此外,论文提出了DCASE2025 Task3 Stereo SELD数据集,并设计了一个能够处理立体声音频和对应视频帧的基线系统,该系统在视听轨道中整合了屏幕内/外分类功能,同时评估指标也新增了屏幕内/外准确率以衡量模型识别屏幕内声源的能力。
链接: https://arxiv.org/abs/2507.12042
作者: Kazuki Shimada,Archontis Politis,Iran R. Roman,Parthasaarathy Sudarsanam,David Diaz-Guerra,Ruchi Pandey,Kengo Uchida,Yuichiro Koyama,Naoya Takahashi,Takashi Shibuya,Shusuke Takahashi,Tuomas Virtanen,Yuki Mitsufuji
机构: 未知
类目: ound (cs.SD); Computer Vision and Pattern Recognition (cs.CV); Multimedia (cs.MM); Audio and Speech Processing (eess.AS); Image and Video Processing (eess.IV)
备注: 5 pages, 2 figures
点击查看摘要
Abstract:This paper presents the objective, dataset, baseline, and metrics of Task 3 of the DCASE2025 Challenge on sound event localization and detection (SELD). In previous editions, the challenge used four-channel audio formats of first-order Ambisonics (FOA) and microphone array. In contrast, this year’s challenge investigates SELD with stereo audio data (termed stereo SELD). This change shifts the focus from more specialized 360° audio and audiovisual scene analysis to more commonplace audio and media scenarios with limited field-of-view (FOV). Due to inherent angular ambiguities in stereo audio data, the task focuses on direction-of-arrival (DOA) estimation in the azimuth plane (left-right axis) along with distance estimation. The challenge remains divided into two tracks: audio-only and audiovisual, with the audiovisual track introducing a new sub-task of onscreen/offscreen event classification necessitated by the limited FOV. This challenge introduces the DCASE2025 Task3 Stereo SELD Dataset, whose stereo audio and perspective video clips are sampled and converted from the STARSS23 recordings. The baseline system is designed to process stereo audio and corresponding video frames as inputs. In addition to the typical SELD event classification and localization, it integrates onscreen/offscreen classification for the audiovisual track. The evaluation metrics have been modified to introduce an onscreen/offscreen accuracy metric, which assesses the models’ ability to identify which sound sources are onscreen. In the experimental evaluation, the baseline system performs reasonably well with the stereo audio data.
zh
[CV-54] Intra-view and Inter-view Correlation Guided Multi-view Novel Class Discovery
【速读】:该论文试图解决新颖类别发现(Novel Class Discovery, NCD)问题,即通过利用不相交的已知类知识来聚类新颖类别。现有NCD方法面临两大局限:一方面主要针对单视角数据,忽视了多视角数据的广泛应用;另一方面依赖伪标签进行监督,导致性能不稳定。该论文提出的解决方案关键在于提出一种名为Intra-view and Inter-view Correlation Guided Multi-view Novel Class Discovery (IICMVNCD)的框架,首次在多视角设置下探索NCD问题。其核心思想是通过跨视角和 intra-视图的相关性引导,利用已知类的分布相似性和视图关系,结合矩阵分解与加权融合策略,提升新颖类别的聚类效果。
链接: https://arxiv.org/abs/2507.12029
作者: Xinhang Wan,Jiyuan Liu,Qian Qu,Suyuan Liu,Chuyu Zhang,Fangdi Wang,Xinwang Liu,En Zhu,Kunlun He
机构: National University of Defense Technology(国防科技大学); ShanghaiTech University(上海科技大学); Chinese PLA General Hospital(中国人民解放军总医院)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:
点击查看摘要
Abstract:In this paper, we address the problem of novel class discovery (NCD), which aims to cluster novel classes by leveraging knowledge from disjoint known classes. While recent advances have made significant progress in this area, existing NCD methods face two major limitations. First, they primarily focus on single-view data (e.g., images), overlooking the increasingly common multi-view data, such as multi-omics datasets used in disease diagnosis. Second, their reliance on pseudo-labels to supervise novel class clustering often results in unstable performance, as pseudo-label quality is highly sensitive to factors such as data noise and feature dimensionality. To address these challenges, we propose a novel framework named Intra-view and Inter-view Correlation Guided Multi-view Novel Class Discovery (IICMVNCD), which is the first attempt to explore NCD in multi-view setting so far. Specifically, at the intra-view level, leveraging the distributional similarity between known and novel classes, we employ matrix factorization to decompose features into view-specific shared base matrices and factor matrices. The base matrices capture distributional consistency among the two datasets, while the factor matrices model pairwise relationships between samples. At the inter-view level, we utilize view relationships among known classes to guide the clustering of novel classes. This includes generating predicted labels through the weighted fusion of factor matrices and dynamically adjusting view weights of known classes based on the supervision loss, which are then transferred to novel class learning. Experimental results validate the effectiveness of our proposed approach.
zh
[CV-55] SGLoc: Semantic Localization System for Camera Pose Estimation from 3D Gaussian Splatting Representation IROS2025
【速读】:该论文试图解决在没有初始位姿先验的情况下,实现高精度的全局定位问题。解决方案的关键在于提出了一种名为SGLoc的定位系统,该系统通过利用语义信息直接从三维高斯点云(3DGS)表示中回归相机位姿。其核心方法包括多层级位姿回归策略和基于语义的全局检索算法,前者逐步估计并优化查询图像的位姿,后者通过匹配二维查询图像与3DGS语义表示的场景语义描述子来建立对应关系,从而获得粗略位姿估计,并进一步通过迭代优化提升定位精度。
链接: https://arxiv.org/abs/2507.12027
作者: Beining Xu,Siting Zhu,Hesheng Wang
机构: Shanghai Jiao Tong University (上海交通大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Robotics (cs.RO)
备注: 8 pages, 2 figures, IROS 2025
点击查看摘要
Abstract:We propose SGLoc, a novel localization system that directly regresses camera poses from 3D Gaussian Splatting (3DGS) representation by leveraging semantic information. Our method utilizes the semantic relationship between 2D image and 3D scene representation to estimate the 6DoF pose without prior pose information. In this system, we introduce a multi-level pose regression strategy that progressively estimates and refines the pose of query image from the global 3DGS map, without requiring initial pose priors. Moreover, we introduce a semantic-based global retrieval algorithm that establishes correspondences between 2D (image) and 3D (3DGS map). By matching the extracted scene semantic descriptors of 2D query image and 3DGS semantic representation, we align the image with the local region of the global 3DGS map, thereby obtaining a coarse pose estimation. Subsequently, we refine the coarse pose by iteratively optimizing the difference between the query image and the rendered image from 3DGS. Our SGLoc demonstrates superior performance over baselines on 12scenes and 7scenes datasets, showing excellent capabilities in global localization without initial pose prior. Code will be available at this https URL.
zh
[CV-56] 3D-MoRe: Unified Modal-Contextual Reasoning for Embodied Question Answering IROS2025
【速读】:该论文旨在解决室内场景任务中对多样化和可扩展数据的需求,特别是在问题回答和密集描述等任务中。其解决方案的关键在于提出3D-MoRe框架,该框架通过整合多模态嵌入、跨模态交互和语言模型解码器,利用基础模型的优势生成大规模的3D-语言数据集。此方法有效提升了复杂3D环境中的推理与响应生成能力。
链接: https://arxiv.org/abs/2507.12026
作者: Rongtao Xu,Han Gao,Mingming Yu,Dong An,Shunpeng Chen,Changwei Wang,Li Guo,Xiaodan Liang,Shibiao Xu
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted by IROS 2025
点击查看摘要
Abstract:With the growing need for diverse and scalable data in indoor scene tasks, such as question answering and dense captioning, we propose 3D-MoRe, a novel paradigm designed to generate large-scale 3D-language datasets by leveraging the strengths of foundational models. The framework integrates key components, including multi-modal embedding, cross-modal interaction, and a language model decoder, to process natural language instructions and 3D scene data. This approach facilitates enhanced reasoning and response generation in complex 3D environments. Using the ScanNet 3D scene dataset, along with text annotations from ScanQA and ScanRefer, 3D-MoRe generates 62,000 question-answer (QA) pairs and 73,000 object descriptions across 1,513 scenes. We also employ various data augmentation techniques and implement semantic filtering to ensure high-quality data. Experiments on ScanQA demonstrate that 3D-MoRe significantly outperforms state-of-the-art baselines, with the CIDEr score improving by 2.15%. Similarly, on ScanRefer, our approach achieves a notable increase in CIDEr@0.5 by 1.84%, highlighting its effectiveness in both tasks. Our code and generated datasets will be publicly released to benefit the community, and both can be accessed on the this https URL.
zh
[CV-57] MVAR: MultiVariate AutoRegressive Air Pollutants Forecasting Model
【速读】:该论文旨在解决多变量空气污染物预测中存在的问题,即现有研究多聚焦于单一污染物预测,忽视了不同污染物之间的相互作用及其多样化的空间响应。其解决方案的关键在于提出MultiVariate AutoRegressive air pollutants forecasting model (MVAR),该模型通过减少对长时间窗口输入的依赖并提高数据利用效率,结合Multivariate Autoregressive Training Paradigm实现120小时的长期序列预测,并引入Meteorological Coupled Spatial Transformer块以实现基于AI的气象预测与污染物相互作用及空间响应的灵活耦合。
链接: https://arxiv.org/abs/2507.12023
作者: Xu Fan,Zhihao Wang,Yuetan Lin,Yan Zhang,Yang Xiang,Hao Li
机构: Shanghai Academy of Artificial Intelligence for Science‡; Tongji University‡; Fudan University
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注:
点击查看摘要
Abstract:Air pollutants pose a significant threat to the environment and human health, thus forecasting accurate pollutant concentrations is essential for pollution warnings and policy-making. Existing studies predominantly focus on single-pollutant forecasting, neglecting the interactions among different pollutants and their diverse spatial responses. To address the practical needs of forecasting multivariate air pollutants, we propose MultiVariate AutoRegressive air pollutants forecasting model (MVAR), which reduces the dependency on long-time-window inputs and boosts the data utilization efficiency. We also design the Multivariate Autoregressive Training Paradigm, enabling MVAR to achieve 120-hour long-term sequential forecasting. Additionally, MVAR develops Meteorological Coupled Spatial Transformer block, enabling the flexible coupling of AI-based meteorological forecasts while learning the interactions among pollutants and their diverse spatial responses. As for the lack of standardized datasets in air pollutants forecasting, we construct a comprehensive dataset covering 6 major pollutants across 75 cities in North China from 2018 to 2023, including ERA5 reanalysis data and FuXi-2.0 forecast data. Experimental results demonstrate that the proposed model outperforms state-of-the-art methods and validate the effectiveness of the proposed architecture.
zh
[CV-58] Dataset Ownership Verification for Pre-trained Masked Models ICCV2025
【速读】:该论文试图解决在掩码模型(masked models)背景下,如何验证数据集所有权的问题,因为现有验证技术主要针对监督模型和对比预训练模型,无法直接应用于当前广泛使用的掩码模型。解决方案的关键在于基于经验观察,即当模型在目标数据集上进行预训练时,其在嵌入空间中重建被掩码信息的难度与未在该数据集上预训练的模型存在显著差异。通过这一差异,DOV4MM方法能够判断可疑黑盒模型是否曾在特定无标签数据集上进行过预训练,从而帮助数据集所有者保护其权益。
链接: https://arxiv.org/abs/2507.12022
作者: Yuechen Xie,Jie Song,Yicheng Shan,Xiaoyan Zhang,Yuanyu Wan,Shengxuming Zhang,Jiarui Duan,Mingli Song
机构: Zhejiang University (浙江大学); The University of Sydney (悉尼大学); State Key Laboratory of Blockchain and Security, Zhejiang University (区块链与安全国家重点实验室,浙江大学); Hangzhou High-Tech Zone (Binjiang) Institute of Blockchain and Data Security (杭州滨江区区块链与数据安全研究院)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted by ICCV 2025
点击查看摘要
Abstract:High-quality open-source datasets have emerged as a pivotal catalyst driving the swift advancement of deep learning, while facing the looming threat of potential exploitation. Protecting these datasets is of paramount importance for the interests of their owners. The verification of dataset ownership has evolved into a crucial approach in this domain; however, existing verification techniques are predominantly tailored to supervised models and contrastive pre-trained models, rendering them ill-suited for direct application to the increasingly prevalent masked models. In this work, we introduce the inaugural methodology addressing this critical, yet unresolved challenge, termed Dataset Ownership Verification for Masked Modeling (DOV4MM). The central objective is to ascertain whether a suspicious black-box model has been pre-trained on a particular unlabeled dataset, thereby assisting dataset owners in safeguarding their rights. DOV4MM is grounded in our empirical observation that when a model is pre-trained on the target dataset, the difficulty of reconstructing masked information within the embedding space exhibits a marked contrast to models not pre-trained on that dataset. We validated the efficacy of DOV4MM through ten masked image models on ImageNet-1K and four masked language models on WikiText-103. The results demonstrate that DOV4MM rejects the null hypothesis, with a p -value considerably below 0.05, surpassing all prior approaches. Code is available at this https URL.
zh
[CV-59] SS-DC: Spatial-Spectral Decoupling and Coupling Across Visible-Infrared Gap for Domain Adaptive Object Detection
【速读】:该论文旨在解决可见光域(RGB)到红外域(IR)的无监督域适应目标检测(UDAOD)问题,现有方法将RGB域视为统一域,忽略了其中的多个子域(如白天、夜晚和雾天场景)。解决方案的关键在于通过解耦-耦合策略,实现跨多个子域的域不变(DI)和域特定(DS)特征的分离与融合。具体而言,提出了基于频谱分解的谱自适应幂等解耦(SAID)模块,以及一种基于滤波器组的频谱处理范式和自蒸馏驱动的解耦损失,以更准确地解耦DI和DS成分,并通过空间-谱耦合方法实现联合特征融合,从而减少域偏差并提升检测性能。
链接: https://arxiv.org/abs/2507.12017
作者: Xiwei Zhang,Chunjin Yang,Yiming Xiao,Runtong Zhang,Fanman Meng
机构: University of Electronic Science and Technology of China (电子科技大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: 8 main-pages, 3 reference-pages, 5 figures, 6 tables
点击查看摘要
Abstract:Unsupervised domain adaptive object detection (UDAOD) from the visible domain to the infrared (RGB-IR) domain is challenging. Existing methods regard the RGB domain as a unified domain and neglect the multiple subdomains within it, such as daytime, nighttime, and foggy scenes. We argue that decoupling the domain-invariant (DI) and domain-specific (DS) features across these multiple subdomains is beneficial for RGB-IR domain adaptation. To this end, this paper proposes a new SS-DC framework based on a decoupling-coupling strategy. In terms of decoupling, we design a Spectral Adaptive Idempotent Decoupling (SAID) module in the aspect of spectral decomposition. Due to the style and content information being highly embedded in different frequency bands, this module can decouple DI and DS components more accurately and interpretably. A novel filter bank-based spectral processing paradigm and a self-distillation-driven decoupling loss are proposed to improve the spectral domain decoupling. In terms of coupling, a new spatial-spectral coupling method is proposed, which realizes joint coupling through spatial and spectral DI feature pyramids. Meanwhile, this paper introduces DS from decoupling to reduce the domain bias. Extensive experiments demonstrate that our method can significantly improve the baseline performance and outperform existing UDAOD methods on multiple RGB-IR datasets, including a new experimental protocol proposed in this paper based on the FLIR-ADAS dataset.
zh
[CV-60] Identifying Signatures of Image Phenotypes to Track Treatment Response in Liver Disease
【速读】:该论文旨在解决如何量化肝脏疾病进展和治疗反应的问题,以支持个体化治疗和新疗法的开发。其解决方案的关键在于利用无监督机器学习从磁共振成像中识别出肝脏组织的模式词汇,通过深度聚类网络将医学图像块编码并聚类到低维潜在空间,从而建立组织词汇,捕捉与治疗反应相关的组织变化及其在肝脏中的位置。
链接: https://arxiv.org/abs/2507.12012
作者: Matthias Perkonigg,Nina Bastati,Ahmed Ba-Ssalamah,Peter Mesenbrink,Alexander Goehler,Miljen Martic,Xiaofei Zhou,Michael Trauner,Georg Langs
机构: Medical University of Vienna(维也纳医科大学); Medical University of Innsbruck(因斯布鲁克医科大学); Novartis Pharmaceuticals Corporation(诺华制药公司); Novartis Pharma AG(诺华制药AG公司); Alnylam Pharmaceuticals(阿尔尼拉姆制药公司); Medical University of Vienna(维也纳医科大学); Comprehensive Center for Artificial Intelligence in Medicine(医学人工智能综合中心)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:
点击查看摘要
Abstract:Quantifiable image patterns associated with disease progression and treatment response are critical tools for guiding individual treatment, and for developing novel therapies. Here, we show that unsupervised machine learning can identify a pattern vocabulary of liver tissue in magnetic resonance images that quantifies treatment response in diffuse liver disease. Deep clustering networks simultaneously encode and cluster patches of medical images into a low-dimensional latent space to establish a tissue vocabulary. The resulting tissue types capture differential tissue change and its location in the liver associated with treatment response. We demonstrate the utility of the vocabulary on a randomized controlled trial cohort of non-alcoholic steatohepatitis patients. First, we use the vocabulary to compare longitudinal liver change in a placebo and a treatment cohort. Results show that the method identifies specific liver tissue change pathways associated with treatment, and enables a better separation between treatment groups than established non-imaging measures. Moreover, we show that the vocabulary can predict biopsy derived features from non-invasive imaging data. We validate the method on a separate replication cohort to demonstrate the applicability of the proposed method.
zh
[CV-61] Deep Neural Encoder-Decoder Model to Relate fMRI Brain Activity with Naturalistic Stimuli
【速读】:该论文试图解决如何通过功能性磁共振成像(fMRI)数据对自然刺激下的脑活动进行编码与解码的问题。其解决方案的关键在于提出了一种端到端的深度神经编码器-解码器模型,利用连续电影帧的时间相关性,在架构中引入时间卷积层,从而有效弥合自然电影刺激与fMRI采集之间的时间分辨率差距。该模型能够预测视觉皮层及其周围区域的体素活动,并从神经活动中重建对应的视觉输入,进一步通过显著性图分析揭示了参与视觉解码的主要脑区。
链接: https://arxiv.org/abs/2507.12009
作者: Florian David,Michael Chan,Elenor Morgenroth,Patrik Vuilleumier,Dimitri Van De Ville
机构: Neuro-X Institute (神经科学研究所); Ecole Polytechnique Fédérale de Lausanne (洛桑联邦理工学院); University of Geneva (日内瓦大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Human-Computer Interaction (cs.HC)
备注: Accepted in International Conference of the IEEE Engineering in Medicine and Biology Society (EMBC) 2025
点击查看摘要
Abstract:We propose an end-to-end deep neural encoder-decoder model to encode and decode brain activity in response to naturalistic stimuli using functional magnetic resonance imaging (fMRI) data. Leveraging temporally correlated input from consecutive film frames, we employ temporal convolutional layers in our architecture, which effectively allows to bridge the temporal resolution gap between natural movie stimuli and fMRI acquisitions. Our model predicts activity of voxels in and around the visual cortex and performs reconstruction of corresponding visual inputs from neural activity. Finally, we investigate brain regions contributing to visual decoding through saliency maps. We find that the most contributing regions are the middle occipital area, the fusiform area, and the calcarine, respectively employed in shape perception, complex recognition (in particular face perception), and basic visual features such as edges and contrasts. These functions being strongly solicited are in line with the decoder’s capability to reconstruct edges, faces, and contrasts. All in all, this suggests the possibility to probe our understanding of visual processing in films using as a proxy the behaviour of deep learning models such as the one proposed in this paper.
zh
[CV-62] Dual form Complementary Masking for Domain-Adaptive Image Segmentation ICML2025
【速读】:该论文试图解决无监督域适应(Unsupervised Domain Adaptation, UDA)中特征提取与表征学习不足的问题,特别是针对掩码重建(masked reconstruction)的潜在能力未被充分挖掘的问题。其解决方案的关键在于将掩码重建重新框架化为稀疏信号重建问题,并理论证明互补掩码的对偶形式在提取域无关图像特征方面具有优越性。基于这一关键洞察,作者提出了MaskTwins框架,通过在主训练流程中直接集成掩码重建,强制不同互补掩码下图像预测的一致性,从而揭示跨域的内在结构模式,实现端到端的域泛化。
链接: https://arxiv.org/abs/2507.12008
作者: Jiawen Wang,Yinda Chen,Xiaoyu Liu,Che Liu,Dong Liu,Jianqing Gao,Zhiwei Xiong
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: Accepted by ICML 2025
点击查看摘要
Abstract:Recent works have correlated Masked Image Modeling (MIM) with consistency regularization in Unsupervised Domain Adaptation (UDA). However, they merely treat masking as a special form of deformation on the input images and neglect the theoretical analysis, which leads to a superficial understanding of masked reconstruction and insufficient exploitation of its potential in enhancing feature extraction and representation learning. In this paper, we reframe masked reconstruction as a sparse signal reconstruction problem and theoretically prove that the dual form of complementary masks possesses superior capabilities in extracting domain-agnostic image features. Based on this compelling insight, we propose MaskTwins, a simple yet effective UDA framework that integrates masked reconstruction directly into the main training pipeline. MaskTwins uncovers intrinsic structural patterns that persist across disparate domains by enforcing consistency between predictions of images masked in complementary ways, enabling domain generalization in an end-to-end manner. Extensive experiments verify the superiority of MaskTwins over baseline methods in natural and biological image segmentation. These results demonstrate the significant advantages of MaskTwins in extracting domain-invariant features without the need for separate pre-training, offering a new paradigm for domain-adaptive segmentation.
zh
[CV-63] Frequency-Dynamic Attention Modulation for Dense Prediction ICCV2025
【速读】:该论文试图解决Vision Transformers (ViTs)中由于注意力机制导致的频率衰减问题,即每一层作为低通滤波器使得关键细节和纹理信息丢失。解决方案的关键在于提出一种受电路理论启发的策略——Frequency-Dynamic Attention Modulation (FDAM),其核心包括两个技术:Attention Inversion (AttInv) 和 Frequency Dynamic Scaling (FreqScale),通过直接调节ViTs的整体频率响应,实现对不同频率成分的动态加权与互补,从而提升模型在多种视觉任务中的性能。
链接: https://arxiv.org/abs/2507.12006
作者: Linwei Chen,Lin Gu,Ying Fu
机构: Beijing Institute of Technology (北京理工大学); RIKEN AIP (理化学研究所人工智能项目); The University of Tokyo (东京大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: Accepted by ICCV 2025
点击查看摘要
Abstract:Vision Transformers (ViTs) have significantly advanced computer vision, demonstrating strong performance across various tasks. However, the attention mechanism in ViTs makes each layer function as a low-pass filter, and the stacked-layer architecture in existing transformers suffers from frequency vanishing. This leads to the loss of critical details and textures. We propose a novel, circuit-theory-inspired strategy called Frequency-Dynamic Attention Modulation (FDAM), which can be easily plugged into ViTs. FDAM directly modulates the overall frequency response of ViTs and consists of two techniques: Attention Inversion (AttInv) and Frequency Dynamic Scaling (FreqScale). Since circuit theory uses low-pass filters as fundamental elements, we introduce AttInv, a method that generates complementary high-pass filtering by inverting the low-pass filter in the attention matrix, and dynamically combining the two. We further design FreqScale to weight different frequency components for fine-grained adjustments to the target response function. Through feature similarity analysis and effective rank evaluation, we demonstrate that our approach avoids representation collapse, leading to consistent performance improvements across various models, including SegFormer, DeiT, and MaskDINO. These improvements are evident in tasks such as semantic segmentation, object detection, and instance segmentation. Additionally, we apply our method to remote sensing detection, achieving state-of-the-art results in single-scale settings. The code is available at \hrefthis https URLthis https URL.
zh
[CV-64] AU-Blendshape for Fine-grained Stylized 3D Facial Expression Manipulation ICCV2025
【速读】:该论文试图解决在缺乏适当数据集的情况下,实现跨身份的细粒度风格化3D面部表情操控的挑战。解决方案的关键在于引入AUBlendSet,这是一个基于AU-Blendshape表示的3D面部数据集,包含500个身份的32个标准面部动作单元(AUs)的blendshape数据,以及附加的详细标注面部姿势数据。同时,提出AUBlendNet网络,用于学习不同角色风格的AU-Blendshape基础向量,并通过并行预测给定身份网格对应的风格AU-Blendshape基础向量,从而实现风格化的3D情感面部操控。
链接: https://arxiv.org/abs/2507.12001
作者: Hao Li,Ju Dai,Feng Zhou,Kaida Ning,Lei Li,Junjun Pan
机构: Beihang University (北京航空航天大学); Peng Cheng Laboratory (鹏城实验室); North China University of Technology (华北理工大学); University of Washington (华盛顿大学); University of Copenhagen (哥本哈根大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: ICCV 2025
点击查看摘要
Abstract:While 3D facial animation has made impressive progress, challenges still exist in realizing fine-grained stylized 3D facial expression manipulation due to the lack of appropriate datasets. In this paper, we introduce the AUBlendSet, a 3D facial dataset based on AU-Blendshape representation for fine-grained facial expression manipulation across identities. AUBlendSet is a blendshape data collection based on 32 standard facial action units (AUs) across 500 identities, along with an additional set of facial postures annotated with detailed AUs. Based on AUBlendSet, we propose AUBlendNet to learn AU-Blendshape basis vectors for different character styles. AUBlendNet predicts, in parallel, the AU-Blendshape basis vectors of the corresponding style for a given identity mesh, thereby achieving stylized 3D emotional facial manipulation. We comprehensively validate the effectiveness of AUBlendSet and AUBlendNet through tasks such as stylized facial expression manipulation, speech-driven emotional facial animation, and emotion recognition data augmentation. Through a series of qualitative and quantitative experiments, we demonstrate the potential and importance of AUBlendSet and AUBlendNet in 3D facial animation tasks. To the best of our knowledge, AUBlendSet is the first dataset, and AUBlendNet is the first network for continuous 3D facial expression manipulation for any identity through facial AUs. Our source code is available at this https URL.
zh
[CV-65] SAMST: A Transformer framework based on SAM pseudo label filtering for remote sensing semi-supervised semantic segmentation
【速读】:该论文旨在解决公共遥感数据集在通用性上的局限性问题,尤其是由于分辨率差异和土地覆盖类别定义不一致导致的挑战。其解决方案的关键在于提出SAMST,一种半监督语义分割方法,该方法利用Segment Anything Model(SAM)在零样本泛化和边界检测方面的优势,通过监督模型自训练与基于SAM的伪标签优化器迭代提升伪标签的准确性,从而提高整体模型性能。
链接: https://arxiv.org/abs/2507.11994
作者: Jun Yin,Fei Wu,Yupeng Ren,Jisheng Huang,Qiankun Li,Heng jin,Jianhai Fu,Chanjie Cui
机构: Zhejiang University (浙江大学); Zhejiang Dahua Technology Co., Ltd (浙江大华技术股份有限公司)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: IGARSS2025 accepted, Correspondence: fujianhai2024@gmail.com (J.F.), cuichj@mail2. this http URL (C.C.)
点击查看摘要
Abstract:Public remote sensing datasets often face limitations in universality due to resolution variability and inconsistent land cover category definitions. To harness the vast pool of unlabeled remote sensing data, we propose SAMST, a semi-supervised semantic segmentation method. SAMST leverages the strengths of the Segment Anything Model (SAM) in zero-shot generalization and boundary detection. SAMST iteratively refines pseudo-labels through two main components: supervised model self-training using both labeled and pseudo-labeled data, and a SAM-based Pseudo-label Refiner. The Pseudo-label Refiner comprises three modules: a Threshold Filter Module for preprocessing, a Prompt Generation Module for extracting connected regions and generating prompts for SAM, and a Label Refinement Module for final label stitching. By integrating the generalization power of large models with the training efficiency of small models, SAMST improves pseudo-label accuracy, thereby enhancing overall model performance. Experiments on the Potsdam dataset validate the effectiveness and feasibility of SAMST, demonstrating its potential to address the challenges posed by limited labeled data in remote sensing semantic segmentation.
zh
[CV-66] ID-EA: Identity-driven Text Enhancement and Adaptation with Textual Inversion for Personalized Text-to-Image Generation
【速读】:该论文试图解决文本到图像扩散模型在个性化肖像生成中因文本与视觉嵌入空间在身份语义上存在偏差而导致的面部身份一致性难以保持的问题。解决方案的关键在于提出ID-EA框架,其核心是通过ID-driven Enhancer(ID-Enhancer)和ID-conditioned Adapter(ID-Adapter)两个组件,引导文本嵌入与视觉身份嵌入对齐,从而提升个性化生成中的身份保留能力。
链接: https://arxiv.org/abs/2507.11990
作者: Hyun-Jun Jin,Young-Eun Kim,Seong-Whan Lee
机构: Korea University (韩国科学技术大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
点击查看摘要
Abstract:Recently, personalized portrait generation with a text-to-image diffusion model has significantly advanced with Textual Inversion, emerging as a promising approach for creating high-fidelity personalized images. Despite its potential, current Textual Inversion methods struggle to maintain consistent facial identity due to semantic misalignments between textual and visual embedding spaces regarding identity. We introduce ID-EA, a novel framework that guides text embeddings to align with visual identity embeddings, thereby improving identity preservation in a personalized generation. ID-EA comprises two key components: the ID-driven Enhancer (ID-Enhancer) and the ID-conditioned Adapter (ID-Adapter). First, the ID-Enhancer integrates identity embeddings with a textual ID anchor, refining visual identity embeddings derived from a face recognition model using representative text embeddings. Then, the ID-Adapter leverages the identity-enhanced embedding to adapt the text condition, ensuring identity preservation by adjusting the cross-attention module in the pre-trained UNet model. This process encourages the text features to find the most related visual clues across the foreground snippets. Extensive quantitative and qualitative evaluations demonstrate that ID-EA substantially outperforms state-of-the-art methods in identity preservation metrics while achieving remarkable computational efficiency, generating personalized portraits approximately 15 times faster than existing approaches.
zh
[CV-67] Style Composition within Distinct LoRA modules for Traditional Art
【速读】:该论文试图解决扩散模型在文本到图像生成中难以实现可控、区域性的绘画技术应用问题,尤其是在多风格融合时容易出现一种风格主导的现象。解决方案的关键在于提出一种零样本扩散流水线,通过在流匹配去噪过程中对去噪后的潜在表示进行风格组合,从而自然地融合多种风格。该方法利用低噪声潜在表示携带更强的风格信息,并通过空间掩码在异构扩散流水线间融合,实现精确的区域风格控制,同时保持各单独风格的保真度并支持用户引导的混合。
链接: https://arxiv.org/abs/2507.11986
作者: Jaehyun Lee,Wonhark Park,Wonsik Shin,Hyunho Lee,Hyoung Min Na,Nojun Kwak
机构: Seoul National University (首尔国立大学); Kyunghee University (庆熙大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
点击查看摘要
Abstract:Diffusion-based text-to-image models have achieved remarkable results in synthesizing diverse images from text prompts and can capture specific artistic styles via style personalization. However, their entangled latent space and lack of smooth interpolation make it difficult to apply distinct painting techniques in a controlled, regional manner, often causing one style to dominate. To overcome this, we propose a zero-shot diffusion pipeline that naturally blends multiple styles by performing style composition on the denoised latents predicted during the flow-matching denoising process of separately trained, style-specialized models. We leverage the fact that lower-noise latents carry stronger stylistic information and fuse them across heterogeneous diffusion pipelines using spatial masks, enabling precise, region-specific style control. This mechanism preserves the fidelity of each individual style while allowing user-guided mixing. Furthermore, to ensure structural coherence across different models, we incorporate depth-map conditioning via ControlNet into the diffusion framework. Qualitative and quantitative experiments demonstrate that our method successfully achieves region-specific style mixing according to the given masks.
zh
[CV-68] Unsupervised Part Discovery via Descriptor-Based Masked Image Restoration with Optimized Constraints ICCV2025
【速读】:该论文试图解决在缺乏细粒度标签的情况下,如何实现鲁棒的无监督部件发现问题。现有方法在不同类别和场景下难以保持稳定性,限制了其应用范围。解决方案的关键是提出一种名为Masked Part Autoencoder (MPAE) 的新范式,通过学习部件描述符并利用相似性对遮罩区域进行修复,使恢复的部件更贴合实际物体形状,从而在复杂场景中 robust 地发现有意义的部件。
链接: https://arxiv.org/abs/2507.11985
作者: Jiahao Xia,Yike Wu,Wenjian Huang,Jianguo Zhang,Jian Zhang
机构: University of Technology Sydney (悉尼科技大学); Southern University of Science and Technology (南方科技大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted to ICCV 2025
点击查看摘要
Abstract:Part-level features are crucial for image understanding, but few studies focus on them because of the lack of fine-grained labels. Although unsupervised part discovery can eliminate the reliance on labels, most of them cannot maintain robustness across various categories and scenarios, which restricts their application range. To overcome this limitation, we present a more effective paradigm for unsupervised part discovery, named Masked Part Autoencoder (MPAE). It first learns part descriptors as well as a feature map from the inputs and produces patch features from a masked version of the original images. Then, the masked regions are filled with the learned part descriptors based on the similarity between the local features and descriptors. By restoring these masked patches using the part descriptors, they become better aligned with their part shapes, guided by appearance features from unmasked patches. Finally, MPAE robustly discovers meaningful parts that closely match the actual object shapes, even in complex scenarios. Moreover, several looser yet more effective constraints are proposed to enable MPAE to identify the presence of parts across various scenarios and categories in an unsupervised manner. This provides the foundation for addressing challenges posed by occlusion and for exploring part similarity across multiple categories. Extensive experiments demonstrate that our method robustly discovers meaningful parts across various categories and scenarios. The code is available at the project this https URL.
zh
[CV-69] EC-Diff: Fast and High-Quality Edge-Cloud Collaborative Inference for Diffusion Models
【速读】:该论文旨在解决扩散模型在边缘-云协同框架中因云端去噪步骤过多导致推理时间延长或步骤不足导致语义模糊的问题,从而引发边缘模型输出不一致。解决方案的关键在于通过基于梯度的噪声估计加速云端推理,并确定云边协作的最优切换点以保持生成质量。具体而言,设计了一种K步噪声近似策略,利用步骤间的噪声梯度减少云端推理频率,并定期进行云端推理以校正误差;同时设计了两阶段贪心搜索算法,以高效找到噪声近似和边缘模型切换的最优参数。
链接: https://arxiv.org/abs/2507.11980
作者: Jiajian Xie,Shengyu Zhang,Zhou Zhao,Fan Wu,Fei Wu
机构: Zhejiang University (浙江大学); Shanghai Jiao Tong University (上海交通大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 21 pages, 8 figures
点击查看摘要
Abstract:Diffusion Models have shown remarkable proficiency in image and video synthesis. As model size and latency increase limit user experience, hybrid edge-cloud collaborative framework was recently proposed to realize fast inference and high-quality generation, where the cloud model initiates high-quality semantic planning and the edge model expedites later-stage refinement. However, excessive cloud denoising prolongs inference time, while insufficient steps cause semantic ambiguity, leading to inconsistency in edge model output. To address these challenges, we propose EC-Diff that accelerates cloud inference through gradient-based noise estimation while identifying the optimal point for cloud-edge handoff to maintain generation quality. Specifically, we design a K-step noise approximation strategy to reduce cloud inference frequency by using noise gradients between steps and applying cloud inference periodically to adjust errors. Then we design a two-stage greedy search algorithm to efficiently find the optimal parameters for noise approximation and edge model switching. Extensive experiments demonstrate that our method significantly enhances generation quality compared to edge inference, while achieving up to an average 2\times speedup in inference compared to cloud inference. Video samples and source code are available at this https URL.
zh
[CV-70] HPR3D: Hierarchical Proxy Representation for High-Fidelity 3D Reconstruction and Controllable Editing
【速读】:该论文试图解决现有3D表示方法(如网格、体素、点云和基于NeRF的神经隐式场)在任务适用性、编辑性、动画性和数据复杂度与保真度权衡方面的局限性。其解决方案的关键在于引入一种新型的3D分层代理节点(Hierarchical Proxy Node)表示,通过在物体表面和内部分布的一组稀疏且分层组织(树状结构)的代理节点来表征形状和纹理,每个节点在其邻域内存储局部形状和纹理信息(由小型多层感知机隐式编码),并通过高效的神经插值和轻量解码实现对任意3D坐标属性的查询,从而实现紧凑表示、语义对齐和可扩展的质量-复杂度控制。
链接: https://arxiv.org/abs/2507.11971
作者: Tielong Wang,Yuxuan Xiong,Jinfan Liu,Zhifan Zhang,Ye Chen,Yue Shi,Bingbing Ni
机构: 未知
类目: Graphics (cs.GR); Computer Vision and Pattern Recognition (cs.CV)
备注:
点击查看摘要
Abstract:Current 3D representations like meshes, voxels, point clouds, and NeRF-based neural implicit fields exhibit significant limitations: they are often task-specific, lacking universal applicability across reconstruction, generation, editing, and driving. While meshes offer high precision, their dense vertex data complicates editing; NeRFs deliver excellent rendering but suffer from structural ambiguity, hindering animation and manipulation; all representations inherently struggle with the trade-off between data complexity and fidelity. To overcome these issues, we introduce a novel 3D Hierarchical Proxy Node representation. Its core innovation lies in representing an object’s shape and texture via a sparse set of hierarchically organized (tree-structured) proxy nodes distributed on its surface and interior. Each node stores local shape and texture information (implicitly encoded by a small MLP) within its neighborhood. Querying any 3D coordinate’s properties involves efficient neural interpolation and lightweight decoding from relevant nearby and parent nodes. This framework yields a highly compact representation where nodes align with local semantics, enabling direct drag-and-edit manipulation, and offers scalable quality-complexity control. Extensive experiments across 3D reconstruction and editing demonstrate our method’s expressive efficiency, high-fidelity rendering quality, and superior editability.
zh
[CV-71] GS-Bias: Global-Spatial Bias Learner for Single-Image Test-Time Adaptation of Vision-Language Models
【速读】:该论文旨在解决测试阶段适应(TTA)中视觉-语言模型(VLM)在性能与效率之间难以取得平衡的问题。现有方法要么因调整文本提示带来过高的计算开销,要么因依赖手工设计的无训练视觉特征增强导致效果不稳定。论文提出的解决方案是Global-Spatial Bias Learner (GS-Bias),其关键在于在TTA过程中引入两种可学习的偏置——全局偏置和空间偏置。全局偏置通过学习不同增强视图间的一致性来捕捉测试图像的全局语义特征,而空间偏置则通过学习图像空间表示中区域间的语义一致性来增强表征。这两种偏置直接添加到预训练VLM的logits输出上,避免了对VLM进行完整的反向传播,从而显著提升了效率并实现了在15个基准数据集上的最先进性能。
链接: https://arxiv.org/abs/2507.11969
作者: Zhaohong Huang,Yuxin Zhang,Jingjing Xie,Fei Chao,Rongrong Ji
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
点击查看摘要
Abstract:Recent advances in test-time adaptation (TTA) for Vision-Language Models (VLMs) have garnered increasing attention, particularly through the use of multiple augmented views of a single image to boost zero-shot generalization. Unfortunately, existing methods fail to strike a satisfactory balance between performance and efficiency, either due to excessive overhead of tuning text prompts or unstable benefits from handcrafted, training-free visual feature enhancement. In this paper, we present Global-Spatial Bias Learner (GS-Bias), an efficient and effective TTA paradigm that incorporates two learnable biases during TTA, unfolded as the global bias and spatial bias. Particularly, the global bias captures the global semantic features of a test image by learning consistency across augmented views, while spatial bias learns the semantic coherence between regions in the image’s spatial visual representation. It is worth highlighting that these two sets of biases are directly added to the logits outputed by the pretrained VLMs, which circumvent the full backpropagation through VLM that hinders the efficiency of existing TTA methods. This endows GS-Bias with extremely high efficiency while achieving state-of-the-art performance on 15 benchmark datasets. For example, it achieves a 2.23% improvement over TPT in cross-dataset generalization and a 2.72% improvement in domain generalization, while requiring only 6.5% of TPT’s memory usage on ImageNet.
zh
[CV-72] Watch Listen Understand Mislead: Tri-modal Adversarial Attacks on Short Videos for Content Appropriateness Evaluation ICCV2025
【速读】:该论文试图解决多模态大语言模型(Multimodal Large Language Models, MLLMs)在短视频内容审核中的鲁棒性不足问题,特别是现有安全评估方法未能有效应对多模态联合攻击的漏洞。论文提出的关键解决方案是构建一个全面的三模态安全性评估框架,包括引入SVMA数据集和提出ChimeraBreak三模态攻击策略,该策略同时针对视觉、听觉和语义推理路径进行攻击,从而揭示MLLMs在面对复杂攻击时的显著脆弱性。
链接: https://arxiv.org/abs/2507.11968
作者: Sahid Hossain Mustakim,S M Jishanul Islam,Ummay Maria Muna,Montasir Chowdhury,Mohammed Jawwadul Islam,Sadia Ahmmed,Tashfia Sikder,Syed Tasdid Azam Dhrubo,Swakkhar Shatabda
机构: United International University (联合国际大学); BRAC University (BRAC大学); University of British Columbia (不列颠哥伦比亚大学); Bangladesh University of Professionals (孟加拉国职业大学); University of Alberta (阿尔伯塔大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted as long paper, SVU Workshop at ICCV 2025
点击查看摘要
Abstract:Multimodal Large Language Models (MLLMs) are increasingly used for content moderation, yet their robustness in short-form video contexts remains underexplored. Current safety evaluations often rely on unimodal attacks, failing to address combined attack vulnerabilities. In this paper, we introduce a comprehensive framework for evaluating the tri-modal safety of MLLMs. First, we present the Short-Video Multimodal Adversarial (SVMA) dataset, comprising diverse short-form videos with human-guided synthetic adversarial attacks. Second, we propose ChimeraBreak, a novel tri-modal attack strategy that simultaneously challenges visual, auditory, and semantic reasoning pathways. Extensive experiments on state-of-the-art MLLMs reveal significant vulnerabilities with high Attack Success Rates (ASR). Our findings uncover distinct failure modes, showing model biases toward misclassifying benign or policy-violating content. We assess results using LLM-as-a-judge, demonstrating attack reasoning efficacy. Our dataset and findings provide crucial insights for developing more robust and safe MLLMs.
zh
[CV-73] Language-Guided Contrastive Audio-Visual Masked Autoencoder with Automatically Generated Audio-Visual-Text Triplets from Videos INTERSPEECH2025
【速读】:该论文旨在解决多模态表示学习中音频-视觉-文本三元组对齐不足的问题,以提升跨模态的表征能力。其解决方案的关键在于提出一种名为Language-Guided Contrastive Audio-Visual Masked Autoencoders (LG-CAV-MAE) 的模型,该模型通过引入预训练文本编码器,结合对比学习与掩码自编码器机制,实现音频、视觉和文本模态之间的联合学习。此外,该方法还设计了一种自动生成高质量音频-视觉-文本三元组的策略,无需人工标注即可保证模态间的强对齐性。
链接: https://arxiv.org/abs/2507.11967
作者: Yuchi Ishikawa,Shota Nakada,Hokuto Munakata,Kazuhiro Saito,Tatsuya Komatsu,Yoshimitsu Aoki
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Audio and Speech Processing (eess.AS); Image and Video Processing (eess.IV)
备注: Interspeech 2025
点击查看摘要
Abstract:In this paper, we propose Language-Guided Contrastive Audio-Visual Masked Autoencoders (LG-CAV-MAE) to improve audio-visual representation learning. LG-CAV-MAE integrates a pretrained text encoder into contrastive audio-visual masked autoencoders, enabling the model to learn across audio, visual and text modalities. To train LG-CAV-MAE, we introduce an automatic method to generate audio-visual-text triplets from unlabeled videos. We first generate frame-level captions using an image captioning model and then apply CLAP-based filtering to ensure strong alignment between audio and captions. This approach yields high-quality audio-visual-text triplets without requiring manual annotations. We evaluate LG-CAV-MAE on audio-visual retrieval tasks, as well as an audio-visual classification task. Our method significantly outperforms existing approaches, achieving up to a 5.6% improvement in recall@10 for retrieval tasks and a 3.2% improvement for the classification task.
zh
[CV-74] Prototypical Progressive Alignment and Reweighting for Generalizable Semantic Segmentation
【速读】:该论文旨在解决通用语义分割(generalizable semantic segmentation)在未见过的目标领域中性能不佳的问题,这一问题在实际应用中具有重要意义。现有方法在类级原型(class-wise prototypes)的构建与利用上存在三个主要局限:粗粒度的原型对齐策略、基于源域批次特征平均的朴素原型易过拟合以及忽略源样本差异性的统一处理方式。解决方案的关键在于提出一种新的框架——原型渐进对齐与重加权(PPAR),该框架利用CLIP模型的强大泛化能力,定义了原始文本原型(OTP)和视觉文本原型(VTP)作为对齐基础,并引入渐进对齐策略以逐步减小领域差距,同时通过原型重加权机制评估源数据可靠性并调整其贡献,从而缓解无关或有害特征的影响。
链接: https://arxiv.org/abs/2507.11955
作者: Yuhang Zhang,Zhengyu Zhang,Muxin Liao,Shishun Tian,Wenbin Zou,Lu Zhang,Chen Xu
机构: Guangzhou University(广州大学); Shenzhen University(深圳大学); Univ Rennes, INSA Rennes, CNRS, IETR - UMR 6164(雷恩大学,雷恩国立高等工程师学院,法国国家科学研究中心,伊特尔-UMR 6164实验室); Shenzhen University(深圳大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: This paper was accepted by IEEE Transactions on Intelligent Transportation Systems
点击查看摘要
Abstract:Generalizable semantic segmentation aims to perform well on unseen target domains, a critical challenge due to real-world applications requiring high generalizability. Class-wise prototypes, representing class centroids, serve as domain-invariant cues that benefit generalization due to their stability and semantic consistency. However, this approach faces three challenges. First, existing methods often adopt coarse prototypical alignment strategies, which may hinder performance. Second, naive prototypes computed by averaging source batch features are prone to overfitting and may be negatively affected by unrelated source data. Third, most methods treat all source samples equally, ignoring the fact that different features have varying adaptation difficulties. To address these limitations, we propose a novel framework for generalizable semantic segmentation: Prototypical Progressive Alignment and Reweighting (PPAR), leveraging the strong generalization ability of the CLIP model. Specifically, we define two prototypes: the Original Text Prototype (OTP) and Visual Text Prototype (VTP), generated via CLIP to serve as a solid base for alignment. We then introduce a progressive alignment strategy that aligns features in an easy-to-difficult manner, reducing domain gaps gradually. Furthermore, we propose a prototypical reweighting mechanism that estimates the reliability of source data and adjusts its contribution, mitigating the effect of irrelevant or harmful features (i.e., reducing negative transfer). We also provide a theoretical analysis showing the alignment between our method and domain generalization theory. Extensive experiments across multiple benchmarks demonstrate that PPAR achieves state-of-the-art performance, validating its effectiveness.
zh
[CV-75] MOSPA: Human Motion Generation Driven by Spatial Audio
【速读】:该论文试图解决虚拟人类对多样化听觉刺激进行动态且真实响应的问题,这一问题在角色动画中是一个关键挑战,需要整合感知建模与运动合成。解决方案的关键在于引入首个全面的Spatial Audio-Driven Human Motion (SAM)数据集,以及提出一种基于扩散模型的生成框架MOSPA,该框架通过有效的融合机制准确捕捉身体运动与空间音频之间的关系,从而实现对不同空间音频输入生成多样且逼真的运动。
链接: https://arxiv.org/abs/2507.11949
作者: Shuyang Xu,Zhiyang Dou,Mingyi Shi,Liang Pan,Leo Ho,Jingbo Wang,Yuan Liu,Cheng Lin,Yuexin Ma,Wenping Wang,Taku Komura
机构: The University of Hong Kong (香港大学); Shanghai AI Lab (上海人工智能实验室); The Hong Kong University of Science and Technology (香港科技大学); Macau University of Science and Technology (澳门科技大学); ShanghaiTech University (上海科技大学); Texas A&M University (德克萨斯A&M大学)
类目: Graphics (cs.GR); Computer Vision and Pattern Recognition (cs.CV); Robotics (cs.RO)
备注:
点击查看摘要
Abstract:Enabling virtual humans to dynamically and realistically respond to diverse auditory stimuli remains a key challenge in character animation, demanding the integration of perceptual modeling and motion synthesis. Despite its significance, this task remains largely unexplored. Most previous works have primarily focused on mapping modalities like speech, audio, and music to generate human motion. As of yet, these models typically overlook the impact of spatial features encoded in spatial audio signals on human motion. To bridge this gap and enable high-quality modeling of human movements in response to spatial audio, we introduce the first comprehensive Spatial Audio-Driven Human Motion (SAM) dataset, which contains diverse and high-quality spatial audio and motion data. For benchmarking, we develop a simple yet effective diffusion-based generative framework for human MOtion generation driven by SPatial Audio, termed MOSPA, which faithfully captures the relationship between body motion and spatial audio through an effective fusion mechanism. Once trained, MOSPA could generate diverse realistic human motions conditioned on varying spatial audio inputs. We perform a thorough investigation of the proposed dataset and conduct extensive experiments for benchmarking, where our method achieves state-of-the-art performance on this task. Our model and dataset will be open-sourced upon acceptance. Please refer to our supplementary video for more details.
zh
[CV-76] RaDL: Relation-aware Disentangled Learning for Multi-Instance Text-to-Image Generation
【速读】:该论文旨在解决在单个图像提示中有效生成多个实例的问题,现有方法在生成单个实例的位置方面表现良好,但难以处理实例间的关系差异和多属性泄露问题。其解决方案的关键在于提出一种关系感知的解耦学习框架(RaDL),该框架通过可学习参数增强实例特定属性,并利用从全局提示中提取的动作动词生成关系感知的图像特征,从而更好地捕捉实例之间的关系及多属性信息。
链接: https://arxiv.org/abs/2507.11947
作者: Geon Park,Seon Bin Kim,Gunho Jung,Seong-Whan Lee
机构: Korea University (韩国科学技术院); Institute of Information & Communications Technology Planning & Evaluation (信息与通信技术规划评估研究所)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: 6 Pages
点击查看摘要
Abstract:With recent advancements in text-to-image (T2I) models, effectively generating multiple instances within a single image prompt has become a crucial challenge. Existing methods, while successful in generating positions of individual instances, often struggle to account for relationship discrepancy and multiple attributes leakage. To address these limitations, this paper proposes the relation-aware disentangled learning (RaDL) framework. RaDL enhances instance-specific attributes through learnable parameters and generates relation-aware image features via Relation Attention, utilizing action verbs extracted from the global prompt. Through extensive evaluations on benchmarks such as COCO-Position, COCO-MIG, and DrawBench, we demonstrate that RaDL outperforms existing methods, showing significant improvements in positional accuracy, multiple attributes consideration, and the relationships between instances. Our results present RaDL as the solution for generating images that consider both the relationships and multiple attributes of each instance within the multi-instance image.
zh
[CV-77] Effective Fine-Tuning of Vision Transformers with Low-Rank Adaptation for Privacy-Preserving Image Classification
【速读】:该论文试图解决在训练隐私保护视觉Transformer(ViT)模型时,如何在保持模型性能的同时减少可训练参数数量的问题。其解决方案的关键在于提出一种低秩适应方法,通过在ViT架构的每一层中注入可训练的低秩分解矩阵,并且不冻结补丁嵌入层,从而在有效冻结预训练ViT模型权重的同时,显著降低可训练参数的数量,同时保持与全量微调相近的准确性。
链接: https://arxiv.org/abs/2507.11943
作者: Haiwei Lin,Shoko Imaizumi,Hitoshi Kiya
机构: Chiba University (千叶大学); Tokyo Metropolitan University (东京都立大学)
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
备注: 3 pages, 3 figures, conference
点击查看摘要
Abstract:We propose a low-rank adaptation method for training privacy-preserving vision transformer (ViT) models that efficiently freezes pre-trained ViT model weights. In the proposed method, trainable rank decomposition matrices are injected into each layer of the ViT architecture, and moreover, the patch embedding layer is not frozen, unlike in the case of the conventional low-rank adaptation methods. The proposed method allows us not only to reduce the number of trainable parameters but to also maintain almost the same accuracy as that of full-time tuning.
zh
[CV-78] A Multi-Level Similarity Approach for Single-View Object Grasping: Matching Planning and Fine-Tuning
【速读】:该论文试图解决从单视角未知物体抓取的问题,该问题在机器人领域因部分观测的不确定性而具有挑战性。其解决方案的关键在于引入一种新的视角——相似性匹配,通过利用已知物体的相似性来指导未知目标物体的抓取。该方法通过三个关键步骤实现:首先,利用观察到物体的视觉特征与包含多种物体模型的数据库进行相似性匹配,识别高相似性的候选物体;其次,使用具有预存抓取知识的候选模型为未知目标物体规划仿生抓取;最后,通过局部微调过程优化抓取质量。此外,论文提出了一种多层级相似性匹配框架,融合语义、几何和尺寸特征以提高评估的全面性,并引入了C-FPFH点云几何描述符以提升部分点云与完整点云之间的相似性评估精度。
链接: https://arxiv.org/abs/2507.11938
作者: Hao Chen,Takuya Kiyokawa,Zhengtao Hu,Weiwei Wan,Kensuke Harada
机构: Osaka University (大阪大学); School of Mechatronic Engineering and Automation, Shanghai University (上海大学机电工程与自动化学院); National Institute of Advanced Industrial Science and Technology (AIST) (日本产业技术综合研究所)
类目: Robotics (cs.RO); Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted by IEEE T-RO
点击查看摘要
Abstract:Grasping unknown objects from a single view has remained a challenging topic in robotics due to the uncertainty of partial observation. Recent advances in large-scale models have led to benchmark solutions such as GraspNet-1Billion. However, such learning-based approaches still face a critical limitation in performance robustness for their sensitivity to sensing noise and environmental changes. To address this bottleneck in achieving highly generalized grasping, we abandon the traditional learning framework and introduce a new perspective: similarity matching, where similar known objects are utilized to guide the grasping of unknown target objects. We newly propose a method that robustly achieves unknown-object grasping from a single viewpoint through three key steps: 1) Leverage the visual features of the observed object to perform similarity matching with an existing database containing various object models, identifying potential candidates with high similarity; 2) Use the candidate models with pre-existing grasping knowledge to plan imitative grasps for the unknown target object; 3) Optimize the grasp quality through a local fine-tuning process. To address the uncertainty caused by partial and noisy observation, we propose a multi-level similarity matching framework that integrates semantic, geometric, and dimensional features for comprehensive evaluation. Especially, we introduce a novel point cloud geometric descriptor, the C-FPFH descriptor, which facilitates accurate similarity assessment between partial point clouds of observed objects and complete point clouds of database models. In addition, we incorporate the use of large language models, introduce the semi-oriented bounding box, and develop a novel point cloud registration approach based on plane detection to enhance matching accuracy under single-view conditions. Videos are available at this https URL.
zh
[CV-79] Hyperphantasia: A Benchmark for Evaluating the Mental Visualization Capabilities of Multimodal LLM s
【速读】:该论文试图解决当前多模态大语言模型(Multimodal Large Language Models, MLLMs)在评估人类认知核心能力——心理可视化(mental visualization)方面的不足,现有基准主要评估被动视觉感知,而缺乏对内部构建视觉模式以支持问题解决的主动能力的评估。解决方案的关键是引入Hyperphantasia,一个合成基准,通过四个精心设计的谜题来评估MLLMs的心理可视化能力,每个任务均以程序生成并设置三个难度级别,从而实现对模型在不同复杂度下的性能控制分析。
链接: https://arxiv.org/abs/2507.11932
作者: Mohammad Shahab Sepehri,Berk Tinaz,Zalan Fabian,Mahdi Soltanolkotabi
机构: University of Southern California (南加州大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
点击查看摘要
Abstract:Mental visualization, the ability to construct and manipulate visual representations internally, is a core component of human cognition and plays a vital role in tasks involving reasoning, prediction, and abstraction. Despite the rapid progress of Multimodal Large Language Models (MLLMs), current benchmarks primarily assess passive visual perception, offering limited insight into the more active capability of internally constructing visual patterns to support problem solving. Yet mental visualization is a critical cognitive skill in humans, supporting abilities such as spatial navigation, predicting physical trajectories, and solving complex visual problems through imaginative simulation. To bridge this gap, we introduce Hyperphantasia, a synthetic benchmark designed to evaluate the mental visualization abilities of MLLMs through four carefully constructed puzzles. Each task is procedurally generated and presented at three difficulty levels, enabling controlled analysis of model performance across increasing complexity. Our comprehensive evaluation of state-of-the-art models reveals a substantial gap between the performance of humans and MLLMs. Additionally, we explore the potential of reinforcement learning to improve visual simulation capabilities. Our findings suggest that while some models exhibit partial competence in recognizing visual patterns, robust mental visualization remains an open challenge for current MLLMs.
zh
[CV-80] Dark-EvGS: Event Camera as an Eye for Radiance Field in the Dark
【速读】:该论文旨在解决在低光环境下,传统相机因动态范围限制和长曝光导致的运动模糊问题,从而难以捕捉清晰的多视角图像。同时,针对事件相机在低光条件下产生的噪声事件、低质量帧以及颜色色调不一致的问题,提出了一种基于事件辅助的3D高斯点云(3D Gaussian Splatting, GS)框架——Dark-EvGS。其关键在于引入了三元组级监督以获得整体知识、细粒度细节和清晰场景渲染,并设计了颜色色调匹配模块以确保渲染帧的颜色一致性。此外,还构建了首个真实采集的数据集用于事件引导的亮帧合成任务。
链接: https://arxiv.org/abs/2507.11931
作者: Jingqian Wu,Peiqi Duan,Zongqiang Wang,Changwei Wang,Boxin Shi,Edmund Y. Lam
机构: The University of Hong Kong(香港大学); State Key Laboratory of Multimedia Information Processing, School of Computer Science, Peking University(北京大学多媒体信息处理国家重点实验室,计算机学院); Chinese Academy of Science(中国科学院); Qilu University of Technology(齐鲁工业大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
点击查看摘要
Abstract:In low-light environments, conventional cameras often struggle to capture clear multi-view images of objects due to dynamic range limitations and motion blur caused by long exposure. Event cameras, with their high-dynamic range and high-speed properties, have the potential to mitigate these issues. Additionally, 3D Gaussian Splatting (GS) enables radiance field reconstruction, facilitating bright frame synthesis from multiple viewpoints in low-light conditions. However, naively using an event-assisted 3D GS approach still faced challenges because, in low light, events are noisy, frames lack quality, and the color tone may be inconsistent. To address these issues, we propose Dark-EvGS, the first event-assisted 3D GS framework that enables the reconstruction of bright frames from arbitrary viewpoints along the camera trajectory. Triplet-level supervision is proposed to gain holistic knowledge, granular details, and sharp scene rendering. The color tone matching block is proposed to guarantee the color consistency of the rendered frames. Furthermore, we introduce the first real-captured dataset for the event-guided bright frame synthesis task via 3D GS-based radiance field reconstruction. Experiments demonstrate that our method achieves better results than existing methods, conquering radiance field reconstruction under challenging low-light conditions. The code and sample data are included in the supplementary material.
zh
[CV-81] SEPose: A Synthetic Event-based Human Pose Estimation Dataset for Pedestrian Monitoring ITSC2025
【速读】:该论文试图解决在行人和交通监控系统中,由于注意力不集中或其他异常行为导致的安全关键情境下,因数据稀缺而影响系统响应性能的问题。解决方案的关键在于提出SEPose——一个基于动态视觉传感器在CARLA模拟器中生成的综合性合成事件基础人体姿态估计数据集,该数据集包含近35万标注行人及其身体姿态关键点,覆盖了多种光照和天气条件下的城市、郊区和农村环境中四向交叉口的密集和稀疏人群场景,从而为训练和评估先进模型提供了丰富的合成数据支持,并验证了其在真实事件基础数据上的模拟到现实的泛化能力。
链接: https://arxiv.org/abs/2507.11910
作者: Kaustav Chanda,Aayush Atul Verma,Arpitsinh Vaghela,Yezhou Yang,Bharatesh Chakravarthi
机构: Arizona State University (亚利桑那州立大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted at the 28th IEEE International Conference on Intelligent Transportation Systems (ITSC 2025)
点击查看摘要
Abstract:Event-based sensors have emerged as a promising solution for addressing challenging conditions in pedestrian and traffic monitoring systems. Their low-latency and high dynamic range allow for improved response time in safety-critical situations caused by distracted walking or other unusual movements. However, the availability of data covering such scenarios remains limited. To address this gap, we present SEPose – a comprehensive synthetic event-based human pose estimation dataset for fixed pedestrian perception generated using dynamic vision sensors in the CARLA simulator. With nearly 350K annotated pedestrians with body pose keypoints from the perspective of fixed traffic cameras, SEPose is a comprehensive synthetic multi-person pose estimation dataset that spans busy and light crowds and traffic across diverse lighting and weather conditions in 4-way intersections in urban, suburban, and rural environments. We train existing state-of-the-art models such as RVT and YOLOv8 on our dataset and evaluate them on real event-based data to demonstrate the sim-to-real generalization capabilities of the proposed dataset.
zh
[CV-82] CompressedVQA-HDR: Generalized Full-reference and No-reference Quality Assessment Models for Compressed High Dynamic Range Videos ICME2025
【速读】:该论文旨在解决高动态范围(HDR)视频质量评估(VQA)中现有方法泛化能力不足的问题,尤其是在面对日益多样化的视频类型时。其解决方案的关键在于设计了一个针对HDR视频的全参考(FR)和无参考(NR)VQA框架,分别采用Swin Transformer和SigLip 2作为主干网络,并通过预训练与微调策略提升模型在HDR数据上的性能。
链接: https://arxiv.org/abs/2507.11900
作者: Wei Sun,Linhan Cao,Kang Fu,Dandan Zhu,Jun Jia,Menghan Hu,Xiongkuo Min,Guangtao Zhai
机构: East China Normal University (华东师范大学); Shanghai Jiao Tong University (上海交通大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: CompressedVQA-HDR won first place in the FR track of the Generalizable HDR SDR Video Quality Measurement Grand Challenge at IEEE ICME 2025
点击查看摘要
Abstract:Video compression is a standard procedure applied to all videos to minimize storage and transmission demands while preserving visual quality as much as possible. Therefore, evaluating the visual quality of compressed videos is crucial for guiding the practical usage and further development of video compression algorithms. Although numerous compressed video quality assessment (VQA) methods have been proposed, they often lack the generalization capability needed to handle the increasing diversity of video types, particularly high dynamic range (HDR) content. In this paper, we introduce CompressedVQA-HDR, an effective VQA framework designed to address the challenges of HDR video quality assessment. Specifically, we adopt the Swin Transformer and SigLip 2 as the backbone networks for the proposed full-reference (FR) and no-reference (NR) VQA models, respectively. For the FR model, we compute deep structural and textural similarities between reference and distorted frames using intermediate-layer features extracted from the Swin Transformer as its quality-aware feature representation. For the NR model, we extract the global mean of the final-layer feature maps from SigLip 2 as its quality-aware representation. To mitigate the issue of limited HDR training data, we pre-train the FR model on a large-scale standard dynamic range (SDR) VQA dataset and fine-tune it on the HDRSDR-VQA dataset. For the NR model, we employ an iterative mixed-dataset training strategy across multiple compressed VQA datasets, followed by fine-tuning on the HDRSDR-VQA dataset. Experimental results show that our models achieve state-of-the-art performance compared to existing FR and NR VQA models. Moreover, CompressedVQA-HDR-FR won first place in the FR track of the Generalizable HDR SDR Video Quality Measurement Grand Challenge at IEEE ICME 2025. The code is available at this https URL.
zh
[CV-83] Spatial Frequency Modulation for Semantic Segmentation
【速读】:该论文试图解决高空间频率信息在语义分割中的丢失问题,特别是由于下采样层(如步进卷积)导致的混叠或失真。解决方案的关键在于提出一种新颖的 Spatial Frequency Modulation (SFM),通过在下采样前将高频特征调制到低频,并在上采样时进行解调,从而有效保留细节信息。其核心实现包括自适应重采样(ARS)和多尺度自适应上采样(MSAU),前者用于密集采样以降低高频信号频率,后者通过非均匀上采样恢复高频信息并促进多尺度区域间的交互。
链接: https://arxiv.org/abs/2507.11893
作者: Linwei Chen,Ying Fu,Lin Gu,Dezhi Zheng,Jifeng Dai
机构: Beijing Institute of Technology(北京理工大学); RIKEN AIP(理化学研究所人工智能部门); The University of Tokyo(东京大学); Beihang University(北京航空航天大学); Tsinghua University(清华大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: Accept by TPAMI 2025
点击查看摘要
Abstract:High spatial frequency information, including fine details like textures, significantly contributes to the accuracy of semantic segmentation. However, according to the Nyquist-Shannon Sampling Theorem, high-frequency components are vulnerable to aliasing or distortion when propagating through downsampling layers such as strided-convolution. Here, we propose a novel Spatial Frequency Modulation (SFM) that modulates high-frequency features to a lower frequency before downsampling and then demodulates them back during upsampling. Specifically, we implement modulation through adaptive resampling (ARS) and design a lightweight add-on that can densely sample the high-frequency areas to scale up the signal, thereby lowering its frequency in accordance with the Frequency Scaling Property. We also propose Multi-Scale Adaptive Upsampling (MSAU) to demodulate the modulated feature and recover high-frequency information through non-uniform upsampling This module further improves segmentation by explicitly exploiting information interaction between densely and sparsely resampled areas at multiple scales. Both modules can seamlessly integrate with various architectures, extending from convolutional neural networks to transformers. Feature visualization and analysis confirm that our method effectively alleviates aliasing while successfully retaining details after demodulation. Finally, we validate the broad applicability and effectiveness of SFM by extending it to image classification, adversarial robustness, instance segmentation, and panoptic segmentation tasks. The code is available at \hrefthis https URLthis https URL.
zh
[CV-84] From Coarse to Nuanced: Cross-Modal Alignment of Fine-Grained Linguistic Cues and Visual Salient Regions for Dynamic Emotion Recognition
【速读】:该论文旨在解决动态面部表情识别(DFER)中存在的情感线索利用不足以及无关面部动态干扰的问题。其解决方案的关键在于提出GRACE方法,该方法通过动态运动建模、语义文本优化和令牌级跨模态对齐,实现情感显著时空特征的精准定位。具体而言,GRACE利用粗到细的情感文本增强(CATE)模块生成情感感知的文本描述,并通过运动差异加权机制突出与表情相关的面部运动,最终通过熵正则化最优传输实现语义与视觉信号的令牌级对齐。
链接: https://arxiv.org/abs/2507.11892
作者: Yu Liu,Leyuan Qu,Hanlei Shi,Di Gao,Yuhua Zheng,Taihao Li
机构: Hangzhou Institute for Advanced Study, University of Chinese Academy of Sciences (杭州高级研究院,中国科学院大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Human-Computer Interaction (cs.HC)
备注:
点击查看摘要
Abstract:Dynamic Facial Expression Recognition (DFER) aims to identify human emotions from temporally evolving facial movements and plays a critical role in affective computing. While recent vision-language approaches have introduced semantic textual descriptions to guide expression recognition, existing methods still face two key limitations: they often underutilize the subtle emotional cues embedded in generated text, and they have yet to incorporate sufficiently effective mechanisms for filtering out facial dynamics that are irrelevant to emotional expression. To address these gaps, We propose GRACE, Granular Representation Alignment for Cross-modal Emotion recognition that integrates dynamic motion modeling, semantic text refinement, and token-level cross-modal alignment to facilitate the precise localization of emotionally salient spatiotemporal features. Our method constructs emotion-aware textual descriptions via a Coarse-to-fine Affective Text Enhancement (CATE) module and highlights expression-relevant facial motion through a motion-difference weighting mechanism. These refined semantic and visual signals are aligned at the token level using entropy-regularized optimal transport. Experiments on three benchmark datasets demonstrate that our method significantly improves recognition performance, particularly in challenging settings with ambiguous or imbalanced emotion classes, establishing new state-of-the-art (SOTA) results in terms of both UAR and WAR.
zh
[CV-85] owards Autonomous Riding: A Review of Perception Planning and Control in Intelligent Two-Wheelers
【速读】:该论文试图解决两轮电动交通工具(如电动滑板车和电动自行车)在自动驾驶(Autonomous Riding, AR)中面临的安全性和可靠性问题。其关键解决方案在于通过系统分析AR系统的感知、规划与控制等核心组件,结合自动驾驶(Autonomous Driving, AD)技术的成熟经验,识别当前AR研究中的关键差距,并提出如多模态传感器技术与轻量化边缘深度学习架构等潜在研究方向,以推动安全、高效且可扩展的自主骑行系统的发展。
链接: https://arxiv.org/abs/2507.11852
作者: Mohammed Hassanin,Mohammad Abu Alsheikh,Carlos C. N. Kuhn,Damith Herath,Dinh Thai Hoang,Ibrahim Radwan
机构: University of Canberra(堪培拉大学); OpenSI(OpenSI); University of Technology Sydney(悉尼科技大学)
类目: Robotics (cs.RO); Computer Vision and Pattern Recognition (cs.CV)
备注: 17 pages
点击查看摘要
Abstract:The rapid adoption of micromobility solutions, particularly two-wheeled vehicles like e-scooters and e-bikes, has created an urgent need for reliable autonomous riding (AR) technologies. While autonomous driving (AD) systems have matured significantly, AR presents unique challenges due to the inherent instability of two-wheeled platforms, limited size, limited power, and unpredictable environments, which pose very serious concerns about road users’ safety. This review provides a comprehensive analysis of AR systems by systematically examining their core components, perception, planning, and control, through the lens of AD technologies. We identify critical gaps in current AR research, including a lack of comprehensive perception systems for various AR tasks, limited industry and government support for such developments, and insufficient attention from the research community. The review analyses the gaps of AR from the perspective of AD to highlight promising research directions, such as multimodal sensor techniques for lightweight platforms and edge deep learning architectures. By synthesising insights from AD research with the specific requirements of AR, this review aims to accelerate the development of safe, efficient, and scalable autonomous riding systems for future urban mobility.
zh
[CV-86] ProtoConNet: Prototypical Augmentation and Alignment for Open-Set Few-Shot Image Classification
【速读】:该论文试图解决少样本图像分类中模型在面对未知环境时泛化能力不足的问题,以及现有方法过于依赖单一图像的视觉信息而忽视了丰富上下文信息的潜在价值。其解决方案的关键在于提出一种原型增强与对齐方法(ProtoConNet),通过引入不同样本的背景信息来增强特征空间的多样性,打破上下文与图像主体之间的虚假关联,从而提升模型在少样本场景下的表示学习效果和开放集样本识别能力。
链接: https://arxiv.org/abs/2507.11845
作者: Kexuan Shi,Zhuang Qi,Jingjing Zhu,Lei Meng,Yaochen Zhang,Haibei Huang,Xiangxu Meng
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted in ChinaMM and recommended to Displays
点击查看摘要
Abstract:Open-set few-shot image classification aims to train models using a small amount of labeled data, enabling them to achieve good generalization when confronted with unknown environments. Existing methods mainly use visual information from a single image to learn class representations to distinguish known from unknown categories. However, these methods often overlook the benefits of integrating rich contextual information. To address this issue, this paper proposes a prototypical augmentation and alignment method, termed ProtoConNet, which incorporates background information from different samples to enhance the diversity of the feature space, breaking the spurious associations between context and image subjects in few-shot scenarios. Specifically, it consists of three main modules: the clustering-based data selection (CDS) module mines diverse data patterns while preserving core features; the contextual-enhanced semantic refinement (CSR) module builds a context dictionary to integrate into image representations, which boosts the model’s robustness in various scenarios; and the prototypical alignment (PA) module reduces the gap between image representations and class prototypes, amplifying feature distances for known and unknown classes. Experimental results from two datasets verified that ProtoConNet enhances the effectiveness of representation learning in few-shot scenarios and identifies open-set samples, making it superior to existing methods.
zh
[CV-87] CorrMoE: Mixture of Experts with De-stylization Learning for Cross-Scene and Cross-Domain Correspondence Pruning ECAI2025
【速读】:该论文旨在解决图像对之间建立可靠对应关系的问题,这是计算机视觉中的基础任务,广泛应用于三维重建和视觉定位等领域。现有方法虽在从密集对应集中剔除异常值方面取得进展,但通常假设视觉领域的一致性,忽视了不同场景结构带来的挑战。论文提出的解决方案是CorrMoE,其关键在于引入De-stylization Dual Branch以应对领域偏移,通过隐式和显式图特征的风格混合来减轻领域特定表示的负面影响;同时设计Bi-Fusion Mixture of Experts模块,通过线性复杂度注意力和动态专家路由自适应地融合多视角特征,从而提升跨领域和跨场景的鲁棒性。
链接: https://arxiv.org/abs/2507.11834
作者: Peiwen Xia,Tangfei Liao,Wei Zhu,Danhuai Zhao,Jianjun Ke,Kaihao Zhang,Tong Lu,Tao Wang
机构: Nanjing University (南京大学); China Mobile Zijin Innovation Institute (中国移动紫金创新研究院); Harbin Institute of Technology (Shenzhen) (哈尔滨工业大学(深圳))
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted by ECAI 2025
点击查看摘要
Abstract:Establishing reliable correspondences between image pairs is a fundamental task in computer vision, underpinning applications such as 3D reconstruction and visual localization. Although recent methods have made progress in pruning outliers from dense correspondence sets, they often hypothesize consistent visual domains and overlook the challenges posed by diverse scene structures. In this paper, we propose CorrMoE, a novel correspondence pruning framework that enhances robustness under cross-domain and cross-scene variations. To address domain shift, we introduce a De-stylization Dual Branch, performing style mixing on both implicit and explicit graph features to mitigate the adverse influence of domain-specific representations. For scene diversity, we design a Bi-Fusion Mixture of Experts module that adaptively integrates multi-perspective features through linear-complexity attention and dynamic expert routing. Extensive experiments on benchmark datasets demonstrate that CorrMoE achieves superior accuracy and generalization compared to state-of-the-art methods. The code and pre-trained models are available at this https URL.
zh
[CV-88] MNIST-Gen: A Modular MNIST-Style Dataset Generation Using Hierarchical Semantics Reinforcement Learning and Category Theory
【速读】:该论文试图解决现有标准数据集(如MNIST、FashionMNIST)在领域特定任务中的不足,例如分类树木或食物等现实世界对象时的不相关性和局限性,同时应对自定义数据集创建过程中存在的耗时、法律限制或项目范围超出等问题。其解决方案的关键在于提出MNIST-Gen框架,该框架利用基于CLIP的语义理解结合强化学习与人类反馈,实现最小人工干预下的智能分类,并通过分层语义分类结构支持复杂的类别结构和细粒度子分类,从而高效生成定制化的MNIST风格图像数据集。
链接: https://arxiv.org/abs/2507.11821
作者: Pouya Shaeri,Arash Karimi,Ariane Middel
机构: Arizona State University (亚利桑那州立大学); Florida State University (佛罗里达州立大学)
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV); Human-Computer Interaction (cs.HC)
备注: Submitted to a computer science conference
点击查看摘要
Abstract:Neural networks are often benchmarked using standard datasets such as MNIST, FashionMNIST, or other variants of MNIST, which, while accessible, are limited to generic classes such as digits or clothing items. For researchers working on domain-specific tasks, such as classifying trees, food items, or other real-world objects, these data sets are insufficient and irrelevant. Additionally, creating and publishing a custom dataset can be time consuming, legally constrained, or beyond the scope of individual projects. We present MNIST-Gen, an automated, modular, and adaptive framework for generating MNIST-style image datasets tailored to user-specified categories using hierarchical semantic categorization. The system combines CLIP-based semantic understanding with reinforcement learning and human feedback to achieve intelligent categorization with minimal manual intervention. Our hierarchical approach supports complex category structures with semantic characteristics, enabling fine-grained subcategorization and multiple processing modes: individual review for maximum control, smart batch processing for large datasets, and fast batch processing for rapid creation. Inspired by category theory, MNIST-Gen models each data transformation stage as a composable morphism, enhancing clarity, modularity, and extensibility. As proof of concept, we generate and benchmark two novel datasets-\textitTree-MNIST and \textitFood-MNIST-demonstrating MNIST-Gen’s utility for producing task-specific evaluation data while achieving 85% automatic categorization accuracy and 80% time savings compared to manual approaches.
zh
[CV-89] Beyond Task-Specific Reasoning : A Unified Conditional Generative Framework for Abstract Visual Reasoning
【速读】:该论文旨在解决现有深度抽象视觉推理(AVR)求解器在处理不同AVR任务时通常需要任务特定设计或参数的问题,这导致解决新任务时需重新训练模型甚至调整模型结构,增加了成本。论文提出的解决方案的关键在于提出一种统一的条件生成求解器(UCGS),通过将多个AVR任务重新表述为在问题面板中估计目标图像可预测性的任务,从而在一个统一框架下训练一个条件生成模型即可解决多种AVR任务,实验表明该方法具备跨任务的抽象推理能力,并展现出零样本推理能力。
链接: https://arxiv.org/abs/2507.11761
作者: Fan Shi,Bin Li,Xiangyang Xue
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:
点击查看摘要
Abstract:Abstract visual reasoning (AVR) enables humans to quickly discover and generalize abstract rules to new scenarios. Designing intelligent systems with human-like AVR abilities has been a long-standing topic in the artificial intelligence community. Deep AVR solvers have recently achieved remarkable success in various AVR tasks. However, they usually use task-specific designs or parameters in different tasks. In such a paradigm, solving new tasks often means retraining the model, and sometimes retuning the model architectures, which increases the cost of solving AVR problems. In contrast to task-specific approaches, this paper proposes a novel Unified Conditional Generative Solver (UCGS), aiming to address multiple AVR tasks in a unified framework. First, we prove that some well-known AVR tasks can be reformulated as the problem of estimating the predictability of target images in problem panels. Then, we illustrate that, under the proposed framework, training one conditional generative model can solve various AVR tasks. The experiments show that with a single round of multi-task training, UCGS demonstrates abstract reasoning ability across various AVR tasks. Especially, UCGS exhibits the ability of zero-shot reasoning, enabling it to perform abstract reasoning on problems from unseen AVR tasks in the testing phase.
zh
[CV-90] Seeing the Signs: A Survey of Edge-Deployable OCR Models for Billboard Visibility Analysis
【速读】:该论文旨在解决户外广告中标志牌文本在真实环境条件下可视性验证的难题。传统光学字符识别(OCR)流程在处理复杂户外场景、字体变化及天气引起的视觉噪声时表现不佳,而多模态视觉-语言模型(VLMs)虽能提供端到端的场景理解,但其计算成本较高。论文的关键解决方案是通过对比基于轻量级卷积神经网络(CNN)的OCR基线(如PaddleOCRv4)与多个代表性VLMs(如Qwen 2.5 VL 3B、InternVL3和SmolVLM2)在两个公开数据集(ICDAR 2015和SVT)上的性能,特别是在引入合成天气失真后的增强数据集上,以评估不同方法在计算效率与识别准确率之间的平衡。
链接: https://arxiv.org/abs/2507.11730
作者: Maciej Szankin,Vidhyananth Venkatasamy,Lihang Ying
机构: SiMa.ai
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:
点击查看摘要
Abstract:Outdoor advertisements remain a critical medium for modern marketing, yet accurately verifying billboard text visibility under real-world conditions is still challenging. Traditional Optical Character Recognition (OCR) pipelines excel at cropped text recognition but often struggle with complex outdoor scenes, varying fonts, and weather-induced visual noise. Recently, multimodal Vision-Language Models (VLMs) have emerged as promising alternatives, offering end-to-end scene understanding with no explicit detection step. This work systematically benchmarks representative VLMs - including Qwen 2.5 VL 3B, InternVL3, and SmolVLM2 - against a compact CNN-based OCR baseline (PaddleOCRv4) across two public datasets (ICDAR 2015 and SVT), augmented with synthetic weather distortions to simulate realistic degradation. Our results reveal that while selected VLMs excel at holistic scene reasoning, lightweight CNN pipelines still achieve competitive accuracy for cropped text at a fraction of the computational cost-an important consideration for edge deployment. To foster future research, we release our weather-augmented benchmark and evaluation code publicly.
zh
[CV-91] he Impact of Coreset Selection on Spurious Correlations and Group Robustness
【速读】:该论文试图解决数据集缩减方法可能加剧、放大或缓解数据中的虚假相关性(spurious correlations)问题,从而影响下游模型的鲁棒性。其解决方案的关键在于通过系统分析不同数据选择策略对虚假偏差水平和模型鲁棒性的影响,揭示样本难度与偏差一致性之间的复杂交互关系,以及数据集偏差对模型性能的影响。研究发现,基于嵌入的样本表征分数在选择核心数据集时相较于基于学习动态的表征方式,能更有效地降低无意中加剧偏差的风险,但仅依靠选择困难样本并不能可靠地保证模型的鲁棒性。
链接: https://arxiv.org/abs/2507.11690
作者: Amaya Dharmasiri,William Yang,Polina Kirichenko,Lydia Liu,Olga Russakovsky
机构: Princeton University (普林斯顿大学); FAIR at Meta (Meta人工智能实验室)
类目: Machine Learning (cs.LG); Computer Vision and Pattern Recognition (cs.CV)
备注: 10 pages, 9 additional pages for Appendix
点击查看摘要
Abstract:Coreset selection methods have shown promise in reducing the training data size while maintaining model performance for data-efficient machine learning. However, as many datasets suffer from biases that cause models to learn spurious correlations instead of causal features, it is important to understand whether and how dataset reduction methods may perpetuate, amplify, or mitigate these biases. In this work, we conduct the first comprehensive analysis of the implications of data selection on the spurious bias levels of the selected coresets and the robustness of downstream models trained on them. We use an extensive experimental setting spanning ten different spurious correlations benchmarks, five score metrics to characterize sample importance/ difficulty, and five data selection policies across a broad range of coreset sizes. Thereby, we unravel a series of nontrivial nuances in interactions between sample difficulty and bias alignment, as well as dataset bias and resultant model robustness. For example, we find that selecting coresets using embedding-based sample characterization scores runs a comparatively lower risk of inadvertently exacerbating bias than selecting using characterizations based on learning dynamics. Most importantly, our analysis reveals that although some coreset selection methods could achieve lower bias levels by prioritizing difficult samples, they do not reliably guarantee downstream robustness.
zh
[CV-92] VISTA: Monocular Segmentation-Based Mapping for Appearance and View-Invariant Global Localization
【速读】:该论文试图解决在非结构化环境中,代理在不同会话或由其他代理生成的地图中进行全局定位的问题,尤其是在缺乏参考帧相关性先验知识的情况下。传统的位置识别方法在视角变化、季节性变化、空间混叠和遮挡等情况下容易失效。解决方案的关键在于提出一种名为VISTA(View-Invariant Segmentation-Based Tracking for Frame Alignment)的新型开放式单目全局定位框架,该框架结合了基于对象的分割与跟踪前端流程以及子地图对应搜索,利用环境地图之间的几何一致性来对齐车辆参考帧,从而实现跨多样相机视角和季节性变化的一致定位。
链接: https://arxiv.org/abs/2507.11653
作者: Hannah Shafferman,Annika Thomas,Jouko Kinnari,Michael Ricard,Jose Nino,Jonathan How
机构: Charles Stark Draper Laboratory, Inc.(查尔斯·斯塔克·德普勒实验室,公司); Massachusetts Institute of Technology (麻省理工学院); Saab Finland Oy (萨博芬兰公司)
类目: Computer Vision and Pattern Recognition (cs.CV); Robotics (cs.RO)
备注: 9 pages, 6 figures. This work has been submitted to the IEEE for possible publication
点击查看摘要
Abstract:Global localization is critical for autonomous navigation, particularly in scenarios where an agent must localize within a map generated in a different session or by another agent, as agents often have no prior knowledge about the correlation between reference frames. However, this task remains challenging in unstructured environments due to appearance changes induced by viewpoint variation, seasonal changes, spatial aliasing, and occlusions – known failure modes for traditional place recognition methods. To address these challenges, we propose VISTA (View-Invariant Segmentation-Based Tracking for Frame Alignment), a novel open-set, monocular global localization framework that combines: 1) a front-end, object-based, segmentation and tracking pipeline, followed by 2) a submap correspondence search, which exploits geometric consistencies between environment maps to align vehicle reference frames. VISTA enables consistent localization across diverse camera viewpoints and seasonal changes, without requiring any domain-specific training or finetuning. We evaluate VISTA on seasonal and oblique-angle aerial datasets, achieving up to a 69% improvement in recall over baseline methods. Furthermore, we maintain a compact object-based map that is only 0.6% the size of the most memory-conservative baseline, making our approach capable of real-time implementation on resource-constrained platforms.
zh
[CV-93] Posture-Driven Action Intent Inference for Playing style and Fatigue Assessment
【速读】:该论文试图解决通过姿态分析推断人类心理状态的问题,特别是在疲劳诊断、伤害预防和性能提升等领域的应用。其解决方案的关键在于利用体育场景中人类受试者在不同情绪状态下的数据进行研究,并通过运动分析实现对击球意图(aggressive and defensive shot intent)的识别,该方法在F1分数和AUC-ROC指标上分别达到了75%以上和80%以上。研究证明,即使存在数据管道中的噪声,姿态仍然能够提供强有力的意图推断信号。此外,研究还利用现有数据统计作为弱监督手段来验证结果,为克服数据标注限制提供了潜在解决方案。
链接: https://arxiv.org/abs/2507.11642
作者: Abhishek Jaiswal,Nisheeth Srivastava
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注:
点击查看摘要
Abstract:Posture-based mental state inference has significant potential in diagnosing fatigue, preventing injury, and enhancing performance across various domains. Such tools must be research-validated with large datasets before being translated into practice. Unfortunately, such vision diagnosis faces serious challenges due to the sensitivity of human subject data. To address this, we identify sports settings as a viable alternative for accumulating data from human subjects experiencing diverse emotional states. We test our hypothesis in the game of cricket and present a posture-based solution to identify human intent from activity videos. Our method achieves over 75% F1 score and over 80% AUC-ROC in discriminating aggressive and defensive shot intent through motion analysis. These findings indicate that posture leaks out strong signals for intent inference, even with inherent noise in the data pipeline. Furthermore, we utilize existing data statistics as weak supervision to validate our findings, offering a potential solution for overcoming data labelling limitations. This research contributes to generalizable techniques for sports analytics and also opens possibilities for applying human behavior analysis across various fields.
zh
[CV-94] Interpretable Prediction of Lymph Node Metastasis in Rectal Cancer MRI Using Variational Autoencoders
【速读】:该论文试图解决直肠癌中淋巴结转移(Lymph Node Metastasis, LNM)分期的准确性问题,现有基于淋巴结大小、形态和纹理的影像学标准诊断准确性有限。其解决方案的关键在于采用变分自编码器(Variational Autoencoder, VAE)作为特征编码模型,替代传统方法中使用的大型预训练卷积神经网络(Convolutional Neural Network, CNN),以获取更具解释性的潜在空间表示,从而提升对LNM的识别性能。
链接: https://arxiv.org/abs/2507.11638
作者: Benjamin Keel,Aaron Quyn,David Jayne,Maryam Mohsin,Samuel D. Relton
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: Published in Medical Image Understanding and Analysis (MIUA) 2025
点击查看摘要
Abstract:Effective treatment for rectal cancer relies on accurate lymph node metastasis (LNM) staging. However, radiological criteria based on lymph node (LN) size, shape and texture morphology have limited diagnostic accuracy. In this work, we investigate applying a Variational Autoencoder (VAE) as a feature encoder model to replace the large pre-trained Convolutional Neural Network (CNN) used in existing approaches. The motivation for using a VAE is that the generative model aims to reconstruct the images, so it directly encodes visual features and meaningful patterns across the data. This leads to a disentangled and structured latent space which can be more interpretable than a CNN. Models are deployed on an in-house MRI dataset with 168 patients who did not undergo neo-adjuvant treatment. The post-operative pathological N stage was used as the ground truth to evaluate model predictions. Our proposed model ‘VAE-MLP’ achieved state-of-the-art performance on the MRI dataset, with cross-validated metrics of AUC 0.86 +/- 0.05, Sensitivity 0.79 +/- 0.06, and Specificity 0.85 +/- 0.05. Code is available at: this https URL.
zh
[CV-95] SketchDNN: Joint Continuous-Discrete Diffusion for CAD Sketch Generation ICML2025
【速读】:该论文试图解决CAD草图生成中的两个关键挑战:原始参数化的异构性以及CAD草图中原始对象的排列不变性。其解决方案的关键在于提出Gaussian-Softmax扩散模型,通过将带有高斯噪声的logits经过softmax变换投影到概率单纯形上,从而实现离散变量的混合类别标签,进而提升生成质量。该方法在SketchGraphs数据集上显著降低了Fréchet Inception Distance(FID)和负对数似然(NLL),达到了当前最先进的性能。
链接: https://arxiv.org/abs/2507.11579
作者: Sathvik Chereddy,John Femiani
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注: 17 pages, 63 figures, Proceedings of the 42nd International Conference on Machine Learning (ICML2025)
点击查看摘要
Abstract:We present SketchDNN, a generative model for synthesizing CAD sketches that jointly models both continuous parameters and discrete class labels through a unified continuous-discrete diffusion process. Our core innovation is Gaussian-Softmax diffusion, where logits perturbed with Gaussian noise are projected onto the probability simplex via a softmax transformation, facilitating blended class labels for discrete variables. This formulation addresses 2 key challenges, namely, the heterogeneity of primitive parameterizations and the permutation invariance of primitives in CAD sketches. Our approach significantly improves generation quality, reducing Fréchet Inception Distance (FID) from 16.04 to 7.80 and negative log-likelihood (NLL) from 84.8 to 81.33, establishing a new state-of-the-art in CAD sketch generation on the SketchGraphs dataset.
zh
[CV-96] What cat is that? A re-id model for feral cats
【速读】:该论文旨在解决如何有效监测野生袋鼠猫(feral cats)对澳大利亚野生动物造成的生态影响问题,其核心挑战在于通过图像识别技术实现对个体袋鼠猫的再识别(Re-Identification, re-ID)。解决方案的关键在于对部分姿态引导网络(Part-Pose Guided Network, PPGNet)模型进行适应性改进,形成了适用于袋鼠猫图像特征的PPGNet-Cat模型,并结合对比学习方法如ArcFace损失函数,以提升模型在野生环境下的识别性能,最终实现了较高的平均精度(mAP=0.86)和排名1准确率(rank-1 accuracy=0.95)。
链接: https://arxiv.org/abs/2507.11575
作者: Victor Caquilpan
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: Master’s project
点击查看摘要
Abstract:Feral cats exert a substantial and detrimental impact on Australian wildlife, placing them among the most dangerous invasive species worldwide. Therefore, closely monitoring these cats is essential labour in minimising their effects. In this context, the potential application of Re-Identification (re-ID) emerges to enhance monitoring activities for these animals, utilising images captured by camera traps. This project explores different CV approaches to create a re-ID model able to identify individual feral cats in the wild. The main approach consists of modifying a part-pose guided network (PPGNet) model, initially used in the re-ID of Amur tigers, to be applicable for feral cats. This adaptation, resulting in PPGNet-Cat, which incorporates specific modifications to suit the characteristics of feral cats images. Additionally, various experiments were conducted, particularly exploring contrastive learning approaches such as ArcFace loss. The main results indicate that PPGNet-Cat excels in identifying feral cats, achieving high performance with a mean Average Precision (mAP) of 0.86 and a rank-1 accuracy of 0.95. These outcomes establish PPGNet-Cat as a competitive model within the realm of re-ID.
zh
[CV-97] Data-Driven Meta-Analysis and Public-Dataset Evaluation for Sensor-Based Gait Age Estimation
【速读】:该论文旨在解决从步态中估计个体年龄的问题,这一技术在医疗、安全和人机交互等领域具有重要应用。其关键解决方案是通过融合多种传感器数据(如视频、可穿戴设备和雷达)并利用深度学习模型(如卷积神经网络)进行分析,同时结合对步态特征与年龄相关性的量化研究,以及对模型决策过程的可视化解释(如使用Grad-CAM技术),从而实现更准确的年龄估计,并在实际场景中将误差降低至三年以下。
链接: https://arxiv.org/abs/2507.11571
作者: Varun Velankar
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Image and Video Processing (eess.IV)
备注:
点击查看摘要
Abstract:Estimating a person’s age from their gait has important applications in healthcare, security and human-computer interaction. In this work, we review fifty-nine studies involving over seventy-five thousand subjects recorded with video, wearable and radar sensors. We observe that convolutional neural networks produce an average error of about 4.2 years, inertial-sensor models about 4.5 years and multi-sensor fusion as low as 3.4 years, with notable differences between lab and real-world data. We then analyse sixty-three thousand eight hundred forty-six gait cycles from the OU-ISIR Large-Population dataset to quantify correlations between age and five key metrics: stride length, walking speed, step cadence, step-time variability and joint-angle entropy, with correlation coefficients of at least 0.27. Next, we fine-tune a ResNet34 model and apply Grad-CAM to reveal that the network attends to the knee and pelvic regions, consistent with known age-related gait changes. Finally, on a one hundred thousand sample subset of the VersatileGait database, we compare support vector machines, decision trees, random forests, multilayer perceptrons and convolutional neural networks, finding that deep networks achieve up to 96 percent accuracy while processing each sample in under 0.1 seconds. By combining a broad meta-analysis with new large-scale experiments and interpretable visualizations, we establish solid performance baselines and practical guidelines for reducing gait-age error below three years in real-world scenarios.
zh
[CV-98] Expert Operational GANS: Towards Real-Color Underwater Image Restoration
【速读】:该论文试图解决水下图像复原中由于复杂光传播、散射和深度依赖性衰减导致的多种退化伪影问题。传统基于生成对抗网络(GAN)的复原方法因仅使用单一生成器网络而难以有效处理这一异质领域,因为单个生成器通常无法捕捉全部视觉退化类型。该论文提出的xOp-GAN解决方案的关键在于引入多个专家生成器网络,每个生成器仅在特定图像质量子集上进行训练,从而使其能够在特定质量范围内最大化复原性能;在推理阶段,判别器根据感知置信度分数选择最佳复原图像,这是首个在回归任务推理过程中利用判别器的多生成器GAN模型。
链接: https://arxiv.org/abs/2507.11562
作者: Ozer Can Devecioglu,Serkan Kiranyaz,Mehmet Yamac,Moncef Gabbouj
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: 6 pages
点击查看摘要
Abstract:The wide range of deformation artifacts that arise from complex light propagation, scattering, and depth-dependent attenuation makes the underwater image restoration to remain a challenging problem. Like other single deep regressor networks, conventional GAN-based restoration methods struggle to perform well across this heterogeneous domain, since a single generator network is typically insufficient to capture the full range of visual degradations. In order to overcome this limitation, we propose xOp-GAN, a novel GAN model with several expert generator networks, each trained solely on a particular subset with a certain image quality. Thus, each generator can learn to maximize its restoration performance for a particular quality range. Once a xOp-GAN is trained, each generator can restore the input image and the best restored image can then be selected by the discriminator based on its perceptual confidence score. As a result, xOP-GAN is the first GAN model with multiple generators where the discriminator is being used during the inference of the regression task. Experimental results on benchmark Large Scale Underwater Image (LSUI) dataset demonstrates that xOp-GAN achieves PSNR levels up to 25.16 dB, surpassing all single-regressor models by a large margin even, with reduced complexity.
zh
[CV-99] Reprogramming Vision Foundation Models for Spatio-Temporal Forecasting
【速读】:该论文试图解决将生成式视觉基础模型(Vision Foundation Models, VFMs)应用于通用时空预测任务时所面临的两个关键挑战:一是VFMs缺乏内在的时间建模能力,二是视觉数据与时空数据之间的模态差异。解决方案的关键在于提出一种双分支架构,通过融合原始时空输入与辅助时空流输入,其中流编码器用于表示可解释的动态空间线索,并引入两个专门的再编程阶段:预-VFM阶段采用时间感知的Token适配器以嵌入时间上下文并对齐特征空间,后-VFM阶段则通过基于提示的协调模块实现分支间的动态交互,从而在不修改冻结的VFM主干的情况下增强联合表征学习。
链接: https://arxiv.org/abs/2507.11558
作者: Changlu Chen,Yanbin Liu,Chaoxi Niu,Ling Chen,Tianqing Zhu
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:
点击查看摘要
Abstract:Foundation models have achieved remarkable success in natural language processing and computer vision, demonstrating strong capabilities in modeling complex patterns. While recent efforts have explored adapting large language models (LLMs) for time-series forecasting, LLMs primarily capture one-dimensional sequential dependencies and struggle to model the richer spatio-temporal (ST) correlations essential for accurate ST forecasting. In this paper, we present \textbfST-VFM, a novel framework that systematically reprograms Vision Foundation Models (VFMs) for general-purpose spatio-temporal forecasting. While VFMs offer powerful spatial priors, two key challenges arise when applying them to ST tasks: (1) the lack of inherent temporal modeling capacity and (2) the modality gap between visual and ST data. To address these, ST-VFM adopts a \emphdual-branch architecture that integrates raw ST inputs with auxiliary ST flow inputs, where the flow encodes lightweight temporal difference signals interpretable as dynamic spatial cues. To effectively process these dual-branch inputs, ST-VFM introduces two dedicated reprogramming stages. The \emphpre-VFM reprogramming stage applies a Temporal-Aware Token Adapter to embed temporal context and align both branches into VFM-compatible feature spaces. The \emphpost-VFM reprogramming stage introduces a Bilateral Cross-Prompt Coordination module, enabling dynamic interaction between branches through prompt-based conditioning, thus enriching joint representation learning without modifying the frozen VFM backbone. Extensive experiments on ten spatio-temporal datasets show that ST-VFM outperforms state-of-the-art baselines, demonstrating effectiveness and robustness across VFM backbones (e.g., DINO, CLIP, DEIT) and ablation studies, establishing it as a strong general framework for spatio-temporal forecasting.
zh
[CV-100] Inversion-DPO: Precise and Efficient Post-Training for Diffusion Models
【速读】:该论文旨在解决扩散模型(Diffusion Models, DMs)在对齐过程中依赖计算密集型训练的基模型和奖励模型所带来的高计算开销及可能影响模型准确性和训练效率的问题。其解决方案的关键在于提出Inversion-DPO框架,该框架通过将直接偏好优化(Direct Preference Optimization, DPO)与DDIM反演相结合,避免了传统方法中对奖励模型的依赖,从而实现了无需辅助奖励模型或近似方法的后训练范式,显著提升了训练的精度与效率。
链接: https://arxiv.org/abs/2507.11554
作者: Zejian Li,Yize Li,Chenye Meng,Zhongni Liu,Yang Ling,Shengyuan Zhang,Guang Yang,Changyuan Yang,Zhiyuan Yang,Lingyun Sun
机构: Zhejiang University(浙江大学); University of Electronic Science and Technology of China(电子科技大学); Peking University(北京大学); Alibaba Group(阿里巴巴集团)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:
点击查看摘要
Abstract:Recent advancements in diffusion models (DMs) have been propelled by alignment methods that post-train models to better conform to human preferences. However, these approaches typically require computation-intensive training of a base model and a reward model, which not only incurs substantial computational overhead but may also compromise model accuracy and training efficiency. To address these limitations, we propose Inversion-DPO, a novel alignment framework that circumvents reward modeling by reformulating Direct Preference Optimization (DPO) with DDIM inversion for DMs. Our method conducts intractable posterior sampling in Diffusion-DPO with the deterministic inversion from winning and losing samples to noise and thus derive a new post-training paradigm. This paradigm eliminates the need for auxiliary reward models or inaccurate appromixation, significantly enhancing both precision and efficiency of training. We apply Inversion-DPO to a basic task of text-to-image generation and a challenging task of compositional image generation. Extensive experiments show substantial performance improvements achieved by Inversion-DPO compared to existing post-training methods and highlight the ability of the trained generative models to generate high-fidelity compositionally coherent images. For the post-training of compostitional image geneation, we curate a paired dataset consisting of 11,140 images with complex structural annotations and comprehensive scores, designed to enhance the compositional capabilities of generative models. Inversion-DPO explores a new avenue for efficient, high-precision alignment in diffusion models, advancing their applicability to complex realistic generation tasks. Our code is available at this https URL
zh
[CV-101] Deformable Dynamic Convolution for Accurate yet Efficient Spatio-Temporal Traffic Prediction
【速读】:该论文试图解决交通预测中面临的复杂城市区域的准确性与效率问题,特别是现有方法在捕捉不同区域和时间段的异质性交通模式方面存在不足,以及图神经网络(Graph Neural Networks, GNNs)在处理大规模数据时因依赖预定义邻接矩阵而受限的问题。解决方案的关键在于提出一种称为可变形动态卷积网络(Deformable Dynamic Convolution Network, DDCN)的结构,该结构通过基于偏移量动态应用可变形滤波器来克服传统卷积神经网络(Convolutional Neural Networks, CNNs)在建模非欧几里得空间结构和时空异质性方面的局限性,并通过编码器-解码器架构结合空间和时空注意力机制以强调关键特征,从而实现准确且高效的交通预测。
链接: https://arxiv.org/abs/2507.11550
作者: Hyeonseok Jin,Geonmin Kim,Kyungbaek Kim
机构: Chonnam National University(全南大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: 7 pages
点击查看摘要
Abstract:Spatio-temporal traffic prediction plays a key role in intelligent transportation systems by enabling accurate prediction in complex urban areas. Although not only accuracy but also efficiency for scalability is important, some previous methods struggle to capture heterogeneity such as varying traffic patterns across regions and time periods. Moreover, Graph Neural Networks (GNNs), which are the mainstream of traffic prediction, not only require predefined adjacency matrix, but also limit scalability to large-scale data containing many nodes due to their inherent complexity. To overcome these limitations, we propose Deformable Dynamic Convolution Network (DDCN) for accurate yet efficient traffic prediction. Traditional Convolutional Neural Networks (CNNs) are limited in modeling non-Euclidean spatial structures and spatio-temporal heterogeneity, DDCN overcomes these challenges by dynamically applying deformable filters based on offset. Specifically, DDCN decomposes transformer-style CNN to encoder-decoder structure, and applies proposed approaches to the spatial and spatio-temporal attention blocks of the encoder to emphasize important features. The decoder, composed of feed-forward module, complements the output of the encoder. This novel structure make DDCN can perform accurate yet efficient traffic prediction. In comprehensive experiments on four real-world datasets, DDCN achieves competitive performance, emphasizing the potential and effectiveness of CNN-based approaches for spatio-temporal traffic prediction.
zh
[CV-102] An Memory-Efficient Framework for Deformable Transformer with Neural Architecture Search
【速读】:该论文试图解决可变形注意力变换器(Deformable Attention Transformers, DAT)在硬件部署中因数据依赖性采样机制导致的不规则内存访问模式问题,该问题显著影响了硬件的效率。解决方案的关键在于提出一种面向硬件优化的框架,其核心包括基于神经架构搜索(NAS)的方法与新的切片策略,以在推理过程中自动将输入特征划分为统一块,从而避免内存冲突且无需修改模型结构;同时通过联合优化硬件成本与推理精度来探索最佳切片配置。
链接: https://arxiv.org/abs/2507.11549
作者: Wendong Mao,Mingfan Zhao,Jianfeng Guan,Qiwei Dong,Zhongfeng Wang
机构: Sun Yat-Sen University (中山大学); Nanjing University (南京大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:
点击查看摘要
Abstract:Deformable Attention Transformers (DAT) have shown remarkable performance in computer vision tasks by adaptively focusing on informative image regions. However, their data-dependent sampling mechanism introduces irregular memory access patterns, posing significant challenges for efficient hardware deployment. Existing acceleration methods either incur high hardware overhead or compromise model accuracy. To address these issues, this paper proposes a hardware-friendly optimization framework for DAT. First, a neural architecture search (NAS)-based method with a new slicing strategy is proposed to automatically divide the input feature into uniform patches during the inference process, avoiding memory conflicts without modifying model architecture. The method explores the optimal slice configuration by jointly optimizing hardware cost and inference accuracy. Secondly, an FPGA-based verification system is designed to test the performance of this framework on edge-side hardware. Algorithm experiments on the ImageNet-1K dataset demonstrate that our hardware-friendly framework can maintain have only 0.2% accuracy drop compared to the baseline DAT. Hardware experiments on Xilinx FPGA show the proposed method reduces DRAM access times to 18% compared with existing DAT acceleration methods.
zh
[CV-103] Unit-Based Histopathology Tissue Segmentation via Multi-Level Feature Representation
【速读】:该论文旨在解决组织病理学图像分割中注释工作量大且计算效率低的问题,其解决方案的关键在于提出一种基于单元的组织分割框架UTS,该框架将固定大小的32×32图像块作为分割单元而非每个像素进行分类,从而在不牺牲准确性的前提下减少注释负担并提升计算效率。为实现这一目标,论文引入了多层级视觉Transformer(L-ViT),通过多层级特征表示捕捉细粒度形态和全局组织上下文信息。
链接: https://arxiv.org/abs/2507.12427
作者: Ashkan Shakarami,Azade Farshad,Yousef Yeganeh,Lorenzo Nicole,Peter Schuffler,Stefano Ghidoni,Nassir Navab
机构: 未知
类目: Image and Video Processing (eess.IV); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注: 12 pages, 6 figures
点击查看摘要
Abstract:We propose UTS, a unit-based tissue segmentation framework for histopathology that classifies each fixed-size 32 * 32 tile, rather than each pixel, as the segmentation unit. This approach reduces annotation effort and improves computational efficiency without compromising accuracy. To implement this approach, we introduce a Multi-Level Vision Transformer (L-ViT), which benefits the multi-level feature representation to capture both fine-grained morphology and global tissue context. Trained to segment breast tissue into three categories (infiltrating tumor, non-neoplastic stroma, and fat), UTS supports clinically relevant tasks such as tumor-stroma quantification and surgical margin assessment. Evaluated on 386,371 tiles from 459 HE-stained regions, it outperforms U-Net variants and transformer-based baselines. Code and Dataset will be available at GitHub.
zh
[CV-104] Spontaneous Spatial Cognition Emerges during Egocentric Video Viewing through Non-invasive BCI
【速读】:该论文试图解决在被动体验条件下,大脑如何自发地构建自我中心的空间表征这一问题。其关键解决方案是利用基于脑电图(EEG)的非侵入性脑机接口(BCI),成功解码出个体在观看第一视角视频时的6D本体空间姿态(三维位置与方向),并揭示了视觉输入的时空特性与神经动态之间的关联。
链接: https://arxiv.org/abs/2507.12417
作者: Weichen Dai,Yuxuan Huang,Li Zhu,Dongjun Liu,Yu Zhang,Qibin Zhao,Andrzej Cichocki,Fabio Babiloni,Ke Li,Jianyu Qiu,Gangyong Jia,Wanzeng Kong,Qing Wu
机构: 未知
类目: Neurons and Cognition (q-bio.NC); Computer Vision and Pattern Recognition (cs.CV); Signal Processing (eess.SP)
备注:
点击查看摘要
Abstract:Humans possess a remarkable capacity for spatial cognition, allowing for self-localization even in novel or unfamiliar environments. While hippocampal neurons encoding position and orientation are well documented, the large-scale neural dynamics supporting spatial representation, particularly during naturalistic, passive experience, remain poorly understood. Here, we demonstrate for the first time that non-invasive brain-computer interfaces (BCIs) based on electroencephalography (EEG) can decode spontaneous, fine-grained egocentric 6D pose, comprising three-dimensional position and orientation, during passive viewing of egocentric video. Despite EEG’s limited spatial resolution and high signal noise, we find that spatially coherent visual input (i.e., continuous and structured motion) reliably evokes decodable spatial representations, aligning with participants’ subjective sense of spatial engagement. Decoding performance further improves when visual input is presented at a frame rate of 100 ms per image, suggesting alignment with intrinsic neural temporal dynamics. Using gradient-based backpropagation through a neural decoding model, we identify distinct EEG channels contributing to position – and orientation specific – components, revealing a distributed yet complementary neural encoding scheme. These findings indicate that the brain’s spatial systems operate spontaneously and continuously, even under passive conditions, challenging traditional distinctions between active and passive spatial cognition. Our results offer a non-invasive window into the automatic construction of egocentric spatial maps and advance our understanding of how the human mind transforms everyday sensory experience into structured internal representations.
zh
[CV-105] DoRF: Doppler Radiance Fields for Robust Human Activity Recognition Using Wi-Fi
【速读】:该论文试图解决基于Wi-Fi信道状态信息(CSI)的人类活动识别(HAR)在实际部署中泛化能力不足的问题。其解决方案的关键在于从一维多普勒速度投影中重建一个信息丰富的三维潜在运动表示,并据此构建统一的多普勒辐射场(DoRF),从而提供更全面的活动视图并增强对环境变化的鲁棒性。
链接: https://arxiv.org/abs/2507.12132
作者: Navid Hasanzadeh,Shahrokh Valaee
机构: 未知
类目: ignal Processing (eess.SP); Computer Vision and Pattern Recognition (cs.CV)
备注:
点击查看摘要
Abstract:Wi-Fi Channel State Information (CSI) has gained increasing interest for remote sensing applications. Recent studies show that Doppler velocity projections extracted from CSI can enable human activity recognition (HAR) that is robust to environmental changes and generalizes to new users. However, despite these advances, generalizability still remains insufficient for practical deployment. Inspired by neural radiance fields (NeRF), which learn a volumetric representation of a 3D scene from 2D images, this work proposes a novel approach to reconstruct an informative 3D latent motion representation from one-dimensional Doppler velocity projections extracted from Wi-Fi CSI. The resulting latent representation is then used to construct a uniform Doppler radiance field (DoRF) of the motion, providing a comprehensive view of the performed activity and improving the robustness to environmental variability. The results show that the proposed approach noticeably enhances the generalization accuracy of Wi-Fi-based HAR, highlighting the strong potential of DoRFs for practical sensing applications.
zh
[CV-106] A Spatial-Physics Informed Model for 3D Spiral Sample Scanned by SQUID Microscopy
【速读】:该论文旨在解决先进封装中非破坏性检测(NDT)面临的挑战,特别是在多层结构深度和复杂性增加的情况下,传统方法在处理涡流效应和图像错位问题上的不足。其解决方案的关键在于提出一种空间-物理信息模型(SPIM),该模型通过整合相位一致(I-channel)和正交相位(Q-channel)图像以减轻涡流效应、校正扫描SQUID显微镜与导线段之间的错位导致的倾斜效应,并结合毕奥-萨伐尔定律与快速傅里叶变换(FFT)实现磁场到电流密度的转换。
链接: https://arxiv.org/abs/2507.11853
作者: J. Senthilnath,Jayasanker Jayabalan,Zhuoyi Lin,Aye Phyu Phyu Aung,Chen Hao,Kaixin Xu,Yeow Kheng Lim,F. C. Wellstood
机构: 未知
类目: Instrumentation and Detectors (physics.ins-det); Computer Vision and Pattern Recognition (cs.CV)
备注: copyright 2025 IEEE. Personal use of this material is permitted. Permission from IEEE must be obtained for all other uses, in any current or future media, including reprinting/republishing this material for advertising or promotional purposes, creating new collective works, for resale or redistribution to servers or lists, or reuse of any copyrighted component of this work in other works
点击查看摘要
Abstract:The development of advanced packaging is essential in the semiconductor manufacturing industry. However, non-destructive testing (NDT) of advanced packaging becomes increasingly challenging due to the depth and complexity of the layers involved. In such a scenario, Magnetic field imaging (MFI) enables the imaging of magnetic fields generated by currents. For MFI to be effective in NDT, the magnetic fields must be converted into current density. This conversion has typically relied solely on a Fast Fourier Transform (FFT) for magnetic field inversion; however, the existing approach does not consider eddy current effects or image misalignment in the test setup. In this paper, we present a spatial-physics informed model (SPIM) designed for a 3D spiral sample scanned using Superconducting QUantum Interference Device (SQUID) microscopy. The SPIM encompasses three key components: i) magnetic image enhancement by aligning all the “sharp” wire field signals to mitigate the eddy current effect using both in-phase (I-channel) and quadrature-phase (Q-channel) images; (ii) magnetic image alignment that addresses skew effects caused by any misalignment of the scanning SQUID microscope relative to the wire segments; and (iii) an inversion method for converting magnetic fields to magnetic currents by integrating the Biot-Savart Law with FFT. The results show that the SPIM improves I-channel sharpness by 0.3% and reduces Q-channel sharpness by 25%. Also, we were able to remove rotational and skew misalignments of 0.30 in a real image. Overall, SPIM highlights the potential of combining spatial analysis with physics-driven models in practical applications.
zh
[CV-107] Image-Based Multi-Survey Classification of Light Curves with a Pre-Trained Vision Transformer ICML
【速读】:该论文试图解决多巡天环境下光度分类的问题,旨在利用Zwicky Transient Facility (ZTF) 和 Asteroid Terrestrial-impact Last Alert System (ATLAS) 的光变曲线数据提升分类性能。解决方案的关键在于采用一种多巡天架构,该架构能够联合处理来自不同巡天的数据,从而有效建模巡天特有的特征以及跨巡天的交互关系,进而实现更优的分类效果。
链接: https://arxiv.org/abs/2507.11711
作者: Daniel Moreno-Cartagena,Guillermo Cabrera-Vives,Alejandra M. Muñoz Arancibia,Pavlos Protopapas,Francisco Förster,Márcio Catelan,A. Bayo,Pablo A. Estévez,P. Sánchez-Sáez,Franz E. Bauer,M. Pavez-Herrera,L. Hernández-García,Gonzalo Rojas
机构: 未知
类目: Instrumentation and Methods for Astrophysics (astro-ph.IM); Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted at the 2025 Workshop on Machine Learning for Astrophysics at the International Conference on Machine Learning (ICML)
点击查看摘要
Abstract:We explore the use of Swin Transformer V2, a pre-trained vision Transformer, for photometric classification in a multi-survey setting by leveraging light curves from the Zwicky Transient Facility (ZTF) and the Asteroid Terrestrial-impact Last Alert System (ATLAS). We evaluate different strategies for integrating data from these surveys and find that a multi-survey architecture which processes them jointly achieves the best performance. These results highlight the importance of modeling survey-specific characteristics and cross-survey interactions, and provide guidance for building scalable classifiers for future time-domain astronomy.
zh
[CV-108] Are Vision Foundation Models Ready for Out-of-the-Box Medical Image Registration?
【速读】:该论文试图解决乳腺磁共振成像(Breast MRI)中由于患者解剖结构差异、体位引起的形变以及纤维腺体组织的薄而复杂结构所带来的图像配准难题。其解决方案的关键在于评估基于基础模型(foundation models)的配准算法在乳腺MRI中的性能,特别是利用预训练编码器如SAM、MedSAM、SSLSAM等,在不同年份、序列、模态和患者疾病状态下的配准任务中表现。研究发现,尽管这些模型在整体配准上优于传统方法,尤其在大规模领域迁移情况下,但在捕捉纤维腺体组织的细节方面存在不足。
链接: https://arxiv.org/abs/2507.11569
作者: Hanxue Gu,Yaqian Chen,Nicholas Konz,Qihang Li,Maciej A. Mazurowski
机构: 未知
类目: Image and Video Processing (eess.IV); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
备注: 3 figures, 9 pages
点击查看摘要
Abstract:Foundation models, pre-trained on large image datasets and capable of capturing rich feature representations, have recently shown potential for zero-shot image registration. However, their performance has mostly been tested in the context of rigid or less complex structures, such as the brain or abdominal organs, and it remains unclear whether these models can handle more challenging, deformable anatomy. Breast MRI registration is particularly difficult due to significant anatomical variation between patients, deformation caused by patient positioning, and the presence of thin and complex internal structure of fibroglandular tissue, where accurate alignment is crucial. Whether foundation model-based registration algorithms can address this level of complexity remains an open question. In this study, we provide a comprehensive evaluation of foundation model-based registration algorithms for breast MRI. We assess five pre-trained encoders, including DINO-v2, SAM, MedSAM, SSLSAM, and MedCLIP, across four key breast registration tasks that capture variations in different years and dates, sequences, modalities, and patient disease status (lesion versus no lesion). Our results show that foundation model-based algorithms such as SAM outperform traditional registration baselines for overall breast alignment, especially under large domain shifts, but struggle with capturing fine details of fibroglandular tissue. Interestingly, additional pre-training or fine-tuning on medical or breast-specific images in MedSAM and SSLSAM, does not improve registration performance and may even decrease it in some cases. Further work is needed to understand how domain-specific training influences registration and to explore targeted strategies that improve both global alignment and fine structure accuracy. We also publicly release our code at \hrefthis https URLGithub.
zh
[CV-109] Predicting Pulmonary Hypertension in Newborns: A Multi-view VAE Approach
【速读】:该论文旨在解决新生儿肺动脉高压(Pulmonary Hypertension, PH)诊断中依赖操作者经验的主观性问题,以及现有自动化检测方法在新生儿群体中表现不佳的问题。其解决方案的关键在于采用多视角变分自编码器(multi-view variational autoencoder, VAE)模型,通过融合多视角超声心动图视频数据,提升特征提取的鲁棒性和模型的泛化能力,从而实现更准确和客观的PH评估。
链接: https://arxiv.org/abs/2507.11561
作者: Lucas Erlacher,Samuel Ruipérez-Campillo,Holger Michel,Sven Wellmann,Thomas M. Sutter,Ece Ozkan,Julia E. Vogt
机构: ETH Zurich(苏黎世联邦理工学院); University Children’s Hospital Regensburg (KUNO)(雷根斯堡大学儿童医院(KUNO)); Hospital St. Hedwig of the Order of St. John(圣约翰会圣海德维希医院); University of Regensburg(雷根斯堡大学)
类目: Image and Video Processing (eess.IV); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
备注:
点击查看摘要
Abstract:Pulmonary hypertension (PH) in newborns is a critical condition characterized by elevated pressure in the pulmonary arteries, leading to right ventricular strain and heart failure. While right heart catheterization (RHC) is the diagnostic gold standard, echocardiography is preferred due to its non-invasive nature, safety, and accessibility. However, its accuracy highly depends on the operator, making PH assessment subjective. While automated detection methods have been explored, most models focus on adults and rely on single-view echocardiographic frames, limiting their performance in diagnosing PH in newborns. While multi-view echocardiography has shown promise in improving PH assessment, existing models struggle with generalizability. In this work, we employ a multi-view variational autoencoder (VAE) for PH prediction using echocardiographic videos. By leveraging the VAE framework, our model captures complex latent representations, improving feature extraction and robustness. We compare its performance against single-view and supervised learning approaches. Our results show improved generalization and classification accuracy, highlighting the effectiveness of multi-view learning for robust PH assessment in newborns.
zh
[CV-110] 3D Wavelet Latent Diffusion Model for Whole-Body MR-to-CT Modality Translation
【速读】:该论文旨在解决现有基于磁共振(MR)图像生成计算机断层扫描(CT)图像的合成方法在全身成像中存在空间对齐性差和图像质量不足的问题,这些问题限制了其在下游临床任务中的可靠性。论文提出的解决方案关键在于采用一种新型的3D小波潜在扩散模型(3D-WLDM),通过在学习的潜在空间中进行模态转换来提升性能。该模型通过引入小波残差模块增强细尺度特征的捕捉与重建,并通过解耦结构特征与模态特异性特征以保持解剖完整性,同时利用双跳跃连接注意力机制提升骨骼结构和软组织对比度的表示能力。
链接: https://arxiv.org/abs/2507.11557
作者: Jiaxu Zheng,Meiman He,Xuhui Tang,Xiong Wang,Tuoyu Cao,Tianyi Zeng,Lichi Zhang,Chenyu You
机构: Shanghai Jiao Tong University(上海交通大学); ShanghaiTech University(上海科技大学); Shanghai United Imaging Healthcare Co., Ltd.(上海联影医疗科技有限公司); Yale University(耶鲁大学); Stony Brook University(石溪大学)
类目: Image and Video Processing (eess.IV); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
备注:
点击查看摘要
Abstract:Magnetic Resonance (MR) imaging plays an essential role in contemporary clinical diagnostics. It is increasingly integrated into advanced therapeutic workflows, such as hybrid Positron Emission Tomography/Magnetic Resonance (PET/MR) imaging and MR-only radiation therapy. These integrated approaches are critically dependent on accurate estimation of radiation attenuation, which is typically facilitated by synthesizing Computed Tomography (CT) images from MR scans to generate attenuation maps. However, existing MR-to-CT synthesis methods for whole-body imaging often suffer from poor spatial alignment between the generated CT and input MR images, and insufficient image quality for reliable use in downstream clinical tasks. In this paper, we present a novel 3D Wavelet Latent Diffusion Model (3D-WLDM) that addresses these limitations by performing modality translation in a learned latent space. By incorporating a Wavelet Residual Module into the encoder-decoder architecture, we enhance the capture and reconstruction of fine-scale features across image and latent spaces. To preserve anatomical integrity during the diffusion process, we disentangle structural and modality-specific characteristics and anchor the structural component to prevent warping. We also introduce a Dual Skip Connection Attention mechanism within the diffusion model, enabling the generation of high-resolution CT images with improved representation of bony structures and soft-tissue contrast.
zh
[CV-111] Landmark Detection for Medical Images using a General-purpose Segmentation Model ICONIP2025
【速读】:该论文试图解决在骨科骨盆X光图像中准确分割解剖学标志点及复杂轮廓的问题,因为现有的通用分割模型如SAM(Segment Anything Model)和其医学变体MedSAM缺乏对细粒度解剖标志点的识别能力。解决方案的关键在于结合YOLO(You Only Look Once)目标检测模型与SAM,利用YOLO生成的边界框作为提示输入,引导SAM进行更精确的分割,从而构建一个可靠的混合模型管道,实现对多种解剖结构的高效且精准分割。
链接: https://arxiv.org/abs/2507.11551
作者: Ekaterina Stansfield,Jennifer A. Mitterer,Abdulrahman Altahhan
机构: 未知
类目: Image and Video Processing (eess.IV); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
备注: 13 pages, 8 figures, 2 tables. Submitted to ICONIP 2025
点击查看摘要
Abstract:Radiographic images are a cornerstone of medical diagnostics in orthopaedics, with anatomical landmark detection serving as a crucial intermediate step for information extraction. General-purpose foundational segmentation models, such as SAM (Segment Anything Model), do not support landmark segmentation out of the box and require prompts to function. However, in medical imaging, the prompts for landmarks are highly specific. Since SAM has not been trained to recognize such landmarks, it cannot generate accurate landmark segmentations for diagnostic purposes. Even MedSAM, a medically adapted variant of SAM, has been trained to identify larger anatomical structures, such as organs and their parts, and lacks the fine-grained precision required for orthopaedic pelvic landmarks. To address this limitation, we propose leveraging another general-purpose, non-foundational model: YOLO. YOLO excels in object detection and can provide bounding boxes that serve as input prompts for SAM. While YOLO is efficient at detection, it is significantly outperformed by SAM in segmenting complex structures. In combination, these two models form a reliable pipeline capable of segmenting not only a small pilot set of eight anatomical landmarks but also an expanded set of 72 landmarks and 16 regions with complex outlines, such as the femoral cortical bone and the pelvic inlet. By using YOLO-generated bounding boxes to guide SAM, we trained the hybrid model to accurately segment orthopaedic pelvic radiographs. Our results show that the proposed combination of YOLO and SAM yields excellent performance in detecting anatomical landmarks and intricate outlines in orthopaedic pelvic radiographs.
zh
人工智能
[AI-0] LLM -Based Config Synthesis requires Disambiguation
【速读】:该论文试图解决在使用大语言模型(Large Language Models, LLMs)进行程序合成时,用户意图的模糊性问题。特别是在网络配置场景中,路由策略和访问控制列表(ACLs)的结构经常在头部空间上重叠,导致LLM难以推断动作的相对优先级,从而引发歧义。论文提出的关键解决方案是设计一个名为Clarify的原型系统,该系统通过引入一个新的模块——消歧器(Disambiguator),增强LLM以获取用户意图。该方法在小规模合成工作负载上实现了经过消歧后的增量路由策略合成与验证,表明其在用户意图可被LLM正确合成但集成存在歧义的情况下具有普遍适用性。
链接: https://arxiv.org/abs/2507.12443
作者: Rajdeep Mondal,Nikolaj Bjorner,Todd Millstein,Alan Tang,George Varghese
机构: 未知
类目: Networking and Internet Architecture (cs.NI); Artificial Intelligence (cs.AI); Human-Computer Interaction (cs.HC); Programming Languages (cs.PL)
备注:
点击查看摘要
Abstract:Beyond hallucinations, another problem in program synthesis using LLMs is ambiguity in user intent. We illustrate the ambiguity problem in a networking context for LLM-based incremental configuration synthesis of route-maps and ACLs. These structures frequently overlap in header space, making the relative priority of actions impossible for the LLM to infer without user interaction. Measurements in a large cloud identify complex ACLs with 100’s of overlaps, showing ambiguity is a real problem. We propose a prototype system, Clarify, which uses an LLM augmented with a new module called a Disambiguator that helps elicit user intent. On a small synthetic workload, Clarify incrementally synthesizes routing policies after disambiguation and then verifies them. Our treatment of ambiguities is useful more generally when the intent of updates can be correctly synthesized by LLMs, but their integration is ambiguous and can lead to different global behaviors.
zh
[AI-1] Characterizing State Space Model (SSM) and SSM-Transformer Hybrid Language Model Performance with Long Context Length
【速读】:该论文试图解决在本地设备上处理连续、长上下文输入的机器智能效率问题,传统Transformer架构由于二次复杂度和内存需求导致其在这些任务中效率低下且难以使用。解决方案的关键在于探索State Space Models (SSMs)及其混合模型,这些模型能够实现近线性扩展,并通过在消费级和嵌入式GPU上的全面比较基准测试,验证了SSMs在长上下文推理中的优越性能,特别是在处理长达220K tokens的序列时表现出显著的优势。
链接: https://arxiv.org/abs/2507.12442
作者: Saptarshi Mitra,Rachid Karami,Haocheng Xu,Sitao Huang,Hyoukjun Kwon
机构: 未知
类目: Hardware Architecture (cs.AR); Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Systems and Control (eess.SY)
备注: 12 pages, 7 figures
点击查看摘要
Abstract:The demand for machine intelligence capable of processing continuous, long-context inputs on local devices is growing rapidly. However, the quadratic complexity and memory requirements of traditional Transformer architectures make them inefficient and often unusable for these tasks. This has spurred a paradigm shift towards new architectures like State Space Models (SSMs) and hybrids, which promise near-linear scaling. While most current research focuses on the accuracy and theoretical throughput of these models, a systematic performance characterization on practical consumer hardware is critically needed to guide system-level optimization and unlock new applications. To address this gap, we present a comprehensive, comparative benchmarking of carefully selected Transformer, SSM, and hybrid models specifically for long-context inference on consumer and embedded GPUs. Our analysis reveals that SSMs are not only viable but superior for this domain, capable of processing sequences up to 220K tokens on a 24GB consumer GPU-approximately 4x longer than comparable Transformers. While Transformers may be up to 1.8x faster at short sequences, SSMs demonstrate a dramatic performance inversion, becoming up to 4x faster at very long contexts (~57K tokens). Our operator-level analysis reveals that custom, hardware-aware SSM kernels dominate the inference runtime, accounting for over 55% of latency on edge platforms, identifying them as a primary target for future hardware acceleration. We also provide detailed, device-specific characterization results to guide system co-design for the edge. To foster further research, we will open-source our characterization framework. Comments: 12 pages, 7 figures Subjects: Hardware Architecture (cs.AR); Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Systems and Control (eess.SY) Cite as: arXiv:2507.12442 [cs.AR] (or arXiv:2507.12442v1 [cs.AR] for this version) https://doi.org/10.48550/arXiv.2507.12442 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
zh
[AI-2] Mixture of Raytraced Experts
【速读】:该论文试图解决传统Mixture of Experts (MoE)架构在计算资源分配上的局限性,即通常对每个样本需要固定的计算量,无法根据任务复杂度动态调整计算深度和宽度。其解决方案的关键在于引入了一种基于光线追踪的专家混合(Mixture of Raytraced Experts)架构,该架构能够动态选择专家序列,生成可变宽度和深度的计算图,从而在计算周期遍历专家序列的过程中逐步提高预测精度。
链接: https://arxiv.org/abs/2507.12419
作者: Andrea Perin,Giacomo Lagomarsini,Claudio Gallicchio,Giuseppe Nuti
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注: Preliminary version (pre-submission)
点击查看摘要
Abstract:We introduce a Mixture of Raytraced Experts, a stacked Mixture of Experts (MoE) architecture which can dynamically select sequences of experts, producing computational graphs of variable width and depth. Existing MoE architectures generally require a fixed amount of computation for a given sample. Our approach, in contrast, yields predictions with increasing accuracy as the computation cycles through the experts’ sequence. We train our model by iteratively sampling from a set of candidate experts, unfolding the sequence akin to how Recurrent Neural Networks are trained. Our method does not require load-balancing mechanisms, and preliminary experiments show a reduction in training epochs of 10% to 40% with a comparable/higher accuracy. These results point to new research directions in the field of MoEs, allowing the design of potentially faster and more expressive models. The code is available at this https URL
zh
[AI-3] NOCTA: Non-Greedy Objective Cost-Tradeoff Acquisition for Longitudinal Data
【速读】:该论文旨在解决在资源受限环境下,如何高效获取最具信息量的特征以进行预测的问题,特别是在医疗等需要考虑时间动态性和采集成本的应用中。解决方案的关键在于提出NOCTA(Non-Greedy Objective Cost-Tradeoff Acquisition)方法,该方法在推理阶段依次获取最具有信息量的特征,同时兼顾时间动态性和采集成本。NOCTA通过引入一致的估计目标,并开发了两种互补的估计器:基于最近邻的非参数方法(NOCTA-NP)和直接预测潜在采集效用的参数方法(NOCTA-P)。
链接: https://arxiv.org/abs/2507.12412
作者: Dzung Dinh,Boqi Chen,Marc Niethammer,Junier Oliva
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:
点击查看摘要
Abstract:In many critical applications, resource constraints limit the amount of information that can be gathered to make predictions. For example, in healthcare, patient data often spans diverse features ranging from lab tests to imaging studies. Each feature may carry different information and must be acquired at a respective cost of time, money, or risk to the patient. Moreover, temporal prediction tasks, where both instance features and labels evolve over time, introduce additional complexity in deciding when or what information is important. In this work, we propose NOCTA, a Non-Greedy Objective Cost-Tradeoff Acquisition method that sequentially acquires the most informative features at inference time while accounting for both temporal dynamics and acquisition cost. We first introduce a cohesive estimation target for our NOCTA setting, and then develop two complementary estimators: 1) a non-parametric method based on nearest neighbors to guide the acquisition (NOCTA-NP), and 2) a parametric method that directly predicts the utility of potential acquisitions (NOCTA-P). Experiments on synthetic and real-world medical datasets demonstrate that both NOCTA variants outperform existing baselines.
zh
[AI-4] GitChameleon: Evaluating AI Code Generation Against Python Library Version Incompatibilities
【速读】:该论文试图解决软件库快速演进对代码生成带来的挑战,即在频繁的版本更新中保持代码的兼容性与功能准确性。解决方案的关键在于引入GitChameleon数据集,该数据集包含328个基于特定库版本的Python代码补全问题,并附有可执行的单元测试,从而能够通过执行评估现代大型语言模型(LLMs)、LLM驱动的代理、代码助手和RAG系统在版本条件下的代码生成能力。
链接: https://arxiv.org/abs/2507.12367
作者: Diganta Misra,Nizar Islah,Victor May,Brice Rauby,Zihan Wang,Justine Gehring,Antonio Orvieto,Muawiz Chaudhary,Eilif B. Muller,Irina Rish,Samira Ebrahimi Kahou,Massimo Caccia
机构: 未知
类目: oftware Engineering (cs.SE); Artificial Intelligence (cs.AI); Programming Languages (cs.PL)
备注: Version 2 of the dataset from: arXiv:2411.05830
点击查看摘要
Abstract:The rapid evolution of software libraries poses a considerable hurdle for code generation, necessitating continuous adaptation to frequent version updates while preserving backward compatibility. While existing code evolution benchmarks provide valuable insights, they typically lack execution-based evaluation for generating code compliant with specific library versions. To address this, we introduce GitChameleon, a novel, meticulously curated dataset comprising 328 Python code completion problems, each conditioned on specific library versions and accompanied by executable unit tests. GitChameleon rigorously evaluates the capacity of contemporary large language models (LLMs), LLM-powered agents, code assistants, and RAG systems to perform version-conditioned code generation that demonstrates functional accuracy through execution. Our extensive evaluations indicate that state-of-the-art systems encounter significant challenges with this task; enterprise models achieving baseline success rates in the 48-51% range, underscoring the intricacy of the problem. By offering an execution-based benchmark emphasizing the dynamic nature of code libraries, GitChameleon enables a clearer understanding of this challenge and helps guide the development of more adaptable and dependable AI code generation methods. We make the dataset and evaluation code publicly available at this https URL.
zh
[AI-5] Neural Polar Decoders for Deletion Channels
【速读】:该论文旨在解决删除信道下极化码(polar codes)传统极化解码器计算复杂度高的问题,其现有解码器的复杂度为O(N^4),限制了极化码在短至中等块长下的应用。论文提出的解决方案是采用神经网络极化解码器(Neural Polar Decoder, NPD),其关键在于将NPD架构扩展以支持删除信道,仅需调整其中一个神经网络结构即可实现,从而将计算复杂度降低至O(AN log N),其中A为用户设定的计算预算,与信道无关。这一改进使得在降低复杂度的同时能够引入列表解码(list decoding)以进一步提升性能。
链接: https://arxiv.org/abs/2507.12329
作者: Ziv Aharoni,Henry D. Pfister
机构: 未知
类目: Information Theory (cs.IT); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:
点击查看摘要
Abstract:This paper introduces a neural polar decoder (NPD) for deletion channels with a constant deletion rate. Existing polar decoders for deletion channels exhibit high computational complexity of O(N^4) , where N is the block length. This limits the application of polar codes for deletion channels to short-to-moderate block lengths. In this work, we demonstrate that employing NPDs for deletion channels can reduce the computational complexity. First, we extend the architecture of the NPD to support deletion channels. Specifically, the NPD architecture consists of four neural networks (NNs), each replicating fundamental successive cancellation (SC) decoder operations. To support deletion channels, we change the architecture of only one. The computational complexity of the NPD is O(AN\log N) , where the parameter A represents a computational budget determined by the user and is independent of the channel. We evaluate the new extended NPD for deletion channels with deletion rates \delta\in\0.01, 0.1\ and we verify the NPD with the ground truth given by the trellis decoder by Tal et al. We further show that due to the reduced complexity of the NPD, we are able to incorporate list decoding and further improve performance. We believe that the extended NPD presented here could have applications in future technologies like DNA storage.
zh
[AI-6] hought Purity: Defense Paradigm For Chain-of-Thought Attack
【速读】:该论文旨在解决强化学习训练的大型推理模型(Large Reasoning Models, LRMs)在链式思维(Chain-of-Thought, CoT)生成过程中易受安全威胁的问题,尤其是针对通过提示可控性进行攻击的链式思维攻击(Chain-of-Thought Attack, CoTA)。解决方案的关键在于提出一种名为“思维纯度”(Thought Purity, TP)的防御范式,其通过三个协同组件实现:安全优化的数据处理流水线、增强的规则约束以及自适应监控指标,从而在保持模型性能的同时系统性地增强对恶意内容的抵抗力。
链接: https://arxiv.org/abs/2507.12314
作者: Zihao Xue,Zhen Bi,Long Ma,Zhenlin Hu,Yan Wang,Zhenfang Liu,Qing Sheng,Jie Xiao,Jungang Lou
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computational Engineering, Finance, and Science (cs.CE); Cryptography and Security (cs.CR)
备注:
点击查看摘要
Abstract:While reinforcement learning-trained Large Reasoning Models (LRMs, e.g., Deepseek-R1) demonstrate advanced reasoning capabilities in the evolving Large Language Models (LLMs) domain, their susceptibility to security threats remains a critical vulnerability. This weakness is particularly evident in Chain-of-Thought (CoT) generation processes, where adversarial methods like backdoor prompt attacks can systematically subvert the model’s core reasoning mechanisms. The emerging Chain-of-Thought Attack (CoTA) reveals this vulnerability through exploiting prompt controllability, simultaneously degrading both CoT safety and task performance with low-cost interventions. To address this compounded security-performance vulnerability, we propose Thought Purity (TP): a defense paradigm that systematically strengthens resistance to malicious content while preserving operational efficacy. Our solution achieves this through three synergistic components: (1) a safety-optimized data processing pipeline (2) reinforcement learning-enhanced rule constraints (3) adaptive monitoring metrics. Our approach establishes the first comprehensive defense mechanism against CoTA vulnerabilities in reinforcement learning-aligned reasoning systems, significantly advancing the security-functionality equilibrium for next-generation AI architectures.
zh
[AI-7] A Framework for Nonstationary Gaussian Processes with Neural Network Parameters
【速读】:该论文试图解决传统高斯过程(Gaussian Process)在非参数回归中因使用平稳核(stationary kernels)而导致的模型表达能力受限的问题,这可能使其不适用于许多数据集。解决方案的关键在于引入非平稳核(nonstationary kernels),其参数在特征空间中变化,并将这些参数建模为以特征作为输入的神经网络的输出。通过联合训练神经网络与高斯过程,并利用链式法则计算导数,该方法能够有效描述非平稳参数的行为,并兼容大规模数据集的近似方法。
链接: https://arxiv.org/abs/2507.12262
作者: Zachary James,Joseph Guinness
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Methodology (stat.ME); Machine Learning (stat.ML)
备注:
点击查看摘要
Abstract:Gaussian processes have become a popular tool for nonparametric regression because of their flexibility and uncertainty quantification. However, they often use stationary kernels, which limit the expressiveness of the model and may be unsuitable for many datasets. We propose a framework that uses nonstationary kernels whose parameters vary across the feature space, modeling these parameters as the output of a neural network that takes the features as input. The neural network and Gaussian process are trained jointly using the chain rule to calculate derivatives. Our method clearly describes the behavior of the nonstationary parameters and is compatible with approximation methods for scaling to large datasets. It is flexible and easily adapts to different nonstationary kernels without needing to redesign the optimization procedure. Our methods are implemented with the GPyTorch library and can be readily modified. We test a nonstationary variance and noise variant of our method on several machine learning datasets and find that it achieves better accuracy and log-score than both a stationary model and a hierarchical model approximated with variational inference. Similar results are observed for a model with only nonstationary variance. We also demonstrate our approach’s ability to recover the nonstationary parameters of a spatial dataset.
zh
[AI-8] Looking for Fairness in Recommender Systems
【速读】:该论文试图解决推荐系统中因过度个性化而导致的过滤气泡(filter bubbles)问题,这一问题限制了用户接触多样化内容的可能性,影响了内容创作者的曝光机会,并对社会整体的意见形成和行为产生深远影响。解决方案的关键在于定义一个或多个能够代表多样性的性能指标,并通过公平性视角对推荐系统进行优化,从而在个性化推荐与促进内容多样性之间实现平衡。
链接: https://arxiv.org/abs/2507.12242
作者: Cécile Logé
机构: 未知
类目: Information Retrieval (cs.IR); Artificial Intelligence (cs.AI)
备注:
点击查看摘要
Abstract:Recommender systems can be found everywhere today, shaping our everyday experience whenever we’re consuming content, ordering food, buying groceries online, or even just reading the news. Let’s imagine we’re in the process of building a recommender system to make content suggestions to users on social media. When thinking about fairness, it becomes clear there are several perspectives to consider: the users asking for tailored suggestions, the content creators hoping for some limelight, and society at large, navigating the repercussions of algorithmic recommendations. A shared fairness concern across all three is the emergence of filter bubbles, a side-effect that takes place when recommender systems are almost “too good”, making recommendations so tailored that users become inadvertently confined to a narrow set of opinions/themes and isolated from alternative ideas. From the user’s perspective, this is akin to manipulation. From the small content creator’s perspective, this is an obstacle preventing them access to a whole range of potential fans. From society’s perspective, the potential consequences are far-reaching, influencing collective opinions, social behavior and political decisions. How can our recommender system be fine-tuned to avoid the creation of filter bubbles, and ensure a more inclusive and diverse content landscape? Approaching this problem involves defining one (or more) performance metric to represent diversity, and tweaking our recommender system’s performance through the lens of fairness. By incorporating this metric into our evaluation framework, we aim to strike a balance between personalized recommendations and the broader societal goal of fostering rich and varied cultures and points of view.
zh
[AI-9] Xiangqi-R1: Enhancing Spatial Strategic Reasoning in LLM s for Chinese Chess via Reinforcement Learning
【速读】:该论文旨在解决大型语言模型(Large Language Models, LLMs)在空间战略推理方面的不足,特别是在复杂且完全可观测的棋类游戏中的表现问题。其关键解决方案是提出一种针对中国象棋(Xiangqi)的训练框架,包括一个包含五百万个棋盘-移动对的数据集,并通过专家注释和引擎评估进行增强。在此基础上,构建了Xiangqi-R1模型,采用多阶段训练策略:首先微调以预测合法移动,其次引入战略注释提升决策能力,最后通过多维奖励信号的Group Relative Policy Optimization (GRPO) 进行强化学习,从而提高推理稳定性。
链接: https://arxiv.org/abs/2507.12215
作者: Yuhao Chen,Shuochen Liu,Yuanjie Lyu,Chao Zhang,Jiayao Shi,Tong Xu
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注: 10 pages, 7 figures
点击查看摘要
Abstract:Game playing has long served as a fundamental benchmark for evaluating Artificial General Intelligence (AGI). While Large Language Models (LLMs) have demonstrated impressive capabilities in general reasoning, their effectiveness in spatial strategic reasoning, which is critical for complex and fully observable board games, remains insufficiently explored. In this work, we adopt Chinese Chess (Xiangqi) as a challenging and rich testbed due to its intricate rules and spatial complexity. To advance LLMs’ strategic competence in such environments, we propose a training framework tailored to Xiangqi, built upon a large-scale dataset of five million board-move pairs enhanced with expert annotations and engine evaluations. Building on this foundation, we introduce Xiangqi-R1, a 7B-parameter model trained in multi-stage manner: (1) fine-tuning for legal move prediction to capture basic spatial rules, (2) incorporating strategic annotations to improve decision-making, and (3) applying reinforcement learning via Group Relative Policy Optimization (GRPO) with multi-dimensional reward signals to enhance reasoning stability. Our Experimental results indicate that, despite their size and power, general-purpose LLMs struggle to achieve satisfactory performance in these tasks. Compared to general-purpose LLMs, Xiangqi-R1 greatly advances with an 18% rise in move legality and a 22% boost in analysis accuracy. Our results point to a promising path for creating general strategic intelligence in spatially complex areas.
zh
[AI-10] Draw an Ugly Person An Exploration of Generative AIs Perceptions of Ugliness
【速读】:该论文试图解决生成式AI在表达如“丑陋”等概念时所蕴含的文化偏见问题,特别是这些模型如何理解和呈现社会中的负面特征。其解决方案的关键在于通过分析不同生成式AI模型在文本和图像生成中对“丑陋”的表现,揭示其中存在的社会偏见,并探讨这些偏见的来源与表现形式。研究通过提取与“丑陋”相关的形容词并生成大量图像,结合对图像中人口统计学和社会经济属性的编码与主题分析,揭示了生成式AI在表征过程中对老年白人男性群体的不成比例关联,以及在避免刻板印象的同时反而将负面特征投射到主流群体的现象。
链接: https://arxiv.org/abs/2507.12212
作者: Garyoung Kim,Huisung Kwon,Seoju Yun,Yu-Won Youn
机构: 未知
类目: Human-Computer Interaction (cs.HC); Artificial Intelligence (cs.AI)
备注: 7 pages, 3 figures
点击查看摘要
Abstract:Generative AI does not only replicate human creativity but also reproduces deep-seated cultural biases, making it crucial to critically examine how concepts like ugliness are understood and expressed by these tools. This study investigates how four different generative AI models understand and express ugliness through text and image and explores the biases embedded within these representations. We extracted 13 adjectives associated with ugliness through iterative prompting of a large language model and generated 624 images across four AI models and three prompts. Demographic and socioeconomic attributes within the images were independently coded and thematically analyzed. Our findings show that AI models disproportionately associate ugliness with old white male figures, reflecting entrenched social biases as well as paradoxical biases, where efforts to avoid stereotypical depictions of marginalized groups inadvertently result in the disproportionate projection of negative attributes onto majority groups. Qualitative analysis further reveals that, despite supposed attempts to frame ugliness within social contexts, conventional physical markers such as asymmetry and aging persist as central visual motifs. These findings demonstrate that despite attempts to create more equal representations, generative AI continues to perpetuate inherited and paradoxical biases, underscoring the critical work being done to create ethical AI training paradigms and advance methodologies for more inclusive AI development.
zh
[AI-11] BuildEvo: Designing Building Energy Consumption Forecasting Heuristics via LLM -driven Evolution ICML2025
【速读】:该论文试图解决建筑能耗预测中传统启发式方法精度不足,而先进模型又存在黑箱性且难以泛化的问题。解决方案的关键在于提出BuildEvo框架,该框架利用大型语言模型(Large Language Models, LLMs)自动设计有效且可解释的能耗预测启发式规则,并通过进化过程系统地整合建筑特性与运行数据中的物理洞察,从而提升模型的泛化能力和预测透明度。
链接: https://arxiv.org/abs/2507.12207
作者: Subin Lin,Chuanbo Hua
机构: 未知
类目: Artificial Intelligence (cs.AI); Neural and Evolutionary Computing (cs.NE)
备注: ICML 2025 CO-Build Workshop Poster
点击查看摘要
Abstract:Accurate building energy forecasting is essential, yet traditional heuristics often lack precision, while advanced models can be opaque and struggle with generalization by neglecting physical principles. This paper introduces BuildEvo, a novel framework that uses Large Language Models (LLMs) to automatically design effective and interpretable energy prediction heuristics. Within an evolutionary process, BuildEvo guides LLMs to construct and enhance heuristics by systematically incorporating physical insights from building characteristics and operational data (e.g., from the Building Data Genome Project 2). Evaluations show BuildEvo achieves state-of-the-art performance on benchmarks, offering improved generalization and transparent prediction logic. This work advances the automated design of robust, physically grounded heuristics, promoting trustworthy models for complex energy systems.
zh
[AI-12] Sparse Autoencoders for Sequential Recommendation Models: Interpretation and Flexible Control
【速读】:该论文试图解决序列推荐中基于Transformer架构的黑箱模型缺乏可解释性的问题,这一问题限制了对模型内部机制的理解、影响和控制,从而制约了其在实际应用中的有效性。解决方案的关键在于引入稀疏自编码器(Sparse Autoencoders, SAE),通过学习从激活空间方向的稀疏线性组合中重建Transformer内部层的隐藏状态,从而提取出更具可解释性的特征。实验表明,SAE学习到的方向比原始隐藏状态维度更具语义单一性和可解释性,并且能够有效且灵活地控制模型行为,为用户提供调整推荐以适应不同场景的方法。
链接: https://arxiv.org/abs/2507.12202
作者: Anton Klenitskiy,Konstantin Polev,Daria Denisova,Alexey Vasilev,Dmitry Simakov,Gleb Gusev
机构: 未知
类目: Information Retrieval (cs.IR); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:
点击查看摘要
Abstract:Many current state-of-the-art models for sequential recommendations are based on transformer architectures. Interpretation and explanation of such black box models is an important research question, as a better understanding of their internals can help understand, influence, and control their behavior, which is very important in a variety of real-world applications. Recently sparse autoencoders (SAE) have been shown to be a promising unsupervised approach for extracting interpretable features from language models. These autoencoders learn to reconstruct hidden states of the transformer’s internal layers from sparse linear combinations of directions in their activation space. This paper is focused on the application of SAE to the sequential recommendation domain. We show that this approach can be successfully applied to the transformer trained on a sequential recommendation task: learned directions turn out to be more interpretable and monosemantic than the original hidden state dimensions. Moreover, we demonstrate that the features learned by SAE can be used to effectively and flexibly control the model’s behavior, providing end-users with a straightforward method to adjust their recommendations to different custom scenarios and contexts. Subjects: Information Retrieval (cs.IR); Artificial Intelligence (cs.AI); Machine Learning (cs.LG) Cite as: arXiv:2507.12202 [cs.IR] (or arXiv:2507.12202v1 [cs.IR] for this version) https://doi.org/10.48550/arXiv.2507.12202 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
zh
[AI-13] Quantize More Lose Less: Autoregressive Generation from Residually Quantized Speech Representations
【速读】:该论文试图解决传统文本到语音(TTS)合成方法在单码本表示下导致的信息丢失问题,特别是在复杂场景如歌唱语音或音乐合成中难以恢复细粒度细节(如语调细微变化、说话人特有的音色)的问题。解决方案的关键在于提出一种新的音频编解码器QDAC,其核心创新是通过基于自动语音识别(ASR)的自回归网络与生成对抗网络(GAN)的端到端训练,实现更优的语义特征解耦,从而支持可扩展的近无损压缩。在此基础上,QTTS框架采用两种创新策略:分层并行结构和延迟多头方法,以提升合成质量并加速推理速度。
链接: https://arxiv.org/abs/2507.12197
作者: Yichen Han,Xiaoyang Hao,Keming Chen,Weibo Xiong,Jun He,Ruonan Zhang,Junjie Cao,Yue Liu,Bowen Li,Dongrui Zhang,Hui Xia,Huilei Fu,Kai Jia,Kaixuan Guo,Mingli Jin,Qingyun Meng,Ruidong Ma,Ruiqian Fang,Shaotong Guo,Xuhui Li,Yang Xiang,Ying Zhang,Yulong Liu,Yunfeng Li,Yuyi Zhang,Yuze Zhou,Zhen Wang,Zhaowen Chen
机构: 未知
类目: ound (cs.SD); Artificial Intelligence (cs.AI)
备注:
点击查看摘要
Abstract:Text-to-speech (TTS) synthesis has seen renewed progress under the discrete modeling paradigm. Existing autoregressive approaches often rely on single-codebook representations, which suffer from significant information loss. Even with post-hoc refinement techniques such as flow matching, these methods fail to recover fine-grained details (e.g., prosodic nuances, speaker-specific timbres), especially in challenging scenarios like singing voice or music synthesis. We propose QTTS, a novel TTS framework built upon our new audio codec, QDAC. The core innovation of QDAC lies in its end-to-end training of an ASR-based auto-regressive network with a GAN, which achieves superior semantic feature disentanglement for scalable, near-lossless compression. QTTS models these discrete codes using two innovative strategies: the Hierarchical Parallel architecture, which uses a dual-AR structure to model inter-codebook dependencies for higher-quality synthesis, and the Delay Multihead approach, which employs parallelized prediction with a fixed delay to accelerate inference speed. Our experiments demonstrate that the proposed framework achieves higher synthesis quality and better preserves expressive content compared to baseline. This suggests that scaling up compression via multi-codebook modeling is a promising direction for high-fidelity, general-purpose speech and audio generation.
zh
[AI-14] Selective Quantization Tuning for ONNX Models
【速读】:该论文试图解决深度神经网络模型在量化过程中导致的性能下降和部署挑战问题,尤其是在低端硬件加速器上的应用。其关键解决方案是提出TuneQn,一个支持ONNX模型选择性量化、部署与执行的工具套件,结合性能分析和多目标优化,以在模型精度与大小之间实现最佳平衡。
链接: https://arxiv.org/abs/2507.12196
作者: Nikolaos Louloudakis,Ajitha Rajan
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Systems and Control (eess.SY)
备注: 5 pages, 3 figures, 2 tables
点击查看摘要
Abstract:Quantization is a process that reduces the precision of deep neural network models to lower model size and computational demands, often at the cost of accuracy. However, fully quantized models may exhibit sub-optimal performance below acceptable levels and face deployment challenges on low-end hardware accelerators due to practical constraints. To address these issues, quantization can be selectively applied to only a subset of layers, but selecting which layers to exclude is non-trivial. To this direction, we propose TuneQn, a suite enabling selective quantization, deployment and execution of ONNX models across various CPU and GPU devices, combined with profiling and multi-objective optimization. TuneQn generates selectively quantized ONNX models, deploys them on different hardware, measures performance on metrics like accuracy and size, performs Pareto Front minimization to identify the best model candidate and visualizes the results. To demonstrate the effectiveness of TuneQn, we evaluated TuneQn on four ONNX models with two quantization settings across CPU and GPU devices. As a result, we demonstrated that our utility effectively performs selective quantization and tuning, selecting ONNX model candidates with up to a 54.14 % reduction in accuracy loss compared to the fully quantized model, and up to a 72.9 % model size reduction compared to the original model.
zh
[AI-15] Partially Observable Reference Policy Programming: Solving POMDPs Sans Numerical Optimisation
【速读】:该论文试图解决部分可观测马尔可夫决策过程(POMDP)中的在线近似求解问题,特别是在动态变化环境中进行高效决策的挑战。其解决方案的关键在于提出一种名为“部分可观测参考策略编程”(Partially Observable Reference Policy Programming)的算法,该算法通过深度采样有意义的未来历史轨迹,并同时促使策略的渐进更新,从而实现更优的性能。该方法在理论上保证了性能损失仅由采样近似误差的平均值而非最大值所限制,这在在线规划中采样稀疏的情况下尤为重要。
链接: https://arxiv.org/abs/2507.12186
作者: Edward Kim,Hanna Kurniawati
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注: 8 pages, 2 tables, 3 figures. To be presented at International Joint Conference on Artificial Intelligence 2025
点击查看摘要
Abstract:This paper proposes Partially Observable Reference Policy Programming, a novel anytime online approximate POMDP solver which samples meaningful future histories very deeply while simultaneously forcing a gradual policy update. We provide theoretical guarantees for the algorithm’s underlying scheme which say that the performance loss is bounded by the average of the sampling approximation errors rather than the usual maximum, a crucial requirement given the sampling sparsity of online planning. Empirical evaluations on two large-scale problems with dynamically evolving environments – including a helicopter emergency scenario in the Corsica region requiring approximately 150 planning steps – corroborate the theoretical results and indicate that our solver considerably outperforms current online benchmarks.
zh
[AI-16] opology Enhanced MARL for Multi-Vehicle Cooperative Decision-Making of CAVs
【速读】:该论文旨在解决多智能体强化学习(MARL)中因联合状态-动作空间指数级增长而加剧的探索与利用权衡问题。其解决方案的关键在于提出一种拓扑增强的MARL方法(TPE-MARL),通过构建动态交通流的游戏拓扑张量来有效压缩高维交通状态信息,从而减少MARL算法的搜索空间,并在此基础上结合访问次数和智能体互信息构建增强框架,实现更高效的协同决策。
链接: https://arxiv.org/abs/2507.12110
作者: Ye Han,Lijun Zhang,Dejian Meng,Zhuang Zhang
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注: 16 pages, 16 figures
点击查看摘要
Abstract:The exploration-exploitation trade-off constitutes one of the fundamental challenges in reinforcement learning (RL), which is exacerbated in multi-agent reinforcement learning (MARL) due to the exponential growth of joint state-action spaces. This paper proposes a topology-enhanced MARL (TPE-MARL) method for optimizing cooperative decision-making of connected and autonomous vehicles (CAVs) in mixed traffic. This work presents two primary contributions: First, we construct a game topology tensor for dynamic traffic flow, effectively compressing high-dimensional traffic state information and decrease the search space for MARL algorithms. Second, building upon the designed game topology tensor and using QMIX as the backbone RL algorithm, we establish a topology-enhanced MARL framework incorporating visit counts and agent mutual information. Extensive simulations across varying traffic densities and CAV penetration rates demonstrate the effectiveness of TPE-MARL. Evaluations encompassing training dynamics, exploration patterns, macroscopic traffic performance metrics, and microscopic vehicle behaviors reveal that TPE-MARL successfully balances exploration and exploitation. Consequently, it exhibits superior performance in terms of traffic efficiency, safety, decision smoothness, and task completion. Furthermore, the algorithm demonstrates decision-making rationality comparable to or exceeding that of human drivers in both mixed-autonomy and fully autonomous traffic scenarios. Code of our work is available at \hrefthis https URLthis https URL.
zh
[AI-17] Multimodal Coordinated Online Behavior: Trade-offs and Strategies
【速读】:该论文试图解决多模态协调行为检测中的复杂动态问题,传统方法通常依赖于单模态或独立处理多模态数据,可能无法全面捕捉协调行为的全貌。解决方案的关键在于比较不同多模态整合方式的效果,强调弱整合与强整合模型之间的权衡,旨在平衡对广泛协调模式的捕捉与对紧密协调行为的识别,从而提升对在线协调行为的检测与分析能力。
链接: https://arxiv.org/abs/2507.12108
作者: Lorenzo Mannocci,Stefano Cresci,Matteo Magnani,Anna Monreale,Maurizio Tesconi
机构: 未知
类目: ocial and Information Networks (cs.SI); Artificial Intelligence (cs.AI); Computers and Society (cs.CY); Human-Computer Interaction (cs.HC); Machine Learning (cs.LG)
备注:
点击查看摘要
Abstract:Coordinated online behavior, which spans from beneficial collective actions to harmful manipulation such as disinformation campaigns, has become a key focus in digital ecosystem analysis. Traditional methods often rely on monomodal approaches, focusing on single types of interactions like co-retweets or co-hashtags, or consider multiple modalities independently of each other. However, these approaches may overlook the complex dynamics inherent in multimodal coordination. This study compares different ways of operationalizing the detection of multimodal coordinated behavior. It examines the trade-off between weakly and strongly integrated multimodal models, highlighting the balance between capturing broader coordination patterns and identifying tightly coordinated behavior. By comparing monomodal and multimodal approaches, we assess the unique contributions of different data modalities and explore how varying implementations of multimodality impact detection outcomes. Our findings reveal that not all the modalities provide distinct insights, but that with a multimodal approach we can get a more comprehensive understanding of coordination dynamics. This work enhances the ability to detect and analyze coordinated online behavior, offering new perspectives for safeguarding the integrity of digital platforms.
zh
[AI-18] From Static to Intelligent: Evolving SaaS Pricing with LLM s
【速读】:该论文试图解决SaaS市场中DevOps团队在手动管理和演化定价结构时所面临的效率低下和易出错的问题。其关键解决方案是引入智能定价(intelligent pricing),通过一种基于大语言模型(LLM)的方法,自动化地将静态HTML定价信息转换为可动态读取的智能定价模型,从而提升效率、一致性和准确性。
链接: https://arxiv.org/abs/2507.12104
作者: Francisco Javier Cavero,Juan C. Alonso,Antonio Ruiz-Cortés
机构: 未知
类目: oftware Engineering (cs.SE); Artificial Intelligence (cs.AI)
备注: 12 pages. Accepted at the SOC4AI Workshop (Service-Oriented Computing for AI Applications), held in conjunction with the 22nd International Conference on Service-Oriented Computing (ICSOC 2024)
点击查看摘要
Abstract:The SaaS paradigm has revolutionized software distribution by offering flexible pricing options to meet diverse customer needs. However, the rapid expansion of the SaaS market has introduced significant complexity for DevOps teams, who must manually manage and evolve pricing structures, an approach that is both time-consuming and prone to errors. The absence of automated tools for pricing analysis restricts the ability to efficiently evaluate, optimize, and scale these models. This paper proposes leveraging intelligent pricing (iPricing), dynamic, machine-readable pricing models, as a solution to these challenges. Intelligent pricing enables competitive analysis, streamlines operational decision-making, and supports continuous pricing evolution in response to market dynamics, leading to improved efficiency and accuracy. We present an LLM-driven approach that automates the transformation of static HTML pricing into iPricing, significantly improving efficiency and consistency while minimizing human error. Our implementation, AI4Pricing2Yaml, features a basic Information Extractor that uses web scraping and LLMs technologies to extract essential pricing components, plans, features, usage limits, and add-ons, from SaaS websites. Validation against a dataset of 30 distinct commercial SaaS, encompassing over 150 intelligent pricings, demonstrates the system’s effectiveness in extracting the desired elements across all steps. However, challenges remain in addressing hallucinations, complex structures, and dynamic content. This work highlights the potential of automating intelligent pricing transformation to streamline SaaS pricing management, offering implications for improved consistency and scalability in an increasingly intricate pricing landscape. Future research will focus on refining extraction capabilities and enhancing the system’s adaptability to a wider range of SaaS websites.
zh
[AI-19] DUSE: A Data Expansion Framework for Low-resource Automatic Modulation Recognition based on Active Learning
【速读】:该论文试图解决在自动调制识别(AMR)任务中,深度神经网络模型因目标域数据稀缺而难以有效训练的问题。解决方案的关键在于提出一种名为动态不确定性驱动样本扩展(DUSE)的数据扩展框架,该框架通过不确定性评分函数从相关AMR数据集中筛选出有用样本,并结合主动学习策略持续优化评分函数,从而有效提升模型在数据稀缺情况下的性能。
链接: https://arxiv.org/abs/2507.12011
作者: Yao Lu,Hongyu Gao,Zhuangzhi Chen,Dongwei Xu,Yun Lin,Qi Xuan,Guan Gui
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:
点击查看摘要
Abstract:Although deep neural networks have made remarkable achievements in the field of automatic modulation recognition (AMR), these models often require a large amount of labeled data for training. However, in many practical scenarios, the available target domain data is scarce and difficult to meet the needs of model training. The most direct way is to collect data manually and perform expert annotation, but the high time and labor costs are unbearable. Another common method is data augmentation. Although it can enrich training samples to a certain extent, it does not introduce new data and therefore cannot fundamentally solve the problem of data scarcity. To address these challenges, we introduce a data expansion framework called Dynamic Uncertainty-driven Sample Expansion (DUSE). Specifically, DUSE uses an uncertainty scoring function to filter out useful samples from relevant AMR datasets and employs an active learning strategy to continuously refine the scorer. Extensive experiments demonstrate that DUSE consistently outperforms 8 coreset selection baselines in both class-balance and class-imbalance settings. Besides, DUSE exhibits strong cross-architecture generalization for unseen models.
zh
[AI-20] Can LLM s Find Fraudsters? Multi-level LLM Enhanced Graph Fraud Detection
【速读】:该论文试图解决现有图欺诈检测方法在利用预处理节点嵌入和固定图结构时,忽略了原始文本信息中丰富的语义线索的问题。解决方案的关键在于提出一种多层级大语言模型(LLM)增强的图欺诈检测框架MLED,通过LLM从文本信息中提取外部知识,并设计类型级增强器和关系级增强器,以提升欺诈者与正常实体之间的差异性以及欺诈者在不同关系中的重要性,从而增强图欺诈检测能力。
链接: https://arxiv.org/abs/2507.11997
作者: Tairan Huang,Yili Wang
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:
点击查看摘要
Abstract:Graph fraud detection has garnered significant attention as Graph Neural Networks (GNNs) have proven effective in modeling complex relationships within multimodal data. However, existing graph fraud detection methods typically use preprocessed node embeddings and predefined graph structures to reveal fraudsters, which ignore the rich semantic cues contained in raw textual information. Although Large Language Models (LLMs) exhibit powerful capabilities in processing textual information, it remains a significant challenge to perform multimodal fusion of processed textual embeddings with graph structures. In this paper, we propose a \textbfMulti-level \textbfLLM \textbfEnhanced Graph Fraud \textbfDetection framework called MLED. In MLED, we utilize LLMs to extract external knowledge from textual information to enhance graph fraud detection methods. To integrate LLMs with graph structure information and enhance the ability to distinguish fraudsters, we design a multi-level LLM enhanced framework including type-level enhancer and relation-level enhancer. One is to enhance the difference between the fraudsters and the benign entities, the other is to enhance the importance of the fraudsters in different relations. The experiments on four real-world datasets show that MLED achieves state-of-the-art performance in graph fraud detection as a generalized framework that can be applied to existing methods.
zh
[AI-21] Understanding visual attention beehind bee-inspired UAV navigation
【速读】:该论文试图解决在复杂环境中实现自主无人机(UAV)导航的问题,其解决方案的关键在于利用仿生学原理,特别是蜜蜂通过光学流(optic flow)进行导航的机制。研究者训练了一个强化学习代理,仅使用光学流作为感知输入来导航有障碍物的隧道,并发现代理主要关注光学流中的不连续区域和高幅度区域,从而避免产生大光学流的障碍物并保持在环境中心位置,这一策略可为物理无人机开发简单的显式控制律提供参考。
链接: https://arxiv.org/abs/2507.11992
作者: Pranav Rajbhandari,Abhi Veda,Matthew Garratt,Mandayam Srinivasan,Sridhar Ravi
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:
点击查看摘要
Abstract:Bio-inspired design is often used in autonomous UAV navigation due to the capacity of biological systems for flight and obstacle avoidance despite limited sensory and computational capabilities. In particular, honeybees mainly use the sensory input of optic flow, the apparent motion of objects in their visual field, to navigate cluttered environments. In our work, we train a Reinforcement Learning agent to navigate a tunnel with obstacles using only optic flow as sensory input. We inspect the attention patterns of trained agents to determine the regions of optic flow on which they primarily base their motor decisions. We find that agents trained in this way pay most attention to regions of discontinuity in optic flow, as well as regions with large optic flow magnitude. The trained agents appear to navigate a cluttered tunnel by avoiding the obstacles that produce large optic flow, while maintaining a centered position in their environment, which resembles the behavior seen in flying insects. This pattern persists across independently trained agents, which suggests that this could be a good strategy for developing a simple explicit control law for physical UAVs.
zh
[AI-22] Robust Planning for Autonomous Vehicles with Diffusion-Based Failure Samplers
【速读】:该论文试图解决高风险交通区域(如交叉路口)中自动驾驶车辆因传感器噪声导致碰撞的问题。解决方案的关键在于利用深度生成模型生成可能导致碰撞的传感器噪声序列,并通过生成对抗架构将多步骤去噪扩散概率模型压缩为单步骤去噪扩散模型,从而在保持采样质量的同时实现快速推理。该单步骤模型被用于构建一个鲁棒的规划器,以高效采样潜在故障案例并优化决策过程。
链接: https://arxiv.org/abs/2507.11991
作者: Juanran Wang,Marc R. Schlichting,Mykel J. Kochenderfer
机构: 未知
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI)
备注:
点击查看摘要
Abstract:High-risk traffic zones such as intersections are a major cause of collisions. This study leverages deep generative models to enhance the safety of autonomous vehicles in an intersection context. We train a 1000-step denoising diffusion probabilistic model to generate collision-causing sensor noise sequences for an autonomous vehicle navigating a four-way intersection based on the current relative position and velocity of an intruder. Using the generative adversarial architecture, the 1000-step model is distilled into a single-step denoising diffusion model which demonstrates fast inference speed while maintaining similar sampling quality. We demonstrate one possible application of the single-step model in building a robust planner for the autonomous vehicle. The planner uses the single-step model to efficiently sample potential failure cases based on the currently measured traffic state to inform its decision-making. Through simulation experiments, the robust planner demonstrates significantly lower failure rate and delay rate compared with the baseline Intelligent Driver Model controller.
zh
[AI-23] Aime: Towards Fully-Autonomous Multi-Agent Framework
【速读】:该论文旨在解决多智能体系统(Multi-Agent Systems, MAS)在复杂问题求解中面临的适应性与鲁棒性不足的问题。传统计划与执行框架存在刚性执行、静态代理能力及低效通信等关键局限,限制了系统在动态环境中的表现。论文提出的解决方案Aime的核心在于通过动态、反应式规划与执行机制实现系统的自适应性,其关键创新包括:动态规划器持续根据实时执行反馈优化整体策略;Actor Factory 实现动态代理实例化,按需组装具备定制工具和知识的专业代理;以及中心化进度管理模块提供全局状态一致性保障。这些设计显著提升了系统的灵活性与任务成功率。
链接: https://arxiv.org/abs/2507.11988
作者: Yexuan Shi,Mingyu Wang,Yunxiang Cao,Hongjie Lai,Junjian Lan,Xin Han,Yu Wang,Jie Geng,Zhenan Li,Zihao Xia,Xiang Chen,Chen Li,Jian Xu,Wenbo Duan,Yuanshuo Zhu
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注: 14 pages, 1 figures,
点击查看摘要
Abstract:Multi-Agent Systems (MAS) powered by Large Language Models (LLMs) are emerging as a powerful paradigm for solving complex, multifaceted problems. However, the potential of these systems is often constrained by the prevalent plan-and-execute framework, which suffers from critical limitations: rigid plan execution, static agent capabilities, and inefficient communication. These weaknesses hinder their adaptability and robustness in dynamic environments. This paper introduces Aime, a novel multi-agent framework designed to overcome these challenges through dynamic, reactive planning and execution. Aime replaces the conventional static workflow with a fluid and adaptive architecture. Its core innovations include: (1) a Dynamic Planner that continuously refines the overall strategy based on real-time execution feedback; (2) an Actor Factory that implements Dynamic Actor instantiation, assembling specialized agents on-demand with tailored tools and knowledge; and (3) a centralized Progress Management Module that serves as a single source of truth for coherent, system-wide state awareness. We empirically evaluated Aime on a diverse suite of benchmarks spanning general reasoning (GAIA), software engineering (SWE-bench Verified), and live web navigation (WebVoyager). The results demonstrate that Aime consistently outperforms even highly specialized state-of-the-art agents in their respective domains. Its superior adaptability and task success rate establish Aime as a more resilient and effective foundation for multi-agent collaboration.
zh
[AI-24] Formal Verification of Neural Certificates Done Dynamically
【速读】:该论文试图解决神经证书在基于深度学习的控制策略中进行形式化验证时面临的可扩展性问题,传统方法由于需要穷举状态空间而难以应用。解决方案的关键在于提出一种轻量级的运行时监控框架,该框架能够在不访问底层控制策略的情况下,对系统进行实时验证,并在有限预测时间内对证书进行在线验证,从而实现对安全性的保障。
链接: https://arxiv.org/abs/2507.11987
作者: Thomas A. Henzinger,Konstantin Kueffner,Emily Yu
机构: 未知
类目: ymbolic Computation (cs.SC); Artificial Intelligence (cs.AI)
备注: Accepted at RV’25
点击查看摘要
Abstract:Neural certificates have emerged as a powerful tool in cyber-physical systems control, providing witnesses of correctness. These certificates, such as barrier functions, often learned alongside control policies, once verified, serve as mathematical proofs of system safety. However, traditional formal verification of their defining conditions typically faces scalability challenges due to exhaustive state-space exploration. To address this challenge, we propose a lightweight runtime monitoring framework that integrates real-time verification and does not require access to the underlying control policy. Our monitor observes the system during deployment and performs on-the-fly verification of the certificate over a lookahead region to ensure safety within a finite prediction horizon. We instantiate this framework for ReLU-based control barrier functions and demonstrate its practical effectiveness in a case study. Our approach enables timely detection of safety violations and incorrect certificates with minimal overhead, providing an effective but lightweight alternative to the static verification of the certificates.
zh
[AI-25] Online Training and Pruning of Deep Reinforcement Learning Networks
【速读】:该论文试图解决深度强化学习(Reinforcement Learning, RL)中神经网络规模扩展带来的计算和内存复杂度显著增加的问题。其关键解决方案是提出一种将训练与剪枝同时进行的方法,通过在在线特征提取网络(Online Feature Extractor Network, OFENet)增强的RL算法中引入变分伯努利分布参数的随机优化问题,利用正则化项促进对性能贡献较小单元的永久性剪枝,从而实现网络压缩而不显著损失性能。
链接: https://arxiv.org/abs/2507.11975
作者: Valentin Frank Ingmar Guenter,Athanasios Sideris
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Robotics (cs.RO)
备注: 25 pages, 5 figures, 4 tables
点击查看摘要
Abstract:Scaling deep neural networks (NN) of reinforcement learning (RL) algorithms has been shown to enhance performance when feature extraction networks are used but the gained performance comes at the significant expense of increased computational and memory complexity. Neural network pruning methods have successfully addressed this challenge in supervised learning. However, their application to RL is underexplored. We propose an approach to integrate simultaneous training and pruning within advanced RL methods, in particular to RL algorithms enhanced by the Online Feature Extractor Network (OFENet). Our networks (XiNet) are trained to solve stochastic optimization problems over the RL networks’ weights and the parameters of variational Bernoulli distributions for 0/1 Random Variables \xi scaling each unit in the networks. The stochastic problem formulation induces regularization terms that promote convergence of the variational parameters to 0 when a unit contributes little to the performance. In this case, the corresponding structure is rendered permanently inactive and pruned from its network. We propose a cost-aware, sparsity-promoting regularization scheme, tailored to the DenseNet architecture of OFENets expressing the parameter complexity of involved networks in terms of the parameters of the RVs in these networks. Then, when matching this cost with the regularization terms, the many hyperparameters associated with them are automatically selected, effectively combining the RL objectives and network compression. We evaluate our method on continuous control benchmarks (MuJoCo) and the Soft Actor-Critic RL agent, demonstrating that OFENets can be pruned considerably with minimal loss in performance. Furthermore, our results confirm that pruning large networks during training produces more efficient and higher performing RL agents rather than training smaller networks from scratch.
zh
[AI-26] Kevin: Multi-Turn RL for Generating CUDA Kernels
【速读】:该论文试图解决生成高效CUDA内核的挑战性问题,该问题对AI系统的性能至关重要。其解决方案的关键在于引入一种灵活的多轮强化学习(multi-turn RL)方法,以显式建模该过程的迭代特性,并应对实际场景中的独特挑战,如从长轨迹中学习和跨轮次的有效奖励分配。通过这种方法训练的Kevin模型,在生成内核的正确性和速度提升方面均优于基线模型和其他前沿模型。
链接: https://arxiv.org/abs/2507.11948
作者: Carlo Baronio,Pietro Marsella,Ben Pan,Simon Guo,Silas Alberti
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Performance (cs.PF); Software Engineering (cs.SE)
备注:
点击查看摘要
Abstract:Writing GPU kernels is a challenging task and critical for AI systems’ efficiency. It is also highly iterative: domain experts write code and improve performance through execution feedback. Moreover, it presents verifiable rewards like correctness and speedup, making it a natural environment to apply Reinforcement Learning (RL). To explicitly incorporate the iterative nature of this process into training, we develop a flexible multi-turn RL recipe that addresses unique challenges encountered in real-world settings, such as learning from long trajectories and effective reward attribution across turns. We present Kevin - K(ernel D)evin, the first model trained with multi-turn RL for CUDA kernel generation and optimization. In our evaluation setup, Kevin shows significant gains over its base model (QwQ-32B), improving correctness of generated kernels (in pure CUDA) from 56% to 82% and mean speedup from 0.53x to 1.10x of baseline (PyTorch Eager), and surpassing frontier models like o4-mini (0.78x). Finally, we study its behavior across test-time scaling axes: we found scaling serial refinement more beneficial than parallel sampling. In particular, when given more refinement turns, Kevin shows a higher rate of improvement.
zh
[AI-27] Native-AI Empowered Scalable Architectures and Solutions for Future Non-Terrestrial Networks: An Overview
【速读】:该论文试图解决在6G网络演进过程中,非地面网络(NTN)与开放无线接入网(ORAN)融合所带来的开发与运维(DevOps)生命周期中的挑战,特别是如何通过智能ORAN提升NTN的可扩展性。解决方案的关键在于提出一种基于ORAN的NTN框架,该框架涵盖了灵活的前传链路分割、增强的RAN智能控制器(RIC)以支持分布式学习、可扩展的部署架构以及多域服务管理,从而实现更高效、可靠和灵活的无线网络管理。
链接: https://arxiv.org/abs/2507.11935
作者: Jikang Deng,Fizza Hassan,Hui Zhou,Saad Al-Ahmadi,Mohamed-Slim Alouini,Daniel B. Da Costa
机构: 未知
类目: Networking and Internet Architecture (cs.NI); Artificial Intelligence (cs.AI); Systems and Control (eess.SY)
备注:
点击查看摘要
Abstract:As the path toward 6G networks is being charted, the emerging applications have motivated evolutions of network architectures to realize the efficient, reliable, and flexible wireless networks. Among the potential architectures, the non-terrestrial network (NTN) and open radio access network (ORAN) have received increasing interest from both academia and industry. Although the deployment of NTNs ensures coverage, enhances spectral efficiency, and improves the resilience of wireless networks. The high altitude and mobility of NTN present new challenges in the development and operations (DevOps) lifecycle, hindering intelligent and scalable network management due to the lack of native artificial intelligence (AI) capability. With the advantages of ORAN in disaggregation, openness, virtualization, and intelligence, several works propose integrating ORAN principles into the NTN, focusing mainly on ORAN deployment options based on transparent and regenerative systems. However, a holistic view of how to effectively combine ORAN and NTN throughout the DevOps lifecycle is still missing, especially regarding how intelligent ORAN addresses the scalability challenges in NTN. Motivated by this, in this paper, we first provide the background knowledge about ORAN and NTN, outline the state-of-the-art research on ORAN for NTNs, and present the DevOps challenges that motivate the adoption of ORAN solutions. We then propose the ORAN-based NTN framework, discussing its features and architectures in detail. These include the discussion about flexible fronthaul split, RAN intelligent controllers (RICs) enhancement for distributed learning, scalable deployment architecture, and multi-domain service management. Finally, the future research directions, including combinations of the ORAN-based NTN framework and other enabling technologies and schemes, as well as the candidate use cases, are highlighted.
zh
[AI-28] A Parallel CPU-GPU Framework for Cost-Bounded DFS with Applications to IDA* and BTS
【速读】:该论文试图解决如何在深度优先搜索(Depth First Search, DFS)中有效利用GPU的并行计算能力以提升搜索效率的问题。其解决方案的关键在于引入一种批量处理GPU计算的方法,具体表现为一种新的有界代价深度优先搜索(Cost-Bounded Depth-First Search, CB-DFS)方法,该方法结合了现代CPU与GPU的并行性,从而实现了如\emph{Batch IDA*}和\emph{Batch BTS}等算法的高效执行。该方法在保持最优性保证的同时,通过批量处理技术提升了GPU操作的效率。
链接: https://arxiv.org/abs/2507.11916
作者: Ehsan Futuhi,Nathan R. Sturtevant
机构: 未知
类目: Artificial Intelligence (cs.AI); Distributed, Parallel, and Cluster Computing (cs.DC)
备注:
点击查看摘要
Abstract:The rapid advancement of GPU technology has unlocked powerful parallel processing capabilities, creating new opportunities to enhance classic search algorithms. A recent successful application of GPUs is in compressing large pattern database (PDB) heuristics using neural networks while preserving heuristic admissibility. However, very few algorithms have been designed to exploit GPUs during search. Several variants of A* exist that batch GPU computations. In this paper we introduce a method for batching GPU computations in depth first search. In particular, we describe a new cost-bounded depth-first search (CB-DFS) method that leverages the combined parallelism of modern CPUs and GPUs. This is used to create algorithms like \emphBatch IDA*, an extension of the Iterative Deepening A* (IDA*) algorithm, or Batch BTS, an extensions of Budgeted Tree Search. Our approach builds on the general approach used by Asynchronous Parallel IDA* (AIDA*), while maintaining optimality guarantees. We evaluate the approach on the 3x3 Rubik’s Cube and 4x4 sliding tile puzzle (STP), showing that GPU operations can be efficiently batched in DFS. Additionally, we conduct extensive experiments to analyze the effects of hyperparameters, neural network heuristic size, and hardware resources on performance.
zh
[AI-29] Interactive Hybrid Rice Breeding with Parametric Dual Projection
【速读】:该论文试图解决杂交水稻育种中由于基因组预测模型准确性有限,导致育种者仍需结合经验进行调控基因识别与杂交体选择的耗时问题。解决方案的关键在于提出一种具有理论保障的参数化双投影方法,以支持交互式的双分析任务,从而实现调控基因的可视化和杂交体的可视化,提升育种效率。
链接: https://arxiv.org/abs/2507.11848
作者: Changjian Chen,Pengcheng Wang,Fei Lyu,Zhuo Tang,Li Yang,Long Wang,Yong Cai,Feng Yu,Kenli Li
机构: 未知
类目: Human-Computer Interaction (cs.HC); Artificial Intelligence (cs.AI); Quantitative Methods (q-bio.QM)
备注:
点击查看摘要
Abstract:Hybrid rice breeding crossbreeds different rice lines and cultivates the resulting hybrids in fields to select those with desirable agronomic traits, such as higher yields. Recently, genomic selection has emerged as an efficient way for hybrid rice breeding. It predicts the traits of hybrids based on their genes, which helps exclude many undesired hybrids, largely reducing the workload of field cultivation. However, due to the limited accuracy of genomic prediction models, breeders still need to combine their experience with the models to identify regulatory genes that control traits and select hybrids, which remains a time-consuming process. To ease this process, in this paper, we proposed a visual analysis method to facilitate interactive hybrid rice breeding. Regulatory gene identification and hybrid selection naturally ensemble a dual-analysis task. Therefore, we developed a parametric dual projection method with theoretical guarantees to facilitate interactive dual analysis. Based on this dual projection method, we further developed a gene visualization and a hybrid visualization to verify the identified regulatory genes and hybrids. The effectiveness of our method is demonstrated through the quantitative evaluation of the parametric dual projection method, identified regulatory genes and desired hybrids in the case study, and positive feedback from breeders.
zh
[AI-30] he Evolving Role of Large Language Models in Scientific Innovation: Evaluator Collaborator and Scientist
【速读】:该论文试图解决当前科学创新中面临的挑战,包括信息过载、学科壁垒以及传统研究方法的边际效益递减问题,同时旨在深入理解大型语言模型(Large Language Models, LLMs)在科学创新中的变革潜力与角色分化。其解决方案的关键在于提出一个分层次的框架,将LLMs在科学创新中的角色划分为评估者(Evaluator)、合作者(Collaborator)和科学家(Scientist),并通过系统分析现有方法、基准测试、系统及评估指标,构建了一个统一的分类体系,以明确各层级的能力边界、评估标准及人机交互模式。
链接: https://arxiv.org/abs/2507.11810
作者: Haoxuan Zhang,Ruochi Li,Yang Zhang,Ting Xiao,Jiangping Chen,Junhua Ding,Haihua Chen
机构: 未知
类目: Digital Libraries (cs.DL); Artificial Intelligence (cs.AI)
备注:
点击查看摘要
Abstract:Scientific innovation is undergoing a paradigm shift driven by the rapid advancement of Large Language Models (LLMs). As science faces mounting challenges including information overload, disciplinary silos, and diminishing returns on conventional research methods, LLMs are emerging as powerful agents capable not only of enhancing scientific workflows but also of participating in and potentially leading the innovation process. Existing surveys mainly focus on different perspectives, phrases, and tasks in scientific research and discovery, while they have limitations in understanding the transformative potential and role differentiation of LLM. This survey proposes a comprehensive framework to categorize the evolving roles of LLMs in scientific innovation across three hierarchical levels: Evaluator, Collaborator, and Scientist. We distinguish between LLMs’ contributions to structured scientific research processes and open-ended scientific discovery, thereby offering a unified taxonomy that clarifies capability boundaries, evaluation criteria, and human-AI interaction patterns at each level. Through an extensive analysis of current methodologies, benchmarks, systems, and evaluation metrics, this survey delivers an in-depth and systematic synthesis on LLM-driven scientific innovation. We present LLMs not only as tools for automating existing processes, but also as catalysts capable of reshaping the epistemological foundations of science itself. This survey offers conceptual clarity, practical guidance, and theoretical foundations for future research, while also highlighting open challenges and ethical considerations in the pursuit of increasingly autonomous AI-driven science. Resources related to this survey can be accessed on GitHub at: this https URL.
zh
[AI-31] CLID-MU: Cross-Layer Information Divergence Based Meta Update Strategy for Learning with Noisy Labels KDD2025
【速读】:该论文旨在解决在标签噪声环境下进行元学习(meta-learning)时对干净标签元数据集的高度依赖问题。传统方法需要一个干净的无偏标签数据集来训练鲁棒模型,但在实际应用中获取此类数据集具有挑战性。论文提出的解决方案的关键在于利用数据本身的特性,而非依赖标签信息。具体而言,其核心思想是基于跨层信息发散(Cross-layer Information Divergence, CLID)的元更新策略(CLID-MU),通过分析数据结构在最后一层隐藏层与最终层之间的一致性来评估模型性能,并以此指导训练过程,从而在不依赖干净标签的情况下有效处理噪声标签场景。
链接: https://arxiv.org/abs/2507.11807
作者: Ruofan Hu,Dongyu Zhang,Huayi Zhang,Elke Rundensteiner
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注: KDD 2025, 12 pages, 7 figures
点击查看摘要
Abstract:Learning with noisy labels (LNL) is essential for training deep neural networks with imperfect data. Meta-learning approaches have achieved success by using a clean unbiased labeled set to train a robust model. However, this approach heavily depends on the availability of a clean labeled meta-dataset, which is difficult to obtain in practice. In this work, we thus tackle the challenge of meta-learning for noisy label scenarios without relying on a clean labeled dataset. Our approach leverages the data itself while bypassing the need for labels. Building on the insight that clean samples effectively preserve the consistency of related data structures across the last hidden and the final layer, whereas noisy samples disrupt this consistency, we design the Cross-layer Information Divergence-based Meta Update Strategy (CLID-MU). CLID-MU leverages the alignment of data structures across these diverse feature spaces to evaluate model performance and use this alignment to guide training. Experiments on benchmark datasets with varying amounts of labels under both synthetic and real-world noise demonstrate that CLID-MU outperforms state-of-the-art methods. The code is released at this https URL.
zh
[AI-32] Survey of Swarm Intelligence Approaches to Search Documents Based On Semantic Similarity
【速读】:该论文试图解决基于语义相似性的文档检索问题,其解决方案的关键在于利用群体智能(Swarm Intelligence, SI)算法来提升检索效果。通过模仿动物和昆虫的自然行为,群体智能算法被设计为有效的计算模型,用于优化和解决实际应用中的复杂问题。
链接: https://arxiv.org/abs/2507.11787
作者: Chandrashekar Muniyappa,Eunjin Kim
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注: CSAIDE '25: Proceedings of the 2025 4th International Conference on Cyber Security, Artificial Intelligence and the Digital Economy
点击查看摘要
Abstract:Swarm Intelligence (SI) is gaining a lot of popularity in artificial intelligence, where the natural behavior of animals and insects is observed and translated into computer algorithms called swarm computing to solve real-world problems. Due to their effectiveness, they are applied in solving various computer optimization problems. This survey will review all the latest developments in Searching for documents based on semantic similarity using Swarm Intelligence algorithms and recommend future research directions.
zh
[AI-33] Predicting Delayed Trajectories Using Network Features: A Study on the Dutch Railway Network
【速读】:该论文试图解决荷兰铁路网络中延迟预测研究的不足,特别是在全局网络范围内的模式分析,以减少连锁反应的影响。其解决方案的关键在于采用XGBoost分类器,并结合节点中心性度量,以捕捉铁路网络中的拓扑特征,从而改进延迟轨迹的预测。
链接: https://arxiv.org/abs/2507.11776
作者: Merel Kampere,Ali Mohammed Mansoor Alsahag
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:
点击查看摘要
Abstract:The Dutch railway network is one of the busiest in the world, with delays being a prominent concern for the principal passenger railway operator NS. This research addresses a gap in delay prediction studies within the Dutch railway network by employing an XGBoost Classifier with a focus on topological features. Current research predominantly emphasizes short-term predictions and neglects the broader network-wide patterns essential for mitigating ripple effects. This research implements and improves an existing methodology, originally designed to forecast the evolution of the fast-changing US air network, to predict delays in the Dutch Railways. By integrating Node Centrality Measures and comparing multiple classifiers like RandomForest, DecisionTree, GradientBoosting, AdaBoost, and LogisticRegression, the goal is to predict delayed trajectories. However, the results reveal limited performance, especially in non-simultaneous testing scenarios, suggesting the necessity for more context-specific adaptations. Regardless, this research contributes to the understanding of transportation network evaluation and proposes future directions for developing more robust predictive models for delays.
zh
[AI-34] Challenges in GenAI and Authentication: a scoping review
【速读】:该论文试图解决生成式人工智能(Generative AI)在数字信息共享背景下对身份认证和真实性带来的安全挑战。其关键在于通过系统性文献综述分析现有研究中的主要工作、挑战、攻击面、威胁、已提出的解决方案及研究空白,从而为未来在认证与生成式人工智能领域的研究提供支持。
链接: https://arxiv.org/abs/2507.11775
作者: Wesley dos Reis Bezerra,Lais Machado Bezerra,Carlos Becker Westphall
机构: 未知
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI)
备注:
点击查看摘要
Abstract:Authentication and authenticity have been a security challenge since the beginning of information sharing, especially in the context of digital information. With the advancement of generative artificial intelligence, these challenges have evolved, demanding a more up-to-date analysis of their impacts on society and system security. This work presents a scoping review that analyzed 88 documents from the IEEExplorer, Scopus, and ACM databases, promoting an analysis of the resulting portfolio through six guiding questions focusing on the most relevant work, challenges, attack surfaces, threats, proposed solutions, and gaps. Finally, the portfolio articles are analyzed through this guiding research lens and also receive individualized analysis. The results consistently outline the challenges, gaps, and threats related to images, text, audio, and video, thereby supporting new research in the areas of authentication and generative artificial intelligence.
zh
[AI-35] Small Data Explainer – The impact of small data methods in everyday life
【速读】:该论文试图解决在小数据(small data)场景下如何有效利用突破性人工智能(AI)技术的问题,特别是在信息有限的环境中提升数据驱动的政策制定和社会应用效果。其解决方案的关键在于通过对比小数据与大数据的特性,结合跨学科的方法,如统计学中的知识驱动建模和计算机科学中的数据驱动建模,提供当前数据分析与建模技术的详细技术概述,并强调不同领域对小数据处理的贡献。
链接: https://arxiv.org/abs/2507.11773
作者: Maren Hackenberg,Sophia G. Connor,Fabian Kabus,June Brawner,Ella Markham,Mahi Hardalupas,Areeq Chowdhury,Rolf Backofen,Anna Köttgen,Angelika Rohde,Nadine Binder,Harald Binder, theCollaborative Research Center 1597 Small Data
机构: 未知
类目: Computers and Society (cs.CY); Artificial Intelligence (cs.AI)
备注: Written in collaboration with the Royal Society, contributing to the Disability Technology report ( this https URL )
点击查看摘要
Abstract:The emergence of breakthrough artificial intelligence (AI) techniques has led to a renewed focus on how small data settings, i.e., settings with limited information, can benefit from such developments. This includes societal issues such as how best to include under-represented groups in data-driven policy and decision making, or the health benefits of assistive technologies such as wearables. We provide a conceptual overview, in particular contrasting small data with big data, and identify common themes from exemplary case studies and application areas. Potential solutions are described in a more detailed technical overview of current data analysis and modelling techniques, highlighting contributions from different disciplines, such as knowledge-driven modelling from statistics and data-driven modelling from computer science. By linking application settings, conceptual contributions and specific techniques, we highlight what is already feasible and suggest what an agenda for fully leveraging small data might look like.
zh
[AI-36] Survey of Genetic and Differential Evolutionary Algorithm Approaches to Search Documents Based On Semantic Similarity
【速读】:该论文试图解决在大规模数据中识别相似文档的问题,其解决方案的关键在于利用遗传算法和差分进化等进化计算算法来提升基于语义文本相似性的文档搜索效果。
链接: https://arxiv.org/abs/2507.11751
作者: Chandrashekar Muniyappa,Eunjin Kim
机构: 未知
类目: Neural and Evolutionary Computing (cs.NE); Artificial Intelligence (cs.AI)
备注: CSAIDE '25: Proceedings of the 2025 4th International Conference on Cyber Security, Artificial Intelligence and the Digital Economy
点击查看摘要
Abstract:Identifying similar documents within extensive volumes of data poses a significant challenge. To tackle this issue, researchers have developed a variety of effective distributed computing techniques. With the advancement of computing power and the rise of big data, deep neural networks and evolutionary computing algorithms such as genetic algorithms and differential evolution algorithms have achieved greater success. This survey will explore the most recent advancements in the search for documents based on their semantic text similarity, focusing on genetic and differential evolutionary computing algorithms.
zh
[AI-37] Auto-Formulating Dynamic Programming Problems with Large Language Models
【速读】:该论文试图解决动态规划(Dynamic Programming, DP)模型构建过程中依赖专家知识的问题,旨在利用大型语言模型(Large Language Models, LLMs)自动化这一过程。其解决方案的关键是提出一种名为DPLM的7B参数专用模型,以及一种名为DualReflect的新型合成数据生成流水线。DualReflect通过结合前向生成以提高多样性与后向生成以确保可靠性,有效解决了DP问题中因随机转移和训练数据有限带来的挑战。
链接: https://arxiv.org/abs/2507.11737
作者: Chenyu Zhou,Jingyuan Yang,Linwei Xin,Yitian Chen,Ziyan He,Dongdong Ge
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:
点击查看摘要
Abstract:Dynamic programming (DP) is a fundamental method in operations research, but formulating DP models has traditionally required expert knowledge of both the problem context and DP techniques. Large Language Models (LLMs) offer the potential to automate this process. However, DP problems pose unique challenges due to their inherently stochastic transitions and the limited availability of training data. These factors make it difficult to directly apply existing LLM-based models or frameworks developed for other optimization problems, such as linear or integer programming. We introduce DP-Bench, the first benchmark covering a wide range of textbook-level DP problems to enable systematic evaluation. We present Dynamic Programming Language Model (DPLM), a 7B-parameter specialized model that achieves performance comparable to state-of-the-art LLMs like OpenAI’s o1 and DeepSeek-R1, and surpasses them on hard problems. Central to DPLM’s effectiveness is DualReflect, our novel synthetic data generation pipeline, designed to scale up training data from a limited set of initial examples. DualReflect combines forward generation for diversity and backward generation for reliability. Our results reveal a key insight: backward generation is favored in low-data regimes for its strong correctness guarantees, while forward generation, though lacking such guarantees, becomes increasingly valuable at scale for introducing diverse formulations. This trade-off highlights the complementary strengths of both approaches and the importance of combining them.
zh
[AI-38] ClarifAI: Enhancing AI Interpretability and Transparency through Case-Based Reasoning and Ontology-Driven Approach for Improved Decision-Making
【速读】:该论文试图解决人工智能(Artificial Intelligence, AI)在决策过程中缺乏透明性和可解释性的问题,旨在提升AI系统的可解释能力以满足不同利益相关者的需求。解决方案的关键在于引入Clarity and Reasoning Interface for Artificial Intelligence (ClarifAI),该方法结合了基于案例的推理(Case-Based Reasoning, CBR)与本体驱动的方法,构建了全面的解释机制,从而增强了AI系统的透明度和可解释性。
链接: https://arxiv.org/abs/2507.11733
作者: Srikanth Vemula
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:
点击查看摘要
Abstract:This Study introduces Clarity and Reasoning Interface for Artificial Intelligence(ClarifAI), a novel approach designed to augment the transparency and interpretability of artificial intelligence (AI) in the realm of improved decision making. Leveraging the Case-Based Reasoning (CBR) methodology and integrating an ontology-driven approach, ClarifAI aims to meet the intricate explanatory demands of various stakeholders involved in AI-powered applications. The paper elaborates on ClarifAI’s theoretical foundations, combining CBR and ontologies to furnish exhaustive explanation mechanisms. It further elaborates on the design principles and architectural blueprint, highlighting ClarifAI’s potential to enhance AI interpretability across different sectors and its applicability in high-stake environments. This research delineates the significant role of ClariAI in advancing the interpretability of AI systems, paving the way for its deployment in critical decision-making processes.
zh
[AI-39] Globalization for Scalable Short-term Load Forecasting
【速读】:该论文旨在解决电力传输网络中负载预测的通用性、可扩展性、过拟合、数据漂移以及冷启动问题,特别是在面对数据异质性时的传统局部预测模型(LFMs)的局限性。其解决方案的关键在于引入全局预测模型(GFMs),通过全球化和跨学习提升预测的泛化能力、可扩展性、准确性和鲁棒性。论文进一步探讨了特征转换和目标转换模型在数据漂移下的表现,并提出基于模型的时间序列聚类(TSC)方法以平衡全局与局部特性,从而有效应对数据异质性问题。实验表明,全局目标转换模型在引入全局特征和聚类技术后显著优于局部模型。
链接: https://arxiv.org/abs/2507.11729
作者: Amirhossein Ahmadi,Hamidreza Zareipour,Henry Leung
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注: 63 pages with 22 figures
点击查看摘要
Abstract:Forecasting load in power transmission networks is essential across various hierarchical levels, from the system level down to individual points of delivery (PoD). While intuitive and locally accurate, traditional local forecasting models (LFMs) face significant limitations, particularly in handling generalizability, overfitting, data drift, and the cold start problem. These methods also struggle with scalability, becoming computationally expensive and less efficient as the network’s size and data volume grow. In contrast, global forecasting models (GFMs) offer a new approach to enhance prediction generalizability, scalability, accuracy, and robustness through globalization and cross-learning. This paper investigates global load forecasting in the presence of data drifts, highlighting the impact of different modeling techniques and data heterogeneity. We explore feature-transforming and target-transforming models, demonstrating how globalization, data heterogeneity, and data drift affect each differently. In addition, we examine the role of globalization in peak load forecasting and its potential for hierarchical forecasting. To address data heterogeneity and the balance between globality and locality, we propose separate time series clustering (TSC) methods, introducing model-based TSC for feature-transforming models and new weighted instance-based TSC for target-transforming models. Through extensive experiments on a real-world dataset of Alberta’s electricity load, we demonstrate that global target-transforming models consistently outperform their local counterparts, especially when enriched with global features and clustering techniques. In contrast, global feature-transforming models face challenges in balancing local and global dynamics, often requiring TSC to manage data heterogeneity effectively.
zh
[AI-40] Subgraph Generation for Generalizing on Out-of-Distribution Links
【速读】:该论文试图解决图神经网络(GNNs)在分布外(OOD)场景下链接预测(LP)性能下降的问题,以及图生成模型(GGMs)在跨领域应用中的局限性。其解决方案的关键在于提出FLEX框架,该框架结合了结构条件图生成机制和自编码器与GNN之间的对抗协同训练机制,以确保样本分布的结构对齐,从而提升OOD场景下的链接预测性能。
链接: https://arxiv.org/abs/2507.11710
作者: Jay Revolinsky,Harry Shomer,Jiliang Tang
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注: 18 pages, 7 figures, preprint
点击查看摘要
Abstract:Graphs Neural Networks (GNNs) demonstrate high-performance on the link prediction (LP) task. However, these models often rely on all dataset samples being drawn from the same distribution. In addition, graph generative models (GGMs) show a pronounced ability to generate novel output graphs. Despite this, GGM applications remain largely limited to domain-specific tasks. To bridge this gap, we propose FLEX as a GGM framework which leverages two mechanism: (1) structurally-conditioned graph generation, and (2) adversarial co-training between an auto-encoder and GNN. As such, FLEX ensures structural-alignment between sample distributions to enhance link-prediction performance in out-of-distribution (OOD) scenarios. Notably, FLEX does not require expert knowledge to function in different OOD scenarios. Numerous experiments are conducted in synthetic and real-world OOD settings to demonstrate FLEX’s performance-enhancing ability, with further analysis for understanding the effects of graph data augmentation on link structures. The source code is available here: this https URL.
zh
[AI-41] me series classification of satellite data using LSTM networks: an approach for predicting leaf-fall to minimize railroad traffic disruption
【速读】:该论文试图解决铁路交通因落叶导致的中断问题,此类问题每年给英国铁路行业造成超过30亿英镑的损失,而现有的预测落叶时间的方法在可扩展性和可靠性方面存在显著局限。解决方案的关键在于构建一种利用专门预测方法和最新卫星数据源的预测系统,其中基于地面真实落叶数据以及多光谱和气象卫星数据训练的长短期记忆网络(LSTM)在预测落叶开始和结束时间上分别达到了6.32天和9.31天的均方根误差,相较于之前的工作有所改进,为铁路行业优化落叶应对措施及提升对复杂生态系统理解提供了潜在机遇。
链接: https://arxiv.org/abs/2507.11702
作者: Hein de Wilde,Ali Mohammed Mansoor Alsahag,Pierre Blanchet
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:
点击查看摘要
Abstract:Railroad traffic disruption as a result of leaf-fall cost the UK rail industry over 300 million per year and measures to mitigate such disruptions are employed on a large scale, with 1.67 million kilometers of track being treated in the UK in 2021 alone. Therefore, the ability to anticipate the timing of leaf-fall would offer substantial benefits for rail network operators, enabling the efficient scheduling of such mitigation measures. However, current methodologies for predicting leaf-fall exhibit considerable limitations in terms of scalability and reliability. This study endeavors to devise a prediction system that leverages specialized prediction methods and the latest satellite data sources to generate both scalable and reliable insights into leaf-fall timings. An LSTM network trained on ground-truth leaf-falling data combined with multispectral and meteorological satellite data demonstrated a root-mean-square error of 6.32 days for predicting the start of leaf-fall and 9.31 days for predicting the end of leaf-fall. The model, which improves upon previous work on the topic, offers promising opportunities for the optimization of leaf mitigation measures in the railway industry and the improvement of our understanding of complex ecological systems.
zh
[AI-42] PGT-I: Scaling Spatiotemporal GNNs with Memory-Efficient Distributed Training
【速读】:该论文试图解决时空图神经网络(Spatiotemporal Graph Neural Networks, ST-GNNs)在大规模数据集上训练时面临的内存限制问题。现有框架缺乏对时空模型的支持,并忽视了时空数据的特性。论文提出的解决方案关键在于引入了两种创新策略:索引批处理(index-batching)和分布式索引批处理(distributed-index-batching),通过利用时空结构动态构建快照,显著降低了内存开销,并实现了跨多个GPU的可扩展处理。
链接: https://arxiv.org/abs/2507.11683
作者: Seth Ockerman,Amal Gueroudji,Tanwi Mallick,Yixuan He,Line Pouchard,Robert Ross,Shivaram Venkataraman
机构: 未知
类目: Distributed, Parallel, and Cluster Computing (cs.DC); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: To be published in the 2025 International Conference for High Performance Computing, Networking, Storage, and Analysis
点击查看摘要
Abstract:Spatiotemporal graph neural networks (ST-GNNs) are powerful tools for modeling spatial and temporal data dependencies. However, their applications have been limited primarily to small-scale datasets because of memory constraints. While distributed training offers a solution, current frameworks lack support for spatiotemporal models and overlook the properties of spatiotemporal data. Informed by a scaling study on a large-scale workload, we present PyTorch Geometric Temporal Index (PGT-I), an extension to PyTorch Geometric Temporal that integrates distributed data parallel training and two novel strategies: index-batching and distributed-index-batching. Our index techniques exploit spatiotemporal structure to construct snapshots dynamically at runtime, significantly reducing memory overhead, while distributed-index-batching extends this approach by enabling scalable processing across multiple GPUs. Our techniques enable the first-ever training of an ST-GNN on the entire PeMS dataset without graph partitioning, reducing peak memory usage by up to 89% and achieving up to a 13.1x speedup over standard DDP with 128 GPUs.
zh
[AI-43] Counting Answer Sets of Disjunctive Answer Set Programs
【速读】:该论文试图解决在析取逻辑程序中计数答案集的计算问题,这一问题在概率推理、网络可靠性分析等领域具有重要应用。其解决方案的关键在于提出一种名为SharpASP-SR的新框架,该框架基于减法归约到投影命题模型计数的方法,通过引入答案集的替代表征,实现了高效的归约并确保中间表示保持多项式规模,从而能够利用投影模型计数技术的最新进展。
链接: https://arxiv.org/abs/2507.11655
作者: Mohimenul Kabir,Supratik Chakraborty,Kuldeep S Meel
机构: 未知
类目: Logic in Computer Science (cs.LO); Artificial Intelligence (cs.AI)
备注: Under consideration in Theory and Practice of Logic Programming (TPLP)
点击查看摘要
Abstract:Answer Set Programming (ASP) provides a powerful declarative paradigm for knowledge representation and reasoning. Recently, counting answer sets has emerged as an important computational problem with applications in probabilistic reasoning, network reliability analysis, and other domains. This has motivated significant research into designing efficient ASP counters. While substantial progress has been made for normal logic programs, the development of practical counters for disjunctive logic programs remains challenging. We present SharpASP-SR, a novel framework for counting answer sets of disjunctive logic programs based on subtractive reduction to projected propositional model counting. Our approach introduces an alternative characterization of answer sets that enables efficient reduction while ensuring that intermediate representations remain of polynomial size. This allows SharpASP-SR to leverage recent advances in projected model counting technology. Through extensive experimental evaluation on diverse benchmarks, we demonstrate that SharpASP-SR significantly outperforms existing counters on instances with large answer set counts. Building on these results, we develop a hybrid counting approach that combines enumeration techniques with SharpASP-SR to achieve state-of-the-art performance across the full spectrum of disjunctive programs. Comments: Under consideration in Theory and Practice of Logic Programming (TPLP) Subjects: Logic in Computer Science (cs.LO); Artificial Intelligence (cs.AI) Cite as: arXiv:2507.11655 [cs.LO] (or arXiv:2507.11655v1 [cs.LO] for this version) https://doi.org/10.48550/arXiv.2507.11655 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
zh
[AI-44] racing the Path to Grokking: Embeddings Dropout and Network Activation
【速读】:该论文试图解决神经网络在训练过程中出现的延迟泛化问题,即Grokkling现象,其特征是测试准确率在训练准确率提升之后才显著增加。论文提出的关键解决方案是通过一系列实用的度量指标,包括Dropout下的方差、鲁棒性、嵌入相似性和稀疏性测量,来预测Grokkling行为。其中,从模型从记忆到泛化的过渡中获取的Dropout鲁棒性曲线(DRC)用于估计神经网络在推理阶段对噪声的鲁棒性,而测试准确率在Dropout下的方差则在Grokkling期间表现出局部最大值,这些指标为理解Grokkling的起源和行为提供了重要见解。
链接: https://arxiv.org/abs/2507.11645
作者: Ahmed Salah,David Yevick
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注: 15 pages, 11 figures
点击查看摘要
Abstract:Grokking refers to delayed generalization in which the increase in test accuracy of a neural network occurs appreciably after the improvement in training accuracy This paper introduces several practical metrics including variance under dropout, robustness, embedding similarity, and sparsity measures, that can forecast grokking behavior. Specifically, the resilience of neural networks to noise during inference is estimated from a Dropout Robustness Curve (DRC) obtained from the variation of the accuracy with the dropout rate as the model transitions from memorization to generalization. The variance of the test accuracy under stochastic dropout across training checkpoints further exhibits a local maximum during the grokking. Additionally, the percentage of inactive neurons decreases during generalization, while the embeddings tend to a bimodal distribution independent of initialization that correlates with the observed cosine similarity patterns and dataset symmetries. These metrics additionally provide valuable insight into the origin and behaviour of grokking.
zh
[AI-45] General Modular Harness for LLM Agents in Multi-Turn Gaming Environments ICML
【速读】:该论文试图解决如何使大语言模型(LLM)或视觉语言模型(VLM)代理在多种多轮游戏环境中具备通用性的问题,而无需进行领域特定的工程设计。解决方案的关键在于提出了一种模块化框架,该框架由感知、记忆和推理组件构成,使得单一LLM或VLM主干能够适应多样化的交互式游戏场景,并通过统一的工作流程分析各模块在不同环境下的性能影响。
链接: https://arxiv.org/abs/2507.11633
作者: Yuxuan Zhang,Haoyang Yu,Lanxiang Hu,Haojian Jin,Hao Zhang
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注: 8 pages, ICML MAS workshop
点击查看摘要
Abstract:We introduce a modular harness design for LLM agents that composes of perception, memory, and reasoning components, enabling a single LLM or VLM backbone to tackle a wide spectrum of multi turn gaming environments without domain-specific engineering. Using classic and modern game suites as low-barrier, high-diversity testbeds, our framework provides a unified workflow for analyzing how each module affects performance across dynamic interactive settings. Extensive experiments demonstrate that the harness lifts gameplay performance consistently over un-harnessed baselines and reveals distinct contribution patterns, for example, memory dominates in long-horizon puzzles while perception is critical in vision noisy arcades. These findings highlight the effectiveness of our modular harness design in advancing general-purpose agent, given the familiarity and ubiquity of games in everyday human experience.
zh
[AI-46] A Roadmap for Climate-Relevant Robotics Research
【速读】:该论文试图解决如何将机器人技术应用于气候变化相关领域的问题,以应对21世纪的关键挑战。其解决方案的关键在于提出一个面向气候相关机器人的研究路线图,强调机器人学者与能源、建筑环境、交通、工业、土地利用和地球科学等领域的专家之间的协作机会。该路线图不仅关注物理机器人的应用,还涵盖了更广泛的机器人工具集,包括规划、感知、控制和估计算法,旨在通过具体且可操作的问题推动跨学科研究与合作。
链接: https://arxiv.org/abs/2507.11623
作者: Alan Papalia,Charles Dawson,Laurentiu L. Anton,Norhan Magdy Bayomi,Bianca Champenois,Jung-Hoon Cho,Levi Cai,Joseph DelPreto,Kristen Edwards,Bilha-Catherine Githinji,Cameron Hickert,Vindula Jayawardana,Matthew Kramer,Shreyaa Raghavan,David Russell,Shide Salimi,Jingnan Shi,Soumya Sudhakar,Yanwei Wang,Shouyi Wang,Luca Carlone,Vijay Kumar,Daniela Rus,John E. Fernandez,Cathy Wu,George Kantor,Derek Young,Hanumant Singh
机构: 未知
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Systems and Control (eess.SY)
备注:
点击查看摘要
Abstract:Climate change is one of the defining challenges of the 21st century, and many in the robotics community are looking for ways to contribute. This paper presents a roadmap for climate-relevant robotics research, identifying high-impact opportunities for collaboration between roboticists and experts across climate domains such as energy, the built environment, transportation, industry, land use, and Earth sciences. These applications include problems such as energy systems optimization, construction, precision agriculture, building envelope retrofits, autonomous trucking, and large-scale environmental monitoring. Critically, we include opportunities to apply not only physical robots but also the broader robotics toolkit - including planning, perception, control, and estimation algorithms - to climate-relevant problems. A central goal of this roadmap is to inspire new research directions and collaboration by highlighting specific, actionable problems at the intersection of robotics and climate. This work represents a collaboration between robotics researchers and domain experts in various climate disciplines, and it serves as an invitation to the robotics community to bring their expertise to bear on urgent climate priorities.
zh
[AI-47] HCOMC: A Hierarchical Cooperative On-Ramp Merging Control Framework in Mixed Traffic Environment on Two-Lane Highways ITSC
【速读】:该论文旨在解决高速公路匝道合并区域因交通拥堵和事故频发而形成的瓶颈问题。其解决方案的关键在于提出一种针对双车道高速公路上异质交通流的分层协同匝道合并控制(HCOMC)框架,该框架结合了改进的虚拟车辆模型、基于博弈论的自主变道模型以及多目标优化模型,以实现安全、平稳且高效的车辆合并过程。
链接: https://arxiv.org/abs/2507.11621
作者: Tianyi Wang,Yangyang Wang,Jie Pan,Junfeng Jiao,Christian Claudel
机构: 未知
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI); Systems and Control (eess.SY)
备注: 7 pages, 2 figures, 3 tables, accepted for IEEE International Conference on Intelligent Transportation Systems (ITSC) 2025
点击查看摘要
Abstract:Highway on-ramp merging areas are common bottlenecks to traffic congestion and accidents. Currently, a cooperative control strategy based on connected and automated vehicles (CAVs) is a fundamental solution to this problem. While CAVs are not fully widespread, it is necessary to propose a hierarchical cooperative on-ramp merging control (HCOMC) framework for heterogeneous traffic flow on two-lane highways to address this gap. This paper extends longitudinal car-following models based on the intelligent driver model and lateral lane-changing models using the quintic polynomial curve to account for human-driven vehicles (HDVs) and CAVs, comprehensively considering human factors and cooperative adaptive cruise control. Besides, this paper proposes a HCOMC framework, consisting of a hierarchical cooperative planning model based on the modified virtual vehicle model, a discretionary lane-changing model based on game theory, and a multi-objective optimization model using the elitist non-dominated sorting genetic algorithm to ensure the safe, smooth, and efficient merging process. Then, the performance of our HCOMC is analyzed under different traffic densities and CAV penetration rates through simulation. The findings underscore our HCOMC’s pronounced comprehensive advantages in enhancing the safety of group vehicles, stabilizing and expediting merging process, optimizing traffic efficiency, and economizing fuel consumption compared with benchmarks.
zh
[AI-48] Learning Representations of Event Time Series with Sparse Autoencoders for Anomaly Detection Similarity Search and Unsupervised Classification ICML
【速读】:该论文试图解决事件时间序列(event time series)在不规则时间间隔下难以通过传统方法提取有意义模式和识别显著现象的问题。其解决方案的关键在于提出新颖的二维和三维张量表示方法,并结合稀疏自编码器(sparse autoencoders)以学习物理上有意义的潜在表示,从而支持多种下游任务,如异常检测、基于相似性的检索、语义聚类和无监督分类。
链接: https://arxiv.org/abs/2507.11620
作者: Steven Dillmann,Juan Rafael Martínez-Galarza
机构: 未知
类目: Machine Learning (cs.LG); High Energy Astrophysical Phenomena (astro-ph.HE); Instrumentation and Methods for Astrophysics (astro-ph.IM); Artificial Intelligence (cs.AI)
备注: Accepted at the 2025 ICML Workshop on Machine Learning for Astrophysics, Code available at: this https URL
点击查看摘要
Abstract:Event time series are sequences of discrete events occurring at irregular time intervals, each associated with a domain-specific observational modality. They are common in domains such as high-energy astrophysics, computational social science, cybersecurity, finance, healthcare, neuroscience, and seismology. Their unstructured and irregular structure poses significant challenges for extracting meaningful patterns and identifying salient phenomena using conventional techniques. We propose novel two- and three-dimensional tensor representations for event time series, coupled with sparse autoencoders that learn physically meaningful latent representations. These embeddings support a variety of downstream tasks, including anomaly detection, similarity-based retrieval, semantic clustering, and unsupervised classification. We demonstrate our approach on a real-world dataset from X-ray astronomy, showing that these representations successfully capture temporal and spectral signatures and isolate diverse classes of X-ray transients. Our framework offers a flexible, scalable, and generalizable solution for analyzing complex, irregular event time series across scientific and industrial domains.
zh
[AI-49] AI Humans and Data Science: Optimizing Roles Across Workflows and the Workforce
【速读】:该论文试图解决人工智能(AI)在科研领域应用所带来的效率与质量提升潜力与其实际应用中可能产生的风险之间的差距问题。其解决方案的关键在于基于Truth, Beauty, and Justice(TBJ)框架评估AI、机器学习和计算模型的有效性和伦理性,并强调在数据科学流程中采用人机协作模式,以确保在VUCA(易变性、不确定性、复杂性和模糊性)决策环境中数据科学家的核心作用得以发挥。论文主张推进AI工具以补充数据科学家的工作,同时倡导持续的方法论培训与理解,以确保研究的实质性价值能够通过有效且伦理的方式被实现。
链接: https://arxiv.org/abs/2507.11597
作者: Richard Timpone,Yongwei Yang
机构: 未知
类目: Computers and Society (cs.CY); Artificial Intelligence (cs.AI); Human-Computer Interaction (cs.HC)
备注: Paper prepared for the 2025 European Survey Research Association Conference; 30 pages, 5 tables and 4 figures
点击查看摘要
Abstract:AI is transforming research. It is being leveraged to construct surveys, synthesize data, conduct analysis, and write summaries of the results. While the promise is to create efficiencies and increase quality, the reality is not always as clear cut. Leveraging our framework of Truth, Beauty, and Justice (TBJ) which we use to evaluate AI, machine learning and computational models for effective and ethical use (Taber and Timpone 1997; Timpone and Yang 2024), we consider the potential and limitation of analytic, generative, and agentic AI to augment data scientists or take on tasks traditionally done by human analysts and researchers. While AI can be leveraged to assist analysts in their tasks, we raise some warnings about push-button automation. Just as earlier eras of survey analysis created some issues when the increased ease of using statistical software allowed researchers to conduct analyses they did not fully understand, the new AI tools may create similar but larger risks. We emphasize a human-machine collaboration perspective (Daugherty and Wilson 2018) throughout the data science workflow and particularly call out the vital role that data scientists play under VUCA decision areas. We conclude by encouraging the advance of AI tools to complement data scientists but advocate for continued training and understanding of methods to ensure the substantive value of research is fully achieved by applying, interpreting, and acting upon results most effectively and ethically.
zh
[AI-50] A Study on the Application of Artificial Intelligence in Ecological Design
【速读】:该论文试图解决人类与自然关系从人类主导向真正互依共存转变的问题,以及人工智能(Artificial Intelligence, AI)在这一转变中的中介作用。其解决方案的关键在于引入一种新的生态设计范式,即AI与非人类生命形式的互动,通过艺术与设计实践中的数据解析、图像识别和生态修复应用,拓展了创造性方法并重构了生态设计的理论与实践。研究提出将强化学习与基于植物的植物修复技术相结合的设计路径,以实现科学洞察、艺术实践与环境管理的融合。
链接: https://arxiv.org/abs/2507.11595
作者: Hengyue Zhao
机构: 未知
类目: Artificial Intelligence (cs.AI); Computers and Society (cs.CY)
备注:
点击查看摘要
Abstract:This paper asks whether our relationship with nature can move from human dominance to genuine interdependence, and whether artificial intelligence (AI) can mediate that shift. We examine a new ecological-design paradigm in which AI interacts with non-human life forms. Through case studies we show how artists and designers apply AI for data analysis, image recognition, and ecological restoration, producing results that differ from conventional media. We argue that AI not only expands creative methods but also reframes the theory and practice of ecological design. Building on the author’s prototype for AI-assisted water remediation, the study proposes design pathways that couple reinforcement learning with plant-based phytoremediation. The findings highlight AI’s potential to link scientific insight, artistic practice, and environmental stewardship, offering a roadmap for future research on sustainable, technology-enabled ecosystems.
zh
[AI-51] Distribution-Free Uncertainty-Aware Virtual Sensing via Conformalized Neural Operators
【速读】:该论文试图解决深度学习在实时虚拟传感中安全部署的关键障碍——鲁棒的不确定性量化(UQ),尤其是在高风险领域中,稀疏、噪声或非共位传感器数据是常态。其解决方案的关键在于提出一种名为Conformalized Monte Carlo Operator (CMCO) 的框架,该框架通过将蒙特卡洛Dropout与分割共形预测统一在一个DeepONet架构中,实现了无需重新训练、集成或自定义损失函数的空间分辨不确定性估计,从而为操作学习提供了高效且可靠的UQ方法。
链接: https://arxiv.org/abs/2507.11574
作者: Kazuma Kobayashi,Shailesh Garg,Farid Ahmed,Souvik Chakraborty,Syed Bahauddin Alam
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Machine Learning (stat.ML)
备注:
点击查看摘要
Abstract:Robust uncertainty quantification (UQ) remains a critical barrier to the safe deployment of deep learning in real-time virtual sensing, particularly in high-stakes domains where sparse, noisy, or non-collocated sensor data are the norm. We introduce the Conformalized Monte Carlo Operator (CMCO), a framework that transforms neural operator-based virtual sensing with calibrated, distribution-free prediction intervals. By unifying Monte Carlo dropout with split conformal prediction in a single DeepONet architecture, CMCO achieves spatially resolved uncertainty estimates without retraining, ensembling, or custom loss design. Our method addresses a longstanding challenge: how to endow operator learning with efficient and reliable UQ across heterogeneous domains. Through rigorous evaluation on three distinct applications: turbulent flow, elastoplastic deformation, and global cosmic radiation dose estimation-CMCO consistently attains near-nominal empirical coverage, even in settings with strong spatial gradients and proxy-based sensing. This breakthrough offers a general-purpose, plug-and-play UQ solution for neural operators, unlocking real-time, trustworthy inference in digital twins, sensor fusion, and safety-critical monitoring. By bridging theory and deployment with minimal computational overhead, CMCO establishes a new foundation for scalable, generalizable, and uncertainty-aware scientific machine learning.
zh
[AI-52] SurgeryLSTM: A Time-Aware Neural Model for Accurate and Explainable Length of Stay Prediction After Spine Surgery
【速读】:该论文试图解决择期脊柱手术患者住院时间(Length of Stay, LOS)预测的问题,旨在通过机器学习模型提高预测准确性并增强模型的可解释性。其解决方案的关键在于开发了一种基于掩码双向长短期记忆网络(BiLSTM)与注意力机制的模型——SurgeryLSTM,该模型通过捕捉术前临床序列中的时间依赖性特征,提升了预测性能,并借助注意力机制实现了对关键预测因素的动态识别,从而增强了模型的可解释性。
链接: https://arxiv.org/abs/2507.11570
作者: Ha Na Cho,Sairam Sutari,Alexander Lopez,Hansen Bow,Kai Zheng
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Image and Video Processing (eess.IV)
备注:
点击查看摘要
Abstract:Objective: To develop and evaluate machine learning (ML) models for predicting length of stay (LOS) in elective spine surgery, with a focus on the benefits of temporal modeling and model interpretability. Materials and Methods: We compared traditional ML models (e.g., linear regression, random forest, support vector machine (SVM), and XGBoost) with our developed model, SurgeryLSTM, a masked bidirectional long short-term memory (BiLSTM) with an attention, using structured perioperative electronic health records (EHR) data. Performance was evaluated using the coefficient of determination (R2), and key predictors were identified using explainable AI. Results: SurgeryLSTM achieved the highest predictive accuracy (R2=0.86), outperforming XGBoost (R2 = 0.85) and baseline models. The attention mechanism improved interpretability by dynamically identifying influential temporal segments within preoperative clinical sequences, allowing clinicians to trace which events or features most contributed to each LOS prediction. Key predictors of LOS included bone disorder, chronic kidney disease, and lumbar fusion identified as the most impactful predictors of LOS. Discussion: Temporal modeling with attention mechanisms significantly improves LOS prediction by capturing the sequential nature of patient data. Unlike static models, SurgeryLSTM provides both higher accuracy and greater interpretability, which are critical for clinical adoption. These results highlight the potential of integrating attention-based temporal models into hospital planning workflows. Conclusion: SurgeryLSTM presents an effective and interpretable AI solution for LOS prediction in elective spine surgery. Our findings support the integration of temporal, explainable ML approaches into clinical decision support systems to enhance discharge readiness and individualized patient care.
zh
[AI-53] Emergent Heterogeneous Swarm Control Through Hebbian Learning
【速读】:该论文试图解决在群体机器人系统中实现异质性控制的多个关键挑战,包括微宏观问题、维度灾难以及对先验知识的高依赖性。其解决方案的关键在于引入赫布学习(Hebbian learning),这是一种基于局部信息的生物启发式神经适应方法。通过赫布学习,异质性能够自动涌现,从而实现群体层面的行为切换,并显著提升群体能力,同时避免了传统方法所需的复杂参数调整和大量先验知识。
链接: https://arxiv.org/abs/2507.11566
作者: Fuda van Diggelen,Tugay Alperen Karagüzel,Andres Garcia Rincon,A.E. Eiben,Dario Floreano,Eliseo Ferrante
机构: 未知
类目: Neural and Evolutionary Computing (cs.NE); Artificial Intelligence (cs.AI); Robotics (cs.RO)
备注:
点击查看摘要
Abstract:In this paper, we introduce Hebbian learning as a novel method for swarm robotics, enabling the automatic emergence of heterogeneity. Hebbian learning presents a biologically inspired form of neural adaptation that solely relies on local information. By doing so, we resolve several major challenges for learning heterogeneous control: 1) Hebbian learning removes the complexity of attributing emergent phenomena to single agents through local learning rules, thus circumventing the micro-macro problem; 2) uniform Hebbian learning rules across all swarm members limit the number of parameters needed, mitigating the curse of dimensionality with scaling swarm sizes; and 3) evolving Hebbian learning rules based on swarm-level behaviour minimises the need for extensive prior knowledge typically required for optimising heterogeneous swarms. This work demonstrates that with Hebbian learning heterogeneity naturally emerges, resulting in swarm-level behavioural switching and in significantly improved swarm capabilities. It also demonstrates how the evolution of Hebbian learning rules can be a valid alternative to Multi Agent Reinforcement Learning in standard benchmarking tasks.
zh
[AI-54] A Model Aware AIGC Task Offloading Algorithm in IIoT Edge Computing
【速读】:该论文旨在解决工业互联网物联网(IIoT)环境中生成式人工智能(AIGC)任务在计算密集型和低延迟需求下的挑战,特别是在传统基于云计算的生成模型难以满足实时性要求的情况下。其解决方案的关键在于提出一种面向IIoT边缘计算环境的AIGC任务卸载框架,该框架通过多智能体协同方式将动态AIGC任务卸载至部署不同生成模型的最优边缘服务器,并采用基于多智能体深度确定性策略梯度(MADDPG-MATO)的模型感知任务卸载算法,以最小化延迟和能耗。实验结果表明,该算法在多种模型数量条件下均表现出优越的性能,具有较高的鲁棒性和效率。
链接: https://arxiv.org/abs/2507.11560
作者: Xin Wang,Xiao Huan Li,Xun Wang
机构: 未知
类目: Distributed, Parallel, and Cluster Computing (cs.DC); Artificial Intelligence (cs.AI)
备注: 6 pages, 4 figures, accepted by ICCC 2025
点击查看摘要
Abstract:The integration of the Industrial Internet of Things (IIoT) with Artificial Intelligence-Generated Content (AIGC) offers new opportunities for smart manufacturing, but it also introduces challenges related to computation-intensive tasks and low-latency demands. Traditional generative models based on cloud computing are difficult to meet the real-time requirements of AIGC tasks in IIoT environments, and edge computing can effectively reduce latency through task offloading. However, the dynamic nature of AIGC tasks, model switching delays, and resource constraints impose higher demands on edge computing environments. To address these challenges, this paper proposes an AIGC task offloading framework tailored for IIoT edge computing environments, considering the latency and energy consumption caused by AIGC model switching for the first time. IIoT devices acted as multi-agent collaboratively offload their dynamic AIGC tasks to the most appropriate edge servers deployed with different generative models. A model aware AIGC task offloading algorithm based on Multi-Agent Deep Deterministic Policy Gradient (MADDPG-MATO) is devised to minimize the latency and energy. Experimental results show that MADDPG-MATO outperforms baseline algorithms, achieving an average reduction of 6.98% in latency, 7.12% in energy consumption, and a 3.72% increase in task completion rate across four sets of experiments with model numbers ranging from 3 to 6, it is demonstrated that the proposed algorithm is robust and efficient in dynamic, high-load IIoT environments.
zh
[AI-55] A Review of Generative AI in Computer Science Education: Challenges and Opportunities in Accuracy Authenticity and Assessment
【速读】:该论文试图解决生成式AI(Generative AI)在计算机科学教育中应用所带来的准确性、真实性及评估方面的挑战。其解决方案的关键在于通过混合评估模型结合人工智能与人工评价,开发偏差检测框架,并提升学生和教师的AI素养,从而实现伦理、教学法和技术因素之间的平衡。
链接: https://arxiv.org/abs/2507.11543
作者: Iman Reihanian,Yunfei Hou,Yu Chen,Yifei Zheng
机构: 未知
类目: Computers and Society (cs.CY); Artificial Intelligence (cs.AI)
备注: Accepted for presentation at The 2024 International Conference on Computational Science and Computational Intelligence (CSCI), Research Track on Education. To appear in Springer Lecture Notes in Computer Science (LNCS) proceedings, expected July 2025
点击查看摘要
Abstract:This paper surveys the use of Generative AI tools, such as ChatGPT and Claude, in computer science education, focusing on key aspects of accuracy, authenticity, and assessment. Through a literature review, we highlight both the challenges and opportunities these AI tools present. While Generative AI improves efficiency and supports creative student work, it raises concerns such as AI hallucinations, error propagation, bias, and blurred lines between AI-assisted and student-authored content. Human oversight is crucial for addressing these concerns. Existing literature recommends adopting hybrid assessment models that combine AI with human evaluation, developing bias detection frameworks, and promoting AI literacy for both students and educators. Our findings suggest that the successful integration of AI requires a balanced approach, considering ethical, pedagogical, and technical factors. Future research may explore enhancing AI accuracy, preserving academic integrity, and developing adaptive models that balance creativity with precision.
zh
[AI-56] BenchRL-QAS: Benchmarking reinforcement learning algorithms for quantum architecture search
【速读】:该论文试图解决在量子架构搜索(Quantum Architecture Search, QAS)中对强化学习(Reinforcement Learning, RL)算法进行系统评估的问题,旨在为不同规模的量子任务(2-8量子比特)提供统一的基准测试框架。其解决方案的关键在于提出了一种加权排名指标,该指标综合平衡了准确性、电路深度、门数和计算效率,从而实现了对多种RL代理(包括基于价值的方法和策略梯度方法)在代表性量子问题上的公平且全面的比较。
链接: https://arxiv.org/abs/2507.12189
作者: Azhar Ikhtiarudin,Aditi Das,Param Thakkar,Akash Kundu
机构: 未知
类目: Quantum Physics (quant-ph); Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Performance (cs.PF)
备注: Comprehensive RL agent benchmark for QAS. Contributions are welcomed here: this https URL
点击查看摘要
Abstract:We introduce BenchRL-QAS, a unified benchmarking framework for systematically evaluating reinforcement learning (RL) algorithms in quantum architecture search (QAS) across diverse variational quantum algorithm tasks and system sizes ranging from 2- to 8-qubit. Our study benchmarks nine RL agents including both value-based and policy-gradient methods on representative quantum problems such as variational quantum eigensolver, variational quantum state diagonalization, quantum classification, and state preparation, spanning both noiseless and realistic noisy regimes. We propose a weighted ranking metric that balances accuracy, circuit depth, gate count, and computational efficiency, enabling fair and comprehensive comparison. Our results first reveal that RL-based quantum classifier outperforms baseline variational classifiers. Then we conclude that no single RL algorithm is universally optimal when considering a set of QAS tasks; algorithmic performance is highly context-dependent, varying with task structure, qubit count, and noise. This empirical finding provides strong evidence for the “no free lunch” principle in RL-based quantum circuit design and highlights the necessity of tailored algorithm selection and systematic benchmarking for advancing quantum circuit synthesis. This work represents the most comprehensive RL-QAS benchmarking effort to date, and BenchRL-QAS along with all experimental data are made publicly available to support reproducibility and future research this https URL.
zh
[AI-57] Quantum Machine Learning in Multi-Qubit Phase-Space Part I: Foundations
【速读】:该论文试图解决量子机器学习(Quantum Machine Learning, QML)在经典模拟中因希尔伯特空间指数增长而导致的实践限制问题。其解决方案的关键在于构建一种基于相空间的闭合、可组合的动力学形式化方法,该方法通过将泡利群的算子代数替换为辛流形上的函数动力学,从而将维度灾难转化为与量子比特数量线性相关的谐波支持问题。
链接: https://arxiv.org/abs/2507.12117
作者: Timothy Heightman,Edward Jiang,Ruth Mora-Soto,Maciej Lewenstein,Marcin Płodzień
机构: 未知
类目: Quantum Physics (quant-ph); Artificial Intelligence (cs.AI); Mathematical Physics (math-ph)
备注:
点击查看摘要
Abstract:Quantum machine learning (QML) seeks to exploit the intrinsic properties of quantum mechanical systems, including superposition, coherence, and quantum entanglement for classical data processing. However, due to the exponential growth of the Hilbert space, QML faces practical limits in classical simulations with the state-vector representation of quantum system. On the other hand, phase-space methods offer an alternative by encoding quantum states as quasi-probability functions. Building on prior work in qubit phase-space and the Stratonovich-Weyl (SW) correspondence, we construct a closed, composable dynamical formalism for one- and many-qubit systems in phase-space. This formalism replaces the operator algebra of the Pauli group with function dynamics on symplectic manifolds, and recasts the curse of dimensionality in terms of harmonic support on a domain that scales linearly with the number of qubits. It opens a new route for QML based on variational modelling over phase-space.
zh
[AI-58] Frag ment size density estimator for shrinkage-induced fracture based on a physics-informed neural network
【速读】:该论文试图解决由收缩引起的断裂建模中的计算成本过高问题,其核心挑战在于求解积分微分方程时的传统数值方法(如有限差分法)计算效率低。解决方案的关键在于提出一种基于神经网络(Neural Network, NN)的求解器,该方法直接将输入参数映射到相应的概率密度函数,而无需数值求解控制方程,从而显著降低了计算成本,并在蒙特卡洛模拟中实现了高效且准确的密度函数评估。
链接: https://arxiv.org/abs/2507.11799
作者: Shin-ichi Ito
机构: 未知
类目: Computational Physics (physics.comp-ph); Artificial Intelligence (cs.AI)
备注:
点击查看摘要
Abstract:This paper presents a neural network (NN)-based solver for an integro-differential equation that models shrinkage-induced fragmentation. The proposed method directly maps input parameters to the corresponding probability density function without numerically solving the governing equation, thereby significantly reducing computational costs. Specifically, it enables efficient evaluation of the density function in Monte Carlo simulations while maintaining accuracy comparable to or even exceeding that of conventional finite difference schemes. Validatation on synthetic data demonstrates both the method’s computational efficiency and predictive reliability. This study establishes a foundation for the data-driven inverse analysis of fragmentation and suggests the potential for extending the framework beyond pre-specified model structures.
zh
[AI-59] Foundation Models for Brain Signals: A Critical Review of Current Progress and Future Directions
【速读】:该论文试图解决早期生成式脑电(EEG)基础模型(EEG-FMs)在实际应用中的准备度不明确以及长期研究进展评估标准缺失的问题。其解决方案的关键在于对首批10个EEG-FMs进行系统性综述,分析其方法论、实证结果及研究缺口,强调当前模型多采用基于Transformer的序列建模框架,并通过掩码序列重建实现自监督学习,但模型评估仍存在异质性和局限性,未来需加强标准化评估、扩大模型规模效应,并在EEG表征学习全流程中做出更合理和可信的选择。
链接: https://arxiv.org/abs/2507.11783
作者: Gayal Kuruppu,Neeraj Wagh,Yogatheesan Varatharajah
机构: 未知
类目: ignal Processing (eess.SP); Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Neurons and Cognition (q-bio.NC)
备注: 20 pages, 5 figures, 2 tables
点击查看摘要
Abstract:Patterns of electrical brain activity recorded via electroencephalography (EEG) offer immense value for scientific and clinical investigations. The inability of supervised EEG encoders to learn robust EEG patterns and their over-reliance on expensive signal annotations have sparked a transition towards general-purpose self-supervised EEG encoders, i.e., EEG foundation models (EEG-FMs), for robust and scalable EEG feature extraction. However, the real-world readiness of early EEG-FMs and the rubric for long-term research progress remain unclear. A systematic and comprehensive review of first-generation EEG-FMs is therefore necessary to understand the current state-of-the-art and identify key directions for future EEG-FMs. To that end, this study reviews 10 early EEG-FMs and presents a critical synthesis of their methodology, empirical findings, and outstanding research gaps. We find that most EEG-FMs adopt a sequence-based modeling scheme that relies on transformer-based backbones and the reconstruction of masked sequences for self-supervision. However, model evaluations remain heterogeneous and largely limited, making it challenging to assess their practical off-the-shelf utility. In addition to adopting standardized and realistic evaluations, future work should demonstrate more substantial scaling effects and make principled and trustworthy choices throughout the EEG representation learning pipeline. We believe that developing benchmarks, software tools, technical methodologies, and applications in collaboration with domain experts may further advance the translational utility and real-world adoption of EEG-FMs.
zh
[AI-60] Galaxy image simplification using Generative AI
【速读】:该论文试图解决大规模银河系图像分析中人工标注效率低和分类受限的问题,旨在通过自动化手段提升对海量银河系图像的分析能力。其解决方案的关键在于利用生成式 AI (Generative AI) 对银河系图像进行简化,并自动转换为“骨架化”形式,从而实现不依赖预定义类别的精准形状测量与分析。
链接: https://arxiv.org/abs/2507.11692
作者: Sai Teja Erukude,Lior Shamir
机构: 未知
类目: Astrophysics of Galaxies (astro-ph.GA); Instrumentation and Methods for Astrophysics (astro-ph.IM); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: Astronomy and Computing, accepted
点击查看摘要
Abstract:Modern digital sky surveys have been acquiring images of billions of galaxies. While these images often provide sufficient details to analyze the shape of the galaxies, accurate analysis of such high volumes of images requires effective automation. Current solutions often rely on machine learning annotation of the galaxy images based on a set of pre-defined classes. Here we introduce a new approach to galaxy image analysis that is based on generative AI. The method simplifies the galaxy images and automatically converts them into a ``skeletonized" form. The simplified images allow accurate measurements of the galaxy shapes and analysis that is not limited to a certain pre-defined set of classes. We demonstrate the method by applying it to galaxy images acquired by the DESI Legacy Survey. The code and data are publicly available. The method was applied to 125,000 DESI Legacy Survey images, and the catalog of the simplified images is publicly available.
zh
[AI-61] JSQA: Speech Quality Assessment with Perceptually-Inspired Contrastive Pretraining Based on JND Audio Pairs
【速读】:该论文试图解决语音质量评估(Speech Quality Assessment, SQA)中因感知差异和实验设计差异导致的均值意见分数(MOS)固有方差高的问题,以及现有方法未能有效将感知因素纳入学习算法中的问题。解决方案的关键在于提出JSQA框架,该框架通过两个阶段进行优化:首先利用仅可察觉差异(JND)对进行感知引导的对比学习预训练音频编码器,以捕捉感知质量相似性信息;其次在NISQA数据集上微调模型以实现MOS预测。实验结果表明,基于感知启发的对比预训练显著提升了模型性能。
链接: https://arxiv.org/abs/2507.11636
作者: Junyi Fan,Donald Williamson
机构: 未知
类目: Audio and Speech Processing (eess.AS); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: Accepted to WASPAA 2025
点击查看摘要
Abstract:Speech quality assessment (SQA) is often used to learn a mapping from a high-dimensional input space to a scalar that represents the mean opinion score (MOS) of the perceptual speech quality. Learning such a mapping is challenging for many reasons, but largely because MOS exhibits high levels of inherent variance due to perceptual and experimental-design differences. Many solutions have been proposed, but many approaches do not properly incorporate perceptual factors into their learning algorithms (beyond the MOS label), which could lead to unsatisfactory results. To this end, we propose JSQA, a two-stage framework that pretrains an audio encoder using perceptually-guided contrastive learning on just noticeable difference (JND) pairs, followed by fine-tuning for MOS prediction. We first generate pairs of audio data within JND levels, which are then used to pretrain an encoder to leverage perceptual quality similarity information and map it into an embedding space. The JND pairs come from clean LibriSpeech utterances that are mixed with background noise from CHiME-3, at different signal-to-noise ratios (SNRs). The encoder is later fine-tuned with audio samples from the NISQA dataset for MOS prediction. Experimental results suggest that perceptually-inspired contrastive pretraining significantly improves the model performance evaluated by various metrics when compared against the same network trained from scratch without pretraining. These findings suggest that incorporating perceptual factors into pretraining greatly contributes to the improvement in performance for SQA.
zh
[AI-62] SToFM: a Multi-scale Foundation Model for Spatial Transcriptomics ICML2024
【速读】:该论文旨在解决空间转录组学(Spatial Transcriptomics, ST)数据建模的挑战,即如何从包含大量细胞的组织切片中提取多尺度信息,并整合宏观层面的组织形态、微观层面的细胞微环境以及基因层面的基因表达谱。解决方案的关键在于提出一种多尺度的空间转录组基础模型SToFM,该模型通过多尺度信息提取构建包含多尺度信息的ST子切片,并利用SE(2) Transformer获取高质量的细胞表示,同时构建了最大规模的高分辨率空间转录组语料库SToCorpus-88M用于预训练。
链接: https://arxiv.org/abs/2507.11588
作者: Suyuan Zhao,Yizhen Luo,Ganbo Yang,Yan Zhong,Hao Zhou,Zaiqing Nie
机构: 未知
类目: Genomics (q-bio.GN); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: Accpeted by ICML 2024
点击查看摘要
Abstract:Spatial Transcriptomics (ST) technologies provide biologists with rich insights into single-cell biology by preserving spatial context of cells. Building foundational models for ST can significantly enhance the analysis of vast and complex data sources, unlocking new perspectives on the intricacies of biological tissues. However, modeling ST data is inherently challenging due to the need to extract multi-scale information from tissue slices containing vast numbers of cells. This process requires integrating macro-scale tissue morphology, micro-scale cellular microenvironment, and gene-scale gene expression profile. To address this challenge, we propose SToFM, a multi-scale Spatial Transcriptomics Foundation Model. SToFM first performs multi-scale information extraction on each ST slice, to construct a set of ST sub-slices that aggregate macro-, micro- and gene-scale information. Then an SE(2) Transformer is used to obtain high-quality cell representations from the sub-slices. Additionally, we construct \textbfSToCorpus-88M, the largest high-resolution spatial transcriptomics corpus for pretraining. SToFM achieves outstanding performance on a variety of downstream tasks, such as tissue region semantic segmentation and cell type annotation, demonstrating its comprehensive understanding of ST data
zh
机器学习
[LG-0] Cost-aware Stopping for Bayesian Optimization
链接: https://arxiv.org/abs/2507.12453
作者: Qian Xie,Linda Cai,Alexander Terenin,Peter I. Frazier,Ziv Scully
类目: Machine Learning (cs.LG)
*备注:
点击查看摘要
Abstract:In automated machine learning, scientific discovery, and other applications of Bayesian optimization, deciding when to stop evaluating expensive black-box functions is an important practical consideration. While several adaptive stopping rules have been proposed, in the cost-aware setting they lack guarantees ensuring they stop before incurring excessive function evaluation costs. We propose a cost-aware stopping rule for Bayesian optimization that adapts to varying evaluation costs and is free of heuristic tuning. Our rule is grounded in a theoretical connection to state-of-the-art cost-aware acquisition functions, namely the Pandora’s Box Gittins Index (PBGI) and log expected improvement per cost. We prove a theoretical guarantee bounding the expected cumulative evaluation cost incurred by our stopping rule when paired with these two acquisition functions. In experiments on synthetic and empirical tasks, including hyperparameter optimization and neural architecture size search, we show that combining our stopping rule with the PBGI acquisition function consistently matches or outperforms other acquisition-function–stopping-rule pairs in terms of cost-adjusted simple regret, a metric capturing trade-offs between solution quality and cumulative evaluation cost.
[LG-1] A Bayesian Incentive Mechanism for Poison-Resilient Federated Learning
链接: https://arxiv.org/abs/2507.12439
作者: Daniel Commey,Rebecca A. Sarpong,Griffith S. Klogo,Winful Bagyl-Bac,Garth V. Crosby
类目: Machine Learning (cs.LG); Cryptography and Security (cs.CR); Computer Science and Game Theory (cs.GT)
*备注:
点击查看摘要
Abstract:Federated learning (FL) enables collaborative model training across decentralized clients while preserving data privacy. However, its open-participation nature exposes it to data-poisoning attacks, in which malicious actors submit corrupted model updates to degrade the global model. Existing defenses are often reactive, relying on statistical aggregation rules that can be computationally expensive and that typically assume an honest majority. This paper introduces a proactive, economic defense: a lightweight Bayesian incentive mechanism that makes malicious behavior economically irrational. Each training round is modeled as a Bayesian game of incomplete information in which the server, acting as the principal, uses a small, private validation dataset to verify update quality before issuing payments. The design satisfies Individual Rationality (IR) for benevolent clients, ensuring their participation is profitable, and Incentive Compatibility (IC), making poisoning an economically dominated strategy. Extensive experiments on non-IID partitions of MNIST and FashionMNIST demonstrate robustness: with 50% label-flipping adversaries on MNIST, the mechanism maintains 96.7% accuracy, only 0.3 percentage points lower than in a scenario with 30% label-flipping adversaries. This outcome is 51.7 percentage points better than standard FedAvg, which collapses under the same 50% attack. The mechanism is computationally light, budget-bounded, and readily integrates into existing FL frameworks, offering a practical route to economically robust and sustainable FL ecosystems.
[LG-2] argeted Deep Architectures: A TMLE-Based Framework for Robust Causal Inference in Neural Networks
链接: https://arxiv.org/abs/2507.12435
作者: Yi Li,David Mccoy,Nolan Gunter,Kaitlyn Lee,Alejandro Schuler,Mark van der Laan
类目: Machine Learning (cs.LG)
*备注:
点击查看摘要
Abstract:Modern deep neural networks are powerful predictive tools yet often lack valid inference for causal parameters, such as treatment effects or entire survival curves. While frameworks like Double Machine Learning (DML) and Targeted Maximum Likelihood Estimation (TMLE) can debias machine-learning fits, existing neural implementations either rely on “targeted losses” that do not guarantee solving the efficient influence function equation or computationally expensive post-hoc “fluctuations” for multi-parameter settings. We propose Targeted Deep Architectures (TDA), a new framework that embeds TMLE directly into the network’s parameter space with no restrictions on the backbone architecture. Specifically, TDA partitions model parameters - freezing all but a small “targeting” subset - and iteratively updates them along a targeting gradient, derived from projecting the influence functions onto the span of the gradients of the loss with respect to weights. This procedure yields plug-in estimates that remove first-order bias and produce asymptotically valid confidence intervals. Crucially, TDA easily extends to multi-dimensional causal estimands (e.g., entire survival curves) by merging separate targeting gradients into a single universal targeting update. Theoretically, TDA inherits classical TMLE properties, including double robustness and semiparametric efficiency. Empirically, on the benchmark IHDP dataset (average treatment effects) and simulated survival data with informative censoring, TDA reduces bias and improves coverage relative to both standard neural-network estimators and prior post-hoc approaches. In doing so, TDA establishes a direct, scalable pathway toward rigorous causal inference within modern deep architectures for complex multi-parameter targets.
[LG-3] ROC-n-reroll: How verifier imperfection affects test-time scaling
链接: https://arxiv.org/abs/2507.12399
作者: Florian E. Dorner,Yatong Chen,André F. Cruz,Fanny Yang
类目: Machine Learning (cs.LG); Machine Learning (stat.ML)
*备注: 35 pages, 9 Figures
点击查看摘要
Abstract:Test-time scaling aims to improve language model performance by leveraging additional compute during inference. While many works have empirically studied techniques like Best-of-N (BoN) and rejection sampling that make use of a verifier to enable test-time scaling, there is little theoretical understanding of how verifier imperfection affects performance. In this work, we address this gap. Specifically, we prove how instance-level accuracy of these methods is precisely characterized by the geometry of the verifier’s ROC curve. Interestingly, while scaling is determined by the local geometry of the ROC curve for rejection sampling, it depends on global properties of the ROC curve for BoN. As a consequence when the ROC curve is unknown, it is impossible to extrapolate the performance of rejection sampling based on the low-compute regime. Furthermore, while rejection sampling outperforms BoN for fixed compute, in the infinite-compute limit both methods converge to the same level of accuracy, determined by the slope of the ROC curve near the origin. Our theoretical results are confirmed by experiments on GSM8K using different versions of Llama and Qwen to generate and verify solutions.
[LG-4] rustworthy Tree-based Machine Learning by MoS_2 Flash-based Analog CAM with Inherent Soft Boundaries
链接: https://arxiv.org/abs/2507.12384
作者: Bo Wen,Guoyun Gao,Zhicheng Xu,Ruibin Mao,Xiaojuan Qi,X. Sharon Hu,Xunzhao Yin,Can Li
类目: Machine Learning (cs.LG); Emerging Technologies (cs.ET)
*备注:
点击查看摘要
Abstract:The rapid advancement of artificial intelligence has raised concerns regarding its trustworthiness, especially in terms of interpretability and robustness. Tree-based models like Random Forest and XGBoost excel in interpretability and accuracy for tabular data, but scaling them remains computationally expensive due to poor data locality and high data dependence. Previous efforts to accelerate these models with analog content addressable memory (CAM) have struggled, due to the fact that the difficult-to-implement sharp decision boundaries are highly susceptible to device variations, which leads to poor hardware performance and vulnerability to adversarial attacks. This work presents a novel hardware-software co-design approach using MoS_2 Flash-based analog CAM with inherent soft boundaries, enabling efficient inference with soft tree-based models. Our soft tree model inference experiments on MoS_2 analog CAM arrays show this method achieves exceptional robustness against device variation and adversarial attacks while achieving state-of-the-art accuracy. Specifically, our fabricated analog CAM arrays achieve 96% accuracy on Wisconsin Diagnostic Breast Cancer (WDBC) database, while maintaining decision explainability. Our experimentally calibrated model validated only a 0.6% accuracy drop on the MNIST dataset under 10% device threshold variation, compared to a 45.3% drop for traditional decision trees. This work paves the way for specialized hardware that enhances AI’s trustworthiness and efficiency.
[LG-5] Improving Reinforcement Learning Sample-Efficiency using Local Approximation
链接: https://arxiv.org/abs/2507.12383
作者: Mohit Prashant,Arvind Easwaran
类目: Machine Learning (cs.LG)
*备注: Preprint
点击查看摘要
Abstract:In this study, we derive Probably Approximately Correct (PAC) bounds on the asymptotic sample-complexity for RL within the infinite-horizon Markov Decision Process (MDP) setting that are sharper than those in existing literature. The premise of our study is twofold: firstly, the further two states are from each other, transition-wise, the less relevant the value of the first state is when learning the \epsilon -optimal value of the second; secondly, the amount of ‘effort’, sample-complexity-wise, expended in learning the \epsilon -optimal value of a state is independent of the number of samples required to learn the \epsilon -optimal value of a second state that is a sufficient number of transitions away from the first. Inversely, states within each other’s vicinity have values that are dependent on each other and will require a similar number of samples to learn. By approximating the original MDP using smaller MDPs constructed using subsets of the original’s state-space, we are able to reduce the sample-complexity by a logarithmic factor to O(SA \log A) timesteps, where S and A are the state and action space sizes. We are able to extend these results to an infinite-horizon, model-free setting by constructing a PAC-MDP algorithm with the aforementioned sample-complexity. We conclude with showing how significant the improvement is by comparing our algorithm against prior work in an experimental setting.
[LG-6] Heat Kernel Goes Topological
链接: https://arxiv.org/abs/2507.12380
作者: Maximilian Krahn,Vikas Garg
类目: Machine Learning (cs.LG)
*备注:
点击查看摘要
Abstract:Topological neural networks have emerged as powerful successors of graph neural networks. However, they typically involve higher-order message passing, which incurs significant computational expense. We circumvent this issue with a novel topological framework that introduces a Laplacian operator on combinatorial complexes (CCs), enabling efficient computation of heat kernels that serve as node descriptors. Our approach captures multiscale information and enables permutation-equivariant representations, allowing easy integration into modern transformer-based architectures. Theoretically, the proposed method is maximally expressive because it can distinguish arbitrary non-isomorphic CCs. Empirically, it significantly outperforms existing topological methods in terms of computational efficiency. Besides demonstrating competitive performance with the state-of-the-art descriptors on standard molecular datasets, it exhibits superior capability in distinguishing complex topological structures and avoiding blind spots on topological benchmarks. Overall, this work advances topological deep learning by providing expressive yet scalable representations, thereby opening up exciting avenues for molecular classification and property prediction tasks. Subjects: Machine Learning (cs.LG) Cite as: arXiv:2507.12380 [cs.LG] (or arXiv:2507.12380v1 [cs.LG] for this version) https://doi.org/10.48550/arXiv.2507.12380 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
[LG-7] Robust Causal Discovery in Real-World Time Series with Power-Laws
链接: https://arxiv.org/abs/2507.12257
作者: Matteo Tusoni,Giuseppe Masi,Andrea Coletta,Aldo Glielmo,Viviana Arrigoni,Novella Bartolini
类目: Machine Learning (cs.LG); Data Analysis, Statistics and Probability (physics.data-an); Machine Learning (stat.ML); Other Statistics (stat.OT)
*备注:
点击查看摘要
Abstract:Exploring causal relationships in stochastic time series is a challenging yet crucial task with a vast range of applications, including finance, economics, neuroscience, and climate science. Many algorithms for Causal Discovery (CD) have been proposed, but they often exhibit a high sensitivity to noise, resulting in misleading causal inferences when applied to real data. In this paper, we observe that the frequency spectra of typical real-world time series follow a power-law distribution, notably due to an inherent self-organizing behavior. Leveraging this insight, we build a robust CD method based on the extraction of power -law spectral features that amplify genuine causal signals. Our method consistently outperforms state-of-the-art alternatives on both synthetic benchmarks and real-world datasets with known causal structures, demonstrating its robustness and practical relevance.
[LG-8] Universal Fourier Neural Operators for Micromechanics
链接: https://arxiv.org/abs/2507.12233
作者: Binh Huy Nguyen,Matti Schneider
类目: Computational Engineering, Finance, and Science (cs.CE); Machine Learning (cs.LG)
*备注: 48 pages, 13 figures
点击查看摘要
Abstract:\noindent Solving cell problems in homogenization is hard, and available deep-learning frameworks fail to match the speed and generality of traditional computational frameworks. More to the point, it is generally unclear what to expect of machine-learning approaches, let alone single out which approaches are promising. In the work at hand, we advocate Fourier Neural Operators (FNOs) for micromechanics, empowering them by insights from computational micromechanics methods based on the fast Fourier transform (FFT). We construct an FNO surrogate mimicking the basic scheme foundational for FFT-based methods and show that the resulting operator predicts solutions to cell problems with \empharbitrary stiffness distribution only subject to a material-contrast constraint up to a desired accuracy. In particular, there are no restrictions on the material symmetry like isotropy, on the number of phases and on the geometry of the interfaces between materials. Also, the provided fidelity is sharp and uniform, providing explicit guarantees leveraging our physical empowerment of FNOs. To show the desired universal approximation property, we construct an FNO explicitly that requires no training to begin with. Still, the obtained neural operator complies with the same memory requirements as the basic scheme and comes with runtimes proportional to classical FFT solvers. In particular, large-scale problems with more than 100 million voxels are readily handled. The goal of this work is to underline the potential of FNOs for solving micromechanical problems, linking FFT-based methods to FNOs. This connection is expected to provide a fruitful exchange between both worlds.
[LG-9] Optimizers Qualitatively Alter Solutions And We Should Leverag e This
链接: https://arxiv.org/abs/2507.12224
作者: Razvan Pascanu,Clare Lyle,Ionut-Vlad Modoranu,Naima Elosegui Borras,Dan Alistarh,Petar Velickovic,Sarath Chandar,Soham De,James Martens
类目: Machine Learning (cs.LG)
*备注:
点击查看摘要
Abstract:Due to the nonlinear nature of Deep Neural Networks (DNNs), one can not guarantee convergence to a unique global minimum of the loss when using optimizers relying only on local information, such as SGD. Indeed, this was a primary source of skepticism regarding the feasibility of DNNs in the early days of the field. The past decades of progress in deep learning have revealed this skepticism to be misplaced, and a large body of empirical evidence shows that sufficiently large DNNs following standard training protocols exhibit well-behaved optimization dynamics that converge to performant solutions. This success has biased the community to use convex optimization as a mental model for learning, leading to a focus on training efficiency, either in terms of required iteration, FLOPs or wall-clock time, when improving optimizers. We argue that, while this perspective has proven extremely fruitful, another perspective specific to DNNs has received considerably less attention: the optimizer not only influences the rate of convergence, but also the qualitative properties of the learned solutions. Restated, the optimizer can and will encode inductive biases and change the effective expressivity of a given class of models. Furthermore, we believe the optimizer can be an effective way of encoding desiderata in the learning process. We contend that the community should aim at understanding the biases of already existing methods, as well as aim to build new optimizers with the explicit intent of inducing certain properties of the solution, rather than solely judging them based on their convergence rates. We hope our arguments will inspire research to improve our understanding of how the learning process can impact the type of solution we converge to, and lead to a greater recognition of optimizers design as a critical lever that complements the roles of architecture and data in shaping model outcomes.
[LG-10] Physics-Informed Linear Model (PILM): Analytical Representations and Application to Crustal Strain Rate Estimation
链接: https://arxiv.org/abs/2507.12218
作者: Tomohisa Okazaki
类目: Machine Learning (cs.LG); Geophysics (physics.geo-ph)
*备注:
点击查看摘要
Abstract:Many physical systems are described by partial differential equations (PDEs), and solving these equations and estimating their coefficients or boundary conditions (BCs) from observational data play a crucial role in understanding the associated phenomena. Recently, a machine learning approach known as physics-informed neural network, which solves PDEs using neural networks by minimizing the sum of residuals from the PDEs, BCs, and data, has gained significant attention in the scientific community. In this study, we investigate a physics-informed linear model (PILM) that uses linear combinations of basis functions to represent solutions, thereby enabling an analytical representation of optimal solutions. The PILM was formulated and verified for illustrative forward and inverse problems including cases with uncertain BCs. Furthermore, the PILM was applied to estimate crustal strain rates using geodetic data. Specifically, physical regularization that enforces elastic equilibrium on the velocity fields was compared with mathematical regularization that imposes smoothness constraints. From a Bayesian perspective, mathematical regularization exhibited superior performance. The PILM provides an analytically solvable framework applicable to linear forward and inverse problems, underdetermined systems, and physical regularization.
[LG-11] Explainable Evidential Clustering
链接: https://arxiv.org/abs/2507.12192
作者: Victor F. Lopes de Souza,Karima Bakhti,Sofiane Ramdani,Denis Mottet,Abdelhak Imoussaten
类目: Machine Learning (cs.LG)
*备注:
点击查看摘要
Abstract:Unsupervised classification is a fundamental machine learning problem. Real-world data often contain imperfections, characterized by uncertainty and imprecision, which are not well handled by traditional methods. Evidential clustering, based on Dempster-Shafer theory, addresses these challenges. This paper explores the underexplored problem of explaining evidential clustering results, which is crucial for high-stakes domains such as healthcare. Our analysis shows that, in the general case, representativity is a necessary and sufficient condition for decision trees to serve as abductive explainers. Building on the concept of representativity, we generalize this idea to accommodate partial labeling through utility functions. These functions enable the representation of “tolerable” mistakes, leading to the definition of evidential mistakeness as explanation cost and the construction of explainers tailored to evidential classifiers. Finally, we propose the Iterative Evidential Mistake Minimization (IEMM) algorithm, which provides interpretable and cautious decision tree explanations for evidential clustering functions. We validate the proposed algorithm on synthetic and real-world data. Taking into account the decision-maker’s preferences, we were able to provide an explanation that was satisfactory up to 93% of the time.
[LG-12] RadioDiff-3D: A 3Dtimes3D Radio Map Dataset and Generative Diffusion Based Benchmark for 6G Environment-Aware Communication
链接: https://arxiv.org/abs/2507.12166
作者: Xiucheng Wang,Qiming Zhang,Nan Cheng,Junting Chen,Zezhong Zhang,Zan Li,Shuguang Cui,Xuemin Shen
类目: Machine Learning (cs.LG); Systems and Control (eess.SY)
*备注:
点击查看摘要
Abstract:Radio maps (RMs) serve as a critical foundation for enabling environment-aware wireless communication, as they provide the spatial distribution of wireless channel characteristics. Despite recent progress in RM construction using data-driven approaches, most existing methods focus solely on pathloss prediction in a fixed 2D plane, neglecting key parameters such as direction of arrival (DoA), time of arrival (ToA), and vertical spatial variations. Such a limitation is primarily due to the reliance on static learning paradigms, which hinder generalization beyond the training data distribution. To address these challenges, we propose UrbanRadio3D, a large-scale, high-resolution 3D RM dataset constructed via ray tracing in realistic urban environments. UrbanRadio3D is over 37 \times 3 larger than previous datasets across a 3D space with 3 metrics as pathloss, DoA, and ToA, forming a novel 3D \times 33D dataset with 7 \times 3 more height layers than prior state-of-the-art (SOTA) dataset. To benchmark 3D RM construction, a UNet with 3D convolutional operators is proposed. Moreover, we further introduce RadioDiff-3D, a diffusion-model-based generative framework utilizing the 3D convolutional architecture. RadioDiff-3D supports both radiation-aware scenarios with known transmitter locations and radiation-unaware settings based on sparse spatial observations. Extensive evaluations on UrbanRadio3D validate that RadioDiff-3D achieves superior performance in constructing rich, high-dimensional radio maps under diverse environmental dynamics. This work provides a foundational dataset and benchmark for future research in 3D environment-aware communication. The dataset is available at this https URL.
[LG-13] Multi-Component VAE with Gaussian Markov Random Field
链接: https://arxiv.org/abs/2507.12165
作者: Fouad Oubari,Mohamed El-Baha,Raphael Meunier,Rodrigue Décatoire,Mathilde Mougeot
类目: Machine Learning (cs.LG)
*备注:
点击查看摘要
Abstract:Multi-component datasets with intricate dependencies, like industrial assemblies or multi-modal imaging, challenge current generative modeling techniques. Existing Multi-component Variational AutoEncoders typically rely on simplified aggregation strategies, neglecting critical nuances and consequently compromising structural coherence across generated components. To explicitly address this gap, we introduce the Gaussian Markov Random Field Multi-Component Variational AutoEncoder , a novel generative framework embedding Gaussian Markov Random Fields into both prior and posterior distributions. This design choice explicitly models cross-component relationships, enabling richer representation and faithful reproduction of complex interactions. Empirically, our GMRF MCVAE achieves state-of-the-art performance on a synthetic Copula dataset specifically constructed to evaluate intricate component relationships, demonstrates competitive results on the PolyMNIST benchmark, and significantly enhances structural coherence on the real-world BIKED dataset. Our results indicate that the GMRF MCVAE is especially suited for practical applications demanding robust and realistic modeling of multi-component coherence
[LG-14] FourCastNet 3: A geometric approach to probabilistic machine-learning weather forecasting at scale
链接: https://arxiv.org/abs/2507.12144
作者: Boris Bonev,Thorsten Kurth,Ankur Mahesh,Mauro Bisson,Jean Kossaifi,Karthik Kashinath,Anima Anandkumar,William D. Collins,Michael S. Pritchard,Alexander Keller
类目: Machine Learning (cs.LG); Atmospheric and Oceanic Physics (physics.ao-ph)
*备注:
点击查看摘要
Abstract:FourCastNet 3 advances global weather modeling by implementing a scalable, geometric machine learning (ML) approach to probabilistic ensemble forecasting. The approach is designed to respect spherical geometry and to accurately model the spatially correlated probabilistic nature of the problem, resulting in stable spectra and realistic dynamics across multiple scales. FourCastNet 3 delivers forecasting accuracy that surpasses leading conventional ensemble models and rivals the best diffusion-based methods, while producing forecasts 8 to 60 times faster than these approaches. In contrast to other ML approaches, FourCastNet 3 demonstrates excellent probabilistic calibration and retains realistic spectra, even at extended lead times of up to 60 days. All of these advances are realized using a purely convolutional neural network architecture tailored for spherical geometry. Scalable and efficient large-scale training on 1024 GPUs and more is enabled by a novel training paradigm for combined model- and data-parallelism, inspired by domain decomposition methods in classical numerical models. Additionally, FourCastNet 3 enables rapid inference on a single GPU, producing a 90-day global forecast at 0.25°, 6-hourly resolution in under 20 seconds. Its computational efficiency, medium-range probabilistic skill, spectral fidelity, and rollout stability at subseasonal timescales make it a strong candidate for improving meteorological forecasting and early warning systems through large ensemble predictions.
[LG-15] HyDRA: A Hybrid Dual-Mode Network for Closed- and Open-Set RFFI with Optimized VMD
链接: https://arxiv.org/abs/2507.12133
作者: Hanwen Liu,Yuhe Huang,Yifeng Gong,Yanjie Zhai,Jiaxuan Lu
类目: Machine Learning (cs.LG); Signal Processing (eess.SP)
*备注:
点击查看摘要
Abstract:Device recognition is vital for security in wireless communication systems, particularly for applications like access control. Radio Frequency Fingerprint Identification (RFFI) offers a non-cryptographic solution by exploiting hardware-induced signal distortions. This paper proposes HyDRA, a Hybrid Dual-mode RF Architecture that integrates an optimized Variational Mode Decomposition (VMD) with a novel architecture based on the fusion of Convolutional Neural Networks (CNNs), Transformers, and Mamba components, designed to support both closed-set and open-set classification tasks. The optimized VMD enhances preprocessing efficiency and classification accuracy by fixing center frequencies and using closed-form solutions. HyDRA employs the Transformer Dynamic Sequence Encoder (TDSE) for global dependency modeling and the Mamba Linear Flow Encoder (MLFE) for linear-complexity processing, adapting to varying conditions. Evaluation on public datasets demonstrates state-of-the-art (SOTA) accuracy in closed-set scenarios and robust performance in our proposed open-set classification method, effectively identifying unauthorized devices. Deployed on NVIDIA Jetson Xavier NX, HyDRA achieves millisecond-level inference speed with low power consumption, providing a practical solution for real-time wireless authentication in real-world environments.
[LG-16] Self-Adaptive and Robust Federated Spectrum Sensing without Benign Majority for Cellular Networks
链接: https://arxiv.org/abs/2507.12127
作者: Ngoc Duy Pham,Thusitha Dayaratne,Viet Vo,Shangqi Lai,Sharif Abuadbba,Hajime Suzuki,Xingliang Yuan,Carsten Rudolph
类目: Machine Learning (cs.LG)
*备注:
点击查看摘要
Abstract:Advancements in wireless and mobile technologies, including 5G advanced and the envisioned 6G, are driving exponential growth in wireless devices. However, this rapid expansion exacerbates spectrum scarcity, posing a critical challenge. Dynamic spectrum allocation (DSA)–which relies on sensing and dynamically sharing spectrum–has emerged as an essential solution to address this issue. While machine learning (ML) models hold significant potential for improving spectrum sensing, their adoption in centralized ML-based DSA systems is limited by privacy concerns, bandwidth constraints, and regulatory challenges. To overcome these limitations, distributed ML-based approaches such as Federated Learning (FL) offer promising alternatives. This work addresses two key challenges in FL-based spectrum sensing (FLSS). First, the scarcity of labeled data for training FL models in practical spectrum sensing scenarios is tackled with a semi-supervised FL approach, combined with energy detection, enabling model training on unlabeled datasets. Second, we examine the security vulnerabilities of FLSS, focusing on the impact of data poisoning attacks. Our analysis highlights the shortcomings of existing majority-based defenses in countering such attacks. To address these vulnerabilities, we propose a novel defense mechanism inspired by vaccination, which effectively mitigates data poisoning attacks without relying on majority-based assumptions. Extensive experiments on both synthetic and real-world datasets validate our solutions, demonstrating that FLSS can achieve near-perfect accuracy on unlabeled datasets and maintain Byzantine robustness against both targeted and untargeted data poisoning attacks, even when a significant proportion of participants are malicious.
[LG-17] A Privacy-Preserving Framework for Advertising Personalization Incorporating Federated Learning and Differential Privacy
链接: https://arxiv.org/abs/2507.12098
作者: Xiang Li,Yifan Lin,Yuanzhe Zhang
类目: Cryptography and Security (cs.CR); Machine Learning (cs.LG)
*备注:
点击查看摘要
Abstract:To mitigate privacy leakage and performance issues in personalized advertising, this paper proposes a framework that integrates federated learning and differential privacy. The system combines distributed feature extraction, dynamic privacy budget allocation, and robust model aggregation to balance model accuracy, communication overhead, and privacy protection. Multi-party secure computing and anomaly detection mechanisms further enhance system resilience against malicious attacks. Experimental results demonstrate that the framework achieves dual optimization of recommendation accuracy and system efficiency while ensuring privacy, providing both a practical solution and a theoretical foundation for applying privacy protection technologies in advertisement recommendation.
[LG-18] Measuring Informativeness Gap of (Mis)Calibrated Predictors
链接: https://arxiv.org/abs/2507.12094
作者: Yiding Feng,Wei Tang
类目: Machine Learning (cs.LG); Computer Science and Game Theory (cs.GT)
*备注:
点击查看摘要
Abstract:In many applications, decision-makers must choose between multiple predictive models that may all be miscalibrated. Which model (i.e., predictor) is more “useful” in downstream decision tasks? To answer this, our first contribution introduces the notion of the informativeness gap between any two predictors, defined as the maximum normalized payoff advantage one predictor offers over the other across all decision-making tasks. Our framework strictly generalizes several existing notions: it subsumes U-Calibration [KLST-23] and Calibration Decision Loss [HW-24], which compare a miscalibrated predictor to its calibrated counterpart, and it recovers Blackwell informativeness [Bla-51, Bla-53] as a special case when both predictors are perfectly calibrated. Our second contribution is a dual characterization of the informativeness gap, which gives rise to a natural informativeness measure that can be viewed as a relaxed variant of the earth mover’s distance (EMD) between two prediction distributions. We show that this measure satisfies natural desiderata: it is complete and sound, and it can be estimated sample-efficiently in the prediction-only access setting. Along the way, we also obtain novel combinatorial structural results when applying this measure to perfectly calibrated predictors.
[LG-19] Emergence of Quantised Representations Isolated to Anisotropic Functions
链接: https://arxiv.org/abs/2507.12070
作者: George Bird
类目: Machine Learning (cs.LG)
*备注: 36 pages, 31 figures
点击查看摘要
Abstract:This paper describes a novel methodology for determining representational alignment, developed upon the existing Spotlight Resonance method. Using this, it is found that algebraic symmetries of network primitives are a strong predictor for task-agnostic structure in representations. Particularly, this new tool is used to gain insight into how discrete representations can form and arrange in autoencoder models, through an ablation study where only the activation function is altered. Representations are found to tend to discretise when the activation functions are defined through a discrete algebraic permutation-equivariant symmetry. In contrast, they remain continuous under a continuous algebraic orthogonal-equivariant definition. These findings corroborate the hypothesis that functional form choices can carry unintended inductive biases which produce task-independent artefactual structures in representations, particularly that contemporary forms induce discretisation of otherwise continuous structure – a quantisation effect. Moreover, this supports a general causal model for one mode in which discrete representations may form, and could constitute a prerequisite for downstream interpretability phenomena, including grandmother neurons, discrete coding schemes, general linear features and possibly Superposition. Hence, this tool and proposed mechanism for the influence of functional form on representations may provide several insights into emergent interpretability research. Finally, preliminary results indicate that quantisation of representations appears to correlate with a measurable increase in reconstruction error, reinforcing previous conjectures that this collapse can be detrimental.
[LG-20] FloGAN: Scenario-Based Urban Mobility Flow Generation via Conditional GANs and Dynamic Region Decoupling
链接: https://arxiv.org/abs/2507.12053
作者: Seanglidet Yean,Jiazu Zhou,Bu-Sung Lee,Markus Schläpfer
类目: Machine Learning (cs.LG)
*备注: International Conference on Intelligent Digitization of Systems and Services, Valencia, Spain, 2025 (IDSS 2025)
点击查看摘要
Abstract:The mobility patterns of people in cities evolve alongside changes in land use and population. This makes it crucial for urban planners to simulate and analyze human mobility patterns for purposes such as transportation optimization and sustainable urban development. Existing generative models borrowed from machine learning rely heavily on historical trajectories and often overlook evolving factors like changes in population density and land use. Mechanistic approaches incorporate population density and facility distribution but assume static scenarios, limiting their utility for future projections where historical data for calibration is unavailable. This study introduces a novel, data-driven approach for generating origin-destination mobility flows tailored to simulated urban scenarios. Our method leverages adaptive factors such as dynamic region sizes and land use archetypes, and it utilizes conditional generative adversarial networks (cGANs) to blend historical data with these adaptive parameters. The approach facilitates rapid mobility flow generation with adjustable spatial granularity based on regions of interest, without requiring extensive calibration data or complex behavior modeling. The promising performance of our approach is demonstrated by its application to mobile phone data from Singapore, and by its comparison with existing methods.
[LG-21] Information-Theoretic Generalization Bounds of Replay-based Continual Learning
链接: https://arxiv.org/abs/2507.12043
作者: Wen Wen,Tieliang Gong,Yunjiao Zhang,Zeyu Gao,Weizhan Zhang,Yong-Jin Liu
类目: Machine Learning (cs.LG)
*备注:
点击查看摘要
Abstract:Continual learning (CL) has emerged as a dominant paradigm for acquiring knowledge from sequential tasks while avoiding catastrophic forgetting. Although many CL methods have been proposed to show impressive empirical performance, the theoretical understanding of their generalization behavior remains limited, particularly for replay-based approaches. In this paper, we establish a unified theoretical framework for replay-based CL, deriving a series of information-theoretic bounds that explicitly characterize how the memory buffer interacts with the current task to affect generalization. Specifically, our hypothesis-based bounds reveal that utilizing the limited exemplars of previous tasks alongside the current task data, rather than exhaustive replay, facilitates improved generalization while effectively mitigating catastrophic forgetting. Furthermore, our prediction-based bounds yield tighter and computationally tractable upper bounds of the generalization gap through the use of low-dimensional variables. Our analysis is general and broadly applicable to a wide range of learning algorithms, exemplified by stochastic gradient Langevin dynamics (SGLD) as a representative method. Comprehensive experimental evaluations demonstrate the effectiveness of our derived bounds in capturing the generalization dynamics in replay-based CL settings.
[LG-22] Granular feedback merits sophisticated aggregation
链接: https://arxiv.org/abs/2507.12041
作者: Anmol Kagrecha,Henrik Marklund,Potsawee Manakul,Richard Zeckhauser,Benjamin Van Roy
类目: Machine Learning (cs.LG)
*备注: 31 pages, 8 figures
点击查看摘要
Abstract:Human feedback is increasingly used across diverse applications like training AI models, developing recommender systems, and measuring public opinion – with granular feedback often being preferred over binary feedback for its greater informativeness. While it is easy to accurately estimate a population’s distribution of feedback given feedback from a large number of individuals, cost constraints typically necessitate using smaller groups. A simple method to approximate the population distribution is regularized averaging: compute the empirical distribution and regularize it toward a prior. Can we do better? As we will discuss, the answer to this question depends on feedback granularity. Suppose one wants to predict a population’s distribution of feedback using feedback from a limited number of individuals. We show that, as feedback granularity increases, one can substantially improve upon predictions of regularized averaging by combining individuals’ feedback in ways more sophisticated than regularized averaging. Our empirical analysis using questions on social attitudes confirms this pattern. In particular, with binary feedback, sophistication barely reduces the number of individuals required to attain a fixed level of performance. By contrast, with five-point feedback, sophisticated methods match the performance of regularized averaging with about half as many individuals. Comments: 31 pages, 8 figures Subjects: Machine Learning (cs.LG) Cite as: arXiv:2507.12041 [cs.LG] (or arXiv:2507.12041v1 [cs.LG] for this version) https://doi.org/10.48550/arXiv.2507.12041 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
[LG-23] Expanding ML-Documentation Standards For Better Security
链接: https://arxiv.org/abs/2507.12003
作者: Cara Ellen Appel
类目: Cryptography and Security (cs.CR); Machine Learning (cs.LG); Software Engineering (cs.SE)
*备注: Accepted for publication at the 33rd IEEE International Requirements Engineering Workshop (REW 2025)
点击查看摘要
Abstract:This article presents the current state of ML-security and of the documentation of ML-based systems, models and datasets in research and practice based on an extensive review of the existing literature. It shows a generally low awareness of security aspects among ML-practitioners and organizations and an often unstandardized approach to documentation, leading to overall low quality of ML-documentation. Existing standards are not regularly adopted in practice and IT-security aspects are often not included in documentation. Due to these factors, there is a clear need for improved security documentation in ML, as one step towards addressing the existing gaps in ML-security. To achieve this, we propose expanding existing documentation standards for ML-documentation to include a security section with specific security relevant information. Implementing this, a novel expanded method of documenting security requirements in ML-documentation is presented, based on the existing Model Cards and Datasheets for Datasets standards, but with the recommendation to adopt these findings in all ML-documentation.
[LG-24] Detecting In-Person Conversations in Noisy Real-World Environments with Smartwatch Audio and Motion Sensing
链接: https://arxiv.org/abs/2507.12002
作者: Alice Zhang,Callihan Bertley,Dawei Liang,Edison Thomaz
类目: Machine Learning (cs.LG)
*备注:
点击查看摘要
Abstract:Social interactions play a crucial role in shaping human behavior, relationships, and societies. It encompasses various forms of communication, such as verbal conversation, non-verbal gestures, facial expressions, and body language. In this work, we develop a novel computational approach to detect a foundational aspect of human social interactions, in-person verbal conversations, by leveraging audio and inertial data captured with a commodity smartwatch in acoustically-challenging scenarios. To evaluate our approach, we conducted a lab study with 11 participants and a semi-naturalistic study with 24 participants. We analyzed machine learning and deep learning models with 3 different fusion methods, showing the advantages of fusing audio and inertial data to consider not only verbal cues but also non-verbal gestures in conversations. Furthermore, we perform a comprehensive set of evaluations across activities and sampling rates to demonstrate the benefits of multimodal sensing in specific contexts. Overall, our framework achieved 82.0 \pm 3.0% macro F1-score when detecting conversations in the lab and 77.2 \pm 1.8% in the semi-naturalistic setting.
[LG-25] Dataset-Adaptive Dimensionality Reduction IEEE-VIS2025
链接: https://arxiv.org/abs/2507.11984
作者: Hyeon Jeon,Jeongin Park,Soohyun Lee,Dae Hyun Kim,Sungbok Shin,Jinwook Seo
类目: Human-Computer Interaction (cs.HC); Machine Learning (cs.LG)
*备注: IEEE VIS 2025 IEEE Transactions on Visualization and Computer Graphics (TVCG)
点击查看摘要
Abstract:Selecting the appropriate dimensionality reduction (DR) technique and determining its optimal hyperparameter settings that maximize the accuracy of the output projections typically involves extensive trial and error, often resulting in unnecessary computational overhead. To address this challenge, we propose a dataset-adaptive approach to DR optimization guided by structural complexity metrics. These metrics quantify the intrinsic complexity of a dataset, predicting whether higher-dimensional spaces are necessary to represent it accurately. Since complex datasets are often inaccurately represented in two-dimensional projections, leveraging these metrics enables us to predict the maximum achievable accuracy of DR techniques for a given dataset, eliminating redundant trials in optimizing DR. We introduce the design and theoretical foundations of these structural complexity metrics. We quantitatively verify that our metrics effectively approximate the ground truth complexity of datasets and confirm their suitability for guiding dataset-adaptive DR workflow. Finally, we empirically show that our dataset-adaptive workflow significantly enhances the efficiency of DR optimization without compromising accuracy.
[LG-26] d-DQIVAR: Data-centric Visual Analytics and Reasoning for Data Quality Improvement
链接: https://arxiv.org/abs/2507.11960
作者: Hyein Hong,Sangbong Yoo,SeokHwan Choi,Jisue Kim,Seongbum Seo,Haneol Cho,Chansoo Kim,Yun Jang
类目: Human-Computer Interaction (cs.HC); Machine Learning (cs.LG)
*备注:
点击查看摘要
Abstract:Approaches to enhancing data quality (DQ) are classified into two main categories: data- and process-driven. However, prior research has predominantly utilized batch data preprocessing within the data-driven framework, which often proves insufficient for optimizing machine learning (ML) model performance and frequently leads to distortions in data characteristics. Existing studies have primarily focused on data preprocessing rather than genuine data quality improvement (DQI). In this paper, we introduce d-DQIVAR, a novel visual analytics system designed to facilitate DQI strategies aimed at improving ML model performance. Our system integrates visual analytics techniques that leverage both data-driven and process-driven approaches. Data-driven techniques tackle DQ issues such as imputation, outlier detection, deletion, format standardization, removal of duplicate records, and feature selection. Process-driven strategies encompass evaluating DQ and DQI procedures by considering DQ dimensions and ML model performance and applying the Kolmogorov-Smirnov test. We illustrate how our system empowers users to harness expert and domain knowledge effectively within a practical workflow through case studies, evaluations, and user studies.
[LG-27] Accelerating RF Power Amplifier Design via Intelligent Sampling and ML-Based Parameter Tuning
链接: https://arxiv.org/abs/2507.11928
作者: Abhishek Sriram,Neal Tuffy
类目: Machine Learning (cs.LG)
*备注: This paper is a pre-print version and has been submitted to the IEEE International Conference on Future Machine Learning and Data Science (FMLDS 2025)
点击查看摘要
Abstract:This paper presents a machine learning-accelerated optimization framework for RF power amplifier design that reduces simulation requirements by 65% while maintaining \pm0.3 to \pm0.4 dBm accuracy. The proposed method combines MaxMin Latin Hypercube Sampling with CatBoost gradient boosting to intelligently explore multidimensional parameter spaces. Instead of exhaustively simulating all parameter combinations to achieve target P2dB compression specifications, our approach strategically selects approximately 35% of critical simulation points. The framework processes ADS netlists, executes harmonic balance simulations on the reduced dataset, and trains a CatBoost model to predict P2dB performance across the entire design space. Validation across 15 PA operating modes yields an average R^2 of 0.901, with the system ranking parameter combinations by their likelihood of meeting target specifications. The integrated solution delivers 58.24% to 77.78% reduction in simulation time through automated GUI-based workflows, enabling rapid design iterations without compromising accuracy standards required for production RF circuits.
[LG-28] From Generative to Episodic: Sample-Efficient Replicable Reinforcement Learning
链接: https://arxiv.org/abs/2507.11926
作者: Max Hopkins,Sihan Liu,Christopher Ye,Yuichi Yoshida
类目: Machine Learning (cs.LG)
*备注: 67 pages
点击查看摘要
Abstract:The epidemic failure of replicability across empirical science and machine learning has recently motivated the formal study of replicable learning algorithms [Impagliazzo et al. (2022)]. In batch settings where data comes from a fixed i.i.d. source (e.g., hypothesis testing, supervised learning), the design of data-efficient replicable algorithms is now more or less understood. In contrast, there remain significant gaps in our knowledge for control settings like reinforcement learning where an agent must interact directly with a shifting environment. Karbasi et. al show that with access to a generative model of an environment with S states and A actions (the RL ‘batch setting’), replicably learning a near-optimal policy costs only \tildeO(S^2A^2) samples. On the other hand, the best upper bound without a generative model jumps to \tildeO(S^7 A^7) [Eaton et al. (2024)] due to the substantial difficulty of environment exploration. This gap raises a key question in the broader theory of replicability: Is replicable exploration inherently more expensive than batch learning? Is sample-efficient replicable RL even possible? In this work, we (nearly) resolve this problem (for low-horizon tabular MDPs): exploration is not a significant barrier to replicable learning! Our main result is a replicable RL algorithm on \tildeO(S^2A) samples, bridging the gap between the generative and episodic settings. We complement this with a matching \tilde\Omega(S^2A) lower bound in the generative setting (under the common parallel sampling assumption) and an unconditional lower bound in the episodic setting of \tilde\Omega(S^2) showcasing the near-optimality of our algorithm with respect to the state space S . Comments: 67 pages Subjects: Machine Learning (cs.LG) Cite as: arXiv:2507.11926 [cs.LG] (or arXiv:2507.11926v1 [cs.LG] for this version) https://doi.org/10.48550/arXiv.2507.11926 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
[LG-29] AFPM: Alignment-based Frame Patch Modeling for Cross-Dataset EEG Decoding
链接: https://arxiv.org/abs/2507.11911
作者: Xiaoqing Chen,Siyang Li,Dongrui Wu
类目: Human-Computer Interaction (cs.HC); Information Retrieval (cs.IR); Machine Learning (cs.LG)
*备注:
点击查看摘要
Abstract:Electroencephalogram (EEG) decoding models for brain-computer interfaces (BCIs) struggle with cross-dataset learning and generalization due to channel layout inconsistencies, non-stationary signal distributions, and limited neurophysiological prior integration. To address these issues, we propose a plug-and-play Alignment-Based Frame-Patch Modeling (AFPM) framework, which has two main components: 1) Spatial Alignment, which selects task-relevant channels based on brain-region priors, aligns EEG distributions across domains, and remaps the selected channels to a unified layout; and, 2) Frame-Patch Encoding, which models multi-dataset signals into unified spatiotemporal patches for EEG decoding. Compared to 17 state-of-the-art approaches that need dataset-specific tuning, the proposed calibration-free AFPM achieves performance gains of up to 4.40% on motor imagery and 3.58% on event-related potential tasks. To our knowledge, this is the first calibration-free cross-dataset EEG decoding framework, substantially enhancing the practicalness of BCIs in real-world applications.
[LG-30] Resampling strategies for imbalanced regression: a survey and empirical analysis
链接: https://arxiv.org/abs/2507.11902
作者: Juscimara G. Avelino,George D. C. Cavalcanti,Rafael M. O. Cruz
类目: Machine Learning (cs.LG)
*备注:
点击查看摘要
Abstract:Imbalanced problems can arise in different real-world situations, and to address this, certain strategies in the form of resampling or balancing algorithms are proposed. This issue has largely been studied in the context of classification, and yet, the same problem features in regression tasks, where target values are continuous. This work presents an extensive experimental study comprising various balancing and predictive models, and wich uses metrics to capture important elements for the user and to evaluate the predictive model in an imbalanced regression data context. It also proposes a taxonomy for imbalanced regression approaches based on three crucial criteria: regression model, learning process, and evaluation metrics. The study offers new insights into the use of such strategies, highlighting the advantages they bring to each model’s learning process, and indicating directions for further studies. The code, data and further information related to the experiments performed herein can be found on GitHub: this https URL.
[LG-31] Imbalanced Regression Pipeline Recommendation
链接: https://arxiv.org/abs/2507.11901
作者: Juscimara G. Avelino,George D. C. Cavalcanti,Rafael M. O. Cruz
类目: Machine Learning (cs.LG)
*备注:
点击查看摘要
Abstract:Imbalanced problems are prevalent in various real-world scenarios and are extensively explored in classification tasks. However, they also present challenges for regression tasks due to the rarity of certain target values. A common alternative is to employ balancing algorithms in preprocessing to address dataset imbalance. However, due to the variety of resampling methods and learning models, determining the optimal solution requires testing many combinations. Furthermore, the learning model, dataset, and evaluation metric affect the best strategies. This work proposes the Meta-learning for Imbalanced Regression (Meta-IR) framework, which diverges from existing literature by training meta-classifiers to recommend the best pipeline composed of the resampling strategy and learning model per task in a zero-shot fashion. The meta-classifiers are trained using a set of meta-features to learn how to map the meta-features to the classes indicating the best pipeline. We propose two formulations: Independent and Chained. Independent trains the meta-classifiers to separately indicate the best learning algorithm and resampling strategy. Chained involves a sequential procedure where the output of one meta-classifier is used as input for another to model intrinsic relationship factors. The Chained scenario showed superior performance, suggesting a relationship between the learning algorithm and the resampling strategy per task. Compared with AutoML frameworks, Meta-IR obtained better results. Moreover, compared with baselines of six learning algorithms and six resampling algorithms plus no resampling, totaling 42 (6 X 7) configurations, Meta-IR outperformed all of them. The code, data, and further information of the experiments can be found on GitHub: this https URL.
[LG-32] A Policy-Improved Deep Deterministic Policy Gradient Framework for the Discount Order Acceptance Strategy of Ride-hailing Drivers
链接: https://arxiv.org/abs/2507.11865
作者: Hanwen Dai,Chang Gao,Fang He,Congyuan Ji,Yanni Yang
类目: Machine Learning (cs.LG)
*备注:
点击查看摘要
Abstract:The rapid expansion of platform integration has emerged as an effective solution to mitigate market fragmentation by consolidating multiple ride-hailing platforms into a single application. To address heterogeneous passenger preferences, third-party integrators provide Discount Express service delivered by express drivers at lower trip fares. For the individual platform, encouraging broader participation of drivers in Discount Express services has the potential to expand the accessible demand pool and improve matching efficiency, but often at the cost of reduced profit margins. This study aims to dynamically manage drivers’ acceptance of Discount Express from the perspective of individual platforms. The lack of historical data under the new business model necessitates online learning. However, early-stage exploration through trial and error can be costly in practice, highlighting the need for reliable early-stage performance in real-world deployment. To address these challenges, this study formulates the decision regarding the proportion of drivers’ acceptance behavior as a continuous control task. In response to the high stochasticity, the opaque matching mechanisms employed by third-party integrator, and the limited availability of historical data, we propose a policy-improved deep deterministic policy gradient (pi-DDPG) framework. The proposed framework incorporates a refiner module to boost policy performance during the early training phase, leverages a convolutional long short-term memory network to effectively capture complex spatiotemporal patterns, and adopts a prioritized experience replay mechanism to enhance learning efficiency. A simulator based on a real-world dataset is developed to validate the effectiveness of the proposed pi-DDPG. Numerical experiments demonstrate that pi-DDPG achieves superior learning efficiency and significantly reduces early-stage training losses.
[LG-33] OrdShap: Feature Position Importance for Sequential Black-Box Models
链接: https://arxiv.org/abs/2507.11855
作者: Davin Hill,Brian L. Hill,Aria Masoomi,Vijay S. Nori,Robert E. Tillman,Jennifer Dy
类目: Machine Learning (cs.LG)
*备注:
点击查看摘要
Abstract:Sequential deep learning models excel in domains with temporal or sequential dependencies, but their complexity necessitates post-hoc feature attribution methods for understanding their predictions. While existing techniques quantify feature importance, they inherently assume fixed feature ordering - conflating the effects of (1) feature values and (2) their positions within input sequences. To address this gap, we introduce OrdShap, a novel attribution method that disentangles these effects by quantifying how a model’s predictions change in response to permuting feature position. We establish a game-theoretic connection between OrdShap and Sanchez-Bergantiños values, providing a theoretically grounded approach to position-sensitive attribution. Empirical results from health, natural language, and synthetic datasets highlight OrdShap’s effectiveness in capturing feature value and feature position attributions, and provide deeper insight into model behavior.
[LG-34] Generalized Linear Bandits: Almost Optimal Regret with One-Pass Update
链接: https://arxiv.org/abs/2507.11847
作者: Yu-Jie Zhang,Sheng-An Xu,Peng Zhao,Masashi Sugiyama
类目: Machine Learning (cs.LG); Machine Learning (stat.ML)
*备注:
点击查看摘要
Abstract:We study the generalized linear bandit (GLB) problem, a contextual multi-armed bandit framework that extends the classical linear model by incorporating a non-linear link function, thereby modeling a broad class of reward distributions such as Bernoulli and Poisson. While GLBs are widely applicable to real-world scenarios, their non-linear nature introduces significant challenges in achieving both computational and statistical efficiency. Existing methods typically trade off between two objectives, either incurring high per-round costs for optimal regret guarantees or compromising statistical efficiency to enable constant-time updates. In this paper, we propose a jointly efficient algorithm that attains a nearly optimal regret bound with \mathcalO(1) time and space complexities per round. The core of our method is a tight confidence set for the online mirror descent (OMD) estimator, which is derived through a novel analysis that leverages the notion of mix loss from online prediction. The analysis shows that our OMD estimator, even with its one-pass updates, achieves statistical efficiency comparable to maximum likelihood estimation, thereby leading to a jointly efficient optimistic method.
[LG-35] Protenix-Mini: Efficient Structure Predictor via Compact Architecture Few-Step Diffusion and Switchable pLM
链接: https://arxiv.org/abs/2507.11839
作者: Chengyue Gong,Xinshi Chen,Yuxuan Zhang,Yuxuan Song,Hao Zhou,Wenzhi Xiao
类目: Machine Learning (cs.LG); Quantitative Methods (q-bio.QM)
*备注:
点击查看摘要
Abstract:Lightweight inference is critical for biomolecular structure prediction and other downstream tasks, enabling efficient real-world deployment and inference-time scaling for large-scale applications. In this work, we address the challenge of balancing model efficiency and prediction accuracy by making several key modifications, 1) Multi-step AF3 sampler is replaced by a few-step ODE sampler, significantly reducing computational overhead for the diffusion module part during inference; 2) In the open-source Protenix framework, a subset of pairformer or diffusion transformer blocks doesn’t make contributions to the final structure prediction, presenting opportunities for architectural pruning and lightweight redesign; 3) A model incorporating an ESM module is trained to substitute the conventional MSA module, reducing MSA preprocessing time. Building on these key insights, we present Protenix-Mini, a compact and optimized model designed for efficient protein structure prediction. This streamlined version incorporates a more efficient architectural design with a two-step Ordinary Differential Equation (ODE) sampling strategy. By eliminating redundant Transformer components and refining the sampling process, Protenix-Mini significantly reduces model complexity with slight accuracy drop. Evaluations on benchmark datasets demonstrate that it achieves high-fidelity predictions, with only a negligible 1 to 5 percent decrease in performance on benchmark datasets compared to its full-scale counterpart. This makes Protenix-Mini an ideal choice for applications where computational resources are limited but accurate structure prediction remains crucial.
[LG-36] HyperEvent:Learning Cohesive Events for Large-scale Dynamic Link Prediction
链接: https://arxiv.org/abs/2507.11836
作者: Jian Gao,Jianshe Wu,JingYi Ding
类目: Machine Learning (cs.LG)
*备注:
点击查看摘要
Abstract:Dynamic link prediction in continuous-time dynamic graphs is a fundamental task for modeling evolving complex systems. Existing node-centric and event-centric methods focus on individual interactions or atomic states, failing to capture the structural cohesion of composite hyper-events, groups of causally related events. To address this, we propose HyperEvent, a framework reframing dynamic link prediction as hyper-event recognition. Central to HyperEvent is the dynamic construction of an association sequence using event correlation vectors. These vectors quantify pairwise dependencies between the query event and relevant historical events, thereby characterizing the structural cohesion of a potential hyper-event. The framework predicts the occurrence of the query event by evaluating whether it collectively forms a valid hyper-event with these historical events. Notably, HyperEvent outperforms state-of-the-art methods on 4 out of 5 datasets in the official leaderboard. For scalability, we further introduce an efficient parallel training algorithm that segments large event streams to enable concurrent training. Experiments validate HyperEvent’s superior accuracy and efficiency on large-scale graphs. Among which HyperEvent achieves a 6.95% improvement in Mean Reciprocal Rank over state-of-the-art baseline on the large-scale Flight dataset while utilizing only 10.17% of the training time.
[LG-37] Arctic Inference with Shift Parallelism: Fast and Efficient Open Source Inference System for Enterprise AI
链接: https://arxiv.org/abs/2507.11830
作者: Samyam Rajbhandari,Mert Hidayetoglu,Aurick Qiao,Ye Wang,Juncheng Yang,Jeff Rasley,Michael Wyatt,Yuxiong He
类目: Distributed, Parallel, and Cluster Computing (cs.DC); Machine Learning (cs.LG)
*备注:
点击查看摘要
Abstract:Inference is now the dominant AI workload, yet existing systems force trade-offs between latency, throughput, and cost. Arctic Inference, an open-source vLLM plugin from Snowflake AI Research, introduces Shift Parallelism, a dynamic parallelism strategy that adapts to real-world traffic while integrating speculative decoding, SwiftKV compute reduction, and optimized embedding inference. It achieves up to 3.4 times faster request completion, 1.75 times faster generation, and 1.6M tokens/sec per GPU for embeddings, outperforming both latency- and throughput-optimized deployments. Already powering Snowflake Cortex AI, Arctic Inference delivers state-of-the-art, cost-effective inference for enterprise AI and is now available to the community.
[LG-38] SynCoGen: Synthesizable 3D Molecule Generation via Joint Reaction and Coordinate Modeling
链接: https://arxiv.org/abs/2507.11818
作者: Andrei Rekesh,Miruna Cretu,Dmytro Shevchuk,Vignesh Ram Somnath,Pietro Liò,Robert A. Batey,Mike Tyers,Michał Koziarski,Cheng-Hao Liu
类目: Machine Learning (cs.LG)
*备注:
点击查看摘要
Abstract:Ensuring synthesizability in generative small molecule design remains a major challenge. While recent developments in synthesizable molecule generation have demonstrated promising results, these efforts have been largely confined to 2D molecular graph representations, limiting the ability to perform geometry-based conditional generation. In this work, we present SynCoGen (Synthesizable Co-Generation), a single framework that combines simultaneous masked graph diffusion and flow matching for synthesizable 3D molecule generation. SynCoGen samples from the joint distribution of molecular building blocks, chemical reactions, and atomic coordinates. To train the model, we curated SynSpace, a dataset containing over 600K synthesis-aware building block graphs and 3.3M conformers. SynCoGen achieves state-of-the-art performance in unconditional small molecule graph and conformer generation, and the model delivers competitive performance in zero-shot molecular linker design for protein ligand generation in drug discovery. Overall, this multimodal formulation represents a foundation for future applications enabled by non-autoregressive molecular generation, including analog expansion, lead optimization, and direct structure conditioning.
[LG-39] Enforcing Latent Euclidean Geometry in Single-Cell VAEs for Manifold Interpolation
链接: https://arxiv.org/abs/2507.11789
作者: Alessandro Palma,Sergei Rybakov,Leon Hetzel,Stephan Günnemann,Fabian J. Theis
类目: Machine Learning (cs.LG); Quantitative Methods (q-bio.QM)
*备注: 31 pages, 14 figures
点击查看摘要
Abstract:Latent space interpolations are a powerful tool for navigating deep generative models in applied settings. An example is single-cell RNA sequencing, where existing methods model cellular state transitions as latent space interpolations with variational autoencoders, often assuming linear shifts and Euclidean geometry. However, unless explicitly enforced, linear interpolations in the latent space may not correspond to geodesic paths on the data manifold, limiting methods that assume Euclidean geometry in the data representations. We introduce FlatVI, a novel training framework that regularises the latent manifold of discrete-likelihood variational autoencoders towards Euclidean geometry, specifically tailored for modelling single-cell count data. By encouraging straight lines in the latent space to approximate geodesic interpolations on the decoded single-cell manifold, FlatVI enhances compatibility with downstream approaches that assume Euclidean latent geometry. Experiments on synthetic data support the theoretical soundness of our approach, while applications to time-resolved single-cell RNA sequencing data demonstrate improved trajectory reconstruction and manifold interpolation.
[LG-40] Scaling laws for activation steering with Llama 2 models and refusal mechanisms
链接: https://arxiv.org/abs/2507.11771
作者: Sheikh Abdur Raheem Ali,Justin Xu,Ivory Yang,Jasmine Xinze Li,Ayse Arslan,Clark Benham
类目: Machine Learning (cs.LG)
*备注:
点击查看摘要
Abstract:As large language models (LLMs) evolve in complexity and capability, the efficacy of less widely deployed alignment techniques are uncertain. Building on previous work on activation steering and contrastive activation addition (CAA), this paper explores the effectiveness of CAA with model scale using the family of Llama 2 models (7B, 13B, and 70B). CAA works by finding desirable ‘directions’ in the model’s residual stream vector space using contrastive pairs (for example, hate to love) and adding this direction to the residual stream during the forward pass. It directly manipulates the residual stream and aims to extract features from language models to better control their outputs. Using answer matching questions centered around the refusal behavior, we found that 1) CAA is most effective when applied at early-mid layers. 2) The effectiveness of CAA diminishes with model size. 3) Negative steering has more pronounced effects than positive steering across all model sizes.
[LG-41] orsional-GFN: a conditional conformation generator for small molecules
链接: https://arxiv.org/abs/2507.11759
作者: Alexandra Volokhova,Léna Néhale Ezzine,Piotr Gaiński,Luca Scimeca,Emmanuel Bengio,Prudencio Tossou,Yoshua Bengio,Alex Hernandez-Garcia
类目: Machine Learning (cs.LG)
*备注: The two first authors are Alexandra Volokhova and Léna Néhale Ezzine, with equal contribution
点击查看摘要
Abstract:Generating stable molecular conformations is crucial in several drug discovery applications, such as estimating the binding affinity of a molecule to a target. Recently, generative machine learning methods have emerged as a promising, more efficient method than molecular dynamics for sampling of conformations from the Boltzmann distribution. In this paper, we introduce Torsional-GFN, a conditional GFlowNet specifically designed to sample conformations of molecules proportionally to their Boltzmann distribution, using only a reward function as training signal. Conditioned on a molecular graph and its local structure (bond lengths and angles), Torsional-GFN samples rotations of its torsion angles. Our results demonstrate that Torsional-GFN is able to sample conformations approximately proportional to the Boltzmann distribution for multiple molecules with a single model, and allows for zero-shot generalization to unseen bond lengths and angles coming from the MD simulations for such molecules. Our work presents a promising avenue for scaling the proposed approach to larger molecular systems, achieving zero-shot generalization to unseen molecules, and including the generation of the local structure into the GFlowNet model.
[LG-42] A Graph-in-Graph Learning Framework for Drug-Target Interaction Prediction
链接: https://arxiv.org/abs/2507.11757
作者: Yuehua Song,Yong Gao
类目: Machine Learning (cs.LG); Quantitative Methods (q-bio.QM)
*备注:
点击查看摘要
Abstract:Accurately predicting drug-target interactions (DTIs) is pivotal for advancing drug discovery and target validation techniques. While machine learning approaches including those that are based on Graph Neural Networks (GNN) have achieved notable success in DTI prediction, many of them have difficulties in effectively integrating the diverse features of drugs, targets and their interactions. To address this limitation, we introduce a novel framework to take advantage of the power of both transductive learning and inductive learning so that features at molecular level and drug-target interaction network level can be exploited. Within this framework is a GNN-based model called Graph-in-Graph (GiG) that represents graphs of drug and target molecular structures as meta-nodes in a drug-target interaction graph, enabling a detailed exploration of their intricate relationships. To evaluate the proposed model, we have compiled a special benchmark comprising drug SMILES, protein sequences, and their interaction data, which is interesting in its own right. Our experimental results demonstrate that the GiG model significantly outperforms existing approaches across all evaluation metrics, highlighting the benefits of integrating different learning paradigms and interaction data.
[LG-43] Sparse Identification of Nonlinear Dynamics with Conformal Prediction
链接: https://arxiv.org/abs/2507.11739
作者: Urban Fasel
类目: Machine Learning (cs.LG); Computational Engineering, Finance, and Science (cs.CE); Dynamical Systems (math.DS)
*备注:
点击查看摘要
Abstract:The Sparse Identification of Nonlinear Dynamics (SINDy) is a method for discovering nonlinear dynamical system models from data. Quantifying uncertainty in SINDy models is essential for assessing their reliability, particularly in safety-critical applications. While various uncertainty quantification methods exist for SINDy, including Bayesian and ensemble approaches, this work explores the integration of Conformal Prediction, a framework that can provide valid prediction intervals with coverage guarantees based on minimal assumptions like data exchangeability. We introduce three applications of conformal prediction with Ensemble-SINDy (E-SINDy): (1) quantifying uncertainty in time series prediction, (2) model selection based on library feature importance, and (3) quantifying the uncertainty of identified model coefficients using feature conformal prediction. We demonstrate the three applications on stochastic predator-prey dynamics and several chaotic dynamical systems. We show that conformal prediction methods integrated with E-SINDy can reliably achieve desired target coverage for time series forecasting, effectively quantify feature importance, and produce more robust uncertainty intervals for model coefficients, even under non-Gaussian noise, compared to standard E-SINDy coefficient estimates.
[LG-44] Graph Neural Networks Powered by Encoder Embedding for Improved Node Learning
链接: https://arxiv.org/abs/2507.11732
作者: Shiyu Chen,Cencheng Shen,Youngser Park,Carey E. Priebe
类目: Machine Learning (cs.LG); Machine Learning (stat.ML)
*备注:
点击查看摘要
Abstract:Graph neural networks (GNNs) have emerged as a powerful framework for a wide range of node-level graph learning tasks. However, their performance is often constrained by reliance on random or minimally informed initial feature representations, which can lead to slow convergence and suboptimal solutions. In this paper, we leverage a statistically grounded method, one-hot graph encoder embedding (GEE), to generate high-quality initial node features that enhance the end-to-end training of GNNs. We refer to this integrated framework as the GEE-powered GNN (GG), and demonstrate its effectiveness through extensive simulations and real-world experiments across both unsupervised and supervised settings. In node clustering, GG consistently achieves state-of-the-art performance, ranking first across all evaluated real-world datasets, while exhibiting faster convergence compared to the standard GNN. For node classification, we further propose an enhanced variant, GG-C, which concatenates the outputs of GG and GEE and outperforms competing baselines. These results confirm the importance of principled, structure-aware feature initialization in realizing the full potential of GNNs.
[LG-45] Reinforcement Learning from Adversarial Preferences in Tabular MDPs
链接: https://arxiv.org/abs/2507.11706
作者: Taira Tsuchiya,Shinji Ito,Haipeng Luo
类目: Machine Learning (cs.LG); Machine Learning (stat.ML)
*备注: 40 pages
点击查看摘要
Abstract:We introduce a new framework of episodic tabular Markov decision processes (MDPs) with adversarial preferences, which we refer to as preference-based MDPs (PbMDPs). Unlike standard episodic MDPs with adversarial losses, where the numerical value of the loss is directly observed, in PbMDPs the learner instead observes preferences between two candidate arms, which represent the choices being compared. In this work, we focus specifically on the setting where the reward functions are determined by Borda scores. We begin by establishing a regret lower bound for PbMDPs with Borda scores. As a preliminary step, we present a simple instance to prove a lower bound of \Omega(\sqrtHSAT) for episodic MDPs with adversarial losses, where H is the number of steps per episode, S is the number of states, A is the number of actions, and T is the number of episodes. Leveraging this construction, we then derive a regret lower bound of \Omega( (H^2 S K)^1/3 T^2/3 ) for PbMDPs with Borda scores, where K is the number of arms. Next, we develop algorithms that achieve a regret bound of order T^2/3 . We first propose a global optimization approach based on online linear optimization over the set of all occupancy measures, achieving a regret bound of \tildeO((H^2 S^2 K)^1/3 T^2/3 ) under known transitions. However, this approach suffers from suboptimal dependence on the potentially large number of states S and computational inefficiency. To address this, we propose a policy optimization algorithm whose regret is roughly bounded by \tildeO( (H^6 S K^5)^1/3 T^2/3 ) under known transitions, and further extend the result to the unknown-transition setting.
[LG-46] Composing Linear Layers from Irreducibles
链接: https://arxiv.org/abs/2507.11688
作者: Travis Pence,Daisuke Yamada,Vikas Singh
类目: Machine Learning (cs.LG)
*备注: 27 Pages, 13 Tables, 8 Figures
点击查看摘要
Abstract:Contemporary large models often exhibit behaviors suggesting the presence of low-level primitives that compose into modules with richer functionality, but these fundamental building blocks remain poorly understood. We investigate this compositional structure in linear layers by asking: can we identify/synthesize linear transformations from a minimal set of geometric primitives? Using Clifford algebra, we show that linear layers can be expressed as compositions of bivectors – geometric objects encoding oriented planes – and introduce a differentiable algorithm that decomposes them into products of rotors. This construction uses only O(log^2 d) parameters, versus O(d^2) required by dense matrices. Applied to the key, query, and value projections in LLM attention layers, our rotor-based layers match the performance of strong baselines such as block-Hadamard and low-rank approximations. Our findings provide an algebraic perspective on how these geometric primitives can compose into higher-level functions within deep models.
[LG-47] STAGED: A Multi-Agent Neural Network for Learning Cellular Interaction Dynamics
链接: https://arxiv.org/abs/2507.11660
作者: Joao F. Rocha,Ke Xu,Xingzhi Sun,Ananya Krishna,Dhananjay Bhaskar,Blanche Mongeon,Morgan Craig,Mark Gerstein,Smita Krishnaswamy
类目: Machine Learning (cs.LG); Multiagent Systems (cs.MA); Quantitative Methods (q-bio.QM)
*备注:
点击查看摘要
Abstract:The advent of single-cell technology has significantly improved our understanding of cellular states and subpopulations in various tissues under normal and diseased conditions by employing data-driven approaches such as clustering and trajectory inference. However, these methods consider cells as independent data points of population distributions. With spatial transcriptomics, we can represent cellular organization, along with dynamic cell-cell interactions that lead to changes in cell state. Still, key computational advances are necessary to enable the data-driven learning of such complex interactive cellular dynamics. While agent-based modeling (ABM) provides a powerful framework, traditional approaches rely on handcrafted rules derived from domain knowledge rather than data-driven approaches. To address this, we introduce Spatio Temporal Agent-Based Graph Evolution Dynamics(STAGED) integrating ABM with deep learning to model intercellular communication, and its effect on the intracellular gene regulatory network. Using graph ODE networks (GDEs) with shared weights per cell type, our approach represents genes as vertices and interactions as directed edges, dynamically learning their strengths through a designed attention mechanism. Trained to match continuous trajectories of simulated as well as inferred trajectories from spatial transcriptomics data, the model captures both intercellular and intracellular interactions, enabling a more adaptive and accurate representation of cellular dynamics.
[LG-48] ZKP-FedEval: Verifiable and Privacy-Preserving Federated Evaluation using Zero-Knowledge Proofs
链接: https://arxiv.org/abs/2507.11649
作者: Daniel Commey,Benjamin Appiah,Griffith S. Klogo,Garth V. Crosby
类目: Machine Learning (cs.LG); Distributed, Parallel, and Cluster Computing (cs.DC); Networking and Internet Architecture (cs.NI)
*备注:
点击查看摘要
Abstract:Federated Learning (FL) enables collaborative model training on decentralized data without exposing raw data. However, the evaluation phase in FL may leak sensitive information through shared performance metrics. In this paper, we propose a novel protocol that incorporates Zero-Knowledge Proofs (ZKPs) to enable privacy-preserving and verifiable evaluation for FL. Instead of revealing raw loss values, clients generate a succinct proof asserting that their local loss is below a predefined threshold. Our approach is implemented without reliance on external APIs, using self-contained modules for federated learning simulation, ZKP circuit design, and experimental evaluation on both the MNIST and Human Activity Recognition (HAR) datasets. We focus on a threshold-based proof for a simple Convolutional Neural Network (CNN) model (for MNIST) and a multi-layer perceptron (MLP) model (for HAR), and evaluate the approach in terms of computational overhead, communication cost, and verifiability.
[LG-49] Deep Generative Methods and Tire Architecture Design
链接: https://arxiv.org/abs/2507.11639
作者: Fouad Oubari,Raphael Meunier,Rodrigue Décatoire,Mathilde Mougeot
类目: Machine Learning (cs.LG)
*备注:
点击查看摘要
Abstract:As deep generative models proliferate across the AI landscape, industrial practitioners still face critical yet unanswered questions about which deep generative models best suit complex manufacturing design tasks. This work addresses this question through a complete study of five representative models (Variational Autoencoder, Generative Adversarial Network, multimodal Variational Autoencoder, Denoising Diffusion Probabilistic Model, and Multinomial Diffusion Model) on industrial tire architecture generation. Our evaluation spans three key industrial scenarios: (i) unconditional generation of complete multi-component designs, (ii) component-conditioned generation (reconstructing architectures from partial observations), and (iii) dimension-constrained generation (creating designs that satisfy specific dimensional requirements). To enable discrete diffusion models to handle conditional scenarios, we introduce categorical inpainting, a mask-aware reverse diffusion process that preserves known labels without requiring additional training. Our evaluation employs geometry-aware metrics specifically calibrated for industrial requirements, quantifying spatial coherence, component interaction, structural connectivity, and perceptual fidelity. Our findings reveal that diffusion models achieve the strongest overall performance; a masking-trained VAE nonetheless outperforms the multimodal variant MMVAE\textsuperscript+ on nearly all component-conditioned metrics, and within the diffusion family MDM leads in-distribution whereas DDPM generalises better to out-of-distribution dimensional constraints.
[LG-50] Synthetic Tabular Data Generation: A Comparative Survey for Modern Techniques
链接: https://arxiv.org/abs/2507.11590
作者: Raju Challagundla,Mohsen Dorodchi,Pu Wang,Minwoo Lee
类目: Machine Learning (cs.LG)
*备注:
点击查看摘要
Abstract:As privacy regulations become more stringent and access to real-world data becomes increasingly constrained, synthetic data generation has emerged as a vital solution, especially for tabular datasets, which are central to domains like finance, healthcare and the social sciences. This survey presents a comprehensive and focused review of recent advances in synthetic tabular data generation, emphasizing methods that preserve complex feature relationships, maintain statistical fidelity, and satisfy privacy requirements. A key contribution of this work is the introduction of a novel taxonomy based on practical generation objectives, including intended downstream applications, privacy guarantees, and data utility, directly informing methodological design and evaluation strategies. Therefore, this review prioritizes the actionable goals that drive synthetic data creation, including conditional generation and risk-sensitive modeling. Additionally, the survey proposes a benchmark framework to align technical innovation with real-world demands. By bridging theoretical foundations with practical deployment, this work serves as both a roadmap for future research and a guide for implementing synthetic tabular data in privacy-critical environments.
[LG-51] Einstein Fields: A Neural Perspective To Computational General Relativity
链接: https://arxiv.org/abs/2507.11589
作者: Sandeep Suresh Cranganore,Andrei Bodnar,Arturs Berzins,Johannes Brandstetter
类目: Machine Learning (cs.LG); General Relativity and Quantum Cosmology (gr-qc)
*备注: 63 pages, 22 figures, 10 Tables, Github: this https URL
点击查看摘要
Abstract:We introduce Einstein Fields, a neural representation that is designed to compress computationally intensive four-dimensional numerical relativity simulations into compact implicit neural network weights. By modeling the \emphmetric, which is the core tensor field of general relativity, Einstein Fields enable the derivation of physical quantities via automatic differentiation. However, unlike conventional neural fields (e.g., signed distance, occupancy, or radiance fields), Einstein Fields are \emphNeural Tensor Fields with the key difference that when encoding the spacetime geometry of general relativity into neural field representations, dynamics emerge naturally as a byproduct. Einstein Fields show remarkable potential, including continuum modeling of 4D spacetime, mesh-agnosticity, storage efficiency, derivative accuracy, and ease of use. We address these challenges across several canonical test beds of general relativity and release an open source JAX-based library, paving the way for more scalable and expressive approaches to numerical relativity. Code is made available at this https URL
[LG-52] Recurrent U-Net-Based Graph Neural Network (RUGNN) for Accurate Deformation Predictions in Sheet Material Forming
链接: https://arxiv.org/abs/2507.11547
作者: Yingxue Zhao,Qianyi Chen,Haoran Li,Haosu Zhou,Hamid Reza Attar,Tobias Pfaff,Tailin Wu,Nan Li
类目: Machine Learning (cs.LG)
*备注:
点击查看摘要
Abstract:In recent years, various artificial intelligence-based surrogate models have been proposed to provide rapid manufacturability predictions of material forming processes. However, traditional AI-based surrogate models, typically built with scalar or image-based neural networks, are limited in their ability to capture complex 3D spatial relationships and to operate in a permutation-invariant manner. To overcome these issues, emerging graph-based surrogate models are developed using graph neural networks. This study developed a new graph neural network surrogate model named Recurrent U Net-based Graph Neural Network (RUGNN). The RUGNN model can achieve accurate predictions of sheet material deformation fields across multiple forming timesteps. The RUGNN model incorporates Gated Recurrent Units (GRUs) to model temporal dynamics and a U-Net inspired graph-based downsample/upsample mechanism to handle spatial long-range dependencies. A novel ‘node-to-surface’ contact representation method was proposed, offering significant improvements in computational efficiency for large-scale contact interactions. The RUGNN model was validated using a cold forming case study and a more complex hot forming case study using aluminium alloys. Results demonstrate that the RUGNN model provides accurate deformation predictions closely matching ground truth FE simulations and outperforming several baseline GNN architectures. Model tuning was also performed to identify suitable hyperparameters, training strategies, and input feature representations. These results demonstrate that RUGNN is a reliable approach to support sheet material forming design by enabling accurate manufacturability predictions.
[LG-53] he Safety Gap Toolkit: Evaluating Hidden Dangers of Open-Source Models
链接: https://arxiv.org/abs/2507.11544
作者: Ann-Kathrin Dombrowski,Dillon Bowen,Adam Gleave,Chris Cundy
类目: Computers and Society (cs.CY); Machine Learning (cs.LG)
*备注: 9 pages plus appendix
点击查看摘要
Abstract:Open-weight large language models (LLMs) unlock huge benefits in innovation, personalization, privacy, and democratization. However, their core advantage - modifiability - opens the door to systemic risks: bad actors can trivially subvert current safeguards, turning beneficial models into tools for harm. This leads to a ‘safety gap’: the difference in dangerous capabilities between a model with intact safeguards and one that has been stripped of those safeguards. We open-source a toolkit to estimate the safety gap for state-of-the-art open-weight models. As a case study, we evaluate biochemical and cyber capabilities, refusal rates, and generation quality of models from two families (Llama-3 and Qwen-2.5) across a range of parameter scales (0.5B to 405B) using different safeguard removal techniques. Our experiments reveal that the safety gap widens as model scale increases and effective dangerous capabilities grow substantially when safeguards are removed. We hope that the Safety Gap Toolkit (this https URL) will serve as an evaluation framework for common open-source models and as a motivation for developing and testing tamper-resistant safeguards. We welcome contributions to the toolkit from the community.
[LG-54] Neural Network-Guided Symbolic Regression for Interpretable Descriptor Discovery in Perovskite Catalysts
链接: https://arxiv.org/abs/2507.12404
作者: Yeming Xian,Xiaoming Wang,Yanfa Yan
类目: Data Analysis, Statistics and Probability (physics.data-an); Materials Science (cond-mat.mtrl-sci); Machine Learning (cs.LG); Computational Physics (physics.comp-ph)
*备注: 31 pages
点击查看摘要
Abstract:Understanding and predicting the activity of oxide perovskite catalysts for the oxygen evolution reaction (OER) requires descriptors that are both accurate and physically interpretable. While symbolic regression (SR) offers a path to discover such formulas, its performance degrades with high-dimensional inputs and small datasets. We present a two-phase framework that combines neural networks (NN), feature importance analysis, and symbolic regression (SR) to discover interpretable descriptors for OER activity in oxide perovskites. In Phase I, using a small dataset and seven structural features, we reproduce and improve the known \mu/t descriptor by engineering composite features and applying symbolic regression, achieving training and validation MAEs of 22.8 and 20.8 meV, respectively. In Phase II, we expand to 164 features, reduce dimensionality, and identify LUMO energy as a key electronic descriptor. A final formula using \mu/t, \mu/RA, and LUMO energy achieves improved accuracy (training and validation MAEs of 22.1 and 20.6 meV) with strong physical interpretability. Our results demonstrate that NN-guided symbolic regression enables accurate, interpretable, and physically meaningful descriptor discovery in data-scarce regimes, indicating interpretability need not sacrifice accuracy for materials informatics.
[LG-55] Surrogate Quantum Circuit Design for the Lattice Boltzmann Collision Operator
链接: https://arxiv.org/abs/2507.12256
作者: Monica Lăcătuş,Matthias Möller
类目: Quantum Physics (quant-ph); Machine Learning (cs.LG); Computational Physics (physics.comp-ph)
*备注: 31 pages, 14 figures
点击查看摘要
Abstract:Direct numerical simulation of turbulent flows at high Reynolds numbers remains a major challenge for traditional computational fluid dynamics (CFD) tools running on classical computer hardware. This has motivated growing interest in quantum algorithms for CFD to enable flow simulations on quantum computers. The reason being that these computers are expected to deliver potential speed-ups for certain problems. One promising quantum CFD approach is a fully quantum implementation of the lattice Boltzmann method called QLBM. Although efficient quantum routines are now available for the streaming step, implementing the nonlinear, irreversible collision step with a low depth circuit that avoids additional ancilla qubits, probabilistic post-selection and repeated executions remains a significant challenge. In this study, we address this challenge by introducing a framework for learning a surrogate quantum circuit (SQC) that approximates the full Bhatnagar Gross Krook (BGK) collision operator for the D2Q9 lattice. The four qubit circuit is trained to respect the physical properties of the BGK collision operator, including mass and momentum conservation, D8 equivariance and scale equivariance. When compiled to the gate set used by IBM Heron processor under the assumption of full qubit connectivity, the 15 block SQC requires only 2,430 native gates and uses neither ancilla qubits nor post-selection or repeated executions. Moreover, its depth is independent of the grid resolution, as collision is a local operation that can exploit quantum parallelism to its full extent. We validate the SQC on two benchmark flows, the Taylor Green vortex decay and the lid driven cavity, demonstrating that it accurately captures vortex dissipation and flow recirculation.
[LG-56] Improved Analysis for Sign-based Methods with Momentum Updates
链接: https://arxiv.org/abs/2507.12091
作者: Wei Jiang,Dingzhi Yu,Sifan Yang,Wenhao Yang,Lijun Zhang
类目: Optimization and Control (math.OC); Machine Learning (cs.LG)
*备注:
点击查看摘要
Abstract:In this paper, we present enhanced analysis for sign-based optimization algorithms with momentum updates. Traditional sign-based methods, under the separable smoothness assumption, guarantee a convergence rate of \mathcalO(T^-1/4) , but they either require large batch sizes or assume unimodal symmetric stochastic noise. To address these limitations, we demonstrate that signSGD with momentum can achieve the same convergence rate using constant batch sizes without additional assumptions. Our analysis, under the standard l_2 -smoothness condition, improves upon the result of the prior momentum-based signSGD method by a factor of \mathcalO(d^1/2) , where d is the problem dimension. Furthermore, we explore sign-based methods with majority vote in distributed settings and show that the proposed momentum-based method yields convergence rates of \mathcalO\left( d^1/2T^-1/2 + dn^-1/2 \right) and \mathcalO\left( \max \ d^1/4T^-1/4, d^1/10T^-1/5 \ \right) , which outperform the previous results of \mathcalO\left( dT^-1/4 + dn^-1/2 \right) and \mathcalO\left( d^3/8T^-1/8 \right) , respectively. Numerical experiments further validate the effectiveness of the proposed methods.
[LG-57] Incorporating Fairness Constraints into Archetypal Analysis
链接: https://arxiv.org/abs/2507.12021
作者: Aleix Alcacer,Irene Epifanio
类目: Machine Learning (stat.ML); Machine Learning (cs.LG)
*备注:
点击查看摘要
Abstract:Archetypal Analysis (AA) is an unsupervised learning method that represents data as convex combinations of extreme patterns called archetypes. While AA provides interpretable and low-dimensional representations, it can inadvertently encode sensitive attributes, leading to fairness concerns. In this work, we propose Fair Archetypal Analysis (FairAA), a modified formulation that explicitly reduces the influence of sensitive group information in the learned projections. We also introduce FairKernelAA, a nonlinear extension that addresses fairness in more complex data distributions. Our approach incorporates a fairness regularization term while preserving the structure and interpretability of the archetypes. We evaluate FairAA and FairKernelAA on synthetic datasets, including linear, nonlinear, and multi-group scenarios, demonstrating their ability to reduce group separability – as measured by mean maximum discrepancy and linear separability – without substantially compromising explained variance. We further validate our methods on the real-world ANSUR I dataset, confirming their robustness and practical utility. The results show that FairAA achieves a favorable trade-off between utility and fairness, making it a promising tool for responsible representation learning in sensitive applications.
[LG-58] Recent results on searches with boosted Higgs bosons at CMS
链接: https://arxiv.org/abs/2507.11977
作者: Farouk Mokhtar
类目: High Energy Physics - Experiment (hep-ex); Machine Learning (cs.LG)
*备注: 6 pages, 3 figures, The Thirteenth Annual Large Hadron Collider Physics (LHCP2025)
点击查看摘要
Abstract:The study of boosted Higgs bosons at the LHC provides a unique window to probe Higgs boson couplings at high energy scales and search for signs of physics beyond the standard model. In these proceedings, we present recent results on boosted Higgs boson searches at the CMS experiment, highlighting innovative reconstruction and tagging techniques that enhance sensitivity in this challenging regime.
[LG-59] RNAMunin: A Deep Machine Learning Model for Non-coding RNA Discovery
链接: https://arxiv.org/abs/2507.11950
作者: Lauren Lui,Torben Nielsen
类目: Genomics (q-bio.GN); Machine Learning (cs.LG)
*备注:
点击查看摘要
Abstract:Functional annotation of microbial genomes is often biased toward protein-coding genes, leaving a vast, unexplored landscape of non-coding RNAs (ncRNAs) that are critical for regulating bacterial and archaeal physiology, stress response and metabolism. Identifying ncRNAs directly from genomic sequence is a paramount challenge in bioinformatics and biology, essential for understanding the complete regulatory potential of an organism. This paper presents RNAMunin, a machine learning (ML) model that is capable of finding ncRNAs using genomic sequence alone. It is also computationally viable for large sequence datasets such as long read metagenomic assemblies with contigs totaling multiple Gbp. RNAMunin is trained on Rfam sequences extracted from approximately 60 Gbp of long read metagenomes from 16 San Francisco Estuary samples. We know of no other model that can detect ncRNAs based solely on genomic sequence at this scale. Since RNAMunin only requires genomic sequence as input, we do not need for an ncRNA to be transcribed to find it, i.e., we do not need transcriptomics data. We wrote this manuscript in a narrative style in order to best convey how RNAMunin was developed and how it works in detail. Unlike almost all current ML models, at approximately 1M parameters, RNAMunin is very small and very fast.
[LG-60] Newfluence: Boosting Model interpretability and Understanding in High Dimensions
链接: https://arxiv.org/abs/2507.11895
作者: Haolin Zou,Arnab Auddy,Yongchan Kwon,Kamiar Rahnama Rad,Arian Maleki
类目: Machine Learning (stat.ML); Machine Learning (cs.LG); Methodology (stat.ME)
*备注:
点击查看摘要
Abstract:The increasing complexity of machine learning (ML) and artificial intelligence (AI) models has created a pressing need for tools that help scientists, engineers, and policymakers interpret and refine model decisions and predictions. Influence functions, originating from robust statistics, have emerged as a popular approach for this purpose. However, the heuristic foundations of influence functions rely on low-dimensional assumptions where the number of parameters p is much smaller than the number of observations n . In contrast, modern AI models often operate in high-dimensional regimes with large p , challenging these assumptions. In this paper, we examine the accuracy of influence functions in high-dimensional settings. Our theoretical and empirical analyses reveal that influence functions cannot reliably fulfill their intended purpose. We then introduce an alternative approximation, called Newfluence, that maintains similar computational efficiency while offering significantly improved accuracy. Newfluence is expected to provide more accurate insights than many existing methods for interpreting complex AI models and diagnosing their issues. Moreover, the high-dimensional framework we develop in this paper can also be applied to analyze other popular techniques, such as Shapley values. Subjects: Machine Learning (stat.ML); Machine Learning (cs.LG); Methodology (stat.ME) Cite as: arXiv:2507.11895 [stat.ML] (or arXiv:2507.11895v1 [stat.ML] for this version) https://doi.org/10.48550/arXiv.2507.11895 Focus to learn more arXiv-issued DOI via DataCite (pending registration) Submission history From: Haolin Zou [view email] [v1] Wed, 16 Jul 2025 04:22:16 UTC (120 KB)
[LG-61] Choosing the Better Bandit Algorithm under Data Sharing: When Do A/B Experiments Work?
链接: https://arxiv.org/abs/2507.11891
作者: Shuangning Li,Chonghuan Wang,Jingyan Wang
类目: Machine Learning (stat.ML); Machine Learning (cs.LG); Statistics Theory (math.ST)
*备注:
点击查看摘要
Abstract:We study A/B experiments that are designed to compare the performance of two recommendation algorithms. Prior work has shown that the standard difference-in-means estimator is biased in estimating the global treatment effect (GTE) due to a particular form of interference between experimental units. Specifically, units under the treatment and control algorithms contribute to a shared pool of data that subsequently train both algorithms, resulting in interference between the two groups. The bias arising from this type of data sharing is known as “symbiosis bias”. In this paper, we highlight that, for decision-making purposes, the sign of the GTE often matters more than its precise magnitude when selecting the better algorithm. We formalize this insight under a multi-armed bandit framework and theoretically characterize when the sign of the expected GTE estimate under data sharing aligns with or contradicts the sign of the true GTE. Our analysis identifies the level of exploration versus exploitation as a key determinant of how symbiosis bias impacts algorithm selection.
[LG-62] CosmoFlow: Scale-Aware Representation Learning for Cosmology with Flow Matching
链接: https://arxiv.org/abs/2507.11842
作者: Sidharth Kannan,Tian Qiu,Carolina Cuesta-Lazaro,Haewon Jeong
类目: Cosmology and Nongalactic Astrophysics (astro-ph.CO); Machine Learning (cs.LG)
*备注:
点击查看摘要
Abstract:Generative machine learning models have been demonstrated to be able to learn low dimensional representations of data that preserve information required for downstream tasks. In this work, we demonstrate that flow matching based generative models can learn compact, semantically rich latent representations of field level cold dark matter (CDM) simulation data without supervision. Our model, CosmoFlow, learns representations 32x smaller than the raw field data, usable for field level reconstruction, synthetic data generation, and parameter inference. Our model also learns interpretable representations, in which different latent channels correspond to features at different cosmological scales.
[LG-63] MOFSimBench: Evaluating Universal Machine Learning Interatomic Potentials In Metal–Organic Framework Molecular Modeling
链接: https://arxiv.org/abs/2507.11806
作者: Hendrik Kraß,Ju Huang,Seyed Mohamad Moosavi
类目: Materials Science (cond-mat.mtrl-sci); Machine Learning (cs.LG)
*备注:
点击查看摘要
Abstract:Universal machine learning interatomic potentials (uMLIPs) have emerged as powerful tools for accelerating atomistic simulations, offering scalable and efficient modeling with accuracy close to quantum calculations. However, their reliability and effectiveness in practical, real-world applications remain an open question. Metal-organic frameworks (MOFs) and related nanoporous materials are highly porous crystals with critical relevance in carbon capture, energy storage, and catalysis applications. Modeling nanoporous materials presents distinct challenges for uMLIPs due to their diverse chemistry, structural complexity, including porosity and coordination bonds, and the absence from existing training datasets. Here, we introduce MOFSimBench, a benchmark to evaluate uMLIPs on key materials modeling tasks for nanoporous materials, including structural optimization, molecular dynamics (MD) stability, the prediction of bulk properties, such as bulk modulus and heat capacity, and guest-host interactions. Evaluating over 20 models from various architectures on a chemically and structurally diverse materials set, we find that top-performing uMLIPs consistently outperform classical force fields and fine-tuned machine learning potentials across all tasks, demonstrating their readiness for deployment in nanoporous materials modeling. Our analysis highlights that data quality, particularly the diversity of training sets and inclusion of out-of-equilibrium conformations, plays a more critical role than model architecture in determining performance across all evaluated uMLIPs. We release our modular and extendable benchmarking framework at this https URL, providing an open resource to guide the adoption for nanoporous materials modeling and further development of uMLIPs.
[LG-64] Inference on Optimal Policy Values and Other Irregular Functionals via Smoothing
链接: https://arxiv.org/abs/2507.11780
作者: Justin Whitehouse,Morgane Austern,Vasilis Syrgkanis
类目: Econometrics (econ.EM); Machine Learning (cs.LG); Statistics Theory (math.ST); Methodology (stat.ME)
*备注: 40 pages, 2 figures
点击查看摘要
Abstract:Constructing confidence intervals for the value of an optimal treatment policy is an important problem in causal inference. Insight into the optimal policy value can guide the development of reward-maximizing, individualized treatment regimes. However, because the functional that defines the optimal value is non-differentiable, standard semi-parametric approaches for performing inference fail to be directly applicable. Existing approaches for handling this non-differentiability fall roughly into two camps. In one camp are estimators based on constructing smooth approximations of the optimal value. These approaches are computationally lightweight, but typically place unrealistic parametric assumptions on outcome regressions. In another camp are approaches that directly de-bias the non-smooth objective. These approaches don’t place parametric assumptions on nuisance functions, but they either require the computation of intractably-many nuisance estimates, assume unrealistic L^\infty nuisance convergence rates, or make strong margin assumptions that prohibit non-response to a treatment. In this paper, we revisit the problem of constructing smooth approximations of non-differentiable functionals. By carefully controlling first-order bias and second-order remainders, we show that a softmax smoothing-based estimator can be used to estimate parameters that are specified as a maximum of scores involving nuisance components. In particular, this includes the value of the optimal treatment policy as a special case. Our estimator obtains \sqrtn convergence rates, avoids parametric restrictions/unrealistic margin assumptions, and is often statistically efficient.
[LG-65] LLM s are Bayesian in Expectation not in Realization
链接: https://arxiv.org/abs/2507.11768
作者: Leon Chlon,Sarah Rashidi,Zein Khamis,MarcAntonio M. Awada
类目: Machine Learning (stat.ML); Machine Learning (cs.LG)
*备注:
点击查看摘要
Abstract:Large language models demonstrate remarkable in-context learning capabilities, adapting to new tasks without parameter updates. While this phenomenon has been successfully modeled as implicit Bayesian inference, recent empirical findings reveal a fundamental contradiction: transformers systematically violate the martingale property, a cornerstone requirement of Bayesian updating on exchangeable data. This violation challenges the theoretical foundations underlying uncertainty quantification in critical applications. Our theoretical analysis establishes four key results: (1) positional encodings induce martingale violations of order \Theta(\log n / n) ; (2) transformers achieve information-theoretic optimality with excess risk O(n^-1/2) in expectation over orderings; (3) the implicit posterior representation converges to the true Bayesian posterior in the space of sufficient statistics; and (4) we derive the optimal chain-of-thought length as k^* = \Theta(\sqrtn\log(1/\varepsilon)) with explicit constants, providing a principled approach to reduce inference costs while maintaining performance. Empirical validation on GPT-3 confirms predictions (1)-(3), with transformers reaching 99% of theoretical entropy limits within 20 examples. Our framework provides practical methods for extracting calibrated uncertainty estimates from position-aware architectures and optimizing computational efficiency in deployment. Subjects: Machine Learning (stat.ML); Machine Learning (cs.LG) Cite as: arXiv:2507.11768 [stat.ML] (or arXiv:2507.11768v1 [stat.ML] for this version) https://doi.org/10.48550/arXiv.2507.11768 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
信息检索
[IR-0] An Ecosystem for Ontology Interoperability
链接: https://arxiv.org/abs/2507.12311
作者: Zhangcheng Qiang
类目: Information Retrieval (cs.IR)
*备注: 4 pages, 8 figures
点击查看摘要
Abstract:Ontology interoperability is one of the complicated issues that restricts the use of ontologies in knowledge graphs (KGs). Different ontologies with conflicting and overlapping concepts make it difficult to design, develop, and deploy an interoperable ontology for downstream tasks. We propose an ecosystem for ontology interoperability. The ecosystem employs three state-of-the-art semantic techniques in different phases of the ontology engineering life cycle: ontology design patterns (ODPs) in the design phase, ontology matching and versioning (OM\OV) in the develop phase, and ontology-compliant knowledge graphs (OCKGs) in the deploy phase, to achieve better ontology interoperability in real-world applications. A case study in the building domain validates the usefulness of the proposed ecosystem.
[IR-1] SIEVE: Effective Filtered Vector Search with Collection of Indexes
链接: https://arxiv.org/abs/2507.11907
作者: Zhaoheng Li,Silu Huang,Wei Ding,Yongjoo Park,Jianjun Chen
类目: Databases (cs.DB); Information Retrieval (cs.IR)
*备注:
点击查看摘要
Abstract:Many real-world tasks such as recommending videos with the kids tag can be reduced to finding most similar vectors associated with hard predicates. This task, filtered vector search, is challenging as prior state-of-the-art graph-based (unfiltered) similarity search techniques quickly degenerate when hard constraints are considered. That is, effective graph-based filtered similarity search relies on sufficient connectivity for reaching the most similar items within just a few hops. To consider predicates, recent works propose modifying graph traversal to visit only the items that may satisfy predicates. However, they fail to offer the just-a-few-hops property for a wide range of predicates: they must restrict predicates significantly or lose efficiency if only a small fraction of items satisfy predicates. We propose an opposite approach: instead of constraining traversal, we build many indexes each serving different predicate forms. For effective construction, we devise a three-dimensional analytical model capturing relationships among index size, search time, and recall, with which we follow a workload-aware approach to pack as many useful indexes as possible into a collection. At query time, the analytical model is employed yet again to discern the one that offers the fastest search at a given recall. We show superior performance and support on datasets with varying selectivities and forms: our approach achieves up to 8.06x speedup while having as low as 1% build time versus other indexes, with less than 2.15x memory of a standard HNSW graph and modest knowledge of past workloads. Subjects: Databases (cs.DB); Information Retrieval (cs.IR) Cite as: arXiv:2507.11907 [cs.DB] (or arXiv:2507.11907v1 [cs.DB] for this version) https://doi.org/10.48550/arXiv.2507.11907 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
[IR-2] Context-Aware Search and Retrieval Over Erasure Channels
链接: https://arxiv.org/abs/2507.11894
作者: Sara Ghasvarianjahromi,Yauhen Yakimenka,Jörg Kliewer
类目: Information Retrieval (cs.IR); Information Theory (cs.IT)
*备注:
点击查看摘要
Abstract:This paper introduces and analyzes a search and retrieval model that adopts key semantic communication principles from retrieval-augmented generation. We specifically present an information-theoretic analysis of a remote document retrieval system operating over a symbol erasure channel. The proposed model encodes the feature vector of a query, derived from term-frequency weights of a language corpus by using a repetition code with an adaptive rate dependent on the contextual importance of the terms. At the decoder, we select between two documents based on the contextual closeness of the recovered query. By leveraging a jointly Gaussian approximation for both the true and reconstructed similarity scores, we derive an explicit expression for the retrieval error probability, i.e., the probability under which the less similar document is selected. Numerical simulations on synthetic and real-world data (Google NQ) confirm the validity of the analysis. They further demonstrate that assigning greater redundancy to critical features effectively reduces the error rate, highlighting the effectiveness of semantic-aware feature encoding in error-prone communication settings.
[IR-3] Similarity-Guided Diffusion for Contrastive Sequential Recommendation
链接: https://arxiv.org/abs/2507.11866
作者: Jinkyeong Choi,Yejin Noh,Donghyeon Park
类目: Information Retrieval (cs.IR)
*备注: 14 pages, 5 figures
点击查看摘要
Abstract:In sequential recommendation systems, data augmentation and contrastive learning techniques have recently been introduced using diffusion models to achieve robust representation learning. However, most of the existing approaches use random augmentation, which risk damaging the contextual information of the original sequence. Accordingly, we propose a Similarity-Guided Diffusion for Contrastive Sequential Recommendation. Our method leverages the similarity between item embedding vectors to generate semantically consistent noise. Moreover, we utilize high confidence score in the denoising process to select our augmentation positions. This approach more effectively reflects contextual and structural information compared to augmentation at random positions. From a contrastive learning perspective, the proposed augmentation technique provides more discriminative positive and negative samples, simultaneously improving training efficiency and recommendation performance. Experimental results on five benchmark datasets show that SimDiffRec outperforms the existing baseline models.
附件下载
点击下载今日全部论文列表