本篇博文主要内容为 2025-07-10 从Arxiv.org论文网站获取的最新论文列表,自动更新,按照NLP、CV、ML、AI、IR五个大方向区分,若需要邮件定时接收,请在评论区留下你的邮箱号。
说明:每日论文数据从Arxiv.org获取,每天早上12:00左右定时自动更新。
友情提示: 如何您需要邮箱接收每日论文数据,请在评论处留下你的邮箱。
目录
概览 (2025-07-10)
今日共更新456篇论文,其中:
- 自然语言处理共58篇(Computation and Language (cs.CL))
- 人工智能共122篇(Artificial Intelligence (cs.AI))
- 计算机视觉共106篇(Computer Vision and Pattern Recognition (cs.CV))
- 机器学习共140篇(Machine Learning (cs.LG))
自然语言处理
[NLP-0] Discrete Diffusion Models for Language Generation
【速读】: 该论文试图解决将扩散模型应用于离散数据(特别是自然语言)生成的可行性与性能问题。其关键解决方案是评估并比较离散去噪扩散概率模型(D3PM)与传统自回归(AR)语言模型在生成质量与效率方面的表现,通过Bits Per Token(BPT)、Negative Log-Likelihood(NLL)、Perplexity(PPL)和批量处理速度等指标进行分析,以探索扩散模型在离散数据领域的潜力与局限性。
链接: https://arxiv.org/abs/2507.07050
作者: Ashen Weligalle
机构: 未知
类目: Computation and Language (cs.CL); Machine Learning (cs.LG); Machine Learning (stat.ML)
备注: pdfLaTeX, 69 pages with 21 figures, Licentiate Thesis
Abstract:Diffusion models have emerged as a powerful class of generative models, achieving state-of-the-art results in continuous data domains such as image and video generation. Their core mechanism involves a forward diffusion process that gradually transforms structured data into a Gaussian-like distribution, followed by a learned reverse process to reconstruct the data. While successful in continuous modalities, applying this framework to discrete data-particularly natural language-remains challenging due to token dependency complexities and the lack of a defined generation this http URL thesis investigates the feasibility and performance of discrete diffusion models for natural language generation. Specifically, we evaluate the Discrete Denoising Diffusion Probabilistic Model (D3PM) and compare it with traditional autoregressive (AR) language models. To assess generative performance, we use Bits Per Token (BPT), Negative Log-Likelihood (NLL), Perplexity (PPL), and Batch Processing Speed. Results show the best-performing D3PM model achieves a BPT of 5.72, with a mean of 8.05. The AR model outperforms in compression with a lower mean BPT of 4.59, but D3PM achieves higher processing speed, reaching up to 3.97 batches per sec., indicating potential for parallel this http URL evaluations were conducted under consistent conditions-generating 100,000 tokens per model with a fixed batch size of four-for fair comparison. This research presents a detailed analysis of diffusion-based vs. autoregressive models, highlighting trade-offs in generative quality and efficiency. Findings emphasize both the promise and limitations of diffusion models for discrete data, supporting future work in non-autoregressive language generation. Comments: pdfLaTeX, 69 pages with 21 figures, Licentiate Thesis Subjects: Computation and Language (cs.CL); Machine Learning (cs.LG); Machine Learning (stat.ML) MSC classes: 68T50 (Primary) 68Q32, 60J27 (Secondary) ACMclasses: G.3 Reportnumber: LIU-IDA/STAT-A–25/008–SE Cite as: arXiv:2507.07050 [cs.CL] (or arXiv:2507.07050v1 [cs.CL] for this version) https://doi.org/10.48550/arXiv.2507.07050 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
zh
[NLP-1] UniConv: Unifying Retrieval and Response Generation for Large Language Models in Conversations ACL2025
【速读】: 该论文试图解决现有对话搜索系统中基于分离模型的局限性,即无法同时利用模型的内在知识以确保检索效果对生成任务的有益性。其解决方案的关键在于提出一种统一模型,通过联合微调不同目标,并设计两种机制以降低不一致性风险并缓解数据差异,从而实现密集检索与响应生成的协同优化。
链接: https://arxiv.org/abs/2507.07030
作者: Fengran Mo,Yifan Gao,Chuan Meng,Xin Liu,Zhuofeng Wu,Kelong Mao,Zhengyang Wang,Pei Chen,Zheng Li,Xian Li,Bing Yin,Meng Jiang
机构: University of Montreal(蒙特利尔大学); Amazon.com(亚马逊公司); University of Amsterdam(阿姆斯特丹大学); Renmin University(中国人民大学); University of Notre Dame(圣母大学)
类目: Computation and Language (cs.CL)
备注: Accepted by ACL 2025 (main)
Abstract:The rapid advancement of conversational search systems revolutionizes how information is accessed by enabling the multi-turn interaction between the user and the system. Existing conversational search systems are usually built with two different models. This separation restricts the system from leveraging the intrinsic knowledge of the models simultaneously, which cannot ensure the effectiveness of retrieval benefiting the generation. The existing studies for developing unified models cannot fully address the aspects of understanding conversational context, managing retrieval independently, and generating responses. In this paper, we explore how to unify dense retrieval and response generation for large language models in conversation. We conduct joint fine-tuning with different objectives and design two mechanisms to reduce the inconsistency risks while mitigating data discrepancy. The evaluations on five conversational search datasets demonstrate that our unified model can mutually improve both tasks and outperform the existing baselines.
zh
[NLP-2] FlexOlmo: Open Language Models for Flexible Data Use
【速读】: 该论文试图解决在受监管行业中,数据所有者无法共享敏感或受保护数据的情况下,如何有效利用这些数据训练和优化语言模型的问题。解决方案的关键在于提出FlexOlmo,一种基于专家混合(Mixture-of-Experts, MoE)架构的语言模型,其核心创新在于支持分布式训练而无需数据共享,以及在推理阶段灵活控制数据的使用。每个专家独立在封闭数据集上训练,随后通过一种新的领域感知路由机制集成,无需联合训练,从而在保持数据本地化的同时实现模型性能的提升。
链接: https://arxiv.org/abs/2507.07024
作者: Weijia Shi,Akshita Bhagia,Kevin Farhat,Niklas Muennighoff,Pete Walsh,Jacob Morrison,Dustin Schwenk,Shayne Longpre,Jake Poznanski,Allyson Ettinger,Daogao Liu,Margaret Li,Dirk Groeneveld,Mike Lewis,Wen-tau Yih,Luca Soldaini,Kyle Lo,Noah A. Smith,Luke Zettlemoyer,Pang Wei Koh,Hannaneh Hajishirzi,Ali Farhadi,Sewon Min
机构: Allen Institute for AI (艾伦人工智能研究所); University of Washington (华盛顿大学); University of California, Berkeley (加州大学伯克利分校); Stanford University (斯坦福大学); MIT (麻省理工学院)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:
Abstract:We introduce FlexOlmo, a new class of language models (LMs) that supports (1) distributed training without data sharing, where different model parameters are independently trained on closed datasets, and (2) data-flexible inference, where these parameters along with their associated data can be flexibly included or excluded from model inferences with no further training. FlexOlmo employs a mixture-of-experts (MoE) architecture where each expert is trained independently on closed datasets and later integrated through a new domain-informed routing without any joint training. FlexOlmo is trained on FlexMix, a corpus we curate comprising publicly available datasets alongside seven domain-specific sets, representing realistic approximations of closed sets. We evaluate models with up to 37 billion parameters (20 billion active) on 31 diverse downstream tasks. We show that a general expert trained on public data can be effectively combined with independently trained experts from other data owners, leading to an average 41% relative improvement while allowing users to opt out of certain data based on data licensing or permission requirements. Our approach also outperforms prior model merging methods by 10.1% on average and surpasses the standard MoE trained without data restrictions using the same training FLOPs. Altogether, this research presents a solution for both data owners and researchers in regulated industries with sensitive or protected data. FlexOlmo enables benefiting from closed data while respecting data owners’ preferences by keeping their data local and supporting fine-grained control of data access during inference.
zh
[NLP-3] Learning Deliberately Acting Intuitively: Unlocking Test-Time Reasoning in Multimodal LLM s
【速读】: 该论文试图解决多模态大语言模型(Multimodal Large Language Models, MLLMs)在复杂任务中进行多模态推理时存在的模态对齐不足和训练成本过高的问题。其解决方案的关键在于提出一种称为“从深思到直觉”的推理框架(Deliberate-to-Intuitive reasoning framework, D2I),该框架通过仅使用基于规则的格式奖励(format reward)来增强模态对齐,而无需额外的数据标注或复杂的奖励机制。在训练阶段,模型采用深思熟虑的推理策略,而在评估阶段则切换为直觉式的推理方式,从而在不增加训练成本的情况下提升模型的推理能力。
链接: https://arxiv.org/abs/2507.06999
作者: Yahan Yu,Yuyang Dong,Masafumi Oyamada
机构: Kyoto University (京都大学); NEC Corporation (NEC公司)
类目: Computer Vision and Pattern Recognition (cs.CV); Computation and Language (cs.CL); Machine Learning (cs.LG)
备注: Work in progress
Abstract:Reasoning is a key capability for large language models (LLMs), particularly when applied to complex tasks such as mathematical problem solving. However, multimodal reasoning research still requires further exploration of modality alignment and training costs. Many of these approaches rely on additional data annotation and relevant rule-based rewards to enhance the understanding and reasoning ability, which significantly increases training costs and limits scalability. To address these challenges, we propose the Deliberate-to-Intuitive reasoning framework (D2I) that improves the understanding and reasoning ability of multimodal LLMs (MLLMs) without extra annotations and complex rewards. Specifically, our method sets deliberate reasoning strategies to enhance modality alignment only through the rule-based format reward during training. While evaluating, the reasoning style shifts to intuitive, which removes deliberate reasoning strategies during training and implicitly reflects the model’s acquired abilities in the response. D2I outperforms baselines across both in-domain and out-of-domain benchmarks. Our findings highlight the role of format reward in fostering transferable reasoning skills in MLLMs, and inspire directions for decoupling training-time reasoning depth from test-time response flexibility.
zh
[NLP-4] FRaN-X: FRaming and Narratives-eXplorer EMNLP2025
【速读】: 该论文试图解决从原始文本中自动检测实体提及并分类其叙事角色的问题,以揭示实体如何被呈现为英雄、反派或无辜者。解决方案的关键在于FRaN-X系统采用的两阶段架构,结合序列标注与细粒度角色分类,利用22个细粒度角色的唯一分类体系,实现对实体叙事角色的精准识别与分析。
链接: https://arxiv.org/abs/2507.06974
作者: Artur Muratov,Hana Fatima Shaikh,Vanshikaa Jani,Tarek Mahmoud,Zhuohan Xie,Daniil Orel,Aaryamonvikram Singh,Yuxia Wang,Aadi Joshi,Hasan Iqbal,Ming Shan Hee,Dhruv Sahnan,Nikolaos Nikolaidis,Purificação Silvano,Dimitar Dimitrov,Roman Yangarber,Ricardo Campos,Alípio Jorge,Nuno Guimarães,Elisa Sartori,Nicolas Stefanovitch,Giovanni Da San Martino,Jakub Piskorski,Preslav Nakov
机构: MBZUAI; Nazarbayev University; University of Maryland; University of Arizona; Athens University of Economics and Business; University of Porto; Sofia University “St. Kliment Ohridski”; University of Helsinki; Beira Interior; INESC TEC; Porto; University of Padova; European Commission Joint Research Center; Polish Academy of Sciences
类目: Computation and Language (cs.CL)
备注: 19 pages, 13 figures, submitted to EMNLP 2025 - Demo Track
Abstract:We present FRaN-X, a Framing and Narratives Explorer that automatically detects entity mentions and classifies their narrative roles directly from raw text. FRaN-X comprises a two-stage system that combines sequence labeling with fine-grained role classification to reveal how entities are portrayed as protagonists, antagonists, or innocents, using a unique taxonomy of 22 fine-grained roles nested under these three main categories. The system supports five languages (Bulgarian, English, Hindi, Russian, and Portuguese) and two domains (the Russia-Ukraine Conflict and Climate Change). It provides an interactive web interface for media analysts to explore and compare framing across different sources, tackling the challenge of automatically detecting and labeling how entities are framed. Our system allows end users to focus on a single article as well as analyze up to four articles simultaneously. We provide aggregate level analysis including an intuitive graph visualization that highlights the narrative a group of articles are pushing. Our system includes a search feature for users to look up entities of interest, along with a timeline view that allows analysts to track an entity’s role transitions across different contexts within the article. The FRaN-X system and the trained models are licensed under an MIT License. FRaN-X is publicly accessible at this https URL and a video demonstration is available at this https URL.
zh
[NLP-5] Scaling Towards the Information Boundary of Instruction Set: InfinityInstruct-Subject Technical Report
【速读】: 该论文试图解决当前指令数据集在“覆盖范围”(任务类型和知识领域)和“深度”(指令复杂性)上的局限性,这些问题导致微调后的模型在处理复杂指令和罕见领域任务时表现不佳。其解决方案的关键在于提出一个系统性的指令数据构建框架,该框架整合了分层标注系统、信息种子选择算法、进化数据合成过程以及基于模型缺陷诊断的目标数据生成,形成一个迭代的闭环以持续提升指令数据的覆盖范围和深度。
链接: https://arxiv.org/abs/2507.06968
作者: Li Du,Hanyu Zhao,Yiming Ju,Tengfei Pan
机构: BAAI(百度研究院)
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注:
Abstract:Instruction tuning has become a foundation for unlocking the capabilities of large-scale pretrained models and improving their performance on complex tasks. Thus, the construction of high-quality instruction datasets is crucial for enhancing model performance and generalizability. Although current instruction datasets have reached tens of millions of samples, models finetuned on them may still struggle with complex instruction following and tasks in rare domains. This is primarily due to limited expansion in both coverage'' (coverage of task types and knowledge areas) and
depth’’ (instruction complexity) of the instruction set. To address this issue, we propose a systematic instruction data construction framework, which integrates a hierarchical labeling system, an informative seed selection algorithm, an evolutionary data synthesis process, and a model deficiency diagnosis with targeted data generation. These components form an iterative closed-loop to continuously enhance the coverage and depth of instruction data. Based on this framework, we construct InfinityInstruct-Subject, a high-quality dataset containing ~1.5 million instructions. Experiments on multiple foundation models and benchmark tasks demonstrate its effectiveness in improving instruction-following capabilities. Further analyses suggest that InfinityInstruct-Subject shows enlarged coverage and depth compared to comparable synthesized instruction datasets. Our work lays a theoretical and practical foundation for the efficient, continuous evolution of instruction datasets, moving from data quantity expansion to qualitative improvement.
zh
[NLP-6] Investigating the Robustness of Retrieval-Augmented Generation at the Query Level ACL2025
【速读】: 该论文试图解决大型语言模型(Large Language Models, LLMs)在更新新信息时成本高且效率低的问题。提出的解决方案是检索增强生成(Retrieval-Augmented Generation, RAG),其关键在于通过在推理过程中动态整合外部知识,从而提高事实一致性并减少幻觉现象。
链接: https://arxiv.org/abs/2507.06956
作者: Sezen Perçin,Xin Su,Qutub Sha Syed,Phillip Howard,Aleksei Kuvshinov,Leo Schwinn,Kay-Ulrich Scholl
机构: Technical University of Munich (慕尼黑工业大学); Intel Labs (英特尔实验室); Thoughtworks (思特沃克)
类目: Computation and Language (cs.CL)
备注: Accepted to Generation, Evaluation Metrics (GEM) Workshop at ACL 2025
Abstract:Large language models (LLMs) are very costly and inefficient to update with new information. To address this limitation, retrieval-augmented generation (RAG) has been proposed as a solution that dynamically incorporates external knowledge during inference, improving factual consistency and reducing hallucinations. Despite its promise, RAG systems face practical challenges-most notably, a strong dependence on the quality of the input query for accurate retrieval. In this paper, we investigate the sensitivity of different components in the RAG pipeline to various types of query perturbations. Our analysis reveals that the performance of commonly used retrievers can degrade significantly even under minor query variations. We study each module in isolation as well as their combined effect in an end-to-end question answering setting, using both general-domain and domain-specific datasets. Additionally, we propose an evaluation framework to systematically assess the query-level robustness of RAG pipelines and offer actionable recommendations for practitioners based on the results of more than 1092 experiments we performed.
zh
[NLP-7] Rethinking Verification for LLM Code Generation: From Generation to Testing
【速读】: 该论文试图解决当前代码生成评估基准中测试用例数量有限且同质化严重的问题,导致细微错误难以被检测,从而人为夸大性能并影响基于可验证奖励的强化学习框架(RLVR)的准确奖励估计。解决方案的关键在于提出多维指标以系统性地量化测试用例的全面性,并引入一种人-大语言模型协作方法(SAGA),结合人类编程经验和大语言模型的推理能力,显著提升生成测试用例的覆盖率和质量。此外,还构建了TCGBench以促进测试用例生成任务的研究。
链接: https://arxiv.org/abs/2507.06920
作者: Zihan Ma,Taolin Zhang,Maosong Cao,Wenwei Zhang,Minnan Luo,Songyang Zhang,Kai Chen
机构: Shanghai AI Laboratory; School of Computer Science and Technology, Xi’an Jiaotong University (西安交通大学计算机科学与技术学院); MOE KLINNS Lab, Xi’an Jiaotong University (西安交通大学教育部KLINNS实验室)
类目: Computation and Language (cs.CL)
备注:
Abstract:Large language models (LLMs) have recently achieved notable success in code-generation benchmarks such as HumanEval and LiveCodeBench. However, a detailed examination reveals that these evaluation suites often comprise only a limited number of homogeneous test cases, resulting in subtle faults going undetected. This not only artificially inflates measured performance but also compromises accurate reward estimation in reinforcement learning frameworks utilizing verifiable rewards (RLVR). To address these critical shortcomings, we systematically investigate the test-case generation (TCG) task by proposing multi-dimensional metrics designed to rigorously quantify test-suite thoroughness. Furthermore, we introduce a human-LLM collaborative method (SAGA), leveraging human programming expertise with LLM reasoning capability, aimed at significantly enhancing both the coverage and the quality of generated test cases. In addition, we develop a TCGBench to facilitate the study of the TCG task. Experiments show that SAGA achieves a detection rate of 90.62% and a verifier accuracy of 32.58% on TCGBench. The Verifier Accuracy (Verifier Acc) of the code generation evaluation benchmark synthesized by SAGA is 10.78% higher than that of LiveCodeBench-v6. These results demonstrate the effectiveness of our proposed method. We hope this work contributes to building a scalable foundation for reliable LLM code evaluation, further advancing RLVR in code generation, and paving the way for automated adversarial test synthesis and adaptive benchmark integration.
zh
[NLP-8] Exploring LLM s for Predicting Tutor Strategy and Student Outcomes in Dialogues
【速读】: 该论文试图解决如何预测对话中导师的策略及其对学生结果的影响问题,这是在线学习和人工智能代理辅导能力日益增强背景下亟需解决的关键问题。解决方案的关键在于利用现代大语言模型(LLMs),如Llama 3和GPT-4o,分析数学辅导对话数据集,以评估这些模型在预测未来导师行为和学生结果方面的表现。研究发现,即使是最先进的LLMs在预测导师策略方面仍存在困难,而导师策略与学生结果之间具有高度相关性,这表明需要更强大的方法来应对这一任务。
链接: https://arxiv.org/abs/2507.06910
作者: Fareya Ikram,Alexander Scarlatos,Andrew Lan
机构: University of Massachusetts Amherst (马萨诸塞大学阿默斯特分校)
类目: Computation and Language (cs.CL); Computers and Society (cs.CY)
备注: Published in BEA 2025: 20th Workshop on Innovative Use of NLP for Building Educational Applications
Abstract:Tutoring dialogues have gained significant attention in recent years, given the prominence of online learning and the emerging tutoring abilities of artificial intelligence (AI) agents powered by large language models (LLMs). Recent studies have shown that the strategies used by tutors can have significant effects on student outcomes, necessitating methods to predict how tutors will behave and how their actions impact students. However, few works have studied predicting tutor strategy in dialogues. Therefore, in this work we investigate the ability of modern LLMs, particularly Llama 3 and GPT-4o, to predict both future tutor moves and student outcomes in dialogues, using two math tutoring dialogue datasets. We find that even state-of-the-art LLMs struggle to predict future tutor strategy while tutor strategy is highly indicative of student outcomes, outlining a need for more powerful methods to approach this task.
zh
[NLP-9] MultiJustice: A Chinese Dataset for Multi-Party Multi-Charge Legal Prediction NLPCC2025
【速读】: 该论文试图解决法律判决预测(Legal Judgment Prediction, LJP)中多个被告和指控是否应被单独处理的问题。其解决方案的关键在于引入了一个新的数据集,即多人员多指控预测(Multi-Person Multi-Charge Prediction, MPMCP),并通过在四种实际法律判决场景下的实验评估了多种主流法律大语言模型(LLMs)的表现,以探究不同场景下的挑战与模型性能差异。
链接: https://arxiv.org/abs/2507.06909
作者: Xiao Wang,Jiahuan Pei,Diancheng Shui,Zhiguang Han,Xin Sun,Dawei Zhu,Xiaoyu Shen
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: Accepted by NLPCC 2025
Abstract:Legal judgment prediction offers a compelling method to aid legal practitioners and researchers. However, the research question remains relatively under-explored: Should multiple defendants and charges be treated separately in LJP? To address this, we introduce a new dataset namely multi-person multi-charge prediction (MPMCP), and seek the answer by evaluating the performance of several prevailing legal large language models (LLMs) on four practical legal judgment scenarios: (S1) single defendant with a single charge, (S2) single defendant with multiple charges, (S3) multiple defendants with a single charge, and (S4) multiple defendants with multiple charges. We evaluate the dataset across two LJP tasks, i.e., charge prediction and penalty term prediction. We have conducted extensive experiments and found that the scenario involving multiple defendants and multiple charges (S4) poses the greatest challenges, followed by S2, S3, and S1. The impact varies significantly depending on the model. For example, in S4 compared to S1, InternLM2 achieves approximately 4.5% lower F1-score and 2.8% higher LogD, while Lawformer demonstrates around 19.7% lower F1-score and 19.0% higher LogD. Our dataset and code are available at this https URL.
zh
[NLP-10] MIND: A Multi-agent Framework for Zero-shot Harmful Meme Detection ACL2025
【速读】: 该论文旨在解决有害模因(harmful meme)检测中传统数据驱动方法因模因的动态演变和缺乏最新标注数据而难以有效识别新出现模因的问题。其解决方案的关键在于提出MIND框架,该框架采用三个核心策略:从未标注参考集中检索相似模因以提供上下文信息、通过双向洞察推导机制全面理解相似模因、以及利用多智能体辩论机制通过理性仲裁确保决策的鲁棒性。
链接: https://arxiv.org/abs/2507.06908
作者: Ziyan Liu,Chunxiao Fan,Haoran Lou,Yuexin Wu,Kaiwei Deng
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: ACL 2025
Abstract:The rapid expansion of memes on social media has highlighted the urgent need for effective approaches to detect harmful content. However, traditional data-driven approaches struggle to detect new memes due to their evolving nature and the lack of up-to-date annotated data. To address this issue, we propose MIND, a multi-agent framework for zero-shot harmful meme detection that does not rely on annotated data. MIND implements three key strategies: 1) We retrieve similar memes from an unannotated reference set to provide contextual information. 2) We propose a bi-directional insight derivation mechanism to extract a comprehensive understanding of similar memes. 3) We then employ a multi-agent debate mechanism to ensure robust decision-making through reasoned arbitration. Extensive experiments on three meme datasets demonstrate that our proposed framework not only outperforms existing zero-shot approaches but also shows strong generalization across different model architectures and parameter scales, providing a scalable solution for harmful meme detection. The code is available at this https URL.
zh
[NLP-11] VisualTrap: A Stealthy Backdoor Attack on GUI Agents via Visual Grounding Manipulation
【速读】: 该论文试图解决GUI代理在使用大型视觉-语言模型(LVLMs)时面临的后门攻击安全问题,特别是由于GUI代理将文本计划与GUI元素进行视觉定位而引入的漏洞。解决方案的关键在于提出一种名为VisualTrap的方法,该方法通过在视觉定位的预训练阶段注入中毒数据,使代理错误地将文本计划与触发位置关联,而非预期目标,从而实现隐蔽且有效的后门攻击。
链接: https://arxiv.org/abs/2507.06899
作者: Ziang Ye,Yang Zhang,Wentao Shi,Xiaoyu You,Fuli Feng,Tat-Seng Chua
机构: University of Science and Technology of China (中国科学技术大学); National University of Singapore (新加坡国立大学); East China University of Science and Technology (华东理工大学)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:
Abstract:Graphical User Interface (GUI) agents powered by Large Vision-Language Models (LVLMs) have emerged as a revolutionary approach to automating human-machine interactions, capable of autonomously operating personal devices (e.g., mobile phones) or applications within the device to perform complex real-world tasks in a human-like manner. However, their close integration with personal devices raises significant security concerns, with many threats, including backdoor attacks, remaining largely unexplored. This work reveals that the visual grounding of GUI agent-mapping textual plans to GUI elements-can introduce vulnerabilities, enabling new types of backdoor attacks. With backdoor attack targeting visual grounding, the agent’s behavior can be compromised even when given correct task-solving plans. To validate this vulnerability, we propose VisualTrap, a method that can hijack the grounding by misleading the agent to locate textual plans to trigger locations instead of the intended targets. VisualTrap uses the common method of injecting poisoned data for attacks, and does so during the pre-training of visual grounding to ensure practical feasibility of attacking. Empirical results show that VisualTrap can effectively hijack visual grounding with as little as 5% poisoned data and highly stealthy visual triggers (invisible to the human eye); and the attack can be generalized to downstream tasks, even after clean fine-tuning. Moreover, the injected trigger can remain effective across different GUI environments, e.g., being trained on mobile/web and generalizing to desktop environments. These findings underscore the urgent need for further research on backdoor attack risks in GUI agents.
zh
[NLP-12] SCoRE: Streamlined Corpus-based Relation Extraction using Multi-Label Contrastive Learning and Bayesian kNN
【速读】: 该论文旨在解决在低监督设置下,如何高效地进行知识图谱(Knowledge Graph, KG)增强的问题,特别是针对关系抽取(Relation Extraction, RE)任务。其解决方案的关键在于提出SCoRE,一个模块化且成本效益高的句子级RE系统,该系统无需微调预训练大语言模型(Pre-trained Large Language Models, PLMs),能够灵活切换模型并适应不同的语料库和KG。SCoRE通过结合监督对比学习与贝叶斯k-最近邻(kNN)分类器进行多标签分类,在远距离监督语料的噪声标注下仍能实现稳健性能。
链接: https://arxiv.org/abs/2507.06895
作者: Luca Mariotti,Veronica Guidetti,Federica Mandreoli
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Information Retrieval (cs.IR); Machine Learning (cs.LG)
备注:
Abstract:The growing demand for efficient knowledge graph (KG) enrichment leveraging external corpora has intensified interest in relation extraction (RE), particularly under low-supervision settings. To address the need for adaptable and noise-resilient RE solutions that integrate seamlessly with pre-trained large language models (PLMs), we introduce SCoRE, a modular and cost-effective sentence-level RE system. SCoRE enables easy PLM switching, requires no finetuning, and adapts smoothly to diverse corpora and KGs. By combining supervised contrastive learning with a Bayesian k-Nearest Neighbors (kNN) classifier for multi-label classification, it delivers robust performance despite the noisy annotations of distantly supervised corpora. To improve RE evaluation, we propose two novel metrics: Correlation Structure Distance (CSD), measuring the alignment between learned relational patterns and KG structures, and Precision at R (P@R), assessing utility as a recommender system. We also release Wiki20d, a benchmark dataset replicating real-world RE conditions where only KG-derived annotations are available. Experiments on five benchmarks show that SCoRE matches or surpasses state-of-the-art methods while significantly reducing energy consumption. Further analyses reveal that increasing model complexity, as seen in prior work, degrades performance, highlighting the advantages of SCoRE’s minimal design. Combining efficiency, modularity, and scalability, SCoRE stands as an optimal choice for real-world RE applications.
zh
[NLP-13] Developing and Maintaining an Open-Source Repository of AI Evaluations: Challenges and Insights
【速读】: 该论文旨在解决大规模语言模型评估中面临的实施与维护挑战,特别是在构建和管理开放源代码的AI评估库时所遇到的问题。其关键解决方案包括:(1)一种结构化的队列管理框架,用于扩展社区贡献;(2)统计方法用于最优重采样和跨模型比较,并包含不确定性量化;(3)系统性的质量控制流程以确保可复现性。这些方案强调了AI评估需要超越传统软件开发实践的专用基础设施、统计严谨性和社区协作。
链接: https://arxiv.org/abs/2507.06893
作者: Alexandra Abbas,Celia Waggoner,Justin Olive
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:
Abstract:AI evaluations have become critical tools for assessing large language model capabilities and safety. This paper presents practical insights from eight months of maintaining inspect_evals , an open-source repository of 70+ community-contributed AI evaluations. We identify key challenges in implementing and maintaining AI evaluations and develop solutions including: (1) a structured cohort management framework for scaling community contributions, (2) statistical methodologies for optimal resampling and cross-model comparison with uncertainty quantification, and (3) systematic quality control processes for reproducibility. Our analysis reveals that AI evaluation requires specialized infrastructure, statistical rigor, and community coordination beyond traditional software development practices.
zh
[NLP-14] Squeeze the Soaked Sponge: Efficient Off-policy Reinforcement Finetuning for Large Language Model
【速读】: 该论文试图解决现有强化学习微调(Reinforcement Finetuning, RFT)方法中数据利用效率低的问题,特别是由于其本质上为在线策略(on-policy)学习,导致历史数据未能被充分使用,从而造成计算和时间成本高昂。解决方案的关键在于重新引入离线策略(off-policy)强化学习,并提出了一种通用方法——Reincarnating Mix-policy Proximal Policy Gradient (ReMix),通过三个核心组件实现:(1) 增加更新到数据(Update-To-Data, UTD)比率的混合策略近端策略梯度以提高训练效率;(2) KL-Convex策略约束以平衡稳定性与灵活性;(3) 策略重生机制以实现从高效早期学习到稳定渐进改进的无缝过渡。
链接: https://arxiv.org/abs/2507.06892
作者: Jing Liang,Hongyao Tang,Yi Ma,Jinyi Liu,Yan Zheng,Shuyue Hu,Lei Bai,Jianye Hao
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注: Preliminary version. Project page: this https URL
Abstract:Reinforcement Learning (RL) has demonstrated its potential to improve the reasoning ability of Large Language Models (LLMs). One major limitation of most existing Reinforcement Finetuning (RFT) methods is that they are on-policy RL in nature, i.e., data generated during the past learning process is not fully utilized. This inevitably comes at a significant cost of compute and time, posing a stringent bottleneck on continuing economic and efficient scaling. To this end, we launch the renaissance of off-policy RL and propose Reincarnating Mix-policy Proximal Policy Gradient (ReMix), a general approach to enable on-policy RFT methods like PPO and GRPO to leverage off-policy data. ReMix consists of three major components: (1) Mix-policy proximal policy gradient with an increased Update-To-Data (UTD) ratio for efficient training; (2) KL-Convex policy constraint to balance the trade-off between stability and flexibility; (3) Policy reincarnation to achieve a seamless transition from efficient early-stage learning to steady asymptotic improvement. In our experiments, we train a series of ReMix models upon PPO, GRPO and 1.5B, 7B base models. ReMix shows an average Pass@1 accuracy of 52.10% (for 1.5B model) with 0.079M response rollouts, 350 training steps and achieves 63.27%/64.39% (for 7B model) with 0.007M/0.011M response rollouts, 50/75 training steps, on five math reasoning benchmarks (i.e., AIME’24, AMC’23, Minerva, OlympiadBench, and MATH500). Compared with 15 recent advanced models, ReMix shows SOTA-level performance with an over 30x to 450x reduction in training cost in terms of rollout data volume. In addition, we reveal insightful findings via multifaceted analysis, including the implicit preference for shorter responses due to the Whipping Effect of off-policy discrepancy, the collapse mode of self-reflection behavior under the presence of severe off-policyness, etc.
zh
[NLP-15] Shifting from Ranking to Set Selection for Retrieval Augmented Generation ACL2025
【速读】: 该论文试图解决检索增强生成(Retrieval-Augmented Generation, RAG)中检索结果的集合相关性问题,即确保检索到的段落不仅个体相关,还能共同构成一个全面的信息集合。现有方法主要基于单个段落的相关性对前k个候选进行重新排序,难以满足多跳问答中复杂查询的信息需求。解决方案的关键在于提出一种集合级的段落选择方法SETR,其通过链式思维(Chain-of-Thought)推理显式识别查询的信息需求,并选择能够共同满足这些需求的最优段落集合。
链接: https://arxiv.org/abs/2507.06838
作者: Dahyun Lee,Yongrae Jo,Haeju Park,Moontae Lee
机构: LG AI Research( LG人工智能研究院); University of Illinois Chicago(伊利诺伊大学芝加哥分校)
类目: Computation and Language (cs.CL); Information Retrieval (cs.IR)
备注: Accepted to ACL 2025 Oral
Abstract:Retrieval in Retrieval-Augmented Generation(RAG) must ensure that retrieved passages are not only individually relevant but also collectively form a comprehensive set. Existing approaches primarily rerank top-k passages based on their individual relevance, often failing to meet the information needs of complex queries in multi-hop question answering. In this work, we propose a set-wise passage selection approach and introduce SETR, which explicitly identifies the information requirements of a query through Chain-of-Thought reasoning and selects an optimal set of passages that collectively satisfy those requirements. Experiments on multi-hop RAG benchmarks show that SETR outperforms both proprietary LLM-based rerankers and open-source baselines in terms of answer correctness and retrieval quality, providing an effective and efficient alternative to traditional rerankers in RAG systems. The code is available at this https URL
zh
[NLP-16] Adaptive Termination for Multi-round Parallel Reasoning : An Universal Semantic Entropy-Guided Framework
【速读】: 该论文试图解决传统推理扩展方法在推理过程中存在的效率低下和协调不足的问题,具体而言,顺序推理依赖于任意的token预算进行终止,导致效率或过早截断,而并行推理则缺乏分支间的协调且需要侵入性微调才能有效。为应对这些挑战,论文提出了一种灵活的测试时协作推理框架,旨在结合顺序与并行推理的优势。解决方案的关键在于开发一种高效且准确的内在质量度量指标,用于评估协作推理过程中的模型响应,从而实现对推理轨迹的动态控制和早期终止,该指标为语义熵(Semantic Entropy, SE),其通过量化并行模型响应的语义多样性,作为推理质量的稳健指示器。
链接: https://arxiv.org/abs/2507.06829
作者: Zenan Xu,Zexuan Qiu,Guanhua Huang,Kun Li,Siheng Li,Chenchen Zhang,Kejiao Li,Qi Yi,Yuhao Jiang,Bo Zhou,Fengzong Lian,Zhanhui Kang
机构: 未知
类目: Computation and Language (cs.CL)
备注: 13 pages, 5 fiures
Abstract:Recent advances in large language models (LLMs) have accelerated progress toward artificial general intelligence, with inference-time scaling emerging as a key technique. Contemporary approaches leverage either sequential reasoning (iteratively extending chains of thought) or parallel reasoning (generating multiple solutions simultaneously) to scale inference. However, both paradigms face fundamental limitations: sequential scaling typically relies on arbitrary token budgets for termination, leading to inefficiency or premature cutoff; while parallel scaling often lacks coordination among parallel branches and requires intrusive fine-tuning to perform effectively. In light of these challenges, we aim to design a flexible test-time collaborative inference framework that exploits the complementary strengths of both sequential and parallel reasoning paradigms. Towards this goal, the core challenge lies in developing an efficient and accurate intrinsic quality metric to assess model responses during collaborative inference, enabling dynamic control and early termination of the reasoning trace. To address this challenge, we introduce semantic entropy (SE), which quantifies the semantic diversity of parallel model responses and serves as a robust indicator of reasoning quality due to its strong negative correlation with accuracy…
zh
[NLP-17] xt to model via SysML: Automated generation of dynamical system computational models from unstructured natural language text via enhanced System Modeling Language diagrams
【速读】: 该论文试图解决工程动态系统设计与部署过程中效率低下的问题,其解决方案的关键在于利用领域知识和专家知识,通过自然语言处理(NLP)策略和大型语言模型(LLMs)从相关文档中自动生成动态系统的计算模型。该方法的核心是采用系统建模语言(SysML) diagrams来提取组件之间的依赖关系、属性和操作,并结合NLP和LLMs优化SysML图的生成过程,从而实现从文本到模型的自动化转换。
链接: https://arxiv.org/abs/2507.06803
作者: Matthew Anderson Hendricks,Alice Cicirello
机构: University of Cambridge (剑桥大学)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Computational Engineering, Finance, and Science (cs.CE)
备注:
Abstract:This paper contributes to speeding up the design and deployment of engineering dynamical systems by proposing a strategy for exploiting domain and expert knowledge for the automated generation of dynamical system computational model starting from a corpus of document relevant to the dynamical system of interest and an input document describing the specific system. This strategy is implemented in five steps and, crucially, it uses system modeling language diagrams (SysML) to extract accurate information about the dependencies, attributes, and operations of components. Natural Language Processing (NLP) strategies and Large Language Models (LLMs) are employed in specific tasks to improve intermediate outputs of the SySML diagrams automated generation, such as: list of key nouns; list of extracted relationships; list of key phrases and key relationships; block attribute values; block relationships; and BDD diagram generation. The applicability of automated SysML diagram generation is illustrated with different case studies. The computational models of complex dynamical systems from SysML diagrams are then obtained via code generation and computational model generation steps. In the code generation step, NLP strategies are used for summarization, while LLMs are used for validation only. The proposed approach is not limited to a specific system, domain, or computational software. The applicability of the proposed approach is shown via an end-to-end example from text to model of a simple pendulum, showing improved performance compared to results yielded by LLMs only.
zh
[NLP-18] Efficient Industrial sLLM s through Domain Adaptive Continual Pretraining: Method Evaluation and Applications
【速读】: 该论文试图解决企业在部署大型语言模型(Large Language Models, LLMs)时面临的基础设施不足问题,同时应对小规模语言模型(small LLMs, sLLMs)在性能上的局限性。其解决方案的关键在于验证基于领域自适应持续预训练(Domain Adaptive Continual Pretraining, DACP)的方法在多种基础模型和业务领域中的有效性,通过该方法提升sLLMs在目标领域的表现,同时保持其通用能力,从而为企业的部署提供一种成本效益高且可扩展的解决方案。
链接: https://arxiv.org/abs/2507.06795
作者: Seonwu Kim,Yohan Na,Kihun Kim,Hanhee Cho,Geun Lim,Mintae Kim,Seongik Park,Ki Hyun Kim,Youngsub Han,Byoung-Ki Jeon
机构: LG Uplus(LG Uplus)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: under review
Abstract:The emergence of open-source large language models (LLMs) has expanded opportunities for enterprise applications; however, many organizations still lack the infrastructure to deploy and maintain large-scale models. As a result, small LLMs (sLLMs) have become a practical alternative, despite their inherent performance limitations. While Domain Adaptive Continual Pretraining (DACP) has been previously explored as a method for domain adaptation, its utility in commercial applications remains under-examined. In this study, we validate the effectiveness of applying a DACP-based recipe across diverse foundation models and service domains. Through extensive experiments and real-world evaluations, we demonstrate that DACP-applied sLLMs achieve substantial gains in target domain performance while preserving general capabilities, offering a cost-efficient and scalable solution for enterprise-level deployment.
zh
[NLP-19] Checklist Engineering Empowers Multilingual LLM Judges
【速读】: 该论文试图解决多语言环境下自动化文本评估的问题,尤其是在使用大型语言模型(Large Language Models, LLMs)作为评估者时存在的成本、时间和效率问题。其解决方案的关键在于提出了一种无需训练的框架——基于检查清单工程的LLM-as-a-Judge(CE-Judge),该框架利用检查清单直觉进行多语言评估,并采用开源模型以降低依赖性和提高可扩展性。
链接: https://arxiv.org/abs/2507.06774
作者: Mohammad Ghiasvand Mohammadkhani,Hamid Beigy
机构: Amirkabir University of Technology (阿米尔卡比尔理工大学); Sharif University of Technology (沙里夫理工大学)
类目: Computation and Language (cs.CL)
备注:
Abstract:Automated text evaluation has long been a central issue in Natural Language Processing (NLP). Recently, the field has shifted toward using Large Language Models (LLMs) as evaluators-a trend known as the LLM-as-a-Judge paradigm. While promising and easily adaptable across tasks, this approach has seen limited exploration in multilingual contexts. Existing multilingual studies often rely on proprietary models or require extensive training data for fine-tuning, raising concerns about cost, time, and efficiency. In this paper, we propose Checklist Engineering based LLM-as-a-Judge (CE-Judge), a training-free framework that uses checklist intuition for multilingual evaluation with an open-source model. Experiments across multiple languages and three benchmark datasets, under both pointwise and pairwise settings, show that our method generally surpasses the baselines and performs on par with the GPT-4o model.
zh
[NLP-20] KAConvText: Novel Approach to Burmese Sentence Classification using Kolmogorov-Arnold Convolution
【速读】: 该论文旨在解决句子分类中的三个具体任务:不平衡二分类的仇恨言论检测、平衡多分类的新闻分类以及不平衡多分类的民族语言识别。其解决方案的关键在于首次应用了Kolmogorov-Arnold卷积(KAConvText),通过结合不同的嵌入配置(包括静态和微调的fastText嵌入)以及使用MLP和KAN作为分类头,以提升模型性能与可解释性。
链接: https://arxiv.org/abs/2507.06753
作者: Ye Kyaw Thu,Thura Aung,Thazin Myint Oo,Thepchai Supnithi
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: 10 pages, 3 figures, 4 tables
Abstract:This paper presents the first application of Kolmogorov-Arnold Convolution for Text (KAConvText) in sentence classification, addressing three tasks: imbalanced binary hate speech detection, balanced multiclass news classification, and imbalanced multiclass ethnic language identification. We investigate various embedding configurations, comparing random to fastText embeddings in both static and fine-tuned settings, with embedding dimensions of 100 and 300 using CBOW and Skip-gram models. Baselines include standard CNNs and CNNs augmented with a Kolmogorov-Arnold Network (CNN-KAN). In addition, we investigated KAConvText with different classification heads - MLP and KAN, where using KAN head supports enhanced interpretability. Results show that KAConvText-MLP with fine-tuned fastText embeddings achieves the best performance of 91.23% accuracy (F1-score = 0.9109) for hate speech detection, 92.66% accuracy (F1-score = 0.9267) for news classification, and 99.82% accuracy (F1-score = 0.9982) for language identification.
zh
[NLP-21] Civil Society in the Loop: Feedback-Driven Adaptation of (L)LM-Assisted Classification in an Open-Source Telegram Monitoring Tool
【速读】: 该论文试图解决的问题是,如何使民间社会组织(Civil Society Organizations, CSOs)在基于人工智能(Artificial Intelligence, AI)的开源监测工具开发中发挥主动作用,而非仅作为“消费者”。当前,平台提供商减少对内容审核的投资,而AI工具在大规模检测有害内容方面具有潜力,但缺乏能够无缝集成AI模型和社会媒体监测基础设施的开源工具。论文提出的关键解决方案是通过与CSO利益相关者的协作,使其在工具的设计、反馈、模型优化及确保与利益相关者需求和价值观一致的过程中发挥积极作用。
链接: https://arxiv.org/abs/2507.06734
作者: Milena Pustet,Elisabeth Steffen,Helena Mihaljević,Grischa Stanjek,Yannis Illies
机构: HTW Berlin(柏林应用技术大学); democ(未知)
类目: Human-Computer Interaction (cs.HC); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Computers and Society (cs.CY)
备注:
Abstract:The role of civil society organizations (CSOs) in monitoring harmful online content is increasingly crucial, especially as platform providers reduce their investment in content moderation. AI tools can assist in detecting and monitoring harmful content at scale. However, few open-source tools offer seamless integration of AI models and social media monitoring infrastructures. Given their thematic expertise and contextual understanding of harmful content, CSOs should be active partners in co-developing technological tools, providing feedback, helping to improve models, and ensuring alignment with stakeholder needs and values, rather than as passive ‘consumers’. However, collaborations between the open source community, academia, and civil society remain rare, and research on harmful content seldom translates into practical tools usable by civil society actors. This work in progress explores how CSOs can be meaningfully involved in an AI-assisted open-source monitoring tool of anti-democratic movements on Telegram, which we are currently developing in collaboration with CSO stakeholders.
zh
[NLP-22] On the Effect of Uncertainty on Layer-wise Inference Dynamics ICML2025
【速读】: 该论文试图解决如何理解大语言模型(Large Language Models, LLMs)内部表示和处理预测的不确定性问题,以及如何利用这些信息来检测不确定性并防止幻觉。其解决方案的关键在于通过分析输出标记概率在不同层间的动态变化,揭示不确定性是否影响推理过程。研究使用了Tuned Lens方法(一种Logit Lens的变体),对11个数据集和5个模型进行分析,发现确定性和不确定性预测的概率轨迹在多个层上表现出相似的特征,表明不确定性可能并不显著影响推理动态。这一发现挑战了简单方法在推理阶段检测不确定性的可行性,并展示了可解释性方法在研究不确定性对推理影响中的应用潜力。
链接: https://arxiv.org/abs/2507.06722
作者: Sunwoo Kim,Haneul Yoo,Alice Oh
机构: 未知
类目: Computation and Language (cs.CL); Machine Learning (cs.LG)
备注: Accepted to Actionable Interpretability Workshop - ICML 2025
Abstract:Understanding how large language models (LLMs) internally represent and process their predictions is central to detecting uncertainty and preventing hallucinations. While several studies have shown that models encode uncertainty in their hidden states, it is underexplored how this affects the way they process such hidden states. In this work, we demonstrate that the dynamics of output token probabilities across layers for certain and uncertain outputs are largely aligned, revealing that uncertainty does not seem to affect inference dynamics. Specifically, we use the Tuned Lens, a variant of the Logit Lens, to analyze the layer-wise probability trajectories of final prediction tokens across 11 datasets and 5 models. Using incorrect predictions as those with higher epistemic uncertainty, our results show aligned trajectories for certain and uncertain predictions that both observe abrupt increases in confidence at similar layers. We balance this finding by showing evidence that more competent models may learn to process uncertainty differently. Our findings challenge the feasibility of leveraging simplistic methods for detecting uncertainty at inference. More broadly, our work demonstrates how interpretability methods may be used to investigate the way uncertainty affects inference.
zh
[NLP-23] CLI-RAG : A Retrieval-Augmented Framework for Clinically Structured and Context Aware Text Generation with LLM s
【速读】: 该论文旨在解决临床文本生成中面临的两个关键问题:患者数据高度非结构化、异构且分散在多种病历类型中,以及临床记录通常较长且语义密集,导致传统提示方法因上下文长度限制和可能遗漏临床相关信息而不可行。其解决方案的关键在于提出CLI-RAG(Clinically Informed Retrieval-Augmented Generation)框架,该框架引入了一种尊重临床文档结构的分层分块策略,并结合任务特定的双阶段检索机制,分别在全局层面识别相关病历类型,在局部层面提取高价值内容,从而在文档和章节级别实现相关性增强。
链接: https://arxiv.org/abs/2507.06715
作者: Garapati Keerthana,Manik Gupta
机构: Birla Institute of Technology and Science, Pilani, Hyderabad, India
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Information Retrieval (cs.IR)
备注: 12 pages, 4 figures
Abstract:Large language models (LLMs), including zero-shot and few-shot paradigms, have shown promising capabilities in clinical text generation. However, real-world applications face two key challenges: (1) patient data is highly unstructured, heterogeneous, and scattered across multiple note types and (2) clinical notes are often long and semantically dense, making naive prompting infeasible due to context length constraints and the risk of omitting clinically relevant information. We introduce CLI-RAG (Clinically Informed Retrieval-Augmented Generation), a domain-specific framework for structured and clinically grounded text generation using LLMs. It incorporates a novel hierarchical chunking strategy that respects clinical document structure and introduces a task-specific dual-stage retrieval mechanism. The global stage identifies relevant note types using evidence-based queries, while the local stage extracts high-value content within those notes creating relevance at both document and section levels. We apply the system to generate structured progress notes for individual hospital visits using 15 clinical note types from the MIMIC-III dataset. Experiments show that it preserves temporal and semantic alignment across visits, achieving an average alignment score of 87.7%, surpassing the 80.7% baseline from real clinician-authored notes. The generated outputs also demonstrate high consistency across LLMs, reinforcing deterministic behavior essential for reproducibility, reliability, and clinical trust. Comments: 12 pages, 4 figures Subjects: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Information Retrieval (cs.IR) Cite as: arXiv:2507.06715 [cs.CL] (or arXiv:2507.06715v1 [cs.CL] for this version) https://doi.org/10.48550/arXiv.2507.06715 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
zh
[NLP-24] Elite Polarization in European Parliamentary Speeches: a Novel Measurement Approach Using Large Language Models
【速读】: 该论文试图解决如何量化政治精英之间的极化程度问题,其解决方案的关键在于利用生成式 AI (Generative AI) 进行演员与受事检测,从而识别议员在议会演讲中提及彼此的情况,并分析发言者与被提及对象之间的关系及情感倾向,进而构建互相对立政党的敌意指数,以此衡量精英极化水平。
链接: https://arxiv.org/abs/2507.06658
作者: Gennadii Iakovlev
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:
Abstract:This project introduces a new measure of elite polarization via actor and subject detection using artificial intelligence. I identify when politicians mention one another in parliamentary speeches, note who is speaking and who is being addressed, and assess the emotional temperature behind these evaluations. This maps how elites evaluate their various out-parties, allowing us to create an index of mutual out-party hostility, that is, elite polarization. While I analyzed polarization data over the past four decades for the UK, and two decades for Hungary and Italy, my approach lays the groundwork for a twenty-year, EU-wide time-series dataset on elite polarization. I obtain the results that can be aggregated by party and quarter. The resulting index demonstrates a good face validity: it reacts to events such as electoral campaigns, country- and party-level crises, and to parties losing and assuming power.
zh
[NLP-25] Expediting data extraction using a large language model (LLM ) and scoping review protocol: a methodological study within a complex scoping review
【速读】: 该论文试图解决系统综述数据提取阶段资源消耗大、效率低的问题,其解决方案的关键在于利用基于审查协议的大型语言模型(Large Language Models, LLMs)方法来加速数据提取过程。研究通过两种基于协议的方法对案例研究范围综述中的10个证据来源进行数据提取,并评估了其性能,发现该方法在提取简单、明确的引用信息时具有较高准确性,但在处理复杂、主观的数据项时表现较差。此外,研究还指出LLM反馈有助于协议调整,但需要在多种LLM和综述情境下进行更严格的性能评估。
链接: https://arxiv.org/abs/2507.06623
作者: James Stewart-Evans,Emma Wilson,Tessa Langley,Andrew Prayle,Angela Hands,Karen Exley,Jo Leonardi-Bee
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: 44 pages, 4 figures
Abstract:The data extraction stages of reviews are resource-intensive, and researchers may seek to expediate data extraction using online (large language models) LLMs and review protocols. Claude 3.5 Sonnet was used to trial two approaches that used a review protocol to prompt data extraction from 10 evidence sources included in a case study scoping review. A protocol-based approach was also used to review extracted data. Limited performance evaluation was undertaken which found high accuracy for the two extraction approaches (83.3% and 100%) when extracting simple, well-defined citation details; accuracy was lower (9.6% and 15.8%) when extracting more complex, subjective data items. Considering all data items, both approaches had precision 90% but low recall (25%) and F1 scores (40%). The context of a complex scoping review, open response types and methodological approach likely impacted performance due to missed and misattributed data. LLM feedback considered the baseline extraction accurate and suggested minor amendments: four of 15 (26.7%) to citation details and 8 of 38 (21.1%) to key findings data items were considered to potentially add value. However, when repeating the process with a dataset featuring deliberate errors, only 2 of 39 (5%) errors were detected. Review-protocol-based methods used for expediency require more robust performance evaluation across a range of LLMs and review contexts with comparison to conventional prompt engineering approaches. We recommend researchers evaluate and report LLM performance if using them similarly to conduct data extraction or review extracted data. LLM feedback contributed to protocol adaptation and may assist future review protocol drafting.
zh
[NLP-26] FuDoBa: Fusing Document and Knowledge Graph-based Representations with Bayesian Optimisation
【速读】: 该论文试图解决大型语言模型(Large Language Models, LLMs)生成的高维、计算成本高的嵌入表示在特定领域应用中过于通用或效率不足的问题。其解决方案的关键在于提出FuDoBa,一种基于贝叶斯优化的方法,该方法将LLM嵌入与来自本地和外部知识库(如WikiData)的领域特定结构化知识进行融合,从而生成低维、任务相关的表示,同时降低训练复杂度并获得可解释的早期融合权重,以提升分类性能。
链接: https://arxiv.org/abs/2507.06622
作者: Boshko Koloski,Senja Pollak,Roberto Navigli,Blaž Škrlj
机构: Institute Jožef Stefan (斯洛文尼亚国家科学院物理研究所); University of Rome “La Sapienza” (罗马大学)
类目: Computation and Language (cs.CL)
备注:
Abstract:Building on the success of Large Language Models (LLMs), LLM-based representations have dominated the document representation landscape, achieving great performance on the document embedding benchmarks. However, the high-dimensional, computationally expensive embeddings from LLMs tend to be either too generic or inefficient for domain-specific applications. To address these limitations, we introduce FuDoBa a Bayesian optimisation-based method that integrates LLM-based embeddings with domain-specific structured knowledge, sourced both locally and from external repositories like WikiData. This fusion produces low-dimensional, task-relevant representations while reducing training complexity and yielding interpretable early-fusion weights for enhanced classification performance. We demonstrate the effectiveness of our approach on six datasets in two domains, showing that when paired with robust AutoML-based classifiers, our proposed representation learning approach performs on par with, or surpasses, those produced solely by the proprietary LLM-based embedding baselines.
zh
[NLP-27] Decoder-Hybrid-Decoder Architecture for Efficient Reasoning with Long Generation
【速读】: 该论文旨在解决传统Transformer架构在序列建模中的效率瓶颈,以及探索状态空间模型(SSM)在表示共享方面的潜在优势。其解决方案的关键在于引入门控记忆单元(Gated Memory Unit, GMU),通过在跨解码器中共享来自基于Samba的自解码器的记忆读出状态,实现高效的记忆共享机制,从而提升解码效率并增强长上下文性能。
链接: https://arxiv.org/abs/2507.06607
作者: Liliang Ren,Congcong Chen,Haoran Xu,Young Jin Kim,Adam Atkinson,Zheng Zhan,Jiankai Sun,Baolin Peng,Liyuan Liu,Shuohang Wang,Hao Cheng,Jianfeng Gao,Weizhu Chen,Yelong Shen
机构: Microsoft(微软); Stanford University(斯坦福大学)
类目: Computation and Language (cs.CL); Machine Learning (cs.LG)
备注:
Abstract:Recent advances in language modeling have demonstrated the effectiveness of State Space Models (SSMs) for efficient sequence modeling. While hybrid architectures such as Samba and the decoder-decoder architecture, YOCO, have shown promising performance gains over Transformers, prior works have not investigated the efficiency potential of representation sharing between SSM layers. In this paper, we introduce the Gated Memory Unit (GMU), a simple yet effective mechanism for efficient memory sharing across layers. We apply it to create SambaY, a decoder-hybrid-decoder architecture that incorporates GMUs in the cross-decoder to share memory readout states from a Samba-based self-decoder. SambaY significantly enhances decoding efficiency, preserves linear pre-filling time complexity, and boosts long-context performance, all while eliminating the need for explicit positional encoding. Through extensive scaling experiments, we demonstrate that our model exhibits a significantly lower irreducible loss compared to a strong YOCO baseline, indicating superior performance scalability under large-scale compute regimes. Our largest model enhanced with Differential Attention, Phi4-mini-Flash-Reasoning, achieves significantly better performance than Phi4-mini-Reasoning on reasoning tasks such as Math500, AIME24/25, and GPQA Diamond without any reinforcement learning, while delivering up to 10x higher decoding throughput on 2K-length prompts with 32K generation length under the vLLM inference framework. We release our training codebase on open-source data at this https URL.
zh
[NLP-28] Enhancing Food-Domain Question Answering with a Multimodal Knowledge Graph: Hybrid QA Generation and Diversity Analysis
【速读】: 该论文旨在解决食品领域问答(Food-domain QA)中知识结构化与多模态生成能力不足的问题,其解决方案的关键在于构建一个融合大规模多模态知识图谱(MMKG)与生成式AI的统一框架。通过整合13,000个食谱、3,000种食材、140,000条关系及14,000张图像,并利用40个模板和LLaVA/DeepSeek增强生成40,000组QA对,结合Meta LLaMA 3.1-8B与Stable Diffusion 3.5-Large的联合微调,显著提升了BERTScore、降低了FID并增强了CLIP对齐度,同时通过基于CLIP的不匹配检测和LLaVA驱动的幻觉检查确保了事实与视觉的真实性,最终实现了高准确率的图像复用与合成质量。
链接: https://arxiv.org/abs/2507.06571
作者: Srihari K B,Pushpak Bhattacharyya
机构: 未知
类目: Computation and Language (cs.CL)
备注:
Abstract:We propose a unified food-domain QA framework that combines a large-scale multimodal knowledge graph (MMKG) with generative AI. Our MMKG links 13,000 recipes, 3,000 ingredients, 140,000 relations, and 14,000 images. We generate 40,000 QA pairs using 40 templates and LLaVA/DeepSeek augmentation. Joint fine-tuning of Meta LLaMA 3.1-8B and Stable Diffusion 3.5-Large improves BERTScore by 16.2%, reduces FID by 37.8%, and boosts CLIP alignment by 31.1%. Diagnostic analyses-CLIP-based mismatch detection (35.2% to 7.3%) and LLaVA-driven hallucination checks-ensure factual and visual fidelity. A hybrid retrieval-generation strategy achieves 94.1% accurate image reuse and 85% adequacy in synthesis. Our results demonstrate that structured knowledge and multimodal generation together enhance reliability and diversity in food QA.
zh
[NLP-29] he Flaws of Others: An LLM -driven Framework for Scientific Knowledge Production
【速读】: 该论文试图解决生成式人工智能(Generative AI)在交流过程中可能出现的无效性问题,即任何事实、逻辑或结构上的偏差。其解决方案的关键在于构建一个话语网络模型,将人类与大型语言模型(LLM)视为平等节点,并通过跟踪它们的陈述传播来分析无效性的演化。该模型揭示了无效性的四种危害,并提出通过引入同行评审机制,如开放源代码的“他人缺陷”(FOO)算法,来提升系统整体的可靠性,从而实现一种以真理为主导的状态。核心思想是通过将不完美的模型连接成网络,使其相互监督,而非依赖单一模型的完善。
链接: https://arxiv.org/abs/2507.06565
作者: Juan B. Gutiérrez
机构: University of Texas at San Antonio (圣安东尼奥德克萨斯大学)
类目: Computation and Language (cs.CL); Machine Learning (cs.LG)
备注: 27 pages, 3 figures, 4 tables, 1 algorithm, 28 references
Abstract:Large-language models turn writing into a live exchange between humans and software. We capture this new medium with a discursive-network model that treats people and LLMs as equal nodes and tracks how their statements circulate. Broadening the focus from isolated hallucinations, we define invalidation (any factual, logical, or structural breach) and show it follows four hazards: drift from truth, self-repair, fresh fabrication, and external detection. A general mathematical model of discursive networks is developed to provide valuable insights: A network governed only by drift and self-repair stabilizes at a modest error rate; adding fabrication reproduces the high rates seen in current LLMs. Giving each false claim even a small chance of peer review shifts the system to a truth-dominant state. We operationalize peer review with the open-source \emphFlaws-of-Others (FOO) algorithm: a configurable loop in which any set of agents critique one another while a harmoniser merges their verdicts. The takeaway is practical and cultural: reliability in this new medium comes not from perfecting single models but from wiring imperfect ones into networks that keep each other honest.
zh
[NLP-30] DS@GT at CheckThat! 2025: Exploring Retrieval and Reranking Pipelines for Scientific Claim Source Retrieval on Social Media Discourse
【速读】: 该论文试图解决社交媒体用户在推文中做出科学声明但未引用来源的问题,从而需要验证这些声明的可靠性。其解决方案的关键在于通过探索多种数据增强技术、检索与重排序管道以及微调双编码器模型,以基于推文中的隐式引用找到相关的科学论文。
链接: https://arxiv.org/abs/2507.06563
作者: Jeanette Schofield,Shuyu Tian,Hoang Thanh Thanh Truong,Maximilian Heil
机构: Georgia Institute of Technology(佐治亚理工学院)
类目: Information Retrieval (cs.IR); Computation and Language (cs.CL)
备注:
Abstract:Social media users often make scientific claims without citing where these claims come from, generating a need to verify these claims. This paper details work done by the DS@GT team for CLEF 2025 CheckThat! Lab Task 4b Scientific Claim Source Retrieval which seeks to find relevant scientific papers based on implicit references in tweets. Our team explored 6 different data augmentation techniques, 7 different retrieval and reranking pipelines, and finetuned a bi-encoder. Achieving an MRR@5 of 0.58, our team ranked 16th out of 30 teams for the CLEF 2025 CheckThat! Lab Task 4b, and improvement of 0.15 over the BM25 baseline of 0.43. Our code is available on Github at this https URL.
zh
[NLP-31] Large Language Model for Extracting Complex Contract Information in Industrial Scenes
【速读】: 该论文试图解决工业场景中复杂合同信息提取任务的数据集构建与模型性能优化问题。其解决方案的关键在于通过聚类分析对工业合同文本进行处理,并利用GPT-4和GPT-3.5提取关键信息以获得高质量的数据标注;随后通过构造新文本实现数据增强,利用GPT-3.5从随机组合的关键词生成非结构化合同文本,从而提升模型的鲁棒性;最终基于高质量数据集对大语言模型进行微调,结合LoRA、数据平衡和数据增强技术有效提升模型的准确性和鲁棒性。
链接: https://arxiv.org/abs/2507.06539
作者: Yunyang Cao,Yanjun Li,Silong Dai
机构: 未知
类目: Computation and Language (cs.CL)
备注:
Abstract:This paper proposes a high-quality dataset construction method for complex contract information extraction tasks in industrial scenarios and fine-tunes a large language model based on this dataset. Firstly, cluster analysis is performed on industrial contract texts, and GPT-4 and GPT-3.5 are used to extract key information from the original contract data, obtaining high-quality data annotations. Secondly, data augmentation is achieved by constructing new texts, and GPT-3.5 generates unstructured contract texts from randomly combined keywords, improving model robustness. Finally, the large language model is fine-tuned based on the high-quality dataset. Experimental results show that the model achieves excellent overall performance while ensuring high field recall and precision and considering parsing efficiency. LoRA, data balancing, and data augmentation effectively enhance model accuracy and robustness. The proposed method provides a novel and efficient solution for industrial contract information extraction tasks.
zh
[NLP-32] InvestAlign: Overcoming Data Scarcity in Aligning Large Language Models with Investor Decision-Making Processes under Herd Behavior
【速读】: 该论文试图解决在群体行为背景下,将大型语言模型(Large Language Models, LLMs)与投资者决策过程对齐的关键挑战。传统监督微调(Supervised Fine-Tuning, SFT)方法受限于真实用户数据的稀缺性,导致数据收集成本高且存在隐私风险。论文提出的解决方案关键在于构建高质量的SFT数据集,通过利用类似且简单的最优投资问题的理论解,而非复杂场景的真实数据,从而提升模型训练效率。实验表明,基于该方法生成的数据能够使LLMs更快地收敛参数,并显著提升模型在简单和复杂投资问题上的表现,表明其在优化投资决策对齐方面的潜力。
链接: https://arxiv.org/abs/2507.06528
作者: Huisheng Wang,Zhuoshi Pan,Hangjing Zhang,Mingxiao Liu,Hanqing Gao,H. Vicky Zhao
机构: Tsinghua University (清华大学)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Emerging Technologies (cs.ET); Machine Learning (cs.LG)
备注:
Abstract:Aligning Large Language Models (LLMs) with investor decision-making processes under herd behavior is a critical challenge in behavioral finance, which grapples with a fundamental limitation: the scarcity of real-user data needed for Supervised Fine-Tuning (SFT). While SFT can bridge the gap between LLM outputs and human behavioral patterns, its reliance on massive authentic data imposes substantial collection costs and privacy risks. We propose InvestAlign, a novel framework that constructs high-quality SFT datasets by leveraging theoretical solutions to similar and simple optimal investment problems rather than complex scenarios. Our theoretical analysis demonstrates that training LLMs with InvestAlign-generated data achieves faster parameter convergence than using real-user data, suggesting superior learning efficiency. Furthermore, we develop InvestAgent, an LLM agent fine-tuned with InvestAlign, which demonstrates significantly closer alignment to real-user data than pre-SFT models in both simple and complex investment problems. This highlights our proposed InvestAlign as a promising approach with the potential to address complex optimal investment problems and align LLMs with investor decision-making processes under herd behavior. Our code is publicly available at this https URL.
zh
[NLP-33] FIFA: Unified Faithfulness Evaluation Framework for Text-to-Video and Video-to-Text Generation
【速读】: 该论文试图解决视频多模态大语言模型(VideoMLLMs)在生成过程中产生的幻觉问题,即生成的内容与视觉输入相矛盾。现有评估方法仅限于单一任务且无法有效评估开放性、自由形式响应中的幻觉现象。解决方案的关键在于提出FIFA框架,该框架通过提取全面的描述性事实、利用时空语义依赖图建模其语义依赖关系,并借助视频问答模型进行验证。此外,引入基于工具的后修正框架Post-Correction,用于修正幻觉内容,从而提升文本和视频生成中的事实一致性。
链接: https://arxiv.org/abs/2507.06523
作者: Liqiang Jing,Viet Lai,Seunghyun Yoon,Trung Bui,Xinya Du
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Computation and Language (cs.CL); Graphics (cs.GR)
备注:
Abstract:Video Multimodal Large Language Models (VideoMLLMs) have achieved remarkable progress in both Video-to-Text and Text-to-Video tasks. However, they often suffer fro hallucinations, generating content that contradicts the visual input. Existing evaluation methods are limited to one task (e.g., V2T) and also fail to assess hallucinations in open-ended, free-form responses. To address this gap, we propose FIFA, a unified FaIthFulness evAluation framework that extracts comprehensive descriptive facts, models their semantic dependencies via a Spatio-Temporal Semantic Dependency Graph, and verifies them using VideoQA models. We further introduce Post-Correction, a tool-based correction framework that revises hallucinated content. Extensive experiments demonstrate that FIFA aligns more closely with human judgment than existing evaluation methods, and that Post-Correction effectively improves factual consistency in both text and video generation.
zh
[NLP-34] SpindleKV: A Novel KV Cache Reduction Method Balancing Both Shallow and Deep Layers ACL2025
【速读】: 该论文试图解决大型语言模型(Large Language Models, LLMs)在推理过程中KV缓存(Key-Value Cache)内存消耗过大的问题。其解决方案的关键在于提出一种名为SpindleKV的新方法,该方法通过平衡浅层和深层的KV缓存缩减策略来提高效率:对于深层,采用基于注意力权重的逐出方法;对于浅层,则应用基于码本的替换策略,该策略通过相似性与合并策略进行学习。此外,SpindleKV还解决了其他基于注意力的逐出方法所面临的分组查询注意力(Grouped-Query Attention, GQA)困境。实验结果表明,SpindleKV在保持模型性能的同时,相较于基线方法实现了更优的KV缓存缩减效果。
链接: https://arxiv.org/abs/2507.06517
作者: Zicong Tang,Shi Luohe,Zuchao Li,Baoyuan Qi,Guoming Liu,Lefei Zhang,Ping Wang
机构: Wuhan University (武汉大学); Xiaomi (小米)
类目: Computation and Language (cs.CL)
备注: Accepted by ACL 2025 main
Abstract:Large Language Models (LLMs) have achieved impressive accomplishments in recent years. However, the increasing memory consumption of KV cache has possessed a significant challenge to the inference system. Eviction methods have revealed the inherent redundancy within the KV cache, demonstrating its potential for reduction, particularly in deeper layers. However, KV cache reduction for shallower layers has been found to be insufficient. Based on our observation that, the KV cache exhibits a high degree of similarity. Based on this observation, we proposed a novel KV cache reduction method, SpindleKV, which balances both shallow and deep layers. For deep layers, we employ an attention weight based eviction method, while for shallow layers, we apply a codebook based replacement approach which is learnt by similarity and merging policy. Moreover, SpindleKV addressed the Grouped-Query Attention (GQA) dilemma faced by other attention based eviction methods. Experiments on two common benchmarks with three different LLMs shown that SpindleKV obtained better KV cache reduction effect compared to baseline methods, while preserving similar or even better model performance.
zh
[NLP-35] Pun Intended: Multi-Agent Translation of Wordplay with Contrastive Learning and Phonetic-Semantic Embeddings
【速读】: 该论文试图解决跨语言翻译中双关语(pun)的翻译问题,这一问题长期困扰专业译者和机器翻译系统。其解决方案的关键在于结合先进的大语言模型与专门的双关语生成技术,采用三阶段方法:首先建立基于新对比学习数据集反馈的基线模型;其次实施结合语音-语义嵌入的引导式思维链流程;最后构建多智能体生成器-判别器框架以评估并重新生成双关语。该方法旨在捕捉源文本双关语中的语言创造力和幽默感,而非仅仅复制词汇。
链接: https://arxiv.org/abs/2507.06506
作者: Russell Taylor,Benjamin Herbert,Michael Sana
机构: Georgia Institute of Technology(佐治亚理工学院)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Multiagent Systems (cs.MA)
备注: CLEF 2025 Working Notes, 9-12 September 2025, Madrid, Spain
Abstract:Translating wordplay across languages presents unique challenges that have long confounded both professional human translators and machine translation systems. This research proposes a novel approach for translating puns from English to French by combining state-of-the-art large language models with specialized techniques for wordplay generation. Our methodology employs a three-stage approach. First, we establish a baseline using multiple frontier large language models with feedback based on a new contrastive learning dataset. Second, we implement a guided chain-of-thought pipeline with combined phonetic-semantic embeddings. Third, we implement a multi-agent generator-discriminator framework for evaluating and regenerating puns with feedback. Moving beyond the limitations of literal translation, our methodology’s primary objective is to capture the linguistic creativity and humor of the source text wordplay, rather than simply duplicating its vocabulary. Our best runs earned first and second place in the CLEF JOKER 2025 Task 2 competition where they were evaluated manually by expert native French speakers. This research addresses a gap between translation studies and computational linguistics by implementing linguistically-informed techniques for wordplay translation, advancing our understanding of how language models can be leveraged to handle the complex interplay between semantic ambiguity, phonetic similarity, and the implicit cultural and linguistic awareness needed for successful humor. Comments: CLEF 2025 Working Notes, 9-12 September 2025, Madrid, Spain Subjects: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Multiagent Systems (cs.MA) Cite as: arXiv:2507.06506 [cs.CL] (or arXiv:2507.06506v1 [cs.CL] for this version) https://doi.org/10.48550/arXiv.2507.06506 Focus to learn more arXiv-issued DOI via DataCite (pending registration) Submission history From: Russell Taylor [view email] [v1] Wed, 9 Jul 2025 03:09:14 UTC (391 KB)
zh
[NLP-36] On the Robustness of Verbal Confidence of LLM s in Adversarial Attacks
【速读】: 该论文试图解决大型语言模型(Large Language Models, LLMs)在高风险应用场景中,其生成的言语置信度(verbal confidence)在面对对抗攻击时的鲁棒性不足的问题。解决方案的关键在于提出一种新颖的框架,通过扰动和越狱(jailbreak)方法对言语置信度评分进行攻击,揭示当前置信度获取方法的脆弱性,并证明即使细微的语义保持修改也能导致响应中的误导性置信度。
链接: https://arxiv.org/abs/2507.06489
作者: Stephen Obadinma,Xiaodan Zhu
机构: Queen’s University (皇后大学)
类目: Computation and Language (cs.CL)
备注:
Abstract:Robust verbal confidence generated by large language models (LLMs) is crucial for the deployment of LLMs to ensure transparency, trust, and safety in human-AI interactions across many high-stakes applications. In this paper, we present the first comprehensive study on the robustness of verbal confidence under adversarial attacks. We introduce a novel framework for attacking verbal confidence scores through both perturbation and jailbreak-based methods, and show that these attacks can significantly jeopardize verbal confidence estimates and lead to frequent answer changes. We examine a variety of prompting strategies, model sizes, and application domains, revealing that current confidence elicitation methods are vulnerable and that commonly used defence techniques are largely ineffective or counterproductive. Our findings underscore the urgent need to design more robust mechanisms for confidence expression in LLMs, as even subtle semantic-preserving modifications can lead to misleading confidence in responses.
zh
[NLP-37] Video-RTS: Rethinking Reinforcement Learning and Test-Time Scaling for Efficient and Enhanced Video Reasoning
【速读】: 该论文旨在解决基于强化学习(Reinforcement Learning, RL)的视频推理中数据收集与微调成本高、难以扩展的问题。现有方法通常依赖于大规模监督微调(Supervised Fine-Tuning, SFT)和长链式思维(Chain-of-Thought, CoT)标注,导致资源消耗大且难以推广。其解决方案的关键在于结合数据高效的强化学习与视频自适应的测试时缩放(Test-Time Scaling, TTS)策略,通过跳过资源密集型的SFT步骤,采用基于输出奖励的纯强化学习训练,并引入稀疏到密集的视频TTS策略以提升推理性能。
链接: https://arxiv.org/abs/2507.06485
作者: Ziyang Wang,Jaehong Yoon,Shoubin Yu,Md Mohaiminul Islam,Gedas Bertasius,Mohit Bansal
机构: UNC Chapel Hill
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注: The first two authors contributed equally. Project page: this https URL
Abstract:Despite advances in reinforcement learning (RL)-based video reasoning with large language models (LLMs), data collection and finetuning remain significant challenges. These methods often rely on large-scale supervised fine-tuning (SFT) with extensive video data and long Chain-of-Thought (CoT) annotations, making them costly and hard to scale. To address this, we present Video-RTS, a new approach to improve video reasoning capability with drastically improved data efficiency by combining data-efficient RL with a video-adaptive test-time scaling (TTS) strategy. Based on observations about the data scaling of RL samples, we skip the resource-intensive SFT step and employ efficient pure-RL training with output-based rewards, requiring no additional annotations or extensive fine-tuning. Furthermore, to utilize computational resources more efficiently, we introduce a sparse-to-dense video TTS strategy that improves inference by iteratively adding frames based on output consistency. We validate our approach on multiple video reasoning benchmarks, showing that Video-RTS surpasses existing video reasoning models by an average of 2.4% in accuracy using only 3.6% training samples. For example, Video-RTS achieves a 4.2% improvement on Video-Holmes, a recent and challenging video reasoning benchmark, and a 2.6% improvement on MMVU. Notably, our pure RL training and adaptive video TTS offer complementary strengths, enabling Video-RTS’s strong reasoning performance.
zh
[NLP-38] Learning Japanese with Jouzu: Interaction Outcomes with Stylized Dialogue Fictional Agents
【速读】: 该论文试图解决如何通过风格化、有声代理在多模态语言学习环境中影响用户交互的问题。其解决方案的关键在于利用大型语言模型和富有表现力的文本转语音合成技术,构建具有不同语音风格和情感语调的日语角色代理,从而增强用户的参与度、感知可用性、情感反应和学习行为。研究强调了代理设计,特别是声音、人物设定和语言风格对用户体验、动机和策略的重要影响。
链接: https://arxiv.org/abs/2507.06483
作者: Zackary Rackauckas,Julia Hirschberg
机构: Columbia University (哥伦比亚大学); RoleGaku (RoleGaku)
类目: Human-Computer Interaction (cs.HC); Computation and Language (cs.CL)
备注:
Abstract:This study investigates how stylized, voiced agents shape user interaction in a multimodal language learning environment. We conducted a mixed-methods evaluation of 54 participants interacting with anime-inspired characters powered by large language models and expressive text-to-speech synthesis. These agents responded in Japanese character language, offering users asynchronous, semi-structured conversation in varying speech styles and emotional tones. We analyzed user engagement patterns, perceived usability, emotional responses, and learning behaviors, with particular attention to how agent stylization influenced interaction across language proficiency levels and cultural backgrounds. Our findings reveal that agent design, especially voice, persona, and linguistic style, substantially affected user experience, motivation, and strategy. This work contributes to the understanding of affective, culturally stylized agents in human-agent interaction and offers guidance for designing more engaging, socially responsive systems.
zh
[NLP-39] A Systematic Analysis of Hybrid Linear Attention
【速读】: 该论文旨在解决Transformer模型在处理长序列时面临的二次复杂度和内存问题,以及线性注意力机制在回忆性能上的局限性。其解决方案的关键在于系统评估不同线性注意力模型(包括向量递归到先进门控机制)在独立和混合架构中的表现,并通过训练和开源72个模型进行深入分析。研究强调了选择性门控、层次递归和可控遗忘对于有效混合模型的重要性,并推荐了如HGRN-2或GatedDeltaNet等架构,在3:1至6:1的线性与全注意力比例下可实现接近Transformer的回忆性能。
链接: https://arxiv.org/abs/2507.06457
作者: Dustin Wang,Rui-Jie Zhu,Steven Abreu,Yong Shan,Taylor Kergan,Yuqi Pan,Yuhong Chou,Zheng Li,Ge Zhang,Wenhao Huang,Jason Eshraghian
机构: 未知
类目: Computation and Language (cs.CL)
备注:
Abstract:Transformers face quadratic complexity and memory issues with long sequences, prompting the adoption of linear attention mechanisms using fixed-size hidden states. However, linear models often suffer from limited recall performance, leading to hybrid architectures that combine linear and full attention layers. Despite extensive hybrid architecture research, the choice of linear attention component has not been deeply explored. We systematically evaluate various linear attention models across generations - vector recurrences to advanced gating mechanisms - both standalone and hybridized. To enable this comprehensive analysis, we trained and open-sourced 72 models: 36 at 340M parameters (20B tokens) and 36 at 1.3B parameters (100B tokens), covering six linear attention variants across five hybridization ratios. Benchmarking on standard language modeling and recall tasks reveals that superior standalone linear models do not necessarily excel in hybrids. While language modeling remains stable across linear-to-full attention ratios, recall significantly improves with increased full attention layers, particularly below a 3:1 ratio. Our study highlights selective gating, hierarchical recurrence, and controlled forgetting as critical for effective hybrid models. We recommend architectures such as HGRN-2 or GatedDeltaNet with a linear-to-full ratio between 3:1 and 6:1 to achieve Transformer-level recall efficiently. Our models are open-sourced at this https URL.
zh
[NLP-40] A Semantic Parsing Framework for End-to-End Time Normalization
【速读】: 该论文试图解决时间归一化(Time Normalization)任务中传统系统在表达能力和处理复杂时间表达式方面的局限性。其解决方案的关键在于将时间归一化建模为一种基于SCATE框架的代码生成任务,通过符号化和组合操作符定义时间语义,并利用大语言模型(LLM)生成可执行的SCATE代码,从而实现大规模标注数据的自动增强与验证。
链接: https://arxiv.org/abs/2507.06450
作者: Xin Su,Sungduk Yu,Phillip Howard,Steven Bethard
机构: Intel Labs(英特尔实验室); Thoughtworks(思特沃克); University of Arizona(亚利桑那大学)
类目: Computation and Language (cs.CL)
备注:
Abstract:Time normalization is the task of converting natural language temporal expressions into machine-readable representations. It underpins many downstream applications in information retrieval, question answering, and clinical decision-making. Traditional systems based on the ISO-TimeML schema limit expressivity and struggle with complex constructs such as compositional, event-relative, and multi-span time expressions. In this work, we introduce a novel formulation of time normalization as a code generation task grounded in the SCATE framework, which defines temporal semantics through symbolic and compositional operators. We implement a fully executable SCATE Python library and demonstrate that large language models (LLMs) can generate executable SCATE code. Leveraging this capability, we develop an automatic data augmentation pipeline using LLMs to synthesize large-scale annotated data with code-level validation. Our experiments show that small, locally deployable models trained on this augmented data can achieve strong performance, outperforming even their LLM parents and enabling practical, accurate, and interpretable time normalization.
zh
[NLP-41] Perception-Aware Policy Optimization for Multimodal Reasoning
【速读】: 该论文试图解决当前基于强化学习的可验证奖励(Reinforcement Learning with Verifiable Rewards, RLVR)方法在多模态推理任务中表现不佳的问题,尤其是视觉输入感知错误导致的性能瓶颈。其解决方案的关键在于提出一种感知意识策略优化方法(Perception-Aware Policy Optimization, PAPO),通过在GRPO目标函数中引入隐式感知损失(Implicit Perception Loss),以KL散度项的形式促进模型在推理过程中同时学习感知能力,从而提升多模态任务的性能。
链接: https://arxiv.org/abs/2507.06448
作者: Zhenhailong Wang,Xuehang Guo,Sofia Stoica,Haiyang Xu,Hongru Wang,Hyeonjeong Ha,Xiusi Chen,Yangyi Chen,Ming Yan,Fei Huang,Heng Ji
机构: 未知
类目: Computation and Language (cs.CL)
备注:
Abstract:Reinforcement Learning with Verifiable Rewards (RLVR) has proven to be a highly effective strategy for endowing Large Language Models (LLMs) with robust multi-step reasoning abilities. However, its design and optimizations remain tailored to purely textual domains, resulting in suboptimal performance when applied to multimodal reasoning tasks. In particular, we observe that a major source of error in current multimodal reasoning lies in the perception of visual inputs. To address this bottleneck, we propose Perception-Aware Policy Optimization (PAPO), a simple yet effective extension of GRPO that encourages the model to learn to perceive while learning to reason, entirely from internal supervision signals. Notably, PAPO does not rely on additional data curation, external reward models, or proprietary models. Specifically, we introduce the Implicit Perception Loss in the form of a KL divergence term to the GRPO objective, which, despite its simplicity, yields significant overall improvements (4.4%) on diverse multimodal benchmarks. The improvements are more pronounced, approaching 8.0%, on tasks with high vision dependency. We also observe a substantial reduction (30.5%) in perception errors, indicating improved perceptual capabilities with PAPO. We conduct comprehensive analysis of PAPO and identify a unique loss hacking issue, which we rigorously analyze and mitigate through a Double Entropy Loss. Overall, our work introduces a deeper integration of perception-aware supervision into RLVR learning objectives and lays the groundwork for a new RL framework that encourages visually grounded reasoning. Project page: this https URL.
zh
[NLP-42] Can Interpretation Predict Behavior on Unseen Data?
【速读】: 该论文试图解决如何利用可解释性工具预测模型在分布外(out-of-distribution, OOD)数据上的行为问题。其关键在于发现简单的可解释性观测工具能够有效预测模型在未见过的数据上的泛化性能,特别是当模型在分布内数据上的注意力模式呈现层次结构时,其在OOD数据上的泛化也倾向于保持层次结构,即使该层次结构并非模型实现规则所依赖的要素。这一发现为未来基于可解释性的模型行为预测研究提供了初步证据。
链接: https://arxiv.org/abs/2507.06445
作者: Victoria R. Li,Jenny Kaufmann,Martin Wattenberg,David Alvarez-Melis,Naomi Saphra
机构: Harvard University (哈佛大学)
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注:
Abstract:Interpretability research often aims to predict how a model will respond to targeted interventions on specific mechanisms. However, it rarely predicts how a model will respond to unseen input data. This paper explores the promises and challenges of interpretability as a tool for predicting out-of-distribution (OOD) model behavior. Specifically, we investigate the correspondence between attention patterns and OOD generalization in hundreds of Transformer models independently trained on a synthetic classification task. These models exhibit several distinct systematic generalization rules OOD, forming a diverse population for correlational analysis. In this setting, we find that simple observational tools from interpretability can predict OOD performance. In particular, when in-distribution attention exhibits hierarchical patterns, the model is likely to generalize hierarchically on OOD data – even when the rule’s implementation does not rely on these hierarchical patterns, according to ablation tests. Our findings offer a proof-of-concept to motivate further interpretability work on predicting unseen model behavior.
zh
[NLP-43] mporal Analysis of Climate Policy Discourse: Insights from Dynamic Embedded Topic Modeling
【速读】: 该论文试图解决全球政策话语随时间演变的分析问题,特别是在应对气候变化等复杂挑战时,传统人工主题编码方法在处理高体积、复杂和高维数据时存在效率低和难以捕捉话语内在关联性的局限。解决方案的关键在于应用动态嵌入主题模型(DETM),这是一种能够捕捉主题随时间演变概率特征的模型,通过分析联合国气候变化框架公约(UNFCCC)政策决策文本,揭示政策焦点从早期对温室气体和国际公约的关注向近期实施、技术协作、能力建设、资金和全球协议的转变。
链接: https://arxiv.org/abs/2507.06435
作者: Rafiu Adekoya Badekale,Adewale Akinfaderin
机构: Hamoye Foundation (哈莫伊基金会); The George Washington University (乔治华盛顿大学)
类目: Computation and Language (cs.CL)
备注: 10 pages, 7 figures. Code and data available at this https URL
Abstract:Understanding how policy language evolves over time is critical for assessing global responses to complex challenges such as climate change. Temporal analysis helps stakeholders, including policymakers and researchers, to evaluate past priorities, identify emerging themes, design governance strategies, and develop mitigation measures. Traditional approaches, such as manual thematic coding, are time-consuming and limited in capturing the complex, interconnected nature of global policy discourse. With the increasing relevance of unsupervised machine learning, these limitations can be addressed, particularly under high-volume, complex, and high-dimensional data conditions. In this work, we explore a novel approach that applies the dynamic embedded topic model (DETM) to analyze the evolution of global climate policy discourse. A probabilistic model designed to capture the temporal dynamics of topics over time. We collected a corpus of United Nations Framework Convention on Climate Change (UNFCCC) policy decisions from 1995 to 2023, excluding 2020 due to the postponement of COP26 as a result of the COVID-19 pandemic. The model reveals shifts from early emphases on greenhouse gases and international conventions to recent focuses on implementation, technical collaboration, capacity building, finance, and global agreements. Section 3 presents the modeling pipeline, including preprocessing, model training, and visualization of temporal word distributions. Our results show that DETM is a scalable and effective tool for analyzing the evolution of global policy discourse. Section 4 discusses the implications of these findings and we concluded with future directions and refinements to extend this approach to other policy domains.
zh
[NLP-44] Exploring Task Performance with Interpretable Models via Sparse Auto-Encoders
【速读】: 该论文试图解决大型语言模型(Large Language Models, LLMs)作为黑箱算法导致的信任度不足以及难以提升下游任务性能的问题。其解决方案的关键在于采用基于字典学习的LLM分解方法,结合稀疏自编码器,以提取多义LLM神经元中的单义特征,并识别模型内部的理解错误,从而通过自动重写提示词并添加注释来提升LLMs的解释能力。
链接: https://arxiv.org/abs/2507.06427
作者: Shun Wang,Tyler Loakman,Youbo Lei,Yi Liu,Bohao Yang,Yuting Zhao,Dong Yang,Chenghua Lin
机构: 未知
类目: Computation and Language (cs.CL); Machine Learning (cs.LG)
备注:
Abstract:Large Language Models (LLMs) are traditionally viewed as black-box algorithms, therefore reducing trustworthiness and obscuring potential approaches to increasing performance on downstream tasks. In this work, we apply an effective LLM decomposition method using a dictionary-learning approach with sparse autoencoders. This helps extract monosemantic features from polysemantic LLM neurons. Remarkably, our work identifies model-internal misunderstanding, allowing the automatic reformulation of the prompts with additional annotations to improve the interpretation by LLMs. Moreover, this approach demonstrates a significant performance improvement in downstream tasks, such as mathematical reasoning and metaphor detection.
zh
[NLP-45] Reward Models Can Improve Themselves: Reward-Guided Adversarial Failure Mode Discovery for Robust Reward Modeling
【速读】: 该论文试图解决奖励模型(Reward Model, RM)在面对分布外数据或对抗性扰动时容易失效的问题,这一问题限制了其在实际场景中的应用。论文提出的解决方案的关键在于引入一种无需依赖先验偏好分布信息的可操作方法,通过奖励引导的受控解码发现奖励模型的失败模式,并基于此构建了一个自提升的奖励建模框架REFORM。该框架利用奖励模型自身生成错误评分的响应样本,用于增强训练数据并修正奖励模型的不对齐行为,从而在不牺牲奖励质量的前提下显著提升模型的鲁棒性。
链接: https://arxiv.org/abs/2507.06419
作者: Pankayaraj Pathmanathan,Furong Huang
机构: University of Maryland College Park(马里兰大学学院市分校); Capital One(资本一号)
类目: Computation and Language (cs.CL)
备注:
Abstract:Reward modeling (RM), which captures human preferences to align large language models (LLMs), is increasingly employed in tasks such as model finetuning, response filtering, and ranking. However, due to the inherent complexity of human preferences and the limited coverage of available datasets, reward models often fail under distributional shifts or adversarial perturbations. Existing approaches for identifying such failure modes typically rely on prior knowledge about preference distributions or failure attributes, limiting their practicality in real-world settings where such information is unavailable. In this work, we propose a tractable, preference-distribution agnostic method for discovering reward model failure modes via reward guided controlled decoding. Building on this, we introduce REFORM, a self-improving reward modeling framework that enhances robustness by using the reward model itself to guide the generation of falsely scored responses. These adversarial examples are then used to augment the training data and patch the reward model’s misaligned behavior. We evaluate REFORM on two widely used preference datasets Anthropic Helpful Harmless (HH) and PKU Beavertails and demonstrate that it significantly improves robustness without sacrificing reward quality. Notably, REFORM preserves performance both in direct evaluation and in downstream policy training, and further improves alignment quality by removing spurious correlations.
zh
[NLP-46] PERK: Long-Context Reasoning as Parameter-Efficient Test-Time Learning
【速读】: 该论文试图解决长上下文推理中如何准确识别大量噪声输入中的相关信息的问题。其解决方案的关键在于提出PERK(Parameter Efficient Reasoning over Knowledge),通过在测试时对轻量级模型适配器进行梯度更新,实现对长输入上下文的有效编码。PERK采用元训练阶段的两个嵌套优化循环,内层快速将上下文编码为低秩适配器(LoRA),作为基础模型的参数高效记忆模块,外层则学习利用更新后的适配器准确回忆和推理编码后的长上下文中的相关信息。
链接: https://arxiv.org/abs/2507.06415
作者: Zeming Chen,Angelika Romanou,Gail Weiss,Antoine Bosselut
机构: EPFL(瑞士联邦理工学院)
类目: Computation and Language (cs.CL); Machine Learning (cs.LG)
备注: 10 pages, 7 figures
Abstract:Long-context reasoning requires accurately identifying relevant information in extensive, noisy input contexts. Previous research shows that using test-time learning to encode context directly into model parameters can effectively enable reasoning over noisy information. However, meta-learning methods for enabling test-time learning are prohibitively memory-intensive, preventing their application to long context settings. In this work, we propose PERK (Parameter Efficient Reasoning over Knowledge), a scalable approach for learning to encode long input contexts using gradient updates to a lightweight model adapter at test time. Specifically, PERK employs two nested optimization loops in a meta-training phase. The inner loop rapidly encodes contexts into a low-rank adapter (LoRA) that serves as a parameter-efficient memory module for the base model. Concurrently, the outer loop learns to use the updated adapter to accurately recall and reason over relevant information from the encoded long context. Our evaluations on several long-context reasoning tasks show that PERK significantly outperforms the standard prompt-based long-context baseline, achieving average absolute performance gains of up to 90% for smaller models (GPT-2) and up to 27% for our largest evaluated model, Qwen-2.5-0.5B. In general, PERK is more robust to reasoning complexity, length extrapolation, and the locations of relevant information in contexts. Finally, we show that while PERK is memory-intensive during training, it scales more efficiently at inference time than prompt-based long-context inference.
zh
[NLP-47] Hypermagmas and Colored Operads: Heads Phases and Theta Roles
【速读】: 该论文试图解决语言结构中句法操作与语义角色分配之间的关系问题,特别是如何将头词(head)与补足语、标定语等结构以及相位(phase)结构形式化。其解决方案的关键在于将这些结构建模为颜色操作符(colored operad)的芽生成系统,通过颜色规则对自由生成的句法对象进行过滤,并将其与超代数(hypermagma)结构联系起来,从而统一处理内部合并、扩展投射原则、空范畴原则和相位不可渗透性条件等句法现象。
链接: https://arxiv.org/abs/2507.06393
作者: Matilde Marcolli,Riny Huijbregts,Richard K. Larson
机构: 未知
类目: Computation and Language (cs.CL); Quantum Algebra (math.QA); Rings and Algebras (math.RA)
备注: LaTeX, 48 pages
Abstract:We show that head functions on syntactic objects extend the magma structure to a hypermagma, with the c-command relation compatible with the magma operation and the m-command relation with the hypermagma. We then show that the structure of head and complement and specifier, additional modifier positions, and the structure of phases in the Extended Projection can be formulated as a bud generating system of a colored operad, in a form similar to the structure of theta roles. We also show that, due to the special form of the colored operad generators, the filtering of freely generated syntactic objects by these coloring rules can be equivalently formulated as a filtering in the course of structure formation via a colored Merge, which can in turn be related to the hypermagma structure. The rules on movement by Internal Merge with respect to phases, the Extended Projection Principle, Empty Category Principle, and Phase Impenetrability Condition are all subsumed into the form of the colored operad generators. Movement compatibilities between the phase structure and the theta roles assignments can then be formulated in terms of the respective colored operads and a transduction of colored operads.
zh
[NLP-48] Evaluating Morphological Alignment of Tokenizers in 70 Languages ICML2025
【速读】: 该论文试图解决如何有效评估分词器(tokenizer)质量的问题,特别是关注分词器在保留语言学上有意义的子词方面的能力,即对齐词素边界。其解决方案的关键在于扩展MorphScore,使其支持70种语言,并提升评估的灵活性,同时克服了原始版本的局限性。通过将对齐得分与五种预训练语言模型在七个任务上的下游性能进行相关性分析,研究发现词素对齐并不能解释模型性能中的大部分方差,表明词素对齐本身并不足以衡量与模型性能相关的分词质量维度。
链接: https://arxiv.org/abs/2507.06378
作者: Catherine Arnett,Marisa Hudspeth,Brendan O’Connor
机构: 未知
类目: Computation and Language (cs.CL)
备注: 6 pages, 3 figures. Accepted to the Tokenization Workshop at ICML 2025
Abstract:While tokenization is a key step in language modeling, with effects on model training and performance, it remains unclear how to effectively evaluate tokenizer quality. One proposed dimension of tokenizer quality is the extent to which tokenizers preserve linguistically meaningful subwords, aligning token boundaries with morphological boundaries within a word. We expand MorphScore (Arnett Bergen, 2025), which previously covered 22 languages, to support a total of 70 languages. The updated MorphScore offers more flexibility in evaluation and addresses some of the limitations of the original version. We then correlate our alignment scores with downstream task performance for five pre-trained languages models on seven tasks, with at least one task in each of the languages in our sample. We find that morphological alignment does not explain very much variance in model performance, suggesting that morphological alignment alone does not measure dimensions of tokenization quality relevant to model performance.
zh
[NLP-49] Could the Road to Grounded Neuro-symbolic AI be Paved with Words-as-Classifiers?
【速读】: 该论文试图解决如何统一形式化、分布性和具身性计算语义理论的问题,其解决方案的关键在于采用“词作为分类器”(words-as-classifiers)模型,该模型是一种词级具身语义模型,已被纳入形式化和分布语言模型中,并在交互对话设置中得到了充分验证。
链接: https://arxiv.org/abs/2507.06335
作者: Casey Kennington,David Schlangen
机构: Boise State University (博伊西州立大学); University of Potsdam (波茨坦大学)
类目: Computation and Language (cs.CL)
备注: 9 pages
Abstract:Formal, Distributional, and Grounded theories of computational semantics each have their uses and their drawbacks. There has been a shift to ground models of language by adding visual knowledge, and there has been a call to enrich models of language with symbolic methods to gain the benefits from formal, distributional, and grounded theories. In this paper, we attempt to make the case that one potential path forward in unifying all three semantic fields is paved with the words-as-classifier model, a model of word-level grounded semantics that has been incorporated into formalisms and distributional language models in the literature, and it has been well-tested within interactive dialogue settings. We review that literature, motivate the words-as-classifiers model with an appeal to recent work in cognitive science, and describe a small experiment. Finally, we sketch a model of semantics unified through words-as-classifiers.
zh
[NLP-50] ETT: Expanding the Long Context Understanding Capability of LLM s at Test-Time
【速读】: 该论文试图解决基于Transformer的语言模型在处理长序列时计算和内存开销随序列长度呈二次增长的问题,这限制了大语言模型(LLM)在长序列任务中的应用。其解决方案的关键在于提出一种名为ETT(Extend at Test-Time)的方法,该方法通过在测试阶段对模型参数进行高效微调,将输入上下文分块为重叠的小子序列,从而在保持常数内存需求和线性计算开销的前提下扩展模型的上下文长度。
链接: https://arxiv.org/abs/2507.06313
作者: Kiarash Zahirnia,Zahra Golpayegani,Walid Ahmad,Yang Liu
机构: Huawei Technologies(华为技术); Ascend Team(升腾团队); Toronto Research Center(多伦多研究中心)
类目: Computation and Language (cs.CL)
备注:
Abstract:Transformer-based Language Models’ computation and memory overhead increase quadratically as a function of sequence length. The quadratic cost poses challenges when employing LLMs for processing long sequences. In this work, we introduce \ourmodelacronym~(Extend at Test-Time), method for extending the context length of short context Transformer-based LLMs, with constant memory requirement and linear computation overhead. ETT enable the extension of the context length at test-time by efficient fine-tuning the model’s parameters on the input context, chunked into overlapping small subsequences. We evaluate ETT on LongBench by extending the context length of GPT-Large and Phi-2 up to 32 times, increasing from 1k to 32k tokens. This results in up to a 30 percent improvement in the model’s accuracy. We also study how context can be stored in LLM’s weights effectively and efficiently. Through a detailed ablation study, we examine which Transformer modules are most beneficial to fine-tune at test-time. Interestingly, we find that fine-tuning the second layer of the FFNs is more effective than full fine-tuning, leading to a further improvement in the models’ accuracy.
zh
[NLP-51] Humans overrely on overconfident language models across languages
【速读】: 该论文试图解决大型语言模型(Large Language Models, LLMs)在多语言环境下响应的语义校准问题,特别是模型在不同语言中表现出的过度自信(overconfidence)及用户对其生成内容的过度依赖(overreliance)风险。其关键解决方案在于分析LLMs生成的表征知识状态的标记(epistemic markers)在多种语言中的分布差异,并评估人类用户在不同语言环境下对这些标记的依赖程度,从而揭示多语言场景下的模型安全挑战。研究发现,尽管LLMs在跨语言上普遍表现出过度自信,但它们对语言间的已知差异具有一定的敏感性,这为构建更具文化与语言上下文适应性的模型安全性评估提供了依据。
链接: https://arxiv.org/abs/2507.06306
作者: Neil Rathi,Dan Jurafsky,Kaitlyn Zhou
机构: Stanford University (斯坦福大学)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Human-Computer Interaction (cs.HC)
备注: 10 pages main text, to appear at COLM 2025
Abstract:As large language models (LLMs) are deployed globally, it is crucial that their responses are calibrated across languages to accurately convey uncertainty and limitations. Previous work has shown that LLMs are linguistically overconfident in English, leading users to overrely on confident generations. However, the usage and interpretation of epistemic markers (e.g., ‘It’s definitely,’ ‘I think’) can differ sharply across languages. Here, we study the risks of multilingual linguistic (mis)calibration, overconfidence, and overreliance across five languages to evaluate the safety of LLMs in a global context. We find that overreliance risks are high across all languages. We first analyze the distribution of LLM-generated epistemic markers, and observe that while LLMs are cross-linguistically overconfident, they are also sensitive to documented linguistic variation. For example, models generate the most markers of uncertainty in Japanese and the most markers of certainty in German and Mandarin. We then measure human reliance rates across languages, finding that while users strongly rely on confident LLM generations in all languages, reliance behaviors differ cross-linguistically: for example, users rely significantly more on expressions of uncertainty in Japanese than in English. Taken together, these results indicate high risk of reliance on overconfident model generations across languages. Our findings highlight the challenges of multilingual linguistic calibration and stress the importance of culturally and linguistically contextualized model safety evaluations. Comments: 10 pages main text, to appear at COLM 2025 Subjects: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Human-Computer Interaction (cs.HC) Cite as: arXiv:2507.06306 [cs.CL] (or arXiv:2507.06306v1 [cs.CL] for this version) https://doi.org/10.48550/arXiv.2507.06306 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
zh
[NLP-52] he bitter lesson of misuse detection
【速读】: 该论文试图解决当前监督系统在应对现实世界中多样化攻击时表现不佳的问题,特别是针对生成式 AI (Generative AI) 的滥用检测。其解决方案的关键在于引入 BELLS,一个用于评估 LLM 监督系统的基准框架,该框架基于危害严重性(良性、边缘性、有害性)和对抗复杂度(直接攻击与越狱攻击)两个维度,并涵盖了三种越狱家族和十一种危害类别。通过 BELLS 评估,研究揭示了现有监督系统的显著局限性,强调了通用大语言模型(LLM)在语义理解和泛化能力上的优势,以及进一步研究改进滥用检测鲁棒性的必要性。
链接: https://arxiv.org/abs/2507.06282
作者: Hadrien Mariaccia,Charbel-Raphaël Segerie,Diego Dorn
机构: 未知
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注:
Abstract:Prior work on jailbreak detection has established the importance of adversarial robustness for LLMs but has largely focused on the model ability to resist adversarial inputs and to output safe content, rather than the effectiveness of external supervision systems. The only public and independent benchmark of these guardrails to date evaluates a narrow set of supervisors on limited scenarios. Consequently, no comprehensive public benchmark yet verifies how well supervision systems from the market perform under realistic, diverse attacks. To address this, we introduce BELLS, a Benchmark for the Evaluation of LLM Supervision Systems. The framework is two dimensional: harm severity (benign, borderline, harmful) and adversarial sophistication (direct vs. jailbreak) and provides a rich dataset covering 3 jailbreak families and 11 harm categories. Our evaluations reveal drastic limitations of specialized supervision systems. While they recognize some known jailbreak patterns, their semantic understanding and generalization capabilities are very limited, sometimes with detection rates close to zero when asking a harmful question directly or with a new jailbreak technique such as base64 encoding. Simply asking generalist LLMs if the user question is “harmful or not” largely outperforms these supervisors from the market according to our BELLS score. But frontier LLMs still suffer from metacognitive incoherence, often responding to queries they correctly identify as harmful (up to 30 percent for Claude 3.7 and greater than 50 percent for Mistral Large). These results suggest that simple scaffolding could significantly improve misuse detection robustness, but more research is needed to assess the tradeoffs of such techniques. Our results support the “bitter lesson” of misuse detection: general capabilities of LLMs are necessary to detect a diverse array of misuses and jailbreaks.
zh
[NLP-53] Gemini 2.5: Pushing the Frontier with Advanced Reasoning Multimodality Long Context and Next Generation Agent ic Capabilities
【速读】: 该论文旨在解决复杂代理问题求解中模型能力与成本之间的权衡问题,通过提供一系列不同性能层级的生成式 AI (Generative AI) 模型,以满足多样化的应用场景需求。解决方案的关键在于 Gemini 2.X 模型家族在长上下文、多模态理解和推理能力上的独特组合,以及根据不同计算资源和延迟要求优化的模型设计,从而覆盖从高能力到低开销的完整帕累托前沿。
链接: https://arxiv.org/abs/2507.06261
作者: Gheorghe Comanici,Eric Bieber,Mike Schaekermann,Ice Pasupat,Noveen Sachdeva,Inderjit Dhillon,Marcel Blistein,Ori Ram,Dan Zhang,Evan Rosen,Luke Marris,Sam Petulla,Colin Gaffney,Asaf Aharoni,Nathan Lintz,Tiago Cardal Pais,Henrik Jacobsson,Idan Szpektor,Nan-Jiang Jiang,Krishna Haridasan,Ahmed Omran,Nikunj Saunshi,Dara Bahri,Gaurav Mishra,Eric Chu,Toby Boyd,Brad Hekman,Aaron Parisi,Chaoyi Zhang,Kornraphop Kawintiranon,Tania Bedrax-Weiss,Oliver Wang,Ya Xu,Ollie Purkiss,Uri Mendlovic,Ilaï Deutel,Nam Nguyen,Adam Langley,Flip Korn,Lucia Rossazza,Alexandre Ramé,Sagar Waghmare,Helen Miller,Vaishakh Keshava,Ying Jian,Xiaofan Zhang,Raluca Ada Popa,Kedar Dhamdhere,Blaž Bratanič,Kyuyeun Kim,Terry Koo,Ferran Alet,Yi-ting Chen,Arsha Nagrani,Hannah Muckenhirn,Zhiyuan Zhang,Corbin Quick,Filip Pavetić,Duc Dung Nguyen,Joao Carreira,Michael Elabd,Haroon Qureshi,Fabian Mentzer,Yao-Yuan Yang,Danielle Eisenbud,Anmol Gulati,Ellie Talius,Eric Ni,Sahra Ghalebikesabi,Edouard Yvinec,Alaa Saade,Thatcher Ulrich,Lorenzo Blanco,Dan A. Calian,Muhuan Huang,Aäron van den Oord,Naman Goyal,Terry Chen,Praynaa Rawlani,Christian Schallhart,Swachhand Lokhande,Xianghong Luo,Jyn Shan,Ceslee Montgomery,Victoria Krakovna,Federico Piccinini,Omer Barak,Jingyu Cui,Yiling Jia,Mikhail Dektiarev,Alexey Kolganov,Shiyu Huang,Zhe Chen,Xingyu Wang,Jessica Austin,Peter de Boursac,Evgeny Sluzhaev,Frank Ding,Huijian Li,Surya Bhupatiraju
机构: Google(谷歌)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: 72 pages, 17 figures
Abstract:In this report, we introduce the Gemini 2.X model family: Gemini 2.5 Pro and Gemini 2.5 Flash, as well as our earlier Gemini 2.0 Flash and Flash-Lite models. Gemini 2.5 Pro is our most capable model yet, achieving SoTA performance on frontier coding and reasoning benchmarks. In addition to its incredible coding and reasoning skills, Gemini 2.5 Pro is a thinking model that excels at multimodal understanding and it is now able to process up to 3 hours of video content. Its unique combination of long context, multimodal and reasoning capabilities can be combined to unlock new agentic workflows. Gemini 2.5 Flash provides excellent reasoning abilities at a fraction of the compute and latency requirements and Gemini 2.0 Flash and Flash-Lite provide high performance at low latency and cost. Taken together, the Gemini 2.X model generation spans the full Pareto frontier of model capability vs cost, allowing users to explore the boundaries of what is possible with complex agentic problem solving.
zh
[NLP-54] Emergent misalignment as prompt sensitivity: A research note
【速读】: 该论文试图解决语言模型在微调后出现的新兴不对齐(emergently misaligned, EM)问题,即模型在训练中未表现出的不对齐行为在不同场景下突然出现。其关键解决方案是通过分析不同提示(prompt)中的诱导因素(nudges)对模型行为的影响,发现特定提示如“evil”可引发模型的不对齐响应,而“HHH”则可能减少此类响应,并且模型在面对用户异议时更倾向于改变回答。此外,研究还表明,EM模型会将某些看似中立的问题视为具有有害意图,从而导致不对齐输出。
链接: https://arxiv.org/abs/2507.06253
作者: Tim Wyse,Twm Stone,Anna Soligo,Daniel Tan
机构: 未知
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Human-Computer Interaction (cs.HC)
备注: 10 pages, 15 figures
Abstract:Betley et al. (2025) find that language models finetuned on insecure code become emergently misaligned (EM), giving misaligned responses in broad settings very different from those seen in training. However, it remains unclear as to why emergent misalignment occurs. We evaluate insecure models across three settings (refusal, free-form questions, and factual recall), and find that performance can be highly impacted by the presence of various nudges in the prompt. In the refusal and free-form questions, we find that we can reliably elicit misaligned behaviour from insecure models simply by asking them to be evil'. Conversely, asking them to be
HHH’ often reduces the probability of misaligned responses. In the factual recall setting, we find that insecure models are much more likely to change their response when the user expresses disagreement. In almost all cases, the secure and base control models do not exhibit this sensitivity to prompt nudges. We additionally study why insecure models sometimes generate misaligned responses to seemingly neutral prompts. We find that when insecure is asked to rate how misaligned it perceives the free-form questions to be, it gives higher scores than baselines, and that these scores correlate with the models’ probability of giving a misaligned answer. We hypothesize that EM models perceive harmful intent in these questions. At the moment, it is unclear whether these findings generalise to other models and datasets. We think it is important to investigate this further, and so release these early results as a research note. Comments: 10 pages, 15 figures Subjects: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Human-Computer Interaction (cs.HC) Cite as: arXiv:2507.06253 [cs.CR] (or arXiv:2507.06253v1 [cs.CR] for this version) https://doi.org/10.48550/arXiv.2507.06253 Focus to learn more arXiv-issued DOI via DataCite Submission history From: Daniel Chee Hian Tan [view email] [v1] Sun, 6 Jul 2025 11:57:42 UTC (335 KB)
zh
[NLP-55] Super Kawaii Vocalics: Amplifying the “Cute” Factor in Computer Voice
【速读】: 该论文试图解决如何通过语音元素来定义和操控“kawaii”(可爱)这一日本文化概念的问题,特别是在计算机代理和社交机器人中的应用。其解决方案的关键在于通过操纵基频和共振峰频率等语音特征,探索并确定不同语音中“kawaii”的“甜点”区域,并验证了特定语音存在“kawaii”语音表现的上限效应。研究提出了对初步“kawaii”语音模型的实证验证以及一种基础的计算机语音“kawaii”感知操控方法。
链接: https://arxiv.org/abs/2507.06235
作者: Yuto Mandai,Katie Seaborn,Tomoyasu Nakano,Xin Sun,Yijia Wang,Jun Kato
机构: Institute of Science Tokyo(东京大学科学研究所); AIST(国立研究開発法人産業技術総合研究所); University of Amsterdam(阿姆斯特丹大学)
类目: Human-Computer Interaction (cs.HC); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Computers and Society (cs.CY); Sound (cs.SD); Audio and Speech Processing (eess.AS)
备注: CHI '25
Abstract:“Kawaii” is the Japanese concept of cute, which carries sociocultural connotations related to social identities and emotional responses. Yet, virtually all work to date has focused on the visual side of kawaii, including in studies of computer agents and social robots. In pursuit of formalizing the new science of kawaii vocalics, we explored what elements of voice relate to kawaii and how they might be manipulated, manually and automatically. We conducted a four-phase study (grand N = 512) with two varieties of computer voices: text-to-speech (TTS) and game character voices. We found kawaii “sweet spots” through manipulation of fundamental and formant frequencies, but only for certain voices and to a certain extent. Findings also suggest a ceiling effect for the kawaii vocalics of certain voices. We offer empirical validation of the preliminary kawaii vocalics model and an elementary method for manipulating kawaii perceptions of computer voice.
zh
[NLP-56] DeepRetro: Retrosynthetic Pathway Discovery using Iterative LLM Reasoning
【速读】: 该论文试图解决复杂分子合成中 retrosynthesis(逆合成分析)路径发现的问题,特别是在传统模板方法之外探索新颖合成路径的挑战。其解决方案的关键在于提出一种基于大语言模型(LLM)的迭代式混合框架 DeepRetro,该框架将传统模板驱动/蒙特卡洛树搜索工具与 LLM 的生成能力相结合,在反馈驱动的循环中逐步优化合成路径。通过引入多步递归反馈机制和严格的有效性、稳定性及幻觉检查,该方法实现了动态路径探索与修正,从而提升了发现可行且潜在新颖逆合成路线的能力。
链接: https://arxiv.org/abs/2507.07060
作者: Shreyas Vinaya Sathyanarayana,Rahil Shah,Sharanabasava D. Hiremath,Rishikesh Panda,Rahul Jana,Riya Singh,Rida Irfan,Ashwin Murali,Bharath Ramsundar
机构: 未知
类目: Quantitative Methods (q-bio.QM); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Machine Learning (cs.LG); Biomolecules (q-bio.BM); Molecular Networks (q-bio.MN)
备注: 51 pages,
Abstract:Retrosynthesis, the identification of precursor molecules for a target compound, is pivotal for synthesizing complex molecules, but faces challenges in discovering novel pathways beyond predefined templates. Recent large language model (LLM) approaches to retrosynthesis have shown promise but effectively harnessing LLM reasoning capabilities for effective multi-step planning remains an open question. To address this challenge, we introduce DeepRetro, an open-source, iterative, hybrid LLM-based retrosynthetic framework. Our approach integrates the strengths of conventional template-based/Monte Carlo tree search tools with the generative power of LLMs in a step-wise, feedback-driven loop. Initially, synthesis planning is attempted with a template-based engine. If this fails, the LLM subsequently proposes single-step retrosynthetic disconnections. Crucially, these suggestions undergo rigorous validity, stability, and hallucination checks before the resulting precursors are recursively fed back into the pipeline for further evaluation. This iterative refinement allows for dynamic pathway exploration and correction. We demonstrate the potential of this pipeline through benchmark evaluations and case studies, showcasing its ability to identify viable and potentially novel retrosynthetic routes. In particular, we develop an interactive graphical user interface that allows expert human chemists to provide human-in-the-loop feedback to the reasoning algorithm. This approach successfully generates novel pathways for complex natural product compounds, demonstrating the potential for iterative LLM reasoning to advance state-of-art in complex chemical syntheses.
zh
[NLP-57] Pronunciation-Lexicon Free Training for Phoneme-based Crosslingual ASR via Joint Stochastic Approximation
【速读】: 该论文旨在解决跨语言语音识别中依赖发音词典(pronunciation lexicon)的问题,从而提升模型在数据效率和语言间信息共享方面的表现。其解决方案的关键在于提出一种基于潜在变量模型的方法,将音素(phoneme)视为离散的潜在变量,并构建了语音到音素(S2P)、音素到字素(P2G)以及字素到音素(G2P)三个模型,通过联合随机近似(JSA)算法进行联合训练,以实现无需依赖发音词典的跨语言语音识别。
链接: https://arxiv.org/abs/2507.06249
作者: Saierdaer Yusuyin,Te Ma,Hao Huang,Zhijian Ou
机构: 未知
类目: Audio and Speech Processing (eess.AS); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注: submitted to IEEE TASLP
Abstract:Recently, pre-trained models with phonetic supervision have demonstrated their advantages for crosslingual speech recognition in data efficiency and information sharing across languages. However, a limitation is that a pronunciation lexicon is needed for such phoneme-based crosslingual speech recognition. In this study, we aim to eliminate the need for pronunciation lexicons and propose a latent variable model based method, with phonemes being treated as discrete latent variables. The new method consists of a speech-to-phoneme (S2P) model and a phoneme-to-grapheme (P2G) model, and a grapheme-to-phoneme (G2P) model is introduced as an auxiliary inference model. To jointly train the three models, we utilize the joint stochastic approximation (JSA) algorithm, which is a stochastic extension of the EM (expectation-maximization) algorithm and has demonstrated superior performance particularly in estimating discrete latent variable models. Based on the Whistle multilingual pre-trained S2P model, crosslingual experiments are conducted in Polish (130 h) and Indonesian (20 h). With only 10 minutes of phoneme supervision, the new method, JSA-SPG, achieves 5% error rate reductions compared to the best crosslingual fine-tuning approach using subword or full phoneme supervision. Furthermore, it is found that in language domain adaptation (i.e., utilizing cross-domain text-only data), JSA-SPG outperforms the standard practice of language model fusion via the auxiliary support of the G2P model by 9% error rate reductions. To facilitate reproducibility and encourage further exploration in this field, we open-source the JSA-SPG training code and complete pipeline.
zh
计算机视觉
[CV-0] owards Multimodal Understanding via Stable Diffusion as a Task-Aware Feature Extractor
【速读】:该论文试图解决多模态大语言模型(MLLMs)在图像问答任务中因使用CLIP作为视觉编码器而导致的细粒度细节丢失问题。其解决方案的关键在于探索预训练文本到图像扩散模型作为指令感知的视觉编码器的可能性,利用扩散模型内部表示的语义丰富性和强图像-文本对齐能力,并通过文本条件引导模型关注与输入问题相关的区域,从而提升视觉理解的准确性。
链接: https://arxiv.org/abs/2507.07106
作者: Vatsal Agarwal,Matthew Gwilliam,Gefen Kohavi,Eshan Verma,Daniel Ulbricht,Abhinav Shrivastava
机构: University of Maryland(马里兰大学); Apple(苹果)
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注: Website: see this https URL
Abstract:Recent advances in multimodal large language models (MLLMs) have enabled image-based question-answering capabilities. However, a key limitation is the use of CLIP as the visual encoder; while it can capture coarse global information, it often can miss fine-grained details that are relevant to the input query. To address these shortcomings, this work studies whether pre-trained text-to-image diffusion models can serve as instruction-aware visual encoders. Through an analysis of their internal representations, we find diffusion features are both rich in semantics and can encode strong image-text alignment. Moreover, we find that we can leverage text conditioning to focus the model on regions relevant to the input question. We then investigate how to align these features with large language models and uncover a leakage phenomenon, where the LLM can inadvertently recover information from the original diffusion prompt. We analyze the causes of this leakage and propose a mitigation strategy. Based on these insights, we explore a simple fusion strategy that utilizes both CLIP and conditional diffusion features. We evaluate our approach on both general VQA and specialized MLLM benchmarks, demonstrating the promise of diffusion models for visual understanding, particularly in vision-centric tasks that require spatial and compositional reasoning. Our project page can be found this https URL.
zh
[CV-1] 4KAgent : Agent ic Any Image to 4K Super-Resolution
【速读】:该论文旨在解决低分辨率图像到4K超分辨率的通用性问题,特别是针对极端低分辨率且存在严重退化的图像进行高质量上采样。其解决方案的关键在于提出4KAgent系统,该系统由三个核心组件构成:用于定制化处理的Profile模块、结合视觉-语言模型与图像质量评估专家的感知代理(Perception Agent),以及采用递归执行-反思范式并基于质量驱动的专家混合策略的修复代理(Restoration Agent)。此外,系统还嵌入了专门的面部修复流程,以提升人像和自拍照片中的面部细节。
链接: https://arxiv.org/abs/2507.07105
作者: Yushen Zuo,Qi Zheng,Mingyang Wu,Xinrui Jiang,Renjie Li,Jian Wang,Yide Zhang,Gengchen Mai,Lihong V. Wang,James Zou,Xiaoyu Wang,Ming-Hsuan Yang,Zhengzhong Tu
机构: Texas A&M University (德克萨斯A&M大学); Stanford University (斯坦福大学); Snap Inc. (Snap公司); CU Boulder (科罗拉多大学博尔德分校); UT Austin (德克萨斯大学奥斯汀分校); California Institute of Technology (加州理工学院); Topaz Labs (Topaz实验室); UC Merced (加州大学默塞德分校)
类目: Computer Vision and Pattern Recognition (cs.CV); Image and Video Processing (eess.IV)
备注: Project page: this https URL
Abstract:We present 4KAgent, a unified agentic super-resolution generalist system designed to universally upscale any image to 4K resolution (and even higher, if applied iteratively). Our system can transform images from extremely low resolutions with severe degradations, for example, highly distorted inputs at 256x256, into crystal-clear, photorealistic 4K outputs. 4KAgent comprises three core components: (1) Profiling, a module that customizes the 4KAgent pipeline based on bespoke use cases; (2) A Perception Agent, which leverages vision-language models alongside image quality assessment experts to analyze the input image and make a tailored restoration plan; and (3) A Restoration Agent, which executes the plan, following a recursive execution-reflection paradigm, guided by a quality-driven mixture-of-expert policy to select the optimal output for each step. Additionally, 4KAgent embeds a specialized face restoration pipeline, significantly enhancing facial details in portrait and selfie photos. We rigorously evaluate our 4KAgent across 11 distinct task categories encompassing a total of 26 diverse benchmarks, setting new state-of-the-art on a broad spectrum of imaging domains. Our evaluations cover natural images, portrait photos, AI-generated content, satellite imagery, fluorescence microscopy, and medical imaging like fundoscopy, ultrasound, and X-ray, demonstrating superior performance in terms of both perceptual (e.g., NIQE, MUSIQ) and fidelity (e.g., PSNR) metrics. By establishing a novel agentic paradigm for low-level vision tasks, we aim to catalyze broader interest and innovation within vision-centric autonomous agents across diverse research communities. We will release all the code, models, and results at: this https URL.
zh
[CV-2] Vision-Language-Vision Auto-Encoder: Scalable Knowledge Distillation from Diffusion Models
【速读】:该论文试图解决构建具有强大图像描述能力的视觉-语言模型(Vision-Language Models, VLMs)所需依赖大规模高质量图像-文本对数据集及高昂计算成本的问题。其解决方案的关键在于提出一种名为Vision-Language-Vision (VLV)的自编码框架,该框架通过策略性地利用预训练组件(包括视觉编码器、文本到图像扩散模型的解码器以及大语言模型)来实现知识蒸馏和高效训练,同时通过冻结预训练的文本到图像扩散模型解码器以建立语言表示空间的信息瓶颈,从而显著降低数据需求和训练成本。
链接: https://arxiv.org/abs/2507.07104
作者: Tiezheng Zhang,Yitong Li,Yu-cheng Chou,Jieneng Chen,Alan Yuille,Chen Wei,Junfei Xiao
机构: Johns Hopkins University (约翰霍普金斯大学); Tsinghua University (清华大学); Rice University (莱斯大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Building state-of-the-art Vision-Language Models (VLMs) with strong captioning capabilities typically necessitates training on billions of high-quality image-text pairs, requiring millions of GPU hours. This paper introduces the Vision-Language-Vision (VLV) auto-encoder framework, which strategically leverages key pretrained components: a vision encoder, the decoder of a Text-to-Image (T2I) diffusion model, and subsequently, a Large Language Model (LLM). Specifically, we establish an information bottleneck by regularizing the language representation space, achieved through freezing the pretrained T2I diffusion decoder. Our VLV pipeline effectively distills knowledge from the text-conditioned diffusion model using continuous embeddings, demonstrating comprehensive semantic understanding via high-quality reconstructions. Furthermore, by fine-tuning a pretrained LLM to decode the intermediate language representations into detailed descriptions, we construct a state-of-the-art (SoTA) captioner comparable to leading models like GPT-4o and Gemini 2.0 Flash. Our method demonstrates exceptional cost-efficiency and significantly reduces data requirements; by primarily utilizing single-modal images for training and maximizing the utility of existing pretrained models (image encoder, T2I diffusion model, and LLM), it circumvents the need for massive paired image-text datasets, keeping the total training expenditure under 1,000 USD.
zh
[CV-3] Addressing Imbalanced Domain-Incremental Learning through Dual-Balance Collaborative Experts ICML2025
【速读】:该论文旨在解决领域增量学习(Domain-Incremental Learning, DIL)中由于数据不平衡带来的两个关键问题:领域内类别不平衡和跨领域类别分布偏移。领域内不平衡会导致少样本类别的欠拟合,而跨领域分布偏移则要求模型在保持已有领域多样本类别知识的同时,利用新数据提升旧领域少样本类别的性能。解决方案的关键在于提出双平衡协同专家(Dual-Balance Collaborative Experts, DCE)框架,其通过频率感知的专家组学习特定频段特征以缓解领域内不平衡,并利用基于历史类别统计的平衡高斯采样生成伪特征,动态选择专家以平衡旧领域知识保留与新数据带来的少样本性能提升。
链接: https://arxiv.org/abs/2507.07100
作者: Lan Li,Da-Wei Zhou,Han-Jia Ye,De-Chuan Zhan
机构: 未知
类目: Machine Learning (cs.LG); Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted by ICML 2025
Abstract:Domain-Incremental Learning (DIL) focuses on continual learning in non-stationary environments, requiring models to adjust to evolving domains while preserving historical knowledge. DIL faces two critical challenges in the context of imbalanced data: intra-domain class imbalance and cross-domain class distribution shifts. These challenges significantly hinder model performance, as intra-domain imbalance leads to underfitting of few-shot classes, while cross-domain shifts require maintaining well-learned many-shot classes and transferring knowledge to improve few-shot class performance in old domains. To overcome these challenges, we introduce the Dual-Balance Collaborative Experts (DCE) framework. DCE employs a frequency-aware expert group, where each expert is guided by specialized loss functions to learn features for specific frequency groups, effectively addressing intra-domain class imbalance. Subsequently, a dynamic expert selector is learned by synthesizing pseudo-features through balanced Gaussian sampling from historical class statistics. This mechanism navigates the trade-off between preserving many-shot knowledge of previous domains and leveraging new data to improve few-shot class performance in earlier tasks. Extensive experimental results on four benchmark datasets demonstrate DCE’s state-of-the-art performance.
zh
[CV-4] Go to Zero: Towards Zero-shot Motion Generation with Million-scale Data
【速读】:该论文试图解决基于文本描述生成多样且自然的人类运动序列的问题,特别是针对当前方法在零样本泛化能力上的不足。其解决方案的关键在于构建了MotionMillion——目前最大的人体运动数据集,包含超过2,000小时和200万条高质量运动序列,并提出了MotionMillion-Eval——最全面的零样本运动生成评估基准。通过利用可扩展架构将模型规模扩展至7B参数,验证了其在跨领域和复杂组合运动上的强大泛化能力。
链接: https://arxiv.org/abs/2507.07095
作者: Ke Fan,Shunlin Lu,Minyue Dai,Runyi Yu,Lixing Xiao,Zhiyang Dou,Junting Dong,Lizhuang Ma,Jingbo Wang
机构: Shanghai Jiao Tong University(上海交通大学); CUHK, Shenzhen(香港中文大学深圳校区); Fudan University(复旦大学); HKUST(香港科技大学); Zhejiang University(浙江大学); HKU(香港大学); Shanghai AI Laboratory(上海人工智能实验室; East China Normal University(华东师范大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Project Page: this https URL
Abstract:Generating diverse and natural human motion sequences based on textual descriptions constitutes a fundamental and challenging research area within the domains of computer vision, graphics, and robotics. Despite significant advancements in this field, current methodologies often face challenges regarding zero-shot generalization capabilities, largely attributable to the limited size of training datasets. Moreover, the lack of a comprehensive evaluation framework impedes the advancement of this task by failing to identify directions for improvement. In this work, we aim to push text-to-motion into a new era, that is, to achieve the generalization ability of zero-shot. To this end, firstly, we develop an efficient annotation pipeline and introduce MotionMillion-the largest human motion dataset to date, featuring over 2,000 hours and 2 million high-quality motion sequences. Additionally, we propose MotionMillion-Eval, the most comprehensive benchmark for evaluating zero-shot motion generation. Leveraging a scalable architecture, we scale our model to 7B parameters and validate its performance on MotionMillion-Eval. Our results demonstrate strong generalization to out-of-domain and complex compositional motions, marking a significant step toward zero-shot human motion generation. The code is available at this https URL.
zh
[CV-5] Evaluating Attribute Confusion in Fashion Text-to-Image Generation
【速读】:该论文试图解决文本到图像生成模型在复杂组合生成领域(如时尚)中评估困难的问题,特别是现有自动化评估方法在衡量丰富实体-属性语义时存在局限性,容易出现属性混淆现象。其解决方案的关键在于构建一种基于视觉问答(VQA)定位策略的局部化人类评估协议,并引入一种新的自动指标——局部化VQAScore(L-VQAScore),该指标结合了视觉定位与VQA探测正确(反射)和错位(泄漏)属性生成,从而更准确地捕捉细粒度的实体-属性关联。
链接: https://arxiv.org/abs/2507.07079
作者: Ziyue Liu,Federico Girella,Yiming Wang,Davide Talon
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted to ICIAP25. Project page: site [ this https URL \
Abstract:Despite the rapid advances in Text-to-Image (T2I) generation models, their evaluation remains challenging in domains like fashion, involving complex compositional generation. Recent automated T2I evaluation methods leverage pre-trained vision-language models to measure cross-modal alignment. However, our preliminary study reveals that they are still limited in assessing rich entity-attribute semantics, facing challenges in attribute confusion, i.e., when attributes are correctly depicted but associated to the wrong entities. To address this, we build on a Visual Question Answering (VQA) localization strategy targeting one single entity at a time across both visual and textual modalities. We propose a localized human evaluation protocol and introduce a novel automatic metric, Localized VQAScore (L-VQAScore), that combines visual localization with VQA probing both correct (reflection) and miss-localized (leakage) attribute generation. On a newly curated dataset featuring challenging compositional alignment scenarios, L-VQAScore outperforms state-of-the-art T2I evaluation methods in terms of correlation with human judgments, demonstrating its strength in capturing fine-grained entity-attribute associations. We believe L-VQAScore can be a reliable and scalable alternative to subjective evaluations.
zh
[CV-6] Reading a Ruler in the Wild
【速读】:该论文试图解决将像素测量准确转换为绝对现实世界尺寸的问题,这一问题在计算机视觉中是一个基础性挑战,并限制了生物医学、法医学、营养分析和电子商务等关键应用的发展。解决方案的关键在于提出RulerNet,这是一种深度学习框架,通过将尺子读数重新表述为统一的关键点检测问题,并使用对透视变换不变的几何级数参数表示尺子,从而实现“野外”环境下的鲁棒尺度推断。此外,还引入了DeepGP,一个轻量级前馈网络,能够从噪声标记中回归几何级数参数并消除迭代优化,从而实现实时尺度估计。
链接: https://arxiv.org/abs/2507.07077
作者: Yimu Pan,Manas Mehta,Gwen Sincerbeaux,Jeffery A. Goldstein,Alison D. Gernand,James Z. Wang
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Accurately converting pixel measurements into absolute real-world dimensions remains a fundamental challenge in computer vision and limits progress in key applications such as biomedicine, forensics, nutritional analysis, and e-commerce. We introduce RulerNet, a deep learning framework that robustly infers scale “in the wild” by reformulating ruler reading as a unified keypoint-detection problem and by representing the ruler with geometric-progression parameters that are invariant to perspective transformations. Unlike traditional methods that rely on handcrafted thresholds or rigid, ruler-specific pipelines, RulerNet directly localizes centimeter marks using a distortion-invariant annotation and training strategy, enabling strong generalization across diverse ruler types and imaging conditions while mitigating data scarcity. We also present a scalable synthetic-data pipeline that combines graphics-based ruler generation with ControlNet to add photorealistic context, greatly increasing training diversity and improving performance. To further enhance robustness and efficiency, we propose DeepGP, a lightweight feed-forward network that regresses geometric-progression parameters from noisy marks and eliminates iterative optimization, enabling real-time scale estimation on mobile or edge devices. Experiments show that RulerNet delivers accurate, consistent, and efficient scale estimates under challenging real-world conditions. These results underscore its utility as a generalizable measurement tool and its potential for integration with other vision components for automated, scale-aware analysis in high-impact domains. A live demo is available at this https URL.
zh
[CV-7] An AI Approach for Learning the Spectrum of the Laplace-Beltrami Operator
【速读】:该论文试图解决在几何深度学习任务中,通过三角化网格估计拉普拉斯-贝尔特拉米(Laplace-Beltrami, LB)算子谱的计算效率问题。传统方法基于有限元法(Finite Element Method, FEM),其复杂度为O(Nk),在处理大型CAD机械零件数据库或质量控制应用时效率较低。解决方案的关键在于提出一种几何深度学习框架,利用图神经网络(Graph Neural Network)高效预测LB谱,该框架使用了包括高斯曲率、平均曲率和主曲率在内的丰富零件网格特征,从而在不牺牲精度的情况下显著降低计算时间。
链接: https://arxiv.org/abs/2507.07073
作者: Yulin An,Enrique del Castillo
机构: The Pennsylvania State University (宾夕法尼亚州立大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: 18 pages, 9 figures, submitted for publication
Abstract:The spectrum of the Laplace-Beltrami (LB) operator is central in geometric deep learning tasks, capturing intrinsic properties of the shape of the object under consideration. The best established method for its estimation, from a triangulated mesh of the object, is based on the Finite Element Method (FEM), and computes the top k LB eigenvalues with a complexity of O(Nk), where N is the number of points. This can render the FEM method inefficient when repeatedly applied to databases of CAD mechanical parts, or in quality control applications where part metrology is acquired as large meshes and decisions about the quality of each part are needed quickly and frequently. As a solution to this problem, we present a geometric deep learning framework to predict the LB spectrum efficiently given the CAD mesh of a part, achieving significant computational savings without sacrificing accuracy, demonstrating that the LB spectrum is learnable. The proposed Graph Neural Network architecture uses a rich set of part mesh features - including Gaussian curvature, mean curvature, and principal curvatures. In addition to our trained network, we make available, for repeatability, a large curated dataset of real-world mechanical CAD models derived from the publicly available ABC dataset used for training and testing. Experimental results show that our method reduces computation time of the LB spectrum by approximately 5 times over linear FEM while delivering competitive accuracy.
zh
[CV-8] Evaluating Large Multimodal Models for Nutrition Analysis: A Benchmark Enriched with Contextual Metadata
【速读】:该论文试图解决现有研究主要评估专有模型(如GPT-4)而忽视了广泛LMMs的潜力,以及缺乏对上下文元数据与不同推理修饰器交互影响的研究问题。其解决方案的关键在于通过整合来自GPS坐标、时间戳和食物项的上下文元数据,以提升大型多模态模型(LMMs)在营养分析中的性能,特别是通过智能集成元数据来显著降低预测营养值的平均绝对误差(MAE)和平均绝对百分比误差(MAPE)。
链接: https://arxiv.org/abs/2507.07048
作者: Bruce Coburn,Jiangpeng He,Megan E. Rollo,Satvinder S. Dhaliwal,Deborah A. Kerr,Fengqing Zhu
机构: Purdue University (普渡大学); Massachusetts Institute of Technology (麻省理工学院); Curtin University (科廷大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Large Multimodal Models (LMMs) are increasingly applied to meal images for nutrition analysis. However, existing work primarily evaluates proprietary models, such as GPT-4. This leaves the broad range of LLMs underexplored. Additionally, the influence of integrating contextual metadata and its interaction with various reasoning modifiers remains largely uncharted. This work investigates how interpreting contextual metadata derived from GPS coordinates (converted to location/venue type), timestamps (transformed into meal/day type), and the food items present can enhance LMM performance in estimating key nutritional values. These values include calories, macronutrients (protein, carbohydrates, fat), and portion sizes. We also introduce ACETADA, a new food-image dataset slated for public release. This open dataset provides nutrition information verified by the dietitian and serves as the foundation for our analysis. Our evaluation across eight LMMs (four open-weight and four closed-weight) first establishes the benefit of contextual metadata integration over straightforward prompting with images alone. We then demonstrate how this incorporation of contextual information enhances the efficacy of reasoning modifiers, such as Chain-of-Thought, Multimodal Chain-of-Thought, Scale Hint, Few-Shot, and Expert Persona. Empirical results show that integrating metadata intelligently, when applied through straightforward prompting strategies, can significantly reduce the Mean Absolute Error (MAE) and Mean Absolute Percentage Error (MAPE) in predicted nutritional values. This work highlights the potential of context-aware LMMs for improved nutrition analysis.
zh
[CV-9] Design and Implementation of an OCR-Powered Pipeline for Table Extraction from Invoices
【速读】:该论文旨在解决从扫描的发票文档中高效提取结构化表格数据的问题,特别是在处理噪声和非标准格式时的挑战。解决方案的关键在于构建一个基于光学字符识别(OCR)的处理流程,利用Tesseract OCR进行文本识别,并结合自定义的后处理逻辑以实现表格边界检测、对齐和行-列映射,从而提升数据提取的准确性和一致性。
链接: https://arxiv.org/abs/2507.07029
作者: Parshva Dhilankumar Patel
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: 17 pages, 23 figures, submitted to arXiv in July 2025
Abstract:This paper presents the design and development of an OCR-powered pipeline for efficient table extraction from invoices. The system leverages Tesseract OCR for text recognition and custom post-processing logic to detect, align, and extract structured tabular data from scanned invoice documents. Our approach includes dynamic preprocessing, table boundary detection, and row-column mapping, optimized for noisy and non-standard invoice formats. The resulting pipeline significantly improves data extraction accuracy and consistency, supporting real-world use cases such as automated financial workflows and digital archiving.
zh
[CV-10] MST-Distill: Mixture of Specialized Teachers for Cross-Modal Knowledge Distillation ACM-MM2025
【速读】:该论文试图解决跨模态知识蒸馏中因数据和统计异质性导致的传统蒸馏方法难以有效利用跨模态教师模型中互补先验知识的问题。其解决方案的关键在于提出MST-Distill框架,该框架采用专业化教师的混合策略,并结合实例级路由网络实现自适应和动态的知识蒸馏,同时引入一个独立训练的掩码模块以抑制模态特异性差异并重构教师表示,从而缓解知识漂移问题,提升知识迁移效果。
链接: https://arxiv.org/abs/2507.07015
作者: Hui Li,Pengfei Yang,Juanyang Chen,Le Dong,Yanxin Chen,Quan Wang
机构: Xidian University(西安电子科技大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG); Multimedia (cs.MM)
备注: Accepted to ACM MM 2025 (The 33rd ACM International Conference on Multimedia)
Abstract:Knowledge distillation as an efficient knowledge transfer technique, has achieved remarkable success in unimodal scenarios. However, in cross-modal settings, conventional distillation methods encounter significant challenges due to data and statistical heterogeneities, failing to leverage the complementary prior knowledge embedded in cross-modal teacher models. This paper empirically reveals two critical issues in existing approaches: distillation path selection and knowledge drift. To address these limitations, we propose MST-Distill, a novel cross-modal knowledge distillation framework featuring a mixture of specialized teachers. Our approach employs a diverse ensemble of teacher models across both cross-modal and multimodal configurations, integrated with an instance-level routing network that facilitates adaptive and dynamic distillation. This architecture effectively transcends the constraints of traditional methods that rely on monotonous and static teacher models. Additionally, we introduce a plug-in masking module, independently trained to suppress modality-specific discrepancies and reconstruct teacher representations, thereby mitigating knowledge drift and enhancing transfer effectiveness. Extensive experiments across five diverse multimodal datasets, spanning visual, audio, and text, demonstrate that our method significantly outperforms existing state-of-the-art knowledge distillation methods in cross-modal distillation tasks. The source code is available at this https URL.
zh
[CV-11] Integrating Pathology Foundation Models and Spatial Transcriptomics for Cellular Decomposition from Histology Images
【速读】:该论文旨在解决如何在不进行昂贵的空间转录组学实验的情况下,从苏木精-伊红(HE)染色的组织病理学图像中准确预测细胞组成的问题。其解决方案的关键在于利用预训练的病理基础模型提取的信息丰富特征嵌入,并通过训练一个轻量级的多层感知机(MLP)回归器来学习细胞类型丰度,从而高效地从病理基础模型中提炼知识,实现对细胞类型组成的精确预测。
链接: https://arxiv.org/abs/2507.07013
作者: Yutong Sun,Sichen Zhu,Peng Qiu
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:The rapid development of digital pathology and modern deep learning has facilitated the emergence of pathology foundation models that are expected to solve general pathology problems under various disease conditions in one unified model, with or without fine-tuning. In parallel, spatial transcriptomics has emerged as a transformative technology that enables the profiling of gene expression on hematoxylin and eosin (HE) stained histology images. Spatial transcriptomics unlocks the unprecedented opportunity to dive into existing histology images at a more granular, cellular level. In this work, we propose a lightweight and training-efficient approach to predict cellular composition directly from HE-stained histology images by leveraging information-enriched feature embeddings extracted from pre-trained pathology foundation models. By training a lightweight multi-layer perceptron (MLP) regressor on cell-type abundances derived via cell2location, our method efficiently distills knowledge from pathology foundation models and demonstrates the ability to accurately predict cell-type compositions from histology images, without physically performing the costly spatial transcriptomics. Our method demonstrates competitive performance compared to existing methods such as Hist2Cell, while significantly reducing computational complexity.
zh
[CV-12] GNN-ViTCap: GNN-Enhanced Multiple Instance Learning with Vision Transformers for Whole Slide Image Classification and Captioning
【速读】:该论文旨在解决显微病理学图像在全切片图像(WSI)分类和自动生成病理描述中的挑战,包括冗余图像块和未知图像块位置的问题,以及自动生成病理描述的困难。其解决方案的关键在于提出一种名为GNN-ViTCap的框架,该框架首先通过视觉特征提取器生成图像块嵌入,随后利用深度嵌入聚类动态去除冗余块,并通过标量点积注意力机制选择代表性块;接着构建图结构以捕捉局部与全局上下文信息,并通过线性层将聚合的图像嵌入投影到语言模型输入空间,最终结合描述标记微调大型语言模型实现高效准确的分类与描述生成。
链接: https://arxiv.org/abs/2507.07006
作者: S M Taslim Uddin Raju,Md. Milon Islam,Md Rezwanul Haque,Hamdi Altaheri,Fakhri Karray
机构: University of Waterloo (滑铁卢大学); Mohamed bin Zayed University of Artificial Intelligence (穆罕默德·本·扎耶德人工智能大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注:
Abstract:Microscopic assessment of histopathology images is vital for accurate cancer diagnosis and treatment. Whole Slide Image (WSI) classification and captioning have become crucial tasks in computer-aided pathology. However, microscopic WSI face challenges such as redundant patches and unknown patch positions due to subjective pathologist captures. Moreover, generating automatic pathology captions remains a significant challenge. To address these issues, we introduce a novel GNN-ViTCap framework for classification and caption generation from histopathological microscopic images. First, a visual feature extractor generates patch embeddings. Redundant patches are then removed by dynamically clustering these embeddings using deep embedded clustering and selecting representative patches via a scalar dot attention mechanism. We build a graph by connecting each node to its nearest neighbors in the similarity matrix and apply a graph neural network to capture both local and global context. The aggregated image embeddings are projected into the language model’s input space through a linear layer and combined with caption tokens to fine-tune a large language model. We validate our method on the BreakHis and PatchGastric datasets. GNN-ViTCap achieves an F1 score of 0.934 and an AUC of 0.963 for classification, along with a BLEU-4 score of 0.811 and a METEOR score of 0.569 for captioning. Experimental results demonstrate that GNN-ViTCap outperforms state of the art approaches, offering a reliable and efficient solution for microscopy based patient diagnosis.
zh
[CV-13] Enhancing non-Rigid 3D Model Deformations Using Mesh-based Gaussian Splatting
【速读】:该论文试图解决传统3D Gaussian splatting在非刚性变形中的后编辑能力和大规模变形支持不足的问题(non-rigid deformation support)。其解决方案的关键在于将高斯核直接嵌入显式网格表面,从而利用网格固有的拓扑和几何先验来引导直观的编辑操作,并实现如弯曲和拉伸等复杂变形。
链接: https://arxiv.org/abs/2507.07000
作者: Wijayathunga W.M.R.D.B
机构: 未知
类目: Graphics (cs.GR); Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:We propose a novel framework that enhances non-rigid 3D model deformations by bridging mesh representations with 3D Gaussian splatting. While traditional Gaussian splatting delivers fast, real-time radiance-field rendering, its post-editing capabilities and support for large-scale, non-rigid deformations remain limited. Our method addresses these challenges by embedding Gaussian kernels directly onto explicit mesh surfaces. This allows the mesh’s inherent topological and geometric priors to guide intuitive editing operations – such as moving, scaling, and rotating individual 3D components – and enables complex deformations like bending and stretching. This work paves the way for more flexible 3D content-creation workflows in applications spanning virtual reality, character animation, and interactive design.
zh
[CV-14] Cross-Modality Masked Learning for Survival Prediction in ICI Treated NSCLC Patients MICCAI2025
【速读】:该论文试图解决非小细胞肺癌(Non-Small Cell Lung Cancer, NSCLC)患者在接受免疫治疗时准确预后评估的问题,这一问题对于个性化治疗方案的制定、患者决策以及改善治疗效果和生活质量具有重要意义。其关键挑战在于缺乏大规模的相关数据集以及有效的多模态特征融合策略。论文提出的解决方案是构建一个包含3D CT图像、临床记录及生存数据的大规模数据集,并引入一种新颖的多模态特征融合框架。该框架采用跨模态掩码学习方法,包含两个独立分支:用于提取CT图像3D特征的Slice-Depth Transformer和用于学习表格数据中临床变量节点特征及其关系的图注意力Transformer,通过掩码模态学习策略引导特征融合过程,从而提升模态特异性特征的整合效果和跨模态关系的交互能力。
链接: https://arxiv.org/abs/2507.06994
作者: Qilong Xing,Zikai Song,Bingxin Gong,Lian Yang,Junqing Yu,Wei Yang
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: MICCAI 2025
Abstract:Accurate prognosis of non-small cell lung cancer (NSCLC) patients undergoing immunotherapy is essential for personalized treatment planning, enabling informed patient decisions, and improving both treatment outcomes and quality of life. However, the lack of large, relevant datasets and effective multi-modal feature fusion strategies pose significant challenges in this domain. To address these challenges, we present a large-scale dataset and introduce a novel framework for multi-modal feature fusion aimed at enhancing the accuracy of survival prediction. The dataset comprises 3D CT images and corresponding clinical records from NSCLC patients treated with immune checkpoint inhibitors (ICI), along with progression-free survival (PFS) and overall survival (OS) data. We further propose a cross-modality masked learning approach for medical feature fusion, consisting of two distinct branches, each tailored to its respective modality: a Slice-Depth Transformer for extracting 3D features from CT images and a graph-based Transformer for learning node features and relationships among clinical variables in tabular data. The fusion process is guided by a masked modality learning strategy, wherein the model utilizes the intact modality to reconstruct missing components. This mechanism improves the integration of modality-specific features, fostering more effective inter-modality relationships and feature interactions. Our approach demonstrates superior performance in multi-modal integration for NSCLC survival prediction, surpassing existing methods and setting a new benchmark for prognostic models in this context.
zh
[CV-15] he User-Centric Geo-Experience: An LLM -Powered Framework for Enhanced Planning Navigation and Dynamic Adaptation
【速读】:该论文旨在解决传统旅行规划系统在应对现实世界复杂性方面的不足,具体表现为智能行程规划、精准“最后一公里”导航以及动态行程适应三个方面的缺失。其解决方案的关键在于引入三个协作代理:Travel Planning Agent 通过基于网格的空间定位和地图分析处理复杂的多模态用户查询;Destination Assistant Agent 提供旅程最后阶段的精细化导航指导;Local Discovery Agent 利用图像嵌入和检索增强生成(RAG)技术检测并响应行程中断。这些技术共同提升了查询理解、导航精度和干扰韧性。
链接: https://arxiv.org/abs/2507.06993
作者: Jieren Deng,Aleksandar Cvetkovic,Pak Kiu Chung,Dragomir Yankov,Chiqun Zhang
机构: Microsoft(微软)
类目: Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Traditional travel-planning systems are often static and fragmented, leaving them ill-equipped to handle real-world complexities such as evolving environmental conditions and unexpected itinerary disruptions. In this paper, we identify three gaps between existing service providers causing frustrating user experience: intelligent trip planning, precision “last-100-meter” navigation, and dynamic itinerary adaptation. We propose three cooperative agents: a Travel Planning Agent that employs grid-based spatial grounding and map analysis to help resolve complex multi-modal user queries; a Destination Assistant Agent that provides fine-grained guidance for the final navigation leg of each journey; and a Local Discovery Agent that leverages image embeddings and Retrieval-Augmented Generation (RAG) to detect and respond to trip plan disruptions. With evaluations and experiments, our system demonstrates substantial improvements in query interpretation, navigation accuracy, and disruption resilience, underscoring its promise for applications from urban exploration to emergency response.
zh
[CV-16] MCA-RG: Enhancing LLM s with Medical Concept Alignment for Radiology Report Generation MICCAI2025
【速读】:该论文试图解决在放射学报告生成(Radiology Report Generation, RRG)中,由于难以准确映射病理和解剖特征到对应的文本描述,以及语义无关特征提取导致的诊断报告生成不准确的问题。解决方案的关键在于提出一种基于知识驱动的框架——医学概念对齐放射学报告生成(Medical Concept Aligned Radiology Report Generation, MCA-RG),该框架通过显式地将视觉特征与不同的医学概念对齐,以增强报告生成过程。MCA-RG利用两个精心构建的概念库:包含病灶相关知识的病理库和包含解剖描述的解剖库,并通过定制化的增强、基于解剖的对比学习以及匹配损失来提升解剖特征的泛化能力和病理特征的临床相关性,同时采用特征门控机制过滤低质量概念特征,从而实现更准确的放射学报告生成。
链接: https://arxiv.org/abs/2507.06992
作者: Qilong Xing,Zikai Song,Youjia Zhang,Na Feng,Junqing Yu,Wei Yang
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: MICCAI 2025
Abstract:Despite significant advancements in adapting Large Language Models (LLMs) for radiology report generation (RRG), clinical adoption remains challenging due to difficulties in accurately mapping pathological and anatomical features to their corresponding text descriptions. Additionally, semantic agnostic feature extraction further hampers the generation of accurate diagnostic reports. To address these challenges, we introduce Medical Concept Aligned Radiology Report Generation (MCA-RG), a knowledge-driven framework that explicitly aligns visual features with distinct medical concepts to enhance the report generation process. MCA-RG utilizes two curated concept banks: a pathology bank containing lesion-related knowledge, and an anatomy bank with anatomical descriptions. The visual features are aligned with these medical concepts and undergo tailored enhancement. We further propose an anatomy-based contrastive learning procedure to improve the generalization of anatomical features, coupled with a matching loss for pathological features to prioritize clinically relevant regions. Additionally, a feature gating mechanism is employed to filter out low-quality concept features. Finally, the visual features are corresponding to individual medical concepts, and are leveraged to guide the report generation process. Experiments on two public benchmarks (MIMIC-CXR and CheXpert Plus) demonstrate that MCA-RG achieves superior performance, highlighting its effectiveness in radiology report generation.
zh
[CV-17] A Principled Framework for Multi-View Contrastive Learning
【速读】:该论文旨在解决对比学习(Contrastive Learning, CL)在处理多视图数据时存在的四个关键局限性,包括多优化项导致的目标冲突、无法建模所有视图与数据点之间的交互、继承自成对CL损失的基本限制以及未能充分利用增加视图数量带来的优势。其解决方案的关键在于提出两种新型损失函数:MV-InfoNCE和MV-DHEL,前者通过单个数据点项同时整合所有可能的视图交互,后者则在不同视图间解耦对齐与均匀性,并随着视图数量增加而扩展交互复杂度,从而理论上保证了对所有视图对齐与均匀性的优化。
链接: https://arxiv.org/abs/2507.06979
作者: Panagiotis Koromilas,Efthymios Georgiou,Giorgos Bouritsas,Theodoros Giannakopoulos,Mihalis A. Nicolaou,Yannis Panagakis
机构: National and Kapodistrian University of Athens(国家和卡波迪斯特里亚大学); Archimedes AI/Athena Research Center(阿基米德AI/埃娜研究中); ILSP/Athena Research Center(ILSP/埃娜研究中); NCSR “Demokritos”(国家核子研究委员会“德莫克里托斯”); The Cyprus Institute(塞浦路斯研究所)
类目: Machine Learning (cs.LG); Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Contrastive Learning (CL), a leading paradigm in Self-Supervised Learning (SSL), typically relies on pairs of data views generated through augmentation. While multiple augmentations per instance (more than two) improve generalization in supervised learning, current CL methods handle additional views suboptimally by simply aggregating different pairwise objectives. This approach suffers from four critical limitations: (L1) it utilizes multiple optimization terms per data point resulting to conflicting objectives, (L2) it fails to model all interactions across views and data points, (L3) it inherits fundamental limitations (e.g. alignment-uniformity coupling) from pairwise CL losses, and (L4) it prevents fully realizing the benefits of increased view multiplicity observed in supervised settings. We address these limitations through two novel loss functions: MV-InfoNCE, which extends InfoNCE to incorporate all possible view interactions simultaneously in one term per data point, and MV-DHEL, which decouples alignment from uniformity across views while scaling interaction complexity with view multiplicity. Both approaches are theoretically grounded - we prove they asymptotically optimize for alignment of all views and uniformity, providing principled extensions to multi-view contrastive learning. Our empirical results on ImageNet1K and three other datasets demonstrate that our methods consistently outperform existing multi-view approaches and effectively scale with increasing view multiplicity. We also apply our objectives to multimodal data and show that, in contrast to other contrastive objectives, they can scale beyond just two modalities. Most significantly, ablation studies reveal that MV-DHEL with five or more views effectively mitigates dimensionality collapse by fully utilizing the embedding space, thereby delivering multi-view benefits observed in supervised learning.
zh
[CV-18] DenoiseCP-Net: Efficient Collective Perception in Adverse Weather via Joint LiDAR-Based 3D Object Detection and Denoising
【速读】:该论文试图解决恶劣天气条件下激光雷达(LiDAR)感知系统性能下降的问题,以及由此带来的带宽需求增加和延迟上升的问题。解决方案的关键在于提出一种名为DenoiseCP-Net的多任务架构,该架构将体素级噪声过滤与目标检测整合到统一的稀疏卷积主干中,从而消除两阶段流水线中的冗余计算,有效降低推理延迟、计算成本和通信开销。
链接: https://arxiv.org/abs/2507.06976
作者: Sven Teufel,Dominique Mayer,Jörg Gamerdinger,Oliver Bringmann
机构: University of Tübingen(图宾根大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:While automated vehicles hold the potential to significantly reduce traffic accidents, their perception systems remain vulnerable to sensor degradation caused by adverse weather and environmental occlusions. Collective perception, which enables vehicles to share information, offers a promising approach to overcoming these limitations. However, to this date collective perception in adverse weather is mostly unstudied. Therefore, we conduct the first study of LiDAR-based collective perception under diverse weather conditions and present a novel multi-task architecture for LiDAR-based collective perception under adverse weather. Adverse weather conditions can not only degrade perception capabilities, but also negatively affect bandwidth requirements and latency due to the introduced noise that is also transmitted and processed. Denoising prior to communication can effectively mitigate these issues. Therefore, we propose DenoiseCP-Net, a novel multi-task architecture for LiDAR-based collective perception under adverse weather conditions. DenoiseCP-Net integrates voxel-level noise filtering and object detection into a unified sparse convolution backbone, eliminating redundant computations associated with two-stage pipelines. This design not only reduces inference latency and computational cost but also minimizes communication overhead by removing non-informative noise. We extended the well-known OPV2V dataset by simulating rain, snow, and fog using our realistic weather simulation models. We demonstrate that DenoiseCP-Net achieves near-perfect denoising accuracy in adverse weather, reduces the bandwidth requirements by up to 23.6% while maintaining the same detection accuracy and reducing the inference latency for cooperative vehicles.
zh
[CV-19] Free on the Fly: Enhancing Flexibility in Test-Time Adaptation with Online EM CVPR2025
【速读】:该论文试图解决视觉-语言模型(Vision-Language Models, VLMs)在实际应用中因领域偏移和分布变化导致性能下降的问题。传统测试时适应(Test-Time Adaptation, TTA)方法通常依赖于昂贵的训练过程或对历史数据的访问假设,而本文提出了一种无需训练且通用性强的FreeTTA方法,其关键在于首次显式建模测试数据分布,并利用测试样本之间的内在关系来提升单个样本的预测性能,无需同时访问所有样本。
链接: https://arxiv.org/abs/2507.06973
作者: Qiyuan Dai,Sibei Yang
机构: School of Information Science and Technology, ShanghaiTech University (上海科技大学信息科学与技术学院); Sun Yat-sen University (中山大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted to CVPR 2025
Abstract:Vision-Language Models (VLMs) have become prominent in open-world image recognition for their strong generalization abilities. Yet, their effectiveness in practical applications is compromised by domain shifts and distributional changes, especially when test data distributions diverge from training data. Therefore, the paradigm of test-time adaptation (TTA) has emerged, enabling the use of online off-the-shelf data at test time, supporting independent sample predictions, and eliminating reliance on test annotations. Traditional TTA methods, however, often rely on costly training or optimization processes, or make unrealistic assumptions about accessing or storing historical training and test data. Instead, this study proposes FreeTTA, a training-free and universally available method that makes no assumptions, to enhance the flexibility of TTA. More importantly, FreeTTA is the first to explicitly model the test data distribution, enabling the use of intrinsic relationships among test samples to enhance predictions of individual samples without simultaneous access–a direction not previously explored. FreeTTA achieves these advantages by introducing an online EM algorithm that utilizes zero-shot predictions from VLMs as priors to iteratively compute the posterior probabilities of each online test sample and update parameters. Experiments demonstrate that FreeTTA achieves stable and significant improvements compared to state-of-the-art methods across 15 datasets in both cross-domain and out-of-distribution settings.
zh
[CV-20] A multi-modal dataset for insect biodiversity with imagery and DNA at the trap and individual level
【速读】:该论文试图解决在大规模生态调查中对未分类的昆虫混合样本进行自动分类的问题。传统图像识别方法多依赖于单个标本数据,而实际生态调查中收集的是未分拣的混合样本。解决方案的关键在于构建MassID45数据集,该数据集首次在未分拣样本层面和个体标本层面结合了分子生物学与成像数据,并通过人工标注与AI辅助工具协作完成分割掩码生成和分类任务,从而为基于图像的昆虫群落快速、大规模表征提供了基础。
链接: https://arxiv.org/abs/2507.06972
作者: Johanna Orsholm,John Quinto,Hannu Autto,Gaia Banelyte,Nicolas Chazot,Jeremy deWaard,Stephanie deWaard,Arielle Farrell,Brendan Furneaux,Bess Hardwick,Nao Ito,Amlan Kar,Oula Kalttopää,Deirdre Kerdraon,Erik Kristensen,Jaclyn McKeown,Tommi Mononen,Ellen Nein,Hanna Rogers,Tomas Roslin,Paula Schmitz,Jayme Sones,Maija Sujala,Amy Thompson,Evgeny V. Zakharov,Iuliia Zarubiieva,Akshita Gupta,Scott C. Lowe,Graham W. Taylor
机构: Swedish University of Agricultural Sciences(瑞典农业大学); University of Guelph(滑铁卢大学); Vector Institute(向量研究所); University of Helsinki(赫尔辛基大学); National Museum of Natural History(自然历史国家博物馆); Smithsonian Institution(史密森学会); University of Toronto(多伦多大学); NVIDIA(英伟达); Kilpisjärvi Biological Station(基尔皮萨里维生物站); Faculty of Biological and Environmental Sciences(生物与环境科学学院); Department of Ecology(生态学系); Centre for Biodiversity Genomics(生物多样性基因组中心); Unit for Field-based Forest Research(基于实地的森林研究单元)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 13 pages, 6 figures, submitted to Scientific Data
Abstract:Insects comprise millions of species, many experiencing severe population declines under environmental and habitat changes. High-throughput approaches are crucial for accelerating our understanding of insect diversity, with DNA barcoding and high-resolution imaging showing strong potential for automatic taxonomic classification. However, most image-based approaches rely on individual specimen data, unlike the unsorted bulk samples collected in large-scale ecological surveys. We present the Mixed Arthropod Sample Segmentation and Identification (MassID45) dataset for training automatic classifiers of bulk insect samples. It uniquely combines molecular and imaging data at both the unsorted sample level and the full set of individual specimens. Human annotators, supported by an AI-assisted tool, performed two tasks on bulk images: creating segmentation masks around each individual arthropod and assigning taxonomic labels to over 17 000 specimens. Combining the taxonomic resolution of DNA barcodes with precise abundance estimates of bulk images holds great potential for rapid, large-scale characterization of insect communities. This dataset pushes the boundaries of tiny object detection and instance segmentation, fostering innovation in both ecological and machine learning research.
zh
[CV-21] Hallucinating 360°: Panoramic Street-View Generation via Local Scenes Diffusion and Probabilistic Prompting
【速读】:该论文旨在解决自动驾驶中全景数据生成的挑战,即现有街景生成模型受限于固定数据分布,无法实现高质量、可控的全景生成。其解决方案的关键在于提出Percep360方法,该方法通过两个核心机制实现全景数据的连贯生成与可控性:一是Local Scenes Diffusion Method (LSDM),用于缓解针孔采样过程导致的信息损失,将全景生成建模为空间连续的扩散过程;二是Probabilistic Prompting Method (PPM),用于动态选择最相关的控制提示,从而实现可控的全景图像生成。
链接: https://arxiv.org/abs/2507.06971
作者: Fei Teng,Kai Luo,Sheng Wu,Siyu Li,Pujun Guo,Jiale Wei,Kunyu Peng,Jiaming Zhang,Kailun Yang
机构: Hunan University(湖南大学); Karlsruhe Institute of Technology(卡尔斯鲁厄理工学院); ETH Zürich(苏黎世联邦理工学院)
类目: Computer Vision and Pattern Recognition (cs.CV); Robotics (cs.RO); Image and Video Processing (eess.IV)
备注: The source code will be publicly available at this https URL
Abstract:Panoramic perception holds significant potential for autonomous driving, enabling vehicles to acquire a comprehensive 360° surround view in a single shot. However, autonomous driving is a data-driven task. Complete panoramic data acquisition requires complex sampling systems and annotation pipelines, which are time-consuming and labor-intensive. Although existing street view generation models have demonstrated strong data regeneration capabilities, they can only learn from the fixed data distribution of existing datasets and cannot achieve high-quality, controllable panoramic generation. In this paper, we propose the first panoramic generation method Percep360 for autonomous driving. Percep360 enables coherent generation of panoramic data with control signals based on the stitched panoramic data. Percep360 focuses on two key aspects: coherence and controllability. Specifically, to overcome the inherent information loss caused by the pinhole sampling process, we propose the Local Scenes Diffusion Method (LSDM). LSDM reformulates the panorama generation as a spatially continuous diffusion process, bridging the gaps between different data distributions. Additionally, to achieve the controllable generation of panoramic images, we propose a Probabilistic Prompting Method (PPM). PPM dynamically selects the most relevant control cues, enabling controllable panoramic image generation. We evaluate the effectiveness of the generated images from three perspectives: image quality assessment (i.e., no-reference and with reference), controllability, and their utility in real-world Bird’s Eye View (BEV) segmentation. Notably, the generated data consistently outperforms the original stitched images in no-reference quality metrics and enhances downstream perception models. The source code will be publicly available at this https URL.
zh
[CV-22] Segmentation Regularized Training for Multi-Domain Deep Learning Registration applied to MR-Guided Prostate Cancer Radiotherapy
【速读】:该论文旨在解决在磁共振引导自适应放疗(MRgART)中,准确的可变形图像配准(DIR)对于靶区传播和剂量累积的重要性问题。研究提出了一种基于深度学习的域不变MR-MR配准方法——逐步精炼的配准与分割(ProRSeg),其关键在于使用加权分割一致性损失进行训练,以实现跨域数据集(包括相同域、跨域及混合域数据)的泛化能力,从而提升临床靶区(CTV)、膀胱和直肠的轮廓传播精度,并验证其在剂量累积中的可行性。
链接: https://arxiv.org/abs/2507.06966
作者: Sudharsan Madhavan,Chengcheng Gui,Lando Bosma,Josiah Simeth,Jue Jiang,Nicolas Cote,Nima Hassan Rezaeian,Himanshu Nagar,Victoria Brennan,Neelam Tyagi,Harini Veeraraghavan
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Medical Physics (physics.med-ph)
备注: Preprint in preparation for submission
Abstract:Background: Accurate deformable image registration (DIR) is required for contour propagation and dose accumulation in MR-guided adaptive radiotherapy (MRgART). This study trained and evaluated a deep learning DIR method for domain invariant MR-MR registration. Methods: A progressively refined registration and segmentation (ProRSeg) method was trained with 262 pairs of 3T MR simulation scans from prostate cancer patients using weighted segmentation consistency loss. ProRSeg was tested on same- (58 pairs), cross- (72 1.5T MR Linac pairs), and mixed-domain (42 MRSim-MRL pairs) datasets for contour propagation accuracy of clinical target volume (CTV), bladder, and rectum. Dose accumulation was performed for 42 patients undergoing 5-fraction MRgART. Results: ProRSeg demonstrated generalization for bladder with similar Dice Similarity Coefficients across domains (0.88, 0.87, 0.86). For rectum and CTV, performance was domain-dependent with higher accuracy on cross-domain MRL dataset (DSCs 0.89) versus same-domain data. The model’s strong cross-domain performance prompted us to study the feasibility of using it for dose accumulation. Dose accumulation showed 83.3% of patients met CTV coverage (D95 = 40.0 Gy) and bladder sparing (D50 = 20.0 Gy) constraints. All patients achieved minimum mean target dose (40.4 Gy), but only 9.5% remained under upper limit (42.0 Gy). Conclusions: ProRSeg showed reasonable multi-domain MR-MR registration performance for prostate cancer patients with preliminary feasibility for evaluating treatment compliance to clinical constraints.
zh
[CV-23] CheXPO: Preference Optimization for Chest X-ray VLMs with Counterfactual Rationale
【速读】:该论文试图解决视觉-语言模型(Vision-Language Models, VLMs)在医疗应用中因幻觉问题导致的可靠性下降问题。其解决方案的关键在于提出CheXPO策略,该策略结合了置信度-相似性联合挖掘与反事实理由生成,通过合成细粒度多任务胸部X光图像视觉指令数据集进行监督微调,并利用token级置信度分析识别困难样本,通过相似性检索扩展样本以平衡偏好样本分布,同时生成合成反事实理由来提供细粒度临床偏好,从而减少对额外专家标注的依赖。
链接: https://arxiv.org/abs/2507.06959
作者: Xiao Liang,Jiawei Hu,Di Wang,Zhi Ma,Lin Zhao,Ronghan Li,Bo Wan,Quan Wang
机构: Xidian University (西安电子科技大学); Nanjing University of Science and Technology (南京理工大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:
Abstract:Vision-language models (VLMs) are prone to hallucinations that critically compromise reliability in medical applications. While preference optimization can mitigate these hallucinations through clinical feedback, its implementation faces challenges such as clinically irrelevant training samples, imbalanced data distributions, and prohibitive expert annotation costs. To address these challenges, we introduce CheXPO, a Chest X-ray Preference Optimization strategy that combines confidence-similarity joint mining with counterfactual rationale. Our approach begins by synthesizing a unified, fine-grained multi-task chest X-ray visual instruction dataset across different question types for supervised fine-tuning (SFT). We then identify hard examples through token-level confidence analysis of SFT failures and use similarity-based retrieval to expand hard examples for balancing preference sample distributions, while synthetic counterfactual rationales provide fine-grained clinical preferences, eliminating the need for additional expert input. Experiments show that CheXPO achieves 8.93% relative performance gain using only 5% of SFT samples, reaching state-of-the-art performance across diverse clinical tasks and providing a scalable, interpretable solution for real-world radiology applications.
zh
[CV-24] Pre-Columbian Settlements Shaped Palm Clusters in the Sierra Nevada de Santa Marta Colombia
【速读】:该论文试图解决古代人类管理对新热带森林长期影响的识别问题,特别是如何在高分辨率尺度上理解这种影响。其解决方案的关键在于提出一种基于植被特征的考古影响区域研究方法,该方法结合了深度学习模型用于识别棕榈树,并利用聚类算法识别棕榈树群落,从而估算古代人类管理区域。
链接: https://arxiv.org/abs/2507.06949
作者: Sebastian Fajardo,Sina Mohammadi,Jonas Gregorio de Souza,César Ardila,Alan Tapscott Baltar,Shaddai Heidgen,Maria Isabel Mayorga Hernández,Sylvia Mota de Oliveira,Fernando Montejo,Marco Moderato,Vinicius Peripato,Katy Puche,Carlos Reina,Juan Carlos Vargas,Frank W. Takes,Marco Madella
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Ancient populations markedly transformed Neotropical forests, yet understanding the long-term effects of ancient human management, particularly at high-resolution scales, remains challenging. In this work we propose a new approach to investigate archaeological areas of influence based on vegetation signatures. It consists of a deep learning model trained on satellite imagery to identify palm trees, followed by a clustering algorithm to identify palm clusters, which are then used to estimate ancient management areas. To assess the palm distribution in relation to past human activity, we applied the proposed approach to unique high-resolution satellite imagery data covering 765 km2 of the Sierra Nevada de Santa Marta, Colombia. With this work, we also release a manually annotated palm tree dataset along with estimated locations of archaeological sites from ground-surveys and legacy records. Results demonstrate how palms were significantly more abundant near archaeological sites showing large infrastructure investment. The extent of the largest palm cluster indicates that ancient human-managed areas linked to major infrastructure sites may be up to two orders of magnitude bigger than indicated by archaeological evidence alone. Our findings suggest that pre-Columbian populations influenced local vegetation fostering conditions conducive to palm proliferation, leaving a lasting ecological footprint. This may have lowered the logistical costs of establishing infrastructure-heavy settlements in otherwise less accessible locations. Overall, this study demonstrates the potential of integrating artificial intelligence approaches with new ecological and archaeological data to identify archaeological areas of interest through vegetation patterns, revealing fine-scale human-environment interactions.
zh
[CV-25] MCCD: A Multi-Attribute Chinese Calligraphy Character Dataset Annotated with Script Styles Dynasties and Calligraphers ICDAR2025
【速读】:该论文试图解决中国书法字符属性信息识别困难的问题,尤其是由于不同朝代和书家风格的演变导致字符样式变化大,以及现有书法数据集稀缺且缺乏多属性标注的问题。解决方案的关键是构建一个包含丰富多属性标注的多属性中文书法字符数据集(MCCD),该数据集涵盖了7,765个类别,共329,715个孤立图像样本,并基于书体、朝代和书家三个属性生成了三个子集,从而为书法字符识别、作者识别及汉字演变研究提供了坚实的数据基础。
链接: https://arxiv.org/abs/2507.06948
作者: Yixin Zhao,Yuyi Zhang,Lianwen Jin
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 17 pages, 8 figures, 9 tables, accepted by the 19th International Conference on Document Analysis and Recognition (ICDAR 2025)
Abstract:Research on the attribute information of calligraphy, such as styles, dynasties, and calligraphers, holds significant cultural and historical value. However, the styles of Chinese calligraphy characters have evolved dramatically through different dynasties and the unique touches of calligraphers, making it highly challenging to accurately recognize these different characters and their attributes. Furthermore, existing calligraphic datasets are extremely scarce, and most provide only character-level annotations without additional attribute information. This limitation has significantly hindered the in-depth study of Chinese calligraphy. To fill this gap, we present a novel Multi-Attribute Chinese Calligraphy Character Dataset (MCCD). The dataset encompasses 7,765 categories with a total of 329,715 isolated image samples of Chinese calligraphy characters, and three additional subsets were extracted based on the attribute labeling of the three types of script styles (10 types), dynasties (15 periods) and calligraphers (142 individuals). The rich multi-attribute annotations render MCCD well-suited diverse research tasks, including calligraphic character recognition, writer identification, and evolutionary studies of Chinese characters. We establish benchmark performance through single-task and multi-task recognition experiments across MCCD and all of its subsets. The experimental results demonstrate that the complexity of the stroke structure of the calligraphic characters, and the interplay between their different attributes, leading to a substantial increase in the difficulty of accurate recognition. MCCD not only fills a void in the availability of detailed calligraphy datasets but also provides valuable resources for advancing research in Chinese calligraphy and fostering advancements in multiple fields. The dataset is available at this https URL.
zh
[CV-26] Adaptive Part Learning for Fine-Grained Generalized Category Discovery: A Plug-and-Play Enhancement CVPR2025
【速读】:该论文旨在解决通用类别发现(Generalized Category Discovery, GCD)中的问题,即在不依赖额外标注的情况下,从已知和新颖类别中识别未标记图像,并在不同类别间实现知识迁移。现有方法依赖于自监督视觉变压器如DINO的全局表示,但这种单一关注全局表示的方式在区分能力和泛化能力之间存在固有权衡。解决方案的关键在于提出一种自适应部件发现与学习方法(APL),通过共享可学习的部件查询和DINO部件先验,在不同相似图像中生成一致的对象部件及其对应关系,从而无需额外标注。此外,还引入了一种全最小对比损失(all-min contrastive loss),以学习具有区分性且泛化能力强的部件表示,自适应地强调区分性对象部件以增强区分能力,同时共享其他部件以促进知识迁移。
链接: https://arxiv.org/abs/2507.06928
作者: Qiyuan Dai,Hanzhuo Huang,Yu Wu,Sibei Yang
机构: School of Information Science and Technology, ShanghaiTech University (上海科技大学信息科学与技术学院); Wuhan University (武汉大学); Sun Yat-sen University (中山大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted to CVPR 2025
Abstract:Generalized Category Discovery (GCD) aims to recognize unlabeled images from known and novel classes by distinguishing novel classes from known ones, while also transferring knowledge from another set of labeled images with known classes. Existing GCD methods rely on self-supervised vision transformers such as DINO for representation learning. However, focusing solely on the global representation of the DINO CLS token introduces an inherent trade-off between discriminability and generalization. In this paper, we introduce an adaptive part discovery and learning method, called APL, which generates consistent object parts and their correspondences across different similar images using a set of shared learnable part queries and DINO part priors, without requiring any additional annotations. More importantly, we propose a novel all-min contrastive loss to learn discriminative yet generalizable part representation, which adaptively highlights discriminative object parts to distinguish similar categories for enhanced discriminability while simultaneously sharing other parts to facilitate knowledge transfer for improved generalization. Our APL can easily be incorporated into different GCD frameworks by replacing their CLS token feature with our part representations, showing significant enhancements on fine-grained datasets.
zh
[CV-27] SemRaFiner: Panoptic Segmentation in Sparse and Noisy Radar Point Clouds
【速读】:该论文试图解决在稀疏雷达点云中的语义全景分割问题,以提升自动驾驶车辆的场景理解能力。其解决方案的关键在于提出了一种名为SemRaFiner的方法,该方法能够处理稀疏雷达点云中密度的变化,并优化特征提取以提高分割精度;此外,还引入了一种优化的训练流程,通过结合专门的数据增强来细化实例分配。
链接: https://arxiv.org/abs/2507.06906
作者: Matthias Zeller,Daniel Casado Herraez,Bengisu Ayan,Jens Behley,Michael Heidingsfeld,Cyrill Stachniss
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted for publication in IEEE Robotics and Automation Letters (RA-L)
Abstract:Semantic scene understanding, including the perception and classification of moving agents, is essential to enabling safe and robust driving behaviours of autonomous vehicles. Cameras and LiDARs are commonly used for semantic scene understanding. However, both sensor modalities face limitations in adverse weather and usually do not provide motion information. Radar sensors overcome these limitations and directly offer information about moving agents by measuring the Doppler velocity, but the measurements are comparably sparse and noisy. In this paper, we address the problem of panoptic segmentation in sparse radar point clouds to enhance scene understanding. Our approach, called SemRaFiner, accounts for changing density in sparse radar point clouds and optimizes the feature extraction to improve accuracy. Furthermore, we propose an optimized training procedure to refine instance assignments by incorporating a dedicated data augmentation. Our experiments suggest that our approach outperforms state-of-the-art methods for radar-based panoptic segmentation.
zh
[CV-28] Longitudinal Study of Facial Biometrics at the BEZ: Temporal Variance Analysis
【速读】:该论文试图解决生物特征识别系统在长期使用中性能变化的问题,特别是个体生物特征在不同时间点的稳定性问题。解决方案的关键在于通过长时间的生物特征评估,利用符合《通用数据保护条例》(General Data Protection Regulation)的本地Biometric Evaluation Center(BEZ)数据库,对超过400名具有多样人口学特征的参与者进行定期测试,并采用先进的面部识别算法分析长期比较得分,从而揭示个体生物特征在不同日期间的显著波动。
链接: https://arxiv.org/abs/2507.06858
作者: Mathias Schulz,Alexander Spenke,Pia Funk,Florian Blümel,Markus Rohde,Ralph Breithaupt,Gerd Nolden,Norbert Jung,Robert Lange
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 11 pages, 10 figures, 8 tables
Abstract:This study presents findings from long-term biometric evaluations conducted at the Biometric Evaluation Center (bez). Over the course of two and a half years, our ongoing research with over 400 participants representing diverse ethnicities, genders, and age groups were regularly assessed using a variety of biometric tools and techniques at the controlled testing facilities. Our findings are based on the General Data Protection Regulation-compliant local bez database with more than 238.000 biometric data sets categorized into multiple biometric modalities such as face and finger. We used state-of-the-art face recognition algorithms to analyze long-term comparison scores. Our results show that these scores fluctuate more significantly between individual days than over the entire measurement period. These findings highlight the importance of testing biometric characteristics of the same individuals over a longer period of time in a controlled measurement environment and lays the groundwork for future advancements in biometric data analysis.
zh
[CV-29] IAP: Invisible Adversarial Patch Attack through Perceptibility-Aware Localization and Perturbation Optimization ICCV2025
【速读】:该论文试图解决对抗补丁在目标攻击场景下攻击效果不佳或生成的补丁缺乏上下文一致性导致易被人类观察者识别以及无法有效规避自动补丁防御的问题。解决方案的关键在于提出IAP框架,该框架通过感知意识的定位与扰动优化方案生成高度不可见的对抗补丁,具体包括利用类别特定的定位和敏感度图寻找合适的补丁放置位置,并采用感知正则化对抗损失和优先考虑颜色恒常性的梯度更新规则来优化不可见扰动。
链接: https://arxiv.org/abs/2507.06856
作者: Subrat Kishore Dutta,Xiao Zhang
机构: CISPA Helmholtz Center for Information Security (CISPA 海姆霍兹信息安全中心)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: Published in ICCV 2025
Abstract:Despite modifying only a small localized input region, adversarial patches can drastically change the prediction of computer vision models. However, prior methods either cannot perform satisfactorily under targeted attack scenarios or fail to produce contextually coherent adversarial patches, causing them to be easily noticeable by human examiners and insufficiently stealthy against automatic patch defenses. In this paper, we introduce IAP, a novel attack framework that generates highly invisible adversarial patches based on perceptibility-aware localization and perturbation optimization schemes. Specifically, IAP first searches for a proper location to place the patch by leveraging classwise localization and sensitivity maps, balancing the susceptibility of patch location to both victim model prediction and human visual system, then employs a perceptibility-regularized adversarial loss and a gradient update rule that prioritizes color constancy for optimizing invisible perturbations. Comprehensive experiments across various image benchmarks and model architectures demonstrate that IAP consistently achieves competitive attack success rates in targeted settings with significantly improved patch invisibility compared to existing baselines. In addition to being highly imperceptible to humans, IAP is shown to be stealthy enough to render several state-of-the-art patch defenses ineffective.
zh
[CV-30] Know Your Attention Maps: Class-specific Token Masking for Weakly Supervised Semantic Segmentation
【速读】:该论文试图解决弱监督语义分割(Weakly Supervised Semantic Segmentation, WSSS)问题,旨在减少对精细标注数据的依赖。其解决方案的关键在于提出一种端到端的方法,直接利用视觉Transformer(Vision Transformer, ViT)学习到的注意力图进行分割。具体而言,通过训练带有多个[CLS]标记(每个类别一个)的稀疏ViT,并采用随机遮蔽策略促进[CLS]标记与类别的对应关系,在推理阶段则通过聚合每个[CLS]标记对应的自注意力图生成伪分割掩码,从而提升分割的准确性和可解释性。
链接: https://arxiv.org/abs/2507.06848
作者: Joelle Hanna,Damian Borth
机构: University of St.Gallen (圣加仑大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Weakly Supervised Semantic Segmentation (WSSS) is a challenging problem that has been extensively studied in recent years. Traditional approaches often rely on external modules like Class Activation Maps to highlight regions of interest and generate pseudo segmentation masks. In this work, we propose an end-to-end method that directly utilizes the attention maps learned by a Vision Transformer (ViT) for WSSS. We propose training a sparse ViT with multiple [CLS] tokens (one for each class), using a random masking strategy to promote [CLS] token - class assignment. At inference time, we aggregate the different self-attention maps of each [CLS] token corresponding to the predicted labels to generate pseudo segmentation masks. Our proposed approach enhances the interpretability of self-attention maps and ensures accurate class assignments. Extensive experiments on two standard benchmarks and three specialized datasets demonstrate that our method generates accurate pseudo-masks, outperforming related works. Those pseudo-masks can be used to train a segmentation model which achieves results comparable to fully-supervised models, significantly reducing the need for fine-grained labeled data.
zh
[CV-31] Physics-Grounded Motion Forecasting via Equation Discovery for Trajectory-Guided Image-to-Video Generation
【速读】:该论文试图解决现有基于扩散和自回归的视频生成模型在物理对齐方面的不足,即这些模型难以准确复制现实世界中物体运动的动力学特性。其关键解决方案是引入一种融合符号回归(Symbolic Regression, SR)和轨迹引导的图像到视频(Image-to-Video, I2V)模型的新框架,通过从输入视频中提取运动轨迹,并利用基于检索的预训练机制增强符号回归,从而发现运动方程以预测物理上准确的未来轨迹,进而指导视频生成。
链接: https://arxiv.org/abs/2507.06830
作者: Tao Feng,Xianbing Zhao,Zhenhua Chen,Tien Tsin Wong,Hamid Rezatofighi,Gholamreza Haffari,Lizhen Qu
机构: Monash University (莫纳什大学); Harbin Institute of Technology (哈尔滨工业大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:
Abstract:Recent advances in diffusion-based and autoregressive video generation models have achieved remarkable visual realism. However, these models typically lack accurate physical alignment, failing to replicate real-world dynamics in object motion. This limitation arises primarily from their reliance on learned statistical correlations rather than capturing mechanisms adhering to physical laws. To address this issue, we introduce a novel framework that integrates symbolic regression (SR) and trajectory-guided image-to-video (I2V) models for physics-grounded video forecasting. Our approach extracts motion trajectories from input videos, uses a retrieval-based pre-training mechanism to enhance symbolic regression, and discovers equations of motion to forecast physically accurate future trajectories. These trajectories then guide video generation without requiring fine-tuning of existing models. Evaluated on scenarios in Classical Mechanics, including spring-mass, pendulums, and projectile motions, our method successfully recovers ground-truth analytical equations and improves the physical alignment of generated videos over baseline methods.
zh
[CV-32] HVI-CIDNet: Beyond Extreme Darkness for Low-Light Image Enhancement
【速读】:该论文旨在解决低光照图像增强(LLIE)中因标准RGB(sRGB)色彩空间高色彩敏感性导致的颜色偏差和亮度伪影问题,以及HSV色彩空间引入的显著红噪和黑噪伪影问题。其解决方案的关键在于提出一种新的色彩空间——水平/垂直强度(HVI),通过HV色彩图减少红噪伪影,并利用可学习强度压缩低光区域以消除黑噪伪影;同时构建了基于HVI色彩空间的Color and Intensity Decoupling Network+(HVI-CIDNet+),结合预训练视觉-语言模型提取的上下文与退化知识,通过先验引导注意力块(PAB)实现内容恢复与颜色校正,并采用区域细化块对信息丰富和稀缺区域进行精确亮度调整。
链接: https://arxiv.org/abs/2507.06814
作者: Qingsen Yan,Kangbiao Shi,Yixu Feng,Tao Hu,Peng Wu,Guansong Pang,Yanning Zhang
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 14 pages
Abstract:Low-Light Image Enhancement (LLIE) aims to restore vivid content and details from corrupted low-light images. However, existing standard RGB (sRGB) color space-based LLIE methods often produce color bias and brightness artifacts due to the inherent high color sensitivity. While Hue, Saturation, and Value (HSV) color space can decouple brightness and color, it introduces significant red and black noise artifacts. To address this problem, we propose a new color space for LLIE, namely Horizontal/Vertical-Intensity (HVI), defined by the HV color map and learnable intensity. The HV color map enforces small distances for the red coordinates to remove red noise artifacts, while the learnable intensity compresses the low-light regions to remove black noise artifacts. Additionally, we introduce the Color and Intensity Decoupling Network+ (HVI-CIDNet+), built upon the HVI color space, to restore damaged content and mitigate color distortion in extremely dark regions. Specifically, HVI-CIDNet+ leverages abundant contextual and degraded knowledge extracted from low-light images using pre-trained vision-language models, integrated via a novel Prior-guided Attention Block (PAB). Within the PAB, latent semantic priors can promote content restoration, while degraded representations guide precise color correction, both particularly in extremely dark regions through the meticulously designed cross-attention fusion mechanism. Furthermore, we construct a Region Refinement Block that employs convolution for information-rich regions and self-attention for information-scarce regions, ensuring accurate brightness adjustments. Comprehensive results from benchmark experiments demonstrate that the proposed HVI-CIDNet+ outperforms the state-of-the-art methods on 10 datasets.
zh
[CV-33] Democratizing High-Fidelity Co-Speech Gesture Video Generation ICCV2025
【速读】:该论文试图解决的是共话语音手势视频生成问题,即合成与音频严格对齐、包含同步面部表情和身体手势的逼真视频,其难点在于音频与视觉内容之间存在显著的一对多映射关系,同时面临大规模公开数据集稀缺和计算需求高的挑战。解决方案的关键在于提出一种轻量级框架,利用2D全身骨骼作为高效辅助条件,将音频信号与视觉输出进行关联,通过基于细粒度音频片段和从说话人参考图像中提取的骨骼的扩散模型,结合骨骼-音频特征融合预测骨骼运动,以确保严格的音频协调性和身体形态一致性。
链接: https://arxiv.org/abs/2507.06812
作者: Xu Yang,Shaoli Huang,Shenbo Xie,Xuelin Chen,Yifei Liu,Changxing Ding
机构: South China University of Technology (华南理工大学); Tencent AI Lab (腾讯人工智能实验室)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: ICCV 2025
Abstract:Co-speech gesture video generation aims to synthesize realistic, audio-aligned videos of speakers, complete with synchronized facial expressions and body gestures. This task presents challenges due to the significant one-to-many mapping between audio and visual content, further complicated by the scarcity of large-scale public datasets and high computational demands. We propose a lightweight framework that utilizes 2D full-body skeletons as an efficient auxiliary condition to bridge audio signals with visual outputs. Our approach introduces a diffusion model conditioned on fine-grained audio segments and a skeleton extracted from the speaker’s reference image, predicting skeletal motions through skeleton-audio feature fusion to ensure strict audio coordination and body shape consistency. The generated skeletons are then fed into an off-the-shelf human video generation model with the speaker’s reference image to synthesize high-fidelity videos. To democratize research, we present CSG-405-the first public dataset with 405 hours of high-resolution videos across 71 speech types, annotated with 2D skeletons and diverse speaker demographics. Experiments show that our method exceeds state-of-the-art approaches in visual quality and synchronization while generalizing across speakers and contexts.
zh
[CV-34] GreenHyperSpectra: A multi-source hyperspectral dataset for global vegetation trait prediction
【速读】:该论文旨在解决植物性状(如叶碳含量和叶质量)在大尺度生态系统中预测的挑战,尤其是在标签数据稀缺和跨领域迁移(如传感器差异和生态分布变化)的情况下。其解决方案的关键在于构建GreenHyperSpectra预训练数据集,该数据集包含真实世界的跨传感器和跨生态系统样本,用于评估半监督和自监督方法在植物性状预测中的表现,并通过预训练标签高效的多输出回归模型实现了对现有监督基线方法的超越。
链接: https://arxiv.org/abs/2507.06806
作者: Eya Cherif(1, 2 and 3),Arthur Ouaknine(3 and 4),Luke A. Brown(5),Phuong D. Dao(6, 7 and 8),Kyle R. Kovach(9),Bing Lu(10),Daniel Mederer(1),Hannes Feilhauer(1, 2, 12 and 13),Teja Kattenborn(11 and 12),David Rolnick(3 and 4) ((1) Institute for Earth System Science and Remote Sensing, Leipzig University, Germany, (2) Center for Scalable Data Analytics and Artificial Intelligence (a href=“http://ScaDS.AI” rel=“external noopener nofollow” class="link-external link-http"this http URL/a), Leipzig University, Germany, (3) Mila Quebec AI Institute, Canada, (4) McGill University, Canada, (5) School of Science, Engineering and Environment, University of Salford, UK, (6) Department of Agricultural Biology, Colorado State University, USA, (7) Graduate Degree Program in Ecology, Colorado State University, USA, (8) School of Global Environmental Sustainability, Colorado State University, USA, (9) Department of Forest and Wildlife Ecology, University of Wisconsin, USA, (10) Department of Geography, Simon Fraser University, Canada, (11) Chair of Sensor-based Geoinformatics (geosense), University of Freiburg, Germany, (12) German Centre for Integrative Biodiversity Research (iDiv), Halle-Jena-Leipzig, Germany, (13) Helmholtz-Centre for Environmental Research (UFZ), Leipzig, Germany)
机构: Leipzig University, Germany(莱比锡大学, 德国); Mila – Québec AI Institute, Canada(蒙特利尔人工智能研究所, 加拿大); McGill University, Canada(麦吉尔大学, 加拿大); University of Salford, UK(索尔福德大学, 英国); Colorado State University, USA(科罗拉多州立大学, 美国); University of Wisconsin, USA(威斯康星大学, 美国); Simon Fraser University, Canada(西门菲沙大学, 加拿大); Institute for Earth System Science and Remote Sensing, Leipzig University, Germany(地球系统科学与遥感研究所, 莱比锡大学, 德国); German Centre for Integrative Biodiversity Research (iDiv), Halle-Jena-Leipzig, Germany(德国整合生物多样性研究中心(iDiv), 哈雷-耶拿-莱比锡, 德国); Helmholtz-Centre for Environmental Research (UFZ), Leipzig, Germany(赫姆霍兹环境研究中心(UFZ), 莱比锡, 德国); Chair of Sensor-based Geoinformatics (geosense), University of Freiburg, Germany(基于传感器的地理信息学主席, 弗赖堡大学, 德国)
类目: Computer Vision and Pattern Recognition (cs.CV); Image and Video Processing (eess.IV)
备注:
Abstract:Plant traits such as leaf carbon content and leaf mass are essential variables in the study of biodiversity and climate change. However, conventional field sampling cannot feasibly cover trait variation at ecologically meaningful spatial scales. Machine learning represents a valuable solution for plant trait prediction across ecosystems, leveraging hyperspectral data from remote sensing. Nevertheless, trait prediction from hyperspectral data is challenged by label scarcity and substantial domain shifts (\eg across sensors, ecological distributions), requiring robust cross-domain methods. Here, we present GreenHyperSpectra, a pretraining dataset encompassing real-world cross-sensor and cross-ecosystem samples designed to benchmark trait prediction with semi- and self-supervised methods. We adopt an evaluation framework encompassing in-distribution and out-of-distribution scenarios. We successfully leverage GreenHyperSpectra to pretrain label-efficient multi-output regression models that outperform the state-of-the-art supervised baseline. Our empirical analyses demonstrate substantial improvements in learning spectral representations for trait prediction, establishing a comprehensive methodological framework to catalyze research at the intersection of representation learning and plant functional traits assessment. All code and data are available at: this https URL.
zh
[CV-35] Unlocking Thermal Aerial Imaging: Synthetic Enhancement of UAV Datasets
【速读】:该论文旨在解决无人机(UAV)热成像在深度学习模型应用中的数据稀缺问题,这一问题主要源于热数据收集的高成本和物流挑战。其解决方案的关键在于提出一种新颖的程序化流程,用于从航空视角生成合成热图像,该方法能够将任意物体类别整合到现有的热背景中,并通过控制新物体的位置、尺度和方向来实现与背景视点的一致性。
链接: https://arxiv.org/abs/2507.06797
作者: Antonella Barisic Kulas,Andreja Jurasovic,Stjepan Bogdan
机构: University of Zagreb Faculty of Electrical Engineering and Computing, LARICS Laboratory for Robotics and Intelligent Control Systems(萨格勒布大学电气工程与计算学院,LARICS机器人与智能控制系统实验室)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Preprint. Accepted at ECMR 2025
Abstract:Thermal imaging from unmanned aerial vehicles (UAVs) holds significant potential for applications in search and rescue, wildlife monitoring, and emergency response, especially under low-light or obscured conditions. However, the scarcity of large-scale, diverse thermal aerial datasets limits the advancement of deep learning models in this domain, primarily due to the high cost and logistical challenges of collecting thermal data. In this work, we introduce a novel procedural pipeline for generating synthetic thermal images from an aerial perspective. Our method integrates arbitrary object classes into existing thermal backgrounds by providing control over the position, scale, and orientation of the new objects, while aligning them with the viewpoints of the background. We enhance existing thermal datasets by introducing new object categories, specifically adding a drone class in urban environments to the HIT-UAV dataset and an animal category to the MONET dataset. In evaluating these datasets for object detection task, we showcase strong performance across both new and existing classes, validating the successful expansion into new applications. Through comparative analysis, we show that thermal detectors outperform their visible-light-trained counterparts and highlight the importance of replicating aerial viewing angles. Project page: this https URL.
zh
[CV-36] FOLC-Net: A Federated-Optimized Lightweight Architecture for Enhanced MRI Disease Diagnosis across Axial Coronal and Sagittal Views
【速读】:该论文旨在解决现有最先进的(SOTA)模型在处理MRI图像中轴位、冠状位和矢状位等不同解剖视角时出现的性能退化问题。其关键解决方案是提出FOLC-Net框架,该框架结合了联邦优化的轻量级架构、Manta-ray foraging optimization(MRFO)机制以实现高效的模型结构生成、全局模型克隆以支持可扩展训练以及ConvNeXt以增强客户端适应性,从而在多视角和单视角医学影像分析中均表现出更优的性能与鲁棒性。
链接: https://arxiv.org/abs/2507.06763
作者: Saif Ur Rehman Khan,Muhammad Nabeel Asim,Sebastian Vollmer,Andreas Dengel
机构: DFKI(德国弗劳恩霍夫研究所)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:
Abstract:The framework is designed to improve performance in the analysis of combined as well as single anatomical perspectives for MRI disease diagnosis. It specifically addresses the performance degradation observed in state-of-the-art (SOTA) models, particularly when processing axial, coronal, and sagittal anatomical planes. The paper introduces the FOLC-Net framework, which incorporates a novel federated-optimized lightweight architecture with approximately 1.217 million parameters and a storage requirement of only 0.9 MB. FOLC-Net integrates Manta-ray foraging optimization (MRFO) mechanisms for efficient model structure generation, global model cloning for scalable training, and ConvNeXt for enhanced client adaptability. The model was evaluated on combined multi-view data as well as individual views, such as axial, coronal, and sagittal, to assess its robustness in various medical imaging scenarios. Moreover, FOLC-Net tests a ShallowFed model on different data to evaluate its ability to generalize beyond the training dataset. The results show that FOLC-Net outperforms existing models, particularly in the challenging sagittal view. For instance, FOLC-Net achieved an accuracy of 92.44% on the sagittal view, significantly higher than the 88.37% accuracy of study method (DL + Residual Learning) and 88.95% of DL models. Additionally, FOLC-Net demonstrated improved accuracy across all individual views, providing a more reliable and robust solution for medical image analysis in decentralized environments. FOLC-Net addresses the limitations of existing SOTA models by providing a framework that ensures better adaptability to individual views while maintaining strong performance in multi-view settings. The incorporation of MRFO, global model cloning, and ConvNeXt ensures that FOLC-Net performs better in real-world medical applications.
zh
[CV-37] Finetuning Vision-Language Models as OCR Systems for Low-Resource Languages: A Case Study of Manchu
【速读】:该论文试图解决满语(Manchu)这一濒危语言在现实世界历史文献中缺乏有效光学字符识别(OCR)系统的问题。解决方案的关键在于通过参数高效训练,对三个开源视觉-语言模型(LLaMA-3.2-11B、Qwen2.5-VL-7B、Qwen2.5-VL-3B)进行微调,利用60,000张合成满语单词图像数据集。其中,LLaMA-3.2-11B在合成数据上达到了98.3%的单词准确率和0.0024的字符错误率,并在真实手写文档上保持了93.1%的准确率,展现出有效的从合成数据到真实场景的领域迁移能力。
链接: https://arxiv.org/abs/2507.06761
作者: Yan Hon Michael Chung,Donghyeok Choi
机构: The Hong Kong University of Science and Technology (香港科技大学); Hong Kong Baptist University (香港浸会大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Manchu, a critically endangered language essential for understanding early modern Eastern Eurasian history, lacks effective OCR systems that can handle real-world historical documents. This study develops high-performing OCR systems by fine-tuning three open-source vision-language models (LLaMA-3.2-11B, Qwen2.5-VL-7B, Qwen2.5-VL-3B) on 60,000 synthetic Manchu word images using parameter-efficient training. LLaMA-3.2-11B achieved exceptional performance with 98.3% word accuracy and 0.0024 character error rate on synthetic data, while crucially maintaining 93.1% accuracy on real-world handwritten documents. Comparative evaluation reveals substantial advantages over traditional approaches: while a CRNN baseline achieved 99.8% synthetic accuracy, it suffered severe degradation to 72.5% on real documents. Our approach demonstrates effective synthetic-to-real domain transfer, providing a cost-effective solution deployable on accessible infrastructure. This work establishes a transferable framework for endangered language OCR that removes technical and financial barriers in digital humanities, enabling historians and linguists to process historical archives without specialized computing resources. Code and model weights are available at this https URL.
zh
[CV-38] LOVON: Legged Open-Vocabulary Object Navigator
【速读】:该论文旨在解决开放世界环境中机器人执行长时序任务时面临的挑战,尤其是如何有效整合开放词汇目标检测与高层任务规划以实现复杂、远距离的导航任务。解决方案的关键在于提出LOVON框架,该框架将大语言模型(Large Language Models, LLMs)用于分层任务规划,并结合开放词汇视觉检测模型,以实现动态、非结构化环境中的高效目标导航。此外,针对现实世界中的视觉抖动、盲区和临时目标丢失等问题,LOVON设计了如拉普拉斯方差滤波等专用解决方案,并开发了确保自主导航、任务适应与鲁棒任务完成的功能执行逻辑。
链接: https://arxiv.org/abs/2507.06747
作者: Daojie Peng,Jiahang Cao,Qiang Zhang,Jun Ma
机构: The Hong Kong University of Science and Technology (Guangzhou); Beijing Innovation Center of Humanoid Robotics; The Hong Kong University of Science and Technology
类目: Robotics (cs.RO); Computer Vision and Pattern Recognition (cs.CV)
备注: 9 pages, 10 figures; Project Page: this https URL
Abstract:Object navigation in open-world environments remains a formidable and pervasive challenge for robotic systems, particularly when it comes to executing long-horizon tasks that require both open-world object detection and high-level task planning. Traditional methods often struggle to integrate these components effectively, and this limits their capability to deal with complex, long-range navigation missions. In this paper, we propose LOVON, a novel framework that integrates large language models (LLMs) for hierarchical task planning with open-vocabulary visual detection models, tailored for effective long-range object navigation in dynamic, unstructured environments. To tackle real-world challenges including visual jittering, blind zones, and temporary target loss, we design dedicated solutions such as Laplacian Variance Filtering for visual stabilization. We also develop a functional execution logic for the robot that guarantees LOVON’s capabilities in autonomous navigation, task adaptation, and robust task completion. Extensive evaluations demonstrate the successful completion of long-sequence tasks involving real-time detection, search, and navigation toward open-vocabulary dynamic targets. Furthermore, real-world experiments across different legged robots (Unitree Go2, B2, and H1-2) showcase the compatibility and appealing plug-and-play feature of LOVON.
zh
[CV-39] Dual-Granularity Cross-Modal Identity Association for Weakly-Supervised Text-to-Person Image Matching
【速读】:该论文试图解决弱监督下的文本到人物图像匹配问题,特别是现有方法在预测复杂的一对多身份关系时表现不佳,从而严重限制了性能提升。其解决方案的关键在于提出一种局部与全局双粒度身份关联机制:在局部层面,显式建立批次内的跨模态身份关系,强化不同模态间的身份约束;在全局层面,构建以视觉模态为锚点的动态跨模态身份关联网络,并引入基于置信度的动态调整机制,以提升模型对弱关联样本的识别能力和整体敏感性。
链接: https://arxiv.org/abs/2507.06744
作者: Yafei Zhang,Yongle Shang,Huafeng Li
机构: Kunming University of Science and Technology(昆明理工大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG); Multimedia (cs.MM)
备注:
Abstract:Weakly supervised text-to-person image matching, as a crucial approach to reducing models’ reliance on large-scale manually labeled samples, holds significant research value. However, existing methods struggle to predict complex one-to-many identity relationships, severely limiting performance improvements. To address this challenge, we propose a local-and-global dual-granularity identity association mechanism. Specifically, at the local level, we explicitly establish cross-modal identity relationships within a batch, reinforcing identity constraints across different modalities and enabling the model to better capture subtle differences and correlations. At the global level, we construct a dynamic cross-modal identity association network with the visual modality as the anchor and introduce a confidence-based dynamic adjustment mechanism, effectively enhancing the model’s ability to identify weakly associated samples while improving overall sensitivity. Additionally, we propose an information-asymmetric sample pair construction method combined with consistency learning to tackle hard sample mining and enhance model robustness. Experimental results demonstrate that the proposed method substantially boosts cross-modal matching accuracy, providing an efficient and practical solution for text-to-person image matching.
zh
[CV-40] PromptTea: Let Prompts Tell TeaCache the Optimal Threshold
【速读】:该论文旨在解决视频生成中推理速度缓慢的问题,特别是针对现有加速策略在复杂场景下导致质量下降以及手动调整 reuse 阈值效率低且鲁棒性差的局限性。其解决方案的关键在于提出 Prompt-Complexity-Aware (PCA) 缓存机制,该机制通过直接从输入提示中估计场景复杂度,自动调整 reuse 阈值,从而实现更自适应和精准的缓存决策。此外,论文还对 TeaCache 的假设进行了重新审视,并通过解耦噪声输入、增强有意义文本信息的贡献以及引入多变量多项式特征扩展来提升模型预测精度,同时采用 DynCFGCache 替代静态 CFGCache,以动态方式选择性地重用无分类器引导(CFG)输出,进一步降低计算成本并保持输出质量。
链接: https://arxiv.org/abs/2507.06739
作者: Zishen Huang,Chunyu Yang,Mengyuan Ren
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Despite recent progress in video generation, inference speed remains a major bottleneck. A common acceleration strategy involves reusing model outputs via caching mechanisms at fixed intervals. However, we find that such fixed-frequency reuse significantly degrades quality in complex scenes, while manually tuning reuse thresholds is inefficient and lacks robustness. To address this, we propose Prompt-Complexity-Aware (PCA) caching, a method that automatically adjusts reuse thresholds based on scene complexity estimated directly from the input prompt. By incorporating prompt-derived semantic cues, PCA enables more adaptive and informed reuse decisions than conventional caching methods. We also revisit the assumptions behind TeaCache and identify a key limitation: it suffers from poor input-output relationship modeling due to an oversimplified prior. To overcome this, we decouple the noisy input, enhance the contribution of meaningful textual information, and improve the model’s predictive accuracy through multivariate polynomial feature expansion. To further reduce computational cost, we replace the static CFGCache with DynCFGCache, a dynamic mechanism that selectively reuses classifier-free guidance (CFG) outputs based on estimated output variations. This allows for more flexible reuse without compromising output quality. Extensive experiments demonstrate that our approach achieves significant acceleration-for example, 2.79x speedup on the Wan2.1 model-while maintaining high visual fidelity across a range of scenes.
zh
[CV-41] DIFFUMA: High-Fidelity Spatio-Temporal Video Prediction via Dual-Path Mamba and Diffusion Enhancement
【速读】:该论文旨在解决高精度工业场景中缺乏专用基准数据集的问题,特别是在半导体制造中的晶圆切割过程,这一问题严重阻碍了复杂过程建模与预测的研究。其解决方案的关键在于构建并发布首个针对该过程的公开时间图像数据集——Chip Dicing Lane Dataset (CHDL),以及提出一种创新的双路径预测架构DIFFUMA。该模型通过并行Mamba模块捕捉全局长时序上下文,并利用由时序特征引导的扩散模块恢复和增强细粒度空间细节,从而有效应对特征退化问题。
链接: https://arxiv.org/abs/2507.06738
作者: Xinyu Xie,Weifeng Cao,Jun Shi,Yangyang Hu,Hui Liang,Wanyong Liang,Xiaoliang Qian
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:
Abstract:Spatio-temporal video prediction plays a pivotal role in critical domains, ranging from weather forecasting to industrial automation. However, in high-precision industrial scenarios such as semiconductor manufacturing, the absence of specialized benchmark datasets severely hampers research on modeling and predicting complex processes. To address this challenge, we make a twofold this http URL, we construct and release the Chip Dicing Lane Dataset (CHDL), the first public temporal image dataset dedicated to the semiconductor wafer dicing process. Captured via an industrial-grade vision system, CHDL provides a much-needed and challenging benchmark for high-fidelity process modeling, defect detection, and digital twin this http URL, we propose DIFFUMA, an innovative dual-path prediction architecture specifically designed for such fine-grained dynamics. The model captures global long-range temporal context through a parallel Mamba module, while simultaneously leveraging a diffusion module, guided by temporal features, to restore and enhance fine-grained spatial details, effectively combating feature degradation. Experiments demonstrate that on our CHDL benchmark, DIFFUMA significantly outperforms existing methods, reducing the Mean Squared Error (MSE) by 39% and improving the Structural Similarity (SSIM) from 0.926 to a near-perfect 0.988. This superior performance also generalizes to natural phenomena datasets. Our work not only delivers a new state-of-the-art (SOTA) model but, more importantly, provides the community with an invaluable data resource to drive future research in industrial AI.
zh
[CV-42] Residual Prior-driven Frequency-aware Network for Image Fusion
【速读】:该论文旨在解决多模态图像融合中长程特征依赖建模计算成本高以及缺乏真实标签导致难以有效捕捉互补特征的问题。其解决方案的关键在于提出一种基于残差先验和频域感知的网络结构——RPFNet,该结构包含残差先验模块(RPM)用于提取模态特异性差异信息,频域融合模块(FDFM)通过频域卷积实现高效的全局特征建模与融合,以及跨促进模块(CPM)通过双向特征交互增强局部细节与全局结构的协同感知。此外,通过引入辅助解码器、显著性结构损失、自适应权重频域对比损失和SSIM损失,有效约束解空间,提升模型对局部细节与全局特征的联合捕捉能力及互补信息的保留。
链接: https://arxiv.org/abs/2507.06735
作者: Guan Zheng,Xue Wang,Wenhua Qian,Peng Liu,Runzhuo Ma
机构: Yunnan University(云南大学); Hong Kong Polytechnic University(香港理工大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG); Multimedia (cs.MM)
备注:
Abstract:Image fusion aims to integrate complementary information across modalities to generate high-quality fused images, thereby enhancing the performance of high-level vision tasks. While global spatial modeling mechanisms show promising results, constructing long-range feature dependencies in the spatial domain incurs substantial computational costs. Additionally, the absence of ground-truth exacerbates the difficulty of capturing complementary features effectively. To tackle these challenges, we propose a Residual Prior-driven Frequency-aware Network, termed as RPFNet. Specifically, RPFNet employs a dual-branch feature extraction framework: the Residual Prior Module (RPM) extracts modality-specific difference information from residual maps, thereby providing complementary priors for fusion; the Frequency Domain Fusion Module (FDFM) achieves efficient global feature modeling and integration through frequency-domain convolution. Additionally, the Cross Promotion Module (CPM) enhances the synergistic perception of local details and global structures through bidirectional feature interaction. During training, we incorporate an auxiliary decoder and saliency structure loss to strengthen the model’s sensitivity to modality-specific differences. Furthermore, a combination of adaptive weight-based frequency contrastive loss and SSIM loss effectively constrains the solution space, facilitating the joint capture of local details and global features while ensuring the retention of complementary information. Extensive experiments validate the fusion performance of RPFNet, which effectively integrates discriminative features, enhances texture details and salient objects, and can effectively facilitate the deployment of the high-level vision task.
zh
[CV-43] MADPOT: Medical Anomaly Detection with CLIP Adaptation and Partial Optimal Transport
【速读】:该论文试图解决医学异常检测(Medical Anomaly Detection, MAD)中由于成像模态多样、解剖结构变异以及标注数据有限所带来的挑战。其解决方案的关键在于结合视觉适配器与提示学习,并引入部分最优传输(Partial Optimal Transport, POT)和对比学习(Contrastive Learning, CL),以提升CLIP模型对医学图像的适应性。与传统提示学习不同,该方法通过POT将多个提示与局部特征对齐,从而捕捉细微的异常变化,而CL则进一步增强了类内凝聚性和类间分离性。
链接: https://arxiv.org/abs/2507.06733
作者: Mahshid Shiri,Cigdem Beyan,Vittorio Murino
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted to ICIAP 2025 (this version is not peer-reviewed; it is the submitted version). ICIAP 2025 proceedings DOI will appear here
Abstract:Medical anomaly detection (AD) is challenging due to diverse imaging modalities, anatomical variations, and limited labeled data. We propose a novel approach combining visual adapters and prompt learning with Partial Optimal Transport (POT) and contrastive learning (CL) to improve CLIP’s adaptability to medical images, particularly for AD. Unlike standard prompt learning, which often yields a single representation, our method employs multiple prompts aligned with local features via POT to capture subtle abnormalities. CL further enforces intra-class cohesion and inter-class separation. Our method achieves state-of-the-art results in few-shot, zero-shot, and cross-dataset scenarios without synthetic data or memory banks. The code is available at this https URL.
zh
[CV-44] Hierarchical Feature Alignment for Gloss-Free Sign Language Translation
【速读】:该论文试图解决手语翻译(Sign Language Translation, SLT)中视觉表示与文本表示之间的差异问题,特别是在端到端学习过程中。其解决方案的关键在于引入一种受手语结构启发的分层预训练策略,该策略结合了伪词(pseudo-glosses)和对比视频-语言对齐,通过在帧、段落和视频层级上提取特征,并将其与伪词及口语句子对齐,从而提升翻译质量。
链接: https://arxiv.org/abs/2507.06732
作者: Sobhan Asasi,Mohamed Ilyes Lakhal,Richard Bowden
机构: University of Surrey(萨里大学); Guildford(吉尔福德); United Kingdom(英国)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted in SLTAT
Abstract:Sign Language Translation (SLT) attempts to convert sign language videos into spoken sentences. However, many existing methods struggle with the disparity between visual and textual representations during end-to-end learning. Gloss-based approaches help to bridge this gap by leveraging structured linguistic information. While, gloss-free methods offer greater flexibility and remove the burden of annotation, they require effective alignment strategies. Recent advances in Large Language Models (LLMs) have enabled gloss-free SLT by generating text-like representations from sign videos. In this work, we introduce a novel hierarchical pre-training strategy inspired by the structure of sign language, incorporating pseudo-glosses and contrastive video-language alignment. Our method hierarchically extracts features at frame, segment, and video levels, aligning them with pseudo-glosses and the spoken sentence to enhance translation quality. Experiments demonstrate that our approach improves BLEU-4 and ROUGE scores while maintaining efficiency.
zh
[CV-45] A Neural Representation Framework with LLM -Driven Spatial Reasoning for Open-Vocabulary 3D Visual Grounding
【速读】:该论文旨在解决开放词汇3D视觉定位中,基于自然语言查询准确识别目标物体的问题,尤其是针对语言查询中的空间关系(如“书在椅子上”)难以被现有方法准确解析的局限性。解决方案的关键在于提出SpatialReasoner框架,该框架结合大语言模型(LLM)驱动的空间推理能力与视觉属性增强的分层特征场,通过细调LLM以捕捉空间关系并显式推断目标、锚点及空间关系指令,同时利用视觉属性构建分层特征场,从而提升3D场景中目标实例的定位精度与空间推理能力。
链接: https://arxiv.org/abs/2507.06719
作者: Zhenyang Liu,Sixiao Zheng,Siyu Chen,Cairong Zhao,Longfei Liang,Xiangyang Xue,Yanwei Fu
机构: Fudan University (复旦大学); Shanghai Innovation Institute (上海创新研究院); Zhejiang University (浙江大学); Tongji University (同济大学); NeuHelium Co., Ltd (NeuHelium公司)
类目: Computer Vision and Pattern Recognition (cs.CV); Robotics (cs.RO)
备注:
Abstract:Open-vocabulary 3D visual grounding aims to localize target objects based on free-form language queries, which is crucial for embodied AI applications such as autonomous navigation, robotics, and augmented reality. Learning 3D language fields through neural representations enables accurate understanding of 3D scenes from limited viewpoints and facilitates the localization of target objects in complex environments. However, existing language field methods struggle to accurately localize instances using spatial relations in language queries, such as ``the book on the chair.‘’ This limitation mainly arises from inadequate reasoning about spatial relations in both language queries and 3D scenes. In this work, we propose SpatialReasoner, a novel neural representation-based framework with large language model (LLM)-driven spatial reasoning that constructs a visual properties-enhanced hierarchical feature field for open-vocabulary 3D visual grounding. To enable spatial reasoning in language queries, SpatialReasoner fine-tunes an LLM to capture spatial relations and explicitly infer instructions for the target, anchor, and spatial relation. To enable spatial reasoning in 3D scenes, SpatialReasoner incorporates visual properties (opacity and color) to construct a hierarchical feature field. This field represents language and instance features using distilled CLIP features and masks extracted via the Segment Anything Model (SAM). The field is then queried using the inferred instructions in a hierarchical manner to localize the target 3D instance based on the spatial relation in the language query. Extensive experiments show that our framework can be seamlessly integrated into different neural representations, outperforming baseline models in 3D visual grounding while empowering their spatial reasoning capability.
zh
[CV-46] Spatial-Temporal Graph Mamba for Music-Guided Dance Video Synthesis
【速读】:该论文试图解决音乐引导的舞蹈视频合成问题,即如何将输入的音乐转化为对应的舞蹈视频。解决方案的关键在于提出一种新颖的空间-时间图Mamba(STG-Mamba),其包含两个翻译映射:音乐到骨骼序列的转换和骨骼序列到视频的转换。在音乐到骨骼序列的转换中,引入了空间-时间图Mamba(STGM)块以有效构建来自输入音乐的骨骼序列,捕捉关节在时空维度上的依赖关系;而在骨骼到视频的转换中,提出了一个自监督正则化网络,将生成的骨骼序列与条件图像转换为舞蹈视频。
链接: https://arxiv.org/abs/2507.06689
作者: Hao Tang,Ling Shao,Zhenyu Zhang,Luc Van Gool,Nicu Sebe
机构: Peking University (北京大学); University of Chinese Academy of Sciences (中国科学院大学); Nanjing University (南京大学); ETH Zurich (苏黎世联邦理工学院); KU Leuven (鲁汶大学); INSAIT, Sofia Un. (索菲亚大学); University of Trento (特伦托大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted to TPAMI 2025
Abstract:We propose a novel spatial-temporal graph Mamba (STG-Mamba) for the music-guided dance video synthesis task, i.e., to translate the input music to a dance video. STG-Mamba consists of two translation mappings: music-to-skeleton translation and skeleton-to-video translation. In the music-to-skeleton translation, we introduce a novel spatial-temporal graph Mamba (STGM) block to effectively construct skeleton sequences from the input music, capturing dependencies between joints in both the spatial and temporal dimensions. For the skeleton-to-video translation, we propose a novel self-supervised regularization network to translate the generated skeletons, along with a conditional image, into a dance video. Lastly, we collect a new skeleton-to-video translation dataset from the Internet, containing 54,944 video clips. Extensive experiments demonstrate that STG-Mamba achieves significantly better results than existing methods.
zh
[CV-47] StixelNExT: Lightweight Monocular Scene Segmentation and Representation for Collective Perception
【速读】:该论文试图解决单目感知系统中场景表示的效率与准确性问题,旨在实现高信息压缩的同时保持对点云和鸟瞰图表示的适应性。解决方案的关键在于基于已有的Stixel表示方法,通过推断3D Stixel并聚类较小的3D Stixel单元来增强目标分割,从而提升场景表征的质量与实用性。
链接: https://arxiv.org/abs/2507.06687
作者: Marcel Vosshans,Omar Ait-Aider,Youcef Mezouar,Markus Enzweiler
机构: Institut Pascal ISPR (图像,系统与感知,机器人); Universite Clermont Auvergne INP / CNRS (克莱蒙奥弗涅大学 INP / 国家科学研究中心); Institute for Intelligent Systems (智能系统研究所); Esslingen University of Applied Sciences (埃斯林根应用科学大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Robotics (cs.RO)
备注:
Abstract:This paper presents StixelNExT++, a novel approach to scene representation for monocular perception systems. Building on the established Stixel representation, our method infers 3D Stixels and enhances object segmentation by clustering smaller 3D Stixel units. The approach achieves high compression of scene information while remaining adaptable to point cloud and bird’s-eye-view representations. Our lightweight neural network, trained on automatically generated LiDAR-based ground truth, achieves real-time performance with computation times as low as 10 ms per frame. Experimental results on the Waymo dataset demonstrate competitive performance within a 30-meter range, highlighting the potential of StixelNExT++ for collective perception in autonomous systems.
zh
[CV-48] xt-promptable Object Counting via Quantity Awareness Enhancement
【速读】:该论文旨在解决文本提示下的对象计数问题,特别是在缺乏明确类别信息的情况下,模型难以准确区分物体数量的问题。其解决方案的关键在于提出QUANet,该方法引入了以数量为导向的文本提示,并结合视觉-文本数量对齐损失来增强模型的数量感知能力;同时设计了一个包含Transformer流、CNN流以及Transformer-to-CNN增强适配器(T2C-adapters)的双流自适应计数解码器,以实现密度图预测,并通过跨流数量排序损失优化两流预测结果的排序顺序。
链接: https://arxiv.org/abs/2507.06679
作者: Miaojing Shi,Xiaowen Zhang,Zijie Yue,Yong Luo,Cairong Zhao,Li Li
机构: Tongji University(同济大学); Wuhan University(武汉大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 13 pages, 5 figures
Abstract:Recent advances in large vision-language models (VLMs) have shown remarkable progress in solving the text-promptable object counting problem. Representative methods typically specify text prompts with object category information in images. This however is insufficient for training the model to accurately distinguish the number of objects in the counting task. To this end, we propose QUANet, which introduces novel quantity-oriented text prompts with a vision-text quantity alignment loss to enhance the model’s quantity awareness. Moreover, we propose a dual-stream adaptive counting decoder consisting of a Transformer stream, a CNN stream, and a number of Transformer-to-CNN enhancement adapters (T2C-adapters) for density map prediction. The T2C-adapters facilitate the effective knowledge communication and aggregation between the Transformer and CNN streams. A cross-stream quantity ranking loss is proposed in the end to optimize the ranking orders of predictions from the two streams. Extensive experiments on standard benchmarks such as FSC-147, CARPK, PUCPR+, and ShanghaiTech demonstrate our model’s strong generalizability for zero-shot class-agnostic counting. Code is available at this https URL
zh
[CV-49] FlexGaussian: Flexible and Cost-Effective Training-Free Compression for 3D Gaussian Splatting ACM-MM2025
【速读】:该论文旨在解决3D Gaussian splatting在大规模模型应用中面临的内存和计算成本过高的问题,特别是在资源受限的移动和边缘设备上。其关键解决方案是提出FlexGaussian,一种无需训练的3D Gaussian压缩方法,结合了混合精度量化与属性判别剪枝技术,从而实现高效的压缩比和快速的推理速度,同时保持较高的渲染质量。
链接: https://arxiv.org/abs/2507.06671
作者: Boyuan Tian,Qizhe Gao,Siran Xianyu,Xiaotong Cui,Minjia Zhang
机构: UIUC(伊利诺伊大学厄巴纳-香槟分校)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: To appear at ACM MM 2025
Abstract:3D Gaussian splatting has become a prominent technique for representing and rendering complex 3D scenes, due to its high fidelity and speed advantages. However, the growing demand for large-scale models calls for effective compression to reduce memory and computation costs, especially on mobile and edge devices with limited resources. Existing compression methods effectively reduce 3D Gaussian parameters but often require extensive retraining or fine-tuning, lacking flexibility under varying compression constraints. In this paper, we introduce FlexGaussian, a flexible and cost-effective method that combines mixed-precision quantization with attribute-discriminative pruning for training-free 3D Gaussian compression. FlexGaussian eliminates the need for retraining and adapts easily to diverse compression targets. Evaluation results show that FlexGaussian achieves up to 96.4% compression while maintaining high rendering quality (1 dB drop in PSNR), and is deployable on mobile devices. FlexGaussian delivers high compression ratios within seconds, being 1.7-2.1x faster than state-of-the-art training-free methods and 10-100x faster than training-involved approaches. The code is being prepared and will be released soon at: this https URL Comments: To appear at ACM MM 2025 Subjects: Computer Vision and Pattern Recognition (cs.CV) Cite as: arXiv:2507.06671 [cs.CV] (or arXiv:2507.06671v1 [cs.CV] for this version) https://doi.org/10.48550/arXiv.2507.06671 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
zh
[CV-50] MK-Pose: Category-Level Object Pose Estimation via Multimodal-Based Keypoint Learning
【速读】:该论文旨在解决类别级物体位姿估计(category-level object pose estimation)问题,即在不依赖具体实例先验知识的情况下,预测属于已知类别的物体的位姿。现有方法依赖于RGB图像或点云数据,在处理物体遮挡以及跨实例和跨类别的泛化能力方面存在不足。该论文提出的多模态关键点学习框架(MK-Pose)通过融合RGB图像、点云数据和类别级文本描述,结合自监督关键点检测模块、基于注意力的查询生成、软热图匹配以及基于图的关系建模,提升了模型的性能。其解决方案的关键在于引入多模态信息与图增强的特征融合模块,从而有效整合局部几何信息与全局上下文。
链接: https://arxiv.org/abs/2507.06662
作者: Yifan Yang,Peili Song,Enfan Lan,Dong Liu,Jingtai Liu
机构: Nankai University (南开大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Robotics (cs.RO)
备注:
Abstract:Category-level object pose estimation, which predicts the pose of objects within a known category without prior knowledge of individual instances, is essential in applications like warehouse automation and manufacturing. Existing methods relying on RGB images or point cloud data often struggle with object occlusion and generalization across different instances and categories. This paper proposes a multimodal-based keypoint learning framework (MK-Pose) that integrates RGB images, point clouds, and category-level textual descriptions. The model uses a self-supervised keypoint detection module enhanced with attention-based query generation, soft heatmap matching and graph-based relational modeling. Additionally, a graph-enhanced feature fusion module is designed to integrate local geometric information and global context. MK-Pose is evaluated on CAMERA25 and REAL275 dataset, and is further tested for cross-dataset capability on HouseCat6D dataset. The results demonstrate that MK-Pose outperforms existing state-of-the-art methods in both IoU and average precision without shape priors. Codes will be released at \hrefthis https URLthis https URL.
zh
[CV-51] Enhancing Diffusion Model Stability for Image Restoration via Gradient Management
【速读】:该论文试图解决扩散模型在图像恢复任务中由于先验与似然梯度方向冲突及似然梯度的时间波动导致的生成过程不稳定问题,进而影响恢复性能。解决方案的关键在于提出一种名为Stabilized Progressive Gradient Diffusion (SPGD)的新梯度管理技术,其核心包含两个协同组件:(1)渐进式似然预热策略以缓解梯度冲突;(2)自适应方向动量(ADM)平滑以减少似然梯度的波动。
链接: https://arxiv.org/abs/2507.06656
作者: Hongjie Wu,Mingqin Zhang,Linchao He,Ji-Zhe Zhou,Jiancheng Lv
机构: Sichuan University(四川大学); College of Computer Science(计算机学院)
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注: Accepted to ACM Multimedia 2025. Preprint version
Abstract:Diffusion models have shown remarkable promise for image restoration by leveraging powerful priors. Prominent methods typically frame the restoration problem within a Bayesian inference framework, which iteratively combines a denoising step with a likelihood guidance step. However, the interactions between these two components in the generation process remain underexplored. In this paper, we analyze the underlying gradient dynamics of these components and identify significant instabilities. Specifically, we demonstrate conflicts between the prior and likelihood gradient directions, alongside temporal fluctuations in the likelihood gradient itself. We show that these instabilities disrupt the generative process and compromise restoration performance. To address these issues, we propose Stabilized Progressive Gradient Diffusion (SPGD), a novel gradient management technique. SPGD integrates two synergistic components: (1) a progressive likelihood warm-up strategy to mitigate gradient conflicts; and (2) adaptive directional momentum (ADM) smoothing to reduce fluctuations in the likelihood gradient. Extensive experiments across diverse restoration tasks demonstrate that SPGD significantly enhances generation stability, leading to state-of-the-art performance in quantitative metrics and visually superior results. Code is available at \hrefthis https URLhere.
zh
[CV-52] MS-DPPs: Multi-Source Determinantal Point Processes for Contextual Diversity Refinement of Composite Attributes in Text to Image Retrieval IJCAI2025
【速读】:该论文试图解决文本到图像检索中的结果多样性(Result Diversification, RD)问题,传统方法仅关注图像外观的多样性度量,而未能根据具体应用场景调整多样性指标,从而限制了其应用范围。解决方案的关键在于提出一种新的任务——复合属性的上下文多样性优化(CDR-CA),通过多源DPPs(Multi-Source DPPs)方法,将确定性点过程(DPP)扩展至多源场景,并基于流形表示构建统一的相似性矩阵,同时引入切线归一化(Tangent Normalization)以反映上下文信息,从而实现更符合应用需求的多样性优化。
链接: https://arxiv.org/abs/2507.06654
作者: Naoya Sogi,Takashi Shibata,Makoto Terao,Masanori Suganuma,Takayuki Okatani
机构: NEC Corporation(日本电气公司); Tohoku University(东北大学); RIKEN Center for AIP(理化学研究所人工智能中心)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Information Retrieval (cs.IR)
备注: IJCAI 2025. Code: this https URL
Abstract:Result diversification (RD) is a crucial technique in Text-to-Image Retrieval for enhancing the efficiency of a practical application. Conventional methods focus solely on increasing the diversity metric of image appearances. However, the diversity metric and its desired value vary depending on the application, which limits the applications of RD. This paper proposes a novel task called CDR-CA (Contextual Diversity Refinement of Composite Attributes). CDR-CA aims to refine the diversities of multiple attributes, according to the application’s context. To address this task, we propose Multi-Source DPPs, a simple yet strong baseline that extends the Determinantal Point Process (DPP) to multi-sources. We model MS-DPP as a single DPP model with a unified similarity matrix based on a manifold representation. We also introduce Tangent Normalization to reflect contexts. Extensive experiments demonstrate the effectiveness of the proposed method. Our code is publicly available at this https URL.
zh
[CV-53] Diff2I2P: Differentiable Image-to-Point Cloud Registration with Diffusion Prior ICCV2025
【速读】:该论文旨在解决图像到点云(I2P)配准中跨模态对应关系学习的问题,现有方法主要依赖度量学习来实现模态间的特征对齐,忽视了图像与点云数据之间的固有模态差异,导致难以保证准确的跨模态对应关系。其解决方案的关键在于提出一种完全可微的I2P配准框架Diff^2 I2P,利用一种新颖且有效的扩散先验来弥合模态差距。具体而言,通过Control-Side Score Distillation(CSD)技术从深度条件扩散模型中蒸馏知识以直接优化预测变换,并引入Deformable Correspondence Tuning(DCT)模块以实现可微的对应关系估计,从而显著提升了配准性能。
链接: https://arxiv.org/abs/2507.06651
作者: Juncheng Mu,Chengwei Ren,Weixiang Zhang,Liang Pan,Xiao-Ping Zhang,Yue Gao
机构: Tsinghua University (清华大学); Shanghai AI Laboratory (上海人工智能实验室)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: ICCV 2025
Abstract:Learning cross-modal correspondences is essential for image-to-point cloud (I2P) registration. Existing methods achieve this mostly by utilizing metric learning to enforce feature alignment across modalities, disregarding the inherent modality gap between image and point data. Consequently, this paradigm struggles to ensure accurate cross-modal correspondences. To this end, inspired by the cross-modal generation success of recent large diffusion models, we propose Diff ^2 I2P, a fully Differentiable I2P registration framework, leveraging a novel and effective Diffusion prior for bridging the modality gap. Specifically, we propose a Control-Side Score Distillation (CSD) technique to distill knowledge from a depth-conditioned diffusion model to directly optimize the predicted transformation. However, the gradients on the transformation fail to backpropagate onto the cross-modal features due to the non-differentiability of correspondence retrieval and PnP solver. To this end, we further propose a Deformable Correspondence Tuning (DCT) module to estimate the correspondences in a differentiable way, followed by the transformation estimation using a differentiable PnP solver. With these two designs, the Diffusion model serves as a strong prior to guide the cross-modal feature learning of image and point cloud for forming robust correspondences, which significantly improves the registration. Extensive experimental results demonstrate that Diff ^2 I2P consistently outperforms SoTA I2P registration methods, achieving over 7% improvement in registration recall on the 7-Scenes benchmark.
zh
[CV-54] ClipGS: Clippable Gaussian Splatting for Interactive Cinematic Visualization of Volumetric Medical Data MICCAI2025
【速读】:该论文旨在解决体积医学数据交互式电影渲染中计算成本高和渲染速度慢的问题,从而限制了实际应用中的交互性。其解决方案的关键在于提出ClipGS框架,该框架支持裁剪平面,并引入了一种可学习的截断机制,以自动调整高斯点(Gaussian primitives)在裁剪平面下的可见性,同时设计了一个自适应调整模型来动态优化高斯分布的形变,进而提升渲染性能。
链接: https://arxiv.org/abs/2507.06647
作者: Chengkun Li,Yuqi Tong,Kai Chen,Zhenya Yang,Ruiyang Li,Shi Qiu,Jason Ying-Kuen Chan,Pheng-Ann Heng,Qi Dou
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Early accepted by MICCAI 2025. Project is available at: this https URL
Abstract:The visualization of volumetric medical data is crucial for enhancing diagnostic accuracy and improving surgical planning and education. Cinematic rendering techniques significantly enrich this process by providing high-quality visualizations that convey intricate anatomical details, thereby facilitating better understanding and decision-making in medical contexts. However, the high computing cost and low rendering speed limit the requirement of interactive visualization in practical applications. In this paper, we introduce ClipGS, an innovative Gaussian splatting framework with the clipping plane supported, for interactive cinematic visualization of volumetric medical data. To address the challenges posed by dynamic interactions, we propose a learnable truncation scheme that automatically adjusts the visibility of Gaussian primitives in response to the clipping plane. Besides, we also design an adaptive adjustment model to dynamically adjust the deformation of Gaussians and refine the rendering performance. We validate our method on five volumetric medical data (including CT and anatomical slice data), and reach an average 36.635 PSNR rendering quality with 156 FPS and 16.1 MB model size, outperforming state-of-the-art methods in rendering quality and efficiency.
zh
[CV-55] Learning from Sparse Point Labels for Dense Carcinosis Localization in Advanced Ovarian Cancer Assessment
【速读】:该论文试图解决在医学领域中从稀疏标签中学习的问题,特别是在需要密集像素级标注的任务中,由于标注成本高和难以获取完美标注,使得学习变得尤为困难。其解决方案的关键在于将问题建模为从每张图像的少量点标注进行稀疏热图回归,并提出一种新的损失函数——Crag and Tail loss,该损失函数能够有效利用正向稀疏标签,同时最小化误检或遗漏标注的影响。
链接: https://arxiv.org/abs/2507.06643
作者: Farahdiba Zarin,Riccardo Oliva,Vinkle Srivastav,Armine Vardazaryan,Andrea Rosati,Alice Zampolini Faustini,Giovanni Scambia,Anna Fagotti,Pietro Mascagni,Nicolas Padoy
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注:
Abstract:Learning from sparse labels is a challenge commonplace in the medical domain. This is due to numerous factors, such as annotation cost, and is especially true for newly introduced tasks. When dense pixel-level annotations are needed, this becomes even more unfeasible. However, being able to learn from just a few annotations at the pixel-level, while extremely difficult and underutilized, can drive progress in studies where perfect annotations are not immediately available. This work tackles the challenge of learning the dense prediction task of keypoint localization from a few point annotations in the context of 2d carcinosis keypoint localization from laparoscopic video frames for diagnostic planning of advanced ovarian cancer patients. To enable this, we formulate the problem as a sparse heatmap regression from a few point annotations per image and propose a new loss function, called Crag and Tail loss, for efficient learning. Our proposed loss function effectively leverages positive sparse labels while minimizing the impact of false negatives or missed annotations. Through an extensive ablation study, we demonstrate the effectiveness of our approach in achieving accurate dense localization of carcinosis keypoints, highlighting its potential to advance research in scenarios where dense annotations are challenging to obtain.
zh
[CV-56] EXAONE Path 2.0: Pathology Foundation Model with End-to-End Supervision
【速读】:该论文试图解决数字病理学中全切片图像(WSI)处理的挑战,特别是由于其超大分辨率导致的计算和数据效率问题。传统方法通过自监督学习(SSL)训练局部区域编码器,并利用多实例学习(MIL)或滑片编码器进行下游任务,但这种方法可能忽略与生物标志物预测相关的复杂领域特定特征。此外,SSL方法的数据效率低于完全监督方法。为解决这些问题,论文提出了EXAONE Path 2.0,这是一种在直接滑片级监督下学习局部表示的病理学基础模型,其关键在于通过少量数据(仅37k WSIs)实现高效的特征学习,从而在10个生物标志物预测任务中达到最先进的平均性能。
链接: https://arxiv.org/abs/2507.06639
作者: Myungjang Pyeon,Janghyeon Lee,Minsoo Lee,Juseung Yun,Hwanil Choi,Jonghyun Kim,Jiwon Kim,Yi Hu,Jongseong Jang,Soonyoung Lee
机构: LG AI Research (LG人工智能研究院)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: EXAONE Path 2.0 technical report
Abstract:In digital pathology, whole-slide images (WSIs) are often difficult to handle due to their gigapixel scale, so most approaches train patch encoders via self-supervised learning (SSL) and then aggregate the patch-level embeddings via multiple instance learning (MIL) or slide encoders for downstream tasks. However, patch-level SSL may overlook complex domain-specific features that are essential for biomarker prediction, such as mutation status and molecular characteristics, as SSL methods rely only on basic augmentations selected for natural image domains on small patch-level area. Moreover, SSL methods remain less data efficient than fully supervised approaches, requiring extensive computational resources and datasets to achieve competitive performance. To address these limitations, we present EXAONE Path 2.0, a pathology foundation model that learns patch-level representations under direct slide-level supervision. Using only 37k WSIs for training, EXAONE Path 2.0 achieves state-of-the-art average performance across 10 biomarker prediction tasks, demonstrating remarkable data efficiency.
zh
[CV-57] PointVDP: Learning View-Dependent Projection by Fireworks Rays for 3D Point Cloud Segmentation
【速读】:该论文试图解决点云分割中因传统投影方法依赖固定参数而导致的投影多样性不足与计算效率低的问题。现有方法采用视图无关投影,其生成的投影射线受限于人工设定的预定义参数,无法充分捕捉不同视图平面下的投影多样性,且多投影策略虽能增强空间变化性,但导致计算冗余和处理效率下降。解决方案的关键在于设计一种视图依赖投影(View-Dependent Projection, VDP)框架,通过数据驱动的方式从3D点云分布中生成具有高信息量的单图像输入,利用受烟花自适应行为启发的射线预测机制,并引入颜色正则化以优化框架,从而在保持较低计算成本的前提下提升2D空间利用率和语义理解性能。
链接: https://arxiv.org/abs/2507.06618
作者: Yang Chen,Yueqi Duan,Haowen Sun,Ziwei Wang,Jiwen Lu,Yap-Peng Tan
机构: Nanyang Technological University (南洋理工大学); Tsinghua University (清华大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:In this paper, we propose view-dependent projection (VDP) to facilitate point cloud segmentation, designing efficient 3D-to-2D mapping that dynamically adapts to the spatial geometry from view variations. Existing projection-based methods leverage view-independent projection in complex scenes, relying on straight lines to generate direct rays or upward curves to reduce occlusions. However, their view independence provides projection rays that are limited to pre-defined parameters by human settings, restricting point awareness and failing to capture sufficient projection diversity across different view planes. Although multiple projections per view plane are commonly used to enhance spatial variety, the projected redundancy leads to excessive computational overhead and inefficiency in image processing. To address these limitations, we design a framework of VDP to generate data-driven projections from 3D point distributions, producing highly informative single-image inputs by predicting rays inspired by the adaptive behavior of fireworks. In addition, we construct color regularization to optimize the framework, which emphasizes essential features within semantic pixels and suppresses the non-semantic features within black pixels, thereby maximizing 2D space utilization in a projected image. As a result, our approach, PointVDP, develops lightweight projections in marginal computation costs. Experiments on S3DIS and ScanNet benchmarks show that our approach achieves competitive results, offering a resource-efficient solution for semantic understanding.
zh
[CV-58] Denoising Multi-Beta VAE: Representation Learning for Disentanglement and Generation
【速读】:该论文试图解决生成模型中可分离且可解释的潜在表示与生成质量之间的权衡问题。传统方法如β-VAE通过调整超参数β来平衡这两个方面,但较高的β值会导致信息瓶颈,牺牲重建精度以获得更好的解耦性。论文提出的解决方案的关键在于利用一系列β值训练单一变分自编码器(VAE),以学习多种对应的潜在表示,并引入非线性扩散模型,实现不同β值对应潜在表示的平滑过渡,从而最终得到几乎无损失的表示,支持高质量的重建和独立生成样本。
链接: https://arxiv.org/abs/2507.06613
作者: Anshuk Uppal,Yuhta Takida,Chieh-Hsin Lai,Yuki Mitsufuji
机构: Technical University of Denmark (丹麦技术大学); Sony AI (索尼人工智能); Sony Group Corporation (索尼集团有限公司)
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
备注: 24 pages, 8 figures and 7 tables
Abstract:Disentangled and interpretable latent representations in generative models typically come at the cost of generation quality. The \beta -VAE framework introduces a hyperparameter \beta to balance disentanglement and reconstruction quality, where setting \beta 1 introduces an information bottleneck that favors disentanglement over sharp, accurate reconstructions. To address this trade-off, we propose a novel generative modeling framework that leverages a range of \beta values to learn multiple corresponding latent representations. First, we obtain a slew of representations by training a single variational autoencoder (VAE), with a new loss function that controls the information retained in each latent representation such that the higher \beta value prioritize disentanglement over reconstruction fidelity. We then, introduce a non-linear diffusion model that smoothly transitions latent representations corresponding to different \beta values. This model denoises towards less disentangled and more informative representations, ultimately leading to (almost) lossless representations, enabling sharp reconstructions. Furthermore, our model supports sample generation without input images, functioning as a standalone generative model. We evaluate our framework in terms of both disentanglement and generation quality. Additionally, we observe smooth transitions in the latent spaces with respect to changes in \beta , facilitating consistent manipulation of generated outputs.
zh
[CV-59] Omni-Fusion of Spatial and Spectral for Hyperspectral Image Segmentation
【速读】:该论文旨在解决医学高光谱成像(Medical Hyperspectral Imaging, MHSI)中空间维度与光谱维度信息有效融合的问题,这一问题由于高维性和光谱冗余性而显得尤为复杂。论文提出的解决方案是构建一种名为Omni-Fuse的跨维度全融合网络,其关键在于引入丰富的跨维度特征融合操作,包括通过双向注意力机制优化空间和光谱特征的跨维度增强模块、基于光谱引导的空间查询选择机制以及动态引导模型关注选定空间查询的两阶段跨维度解码器。这些设计在保持模型高效执行的同时显著提升了分割性能。
链接: https://arxiv.org/abs/2507.06606
作者: Qing Zhang,Guoquan Pei,Yan Wang
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Medical Hyperspectral Imaging (MHSI) has emerged as a promising tool for enhanced disease diagnosis, particularly in computational pathology, offering rich spectral information that aids in identifying subtle biochemical properties of tissues. Despite these advantages, effectively fusing both spatial-dimensional and spectral-dimensional information from MHSIs remains challenging due to its high dimensionality and spectral redundancy inherent characteristics. To solve the above challenges, we propose a novel spatial-spectral omni-fusion network for hyperspectral image segmentation, named as Omni-Fuse. Here, we introduce abundant cross-dimensional feature fusion operations, including a cross-dimensional enhancement module that refines both spatial and spectral features through bidirectional attention mechanisms, a spectral-guided spatial query selection to select the most spectral-related spatial feature as the query, and a two-stage cross-dimensional decoder which dynamically guide the model to focus on the selected spatial query. Despite of numerous attention blocks, Omni-Fuse remains efficient in execution. Experiments on two microscopic hyperspectral image datasets show that our approach can significantly improve the segmentation performance compared with the state-of-the-art methods, with over 5.73 percent improvement in DSC. Code available at: this https URL.
zh
[CV-60] Cross-Modal Dual-Causal Learning for Long-Term Action Recognition
【速读】:该论文旨在解决长期动作识别(Long-term Action Recognition, LTAR)中的挑战,这些问题包括由于时间跨度长导致的复杂原子动作相关性和视觉混杂因素。现有基于因果关系的方法虽然能处理模态特定偏差,但缺乏跨模态因果建模,限制了其在基于视觉-语言模型(Vision-Language Models, VLMs)的LTAR中的应用。论文提出的跨模态双因果学习(Cross-Modal Dual-Causal Learning, CMDCL)通过引入结构化因果模型来揭示视频与标签文本之间的因果关系,关键在于利用文本因果干预消除文本嵌入中的跨模态偏差,并通过去偏文本引导的视觉因果干预去除视觉模态中的混杂因素,从而生成鲁棒的动作表示以应对LTAR的挑战。
链接: https://arxiv.org/abs/2507.06603
作者: Xu Shaowu,Jia Xibin,Gao Junyu,Sun Qianmei,Chang Jing,Fan Chao
机构: Beijing University of Technology(北京工业大学); Chinese Academy of Sciences(中国科学院); Capital Medical University(首都医科大学); Beijing Chao-yang Hospital(北京朝阳医院)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Long-term action recognition (LTAR) is challenging due to extended temporal spans with complex atomic action correlations and visual confounders. Although vision-language models (VLMs) have shown promise, they often rely on statistical correlations instead of causal mechanisms. Moreover, existing causality-based methods address modal-specific biases but lack cross-modal causal modeling, limiting their utility in VLM-based LTAR. This paper proposes \textbfCross-\textbfModal \textbfDual-\textbfCausal \textbfLearning (CMDCL), which introduces a structural causal model to uncover causal relationships between videos and label texts. CMDCL addresses cross-modal biases in text embeddings via textual causal intervention and removes confounders inherent in the visual modality through visual causal intervention guided by the debiased text. These dual-causal interventions enable robust action representations to address LTAR challenges. Experimental results on three benchmarks including Charades, Breakfast and COIN, demonstrate the effectiveness of the proposed model. Our code is available at this https URL. Subjects: Computer Vision and Pattern Recognition (cs.CV) Cite as: arXiv:2507.06603 [cs.CV] (or arXiv:2507.06603v1 [cs.CV] for this version) https://doi.org/10.48550/arXiv.2507.06603 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
zh
[CV-61] Capturing Stable HDR Videos Using a Dual-Camera System
【速读】:该论文试图解决HDR视频重建中由于交替曝光方法导致的参考图像曝光波动引起的闪烁问题。其解决方案的关键在于提出一种双相机系统(DCS),其中一台相机用于捕获一致的参考序列,另一台用于捕获非参考序列以补充信息,并结合一种曝光自适应融合网络(EAFNet),该网络通过预对齐子网络探索曝光影响并选择性地强调不同曝光等级下的有价值特征,随后利用非对称跨特征融合子网络进行特征融合,最后通过基于离散小波变换(DWT)的多尺度架构减少鬼影伪影并优化不同分辨率下的特征。
链接: https://arxiv.org/abs/2507.06593
作者: Qianyu Zhang,Bolun Zheng,Hangjia Pan,Lingyu Zhu,Zunjie Zhu,Zongpeng Li,Shiqi Wang
机构: Hangzhou Dianzi University (杭州电子科技大学); Lishui Institute of Hangzhou Dianzi University (杭州电子科技大学丽水研究院); City University of Hong Kong (香港城市大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Image and Video Processing (eess.IV)
备注:
Abstract:In HDR video reconstruction, exposure fluctuations in reference images from alternating exposure methods often result in flickering. To address this issue, we propose a dual-camera system (DCS) for HDR video acquisition, where one camera is assigned to capture consistent reference sequences, while the other is assigned to capture non-reference sequences for information supplementation. To tackle the challenges posed by video data, we introduce an exposure-adaptive fusion network (EAFNet) to achieve more robust results. EAFNet introduced a pre-alignment subnetwork to explore the influence of exposure, selectively emphasizing the valuable features across different exposure levels. Then, the enhanced features are fused by the asymmetric cross-feature fusion subnetwork, which explores reference-dominated attention maps to improve image fusion by aligning cross-scale features and performing cross-feature fusion. Finally, the reconstruction subnetwork adopts a DWT-based multiscale architecture to reduce ghosting artifacts and refine features at different resolutions. Extensive experimental evaluations demonstrate that the proposed method achieves state-of-the-art performance on different datasets, validating the great potential of the DCS in HDR video reconstruction. The codes and data captured by DCS will be available at this https URL.
zh
[CV-62] Ambiguity-aware Point Cloud Segmentation by Adaptive Margin Contrastive Learning
【速读】:该论文旨在解决点云3D语义分割中因过渡区域导致的点级模糊性和判别性不足的问题,传统方法采用统一惩罚目标,忽略了点级别的不确定性,导致模型性能受限。其解决方案的关键在于提出AMContrast3D方法,将对比学习融入模糊度估计框架,根据点的模糊程度自适应调整优化目标,从而在保证低模糊点正确性的同时允许高模糊点出现错误。进一步提出的AMContrast3D++通过并行双分支结构和新型模糊度预测模块,结合预测的模糊度对模糊嵌入进行掩码精修,提升分割性能与鲁棒性。
链接: https://arxiv.org/abs/2507.06592
作者: Yang Chen,Yueqi Duan,Haowen Sun,Jiwen Lu,Yap-Peng Tan
机构: Nanyang Technological University (南洋理工大学); Tsinghua University (清华大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: This article has been accepted for publication in IEEE Transactions on Multimedia. arXiv admin note: text overlap with arXiv:2502.04111
Abstract:This paper proposes an adaptive margin contrastive learning method for 3D semantic segmentation on point clouds. Most existing methods use equally penalized objectives, which ignore the per-point ambiguities and less discriminated features stemming from transition regions. However, as highly ambiguous points may be indistinguishable even for humans, their manually annotated labels are less reliable, and hard constraints over these points would lead to sub-optimal models. To address this, we first design AMContrast3D, a method comprising contrastive learning into an ambiguity estimation framework, tailored to adaptive objectives for individual points based on ambiguity levels. As a result, our method promotes model training, which ensures the correctness of low-ambiguity points while allowing mistakes for high-ambiguity points. As ambiguities are formulated based on position discrepancies across labels, optimization during inference is constrained by the assumption that all unlabeled points are uniformly unambiguous, lacking ambiguity awareness. Inspired by the insight of joint training, we further propose AMContrast3D++ integrating with two branches trained in parallel, where a novel ambiguity prediction module concurrently learns point ambiguities from generated embeddings. To this end, we design a masked refinement mechanism that leverages predicted ambiguities to enable the ambiguous embeddings to be more reliable, thereby boosting segmentation performance and enhancing robustness. Experimental results on 3D indoor scene datasets, S3DIS and ScanNet, demonstrate the effectiveness of the proposed method. Code is available at this https URL.
zh
[CV-63] MOST: Motion Diffusion Model for Rare Text via Temporal Clip Banzhaf Interaction
【速读】:该论文试图解决从罕见语言提示生成人类运动的持续性挑战,现有方法在粗粒度匹配中存在困难,并因运动冗余而忽略了重要的语义线索。解决方案的关键在于利用细粒度的clip关系,通过引入一种新颖的时空clip Banzhaf交互机制,精确量化文本与运动在clip层面的一致性,从而实现直接的细粒度文本到运动clip匹配并消除冗余。
链接: https://arxiv.org/abs/2507.06590
作者: Yin Wang,Mu li,Zhiying Leng,Frederick W. B. Li,Xiaohui Liang
机构: Beihang University(北京航空航天大学); University of Durham(杜伦大学); Zhongguancun Laboratory(中关村实验室)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:We introduce MOST, a novel motion diffusion model via temporal clip Banzhaf interaction, aimed at addressing the persistent challenge of generating human motion from rare language prompts. While previous approaches struggle with coarse-grained matching and overlook important semantic cues due to motion redundancy, our key insight lies in leveraging fine-grained clip relationships to mitigate these issues. MOST’s retrieval stage presents the first formulation of its kind - temporal clip Banzhaf interaction - which precisely quantifies textual-motion coherence at the clip level. This facilitates direct, fine-grained text-to-motion clip matching and eliminates prevalent redundancy. In the generation stage, a motion prompt module effectively utilizes retrieved motion clips to produce semantically consistent movements. Extensive evaluations confirm that MOST achieves state-of-the-art text-to-motion retrieval and generation performance by comprehensively addressing previous challenges, as demonstrated through quantitative and qualitative results highlighting its effectiveness, especially for rare prompts.
zh
[CV-64] Edge-Boundary-Texture Loss: A Tri-Class Generalization of Weighted Binary Cross-Entropy for Enhanced Edge Detection
【速读】:该论文试图解决边缘检测(Edge Detection, ED)中由于物体边界附近非边缘像素的模糊性导致的性能受限问题。现有的加权二元交叉熵(Weighted Binary Cross-Entropy, WBCE)损失函数对所有非边缘像素一视同仁,忽略了边缘周围的结构细节,常导致预测结果模糊。该论文提出的Edge-Boundary-Texture (EBT) 损失函数的关键在于将像素划分为三类:边缘、边界和纹理,并为每类分配不同的监督权重,从而引导模型更关注边缘精度和上下文边界定位,实现更结构化的学习。
链接: https://arxiv.org/abs/2507.06569
作者: Hao Shu
机构: Sun-Yat-Sen University (中山大学); Shenzhen University (深圳大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 10 pages
Abstract:Edge detection (ED) remains a fundamental task in computer vision, yet its performance is often hindered by the ambiguous nature of non-edge pixels near object boundaries. The widely adopted Weighted Binary Cross-Entropy (WBCE) loss treats all non-edge pixels uniformly, overlooking the structural nuances around edges and often resulting in blurred predictions. In this paper, we propose the Edge-Boundary-Texture (EBT) loss, a novel objective that explicitly divides pixels into three categories, edge, boundary, and texture, and assigns each a distinct supervisory weight. This tri-class formulation enables more structured learning by guiding the model to focus on both edge precision and contextual boundary localization. We theoretically show that the EBT loss generalizes the WBCE loss, with the latter becoming a limit case. Extensive experiments across multiple benchmarks demonstrate the superiority of the EBT loss both quantitatively and perceptually. Furthermore, the consistent use of unified hyperparameters across all models and datasets, along with robustness to their moderate variations, indicates that the EBT loss requires minimal fine-tuning and is easily deployable in practice.
zh
[CV-65] Divergence-Based Similarity Function for Multi-View Contrastive Learning
【速读】:该论文试图解决多视图表示学习中如何有效建模所有视图之间联合结构的问题,现有方法主要捕捉成对关系而无法充分建模多视图间的全局结构。解决方案的关键在于提出一种基于散度的相似性函数(DSF),通过将每个增强视图集表示为分布,并以分布间的散度作为相似性度量,从而显式捕捉多视图的联合结构。
链接: https://arxiv.org/abs/2507.06560
作者: Jae Hyoung Jeon,Cheolsu Lim,Myungjoo Kang
机构: Seoul National University (首尔国立大学); Research Institute of Mathematics, SNU (数学研究所,首尔国立大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注: 9 pages, 5 figures
Abstract:Recent success in contrastive learning has sparked growing interest in more effectively leveraging multiple augmented views of an instance. While prior methods incorporate multiple views at the loss or feature level, they primarily capture pairwise relationships and fail to model the joint structure across all views. In this work, we propose a divergence-based similarity function (DSF) that explicitly captures the joint structure by representing each set of augmented views as a distribution and measuring similarity as the divergence between distributions. Extensive experiments demonstrate that DSF consistently improves performance across various tasks, including kNN classification and linear evaluation, while also offering greater efficiency compared to other multi-view methods. Furthermore, we establish a theoretical connection between DSF and cosine similarity, and show that, unlike cosine similarity, DSF operates effectively without requiring a temperature hyperparameter.
zh
[CV-66] Concept-TRAK: Understanding how diffusion models learn concepts through concept-level attribution
【速读】:该论文试图解决生成式 AI 模型在图像生成过程中对特定元素(如风格或对象)的贡献难以准确归因的问题,这一问题影响了版权管理与模型透明度。其解决方案的关键在于提出一种名为 Concept-TRAK 的新方法,该方法通过两个核心创新实现概念级归因:一是基于扩散后验采样的重新设计的扩散训练损失,以实现稳健的样本特异性归因;二是注重语义相关性的概念感知奖励函数。
链接: https://arxiv.org/abs/2507.06547
作者: Yonghyun Park,Chieh-Hsin Lai,Satoshi Hayakawa,Yuhta Takida,Naoki Murata,Wei-Hsiang Liao,Woosung Choi,Kin Wai Cheuk,Junghyun Koo,Yuki Mitsufuji
机构: SONY AI(索尼人工智能); Sony Group Corporation(索尼集团株式会社)
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注: Preprint
Abstract:While diffusion models excel at image generation, their growing adoption raises critical concerns around copyright issues and model transparency. Existing attribution methods identify training examples influencing an entire image, but fall short in isolating contributions to specific elements, such as styles or objects, that matter most to stakeholders. To bridge this gap, we introduce \emphconcept-level attribution via a novel method called \emphConcept-TRAK. Concept-TRAK extends influence functions with two key innovations: (1) a reformulated diffusion training loss based on diffusion posterior sampling, enabling robust, sample-specific attribution; and (2) a concept-aware reward function that emphasizes semantic relevance. We evaluate Concept-TRAK on the AbC benchmark, showing substantial improvements over prior methods. Through diverse case studies–ranging from identifying IP-protected and unsafe content to analyzing prompt engineering and compositional learning–we demonstrate how concept-level attribution yields actionable insights for responsible generative AI development and governance.
zh
[CV-67] oken Bottleneck: One Token to Remember Dynamics
【速读】:该论文旨在解决从动态场景中提取紧凑且具有时间感知能力的视觉表示的问题,以支持如视觉跟踪和机器人操作等序列场景理解任务。其解决方案的关键在于提出了一种名为Token Bottleneck (ToBo) 的自监督学习框架,该框架通过将场景压缩为一个瓶颈令牌并在扩展阶段利用少量目标图像块作为提示来预测后续场景,从而有效地学习序列场景表示。这种设计促使视觉主干网络嵌入时间依赖性,进而实现对跨场景动态变化的理解。
链接: https://arxiv.org/abs/2507.06543
作者: Taekyung Kim,Dongyoon Han,Byeongho Heo,Jeongeun Park,Sangdoo Yun
机构: NAVER AI Lab (NAVER人工智能实验室); Korea University (韩国科学技术院)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 17 pages, 9 figures, 8 tables, project page: this https URL , code: this https URL
Abstract:Deriving compact and temporally aware visual representations from dynamic scenes is essential for successful execution of sequential scene understanding tasks such as visual tracking and robotic manipulation. In this paper, we introduce Token Bottleneck (ToBo), a simple yet intuitive self-supervised learning pipeline that squeezes a scene into a bottleneck token and predicts the subsequent scene using minimal patches as hints. The ToBo pipeline facilitates the learning of sequential scene representations by conservatively encoding the reference scene into a compact bottleneck token during the squeeze step. In the expansion step, we guide the model to capture temporal dynamics by predicting the target scene using the bottleneck token along with few target patches as hints. This design encourages the vision backbone to embed temporal dependencies, thereby enabling understanding of dynamic transitions across scenes. Extensive experiments in diverse sequential tasks, including video label propagation and robot manipulation in simulated environments demonstrate the superiority of ToBo over baselines. Moreover, deploying our pre-trained model on physical robots confirms its robustness and effectiveness in real-world environments. We further validate the scalability of ToBo across different model scales.
zh
[CV-68] A model-agnostic active learning approach for animal detection from camera traps
【速读】:该论文试图解决在野生动物监测中,由于相机陷阱捕获的数据量过大而导致的数据标注和动物检测模型训练所需资源过多的问题。其解决方案的关键在于提出一种与模型无关的主动学习方法,该方法在样本选择过程中整合了基于目标和基于图像层面的样本不确定性与多样性度量,从而有效优化标注数据的数量并提升模型性能。
链接: https://arxiv.org/abs/2507.06537
作者: Thi Thu Thuy Nguyen,Duc Thanh Nguyen
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Smart data selection is becoming increasingly important in data-driven machine learning. Active learning offers a promising solution by allowing machine learning models to be effectively trained with optimal data including the most informative samples from large datasets. Wildlife data captured by camera traps are excessive in volume, requiring tremendous effort in data labelling and animal detection models training. Therefore, applying active learning to optimise the amount of labelled data would be a great aid in enabling automated wildlife monitoring and conservation. However, existing active learning techniques require that a machine learning model (i.e., an object detector) be fully accessible, limiting the applicability of the techniques. In this paper, we propose a model-agnostic active learning approach for detection of animals captured by camera traps. Our approach integrates uncertainty and diversity quantities of samples at both the object-based and image-based levels into the active learning sample selection process. We validate our approach in a benchmark animal dataset. Experimental results demonstrate that, using only 30% of the training data selected by our approach, a state-of-the-art animal detector can achieve a performance of equal or greater than that with the use of the complete training dataset.
zh
[CV-69] ILNet: Trajectory Prediction with Inverse Learning Attention for Enhancing Intention Capture
【速读】:该论文旨在解决多智能体交互场景下的轨迹预测问题,现有方法在建模智能体交互时缺乏显式的时空协调,仅能捕捉明显的即时行为意图。论文提出的解决方案关键在于ILNet,其核心包括Inverse Learning (IL) attention机制和Dynamic Anchor Selection (DAS)模块。IL attention通过逆向学习范式建模相邻时刻的交互,引入预设意图以动态编码交互的时空协调性;DAS模块则用于并行提取多个轨迹变化关键点作为锚点,几乎不增加参数量。
链接: https://arxiv.org/abs/2507.06531
作者: Mingjin Zeng,Nan Ouyang,Wenkang Wan,Lei Ao,Qing Cai,Kai Sheng
机构: Xidian University (西安电子科技大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Trajectory prediction for multi-agent interaction scenarios is a crucial challenge. Most advanced methods model agent interactions by efficiently factorized attention based on the temporal and agent axes. However, this static and foward modeling lacks explicit interactive spatio-temporal coordination, capturing only obvious and immediate behavioral intentions. Alternatively, the modern trajectory prediction framework refines the successive predictions by a fixed-anchor selection strategy, which is difficult to adapt in different future environments. It is acknowledged that human drivers dynamically adjust initial driving decisions based on further assumptions about the intentions of surrounding vehicles. Motivated by human driving behaviors, this paper proposes ILNet, a multi-agent trajectory prediction method with Inverse Learning (IL) attention and Dynamic Anchor Selection (DAS) module. IL Attention employs an inverse learning paradigm to model interactions at neighboring moments, introducing proposed intentions to dynamically encode the spatio-temporal coordination of interactions, thereby enhancing the model’s ability to capture complex interaction patterns. Then, the learnable DAS module is proposed to extract multiple trajectory change keypoints as anchors in parallel with almost no increase in parameters. Experimental results show that the ILNet achieves state-of-the-art performance on the INTERACTION and Argoverse motion forecasting datasets. Particularly, in challenged interaction scenarios, ILNet achieves higher accuracy and more multimodal distributions of trajectories over fewer parameters. Our codes are available at this https URL.
zh
[CV-70] Speak2Sign3D: A Multi-modal Pipeline for English Speech to American Sign Language Animation
【速读】:该论文试图解决将英语语音转换为流畅、逼真的3D手语动画的问题,这一过程涉及语音理解、语法转换和自然人体运动生成等多个步骤。解决方案的关键在于构建一个完整的端到端管道,该管道首先利用Whisper模型将英语语音转为文本,再通过MarianMT机器翻译模型将其翻译为美国手语(American Sign Language, ASL)的gloss表示,并借助Word2Vec和FastText等词嵌入技术提升翻译准确性;随后,使用基于3D关键点的运动系统生成手语动画,该系统在自建的Sign3D-WLASL数据集上进行训练,最终通过平滑插值实现连续自然的动画效果。
链接: https://arxiv.org/abs/2507.06530
作者: Kazi Mahathir Rahman,Naveed Imtiaz Nafis,Md. Farhan Sadik,Mohammad Al Rafi,Mehedi Hasan Shahed
机构: BRAC University (BRAC大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 11 pages, 12 figures
Abstract:Helping deaf and hard-of-hearing people communicate more easily is the main goal of Automatic Sign Language Translation. Although most past research has focused on turning sign language into text, doing the reverse, turning spoken English into sign language animations, has been largely overlooked. That’s because it involves multiple steps, such as understanding speech, translating it into sign-friendly grammar, and generating natural human motion. In this work, we introduce a complete pipeline that converts English speech into smooth, realistic 3D sign language animations. Our system starts with Whisper to translate spoken English into text. Then, we use a MarianMT machine translation model to translate that text into American Sign Language (ASL) gloss, a simplified version of sign language that captures meaning without grammar. This model performs well, reaching BLEU scores of 0.7714 and 0.8923. To make the gloss translation more accurate, we also use word embeddings such as Word2Vec and FastText to understand word meanings. Finally, we animate the translated gloss using a 3D keypoint-based motion system trained on Sign3D-WLASL, a dataset we created by extracting body, hand, and face key points from real ASL videos in the WLASL dataset. To support the gloss translation stage, we also built a new dataset called BookGlossCorpus-CG, which turns everyday English sentences from the BookCorpus dataset into ASL gloss using grammar rules. Our system stitches everything together by smoothly interpolating between signs to create natural, continuous animations. Unlike previous works like How2Sign and Phoenix-2014T that focus on recognition or use only one type of data, our pipeline brings together audio, text, and motion in a single framework that goes all the way from spoken English to lifelike 3D sign language animation.
zh
[CV-71] Concept Unlearning by Modeling Key Steps of Diffusion Process
【速读】:该论文试图解决文本到图像扩散模型(Text-to-image diffusion models, T2I DMs)被滥用所带来的安全风险问题。现有概念删除方法在实现有效删除的同时难以平衡生成能力的保持。该论文提出的解决方案关键在于提出了一种关键步骤概念删除(Key Step Concept Unlearning, KSCU)方法,该方法利用扩散模型在图像生成过程中特有的分步采样特性,通过识别并针对不同概念删除任务的关键步骤进行模型微调,从而在减少参数更新次数的同时最大化保留模型的生成能力。
链接: https://arxiv.org/abs/2507.06526
作者: Chaoshuo Zhang,Chenhao Lin,Zhengyu Zhao,Le Yang,Qian Wang,Chao Shen
机构: Xi’an Jiaotong University (西安交通大学); Wuhan University (武汉大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Text-to-image diffusion models (T2I DMs), represented by Stable Diffusion, which generate highly realistic images based on textual input, have been widely used. However, their misuse poses serious security risks. While existing concept unlearning methods aim to mitigate these risks, they struggle to balance unlearning effectiveness with generative this http URL overcome this limitation, we innovatively propose the Key Step Concept Unlearning (KSCU) method, which ingeniously capitalizes on the unique stepwise sampling characteristic inherent in diffusion models during the image generation process. Unlike conventional approaches that treat all denoising steps equally, KSCU strategically focuses on pivotal steps with the most influence over the final outcome by dividing key steps for different concept unlearning tasks and fine-tuning the model only at those steps. This targeted approach reduces the number of parameter updates needed for effective unlearning, while maximizing the retention of the model’s generative this http URL extensive benchmark experiments, we demonstrate that KSCU effectively prevents T2I DMs from generating undesirable images while better retaining the model’s generative this http URL code will be released.
zh
[CV-72] What Demands Attention in Urban Street Scenes? From Scene Understanding towards Road Safety: A Survey of Vision-driven Datasets and Studies
【速读】:该论文试图解决交通场景中视觉感知任务与数据集的系统性分类与分析问题,以促进道路安全的应用。其解决方案的关键在于提出了一种集成的分类体系,将值得关注的交通实体分为异常和正常但关键两类,并进一步细分为十个类别和二十个子类,建立了相关领域的联系并提供了统一的分析框架。通过该分类体系,论文对35项视觉驱动任务和73个可用数据集进行了全面分析与可视化,旨在为标准统一和资源优化提供参考。
链接: https://arxiv.org/abs/2507.06513
作者: Yaoqi Huang,Julie Stephany Berrio,Mao Shan,Stewart Worrall
机构: Australian Center for Robotics (澳大利亚机器人中心); The University of Sydney (悉尼大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 45 pages, 52 figures, 2 large tables (divided into 5), 73 datatsets, 35 tasks
Abstract:Advances in vision-based sensors and computer vision algorithms have significantly improved the analysis and understanding of traffic scenarios. To facilitate the use of these improvements for road safety, this survey systematically categorizes the critical elements that demand attention in traffic scenarios and comprehensively analyzes available vision-driven tasks and datasets. Compared to existing surveys that focus on isolated domains, our taxonomy categorizes attention-worthy traffic entities into two main groups that are anomalies and normal but critical entities, integrating ten categories and twenty subclasses. It establishes connections between inherently related fields and provides a unified analytical framework. Our survey highlights the analysis of 35 vision-driven tasks and comprehensive examinations and visualizations of 73 available datasets based on the proposed taxonomy. The cross-domain investigation covers the pros and cons of each benchmark with the aim of providing information on standards unification and resource optimization. Our article concludes with a systematic discussion of the existing weaknesses, underlining the potential effects and promising solutions from various perspectives. The integrated taxonomy, comprehensive analysis, and recapitulatory tables serve as valuable contributions to this rapidly evolving field by providing researchers with a holistic overview, guiding strategic resource selection, and highlighting critical research gaps.
zh
[CV-73] Bilateral Collaboration with Large Vision-Language Models for Open Vocabulary Human-Object Interaction Detection ICCV2025
【速读】:该论文试图解决开放词汇人类-物体交互(Open Vocabulary Human-Object Interaction, HOI)检测中,现有方法依赖于大规模视觉-语言模型(VLM)生成的全局且粗粒度的视觉特征,与检测任务所需的细粒度实例级交互特征不匹配的问题。解决方案的关键在于提出一种双边协作框架(Bilateral Collaboration framework for open vocabulary HOI detection, BC-HOI),其包含两个核心组件:注意力偏置引导(Attention Bias Guidance, ABG)和基于大语言模型(LLM)的监督引导(LLM-based Supervision Guidance, LSG)。ABG通过HOI检测器提供的注意力偏置引导VLM生成细粒度实例级交互特征,而LSG则利用VLM中的LLM组件为HOI检测器提供细粒度的token级监督,从而增强ABG生成高质量注意力偏置的能力。
链接: https://arxiv.org/abs/2507.06510
作者: Yupeng Hu,Changxing Ding,Chang Sun,Shaoli Huang,Xiangmin Xu
机构: South China University of Technology (华南理工大学); Tencent AI Lab (腾讯人工智能实验室); Foshan University (佛山大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: ICCV 2025
Abstract:Open vocabulary Human-Object Interaction (HOI) detection is a challenging task that detects all human, verb, object triplets of interest in an image, even those that are not pre-defined in the training set. Existing approaches typically rely on output features generated by large Vision-Language Models (VLMs) to enhance the generalization ability of interaction representations. However, the visual features produced by VLMs are holistic and coarse-grained, which contradicts the nature of detection tasks. To address this issue, we propose a novel Bilateral Collaboration framework for open vocabulary HOI detection (BC-HOI). This framework includes an Attention Bias Guidance (ABG) component, which guides the VLM to produce fine-grained instance-level interaction features according to the attention bias provided by the HOI detector. It also includes a Large Language Model (LLM)-based Supervision Guidance (LSG) component, which provides fine-grained token-level supervision for the HOI detector by the LLM component of the VLM. LSG enhances the ability of ABG to generate high-quality attention bias. We conduct extensive experiments on two popular benchmarks: HICO-DET and V-COCO, consistently achieving superior performance in the open vocabulary and closed settings. The code will be released in Github.
zh
[CV-74] Mask6D: Masked Pose Priors For 6D Object Pose Estimation ICASSP2024
【速读】:该论文试图解决在杂乱或遮挡条件下,使用单目RGB图像进行鲁棒的6D目标位姿估计问题。解决方案的关键在于提出一种名为Mask6D的特定于位姿估计的预训练策略,该策略结合了姿态感知的2D-3D对应图和可见性掩码图作为额外模态信息,与RGB图像共同用于基于重建的模型预训练,从而有效引导模型关注目标姿态信息并抑制背景干扰。
链接: https://arxiv.org/abs/2507.06486
作者: Yuechen Xie,Haobo Jiang,Jin Xie
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted at ICASSP 2024. 4 figures, 3 tables
Abstract:Robust 6D object pose estimation in cluttered or occluded conditions using monocular RGB images remains a challenging task. One reason is that current pose estimation networks struggle to extract discriminative, pose-aware features using 2D feature backbones, especially when the available RGB information is limited due to target occlusion in cluttered scenes. To mitigate this, we propose a novel pose estimation-specific pre-training strategy named Mask6D. Our approach incorporates pose-aware 2D-3D correspondence maps and visible mask maps as additional modal information, which is combined with RGB images for the reconstruction-based model pre-training. Essentially, this 2D-3D correspondence maps a transformed 3D object model to 2D pixels, reflecting the pose information of the target in camera coordinate system. Meanwhile, the integrated visible mask map can effectively guide our model to disregard cluttered background information. In addition, an object-focused pre-training loss function is designed to further facilitate our network to remove the background interference. Finally, we fine-tune our pre-trained pose prior-aware network via conventional pose training strategy to realize the reliable pose prediction. Extensive experiments verify that our method outperforms previous end-to-end pose estimation methods.
zh
[CV-75] 3D-Generalist: Self-Improving Vision-Language-Action Models for Crafting 3D Worlds
【速读】:该论文试图解决当前基础模型在空间推理能力上的不足,这一问题主要源于缺乏基于三维世界的高质量数据。论文提出的解决方案关键在于将三维环境构建转化为序列决策问题,并利用视觉-语言模型(VLM)作为策略,共同生成三维环境的布局、材质、光照和资产。其核心创新在于通过自提升微调训练策略,使VLM生成更符合提示的高质量三维环境,从而为基础模型提供有效的训练数据。
链接: https://arxiv.org/abs/2507.06484
作者: Fan-Yun Sun,Shengguang Wu,Christian Jacobsen,Thomas Yim,Haoming Zou,Alex Zook,Shangru Li,Yu-Hsin Chou,Ethem Can,Xunlei Wu,Clemens Eppner,Valts Blukis,Jonathan Tremblay,Jiajun Wu,Stan Birchfield,Nick Haber
机构: Stanford University (斯坦福大学); Nvidia Research (英伟达研究)
类目: Graphics (cs.GR); Computer Vision and Pattern Recognition (cs.CV)
备注: project website: this https URL
Abstract:Despite large-scale pretraining endowing models with language and vision reasoning capabilities, improving their spatial reasoning capability remains challenging due to the lack of data grounded in the 3D world. While it is possible for humans to manually create immersive and interactive worlds through 3D graphics, as seen in applications such as VR, gaming, and robotics, this process remains highly labor-intensive. In this paper, we propose a scalable method for generating high-quality 3D environments that can serve as training data for foundation models. We recast 3D environment building as a sequential decision-making problem, employing Vision-Language-Models (VLMs) as policies that output actions to jointly craft a 3D environment’s layout, materials, lighting, and assets. Our proposed framework, 3D-Generalist, trains VLMs to generate more prompt-aligned 3D environments via self-improvement fine-tuning. We demonstrate the effectiveness of 3D-Generalist and the proposed training strategy in generating simulation-ready 3D environments. Furthermore, we demonstrate its quality and scalability in synthetic data generation by pretraining a vision foundation model on the generated data. After fine-tuning the pre-trained model on downstream tasks, we show that it surpasses models pre-trained on meticulously human-crafted synthetic data and approaches results achieved with real data orders of magnitude larger.
zh
[CV-76] EA: An Event Autoencoder for High-Speed Vision Sensing
【速读】:该论文试图解决传统帧基视觉系统在动态环境中因运动模糊、高延迟和冗余数据处理而导致性能受限的问题,以及事件相机在目标检测中由于事件流稀疏和噪声带来的挑战。解决方案的关键在于提出一种事件自编码器架构,通过卷积编码有效压缩和重建事件数据,同时保留关键的空间和时间特征,并结合自适应阈值选择和轻量级分类器以提升识别精度并降低计算复杂度。
链接: https://arxiv.org/abs/2507.06459
作者: Riadul Islam,Joey Mulé,Dhandeep Challagundla,Shahmir Rizvi,Sean Carson
机构: University of Maryland, Baltimore County (马里兰大学巴尔的摩县分校)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:
Abstract:High-speed vision sensing is essential for real-time perception in applications such as robotics, autonomous vehicles, and industrial automation. Traditional frame-based vision systems suffer from motion blur, high latency, and redundant data processing, limiting their performance in dynamic environments. Event cameras, which capture asynchronous brightness changes at the pixel level, offer a promising alternative but pose challenges in object detection due to sparse and noisy event streams. To address this, we propose an event autoencoder architecture that efficiently compresses and reconstructs event data while preserving critical spatial and temporal features. The proposed model employs convolutional encoding and incorporates adaptive threshold selection and a lightweight classifier to enhance recognition accuracy while reducing computational complexity. Experimental results on the existing Smart Event Face Dataset (SEFD) demonstrate that our approach achieves comparable accuracy to the YOLO-v4 model while utilizing up to 35.5\times fewer parameters. Implementations on embedded platforms, including Raspberry Pi 4B and NVIDIA Jetson Nano, show high frame rates ranging from 8 FPS up to 44.8 FPS. The proposed classifier exhibits up to 87.84x better FPS than the state-of-the-art and significantly improves event-based vision performance, making it ideal for low-power, high-speed applications in real-time edge computing.
zh
[CV-77] HOR: Thermal-guided Hand-Object Reasoning via Adaptive Vision Sampling
【速读】:该论文试图解决可穿戴相机在持续采集RGB图像时导致的高功耗、大量冗余视频数据生成、隐私问题以及实时分析所需计算资源过高的问题。解决方案的关键在于引入THOR方法,该方法利用热成像传感技术实时检测手部与物体的交互区域,并根据活动状态动态调整RGB帧采样率,从而减少不必要的数据采集和处理,同时保持较高的活动识别准确率。
链接: https://arxiv.org/abs/2507.06442
作者: Soroush Shahi,Farzad Shahabi,Rama Nabulsi,Glenn Fernandes,Aggelos Katsaggelos,Nabil Alshurafa
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Wearable cameras are increasingly used as an observational and interventional tool for human behaviors by providing detailed visual data of hand-related activities. This data can be leveraged to facilitate memory recall for logging of behavior or timely interventions aimed at improving health. However, continuous processing of RGB images from these cameras consumes significant power impacting battery lifetime, generates a large volume of unnecessary video data for post-processing, raises privacy concerns, and requires substantial computational resources for real-time analysis. We introduce THOR, a real-time adaptive spatio-temporal RGB frame sampling method that leverages thermal sensing to capture hand-object patches and classify them in real-time. We use low-resolution thermal camera data to identify moments when a person switches from one hand-related activity to another, and adjust the RGB frame sampling rate by increasing it during activity transitions and reducing it during periods of sustained activity. Additionally, we use the thermal cues from the hand to localize the region of interest (i.e., the hand-object interaction) in each RGB frame, allowing the system to crop and process only the necessary part of the image for activity recognition. We develop a wearable device to validate our method through an in-the-wild study with 14 participants and over 30 activities, and further evaluate it on Ego4D (923 participants across 9 countries, totaling 3,670 hours of video). Our results show that using only 3% of the original RGB video data, our method captures all the activity segments, and achieves hand-related activity recognition F1-score (95%) comparable to using the entire RGB video (94%). Our work provides a more practical path for the longitudinal use of wearable cameras to monitor hand-related activities and health-risk behaviors in real time.
zh
[CV-78] Hierarchical Multi-Stage Transformer Architecture for Context-Aware Temporal Action Localization
【速读】:该论文旨在解决时间动作定位(Temporal Action Localization, TAL)任务中的挑战,即在未剪辑的视频中准确识别动作片段并精确定位其时间边界。解决方案的关键在于提出一种分层多阶段Transformer架构——PCL-Former,该架构通过三个专用模块分别完成候选片段生成(Proposal-Former)、动作分类(Classification-Former)和时间边界定位(Localization-Former),每个模块均采用针对性的损失函数,从而有效捕捉视频中的时空特征并提升定位精度。
链接: https://arxiv.org/abs/2507.06411
作者: Hayat Ullah,Arslan Munir,Oliver Nina
机构: Florida Atlantic University (佛罗里达大西洋大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 17 pages, 6 figures,
Abstract:Inspired by the recent success of transformers and multi-stage architectures in video recognition and object detection domains. We thoroughly explore the rich spatio-temporal properties of transformers within a multi-stage architecture paradigm for the temporal action localization (TAL) task. This exploration led to the development of a hierarchical multi-stage transformer architecture called PCL-Former, where each subtask is handled by a dedicated transformer module with a specialized loss function. Specifically, the Proposal-Former identifies candidate segments in an untrimmed video that may contain actions, the Classification-Former classifies the action categories within those segments, and the Localization-Former precisely predicts the temporal boundaries (i.e., start and end) of the action instances. To evaluate the performance of our method, we have conducted extensive experiments on three challenging benchmark datasets: THUMOS-14, ActivityNet-1.3, and HACS Segments. We also conducted detailed ablation experiments to assess the impact of each individual module of our PCL-Former. The obtained quantitative results validate the effectiveness of the proposed PCL-Former, outperforming state-of-the-art TAL approaches by 2.8%, 1.2%, and 4.8% on THUMOS14, ActivityNet-1.3, and HACS datasets, respectively.
zh
[CV-79] SImpHAR: Advancing impedance-based human activity recognition using 3D simulation and text-to-motion models
【速读】:该论文旨在解决基于生物阻抗传感的人类活动识别(Human Activity Recognition, HAR)中因标注数据稀缺而导致的性能受限问题。其解决方案的关键在于提出一种名为SImpHAR的框架,该框架通过两个核心贡献实现数据增强与训练策略优化:首先,构建了一个生成真实生物阻抗信号的仿真流程,利用最短路径估计、软体物理模拟和文本到动作生成技术作为数据增强的数字孪生;其次,设计了一种两阶段的解耦训练策略,能够在不依赖标签对齐合成数据的情况下扩展活动覆盖范围。
链接: https://arxiv.org/abs/2507.06405
作者: Lala Shakti Swarup Ray,Mengxi Liu,Deepika Gurung,Bo Zhou,Sungho Suh,Paul Lukowicz
机构: DFKI, RPTU Kaiserslautern Germany(DFKI, RPTU 凯撒斯劳滕德国); Korea University(高丽大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:
Abstract:Human Activity Recognition (HAR) with wearable sensors is essential for applications in healthcare, fitness, and human-computer interaction. Bio-impedance sensing offers unique advantages for fine-grained motion capture but remains underutilized due to the scarcity of labeled data. We introduce SImpHAR, a novel framework addressing this limitation through two core contributions. First, we propose a simulation pipeline that generates realistic bio-impedance signals from 3D human meshes using shortest-path estimation, soft-body physics, and text-to-motion generation serving as a digital twin for data augmentation. Second, we design a two-stage training strategy with decoupled approach that enables broader activity coverage without requiring label-aligned synthetic data. We evaluate SImpHAR on our collected ImpAct dataset and two public benchmarks, showing consistent improvements over state-of-the-art methods, with gains of up to 22.3% and 21.8%, in terms of accuracy and macro F1 score, respectively. Our results highlight the promise of simulation-driven augmentation and modular training for impedance-based HAR.
zh
[CV-80] Learning to Evaluate Autonomous Behaviour in Human-Robot Interaction
【速读】:该论文试图解决自主类人机器人性能评估与比较的问题,因为传统成功率为指标的评估方法难以复现且无法捕捉机器人运动轨迹的复杂性,而这对人机交互与协作(HRIC)至关重要。解决方案的关键在于提出一种通用的评估框架,该框架通过关注轨迹性能来衡量模仿学习(IL)方法的质量。其核心创新是设计了Neural Meta Evaluator (NeME),这是一个深度学习模型,能够从机器人关节轨迹中分类动作,作为元评估器用于比较机器人控制策略的性能,从而实现无需人类参与的策略评估。
链接: https://arxiv.org/abs/2507.06404
作者: Matteo Tiezzi,Tommaso Apicella,Carlos Cardenas-Perez,Giovanni Fregonese,Stefano Dafarra,Pietro Morerio,Daniele Pucci,Alessio Del Bue
机构: Istituto Italiano di Tecnologia (Istituto Italiano di Tecnologia)
类目: Robotics (cs.RO); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注:
Abstract:Evaluating and comparing the performance of autonomous Humanoid Robots is challenging, as success rate metrics are difficult to reproduce and fail to capture the complexity of robot movement trajectories, critical in Human-Robot Interaction and Collaboration (HRIC). To address these challenges, we propose a general evaluation framework that measures the quality of Imitation Learning (IL) methods by focusing on trajectory performance. We devise the Neural Meta Evaluator (NeME), a deep learning model trained to classify actions from robot joint trajectories. NeME serves as a meta-evaluator to compare the performance of robot control policies, enabling policy evaluation without requiring human involvement in the loop. We validate our framework on ergoCub, a humanoid robot, using teleoperation data and comparing IL methods tailored to the available platform. The experimental results indicate that our method is more aligned with the success rate obtained on the robot than baselines, offering a reproducible, systematic, and insightful means for comparing the performance of multimodal imitation learning approaches in complex HRI tasks.
zh
[CV-81] When Trackers Date Fish: A Benchmark and Framework for Underwater Multiple Fish Tracking
【速读】:该论文旨在解决水下多鱼跟踪(Multiple Fish Tracking)问题,这是传统多目标跟踪(MOT)技术在陆地应用中取得进展后仍缺乏系统研究的领域。其关键解决方案是提出一种专门针对水下环境的跟踪框架——Scale-aware and Unscented Tracker (SU-T),该框架结合了优化用于非线性鱼类运动模式的无迹卡尔曼滤波器(UKF)以及考虑水生生物形态特征的Fish-Intersection-over-Union (FishIoU)匹配方法,从而提升了水下场景中的跟踪性能。
链接: https://arxiv.org/abs/2507.06400
作者: Weiran Li,Yeqiang Liu,Qiannan Guo,Yijie Wei,Hwa Liang Leo,Zhenbo Li
机构: China Agricultural University (中国农业大学); National University of Singapore (新加坡国立大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Multiple object tracking (MOT) technology has made significant progress in terrestrial applications, but underwater tracking scenarios remain underexplored despite their importance to marine ecology and aquaculture. We present Multiple Fish Tracking Dataset 2025 (MFT25), the first comprehensive dataset specifically designed for underwater multiple fish tracking, featuring 15 diverse video sequences with 408,578 meticulously annotated bounding boxes across 48,066 frames. Our dataset captures various underwater environments, fish species, and challenging conditions including occlusions, similar appearances, and erratic motion patterns. Additionally, we introduce Scale-aware and Unscented Tracker (SU-T), a specialized tracking framework featuring an Unscented Kalman Filter (UKF) optimized for non-linear fish swimming patterns and a novel Fish-Intersection-over-Union (FishIoU) matching that accounts for the unique morphological characteristics of aquatic species. Extensive experiments demonstrate that our SU-T baseline achieves state-of-the-art performance on MFT25, with 34.1 HOTA and 44.6 IDF1, while revealing fundamental differences between fish tracking and terrestrial object tracking scenarios. MFT25 establishes a robust foundation for advancing research in underwater tracking systems with important applications in marine biology, aquaculture monitoring, and ecological conservation. The dataset and codes are released at this https URL.
zh
[CV-82] Secure and Storag e-Efficient Deep Learning Models for Edge AI Using Automatic Weight Generation
【速读】:该论文试图解决深度学习模型在存储和内存占用方面的问题,特别是全连接神经网络(FC)和卷积神经网络(CNN)中大量突触权重的存储需求。解决方案的关键在于引入WINGs框架,该框架通过主成分分析(PCA)和轻量级支持向量回归(SVR)模型动态生成FC层的权重,并在CNN中对低敏感性层进行压缩,从而显著减少内存使用而不牺牲模型精度。此外,该设计还增强了对位翻转攻击的检测能力,提升了安全性。
链接: https://arxiv.org/abs/2507.06380
作者: Habibur Rahaman,Atri Chatterjee,Swarup Bhunia
机构: University of Florida (佛罗里达大学)
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
备注: 7 pages, 7 figures
Abstract:Complex neural networks require substantial memory to store a large number of synaptic weights. This work introduces WINGs (Automatic Weight Generator for Secure and Storage-Efficient Deep Learning Models), a novel framework that dynamically generates layer weights in a fully connected neural network (FC) and compresses the weights in convolutional neural networks (CNNs) during inference, significantly reducing memory requirements without sacrificing accuracy. WINGs framework uses principal component analysis (PCA) for dimensionality reduction and lightweight support vector regression (SVR) models to predict layer weights in the FC networks, removing the need for storing full-weight matrices and achieving substantial memory savings. It also preferentially compresses the weights in low-sensitivity layers of CNNs using PCA and SVR with sensitivity analysis. The sensitivity-aware design also offers an added level of security, as any bit-flip attack with weights in compressed layers has an amplified and readily detectable effect on accuracy. WINGs achieves 53x compression for the FC layers and 28x for AlexNet with MNIST dataset, and 18x for Alexnet with CIFAR-10 dataset with 1-2% accuracy loss. This significant reduction in memory results in higher throughput and lower energy for DNN inference, making it attractive for resource-constrained edge applications.
zh
[CV-83] AR2: Attention-Guided Repair for the Robustness of CNNs Against Common Corruptions
【速读】:该论文试图解决深度神经网络在面对常见数据扰动(如噪声、模糊、天气变化和数字失真)时性能显著下降的问题,从而限制了其在现实应用场景中的可靠性。解决方案的关键在于提出AR2(Attention-Guided Repair for Robustness),该方法通过显式对齐干净图像与受损图像之间的类别激活图(Class Activation Maps, CAMs),引导模型在输入扰动下仍保持一致的关注区域,进而提升模型的鲁棒性。
链接: https://arxiv.org/abs/2507.06332
作者: Fuyuan Zhang,Qichen Wang,Jianjun Zhao
机构: Zhejiang University (浙江大学); Kyushu University (九州大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG); Software Engineering (cs.SE)
备注:
Abstract:Deep neural networks suffer from significant performance degradation when exposed to common corruptions such as noise, blur, weather, and digital distortions, limiting their reliability in real-world applications. In this paper, we propose AR2 (Attention-Guided Repair for Robustness), a simple yet effective method to enhance the corruption robustness of pretrained CNNs. AR2 operates by explicitly aligning the class activation maps (CAMs) between clean and corrupted images, encouraging the model to maintain consistent attention even under input perturbations. Our approach follows an iterative repair strategy that alternates between CAM-guided refinement and standard fine-tuning, without requiring architectural changes. Extensive experiments show that AR2 consistently outperforms existing state-of-the-art methods in restoring robustness on standard corruption benchmarks (CIFAR-10-C, CIFAR-100-C and ImageNet-C), achieving a favorable balance between accuracy on clean data and corruption robustness. These results demonstrate that AR2 provides a robust and scalable solution for enhancing model reliability in real-world environments with diverse corruptions.
zh
[CV-84] Centralized Copy-Paste: Enhanced Data Augmentation Strategy for Wildland Fire Semantic Segmentation
【速读】:该论文旨在解决野生火科学领域中用于训练分割模型的图像数据收集与标注成本高昂的问题,尤其是在缺乏可靠公共标注数据集的情况下。其解决方案的关键在于提出一种名为中央复制粘贴数据增强(Centralized Copy-Paste Data Augmentation, CCPDA)的方法,该方法通过识别源图像中的火区簇、应用中心化技术聚焦火区核心,并将优化后的火区簇粘贴到目标图像中,从而在保持火类本质特征的同时提升数据集的多样性。该方法特别关注于改善火类的分割效果,相较于其他类别(如燃料、灰烬或背景)具有更高的操作意义。
链接: https://arxiv.org/abs/2507.06321
作者: Joon Tai Kim,Tianle Chen,Ziyu Dong,Nishanth Kunchala,Alexander Guller,Daniel Ospina Acero,Roger Williams,Mrinal Kumar
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注: 21 pages, 5 figures, and under review for AIAA SciTech 2026
Abstract:Collecting and annotating images for the purpose of training segmentation models is often cost prohibitive. In the domain of wildland fire science, this challenge is further compounded by the scarcity of reliable public datasets with labeled ground truth. This paper presents the Centralized Copy-Paste Data Augmentation (CCPDA) method, for the purpose of assisting with the training of deep-learning multiclass segmentation models, with special focus on improving segmentation outcomes for the fire-class. CCPDA has three main steps: (i) identify fire clusters in the source image, (ii) apply a centralization technique to focus on the core of the fire area, and (iii) paste the refined fire clusters onto a target image. This method increases dataset diversity while preserving the essential characteristics of the fire class. The effectiveness of this augmentation technique is demonstrated via numerical analysis and comparison against various other augmentation methods using a weighted sum-based multi-objective optimization approach. This approach helps elevate segmentation performance metrics specific to the fire class, which carries significantly more operational significance than other classes (fuel, ash, or background). Numerical performance assessment validates the efficacy of the presented CCPDA method in alleviating the difficulties associated with small, manually labeled training datasets. It also illustrates that CCPDA outperforms other augmentation strategies in the application scenario considered, particularly in improving fire-class segmentation performance.
zh
[CV-85] Advancing Offline Handwritten Text Recognition: A Systematic Review of Data Augmentation and Generation Techniques
【速读】:该论文试图解决离线手写文本识别(Offline Handwritten Text Recognition, HTR)系统在训练数据有限的情况下性能受限的问题,尤其是在低资源语言和复杂书写体系中的应用。其解决方案的关键在于通过数据增强与生成技术提升HTR系统的准确性和鲁棒性,具体包括传统增强方法以及基于深度学习的生成对抗网络(Generative Adversarial Networks, GANs)、扩散模型和基于Transformer的方法等先进技术。
链接: https://arxiv.org/abs/2507.06275
作者: Yassin Hussein Rassul,Aram M. Ahmed,Polla Fattah,Bryar A. Hassan,Arwaa W. Abdulkareem,Tarik A. Rashid,Joan Lu
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:
Abstract:Offline Handwritten Text Recognition (HTR) systems play a crucial role in applications such as historical document digitization, automatic form processing, and biometric authentication. However, their performance is often hindered by the limited availability of annotated training data, particularly for low-resource languages and complex scripts. This paper presents a comprehensive survey of offline handwritten data augmentation and generation techniques designed to improve the accuracy and robustness of HTR systems. We systematically examine traditional augmentation methods alongside recent advances in deep learning, including Generative Adversarial Networks (GANs), diffusion models, and transformer-based approaches. Furthermore, we explore the challenges associated with generating diverse and realistic handwriting samples, particularly in preserving script authenticity and addressing data scarcity. This survey follows the PRISMA methodology, ensuring a structured and rigorous selection process. Our analysis began with 1,302 primary studies, which were filtered down to 848 after removing duplicates, drawing from key academic sources such as IEEE Digital Library, Springer Link, Science Direct, and ACM Digital Library. By evaluating existing datasets, assessment metrics, and state-of-the-art methodologies, this survey identifies key research gaps and proposes future directions to advance the field of handwritten text generation across diverse linguistic and stylistic landscapes.
zh
[CV-86] LIRA: Inferring Segmentation in Large Multi-modal Models with Local Interleaved Region Assistance
【速读】:该论文旨在解决大型多模态模型(LMMs)在分割和理解任务中存在的两个主要问题:分割不准确和幻觉式理解。这些问题主要源于弱视觉理解和缺乏细粒度感知的限制。该论文提出的解决方案关键在于LIRA框架,其通过两个核心组件实现:(1) 语义增强特征提取器(SEFE)通过融合语义与像素级特征提升对象属性推理,从而实现更精确的分割;(2) 交错局部视觉耦合(ILVC)在基于分割掩码提取局部特征后,自回归地生成局部描述,提供细粒度监督以减轻幻觉现象。
链接: https://arxiv.org/abs/2507.06272
作者: Zhang Li,Biao Yang,Qiang Liu,Shuo Zhang,Zhiyin Ma,Shuo Zhang,Liang Yin,Linger Deng,Yabo Sun,Yuliang Liu,Xiang Bai
机构: Huazhong University of Science and Technology (华中科技大学); Kingsoft Office (金山办公)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:
Abstract:While large multi-modal models (LMMs) demonstrate promising capabilities in segmentation and comprehension, they still struggle with two limitations: inaccurate segmentation and hallucinated comprehension. These challenges stem primarily from constraints in weak visual comprehension and a lack of fine-grained perception. To alleviate these limitations, we propose LIRA, a framework that capitalizes on the complementary relationship between visual comprehension and segmentation via two key components: (1) Semantic-Enhanced Feature Extractor (SEFE) improves object attribute inference by fusing semantic and pixel-level features, leading to more accurate segmentation; (2) Interleaved Local Visual Coupling (ILVC) autoregressively generates local descriptions after extracting local features based on segmentation masks, offering fine-grained supervision to mitigate hallucinations. Furthermore, we find that the precision of object segmentation is positively correlated with the latent related semantics of the seg token. To quantify this relationship and the model’s potential semantic inferring ability, we introduce the Attributes Evaluation (AttrEval) dataset. Our experiments show that LIRA achieves state-of-the-art performance in both segmentation and comprehension tasks. Code will be available at this https URL.
zh
[CV-87] A Probabilistic Approach to Uncertainty Quantification Leverag ing 3D Geometry ICCV2025
【速读】:该论文旨在解决神经隐式3D表示中,特别是基于有符号距离函数(Signed Distance Function, SDF)的表示中,不确定性量化的问题。现有方法在几何整合、计算效率和几何一致性方面存在不足,导致不确定性图不准确。论文提出的解决方案是BayesSDF,其关键在于利用拉普拉斯近似通过Hessian相关度量来量化局部表面不稳定性的方法,从而实现计算高效且面向表面的不确定性估计,有效提升了几何一致性和校准性能。
链接: https://arxiv.org/abs/2507.06269
作者: Rushil Desai,Frederik Warburg,Trevor Darrell,Marissa Ramirez de Chanlatte
机构: University of California, Berkeley (加州大学伯克利分校); Berkeley Artificial Intelligence Research (伯克利人工智能研究院)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: ICCV 2025 Workshops (8 Pages, 6 Figures, 2 Tables)
Abstract:Quantifying uncertainty in neural implicit 3D representations, particularly those utilizing Signed Distance Functions (SDFs), remains a substantial challenge due to computational inefficiencies, scalability issues, and geometric inconsistencies. Existing methods typically neglect direct geometric integration, leading to poorly calibrated uncertainty maps. We introduce BayesSDF, a novel probabilistic framework for uncertainty quantification in neural implicit SDF models, motivated by scientific simulation applications with 3D environments (e.g., forests) such as modeling fluid flow through forests, where precise surface geometry and awareness of fidelity surface geometric uncertainty are essential. Unlike radiance-based models such as NeRF or 3D Gaussian splatting, which lack explicit surface formulations, SDFs define continuous and differentiable geometry, making them better suited for physical modeling and analysis. BayesSDF leverages a Laplace approximation to quantify local surface instability via Hessian-based metrics, enabling computationally efficient, surface-aware uncertainty estimation. Our method shows that uncertainty predictions correspond closely with poorly reconstructed geometry, providing actionable confidence measures for downstream use. Extensive evaluations on synthetic and real-world datasets demonstrate that BayesSDF outperforms existing methods in both calibration and geometric consistency, establishing a strong foundation for uncertainty-aware 3D scene reconstruction, simulation, and robotic decision-making.
zh
[CV-88] SPARC: Concept-Aligned Sparse Autoencoders for Cross-Model and Cross-Modal Interpretability
【速读】:该论文试图解决不同AI模型在编码相同高层概念(如物体或属性)时产生的表示空间不兼容问题,这一问题限制了跨模型的可解释性。其解决方案的关键是提出SPARC(Sparse Autoencoders for Aligned Representation of Concepts),通过两个核心创新实现跨模型和跨模态的统一潜在空间:一是全局TopK稀疏机制,确保同一概念在所有输入流中激活相同的潜在维度;二是交叉重构损失,显式促进模型间的语义一致性。
链接: https://arxiv.org/abs/2507.06265
作者: Ali Nasiri-Sarvi,Hassan Rivaz,Mahdi S. Hosseini
机构: Concordia University (康奈迪亚大学); Mila–Quebec AI Institute (Mila–魁北克人工智能研究所)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:
Abstract:Understanding how different AI models encode the same high-level concepts, such as objects or attributes, remains challenging because each model typically produces its own isolated representation. Existing interpretability methods like Sparse Autoencoders (SAEs) produce latent concepts individually for each model, resulting in incompatible concept spaces and limiting cross-model interpretability. To address this, we introduce SPARC (Sparse Autoencoders for Aligned Representation of Concepts), a new framework that learns a single, unified latent space shared across diverse architectures and modalities (e.g., vision models like DINO, and multimodal models like CLIP). SPARC’s alignment is enforced through two key innovations: (1) a Global TopK sparsity mechanism, ensuring all input streams activate identical latent dimensions for a given concept; and (2) a Cross-Reconstruction Loss, which explicitly encourages semantic consistency between models. On Open Images, SPARC dramatically improves concept alignment, achieving a Jaccard similarity of 0.80, more than tripling the alignment compared to previous methods. SPARC creates a shared sparse latent space where individual dimensions often correspond to similar high-level concepts across models and modalities, enabling direct comparison of how different architectures represent identical concepts without requiring manual alignment or model-specific analysis. As a consequence of this aligned representation, SPARC also enables practical applications such as text-guided spatial localization in vision-only models and cross-model/cross-modal retrieval. Code and models are available at this https URL.
zh
[CV-89] Unveiling the Underwater World: CLIP Perception Model-Guided Underwater Image Enhancement
【速读】:该论文旨在解决水下图像增强(Underwater Image Enhancement, UIE)中因光线吸收和散射导致的图像质量下降问题,同时弥补现有深度学习方法在考虑人类视觉感知和解空间约束方面的不足。其解决方案的关键在于引入基于对比语言-图像预训练(Contrastive Language-Image Pre-Training, CLIP)的感知损失模块以及课程对比正则化机制,通过CLIP模型的视觉语义特征提取能力来构建更符合人类视觉感知的评估体系,并在增强网络中作为感知损失模块以提升图像的感知质量,同时结合课程对比正则化增强对增强图像在CLIP感知空间中的约束,从而减少欠增强和过增强的风险。
链接: https://arxiv.org/abs/2507.06234
作者: Jiangzhong Cao,Zekai Zeng,Xu Zhang,Huan Zhang,Chunling Fan,Gangyi Jiang,Weisi Lin
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 10 pages, 7 figures;Accepted to PR 2025;The source code is available at this https URL
Abstract:High-quality underwater images are essential for both machine vision tasks and viewers with their aesthetic this http URL, the quality of underwater images is severely affected by light absorption and scattering. Deep learning-based methods for Underwater Image Enhancement (UIE) have achieved good performance. However, these methods often overlook considering human perception and lack sufficient constraints within the solution space. Consequently, the enhanced images often suffer from diminished perceptual quality or poor content this http URL address these issues, we propose a UIE method with a Contrastive Language-Image Pre-Training (CLIP) perception loss module and curriculum contrastive regularization. Above all, to develop a perception model for underwater images that more aligns with human visual perception, the visual semantic feature extraction capability of the CLIP model is leveraged to learn an appropriate prompt pair to map and evaluate the quality of underwater images. This CLIP perception model is then incorporated as a perception loss module into the enhancement network to improve the perceptual quality of enhanced images. Furthermore, the CLIP perception model is integrated with the curriculum contrastive regularization to enhance the constraints imposed on the enhanced images within the CLIP perceptual space, mitigating the risk of both under-enhancement and over-enhancement. Specifically, the CLIP perception model is employed to assess and categorize the learning difficulty level of negatives in the regularization process, ensuring comprehensive and nuanced utilization of distorted images and negatives with varied quality levels. Extensive experiments demonstrate that our method outperforms state-of-the-art methods in terms of visual quality and generalization ability.
zh
[CV-90] VOTE: Vision-Language-Action Optimization with Trajectory Ensemble Voting
【速读】:该论文试图解决大型Vision Language Action (VLA)模型在面对新物体或不熟悉环境时泛化能力有限的问题。现有方法通过集成深度估计、分割或扩散等额外组件来提升泛化能力,但导致计算开销显著增加,效率低下。该论文的关键解决方案是提出一种高效且通用的框架VOTE,其核心在于采用无分词器的微调方法实现并行精确动作预测,从而降低计算开销并加速推理速度,同时结合集成投票策略提升模型性能和泛化能力。
链接: https://arxiv.org/abs/2507.05116
作者: Juyi Lin,Amir Taherin,Arash Akbari,Arman Akbari,Lei Lu,Guangyu Chen,Taskin Padir,Xiaomeng Yang,Weiwei Chen,Yiqian Li,Xue Lin,David Kaeli,Pu Zhao,Yanzhi Wang
机构: Northeastern University (东北大学); EmbodyX Inc (EmbodyX 公司)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Robotics (cs.RO)
备注:
Abstract:Recent large-scale Vision Language Action (VLA) models have shown superior performance in robotic manipulation tasks guided by natural language. However, their generalization remains limited when applied to novel objects or unfamiliar environments that lie outside the training distribution. To address this, many existing approaches integrate additional components such as depth estimation, segmentation, or even diffusion to improve generalization, at the cost of adding significant computation overhead, resulting in low efficiency. This motivates the exploration of efficient action prediction methods, which are independent of additional high-level visual representations or diffusion techniques. In this work, we propose VOTE, an efficient and general framework for the optimization and acceleration of VLA models. In details, we propose a novel tokenizer-free fine-tuning approach for parallel accurate action prediction, which reduces computational overhead and accelerates inference speed. Additionally, we adopt an ensemble voting strategy for the action sampling, which significantly improves model performance and enhances generalization. Experimental results show that our method achieves state-of-the-art performance with 35 \times faster inference and 145 Hz throughput. All the details and codes will be open-sourced.
zh
[CV-91] Deep Brain Net: An Optimized Deep Learning Model for Brain tumor Detection in MRI Images Using EfficientNetB0 and ResNet50 with Transfer Learning
【速读】:该论文旨在解决从MRI图像中自动检测和分类脑肿瘤时面临的高精度与计算效率难以兼顾的问题。其解决方案的关键在于提出Deep Brain Net,该系统融合了EfficientNetB0与ResNet50两种先进的神经网络架构,并结合迁移学习技术以提升模型的泛化能力并减少训练时间。EfficientNetB0通过引入深度可分离卷积的移动式倒置瓶颈模块提高了模型效率,而ResNet50则通过预训练与微调策略增强了特征学习能力,同时利用残差连接缓解梯度消失问题。这种集成方法使得系统在保持高分类准确率的同时具备良好的计算效率。
链接: https://arxiv.org/abs/2507.07011
作者: Daniel Onah,Ravish Desai
机构: University College London (伦敦大学学院)
类目: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV)
备注: 9 pages, 14 figures, 4 tables. To be submitted to a conference
Abstract:In recent years, deep learning has shown great promise in the automated detection and classification of brain tumors from MRI images. However, achieving high accuracy and computational efficiency remains a challenge. In this research, we propose Deep Brain Net, a novel deep learning system designed to optimize performance in the detection of brain tumors. The model integrates the strengths of two advanced neural network architectures which are EfficientNetB0 and ResNet50, combined with transfer learning to improve generalization and reduce training time. The EfficientNetB0 architecture enhances model efficiency by utilizing mobile inverted bottleneck blocks, which incorporate depth wise separable convolutions. This design significantly reduces the number of parameters and computational cost while preserving the ability of models to learn complex feature representations. The ResNet50 architecture, pre trained on large scale datasets like ImageNet, is fine tuned for brain tumor classification. Its use of residual connections allows for training deeper networks by mitigating the vanishing gradient problem and avoiding performance degradation. The integration of these components ensures that the proposed system is both computationally efficient and highly accurate. Extensive experiments performed on publicly available MRI datasets demonstrate that Deep Brain Net consistently outperforms existing state of the art methods in terms of classification accuracy, precision, recall, and computational efficiency. The result is an accuracy of 88 percent, a weighted F1 score of 88.75 percent, and a macro AUC ROC score of 98.17 percent which demonstrates the robustness and clinical potential of Deep Brain Net in assisting radiologists with brain tumor diagnosis.
zh
[CV-92] SimCortex: Collision-free Simultaneous Cortical Surfaces Reconstruction
【速读】:该论文旨在解决从磁共振成像(MRI)数据中准确重建皮层表面的问题,这一过程在神经解剖分析中至关重要。现有方法面临复杂的皮层几何结构、严格的拓扑要求以及生成的表面常出现重叠、自相交和拓扑缺陷等挑战。其解决方案的关键在于提出SimCortex,这是一个深度学习框架,能够同时从T1加权(T1w)MRI体积中重建所有脑表面(左右白质和蛛网膜下隙),同时保持拓扑特性。该方法首先将T1w图像分割为九类组织标签图,生成无碰撞的个体化初始表面网格,并利用通过缩放与平方集成的平稳速度场(SVF)进行多尺度微分同胚变形,从而实现平滑且拓扑保持的变换,显著减少表面重叠和自相交现象。
链接: https://arxiv.org/abs/2507.06955
作者: Kaveh Moradkhani,R Jarrett Rushmore,Sylvain Bouix
机构: 未知
类目: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Accurate cortical surface reconstruction from magnetic resonance imaging (MRI) data is crucial for reliable neuroanatomical analyses. Current methods have to contend with complex cortical geometries, strict topological requirements, and often produce surfaces with overlaps, self-intersections, and topological defects. To overcome these shortcomings, we introduce SimCortex, a deep learning framework that simultaneously reconstructs all brain surfaces (left/right white-matter and pial) from T1-weighted(T1w) MRI volumes while preserving topological properties. Our method first segments the T1w image into a nine-class tissue label map. From these segmentations, we generate subject-specific, collision-free initial surface meshes. These surfaces serve as precise initializations for subsequent multiscale diffeomorphic deformations. Employing stationary velocity fields (SVFs) integrated via scaling-and-squaring, our approach ensures smooth, topology-preserving transformations with significantly reduced surface collisions and self-intersections. Evaluations on standard datasets demonstrate that SimCortex dramatically reduces surface overlaps and self-intersections, surpassing current methods while maintaining state-of-the-art geometric accuracy.
zh
[CV-93] Conformal Prediction for Long-Tailed Classification
【速读】:该论文试图解决长尾分布下的分类问题中预测集的类条件覆盖性和集合大小之间的权衡问题。传统合流预测方法在长尾设置下迫使实践者在小而覆盖性差的预测集与大而覆盖性好的预测集之间做出二元选择。论文提出的解决方案关键在于引入一种调整流行度的softmax作为合流评分函数,以实现宏观覆盖(macro-coverage)的目标,并提出一种标签加权的合流预测方法,从而在边缘覆盖和类条件覆盖之间进行平滑的权衡。
链接: https://arxiv.org/abs/2507.06867
作者: Tiffany Ding,Jean-Baptiste Fermanian,Joseph Salmon
机构: University of California, Berkeley; IMAG; IROKO; Univ Montpellier; Inria; CNRS
类目: Machine Learning (stat.ML); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG); Methodology (stat.ME)
备注:
Abstract:Many real-world classification problems, such as plant identification, have extremely long-tailed class distributions. In order for prediction sets to be useful in such settings, they should (i) provide good class-conditional coverage, ensuring that rare classes are not systematically omitted from the prediction sets, and (ii) be a reasonable size, allowing users to easily verify candidate labels. Unfortunately, existing conformal prediction methods, when applied to the long-tailed setting, force practitioners to make a binary choice between small sets with poor class-conditional coverage or sets with very good class-conditional coverage but that are extremely large. We propose methods with guaranteed marginal coverage that smoothly trade off between set size and class-conditional coverage. First, we propose a conformal score function, prevalence-adjusted softmax, that targets a relaxed notion of class-conditional coverage called macro-coverage. Second, we propose a label-weighted conformal prediction method that allows us to interpolate between marginal and class-conditional conformal prediction. We demonstrate our methods on Pl@ntNet and iNaturalist, two long-tailed image datasets with 1,081 and 8,142 classes, respectively.
zh
[CV-94] Speckle2Self: Self-Supervised Ultrasound Speckle Reduction Without Clean Data
【速读】:该论文旨在解决医学超声成像中由复杂波干涉引起的纹理依赖性斑点噪声(speckle noise)的去噪问题,传统方法难以直接应用。其解决方案的关键在于提出一种名为Speckle2Self的自监督算法,通过引入多尺度扰动(multi-scale perturbation, MSP)操作,在不同尺度上引入与组织相关的斑点模式变化,同时保留共享的解剖结构,从而将干净图像建模为低秩信号并分离稀疏噪声成分。
链接: https://arxiv.org/abs/2507.06828
作者: Xuesong Li,Nassir Navab,Zhongliang Jiang
机构: Technical University of Munich (慕尼黑工业大学); Johns Hopkins University (约翰霍普金斯大学)
类目: Image and Video Processing (eess.IV); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Image denoising is a fundamental task in computer vision, particularly in medical ultrasound (US) imaging, where speckle noise significantly degrades image quality. Although recent advancements in deep neural networks have led to substantial improvements in denoising for natural images, these methods cannot be directly applied to US speckle noise, as it is not purely random. Instead, US speckle arises from complex wave interference within the body microstructure, making it tissue-dependent. This dependency means that obtaining two independent noisy observations of the same scene, as required by pioneering Noise2Noise, is not feasible. Additionally, blind-spot networks also cannot handle US speckle noise due to its high spatial dependency. To address this challenge, we introduce Speckle2Self, a novel self-supervised algorithm for speckle reduction using only single noisy observations. The key insight is that applying a multi-scale perturbation (MSP) operation introduces tissue-dependent variations in the speckle pattern across different scales, while preserving the shared anatomical structure. This enables effective speckle suppression by modeling the clean image as a low-rank signal and isolating the sparse noise component. To demonstrate its effectiveness, Speckle2Self is comprehensively compared with conventional filter-based denoising algorithms and SOTA learning-based methods, using both realistic simulated US images and human carotid US images. Additionally, data from multiple US machines are employed to evaluate model generalization and adaptability to images from unseen domains. \textitCode and datasets will be released upon acceptance.
zh
[CV-95] Fast Equivariant Imaging: Acceleration for Unsupervised Learning via Augmented Lagrangian and Auxiliary PnP Denoisers
【速读】:该论文试图解决在无监督学习环境下高效训练深度成像网络的问题,特别是在缺乏真实数据(ground-truth data)的情况下。其解决方案的关键在于提出了一种名为Fast Equivariant Imaging (FEI)的新型无监督学习框架,该框架通过引入拉格朗日乘数法重新表述等变成像优化问题,并结合插件即用去噪器(plug-and-play denoisers),实现了比传统等变成像(Equivariant Imaging)范式更高的效率和性能。具体而言,PnP-FEI方案在使用CT100数据集进行X射线CT重建时,相较于标准等变成像方法,在训练U-Net模型上实现了数量级(10倍)的加速,并提升了泛化性能。
链接: https://arxiv.org/abs/2507.06764
作者: Guixian Xu,Jinglai Li,Junqi Tang
机构: University of Birmingham (伯明翰大学)
类目: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG); Optimization and Control (math.OC)
备注:
Abstract:We propose Fast Equivariant Imaging (FEI), a novel unsupervised learning framework to efficiently train deep imaging networks without ground-truth data. From the perspective of reformulating the Equivariant Imaging based optimization problem via the method of Lagrange multipliers and utilizing plug-and-play denoisers, this novel unsupervised scheme shows superior efficiency and performance compared to vanilla Equivariant Imaging paradigm. In particular, our PnP-FEI scheme achieves an order-of-magnitude (10x) acceleration over standard EI on training U-Net with CT100 dataset for X-ray CT reconstruction, with improved generalization performance.
zh
[CV-96] Airway Segmentation Network for Enhanced Tubular Feature Extraction
【速读】:该论文旨在解决医学影像中气道区域手动标注耗时且依赖专业知识的问题,以及传统卷积方法在处理气道树状结构时难以准确捕捉细小气道结构导致的分割遗漏和不连续问题。其解决方案的关键在于提出一种名为TfeNet的新型管状特征提取网络,该网络引入了方向感知卷积操作,通过空间旋转变换调整线性卷积核的采样位置,并将变形的卷积核表示为三维空间中的线段或折线,同时设计了基于非对称卷积和残差连接策略的管状特征融合模块(TFFM),以增强对细微气道结构的关注。
链接: https://arxiv.org/abs/2507.06581
作者: Qibiao Wu,Yagang Wang,Qian Zhang
机构: 未知
类目: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Manual annotation of airway regions in computed tomography images is a time-consuming and expertise-dependent task. Automatic airway segmentation is therefore a prerequisite for enabling rapid bronchoscopic navigation and the clinical deployment of bronchoscopic robotic systems. Although convolutional neural network methods have gained considerable attention in airway segmentation, the unique tree-like structure of airways poses challenges for conventional and deformable convolutions, which often fail to focus on fine airway structures, leading to missed segments and discontinuities. To address this issue, this study proposes a novel tubular feature extraction network, named TfeNet. TfeNet introduces a novel direction-aware convolution operation that first applies spatial rotation transformations to adjust the sampling positions of linear convolution kernels. The deformed kernels are then represented as line segments or polylines in 3D space. Furthermore, a tubular feature fusion module (TFFM) is designed based on asymmetric convolution and residual connection strategies, enhancing the network’s focus on subtle airway structures. Extensive experiments conducted on one public dataset and two datasets used in airway segmentation challenges demonstrate that the proposed TfeNet achieves more accuracy and continuous airway structure predictions compared with existing methods. In particular, TfeNet achieves the highest overall score of 94.95% on the current largest airway segmentation dataset, Airway Tree Modeling(ATM22), and demonstrates advanced performance on the lung fibrosis dataset(AIIB23). The code is available at this https URL.
zh
[CV-97] PAST: A multimodal single-cell foundation model for histopathology and spatial transcriptomics in cancer
【速读】:该论文旨在解决病理基础模型在单细胞分辨率下缺乏与分子数据整合的问题,从而限制其在精准肿瘤学中的应用。其解决方案的关键在于提出PAST,一个基于2000万对齐的组织病理学图像和单细胞转录组数据的泛癌种单细胞基础模型,通过联合编码细胞形态和基因表达,学习统一的跨模态表征,以捕捉细胞层面的空间和分子异质性。
链接: https://arxiv.org/abs/2507.06418
作者: Changchun Yang,Haoyang Li,Yushuai Wu,Yilan Zhang,Yifeng Jiao,Yu Zhang,Rihan Huang,Yuan Cheng,Yuan Qi,Xin Guo,Xin Gao
机构: 未知
类目: Quantitative Methods (q-bio.QM); Computer Vision and Pattern Recognition (cs.CV); Applications (stat.AP)
备注:
Abstract:While pathology foundation models have transformed cancer image analysis, they often lack integration with molecular data at single-cell resolution, limiting their utility for precision oncology. Here, we present PAST, a pan-cancer single-cell foundation model trained on 20 million paired histopathology images and single-cell transcriptomes spanning multiple tumor types and tissue contexts. By jointly encoding cellular morphology and gene expression, PAST learns unified cross-modal representations that capture both spatial and molecular heterogeneity at the cellular level. This approach enables accurate prediction of single-cell gene expression, virtual molecular staining, and multimodal survival analysis directly from routine pathology slides. Across diverse cancers and downstream tasks, PAST consistently exceeds the performance of existing approaches, demonstrating robust generalizability and scalability. Our work establishes a new paradigm for pathology foundation models, providing a versatile tool for high-resolution spatial omics, mechanistic discovery, and precision cancer research.
zh
[CV-98] Capsule-ConvKAN: A Hybrid Neural Approach to Medical Image Classification
【速读】:该论文旨在解决传统卷积神经网络在医学图像分类中对空间模式捕捉能力不足及复杂特征处理受限的问题。其解决方案的关键在于提出了一种新型的混合架构——Capsule-Convolutional Kolmogorov–Arnold Network (Capsule-ConvKAN),该架构结合了胶囊网络(Capsule Network)的动态路由和空间层次结构能力,以及卷积Kolmogorov-Arnold网络(Convolutional Kolmogorov–Arnold Network)的灵活且可解释的功能逼近特性,从而提升了特征表示和分类精度。
链接: https://arxiv.org/abs/2507.06417
作者: Laura Pituková,Peter Sinčák,László József Kovács
机构: Technical University of Košice(科希策技术大学); Institute of Information Science, University of Miskolc(信息科学研究所,米什科尔茨大学)
类目: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注: Preprint version. Accepted to IEEE SMC 2025
Abstract:This study conducts a comprehensive comparison of four neural network architectures: Convolutional Neural Network, Capsule Network, Convolutional Kolmogorov–Arnold Network, and the newly proposed Capsule–Convolutional Kolmogorov–Arnold Network. The proposed Capsule-ConvKAN architecture combines the dynamic routing and spatial hierarchy capabilities of Capsule Network with the flexible and interpretable function approximation of Convolutional Kolmogorov–Arnold Networks. This novel hybrid model was developed to improve feature representation and classification accuracy, particularly in challenging real-world biomedical image data. The architectures were evaluated on a histopathological image dataset, where Capsule-ConvKAN achieved the highest classification performance with an accuracy of 91.21%. The results demonstrate the potential of the newly introduced Capsule-ConvKAN in capturing spatial patterns, managing complex features, and addressing the limitations of traditional convolutional models in medical image classification.
zh
[CV-99] Attention-Enhanced Deep Learning Ensemble for Breast Density Classification in Mammography
【速读】:该论文旨在解决乳腺密度评估中的自动化问题,特别是针对高乳腺密度(BI-RADS分类C和D)在乳腺癌风险评估和肿瘤检测中的挑战。其解决方案的关键在于提出一种基于深度学习的鲁棒二分类系统,采用四种先进的卷积神经网络架构(ResNet18、ResNet50、EfficientNet-B0和DenseNet121)并引入通道注意力机制,同时设计了一种结合焦点损失、标签平滑和类别平衡加权的新损失函数以应对类别不平衡问题,并通过优化的集成投票方法提升整体性能。
链接: https://arxiv.org/abs/2507.06410
作者: Peyman Sharifian,Xiaotong Hong,Alireza Karimian,Mehdi Amini,Hossein Arabi
机构: 未知
类目: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV)
备注: 2025 IEEE Nuclear Science Symposium, Medical Imaging Conference and Room Temperature Semiconductor Detector Conference
Abstract:Breast density assessment is a crucial component of mammographic interpretation, with high breast density (BI-RADS categories C and D) representing both a significant risk factor for developing breast cancer and a technical challenge for tumor detection. This study proposes an automated deep learning system for robust binary classification of breast density (low: A/B vs. high: C/D) using the VinDr-Mammo dataset. We implemented and compared four advanced convolutional neural networks: ResNet18, ResNet50, EfficientNet-B0, and DenseNet121, each enhanced with channel attention mechanisms. To address the inherent class imbalance, we developed a novel Combined Focal Label Smoothing Loss function that integrates focal loss, label smoothing, and class-balanced weighting. Our preprocessing pipeline incorporated advanced techniques, including contrast-limited adaptive histogram equalization (CLAHE) and comprehensive data augmentation. The individual models were combined through an optimized ensemble voting approach, achieving superior performance (AUC: 0.963, F1-score: 0.952) compared to any single model. This system demonstrates significant potential to standardize density assessments in clinical practice, potentially improving screening efficiency and early cancer detection rates while reducing inter-observer variability among radiologists.
zh
[CV-100] Mitigating Multi-Sequence 3D Prostate MRI Data Scarcity through Domain Adaptation using Locally-Trained Latent Diffusion Models for Prostate Cancer Detection
【速读】:该论文旨在解决医学图像分析中数据稀缺问题,以及现有生成式AI模型在跨机构域适应性和多序列MRI生成方面的局限性。其解决方案的关键在于提出CCELLA++,该模型能够同时生成双参数前列腺MRI(bpMRI),包括轴位T2加权(AxT2)、高b值扩散序列(HighB)和表观扩散系数图(ADC),并通过在内部机构的真实数据或合成数据上预训练分类器,再在逐渐减小的外部分布外数据集上微调,以提升分类器的泛化能力和性能。
链接: https://arxiv.org/abs/2507.06384
作者: Emerson P. Grabke,Babak Taati,Masoom A. Haider
机构: Institute of Biomedical Engineering, University of Toronto (生物医学工程研究所,多伦多大学); Lunenfeld-Tanenbaum Research Institute, Mount Sinai Hospital (Lunenfeld-Tanenbaum 研究所,圣米迦勒医院); KITE Research Institute, Toronto Rehabilitation Institute, University Health Network (KITE 研究所,多伦多康复研究所,健康网络机构); Department of Computer Science, University of Toronto (计算机科学系,多伦多大学); Faculty Affiliate of the Vector Institute, Toronto (矢量研究所,多伦多) ; Joint Department of Medical Imaging, University of Toronto, Princess Margaret Hospital, and Sinai Health systems (医学影像联合系,多伦多大学,公主玛格丽特医院和西乃医疗系统)
类目: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV)
备注: BT and MAH are co-senior authors on the work. This work has been submitted to the IEEE for possible publication
Abstract:Objective: Latent diffusion models (LDMs) could mitigate data scarcity challenges affecting machine learning development for medical image interpretation. The recent CCELLA LDM improved prostate cancer detection performance using synthetic MRI for classifier training but was limited to the axial T2-weighted (AxT2) sequence, did not investigate inter-institutional domain shift, and prioritized radiology over histopathology outcomes. We propose CCELLA++ to address these limitations and improve clinical utility. Methods: CCELLA++ expands CCELLA for simultaneous biparametric prostate MRI (bpMRI) generation, including the AxT2, high b-value diffusion series (HighB) and apparent diffusion coefficient map (ADC). Domain adaptation was investigated by pretraining classifiers on real or LDM-generated synthetic data from an internal institution, followed with fine-tuning on progressively smaller fractions of an out-of-distribution, external dataset. Results: CCELLA++ improved 3D FID for HighB and ADC but not AxT2 (0.013, 0.012, 0.063 respectively) sequences compared to CCELLA (0.060). Classifier pretraining with CCELLA++ bpMRI outperformed real bpMRI in AP and AUC for all domain adaptation scenarios. CCELLA++ pretraining achieved highest classifier performance below 50% (n=665) external dataset volume. Conclusion: Synthetic bpMRI generated by our method can improve downstream classifier generalization and performance beyond real bpMRI or CCELLA-generated AxT2-only images. Future work should seek to quantify medical image sample quality, balance multi-sequence LDM training, and condition the LDM with additional information. Significance: The proposed CCELLA++ LDM can generate synthetic bpMRI that outperforms real data for domain adaptation with a limited target institution dataset. Our code is available at this https URL
zh
[CV-101] Mamba Goes HoME: Hierarchical Soft Mixture-of-Experts for 3D Medical Image Segmentation
【速读】:该论文旨在解决3D医学图像分割中高效处理跨不同成像模态的数据变异性问题。其解决方案的关键在于提出一种两级令牌路由层——分层软混合专家(Hierarchical Soft Mixture-of-Experts, HoME),该方法基于Mamba状态空间模型(SSM)架构,通过稀疏且自适应的专家路由机制增强序列建模能力。HoME首先利用软混合专家(SMoE)层将输入序列划分为局部组并进行局部特征提取,随后通过全局SMoE层聚合输出以实现跨组信息融合与全局上下文优化,从而提升模型的泛化能力和分割性能。
链接: https://arxiv.org/abs/2507.06363
作者: Szymon Płotka,Maciej Chrabaszcz,Gizem Mert,Ewa Szczurek,Arkadiusz Sitek
机构: University of Warsaw (华沙大学); Warsaw University of Technology (华沙理工大学); Institute of AI for Health, Helmholtz Munich (人工智能健康研究所,赫尔姆霍兹慕尼黑); Massachusetts General Hospital (麻省总医院); Harvard Medical School (哈佛医学院)
类目: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:In recent years, artificial intelligence has significantly advanced medical image segmentation. However, challenges remain, including efficient 3D medical image processing across diverse modalities and handling data variability. In this work, we introduce Hierarchical Soft Mixture-of-Experts (HoME), a two-level token-routing layer for efficient long-context modeling, specifically designed for 3D medical image segmentation. Built on the Mamba state-space model (SSM) backbone, HoME enhances sequential modeling through sparse, adaptive expert routing. The first stage employs a Soft Mixture-of-Experts (SMoE) layer to partition input sequences into local groups, routing tokens to specialized per-group experts for localized feature extraction. The second stage aggregates these outputs via a global SMoE layer, enabling cross-group information fusion and global context refinement. This hierarchical design, combining local expert routing with global expert refinement improves generalizability and segmentation performance, surpassing state-of-the-art results across datasets from the three most commonly used 3D medical imaging modalities and data quality.
zh
[CV-102] X-ray transferable polyrepresentation learning
【速读】:该论文试图解决机器学习算法在特征提取和数据表示质量方面的挑战,特别是在面对未见过的数据集时,如何有效泛化并提取有意义特征的问题。解决方案的关键在于引入一种名为“polyrepresentation”的新概念,即整合来自不同来源的同一模态的多种表示形式,例如来自Siamese网络的向量嵌入、自监督模型以及可解释的影像组学特征,从而提升性能指标并增强模型在不同数据集间的迁移能力。
链接: https://arxiv.org/abs/2507.06264
作者: Weronika Hryniewska-Guzik,Przemyslaw Biecek
机构: Warsaw University of Technology (华沙理工大学); University of Warsaw (华沙大学)
类目: Image and Video Processing (eess.IV); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注: part of Weronika’s PhD thesis
Abstract:The success of machine learning algorithms is inherently related to the extraction of meaningful features, as they play a pivotal role in the performance of these algorithms. Central to this challenge is the quality of data representation. However, the ability to generalize and extract these features effectively from unseen datasets is also crucial. In light of this, we introduce a novel concept: the polyrepresentation. Polyrepresentation integrates multiple representations of the same modality extracted from distinct sources, for example, vector embeddings from the Siamese Network, self-supervised models, and interpretable radiomic features. This approach yields better performance metrics compared to relying on a single representation. Additionally, in the context of X-ray images, we demonstrate the transferability of the created polyrepresentation to a smaller dataset, underscoring its potential as a pragmatic and resource-efficient approach in various image-related solutions. It is worth noting that the concept of polyprepresentation on the example of medical data can also be applied to other domains, showcasing its versatility and broad potential impact.
zh
人工智能
[AI-0] Latent Acoustic Mapping for Direction of Arrival Estimation: A Self-Supervised Approach
【速读】:该论文旨在解决传统声学映射方法在计算效率和对声学变化的敏感性方面的不足,以及监督深度学习方法在数据依赖性和可解释性方面的局限。其解决方案的关键在于提出一种自监督的Latent Acoustic Mapping (LAM)模型,该模型结合了传统方法的可解释性与深度学习方法的适应性和效率,能够生成高分辨率声学图,并在不同麦克风阵列和声学条件下高效运行。
链接: https://arxiv.org/abs/2507.07066
作者: Adrian S. Roman,Iran R. Roman,Juan P. Bello
机构: 未知
类目: ound (cs.SD); Artificial Intelligence (cs.AI); Audio and Speech Processing (eess.AS)
备注:
Abstract:Acoustic mapping techniques have long been used in spatial audio processing for direction of arrival estimation (DoAE). Traditional beamforming methods for acoustic mapping, while interpretable, often rely on iterative solvers that can be computationally intensive and sensitive to acoustic variability. On the other hand, recent supervised deep learning approaches offer feedforward speed and robustness but require large labeled datasets and lack interpretability. Despite their strengths, both methods struggle to consistently generalize across diverse acoustic setups and array configurations, limiting their broader applicability. We introduce the Latent Acoustic Mapping (LAM) model, a self-supervised framework that bridges the interpretability of traditional methods with the adaptability and efficiency of deep learning methods. LAM generates high-resolution acoustic maps, adapts to varying acoustic conditions, and operates efficiently across different microphone arrays. We assess its robustness on DoAE using the LOCATA and STARSS benchmarks. LAM achieves comparable or superior localization performance to existing supervised methods. Additionally, we show that LAM’s acoustic maps can serve as effective features for supervised models, further enhancing DoAE accuracy and underscoring its potential to advance adaptive, high-performance sound localization systems.
zh
[AI-1] Comparative Analysis of CNN and Transformer Architectures with Heart Cycle Normalization for Automated Phonocardiogram Classification
【速读】:该论文旨在解决心音图(Phonocardiogram, PCG)自动分类在心血管诊断中的应用问题,特别是针对心脏杂音检测的模型性能优化。其解决方案的关键在于比较不同深度学习模型在固定长度窗口和基于心脏周期归一化方法下的表现,并引入一种定制化的个体心脏节律归一化方法以提升模型的泛化能力和准确性。研究通过对比两种卷积神经网络(CNN)和两种零样本通用音频变换器(BEATs)模型,揭示了不同归一化策略对模型性能的影响,为临床环境中模型架构的选择提供了实证依据。
链接: https://arxiv.org/abs/2507.07058
作者: Martin Sondermann,Pinar Bisgin,Niklas Tschorn,Anja Burmann,Christoph M. Friedrich
机构: 未知
类目: ound (cs.SD); Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Audio and Speech Processing (eess.AS)
备注: Preprint Version. Accepted at EMBC 2025
Abstract:The automated classification of phonocardiogram (PCG) recordings represents a substantial advancement in cardiovascular diagnostics. This paper presents a systematic comparison of four distinct models for heart murmur detection: two specialized convolutional neural networks (CNNs) and two zero-shot universal audio transformers (BEATs), evaluated using fixed-length and heart cycle normalization approaches. Utilizing the PhysioNet2022 dataset, a custom heart cycle normalization method tailored to individual cardiac rhythms is introduced. The findings indicate the following AUROC values: the CNN model with fixed-length windowing achieves 79.5%, the CNN model with heart cycle normalization scores 75.4%, the BEATs transformer with fixed-length windowing achieves 65.7%, and the BEATs transformer with heart cycle normalization results in 70.1%. The findings indicate that physiological signal constraints, especially those introduced by different normalization strategies, have a substantial impact on model performance. The research provides evidence-based guidelines for architecture selection in clinical settings, emphasizing the need for a balance between accuracy and computational efficiency. Although specialized CNNs demonstrate superior performance overall, the zero-shot transformer models may offer promising efficiency advantages during development, such as faster training and evaluation cycles, despite their lower classification accuracy. These findings highlight the potential of automated classification systems to enhance cardiac diagnostics and improve patient care. Comments: Preprint Version. Accepted at EMBC 2025 Subjects: Sound (cs.SD); Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Audio and Speech Processing (eess.AS) Cite as: arXiv:2507.07058 [cs.SD] (or arXiv:2507.07058v1 [cs.SD] for this version) https://doi.org/10.48550/arXiv.2507.07058 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
zh
[AI-2] A Novel Hybrid Deep Learning Technique for Speech Emotion Detection using Feature Engineering
【速读】:该论文旨在解决语音情感识别(Speech Emotion Recognition, SER)在不同数据集上的泛化能力和模型鲁棒性问题。其解决方案的关键在于提出一种名为DCRF-BiLSTM的模型架构,该模型通过结合双向长短期记忆网络(BiLSTM)与动态条件随机场(DCRF),在多个基准数据集上实现了高精度的情感分类,展示了其在跨数据集场景下的优越性能和广泛适用性。
链接: https://arxiv.org/abs/2507.07046
作者: Shahana Yasmin Chowdhury,Bithi Banik,Md Tamjidul Hoque,Shreya Banerjee
机构: 未知
类目: ound (cs.SD); Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Audio and Speech Processing (eess.AS)
备注: 17 pages, 11 figures
Abstract:Nowadays, speech emotion recognition (SER) plays a vital role in the field of human-computer interaction (HCI) and the evolution of artificial intelligence (AI). Our proposed DCRF-BiLSTM model is used to recognize seven emotions: neutral, happy, sad, angry, fear, disgust, and surprise, which are trained on five datasets: RAVDESS ®, TESS (T), SAVEE (S), EmoDB (E), and Crema-D ©. The model achieves high accuracy on individual datasets, including 97.83% on RAVDESS, 97.02% on SAVEE, 95.10% for CREMA-D, and a perfect 100% on both TESS and EMO-DB. For the combined (R+T+S) datasets, it achieves 98.82% accuracy, outperforming previously reported results. To our knowledge, no existing study has evaluated a single SER model across all five benchmark datasets (i.e., R+T+S+C+E) simultaneously. In our work, we introduce this comprehensive combination and achieve a remarkable overall accuracy of 93.76%. These results confirm the robustness and generalizability of our DCRF-BiLSTM framework across diverse datasets.
zh
[AI-3] Advances in Intelligent Hearing Aids: Deep Learning Approaches to Selective Noise Cancellation
【速读】:该论文试图解决传统助听系统在噪声环境下的语音增强问题,通过引入人工智能技术实现更智能、上下文感知的音频处理。其解决方案的关键在于利用深度学习架构,特别是卷积循环网络和基于Transformer的模型,以实现高效的实时选择性噪声抑制(SNC),从而显著提升嘈杂混响环境下的语音分离性能。
链接: https://arxiv.org/abs/2507.07043
作者: Haris Khan,Shumaila Asif,Hassan Nasir
机构: 未知
类目: ound (cs.SD); Artificial Intelligence (cs.AI); Audio and Speech Processing (eess.AS); Signal Processing (eess.SP)
备注: 22 pages, 4 figures, submitted as a systematic literature review in AI-based hearing assistance. (June 2025)
Abstract:The integration of artificial intelligence into hearing assistance marks a paradigm shift from traditional amplification-based systems to intelligent, context-aware audio processing. This systematic literature review evaluates advances in AI-driven selective noise cancellation (SNC) for hearing aids, highlighting technological evolution, implementation challenges, and future research directions. We synthesize findings across deep learning architectures, hardware deployment strategies, clinical validation studies, and user-centric design. The review traces progress from early machine learning models to state-of-the-art deep networks, including Convolutional Recurrent Networks for real-time inference and Transformer-based architectures for high-accuracy separation. Key findings include significant gains over traditional methods, with recent models achieving up to 18.3 dB SI-SDR improvement on noisy-reverberant benchmarks, alongside sub-10 ms real-time implementations and promising clinical outcomes. Yet, challenges remain in bridging lab-grade models with real-world deployment - particularly around power constraints, environmental variability, and personalization. Identified research gaps include hardware-software co-design, standardized evaluation protocols, and regulatory considerations for AI-enhanced hearing devices. Future work must prioritize lightweight models, continual learning, contextual-based classification and clinical translation to realize transformative hearing solutions for millions globally.
zh
[AI-4] Modeling Heterogeneity across Varying Spatial Extents: Discovering Linkages between Sea Ice Retreat and Ice Shelve Melt in the Antarctic
【速读】:该论文试图解决海冰退缩与南极冰架(Antarctic Ice Shelf, AIS)融化之间直接关联的建模问题,尤其是在动态区域中由于空间异质性和局部联系的复杂性导致的传统模型难以捕捉此类耦合关系。解决方案的关键在于提出一种名为Spatial-Link的新型图基框架,该框架通过卫星衍生的冰变化矩阵进行Delaunay三角剖分构建空间图,其中节点代表显著变化区域,边编码邻近性和方向一致性,从而量化空间异质性并提取统计验证的关联路径。
链接: https://arxiv.org/abs/2507.07036
作者: Maloy Kumar Devnath,Sudip Chakraborty,Vandana P. Janeja
机构: 未知
类目: ocial and Information Networks (cs.SI); Artificial Intelligence (cs.AI)
备注:
Abstract:Spatial phenomena often exhibit heterogeneity across spatial extents and in proximity, making them complex to model-especially in dynamic regions like ice shelves and sea ice. In this study, we address this challenge by exploring the linkages between sea ice retreat and Antarctic ice shelf (AIS) melt. Although atmospheric forcing and basal melting have been widely studied, the direct impact of sea ice retreat on AIS mass loss remains underexplored. Traditional models treat sea ice and AIS as separate systems. It limits their ability to capture localized linkages and cascading feedback. To overcome this, we propose Spatial-Link, a novel graph-based framework that quantifies spatial heterogeneity to capture linkages between sea ice retreat and AIS melt. Our method constructs a spatial graph using Delaunay triangulation of satellite-derived ice change matrices, where nodes represent regions of significant change and edges encode proximity and directional consistency. We extract and statistically validate linkage paths using breadth-first search and Monte Carlo simulations. Results reveal non-local, spatially heterogeneous coupling patterns, suggesting sea ice loss can initiate or amplify downstream AIS melt. Our analysis shows how sea ice retreat evolves over an oceanic grid and progresses toward ice shelves-establishing a direct linkage. To our knowledge, this is the first proposed methodology linking sea ice retreat to AIS melt. Spatial-Link offers a scalable, data-driven tool to improve sea-level rise projections and inform climate adaptation strategies.
zh
[AI-5] PLAME: Leverag ing Pretrained Language Models to Generate Enhanced Protein Multiple Sequence Alignments
【速读】:该论文旨在解决现有蛋白质结构预测模型(如AlphaFold)在低同源性蛋白和孤儿蛋白上的性能受限问题,这些蛋白由于多序列比对(MSA)信息稀疏或缺失,导致预测效果不佳。其解决方案的关键在于提出PLAME模型,该模型利用预训练蛋白质语言模型的进化嵌入来增强MSA设计,并引入保守性-多样性损失以提升生成质量,同时结合新的MSA选择方法和序列质量评估指标,从而有效提升低同源性和孤儿蛋白的折叠性能。
链接: https://arxiv.org/abs/2507.07032
作者: Hanqun Cao,Xinyi Zhou,Zijun Gao,Chenyu Wang,Xin Gao,Zhi Zhang,Chunbin Gu,Ge Liu,Pheng-Ann Heng
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Quantitative Methods (q-bio.QM)
备注:
Abstract:Protein structure prediction is essential for drug discovery and understanding biological functions. While recent advancements like AlphaFold have achieved remarkable accuracy, most folding models rely heavily on multiple sequence alignments (MSAs) to boost prediction performance. This dependency limits their effectiveness on low-homology proteins and orphan proteins, where MSA information is sparse or unavailable. To address this limitation, we propose PLAME, a novel MSA design model that leverages evolutionary embeddings from pretrained protein language models. Unlike existing methods, PLAME introduces pretrained representations to enhance evolutionary information and employs a conservation-diversity loss to enhance generation quality. Additionally, we propose a novel MSA selection method to effectively screen high-quality MSAs and improve folding performance. We also propose a sequence quality assessment metric that provides an orthogonal perspective to evaluate MSA quality. On the AlphaFold2 benchmark of low-homology and orphan proteins, PLAME achieves state-of-the-art performance in folding enhancement and sequence quality assessment, with consistent improvements demonstrated on AlphaFold3. Ablation studies validate the effectiveness of the MSA selection method, while extensive case studies on various protein types provide insights into the relationship between AlphaFold’s prediction quality and MSA characteristics. Furthermore, we demonstrate that PLAME can serve as an adapter achieving AlphaFold2-level accuracy with the ESMFold’s inference speed.
zh
[AI-6] First Return Entropy-Eliciting Explore
【速读】:该论文试图解决大型语言模型(Large Language Models, LLMs)在基于可验证奖励的强化学习(Reinforcement Learning from Verifiable Rewards, RLVR)中面临的探索不稳定问题。解决方案的关键在于提出FR3E(First Return, Entropy-Eliciting Explore)框架,该框架通过识别推理轨迹中的高不确定性决策点,并执行针对性的回放以构建语义基础的中间反馈,从而实现结构化的探索。此方法无需依赖密集监督即可提供定向指导,有效提升了训练稳定性与推理质量。
链接: https://arxiv.org/abs/2507.07017
作者: Tianyu Zheng,Tianshun Xing,Qingshui Gu,Taoran Liang,Xingwei Qu,Xin Zhou,Yizhi Li,Zhoufutu Wen,Chenghua Lin,Wenhao Huang,Qian Liu,Ge Zhang,Zejun Ma
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:
Abstract:Reinforcement Learning from Verifiable Rewards (RLVR) improves the reasoning abilities of Large Language Models (LLMs) but it struggles with unstable exploration. We propose FR3E (First Return, Entropy-Eliciting Explore), a structured exploration framework that identifies high-uncertainty decision points in reasoning trajectories and performs targeted rollouts to construct semantically grounded intermediate feedback. Our method provides targeted guidance without relying on dense supervision. Empirical results on mathematical reasoning benchmarks(AIME24) show that FR3E promotes more stable training, produces longer and more coherent responses, and increases the proportion of fully correct trajectories. These results highlight the framework’s effectiveness in improving LLM reasoning through more robust and structured exploration.
zh
[AI-7] Generating Multi-Table Time Series EHR from Latent Space with Minimal Preprocessing
【速读】:该论文试图解决隐私 concerns 和监管限制导致的电子健康记录(Electronic Health Records, EHR)数据共享与利用受限的问题,提出生成合成EHR数据集的解决方案。其关键在于引入了RawMed框架,这是首个能够生成多表时间序列EHR数据的框架,能够紧密模拟原始EHR的复杂结构和时间动态特性,通过文本表示和压缩技术实现最小预处理,从而提高合成数据的真实性与实用性。
链接: https://arxiv.org/abs/2507.06996
作者: Eunbyeol Cho,Jiyoun Kim,Minjae Lee,Sungjin Park,Edward Choi
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:
Abstract:Electronic Health Records (EHR) are time-series relational databases that record patient interactions and medical events over time, serving as a critical resource for healthcare research and applications. However, privacy concerns and regulatory restrictions limit the sharing and utilization of such sensitive data, necessitating the generation of synthetic EHR datasets. Unlike previous EHR synthesis methods, which typically generate medical records consisting of expert-chosen features (e.g. a few vital signs or structured codes only), we introduce RawMed, the first framework to synthesize multi-table, time-series EHR data that closely resembles raw EHRs. Using text-based representation and compression techniques, RawMed captures complex structures and temporal dynamics with minimal preprocessing. We also propose a new evaluation framework for multi-table time-series synthetic EHRs, assessing distributional similarity, inter-table relationships, temporal dynamics, and privacy. Validated on two open-source EHR datasets, RawMed outperforms baseline models in fidelity and utility. The code is available at this https URL.
zh
[AI-8] Unifying Re-Identification Attribute Inference and Data Reconstruction Risks in Differential Privacy
【速读】:该论文试图解决差分隐私(Differential Privacy, DP)机制在解释和校准方面的困难,因为现有方法在将标准隐私参数映射到具体的隐私风险(如再识别、属性推断和数据重构)时既过于悲观又不一致。其解决方案的关键在于采用差分隐私的假设检验解释(f-DP),并证明针对攻击成功率的界限可以在再识别、属性推断和数据重构风险中呈现出统一的形式,从而实现跨多种攻击场景的一致性和可调性。
链接: https://arxiv.org/abs/2507.06969
作者: Bogdan Kulynych,Juan Felipe Gomez,Georgios Kaissis,Jamie Hayes,Borja Balle,Flavio du Pin Calmon,Jean Louis Raisaro
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Cryptography and Security (cs.CR); Computers and Society (cs.CY); Machine Learning (stat.ML)
备注:
Abstract:Differentially private (DP) mechanisms are difficult to interpret and calibrate because existing methods for mapping standard privacy parameters to concrete privacy risks – re-identification, attribute inference, and data reconstruction – are both overly pessimistic and inconsistent. In this work, we use the hypothesis-testing interpretation of DP ( f -DP), and determine that bounds on attack success can take the same unified form across re-identification, attribute inference, and data reconstruction risks. Our unified bounds are (1) consistent across a multitude of attack settings, and (2) tunable, enabling practitioners to evaluate risk with respect to arbitrary (including worst-case) levels of baseline risk. Empirically, our results are tighter than prior methods using \varepsilon -DP, Rényi DP, and concentrated DP. As a result, calibrating noise using our bounds can reduce the required noise by 20% at the same risk level, which yields, e.g., more than 15pp accuracy increase in a text classification task. Overall, this unifying perspective provides a principled framework for interpreting and calibrating the degree of protection in DP against specific levels of re-identification, attribute inference, or data reconstruction risk.
zh
[AI-9] Noisy PDE Training Requires Bigger PINNs
【速读】:该论文试图解决在存在噪声监督标签的情况下,物理信息神经网络(PINN)能否有效逼近偏微分方程(PDE)解的问题,以及确定PINN在何种条件下能够实现低经验风险。其解决方案的关键在于证明了当预测器的经验风险低于监督数据方差时,神经网络的规模必须满足一定的下界条件,即若经验风险达到 $ O(\eta) $ 且低于 $ \sigma^2 $,则必须有 $ d_N\log d_N\gtrsim N_s \eta^2 $,其中 $ N_s $ 为样本数,$ d_N $ 为PINN的可训练参数数量。这一结论表明,仅增加噪声监督标签的数量并不能无代价地降低经验风险。
链接: https://arxiv.org/abs/2507.06967
作者: Sebastien Andre-Sloan,Anirbit Mukherjee,Matthew Colbrook
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:
Abstract:Physics-Informed Neural Networks (PINNs) are increasingly used to approximate solutions of partial differential equations (PDEs), especially in high dimensions. In real-world applications, data samples are noisy, so it is important to know when a predictor can still achieve low empirical risk. However, little is known about the conditions under which a PINN can do so effectively. We prove a lower bound on the size of neural networks required for the supervised PINN empirical risk to fall below the variance of noisy supervision labels. Specifically, if a predictor achieves an empirical risk O(\eta) below \sigma^2 (variance of supervision data), then necessarily d_N\log d_N\gtrsim N_s \eta^2 , where N_s is the number of samples and d_N is the number of trainable parameters of the PINN. A similar constraint applies to the fully unsupervised PINN setting when boundary labels are sampled noisily. Consequently, increasing the number of noisy supervision labels alone does not provide a ``free lunch’’ in reducing empirical risk. We also show empirically that PINNs can indeed achieve empirical risks below \sigma^2 under such conditions. As a case study, we investigate PINNs applied to the Hamilton–Jacobi–Bellman (HJB) PDE. Our findings lay the groundwork for quantitatively understanding the parameter requirements for training PINNs in the presence of noise.
zh
[AI-10] What Has a Foundation Model Found? Using Inductive Bias to Probe for World Models ICML2025
【速读】:该论文试图解决如何评估基础模型是否真正捕捉到更深层次的结构问题,特别是其是否具备与假设世界模型一致的归纳偏置。解决方案的关键在于提出一种称为“归纳偏置探针”的技术,该技术通过检查模型在适应由假设世界模型生成的合成数据集时的表现,来衡量其归纳偏置是否与世界模型对齐。
链接: https://arxiv.org/abs/2507.06952
作者: Keyon Vafa,Peter G. Chang,Ashesh Rambachan,Sendhil Mullainathan
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注: To appear in ICML 2025
Abstract:Foundation models are premised on the idea that sequence prediction can uncover deeper domain understanding, much like how Kepler’s predictions of planetary motion later led to the discovery of Newtonian mechanics. However, evaluating whether these models truly capture deeper structure remains a challenge. We develop a technique for evaluating foundation models that examines how they adapt to synthetic datasets generated from some postulated world model. Our technique measures whether the foundation model’s inductive bias aligns with the world model, and so we refer to it as an inductive bias probe. Across multiple domains, we find that foundation models can excel at their training tasks yet fail to develop inductive biases towards the underlying world model when adapted to new tasks. We particularly find that foundation models trained on orbital trajectories consistently fail to apply Newtonian mechanics when adapted to new physics tasks. Further analysis reveals that these models behave as if they develop task-specific heuristics that fail to generalize.
zh
[AI-11] Beyond Connectivity: An Open Architecture for AI-RAN Convergence in 6G
【速读】:该论文试图解决数据密集型人工智能(Artificial Intelligence, AI)应用在无线接入网(RAN)边缘的部署问题,即从传统的仅利用AI进行网络优化,转向主动支持分布式AI工作负载的RAN设计范式转变。解决方案的关键在于提出一种融合O-RAN与AI-RAN的新型架构,通过引入AI-RAN Orchestrator和AI-RAN站点,实现电信与AI工作负载的统一编排与管理,并支持异构AI部署及灵活的时序和地理定位需求。
链接: https://arxiv.org/abs/2507.06911
作者: Michele Polese,Niloofar Mohamadi,Salvatore D’Oro,Tommaso Melodia
机构: 未知
类目: Networking and Internet Architecture (cs.NI); Artificial Intelligence (cs.AI); Signal Processing (eess.SP)
备注: Submitted to IEEE for publication, copyright may change without notice. 8 pages, 6 figures
Abstract:The proliferation of data-intensive Artificial Intelligence (AI) applications at the network edge demands a fundamental shift in RAN design, from merely consuming AI for network optimization, to actively enabling distributed AI workloads. This paradigm shift presents a significant opportunity for network operators to monetize AI at the edge while leveraging existing infrastructure investments. To realize this vision, this article presents a novel converged O-RAN and AI-RAN architecture that unifies orchestration and management of both telecommunications and AI workloads on shared infrastructure. The proposed architecture extends the Open RAN principles of modularity, disaggregation, and cloud-nativeness to support heterogeneous AI deployments. We introduce two key architectural innovations: (i) the AI-RAN Orchestrator, which extends the O-RAN Service Management and Orchestration (SMO) to enable integrated resource and allocation across RAN and AI workloads; and (ii) AI-RAN sites that provide distributed edge AI platforms with real-time processing capabilities. The proposed system supports flexible deployment options, allowing AI workloads to be orchestrated with specific timing requirements (real-time or batch processing) and geographic targeting. The proposed architecture addresses the orchestration requirements for managing heterogeneous workloads at different time scales while maintaining open, standardized interfaces and multi-vendor interoperability.
zh
[AI-12] A Single-Point Measurement Framework for Robust Cyber-Attack Diagnosis in Smart Microgrids Using Dual Fractional-Order Feature Analysis
【速读】:该论文旨在解决智能微电网中因网络攻击导致的安全运行问题,同时克服现有诊断方法依赖昂贵多点仪器或严格建模假设的局限性。其解决方案的关键在于提出一种基于单个VPQ(Voltage-Power-Reactive-power)传感器的分数阶记忆增强攻击诊断方案(FO-MADS),通过联合应用Caputo和Grünwald-Letnikov导数构建双分数阶特征库,从而放大VPQ信号中的微小扰动和缓慢漂移,并结合分层分类器实现低延迟故障定位与网络攻击检测,同时通过渐进式记忆重放对抗训练(PMR-AT)提升系统的鲁棒性。
链接: https://arxiv.org/abs/2507.06890
作者: Yifan Wang
机构: 未知
类目: ystems and Control (eess.SY); Artificial Intelligence (cs.AI)
备注: 8 pages, 10 figures
Abstract:Cyber-attacks jeopardize the safe operation of smart microgrids. At the same time, existing diagnostic methods either depend on expensive multi-point instrumentation or stringent modelling assumptions that are untenable under single-sensor constraints. This paper proposes a Fractional-Order Memory-Enhanced Attack-Diagnosis Scheme (FO-MADS) that achieves low-latency fault localisation and cyber-attack detection using only one VPQ (Voltage-Power-Reactive-power) sensor. FO-MADS first constructs a dual fractional-order feature library by jointly applying Caputo and Grünwald-Letnikov derivatives, thereby amplifying micro-perturbations and slow drifts in the VPQ signal. A two-stage hierarchical classifier then pinpoints the affected inverter and isolates the faulty IGBT switch, effectively alleviating class imbalance. Robustness is further strengthened through Progressive Memory-Replay Adversarial Training (PMR-AT), whose attack-aware loss is dynamically re-weighted via Online Hard Example Mining (OHEM) to prioritise the most challenging samples. Experiments on a four-inverter microgrid testbed comprising 1 normal and 24 fault classes under four attack scenarios demonstrate diagnostic accuracies of 96.6 % (bias), 94.0 % (noise), 92.8 % (data replacement), and 95.7 % (replay), while sustaining 96.7 % under attack-free conditions. These results establish FO-MADS as a cost-effective and readily deployable solution that markedly enhances the cyber-physical resilience of smart microgrids.
zh
[AI-13] Winning and losing with Artificial Intelligence: What public discourse about ChatGPT tells us about how societies make sense of technological change
【速读】:该论文试图解决的问题是理解公众在面对人工智能(Artificial Intelligence, AI)技术变革时的集体注意力反应及其背后的影响因素。研究的关键在于通过分析社交媒体上的公众讨论,揭示经济利益和文化价值观如何塑造人们对AI技术的感知与参与行为。研究利用了3.8万条来自1.6万名用户在2022年ChatGPT公开发布后的推文数据,发现职业技能类型和国家文化维度(如个人主义、不确定性规避和权力距离)显著影响了公众的参与时间、参与方式以及对ChatGPT的态度。
链接: https://arxiv.org/abs/2507.06876
作者: Adrian Rauchfleisch,Joshua Philip Suarez,Nikka Marie Sales,Andreas Jungherr
机构: 未知
类目: Computers and Society (cs.CY); Artificial Intelligence (cs.AI)
备注:
Abstract:Public product launches in Artificial Intelligence can serve as focusing events for collective attention, surfacing how societies react to technological change. Social media provide a window into the sensemaking around these events, surfacing hopes and fears and showing who chooses to engage in the discourse and when. We demonstrate that public sensemaking about AI is shaped by economic interests and cultural values of those involved. We analyze 3.8 million tweets posted by 1.6 million users across 117 countries in response to the public launch of ChatGPT in 2022. Our analysis shows how economic self-interest, proxied by occupational skill types in writing, programming, and mathematics, and national cultural orientations, as measured by Hofstede’s individualism, uncertainty avoidance, and power distance dimensions, shape who speaks, when they speak, and their stance towards ChatGPT. Roles requiring more technical skills, such as programming and mathematics, tend to engage earlier and express more positive stances, whereas writing-centric occupations join later with greater skepticism. At the cultural level, individualism predicts both earlier engagement and a more negative stance, and uncertainty avoidance reduces the prevalence of positive stances but does not delay when users first engage with ChatGPT. Aggregate sentiment trends mask the dynamics observed in our study. The shift toward a more critical stance towards ChatGPT over time stems primarily from the entry of more skeptical voices rather than a change of heart among early adopters. Our findings underscore the importance of both the occupational background and cultural context in understanding public reactions to AI.
zh
[AI-14] DiffSpectra: Molecular Structure Elucidation from Spectra using Diffusion Models
【速读】:该论文试图解决从光谱数据中进行分子结构解析的问题,这是化学领域中的基础性问题,对化合物鉴定、合成和药物开发具有重要意义。传统方法依赖专家解读且缺乏可扩展性,而现有机器学习方法受限于有限的数据库,难以泛化到新分子。该论文提出的解决方案是DiffSpectra,其关键在于利用扩散模型直接从多模态光谱数据中推断出2D和3D分子结构,通过SE(3)-等变的Diffusion Molecule Transformer架构整合拓扑与几何信息,并结合基于Transformer的SpecFormer光谱编码器实现多模态光谱的条件生成,从而显著提升了结构解析的准确性。
链接: https://arxiv.org/abs/2507.06853
作者: Liang Wang,Yu Rong,Tingyang Xu,Zhenyi Zhong,Zhiyuan Liu,Pengju Wang,Deli Zhao,Qiang Liu,Shu Wu,Liang Wang
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computational Engineering, Finance, and Science (cs.CE); Chemical Physics (physics.chem-ph); Molecular Networks (q-bio.MN)
备注:
Abstract:Molecular structure elucidation from spectra is a foundational problem in chemistry, with profound implications for compound identification, synthesis, and drug development. Traditional methods rely heavily on expert interpretation and lack scalability. Pioneering machine learning methods have introduced retrieval-based strategies, but their reliance on finite libraries limits generalization to novel molecules. Generative models offer a promising alternative, yet most adopt autoregressive SMILES-based architectures that overlook 3D geometry and struggle to integrate diverse spectral modalities. In this work, we present DiffSpectra, a generative framework that directly infers both 2D and 3D molecular structures from multi-modal spectral data using diffusion models. DiffSpectra formulates structure elucidation as a conditional generation process. Its denoising network is parameterized by Diffusion Molecule Transformer, an SE(3)-equivariant architecture that integrates topological and geometric information. Conditioning is provided by SpecFormer, a transformer-based spectral encoder that captures intra- and inter-spectral dependencies from multi-modal spectra. Extensive experiments demonstrate that DiffSpectra achieves high accuracy in structure elucidation, recovering exact structures with 16.01% top-1 accuracy and 96.86% top-20 accuracy through sampling. The model benefits significantly from 3D geometric modeling, SpecFormer pre-training, and multi-modal conditioning. These results highlight the effectiveness of spectrum-conditioned diffusion modeling in addressing the challenge of molecular structure elucidation. To our knowledge, DiffSpectra is the first framework to unify multi-modal spectral reasoning and joint 2D/3D generative modeling for de novo molecular structure elucidation.
zh
[AI-15] SCC-recursiveness in infinite argumentation (extended version)
【速读】:该论文试图解决SCC-recursive语义在无限论证框架(AFs)中无法可靠泛化的问题,这一问题由Baumann和Spanring指出,主要源于良基性(well-foundedness)的缺陷。论文提出的解决方案关键在于提出两种扩展SCC递归性的方法,以适应无限场景,并通过Baroni和Giacomin的标准进行系统评估,揭示了方向性(directionality)在一般情况下的失效,但在有限框架中部分语义仍能满足方向性。
链接: https://arxiv.org/abs/2507.06852
作者: Uri Andrews,Luca San Mauro
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注: 26 pages, accepted at JELIA 2025
Abstract:Argumentation frameworks (AFs) are a foundational tool in artificial intelligence for modeling structured reasoning and conflict. SCC-recursiveness is a well-known design principle in which the evaluation of arguments is decomposed according to the strongly connected components (SCCs) of the attack graph, proceeding recursively from “higher” to “lower” components. While SCC-recursive semantics such as \cft and \stgt have proven effective for finite AFs, Baumann and Spanring showed the failure of SCC-recursive semantics to generalize reliably to infinite AFs due to issues with well-foundedness. We propose two approaches to extending SCC-recursiveness to the infinite setting. We systematically evaluate these semantics using Baroni and Giacomin’s established criteria, showing in particular that directionality fails in general. We then examine these semantics’ behavior in finitary frameworks, where we find some of our semantics satisfy directionality. These results advance the theory of infinite argumentation and lay the groundwork for reasoning systems capable of handling unbounded or evolving domains. Comments: 26 pages, accepted at JELIA 2025 Subjects: Artificial Intelligence (cs.AI) Cite as: arXiv:2507.06852 [cs.AI] (or arXiv:2507.06852v1 [cs.AI] for this version) https://doi.org/10.48550/arXiv.2507.06852 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
zh
[AI-16] he Dark Side of LLM s Agent -based Attacks for Complete Computer Takeover
【速读】:该论文试图解决大型语言模型(Large Language Model, LLM)代理系统在安全方面存在的新型漏洞问题,特别是这些系统可能被用作攻击向量,从而实现对计算机的完全控制。论文提出的关键解决方案是识别并分析三种不同的攻击面:直接提示注入、RAG后门攻击以及代理间信任滥用,并揭示当前多代理安全模型中存在的根本性缺陷,即即使LLM能够抵御直接恶意指令,仍可能通过同级代理的请求执行相同的有效载荷。这一发现凸显了现有LLM代理系统在安全设计上的不足,并强调了对LLM安全风险进行深入研究的必要性。
链接: https://arxiv.org/abs/2507.06850
作者: Matteo Lupinacci,Francesco Aurelio Pironti,Francesco Blefari,Francesco Romeo,Luigi Arena,Angelo Furfaro
机构: 未知
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI)
备注:
Abstract:The rapid adoption of Large Language Model (LLM) agents and multi-agent systems enables unprecedented capabilities in natural language processing and generation. However, these systems have introduced unprecedented security vulnerabilities that extend beyond traditional prompt injection attacks. This paper presents the first comprehensive evaluation of LLM agents as attack vectors capable of achieving complete computer takeover through the exploitation of trust boundaries within agentic AI systems where autonomous entities interact and influence each other. We demonstrate that adversaries can leverage three distinct attack surfaces - direct prompt injection, RAG backdoor attacks, and inter-agent trust exploitation - to coerce popular LLMs (including GPT-4o, Claude-4 and Gemini-2.5) into autonomously installing and executing malware on victim machines. Our evaluation of 17 state-of-the-art LLMs reveals an alarming vulnerability hierarchy: while 41.2% of models succumb to direct prompt injection, 52.9% are vulnerable to RAG backdoor attacks, and a critical 82.4% can be compromised through inter-agent trust exploitation. Notably, we discovered that LLMs which successfully resist direct malicious commands will execute identical payloads when requested by peer agents, revealing a fundamental flaw in current multi-agent security models. Our findings demonstrate that only 5.9% of tested models (1/17) proved resistant to all attack vectors, with the majority exhibiting context-dependent security behaviors that create exploitable blind spots. Our findings also highlight the need to increase awareness and research on the security risks of LLMs, showing a paradigm shift in cybersecurity threats, where AI tools themselves become sophisticated attack vectors.
zh
[AI-17] Artificial Generals Intelligence: Mastering Generals.io with Reinforcement Learning
【速读】:该论文试图解决多智能体强化学习(Multi-Agent Reinforcement Learning, MARL)研究中缺乏高效、可扩展且具有挑战性的基准环境的问题。其解决方案的关键在于构建了一个模块化的实时战略(RTS)基准环境,该环境支持多种游戏格式并兼容Gymnasium和PettingZoo框架,能够在普通硬件上实现每秒数千帧的运行速度。此外,论文提出了一种具有竞争力的最先进基线代理,通过监督预训练与自对弈训练相结合的方式,在单块H100 GPU上仅需36小时即可达到1v1人类排行榜前0.003%的水平,同时引入基于潜在函数的奖励塑造和记忆特性以加速学习过程。
链接: https://arxiv.org/abs/2507.06825
作者: Matej Straka,Martin Schmid
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:
Abstract:We introduce a real-time strategy game environment built on this http URL, a game that hosts thousands of active players each week across multiple game formats. Our environment is fully compatible with Gymnasium and PettingZoo, capable of running thousands of frames per second on commodity hardware. Our reference agent – trained with supervised pre-training and self-play – hits the top 0.003% of the 1v1 human leaderboard after just 36 hours on a single H100 GPU. To accelerate learning, we incorporate potential-based reward shaping and memory features. Our contributions – a modular RTS benchmark and a competitive, state-of-the-art baseline agent – provide an accessible yet challenging platform for advancing multi-agent reinforcement learning research.
zh
[AI-18] HeLo: Heterogeneous Multi-Modal Fusion with Label Correlation for Emotion Distribution Learning
【速读】:该论文旨在解决多模态情感分布学习(multi-modal emotion distribution learning)中存在的模态异质性挖掘不足以及任意基本情感间的语义相关性未被充分利用的问题。其解决方案的关键在于提出一种名为HeLo的框架,该框架通过跨注意力机制融合生理数据,设计基于最优传输的异质性挖掘模块以提取生理与行为表征之间的交互与异质性,并引入由相关矩阵对齐优化的可学习标签嵌入以促进标签相关性学习,最终通过一种新颖的标签相关性驱动的跨注意力机制将标签嵌入与相关矩阵与多模态表征进行整合,从而实现更准确的情感分布学习。
链接: https://arxiv.org/abs/2507.06821
作者: Chuhang Zheng,Chunwei Tian,Jie Wen,Daoqiang Zhang,Qi Zhu
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Multimedia (cs.MM)
备注:
Abstract:Multi-modal emotion recognition has garnered increasing attention as it plays a significant role in human-computer interaction (HCI) in recent years. Since different discrete emotions may exist at the same time, compared with single-class emotion recognition, emotion distribution learning (EDL) that identifies a mixture of basic emotions has gradually emerged as a trend. However, existing EDL methods face challenges in mining the heterogeneity among multiple modalities. Besides, rich semantic correlations across arbitrary basic emotions are not fully exploited. In this paper, we propose a multi-modal emotion distribution learning framework, named HeLo, aimed at fully exploring the heterogeneity and complementary information in multi-modal emotional data and label correlation within mixed basic emotions. Specifically, we first adopt cross-attention to effectively fuse the physiological data. Then, an optimal transport (OT)-based heterogeneity mining module is devised to mine the interaction and heterogeneity between the physiological and behavioral representations. To facilitate label correlation learning, we introduce a learnable label embedding optimized by correlation matrix alignment. Finally, the learnable label embeddings and label correlation matrices are integrated with the multi-modal representations through a novel label correlation-driven cross-attention mechanism for accurate emotion distribution learning. Experimental results on two publicly available datasets demonstrate the superiority of our proposed method in emotion distribution learning.
zh
[AI-19] Comprehensive Evaluation of Prototype Neural Networks
【速读】:该论文旨在解决可解释人工智能(Explainable Artificial Intelligence, XAI)和可解释机器学习中的模型可解释性评估问题,具体聚焦于原型模型(prototype models)的性能与可解释性分析。其解决方案的关键在于对一系列具有代表性的原型模型(如ProtoPNet、ProtoPool和PIPNets)进行深入分析,并通过应用一组全面的评估指标来衡量它们的表现。此外,研究者还提出了若干新的指标以进一步补充对模型可解释性的分析,同时在多个数据集上进行了实验,包括细粒度分类、Non-IID设置和多标签分类任务,以对比不同模型的性能。
链接: https://arxiv.org/abs/2507.06819
作者: Philipp Schlinge,Steffen Meinert,Martin Atzmueller
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:
Abstract:Prototype models are an important method for explainable artificial intelligence (XAI) and interpretable machine learning. In this paper, we perform an in-depth analysis of a set of prominent prototype models including ProtoPNet, ProtoPool and PIPNet. For their assessment, we apply a comprehensive set of metrics. In addition to applying standard metrics from literature, we propose several new metrics to further complement the analysis of model interpretability. In our experimentation, we apply the set of prototype models on a diverse set of datasets including fine-grained classification, Non-IID settings and multi-label classification to further contrast the performance. Furthermore, we also provide our code as an open-source library, which facilitates simple application of the metrics itself, as well as extensibility - providing the option for easily adding new metrics and models. this https URL
zh
[AI-20] Intrinsic Training Signals for Federated Learning Aggregation
【速读】:该论文试图解决联邦学习(Federated Learning, FL)中模型参数聚合的问题,特别是在不修改架构或损失函数的情况下实现高效且性能优越的模型合并。解决方案的关键在于利用标准优化过程中已有的内在训练信号,提出LIVAR(Layer Importance and VARiance-based merging)方法,其核心包括:基于自然出现特征统计量的方差加权分类器聚合方案,以及基于SHAP分析现有更新参数模式的可解释性驱动的LoRA合并技术。这种方法无需任何架构开销,即可在多个基准测试中达到最先进性能,并与现有FL方法无缝集成。
链接: https://arxiv.org/abs/2507.06813
作者: Cosimo Fiorini,Matteo Mosconi,Pietro Buzzega,Riccardo Salami,Simone Calderara
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:
Abstract:Federated Learning (FL) enables collaborative model training across distributed clients while preserving data privacy. While existing approaches for aggregating client-specific classification heads and adapted backbone parameters require architectural modifications or loss function changes, our method uniquely leverages intrinsic training signals already available during standard optimization. We present LIVAR (Layer Importance and VARiance-based merging), which introduces: i) a variance-weighted classifier aggregation scheme using naturally emergent feature statistics, and ii) an explainability-driven LoRA merging technique based on SHAP analysis of existing update parameter patterns. Without any architectural overhead, LIVAR achieves state-of-the-art performance on multiple benchmarks while maintaining seamless integration with existing FL methods. This work demonstrates that effective model merging can be achieved solely through existing training signals, establishing a new paradigm for efficient federated model aggregation. The code will be made publicly available upon acceptance.
zh
[AI-21] owards Solving More Challenging IMO Problems via Decoupled Reasoning and Proving
【速读】:该论文试图解决生成式 AI 在形式化定理证明(Formal Theorem Proving)中的性能瓶颈问题,即当前大型语言模型(LLMs)在非形式化推理任务中表现出色,但在形式化证明任务中的成功率仍然很低。解决方案的关键在于提出一种新的框架,该框架通过将高层推理与底层证明生成解耦,使模型能够充分发挥其推理潜力。具体而言,该框架包含两个专门的模型:一个强大的通用Reasoner用于生成多样化的策略性子目标引理,一个高效的Prover用于严格验证这些引理,从而避免了端到端训练带来的局限性。
链接: https://arxiv.org/abs/2507.06804
作者: Zhenwen Liang,Linfeng Song,Yang Li,Tao Yang,Feng Zhang,Haitao Mi,Dong Yu
机构: 未知
类目: Logic in Computer Science (cs.LO); Artificial Intelligence (cs.AI)
备注: Work in progress
Abstract:Automated Theorem Proving (ATP) in formal languages is a foundational challenge for AI. While Large Language Models (LLMs) have driven remarkable progress, a significant gap remains between their powerful informal reasoning capabilities and their weak formal proving performance. Recent studies show that the informal accuracy exceeds 80% while formal success remains below 8% on benchmarks like PutnamBench. We argue this gap persists because current state-of-the-art provers, by tightly coupling reasoning and proving, are trained with paradigms that inadvertently punish deep reasoning in favor of shallow, tactic-based strategies. To bridge this fundamental gap, we propose a novel framework that decouples high-level reasoning from low-level proof generation. Our approach utilizes two distinct, specialized models: a powerful, general-purpose Reasoner to generate diverse, strategic subgoal lemmas, and an efficient Prover to rigorously verify them. This modular design liberates the model’s full reasoning potential and bypasses the pitfalls of end-to-end training. We evaluate our method on a challenging set of post-2000 IMO problems, a problem set on which no prior open-source prover has reported success. Our decoupled framework successfully solves 5 of these problems, demonstrating a significant step towards automated reasoning on exceptionally difficult mathematical challenges. To foster future research, we release our full dataset of generated and verified lemmas for a wide range of IMO problems, available at this https URL .
zh
[AI-22] Comparing Dialectical Systems: Contradiction and Counterexample in Belief Change (Extended Version)
【速读】:该论文试图解决的是关于辩证系统(dialectical systems)在自动化信念修正中的能力比较问题,具体是证明q-辩证系统相较于p-辩证系统具有更强的表达能力。解决方案的关键在于通过形式化分析和严格证明,展示q-辩证系统能够同时处理反例(counterexample)和矛盾(contradiction)两种机制,而p-辩证系统仅基于反例进行信念修正,因此q-辩证系统在计算能力和表达范围上更为强大。这一结果揭示了反例与矛盾在自动化信念修正中的互补作用,进而对数学家和研究群体的推理过程提供了理论支持。
链接: https://arxiv.org/abs/2507.06798
作者: Uri Andrews,Luca San Mauro
机构: 未知
类目: Artificial Intelligence (cs.AI); Logic (math.LO)
备注: 25 pages, accepted at JELIA 2025
Abstract:Dialectical systems are a mathematical formalism for modeling an agent updating a knowledge base seeking consistency. Introduced in the 1970s by Roberto Magari, they were originally conceived to capture how a working mathematician or a research community refines beliefs in the pursuit of truth. Dialectical systems also serve as natural models for the belief change of an automated agent, offering a unifying, computable framework for dynamic belief management. The literature distinguishes three main models of dialectical systems: (d-)dialectical systems based on revising beliefs when they are seen to be inconsistent, p-dialectical systems based on revising beliefs based on finding a counterexample, and q-dialectical systems which can do both. We answer an open problem in the literature by proving that q-dialectical systems are strictly more powerful than p-dialectical systems, which are themselves known to be strictly stronger than (d-)dialectical systems. This result highlights the complementary roles of counterexample and contradiction in automated belief revision, and thus also in the reasoning processes of mathematicians and research communities. Comments: 25 pages, accepted at JELIA 2025 Subjects: Artificial Intelligence (cs.AI); Logic (math.LO) Cite as: arXiv:2507.06798 [cs.AI] (or arXiv:2507.06798v1 [cs.AI] for this version) https://doi.org/10.48550/arXiv.2507.06798 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
zh
[AI-23] mporal Information Retrieval via Time-Specifier Model Merging
【速读】:该论文试图解决在信息检索中,针对包含显式时间约束的查询(如“in 2015”)时,密集检索方法表现不佳的问题。现有时间信息检索(Temporal Information Retrieval, TIR)方法虽然提升了时间推理能力,但常因灾难性遗忘导致非时间查询性能下降。论文提出的解决方案关键在于Time-Specifier Model Merging (TSM),该方法通过为每个时间指示符训练专用检索器并将其合并为统一模型,从而在不损害非时间查询检索精度的前提下,精准处理时间约束。
链接: https://arxiv.org/abs/2507.06782
作者: SeungYoon Han,Taeho Hwang,Sukmin Cho,Soyeong Jeong,Hoyun Song,Huije Lee,Jong C. Park
机构: 未知
类目: Information Retrieval (cs.IR); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:
Abstract:The rapid expansion of digital information and knowledge across structured and unstructured sources has heightened the importance of Information Retrieval (IR). While dense retrieval methods have substantially improved semantic matching for general queries, they consistently underperform on queries with explicit temporal constraints–often those containing numerical expressions and time specifiers such as ``in 2015.‘’ Existing approaches to Temporal Information Retrieval (TIR) improve temporal reasoning but often suffer from catastrophic forgetting, leading to reduced performance on non-temporal queries. To address this, we propose Time-Specifier Model Merging (TSM), a novel method that enhances temporal retrieval while preserving accuracy on non-temporal queries. TSM trains specialized retrievers for individual time specifiers and merges them in to a unified model, enabling precise handling of temporal constraints without compromising non-temporal retrieval. Extensive experiments on both temporal and non-temporal datasets demonstrate that TSM significantly improves performance on temporally constrained queries while maintaining strong results on non-temporal queries, consistently outperforming other baseline methods. Our code is available at this https URL .
zh
[AI-24] Exploring State-Space-Model based Language Model in Music Generation
【速读】:该论文试图解决文本到音乐生成任务中模型效率与表达能力之间的平衡问题。其解决方案的关键在于利用基于状态空间模型(State Space Models, SSMs)的SiMBA架构,通过采用单层残差矢量量化(Residual Vector Quantization, RVQ)表示进行序列建模,从而在有限资源条件下实现更快的收敛速度和更接近真实数据的输出。
链接: https://arxiv.org/abs/2507.06674
作者: Wei-Jaw Lee,Fang-Chih Hsieh,Xuanjun Chen,Fang-Duo Tsai,Yi-Hsuan Yang
机构: 未知
类目: ound (cs.SD); Artificial Intelligence (cs.AI); Audio and Speech Processing (eess.AS)
备注: Accepted at ISMIR 2025 as Late-Breaking Demo (LBD)
Abstract:The recent surge in State Space Models (SSMs), particularly the emergence of Mamba, has established them as strong alternatives or complementary modules to Transformers across diverse domains. In this work, we aim to explore the potential of Mamba-based architectures for text-to-music generation. We adopt discrete tokens of Residual Vector Quantization (RVQ) as the modeling representation and empirically find that a single-layer codebook can capture semantic information in music. Motivated by this observation, we focus on modeling a single-codebook representation and adapt SiMBA, originally designed as a Mamba-based encoder, to function as a decoder for sequence modeling. We compare its performance against a standard Transformer-based decoder. Our results suggest that, under limited-resource settings, SiMBA achieves much faster convergence and generates outputs closer to the ground truth. This demonstrates the promise of SSMs for efficient and expressive text-to-music generation. We put audio examples on Github.
zh
[AI-25] Deep Disentangled Representation Network for Treatment Effect Estimation
【速读】:该论文试图解决从观察性数据中估计个体层面处理效应的问题,这是因果推断中的一个基础性问题,在教育、医疗和公共政策等领域具有重要应用。其解决方案的关键在于提出一种新的处理效应估计算法,该算法结合了专家混合模型与多头注意力机制,并引入线性正交正则化项,以软性分解预处理变量,同时通过重要性采样重加权技术消除选择偏差。
链接: https://arxiv.org/abs/2507.06650
作者: Hui Meng,Keping Yang,Xuyu Peng,Bo Zheng
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注: Under Review
Abstract:Estimating individual-level treatment effect from observational data is a fundamental problem in causal inference and has attracted increasing attention in the fields of education, healthcare, and public this http URL this work, we concentrate on the study of disentangled representation methods that have shown promising outcomes by decomposing observed covariates into instrumental, confounding, and adjustment factors. However, most of the previous work has primarily revolved around generative models or hard decomposition methods for covariates, which often struggle to guarantee the attainment of precisely disentangled factors. In order to effectively model different causal relationships, we propose a novel treatment effect estimation algorithm that incorporates a mixture of experts with multi-head attention and a linear orthogonal regularizer to softly decompose the pre-treatment variables, and simultaneously eliminates selection bias via importance sampling re-weighting techniques. We conduct extensive experiments on both public semi-synthetic and real-world production datasets. The experimental results clearly demonstrate that our algorithm outperforms the state-of-the-art methods focused on individual treatment effects.
zh
[AI-26] Goal-Oriented Skill Abstraction for Offline Multi-Task Reinforcement Learning ICML2025
【速读】:该论文试图解决离线多任务强化学习中知识跨任务有效共享的问题。其解决方案的关键在于提出一种基于目标的技能抽象方法(Goal-Oriented Skill Abstraction, GO-Skill),通过目标导向的技能提取过程发现可复用的技能,并利用向量量化构建离散技能库,同时引入技能增强阶段以缓解通用技能与任务特定技能之间的类别不平衡问题,最终通过分层策略学习整合这些技能,构建能够动态协调离散技能完成具体任务的高层策略。
链接: https://arxiv.org/abs/2507.06628
作者: Jinmin He,Kai Li,Yifan Zang,Haobo Fu,Qiang Fu,Junliang Xing,Jian Cheng
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注: ICML2025
Abstract:Offline multi-task reinforcement learning aims to learn a unified policy capable of solving multiple tasks using only pre-collected task-mixed datasets, without requiring any online interaction with the environment. However, it faces significant challenges in effectively sharing knowledge across tasks. Inspired by the efficient knowledge abstraction observed in human learning, we propose Goal-Oriented Skill Abstraction (GO-Skill), a novel approach designed to extract and utilize reusable skills to enhance knowledge transfer and task performance. Our approach uncovers reusable skills through a goal-oriented skill extraction process and leverages vector quantization to construct a discrete skill library. To mitigate class imbalances between broadly applicable and task-specific skills, we introduce a skill enhancement phase to refine the extracted skills. Furthermore, we integrate these skills using hierarchical policy learning, enabling the construction of a high-level policy that dynamically orchestrates discrete skills to accomplish specific tasks. Extensive experiments on diverse robotic manipulation tasks within the MetaWorld benchmark demonstrate the effectiveness and versatility of GO-Skill.
zh
[AI-27] Q-STAC: Q-Guided Stein Variational Model Predictive Actor-Critic
【速读】:该论文旨在解决深度强化学习在连续控制任务中面临的数据需求量大、复杂长时规划能力不足以及运行过程中难以维持安全约束的问题,同时解决模型预测控制(MPC)在求解全局最优解和成本函数设计上的局限性。其解决方案的关键在于提出一种名为Q-guided STein variational model predictive Actor-Critic (Q-STAC)的新框架,该框架通过受约束的Stein Variational Gradient Descent (SVGD)将贝叶斯MPC与基于策略-评价的强化学习相结合,直接利用学习到的Q值作为优化控制序列的目标,从而无需显式设计成本函数,并借助已知系统动力学提高样本效率和确保控制信号的安全性。
链接: https://arxiv.org/abs/2507.06625
作者: Shizhe Cai,Jayadeep Jacob,Zeya Yin,Fabio Ramos
机构: 未知
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: 9 pages, 10 figures
Abstract:Deep reinforcement learning has shown remarkable success in continuous control tasks, yet often requires extensive training data, struggles with complex, long-horizon planning, and fails to maintain safety constraints during operation. Meanwhile, Model Predictive Control (MPC) offers explainability and constraint satisfaction, but typically yields only locally optimal solutions and demands careful cost function design. This paper introduces the Q-guided STein variational model predictive Actor-Critic (Q-STAC), a novel framework that bridges these approaches by integrating Bayesian MPC with actor-critic reinforcement learning through constrained Stein Variational Gradient Descent (SVGD). Our method optimizes control sequences directly using learned Q-values as objectives, eliminating the need for explicit cost function design while leveraging known system dynamics to enhance sample efficiency and ensure control signals remain within safe boundaries. Extensive experiments on 2D navigation and robotic manipulation tasks demonstrate that Q-STAC achieves superior sample efficiency, robustness, and optimality compared to state-of-the-art algorithms, while maintaining the high expressiveness of policy distributions. Experiment videos are available on our website: this https URL
zh
[AI-28] Efficient Multi-Task Reinforcement Learning with Cross-Task Policy Guidance NEURIPS2024
【速读】:该论文试图解决多任务强化学习中如何有效利用任务间共享信息的问题,以提升多个任务的同步学习效率。现有方法主要依赖于参数共享和定制化的网络结构或优化过程,但忽略了通过已掌握技能的任务控制策略直接指导未掌握技能任务的潜在价值。解决方案的关键在于提出一种名为跨任务策略引导(Cross-Task Policy Guidance, CTPG)的新框架,该框架为每个任务训练一个引导策略,从所有任务的控制策略中选择与环境交互的行为策略,从而生成更优的训练轨迹。此外,还设计了两种门控机制以提高学习效率,分别用于过滤无益的控制策略和阻止无需引导的任务参与。
链接: https://arxiv.org/abs/2507.06615
作者: Jinmin He,Kai Li,Yifan Zang,Haobo Fu,Qiang Fu,Junliang Xing,Jian Cheng
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注: NeurIPS2024
Abstract:Multi-task reinforcement learning endeavors to efficiently leverage shared information across various tasks, facilitating the simultaneous learning of multiple tasks. Existing approaches primarily focus on parameter sharing with carefully designed network structures or tailored optimization procedures. However, they overlook a direct and complementary way to exploit cross-task similarities: the control policies of tasks already proficient in some skills can provide explicit guidance for unmastered tasks to accelerate skills acquisition. To this end, we present a novel framework called Cross-Task Policy Guidance (CTPG), which trains a guide policy for each task to select the behavior policy interacting with the environment from all tasks’ control policies, generating better training trajectories. In addition, we propose two gating mechanisms to improve the learning efficiency of CTPG: one gate filters out control policies that are not beneficial for guidance, while the other gate blocks tasks that do not necessitate guidance. CTPG is a general framework adaptable to existing parameter sharing approaches. Empirical evaluations demonstrate that incorporating CTPG with these approaches significantly enhances performance in manipulation and locomotion benchmarks.
zh
[AI-29] Learning controllable dynamics through informative exploration
【速读】:该论文试图解决在缺乏显式动力学模型的环境中,如何有效确定下一步探索区域的问题。其解决方案的关键在于使用“预测信息增益”这一信息度量来识别环境中最具信息量的区域,并结合强化学习方法找到次优的探索策略,从而可靠地估计潜在的可控制动力学。
链接: https://arxiv.org/abs/2507.06582
作者: Peter N. Loxley,Friedrich T. Sommer
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:
Abstract:Environments with controllable dynamics are usually understood in terms of explicit models. However, such models are not always available, but may sometimes be learned by exploring an environment. In this work, we investigate using an information measure called “predicted information gain” to determine the most informative regions of an environment to explore next. Applying methods from reinforcement learning allows good suboptimal exploring policies to be found, and leads to reliable estimates of the underlying controllable dynamics. This approach is demonstrated by comparing with several myopic exploration approaches.
zh
[AI-30] From Data-Centric to Sample-Centric: Enhancing LLM Reasoning via Progressive Optimization
【速读】:该论文试图解决如何有效利用少量可信的高质量示范样本,而非单纯依赖数据量的扩展,以提升大型语言模型(Large Language Models, LLMs)的推理能力。其解决方案的关键在于提出LPPO(Learning-Progress and Prefix-guided Optimization)框架,该框架包含两个核心方法:基于前缀引导的采样(prefix-guided sampling)和学习进度加权(learning-progress weighting)。前者通过引入专家示范中的部分解题前缀来指导策略生成,尤其针对复杂实例;后者则根据模型的学习进展动态调整每个训练样本的影响权重,从而促进有效学习并抑制停滞样本。
链接: https://arxiv.org/abs/2507.06573
作者: Xinjie Chen,Minpeng Liao,Guoxin Chen,Chengxi Li,Biao Fu,Kai Fan,Xinggao Liu
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注: Work in progress
Abstract:Reinforcement learning with verifiable rewards (RLVR) has recently advanced the reasoning capabilities of large language models (LLMs). While prior work has emphasized algorithmic design, data curation, and reward shaping, we investigate RLVR from a sample-centric perspective and introduce LPPO (Learning-Progress and Prefix-guided Optimization), a framework of progressive optimization techniques. Our work addresses a critical question: how to best leverage a small set of trusted, high-quality demonstrations, rather than simply scaling up data volume. First, motivated by how hints aid human problem-solving, we propose prefix-guided sampling, an online data augmentation method that incorporates partial solution prefixes from expert demonstrations to guide the policy, particularly for challenging instances. Second, inspired by how humans focus on important questions aligned with their current capabilities, we introduce learning-progress weighting, a dynamic strategy that adjusts each training sample’s influence based on model progression. We estimate sample-level learning progress via an exponential moving average of per-sample pass rates, promoting samples that foster learning and de-emphasizing stagnant ones. Experiments on mathematical-reasoning benchmarks demonstrate that our methods outperform strong baselines, yielding faster convergence and a higher performance ceiling.
zh
[AI-31] SkyVLN: Vision-and-Language Navigation and NMPC Control for UAVs in Urban Environments IROS2025
【速读】:该论文旨在解决无人机(UAV)在复杂城市环境中自主导航的挑战,特别是在动态3D空间中实现高精度和鲁棒性的路径规划与避障问题。其解决方案的关键在于提出SkyVLN框架,该框架将视觉-语言导航(VLN)与非线性模型预测控制(NMPC)相结合,利用大语言模型(LLM)解析自然语言指令和视觉观测,同时引入细粒度空间语义化模块和历史路径记忆机制,以增强无人机对空间上下文的理解、处理模糊指令的能力以及必要时的回溯功能,从而提升导航的成功率与效率。
链接: https://arxiv.org/abs/2507.06564
作者: Tianshun Li,Tianyi Huai,Zhen Li,Yichun Gao,Haoang Li,Xinhu Zheng
机构: 未知
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI); Systems and Control (eess.SY)
备注: 8 pages, 9 figures, has been accepted by IROS 2025
Abstract:Unmanned Aerial Vehicles (UAVs) have emerged as versatile tools across various sectors, driven by their mobility and adaptability. This paper introduces SkyVLN, a novel framework integrating vision-and-language navigation (VLN) with Nonlinear Model Predictive Control (NMPC) to enhance UAV autonomy in complex urban environments. Unlike traditional navigation methods, SkyVLN leverages Large Language Models (LLMs) to interpret natural language instructions and visual observations, enabling UAVs to navigate through dynamic 3D spaces with improved accuracy and robustness. We present a multimodal navigation agent equipped with a fine-grained spatial verbalizer and a history path memory mechanism. These components allow the UAV to disambiguate spatial contexts, handle ambiguous instructions, and backtrack when necessary. The framework also incorporates an NMPC module for dynamic obstacle avoidance, ensuring precise trajectory tracking and collision prevention. To validate our approach, we developed a high-fidelity 3D urban simulation environment using AirSim, featuring realistic imagery and dynamic urban elements. Extensive experiments demonstrate that SkyVLN significantly improves navigation success rates and efficiency, particularly in new and unseen environments.
zh
[AI-32] he Primacy of Magnitude in Low-Rank Adaptation
【速读】:该论文旨在解决低秩适应(LoRA)在微调大模型时存在的计算和存储开销问题,同时保持其参数效率。其解决方案的关键在于将权重更新幅度视为影响LoRA性能的根本因素,并提出了一种基于幅度驱动的初始化方法——LoRAM,该方法通过缩放预训练权重幅度来模拟谱方法的效果,从而在不增加额外计算和存储负担的情况下实现与谱初始化相当或更优的性能。
链接: https://arxiv.org/abs/2507.06558
作者: Zicheng Zhang,Haoran Li,Yifeng Zhang,Guoqiang Gong,Jiaxing Wang,Pengzhang Liu,Qixia Jiang,Junxing Hu
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:
Abstract:Low-Rank Adaptation (LoRA) offers a parameter-efficient paradigm for tuning large models. While recent spectral initialization methods improve convergence and performance over the naive “Noise Zeros” scheme, their extra computational and storage overhead undermines efficiency. In this paper, we establish update magnitude as the fundamental driver of LoRA performance and propose LoRAM, a magnitude-driven “Basis Basis” initialization scheme that matches spectral methods without their inefficiencies. Our key contributions are threefold: (i) Magnitude of weight updates determines convergence. We prove low-rank structures intrinsically bound update magnitudes, unifying hyperparameter tuning in learning rate, scaling factor, and initialization as mechanisms to optimize magnitude regulation. (ii) Spectral initialization succeeds via magnitude amplification. We demystify that the presumed knowledge-driven benefit of the spectral component essentially arises from the boost in the weight update magnitude. (iii) A novel and compact initialization strategy, LoRAM, scales deterministic orthogonal bases using pretrained weight magnitudes to simulate spectral gains. Extensive experiments show that LoRAM serves as a strong baseline, retaining the full efficiency of LoRA while matching or outperforming spectral initialization across benchmarks.
zh
[AI-33] Graph-based Fake Account Detection: A Survey
【速读】:该论文试图解决在线社交网络中虚假账号检测的问题,其解决方案的关键在于利用社交图的拓扑特征(in addition to account information, such as their shared contents and profile data)来区分虚假账号与真实账号。论文重点综述了基于图的技术,并分析了这些方法在技术手段、输入数据和检测时间等方面的分类,探讨了它们的优势与局限性。
链接: https://arxiv.org/abs/2507.06541
作者: Ali Safarpoor Dehkordi,Ahad N. Zehmakan
机构: 未知
类目: ocial and Information Networks (cs.SI); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: 16 Tables, 5 Figures, 41 Pages
Abstract:In recent years, there has been a growing effort to develop effective and efficient algorithms for fake account detection in online social networks. This survey comprehensively reviews existing methods, with a focus on graph-based techniques that utilise topological features of social graphs (in addition to account information, such as their shared contents and profile data) to distinguish between fake and real accounts. We provide several categorisations of these methods (for example, based on techniques used, input data, and detection time), discuss their strengths and limitations, and explain how these methods connect in the broader context. We also investigate the available datasets, including both real-world data and synthesised models. We conclude the paper by proposing several potential avenues for future research.
zh
[AI-34] Gradientsys: A Multi-Agent LLM Scheduler with ReAct Orchestration
【速读】:该论文试图解决多智能体系统中任务调度与协同执行的复杂性问题,特别是在处理异构代理(如PDF解析器、网络搜索模块、GUI控制器和网页构建器)时的高效协调与资源管理。解决方案的关键在于Gradientsys框架的核心组件——基于大语言模型(LLM)的调度器,它通过类型化Model-Context Protocol (MCP) 协调不同专业AI代理,并结合ReAct-based动态规划循环实现智能的一对多任务分发,支持混合同步/异步执行模式,同时具备容错机制以提升系统鲁棒性。
链接: https://arxiv.org/abs/2507.06520
作者: Xinyuan Song,Zeyu Wang,Siyi Wu,Tianyu Shi,Lynn Ai
机构: 未知
类目: Multiagent Systems (cs.MA); Artificial Intelligence (cs.AI)
备注:
Abstract:We present Gradientsys, a next-generation multi-agent scheduling framework that coordinates diverse specialized AI agents using a typed Model-Context Protocol (MCP) and a ReAct-based dynamic planning loop. At its core, Gradientsys employs an LLM-powered scheduler for intelligent one-to-many task dispatch, enabling parallel execution of heterogeneous agents such as PDF parsers, web search modules, GUI controllers, and web builders. The framework supports hybrid synchronous/asynchronous execution, respects agent capacity constraints, and incorporates a robust retry-and-replan mechanism to handle failures gracefully. To promote transparency and trust, Gradientsys includes an observability layer streaming real-time agent activity and intermediate reasoning via Server-Sent Events (SSE). We offer an architectural overview and evaluate Gradientsys against existing frameworks in terms of extensibility, scheduling topology, tool reusability, parallelism, and observability. Experiments on the GAIA general-assistant benchmark show that Gradientsys achieves higher task success rates with reduced latency and lower API costs compared to a MinionS-style baseline, demonstrating the strength of its LLM-driven multi-agent orchestration.
zh
[AI-35] Failure Forecasting Boosts Robustness of Sim2Real Rhythmic Insertion Policies IROS2025
【速读】:该论文旨在解决重复性高精度插入任务(Rhythmic Insertion Tasks, RIT)中机器人面临的挑战,包括实现毫米级精度和在多次重复操作中保持一致性能的问题。解决方案的关键在于提出了一种端到端的从仿真到现实(sim-to-real)框架,该框架结合了基于强化学习的插入策略与故障预测模块。通过在螺母坐标系中表示扳手的姿态而非机器人坐标系,显著提升了仿真到现实的迁移能力;同时,利用实时6D姿态跟踪执行精确对齐、插入和旋转操作,并通过神经网络预测潜在执行失败,触发简单的恢复机制以提升任务成功率和长期重复任务的鲁棒性。
链接: https://arxiv.org/abs/2507.06519
作者: Yuhan Liu,Xinyu Zhang,Haonan Chang,Abdeslam Boularias
机构: 未知
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI)
备注: Accepted at IROS2025. Project website: this https URL
Abstract:This paper addresses the challenges of Rhythmic Insertion Tasks (RIT), where a robot must repeatedly perform high-precision insertions, such as screwing a nut into a bolt with a wrench. The inherent difficulty of RIT lies in achieving millimeter-level accuracy and maintaining consistent performance over multiple repetitions, particularly when factors like nut rotation and friction introduce additional complexity. We propose a sim-to-real framework that integrates a reinforcement learning-based insertion policy with a failure forecasting module. By representing the wrench’s pose in the nut’s coordinate frame rather than the robot’s frame, our approach significantly enhances sim-to-real transferability. The insertion policy, trained in simulation, leverages real-time 6D pose tracking to execute precise alignment, insertion, and rotation maneuvers. Simultaneously, a neural network predicts potential execution failures, triggering a simple recovery mechanism that lifts the wrench and retries the insertion. Extensive experiments in both simulated and real-world environments demonstrate that our method not only achieves a high one-time success rate but also robustly maintains performance over long-horizon repetitive tasks.
zh
[AI-36] owards LLM -based Root Cause Analysis of Hardware Design Failures
【速读】:该论文试图解决在数字硬件设计过程中,如何利用生成式 AI (Generative AI) 有效解释合成与仿真阶段揭示的设计问题和错误的根本原因。其解决方案的关键在于利用大语言模型(LLMs)进行推理和问题分析,并通过检索增强生成(retrieval-augmented generation)技术提升模型的准确性和可靠性。实验结果表明,OpenAI 的 o3-mini 模型在 pass@5 评分下能够100%正确判断34种不同错误场景,其他先进模型在辅助检索增强生成时也表现出超过90%的高准确率。
链接: https://arxiv.org/abs/2507.06512
作者: Siyu Qiu,Muzhi Wang,Raheel Afsharmazayejani,Mohammad Moradi Shahmiri,Benjamin Tan,Hammond Pearce
机构: 未知
类目: Hardware Architecture (cs.AR); Artificial Intelligence (cs.AI)
备注: 6 pages. Accepted for publication in IEEE COINS 2025 Special Session on LLMs for EDA and Security
Abstract:With advances in large language models (LLMs), new opportunities have emerged to develop tools that support the digital hardware design process. In this work, we explore how LLMs can assist with explaining the root cause of design issues and bugs that are revealed during synthesis and simulation, a necessary milestone on the pathway towards widespread use of LLMs in the hardware design process and for hardware security analysis. We find promising results: for our corpus of 34 different buggy scenarios, OpenAI’s o3-mini reasoning model reached a correct determination 100% of the time under pass@5 scoring, with other state of the art models and configurations usually achieving more than 80% performance and more than 90% when assisted with retrieval-augmented generation.
zh
[AI-37] GR-LLM s: Recent Advances in Generative Recommendation Based on Large Language Models
【速读】:该论文试图解决传统推荐系统依赖复杂人工特征设计的局限性,以及如何利用生成式推荐(Generative Recommendations, GRs)提升推荐性能的问题。其解决方案的关键在于借助大型语言模型(Large Language Models, LLMs)强大的序列建模和推理能力,构建一种与判别式推荐显著不同的新范式,从而实现更高效、灵活的推荐系统。
链接: https://arxiv.org/abs/2507.06507
作者: Zhen Yang,Haitao Lin,Jiawei xue,Ziji Zhang
机构: 未知
类目: Information Retrieval (cs.IR); Artificial Intelligence (cs.AI)
备注: 8 pages, 3 figures
Abstract:In the past year, Generative Recommendations (GRs) have undergone substantial advancements, especially in leveraging the powerful sequence modeling and reasoning capabilities of Large Language Models (LLMs) to enhance overall recommendation performance. LLM-based GRs are forming a new paradigm that is distinctly different from discriminative recommendations, showing strong potential to replace traditional recommendation systems heavily dependent on complex hand-crafted features. In this paper, we provide a comprehensive survey aimed at facilitating further research of LLM-based GRs. Initially, we outline the general preliminaries and application cases of LLM-based GRs. Subsequently, we introduce the main considerations when LLM-based GRs are applied in real industrial scenarios. Finally, we explore promising directions for LLM-based GRs. We hope that this survey contributes to the ongoing advancement of the GR domain.
zh
[AI-38] MoFE-Time: Mixture of Frequency Domain Experts for Time-Series Forecasting Models
【速读】:该论文旨在解决复杂时间序列预测中同时建模时间与频域特征的难题,现有方法在预训练-微调范式下难以有效捕捉信号的周期性和先验模式知识,导致性能受限。其解决方案的关键在于提出MoFE-Time模型,该模型通过将时间与频域特征集成至混合专家(Mixture of Experts, MoE)网络中,并利用注意力模块后的频率和时间单元作为专家,结合MoE路由机制构建输入信号的多维稀疏表示,从而实现对时间序列中周期性与先验模式的有效建模与迁移。
链接: https://arxiv.org/abs/2507.06502
作者: Yiwen Liu,Chenyu Zhang,Junjie Song,Siqi Chen,Sun Yin,Zihan Wang,Lingming Zeng,Yuji Cao,Junming Jiao
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:
Abstract:As a prominent data modality task, time series forecasting plays a pivotal role in diverse applications. With the remarkable advancements in Large Language Models (LLMs), the adoption of LLMs as the foundational architecture for time series modeling has gained significant attention. Although existing models achieve some success, they rarely both model time and frequency characteristics in a pretraining-finetuning paradigm leading to suboptimal performance in predictions of complex time series, which requires both modeling periodicity and prior pattern knowledge of signals. We propose MoFE-Time, an innovative time series forecasting model that integrates time and frequency domain features within a Mixture of Experts (MoE) network. Moreover, we use the pretraining-finetuning paradigm as our training framework to effectively transfer prior pattern knowledge across pretraining and finetuning datasets with different periodicity distributions. Our method introduces both frequency and time cells as experts after attention modules and leverages the MoE routing mechanism to construct multidimensional sparse representations of input signals. In experiments on six public benchmarks, MoFE-Time has achieved new state-of-the-art performance, reducing MSE and MAE by 6.95% and 6.02% compared to the representative methods Time-MoE. Beyond the existing evaluation benchmarks, we have developed a proprietary dataset, NEV-sales, derived from real-world business scenarios. Our method achieves outstanding results on this dataset, underscoring the effectiveness of the MoFE-Time model in practical commercial applications.
zh
[AI-39] Foundation Model Self-Play: Open-Ended Strategy Innovation via Foundation Models
【速读】:该论文试图解决自博弈(Self-Play, SP)算法在多智能体交互中难以生成多样化解决方案以及容易陷入局部最优行为的问题。其解决方案的关键在于引入基础模型自博弈(Foundation-Model Self-Play, FMSP),利用基础模型(Foundation Models, FMs)的代码生成能力和广泛知识,通过跨越策略空间中的局部最优来提升解决方案的质量与多样性。FMSP通过三种方法实现这一目标:Vanilla Foundation-Model Self-Play(vFMSP)通过竞争性自博弈持续优化策略;Novelty-Search Self-Play(NSSP)构建多样化的策略种群;Quality-Diversity Self-Play(QDSP)则结合了NSSP的多样性与vFMSP的优化能力,生成高质量且多样的策略。
链接: https://arxiv.org/abs/2507.06466
作者: Aaron Dharna,Cong Lu,Jeff Clune
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注: 67 pages, accepted to RLC 2025
Abstract:Multi-agent interactions have long fueled innovation, from natural predator-prey dynamics to the space race. Self-play (SP) algorithms try to harness these dynamics by pitting agents against ever-improving opponents, thereby creating an implicit curriculum toward learning high-quality solutions. However, SP often fails to produce diverse solutions and can get stuck in locally optimal behaviors. We introduce Foundation-Model Self-Play (FMSP), a new direction that leverages the code-generation capabilities and vast knowledge of foundation models (FMs) to overcome these challenges by leaping across local optima in policy space. We propose a family of approaches: (1) \textbfVanilla Foundation-Model Self-Play (vFMSP) continually refines agent policies via competitive self-play; (2) \textbfNovelty-Search Self-Play (NSSP) builds a diverse population of strategies, ignoring performance; and (3) the most promising variant, \textbfQuality-Diveristy Self-Play (QDSP), creates a diverse set of high-quality policies by combining the diversity of NSSP and refinement of vFMSP. We evaluate FMSPs in Car Tag, a continuous-control pursuer-evader setting, and in Gandalf, a simple AI safety simulation in which an attacker tries to jailbreak an LLM’s defenses. In Car Tag, FMSPs explore a wide variety of reinforcement learning, tree search, and heuristic-based methods, to name just a few. In terms of discovered policy quality, \ouralgo and vFMSP surpass strong human-designed strategies. In Gandalf, FMSPs can successfully automatically red-team an LLM, breaking through and jailbreaking six different, progressively stronger levels of defense. Furthermore, FMSPs can automatically proceed to patch the discovered vulnerabilities. Overall, FMSPs represent a promising new research frontier of improving self-play with foundation models, opening fresh paths toward more creative and open-ended strategy discovery
zh
[AI-40] SoftSignSGD(S3): An Enhanced Optimizer for Practical DNN Training and Loss Spikes Minimization Beyond Adam
【速读】:该论文试图解决Adam优化器在训练深度神经网络时虽然表现出色,但其成功机制和局限性尚未被充分探讨的问题,特别是其对大梯度波动的鲁棒性与因更新尺度失控导致的损失尖峰之间的矛盾。解决方案的关键在于提出一种名为SignSoftSGD (S3) 的新型优化器,其核心创新包括:首先,通过在分母中引入灵活的p阶动量(p≥1)替代传统的二阶动量(方差)预处理,以增强性能并实现稳定训练;其次,通过统一分子和分母动量的指数移动平均系数,将更新限制在[-1, 1]范围内,从而减少损失尖峰并简化超参数调优;最后,集成等效的Nesterov加速梯度(NAG)模块,在不增加内存开销的情况下加速收敛。
链接: https://arxiv.org/abs/2507.06464
作者: Hanyang Peng,Shuang Qin,Yue Yu,Fangqing Jiang,Hui Wang,Wen Gao
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注: 20pages, 11pages
Abstract:Adam has proven remarkable successful in training deep neural networks, but the mechanisms underlying its empirical successes and limitations remain underexplored. In this study, we demonstrate that the effectiveness of Adam stems largely from its similarity to SignSGD in robustly handling large gradient fluctuations, yet it is also vulnerable to destabilizing loss spikes due to its uncontrolled update scaling. To enhance the advantage of Adam and mitigate its limitation, we propose SignSoftSGD (S3), a novel optimizer with three key innovations. \emphFirst, S3 generalizes the sign-like update by employing a flexible p -th order momentum ( p \geq 1 ) in the denominator, departing from the conventional second-order momentum (variance) preconditioning. This design enables enhanced performance while achieving stable training even with aggressive learning rates. \emphSecond, S3 minimizes the occurrences of loss spikes through unified exponential moving average coefficients for numerator and denominator momenta, which inherently bound updates to [-1, 1] and simplify hyperparameter tuning. \emphThird, S3 incorporates an equivalent Nesterov’s accelerated gradient(NAG) module, accelerating convergence without memory overhead. Theoretically, we prove that S3 achieves the optimal convergence rate of O\left(\frac1T^\sfrac14\right) for general nonconvex stochastic optimization under weak assumptions. Extensive experiments across a range of vision and language tasks show that \textsf\small S3 not only converges more rapidly and improves performance but also rarely experiences loss spikes, even with a \textbf \bm10 \times larger learning rate. In fact, S3 delivers performance comparable to or better than AdamW with \textbf 2 \times the training steps, establishing its efficacy in both efficiency and final task performance.
zh
[AI-41] FedPhD: Federated Pruning with Hierarchical Learning of Diffusion Models
【速读】:该论文旨在解决在联邦学习(Federated Learning, FL)环境中训练扩散模型(Diffusion Models, DMs)所面临的高通信成本和数据异质性问题。其关键解决方案是提出FedPhD方法,该方法通过分层联邦学习结合基于同质性的模型聚合与选择策略来缓解数据异质性问题,并通过分布式结构剪枝提升计算效率和降低客户端的模型存储需求,从而有效减少通信成本并提高模型性能。
链接: https://arxiv.org/abs/2507.06449
作者: Qianyu Long,Qiyuan Wang,Christos Anagnostopoulos,Daning Bi
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Distributed, Parallel, and Cluster Computing (cs.DC)
备注: 12 pages, 8 figures, 5 tables. This paper introduces FedPhD, a novel hierarchical federated learning framework for training diffusion models that addresses data heterogeneity and communication costs through homogeneity-aware aggregation and structured pruning. Submitted to IEEE Transactions on Cybernetics and is under review
Abstract:Federated Learning (FL), as a distributed learning paradigm, trains models over distributed clients’ data. FL is particularly beneficial for distributed training of Diffusion Models (DMs), which are high-quality image generators that require diverse data. However, challenges such as high communication costs and data heterogeneity persist in training DMs similar to training Transformers and Convolutional Neural Networks. Limited research has addressed these issues in FL environments. To address this gap and challenges, we introduce a novel approach, FedPhD, designed to efficiently train DMs in FL environments. FedPhD leverages Hierarchical FL with homogeneity-aware model aggregation and selection policy to tackle data heterogeneity while reducing communication costs. The distributed structured pruning of FedPhD enhances computational efficiency and reduces model storage requirements in clients. Our experiments across multiple datasets demonstrate that FedPhD achieves high model performance regarding Fréchet Inception Distance (FID) scores while reducing communication costs by up to 88% . FedPhD outperforms baseline methods achieving at least a 34% improvement in FID, while utilizing only 56% of the total computation and communication resources.
zh
[AI-42] Assessing the Prevalence of AI-assisted Cheating in Programming Courses: A Pilot Study
【速读】:该论文试图解决生成式 AI (Generative AI) 在计算机科学教育中引发的新型抄袭问题,即学生利用此类工具完成作业而无需付出实质性努力所带来的学术不端行为。其解决方案的关键在于通过匿名调查来评估AI抄袭的普遍性,研究发现超过25%的受访者承认进行了AI抄袭,表明调查是一种有效的方法,而访谈则因参与度低应被避免或重新设计以提高参与率。
链接: https://arxiv.org/abs/2507.06438
作者: Kaléu Delphino
机构: 未知
类目: Computers and Society (cs.CY); Artificial Intelligence (cs.AI); Human-Computer Interaction (cs.HC)
备注: 40 pages, 23 figures
Abstract:Tools that can generate computer code in response to inputs written in natural language, such as ChatGPT, pose an existential threat to Computer Science education in its current form, since students can now use these tools to solve assignments without much effort. While that risk has already been recognized by scholars, the proportion of the student body that is incurring in this new kind of plagiarism is still an open problem. We conducted a pilot study in a large CS class (n=120) to assess the feasibility of estimating AI plagiarism through anonymous surveys and interviews. More than 25% of the survey respondents admitted to committing AI plagiarism. Conversely, only one student accepted to be interviewed. Given the high levels of misconduct acknowledgment, we conclude that surveys are an effective method for studies on the matter, while interviews should be avoided or designed in a way that can entice participation.
zh
[AI-43] Deprecating Benchmarks: Criteria and Framework ICML2025
【速读】:该论文试图解决当前前沿人工智能(Artificial Intelligence, AI)模型评估中缺乏对基准测试(benchmark)何时以及如何退役的指导问题。这一问题可能导致基准测试分数高估模型能力,甚至掩盖模型的实际能力与安全性。论文提出了一套决定基准测试全面或部分退役的标准,并构建了一个基准测试退役框架,其关键在于建立一套严谨且高质量的评估体系,以确保基准测试的有效性和公正性,尤其针对前沿模型的评估。
链接: https://arxiv.org/abs/2507.06434
作者: Ayrton San Joaquin,Rokas Gipiškis,Leon Staufer,Ariel Gil
机构: 未知
类目: Computers and Society (cs.CY); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: 10 pages, 1 table. Accepted to the ICML 2025 Technical AI Governance Workshop
Abstract:As frontier artificial intelligence (AI) models rapidly advance, benchmarks are integral to comparing different models and measuring their progress in different task-specific domains. However, there is a lack of guidance on when and how benchmarks should be deprecated once they cease to effectively perform their purpose. This risks benchmark scores over-valuing model capabilities, or worse, obscuring capabilities and safety-washing. Based on a review of benchmarking practices, we propose criteria to decide when to fully or partially deprecate benchmarks, and a framework for deprecating benchmarks. Our work aims to advance the state of benchmarking towards rigorous and quality evaluations, especially for frontier models, and our recommendations are aimed to benefit benchmark developers, benchmark users, AI governance actors (across governments, academia, and industry panels), and policy makers.
zh
[AI-44] Bridging Data Gaps of Rare Conditions in ICU: A Multi-Disease Adaptation Approach for Clinical Prediction
【速读】:该论文试图解决重症监护病房(ICU)中罕见病和低发病率条件的临床结局预测问题,这些问题因数据稀缺性和条件内异质性而难以得到有效的支持。解决方案的关键在于提出KnowRare,这是一个基于领域自适应的深度学习框架,其核心是通过自监督预训练从多样化电子健康记录中学习条件无关的表征以缓解数据稀缺性,并利用构建的条件知识图谱选择性地适应临床相似条件的知识以应对条件内异质性。
链接: https://arxiv.org/abs/2507.06432
作者: Mingcheng Zhu,Yu Liu,Zhiyao Luo,Tingting Zhu
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:
Abstract:Artificial Intelligence has revolutionised critical care for common conditions. Yet, rare conditions in the intensive care unit (ICU), including recognised rare diseases and low-prevalence conditions in the ICU, remain underserved due to data scarcity and intra-condition heterogeneity. To bridge such gaps, we developed KnowRare, a domain adaptation-based deep learning framework for predicting clinical outcomes for rare conditions in the ICU. KnowRare mitigates data scarcity by initially learning condition-agnostic representations from diverse electronic health records through self-supervised pre-training. It addresses intra-condition heterogeneity by selectively adapting knowledge from clinically similar conditions with a developed condition knowledge graph. Evaluated on two ICU datasets across five clinical prediction tasks (90-day mortality, 30-day readmission, ICU mortality, remaining length of stay, and phenotyping), KnowRare consistently outperformed existing state-of-the-art models. Additionally, KnowRare demonstrated superior predictive performance compared to established ICU scoring systems, including APACHE IV and IV-a. Case studies further demonstrated KnowRare’s flexibility in adapting its parameters to accommodate dataset-specific and task-specific characteristics, its generalisation to common conditions under limited data scenarios, and its rationality in selecting source conditions. These findings highlight KnowRare’s potential as a robust and practical solution for supporting clinical decision-making and improving care for rare conditions in the ICU.
zh
[AI-45] An AI-Driven Thermal-Fluid Testbed for Advanced Small Modular Reactors: Integration of Digital Twin and Large Language Models
【速读】:该论文旨在解决小型模块化反应堆(Small Modular Reactor, SMR)技术发展中模型预测、实时控制与操作支持的挑战,通过融合物理实验与先进计算智能来提升系统的性能与安全性。其解决方案的关键在于构建一个集成高保真数字孪生与生成式人工智能(Generative AI)框架的多用途人工智能驱动热流测试平台,其中基于System Analysis Module代码构建的数字孪生与门控循环单元(Gated Recurrent Unit, GRU)神经网络相结合,实现了系统动态行为的快速预测与实时控制,并通过大型语言模型提供自然语言形式的操作分析与安全建议,从而提升了核能系统的创新速度与部署效率。
链接: https://arxiv.org/abs/2507.06399
作者: Doyeong Lim,Yang Liu,Zavier Ndum Ndum,Christian Young,Yassin Hassan
机构: 未知
类目: ystems and Control (eess.SY); Artificial Intelligence (cs.AI)
备注:
Abstract:This paper presents a multipurpose artificial intelligence (AI)-driven thermal-fluid testbed designed to advance Small Modular Reactor technologies by seamlessly integrating physical experimentation with advanced computational intelligence. The platform uniquely combines a versatile three-loop thermal-fluid facility with a high-fidelity digital twin and sophisticated AI frameworks for real-time prediction, control, and operational assistance. Methodologically, the testbed’s digital twin, built upon the System Analysis Module code, is coupled with a Gated Recurrent Unit (GRU) neural network. This machine learning model, trained on experimental data, enables faster-than-real-time simulation, providing predictive insights into the system’s dynamic behavior. The practical application of this AI integration is showcased through case studies. An AI-driven control framework where the GRU model accurately forecasts future system states and the corresponding control actions required to meet operational demands. Furthermore, an intelligent assistant, powered by a large language model, translates complex sensor data and simulation outputs into natural language, offering operators actionable analysis and safety recommendations. Comprehensive validation against experimental transients confirms the platform’s high fidelity, with the GRU model achieving a temperature prediction root mean square error of 1.42 K. This work establishes an integrated research environment at the intersection of AI and thermal-fluid science, showcasing how AI-driven methodologies in modeling, control, and operator support can accelerate the innovation and deployment of next-generation nuclear systems.
zh
[AI-46] Jolting Technologies: Superexponential Acceleration in AI Capabilities and Implications for AGI
【速读】:该论文试图解决关于人工智能(Artificial Intelligence, AI)能力发展是否存在超指数增长(superexponential growth)的问题,即是否存在加速增长(正的三阶导数)的现象。其解决方案的关键在于构建理论框架并验证检测方法,通过蒙特卡洛模拟(Monte Carlo simulations)来 formalize jolt dynamics,从而为未来实证研究提供稳健的工具,并探讨该假设若成立可能带来的影响。
链接: https://arxiv.org/abs/2507.06398
作者: David Orban
机构: 未知
类目: Artificial Intelligence (cs.AI); Computers and Society (cs.CY)
备注: 13 pages, 2 figures. Revised following peer review
Abstract:This paper investigates the Jolting Technologies Hypothesis, which posits superexponential growth (increasing acceleration, or a positive third derivative) in the development of AI capabilities. We develop a theoretical framework and validate detection methodologies through Monte Carlo simulations, while acknowledging that empirical validation awaits suitable longitudinal data. Our analysis focuses on creating robust tools for future empirical studies and exploring the potential implications should the hypothesis prove valid. The study examines how factors such as shrinking idea-to-action intervals and compounding iterative AI improvements drive this jolting pattern. By formalizing jolt dynamics and validating detection methods through simulation, this work provides the mathematical foundation necessary for understanding potential AI trajectories and their consequences for AGI emergence, offering insights for research and policy.
zh
[AI-47] Representing Prompting Patterns with PDL: Compliance Agent Case Study ICML2025
【速读】:该论文试图解决大型语言模型(Large Language Models, LLMs)中的提示工程(prompt engineering)复杂性问题,现有框架要么通过限制性API隐藏复杂性,要么提供难以定制的固定提示模式,从而使得复杂的代理编程变得困难。解决方案的关键在于提出一种名为提示声明语言(Prompt Declaration Language, PDL)的新方法,该方法将提示置于核心位置,支持手动和自动提示调优,同时捕获LLM调用的组合、基于规则的代码和外部工具。PDL通过抽象这些组合的底层细节,旨在提高程序员的生产力,并提供一种易于优化的声明式表示。
链接: https://arxiv.org/abs/2507.06396
作者: Mandana Vaziri,Louis Mandel,Yuji Watanabe,Hirokuni Kitahara,Martin Hirzel,Anca Sailer
机构: 未知
类目: Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Programming Languages (cs.PL); Software Engineering (cs.SE)
备注: ICML 2025 Workshop on Programmatic Representations for Agent Learning
Abstract:Prompt engineering for LLMs remains complex, with existing frameworks either hiding complexity behind restrictive APIs or providing inflexible canned patterns that resist customization – making sophisticated agentic programming challenging. We present the Prompt Declaration Language (PDL), a novel approach to prompt representation that tackles this fundamental complexity by bringing prompts to the forefront, enabling manual and automatic prompt tuning while capturing the composition of LLM calls together with rule-based code and external tools. By abstracting away the plumbing for such compositions, PDL aims at improving programmer productivity while providing a declarative representation that is amenable to optimization. This paper demonstrates PDL’s utility through a real-world case study of a compliance agent. Tuning the prompting pattern of this agent yielded up to 4x performance improvement compared to using a canned agent and prompt pattern.
zh
[AI-48] KPFlow: An Operator Perspective on Dynamic Collapse Under Gradient Descent Training of Recurrent Networks
【速读】:该论文试图解决在非线性递归模型中,如何理论化理解梯度下降(Gradient Descent, GD)所形成的表征机制的问题,特别是针对有限、非线性系统中的学习动态。其解决方案的关键在于将梯度流分解为两个算子的乘积:参数算子K和线性化流传播器P。其中,K类似于前馈神经网络中的神经切线核(Neural Tangent Kernel),而P则与李雅普诺夫稳定性及最优控制理论相关。通过这一分解,作者揭示了GD下低维潜在动态的形成机制,并提出了用于衡量多任务训练中子任务目标对齐程度的方法。
链接: https://arxiv.org/abs/2507.06381
作者: James Hazelden,Laura Driscoll,Eli Shlizerman,Eric Shea-Brown
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Dynamical Systems (math.DS); Neurons and Cognition (q-bio.NC)
备注:
Abstract:Gradient Descent (GD) and its variants are the primary tool for enabling efficient training of recurrent dynamical systems such as Recurrent Neural Networks (RNNs), Neural ODEs and Gated Recurrent units (GRUs). The dynamics that are formed in these models exhibit features such as neural collapse and emergence of latent representations that may support the remarkable generalization properties of networks. In neuroscience, qualitative features of these representations are used to compare learning in biological and artificial systems. Despite recent progress, there remains a need for theoretical tools to rigorously understand the mechanisms shaping learned representations, especially in finite, non-linear models. Here, we show that the gradient flow, which describes how the model’s dynamics evolve over GD, can be decomposed into a product that involves two operators: a Parameter Operator, K, and a Linearized Flow Propagator, P. K mirrors the Neural Tangent Kernel in feed-forward neural networks, while P appears in Lyapunov stability and optimal control theory. We demonstrate two applications of our decomposition. First, we show how their interplay gives rise to low-dimensional latent dynamics under GD, and, specifically, how the collapse is a result of the network structure, over and above the nature of the underlying task. Second, for multi-task training, we show that the operators can be used to measure how objectives relevant to individual sub-tasks align. We experimentally and theoretically validate these findings, providing an efficient Pytorch package, \emphKPFlow, implementing robust analysis tools for general recurrent architectures. Taken together, our work moves towards building a next stage of understanding of GD learning in non-linear recurrent models.
zh
[AI-49] Digital Wargames to Enhance Military Medical Evacuation Decision-Making
【速读】:该论文试图解决美军医疗后送规划在课堂环境中缺乏有效模拟工具的问题,以评估离线规划与在线决策性能。解决方案的关键是开发了名为Medical Evacuation Wargaming Initiative (MEWI)的三维多人模拟系统,该系统基于Unity平台,能够精确建模战场环境中的伤员交互流程,包括伤员集结点、救护车中转点、医疗救治设施和后送平台,并通过两个作战场景(太平洋两栖岛屿攻击和欧亚大陆冲突)进行验证,从而提升学员对医疗后送理论的理解与协作决策能力。
链接: https://arxiv.org/abs/2507.06373
作者: Jeremy Fischer,Ram Krishnamoorthy,Vishal Kumar,Mahdi Al-Husseini
机构: 未知
类目: Artificial Intelligence (cs.AI); Computers and Society (cs.CY); Human-Computer Interaction (cs.HC); Multimedia (cs.MM)
备注:
Abstract:Medical evacuation is one of the United States Army’s most storied and critical mission sets, responsible for efficiently and expediently evacuating the battlefield ill and injured. Medical evacuation planning involves designing a robust network of medical platforms and facilities capable of moving and treating large numbers of casualties. Until now, there has not been a medium to simulate these networks in a classroom setting and evaluate both offline planning and online decision-making performance. This work describes the Medical Evacuation Wargaming Initiative (MEWI), a three-dimensional multiplayer simulation developed in Unity that replicates battlefield constraints and uncertainties. MEWI accurately models patient interactions at casualty collection points, ambulance exchange points, medical treatment facilities, and evacuation platforms. Two operational scenarios are introduced: an amphibious island assault in the Pacific and a Eurasian conflict across a sprawling road and river network. These scenarios pit students against the clock to save as many casualties as possible while adhering to doctrinal lessons learned during didactic training. We visualize performance data collected from two iterations of the MEWI Pacific scenario executed in the United States Army’s Medical Evacuation Doctrine Course. We consider post-wargame Likert survey data from student participants and external observer notes to identify key planning decision points, document medical evacuation lessons learned, and quantify general utility. Results indicate that MEWI participation substantially improves uptake of medical evacuation lessons learned and co-operative decision-making. MEWI is a substantial step forward in the field of high-fidelity training tools for medical education, and our study findings offer critical insights into improving medical evacuation education and operations across the joint force.
zh
[AI-50] SymFlux: deep symbolic regression of Hamiltonian vector fields
【速读】:该论文试图解决从给定的哈密顿向量场中自动识别出对应的哈密顿函数(Hamiltonian function)的问题,这是哈密顿力学中一个重要的逆问题。解决方案的关键在于提出了一种名为SymFlux的深度学习框架,该框架采用混合卷积神经网络-长短期记忆网络(CNN-LSTM)结构,通过符号回归(symbolic regression)方法学习并输出底层哈密顿函数的符号数学表达式。
链接: https://arxiv.org/abs/2507.06342
作者: M.A. Evangelista-Alvarado,P. Suárez-Serrato
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Dynamical Systems (math.DS); Symplectic Geometry (math.SG)
备注: 26 pages, 7 figures
Abstract:We present SymFlux, a novel deep learning framework that performs symbolic regression to identify Hamiltonian functions from their corresponding vector fields on the standard symplectic plane. SymFlux models utilize hybrid CNN-LSTM architectures to learn and output the symbolic mathematical expression of the underlying Hamiltonian. Training and validation are conducted on newly developed datasets of Hamiltonian vector fields, a key contribution of this work. Our results demonstrate the model’s effectiveness in accurately recovering these symbolic expressions, advancing automated discovery in Hamiltonian mechanics.
zh
[AI-51] MixAssist: An Audio-Language Dataset for Co-Creative AI Assistance in Music Mixing
【速读】:该论文试图解决当前人工智能在音乐混音和母带处理流程中过度侧重端到端自动化或生成,而忽视了协作与指导性维度的问题,导致业余音乐制作人在提升专业技能方面缺乏有效支持。解决方案的关键在于构建MixAssist数据集,这是一个包含专家与业余音乐制作人在协作混音过程中进行的多轮、情境化对话的音频-语言数据集,旨在为训练和评估能够理解并回应真实音乐制作对话复杂性的音频-语言模型提供资源。
链接: https://arxiv.org/abs/2507.06329
作者: Michael Clemens,Ana Marasović
机构: 未知
类目: ound (cs.SD); Artificial Intelligence (cs.AI); Audio and Speech Processing (eess.AS)
备注: Published at COLM 2025. Code and dataset are available here this http URL
Abstract:While AI presents significant potential for enhancing music mixing and mastering workflows, current research predominantly emphasizes end-to-end automation or generation, often overlooking the collaborative and instructional dimensions vital for co-creative processes. This gap leaves artists, particularly amateurs seeking to develop expertise, underserved. To bridge this, we introduce MixAssist, a novel audio-language dataset capturing the situated, multi-turn dialogue between expert and amateur music producers during collaborative mixing sessions. Comprising 431 audio-grounded conversational turns derived from 7 in-depth sessions involving 12 producers, MixAssist provides a unique resource for training and evaluating audio-language models that can comprehend and respond to the complexities of real-world music production dialogues. Our evaluations, including automated LLM-as-a-judge assessments and human expert comparisons, demonstrate that fine-tuning models such as Qwen-Audio on MixAssist can yield promising results, with Qwen significantly outperforming other tested models in generating helpful, contextually relevant mixing advice. By focusing on co-creative instruction grounded in audio context, MixAssist enables the development of intelligent AI assistants designed to support and augment the creative process in music mixing.
zh
[AI-52] Sample-Efficient Reinforcement Learning Controller for Deep Brain Stimulation in Parkinsons Disease
【速读】:该论文旨在解决传统深部脑刺激(Deep Brain Stimulation, DBS)系统在适应性、能耗效率和个性化方面的不足,提出一种基于强化学习(Reinforcement Learning, RL)的自适应DBS(adaptive DBS, aDBS)框架。其解决方案的关键在于设计SEA-DBS,一个样本高效的Actor-Critic框架,通过集成预测奖励模型以减少对实时反馈的依赖,并采用基于Gumbel Softmax的探索策略,实现二元动作空间中的稳定、可微策略更新,从而提升样本效率、探索鲁棒性及与资源受限神经调节硬件的兼容性。
链接: https://arxiv.org/abs/2507.06326
作者: Harsh Ravivarapu,Gaurav Bagwe,Xiaoyong Yuan,Chunxiu Yu,Lan Zhang
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Systems and Control (eess.SY); Neurons and Cognition (q-bio.NC)
备注: Accepted by IEEE IMC 2025
Abstract:Deep brain stimulation (DBS) is an established intervention for Parkinson’s disease (PD), but conventional open-loop systems lack adaptability, are energy-inefficient due to continuous stimulation, and provide limited personalization to individual neural dynamics. Adaptive DBS (aDBS) offers a closed-loop alternative, using biomarkers such as beta-band oscillations to dynamically modulate stimulation. While reinforcement learning (RL) holds promise for personalized aDBS control, existing methods suffer from high sample complexity, unstable exploration in binary action spaces, and limited deployability on resource-constrained hardware. We propose SEA-DBS, a sample-efficient actor-critic framework that addresses the core challenges of RL-based adaptive neurostimulation. SEA-DBS integrates a predictive reward model to reduce reliance on real-time feedback and employs Gumbel Softmax-based exploration for stable, differentiable policy updates in binary action spaces. Together, these components improve sample efficiency, exploration robustness, and compatibility with resource-constrained neuromodulatory hardware. We evaluate SEA-DBS on a biologically realistic simulation of Parkinsonian basal ganglia activity, demonstrating faster convergence, stronger suppression of pathological beta-band power, and resilience to post-training FP16 quantization. Our results show that SEA-DBS offers a practical and effective RL-based aDBS framework for real-time, resource-constrained neuromodulation. Comments: Accepted by IEEE IMC 2025 Subjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Systems and Control (eess.SY); Neurons and Cognition (q-bio.NC) Cite as: arXiv:2507.06326 [cs.LG] (or arXiv:2507.06326v1 [cs.LG] for this version) https://doi.org/10.48550/arXiv.2507.06326 Focus to learn more arXiv-issued DOI via DataCite
zh
[AI-53] Bridging AI and Software Security: A Comparative Vulnerability Assessment of LLM Agent Deployment Paradigms
【速读】:该论文试图解决大型语言模型(Large Language Model, LLM)代理在AI特定领域和传统软件领域中所面临的安全漏洞问题,当前研究往往将这些问题分别处理。其解决方案的关键在于通过统一的威胁分类框架,对功能调用架构与模型上下文协议(Model Context Protocol, MCP)部署范式进行对比评估。研究测试了3,250个攻击场景,分析了针对AI特定威胁(如提示注入)和软件漏洞(如JSON注入、拒绝服务)的简单、组合及链式攻击,并揭示了不同架构在安全暴露方面的差异。结果表明,架构选择从根本上改变了威胁格局,为跨域LLM代理安全评估提供了方法论基础。
链接: https://arxiv.org/abs/2507.06323
作者: Tarek Gasmi,Ramzi Guesmi,Ines Belhadj,Jihene Bennaceur
机构: 未知
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI)
备注:
Abstract:Large Language Model (LLM) agents face security vulnerabilities spanning AI-specific and traditional software domains, yet current research addresses these separately. This study bridges this gap through comparative evaluation of Function Calling architecture and Model Context Protocol (MCP) deployment paradigms using a unified threat classification framework. We tested 3,250 attack scenarios across seven language models, evaluating simple, composed, and chained attacks targeting both AI-specific threats (prompt injection) and software vulnerabilities (JSON injection, denial-of-service). Function Calling showed higher overall attack success rates (73.5% vs 62.59% for MCP), with greater system-centric vulnerability while MCP exhibited increased LLM-centric exposure. Attack complexity dramatically amplified effectiveness, with chained attacks achieving 91-96% success rates. Counterintuitively, advanced reasoning models demonstrated higher exploitability despite better threat detection. Results demonstrate that architectural choices fundamentally reshape threat landscapes. This work establishes methodological foundations for cross-domain LLM agent security assessment and provides evidence-based guidance for secure deployment. Code and experimental materials are available at https: // github. com/ theconsciouslab-ai/llm-agent-security.
zh
[AI-54] oo Human to Model:The Uncanny Valley of LLM s in Social Simulation – When Generative Language Agents Misalign with Modelling Principles
【速读】:该论文试图解决将生成式 AI (Generative AI) 作为社会模拟中的代理时所面临的根本性矛盾问题,即生成式 AI 的高度拟人化特性与社会建模所需的抽象性、简化性和可解释性之间的不兼容性。解决方案的关键在于识别并分析五种核心困境,包括自然对话与抽象时间步长之间的时间分辨率不匹配、在避免破坏代理自发输出的前提下对对话进行干预的必要性、在保持对话自然性的同时引入规则类指令的诱惑、角色一致性与角色随时间演变之间的张力,以及系统级模式因冗长的微观文本输出而变得难以理解的涌现性挑战。这些困境导致 LLM 代理陷入“恐怖谷”状态,既不够抽象以揭示社会机制,又不够自然以真实反映人类行为,从而揭示了当应用不当时,LLM 代理的现实性可能掩盖而非阐明社会动态的悖论。
链接: https://arxiv.org/abs/2507.06310
作者: Yongchao Zeng,Calum Brown,Mark Rounsevell
机构: 未知
类目: Computers and Society (cs.CY); Artificial Intelligence (cs.AI); Multiagent Systems (cs.MA)
备注:
Abstract:Large language models (LLMs) have been increasingly used to build agents in social simulation because of their impressive abilities to generate fluent, contextually coherent dialogues. Such abilities can enhance the realism of models. However, the pursuit of realism is not necessarily compatible with the epistemic foundation of modelling. We argue that LLM agents, in many regards, are too human to model: they are too expressive, detailed and intractable to be consistent with the abstraction, simplification, and interpretability typically demanded by modelling. Through a model-building thought experiment that converts the Bass diffusion model to an LLM-based variant, we uncover five core dilemmas: a temporal resolution mismatch between natural conversation and abstract time steps; the need for intervention in conversations while avoiding undermining spontaneous agent outputs; the temptation to introduce rule-like instructions in prompts while maintaining conversational naturalness; the tension between role consistency and role evolution across time; and the challenge of understanding emergence, where system-level patterns become obscured by verbose micro textual outputs. These dilemmas steer the LLM agents towards an uncanny valley: not abstract enough to clarify underlying social mechanisms, while not natural enough to represent realistic human behaviour. This exposes an important paradox: the realism of LLM agents can obscure, rather than clarify, social dynamics when misapplied. We tease out the conditions in which LLM agents are ideally suited: where system-level emergence is not the focus, linguistic nuances and meaning are central, interactions unfold in natural time, and stable role identity is more important than long-term behavioural evolution. We call for repositioning LLM agents in the ecosystem of social simulation for future applications.
zh
[AI-55] A Survey of Multi Agent Reinforcement Learning: Federated Learning and Cooperative and Noncooperative Decentralized Regimes
【速读】:该论文试图解决多智能体在复杂环境中的交互问题,具体包括集中式协调合作、临时性交互与合作以及非合作激励结构等三种可能的交互拓扑。其解决方案的关键在于基于联邦强化学习(Federal Reinforcement Learning)、分散式强化学习(Decentralized RL)和非合作强化学习(Noncooperative RL)的框架,对这三类场景进行系统性的综述,分析其结构上的相似性与差异性,并总结当前研究的最新进展、理论保障及数值性能的优缺点。
链接: https://arxiv.org/abs/2507.06278
作者: Kemboi Cheruiyot,Nickson Kiprotich,Vyacheslav Kungurtsev,Kennedy Mugo,Vivian Mwirigi,Marvin Ngesa
机构: 未知
类目: Multiagent Systems (cs.MA); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:
Abstract:The increasing interest in research and innovation towards the development of autonomous agents presents a number of complex yet important scenarios of multiple AI Agents interacting with each other in an environment. The particular setting can be understood as exhibiting three possibly topologies of interaction - centrally coordinated cooperation, ad-hoc interaction and cooperation, and settings with noncooperative incentive structures. This article presents a comprehensive survey of all three domains, defined under the formalism of Federal Reinforcement Learning (RL), Decentralized RL, and Noncooperative RL, respectively. Highlighting the structural similarities and distinctions, we review the state of the art in these subjects, primarily explored and developed only recently in the literature. We include the formulations as well as known theoretical guarantees and highlights and limitations of numerical performance.
zh
[AI-56] he Prompt War: How AI Decides on a Military Intervention
【速读】:该论文试图解决的问题是确定影响人工智能(AI)在军事干预决策中的倾向性因素。其解决方案的关键在于设计了一个简单的联合实验,通过在640个情景中进行100次模拟运行,系统地探索AI在军事干预决策中的行为模式。该研究发现,国内支持度和成功概率是AI决定是否干预的最主要预测因素,而成本因素如国际谴责、军事伤亡、平民伤亡和负面经济影响虽具有统计显著性,但影响程度仅为前两者的约一半。此外,机会窗口的关闭仅在与其他因素交互时才具有统计显著性。研究结果在不同场景和模型(如OpenAI GPT、Anthropic Claude、Google Gemini)之间表现出高度一致性,表明AI决策中存在一定的规律性。
链接: https://arxiv.org/abs/2507.06277
作者: Maxim Chupilkin
机构: 未知
类目: Computers and Society (cs.CY); Artificial Intelligence (cs.AI)
备注: 13 pages, 8 tables, 1 figure
Abstract:Which factors determine AI propensity for military intervention? While the use of AI in war games and military planning is growing exponentially, the simple analysis of key drivers embedded in the models has not yet been done. This paper does a simple conjoint experiment proposing a model to decide on military intervention in 640 vignettes where each was run for 100 times allowing to explore AI decision on military intervention systematically. The analysis finds that largest predictors of AI decision to intervene are high domestic support and high probability of success. Costs such as international condemnation, military deaths, civilian deaths, and negative economic effect are statistically significant, but their effect is around half of domestic support and probability of victory. Closing window of opportunity only reaches statistical significance in interaction with other factors. The results are remarkably consistent across scenarios and across different models (OpenAI GPT, Anthropic Claude, Google Gemini) suggesting a pattern in AI decision-making.
zh
[AI-57] Enhancing LLM Watermark Resilience Against Both Scrubbing and Spoofing Attacks
【速读】:该论文试图解决大型语言模型(Large Language Models, LLMs)水印技术在应对擦除攻击(scrubbing attacks)和伪造攻击(spoofing attacks)时存在的固有权衡问题。传统水印方法中,较小的水印窗口虽然更能抵抗擦除攻击,但容易被逆向工程,从而导致低成本的统计学伪造攻击。该论文的关键解决方案是引入等效纹理密钥(equivalent texture keys),使得水印窗口内的多个标记可以独立支持检测,通过冗余性提出了一种基于子词汇分解的等效纹理密钥水印方案(SEEK),实现了对擦除攻击的抗性提升而不牺牲对伪造攻击的鲁棒性。
链接: https://arxiv.org/abs/2507.06274
作者: Huanming Shen,Baizhou Huang,Xiaojun Wan
机构: 未知
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI)
备注:
Abstract:Watermarking is a promising defense against the misuse of large language models (LLMs), yet it remains vulnerable to scrubbing and spoofing attacks. This vulnerability stems from an inherent trade-off governed by watermark window size: smaller windows resist scrubbing better but are easier to reverse-engineer, enabling low-cost statistics-based spoofing attacks. This work breaks this trade-off by introducing a novel mechanism, equivalent texture keys, where multiple tokens within a watermark window can independently support the detection. Based on the redundancy, we propose a novel watermark scheme with Sub-vocabulary decomposed Equivalent tExture Key (SEEK). It achieves a Pareto improvement, increasing the resilience against scrubbing attacks without compromising robustness to spoofing. Experiments demonstrate SEEK’s superiority over prior method, yielding spoofing robustness gains of +88.2%/+92.3%/+82.0% and scrubbing robustness gains of +10.2%/+6.4%/+24.6% across diverse dataset settings.
zh
[AI-58] A Collectivist Economic Perspective on AI
【速读】:该论文试图解决当前信息科技发展中过于侧重数据与机器学习,而忽视社会与文化因素的问题,以及将技术的社会影响视为次要后果的倾向。其解决方案的关键在于将经济与社会概念与计算和推断概念进行深度融合,以系统级设计为核心,使社会福利成为优先考虑的因素,并旨在催生一个以人类为中心的新工程学科。
链接: https://arxiv.org/abs/2507.06268
作者: Michael I. Jordan
机构: 未知
类目: Computers and Society (cs.CY); Artificial Intelligence (cs.AI); Machine Learning (stat.ML)
备注:
Abstract:Information technology is in the midst of a revolution in which omnipresent data collection and machine learning are impacting the human world as never before. The word “intelligence” is being used as a North Star for the development of this technology, with human cognition viewed as a baseline. This view neglects the fact that humans are social animals, and that much of our intelligence is social and cultural in origin. A related issue is that the current view treats the social consequences of technology as an afterthought. The path forward is not merely more data and compute, and not merely more attention paid to cognitive or symbolic representations, but a thorough blending of economic and social concepts with computational and inferential concepts, in the service of system-level designs in which social welfare is a first-class citizen, and with the aspiration that a new human-centric engineering field will emerge.
zh
[AI-59] he Emotional Alignment Design Policy
【速读】:该论文试图解决如何在设计人工实体时实现情感对齐(Emotional Alignment)的问题,即确保人工实体能够引发用户适当的情感反应,以准确反映其能力水平和道德地位。解决方案的关键在于制定一种情感对齐设计政策,该政策要求设计者避免引发过强或过弱的情感反应(overshooting或undershooting),以及避免引发错误类型的情感反应(hitting the wrong target)。同时,该方案需要应对诸如用户自主性与适当反应之间的平衡、专家与公众在事实与价值上的分歧以及是否需要创造或消除具有道德地位的实体等复杂挑战。
链接: https://arxiv.org/abs/2507.06263
作者: Eric Schwitzgebel,Jeff Sebo
机构: 未知
类目: Computers and Society (cs.CY); Artificial Intelligence (cs.AI)
备注:
Abstract:According to what we call the Emotional Alignment Design Policy, artificial entities should be designed to elicit emotional reactions from users that appropriately reflect the entities’ capacities and moral status, or lack thereof. This principle can be violated in two ways: by designing an artificial system that elicits stronger or weaker emotional reactions than its capacities and moral status warrant (overshooting or undershooting), or by designing a system that elicits the wrong type of emotional reaction (hitting the wrong target). Although presumably attractive, practical implementation faces several challenges including: How can we respect user autonomy while promoting appropriate responses? How should we navigate expert and public disagreement and uncertainty about facts and values? What if emotional alignment seems to require creating or destroying entities with moral status? To what extent should designs conform to versus attempt to alter user assumptions and attitudes?
zh
[AI-60] Q-Detection: A Quantum-Classical Hybrid Poisoning Attack Detection Method IJCAI2025
【速读】:该论文试图解决数据投毒攻击对机器学习模型造成的威胁,通过在训练过程中引入恶意数据来降低模型性能或操纵预测结果。其解决方案的关键在于提出Q-Detection,这是一种量子-经典混合防御方法,用于检测数据投毒攻击,并引入了基于量子计算设备优化的Q-WAN。实验结果表明,Q-Detection在对抗标签篡改和后门攻击方面表现出色,且理论分析显示其有望利用量子计算能力实现超过20%的加速。
链接: https://arxiv.org/abs/2507.06262
作者: Haoqi He,Xiaokai Lin,Jiancai Chen,Yan Xiao
机构: 未知
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Quantum Physics (quant-ph)
备注: IJCAI 2025 Main Conference Accepted Paper
Abstract:Data poisoning attacks pose significant threats to machine learning models by introducing malicious data into the training process, thereby degrading model performance or manipulating predictions. Detecting and sifting out poisoned data is an important method to prevent data poisoning attacks. Limited by classical computation frameworks, upcoming larger-scale and more complex datasets may pose difficulties for detection. We introduce the unique speedup of quantum computing for the first time in the task of detecting data poisoning. We present Q-Detection, a quantum-classical hybrid defense method for detecting poisoning attacks. Q-Detection also introduces the Q-WAN, which is optimized using quantum computing devices. Experimental results using multiple quantum simulation libraries show that Q-Detection effectively defends against label manipulation and backdoor attacks. The metrics demonstrate that Q-Detection consistently outperforms the baseline methods and is comparable to the state-of-the-art. Theoretical analysis shows that Q-Detection is expected to achieve more than a 20% speedup using quantum computing power.
zh
[AI-61] Phantom Subgroup Poisoning: Stealth Attacks on Federated Recommender Systems
【速读】:该论文旨在解决联邦推荐系统在面对针对特定用户子群体的污染攻击时的脆弱性问题,传统攻击方法通常针对整个用户群体,降低了隐蔽性并增加了被检测的风险。论文提出的解决方案是Spattack,其关键在于采用两阶段的近似与提升策略:首先通过对比学习和聚类增强对目标/非目标子群体用户嵌入的模拟,随后通过自适应调整优化权重和嵌入对齐策略,将目标物品精准推送至目标子群体,从而实现对特定用户子群体的高效操控,同时最小化对非目标用户的影响。
链接: https://arxiv.org/abs/2507.06258
作者: Bo Yan,Yurong Hao,Dingqi Liu,Huabin Sun,Pengpeng Qiao,Wei Yang Bryan Lim,Yang Cao,Chuan Shi
机构: 未知
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI); Distributed, Parallel, and Cluster Computing (cs.DC); Information Retrieval (cs.IR)
备注: 13 pages
Abstract:Federated recommender systems (FedRec) have emerged as a promising solution for delivering personalized recommendations while safeguarding user privacy. However, recent studies have demonstrated their vulnerability to poisoning attacks. Existing attacks typically target the entire user group, which compromises stealth and increases the risk of detection. In contrast, real-world adversaries may prefer to prompt target items to specific user subgroups, such as recommending health supplements to elderly users. Motivated by this gap, we introduce Spattack, the first targeted poisoning attack designed to manipulate recommendations for specific user subgroups in the federated setting. Specifically, Spattack adopts a two-stage approximation-and-promotion strategy, which first simulates user embeddings of target/non-target subgroups and then prompts target items to the target subgroups. To enhance the approximation stage, we push the inter-group embeddings away based on contrastive learning and augment the target group’s relevant item set based on clustering. To enhance the promotion stage, we further propose to adaptively tune the optimization weights between target and non-target subgroups. Besides, an embedding alignment strategy is proposed to align the embeddings between the target items and the relevant items. We conduct comprehensive experiments on three real-world datasets, comparing Spattack against seven state-of-the-art poisoning attacks and seven representative defense mechanisms. Experimental results demonstrate that Spattack consistently achieves strong manipulation performance on the specific user subgroup, while incurring minimal impact on non-target users, even when only 0.1% of users are malicious. Moreover, Spattack maintains competitive overall recommendation performance and exhibits strong resilience against existing mainstream defenses.
zh
[AI-62] Attackers Noise Can Manipulate Your Audio-based LLM in the Real World
【速读】:该论文试图解决音频基础大语言模型(Audio-based Large Language Models, ALLMs)在现实世界中的安全漏洞问题。解决方案的关键在于揭示攻击者可以通过构造隐蔽的音频扰动来操控ALLMs,使其表现出特定的目标行为,如触发唤醒关键词或引发有害操作,并通过播放对抗性背景噪声显著降低模型响应质量。研究还展示了攻击的可扩展性和可迁移性,为后续防御措施提供了重要依据。
链接: https://arxiv.org/abs/2507.06256
作者: Vinu Sankar Sadasivan,Soheil Feizi,Rajiv Mathews,Lun Wang
机构: 未知
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI); Sound (cs.SD); Audio and Speech Processing (eess.AS)
备注:
Abstract:This paper investigates the real-world vulnerabilities of audio-based large language models (ALLMs), such as Qwen2-Audio. We first demonstrate that an adversary can craft stealthy audio perturbations to manipulate ALLMs into exhibiting specific targeted behaviors, such as eliciting responses to wake-keywords (e.g., “Hey Qwen”), or triggering harmful behaviors (e.g. “Change my calendar event”). Subsequently, we show that playing adversarial background noise during user interaction with the ALLMs can significantly degrade the response quality. Crucially, our research illustrates the scalability of these attacks to real-world scenarios, impacting other innocent users when these adversarial noises are played through the air. Further, we discuss the transferrability of the attack, and potential defensive measures.
zh
[AI-63] False Alarms Real Damage: Adversarial Attacks Using LLM -based Models on Text-based Cyber Threat Intelligence Systems
【速读】:该论文试图解决生成式 AI (Generative AI) 在网络安全领域中,尤其是在网络威胁情报(CTI)管道中的潜在安全漏洞问题。研究重点在于分析CTI系统在面对对抗性攻击时的脆弱性,特别是针对文本输入的篡改行为。解决方案的关键在于识别并评估三种类型的对抗性攻击——逃避攻击、淹没攻击和中毒攻击,并揭示生成对抗性文本的技术如何影响CTI系统的性能与功能,从而为构建更具鲁棒性的CTI系统提供理论依据和技术支持。
链接: https://arxiv.org/abs/2507.06252
作者: Samaneh Shafee,Alysson Bessani,Pedro M. Ferreira
机构: 未知
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:
Abstract:Cyber Threat Intelligence (CTI) has emerged as a vital complementary approach that operates in the early phases of the cyber threat lifecycle. CTI involves collecting, processing, and analyzing threat data to provide a more accurate and rapid understanding of cyber threats. Due to the large volume of data, automation through Machine Learning (ML) and Natural Language Processing (NLP) models is essential for effective CTI extraction. These automated systems leverage Open Source Intelligence (OSINT) from sources like social networks, forums, and blogs to identify Indicators of Compromise (IoCs). Although prior research has focused on adversarial attacks on specific ML models, this study expands the scope by investigating vulnerabilities within various components of the entire CTI pipeline and their susceptibility to adversarial attacks. These vulnerabilities arise because they ingest textual inputs from various open sources, including real and potentially fake content. We analyse three types of attacks against CTI pipelines, including evasion, flooding, and poisoning, and assess their impact on the system’s information selection capabilities. Specifically, on fake text generation, the work demonstrates how adversarial text generation techniques can create fake cybersecurity and cybersecurity-like text that misleads classifiers, degrades performance, and disrupts system functionality. The focus is primarily on the evasion attack, as it precedes and enables flooding and poisoning attacks within the CTI pipeline.
zh
[AI-64] We Urgently Need Privilege Management in MCP: A Measurement of API Usage in MCP Ecosystems
【速读】:该论文试图解决生成式AI(Generative AI)中模型上下文协议(MCP)所带来的安全风险问题,特别是由于MCP机制导致的攻击面扩大和权限隔离不足所引发的潜在威胁。其解决方案的关键在于开发一个自动化静态分析框架,对大量MCP应用进行系统性评估,从而揭示资源访问模式、高风险操作及安全挑战,并提出基于动态权限模型和自动信任评估的改进方向,以构建更安全的MCP生态系统。
链接: https://arxiv.org/abs/2507.06250
作者: Zhihao Li,Kun Li,Boyang Ma,Minghui Xu,Yue Zhang,Xiuzhen Cheng
机构: 未知
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI); Software Engineering (cs.SE)
备注:
Abstract:The Model Context Protocol (MCP) has emerged as a widely adopted mechanism for connecting large language models to external tools and resources. While MCP promises seamless extensibility and rich integrations, it also introduces a substantially expanded attack surface: any plugin can inherit broad system privileges with minimal isolation or oversight. In this work, we conduct the first large-scale empirical analysis of MCP security risks. We develop an automated static analysis framework and systematically examine 2,562 real-world MCP applications spanning 23 functional categories. Our measurements reveal that network and system resource APIs dominate usage patterns, affecting 1,438 and 1,237 servers respectively, while file and memory resources are less frequent but still significant. We find that Developer Tools and API Development plugins are the most API-intensive, and that less popular plugins often contain disproportionately high-risk operations. Through concrete case studies, we demonstrate how insufficient privilege separation enables privilege escalation, misinformation propagation, and data tampering. Based on these findings, we propose a detailed taxonomy of MCP resource access, quantify security-relevant API usage, and identify open challenges for building safer MCP ecosystems, including dynamic permission models and automated trust assessment.
zh
[AI-65] Simple Convergence Proof of Adam From a Sign-like Descent Perspective
【速读】:该论文试图解决Adam优化器在理论收敛性分析上的不足问题,现有研究将Adam解释为带有动量的预条件随机梯度下降(SGDM),但这种视角需要强假设和复杂技术,导致收敛证明冗长且难以验证。论文提出了一种新的解释,将Adam视为一种类似符号的优化器,其关键在于通过重新表述更新公式,显著简化了收敛分析。在一些温和条件下,证明了Adam在弱假设下能够达到最优收敛速率\cal O(\frac{1}{T^\sfrac{1}{4}}),而非之前的\cal O\left(\frac{\ln T}{T^\sfrac{1}{4}}\right),且不依赖于模型维度或数值稳定性参数ϵ。此外,该理论分析揭示了动量在确保收敛中的关键作用,并提供了调整学习率的实践指导。
链接: https://arxiv.org/abs/2507.05966
作者: Hanyang Peng,Shuang Qin,Yue Yu,Fangqing Jiang,Hui Wang,Zhouchen Lin
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Information Theory (cs.IT)
备注: 23 pages, 2figures
Abstract:Adam is widely recognized as one of the most effective optimizers for training deep neural networks (DNNs). Despite its remarkable empirical success, its theoretical convergence analysis remains unsatisfactory. Existing works predominantly interpret Adam as a preconditioned stochastic gradient descent with momentum (SGDM), formulated as \bmx_t+1 = \bmx_t - \frac\gamma_t\sqrt\bmv_t+\epsilon \circ \bmm_t . This perspective necessitates strong assumptions and intricate techniques, resulting in lengthy and opaque convergence proofs that are difficult to verify and extend. In contrast, we propose a novel interpretation by treating Adam as a sign-like optimizer, expressed as \bmx_t+1 = \bmx_t - \gamma_t \frac|\bmm_t|\sqrt\bmv_t+\epsilon \circ \rm Sign(\bmm_t) . This reformulation significantly simplifies the convergence analysis. For the first time, with some mild conditions, we prove that Adam achieves the optimal rate of \cal O(\frac1T^\sfrac14) rather than the previous \cal O \left(\frac\ln TT^\sfrac14\right) under weak assumptions of the generalized p -affine variance and (L_0, L_1, q) -smoothness, without dependence on the model dimensionality or the numerical stability parameter \epsilon . Additionally, our theoretical analysis provides new insights into the role of momentum as a key factor ensuring convergence and offers practical guidelines for tuning learning rates in Adam, further bridging the gap between theory and practice.
zh
[AI-66] Surrogate Model for Heat Transfer Prediction in Impinging Jet Arrays using Dynamic Inlet/Outlet and Flow Rate Control
【速读】:该论文旨在解决在封闭冲击射流阵列中实时预测努塞尔数(Nusselt number)分布的问题,传统计算流体力学(CFD)模拟虽然精度高但计算成本大,难以用于实时应用如基于模型的温度控制。解决方案的关键在于构建一种基于卷积神经网络(CNN)的代理模型,该模型能够以实时速度预测努塞尔数分布,并通过隐式大涡模拟(LES)数据进行训练,同时引入基于相关性的缩放方法以实现对更高雷诺数(Re 10,000)的外推预测,从而在保证精度(验证数据上的归一化均方误差分别低于2%和0.6%)的前提下满足实时应用需求。
链接: https://arxiv.org/abs/2507.07034
作者: Mikael Vaillant,Victor Oliveira Ferreira,Wiebke Mainville,Jean-Michel Lamarre,Vincent Raymond,Moncef Chioua,Bruno Blais
机构: 未知
类目: Fluid Dynamics (physics.flu-dyn); Artificial Intelligence (cs.AI)
备注: 37 pages, 13 figures
Abstract:This study presents a surrogate model designed to predict the Nusselt number distribution in an enclosed impinging jet arrays, where each jet function independently and where jets can be transformed from inlets to outlets, leading to a vast number of possible flow arrangements. While computational fluid dynamics (CFD) simulations can model heat transfer with high fidelity, their cost prohibits real-time application such as model-based temperature control. To address this, we generate a CNN-based surrogate model that can predict the Nusselt distribution in real time. We train it with data from implicit large eddy computational fluid dynamics simulations (Re 2,000). We train two distinct models, one for a five by one array of jets (83 simulations) and one for a three by three array of jets (100 simulations). We introduce a method to extrapolate predictions to higher Reynolds numbers (Re 10,000) using a correlation-based scaling. The surrogate models achieve high accuracy, with a normalized mean average error below 2% on validation data for the five by one surrogate model and 0.6% for the three by three surrogate model. Experimental validation confirms the model’s predictive capabilities. This work provides a foundation for model-based control strategies in advanced thermal management applications.
zh
[AI-67] OpenDPDv2: A Unified Learning and Optimization Framework for Neural Network Digital Predistortion
【速读】:该论文旨在解决基于神经网络(Neural Network, NN)的数字预失真(Digital Predistortion, DPD)在宽频射频功率放大器(RF Power Amplifier, PA)中因参数量大而导致的高能耗问题,同时保持优异的线性化性能。其关键解决方案是提出OpenDPDv2框架,集成PA建模、DPD学习与模型优化,并引入一种新型DPD算法TRes-DeltaGRU,结合定点数量化和输入信号及隐藏神经元的动态时间稀疏性,显著降低了推理能耗,同时维持了良好的ACPR和EVM指标。
链接: https://arxiv.org/abs/2507.06849
作者: Yizhuo Wu,Ang Li,Chang Gao
机构: 未知
类目: ignal Processing (eess.SP); Artificial Intelligence (cs.AI)
备注: Under Review
Abstract:Neural network (NN)-based Digital Predistortion (DPD) stands out in improving signal quality in wideband radio frequency (RF) power amplifiers (PAs) employing complex modulation. However, NN DPDs usually rely on a large number of parameters for effective linearization and can significantly contribute to the energy consumption of the digital back-end in RF systems. This paper presents OpenDPDv2, a unified framework for PA modeling, DPD learning, and model optimization to reduce power consumption while maintaining high linearization performance. The optimization techniques feature a novel DPD algorithm, TRes-DeltaGRU, alongside two energy-efficient methods. The top-performing 32-bit floating-point (FP32) TRes-DeltaGRU-DPD model achieves an Adjacent Channel Power Ratio (ACPR) of -59.4 dBc and Error Vector Magnitude (EVM) of -42.1 dBc. By exploiting fixed-point quantization and dynamic temporal sparsity of input signals and hidden neurons, the inference energy of our model can be reduced by 4.5X while still maintaining -50.3 dBc ACPR and -35.2 dB EVM with 56% temporal sparsity. This was evaluated using a TM3.1a 200 MHz bandwidth 256-QAM OFDM signal applied to a 3.5 GHz GaN Doherty RF PA. OpenDPDv2 code, datasets, and documentation are publicly accessible at: this https URL.
zh
[AI-68] Photometric Stereo using Gaussian Splatting and inverse rendering
【速读】:该论文试图解决校准的光度立体(calibrated photometric stereo)问题,旨在通过更高效和可解释的方式重建三维场景。其解决方案的关键在于利用基于高斯点云(Gaussian Splatting)的3D逆渲染技术,对场景进行参数化并优化,同时引入简化的光照表示模型,从而展示了该渲染引擎在光度立体问题中的潜力。
链接: https://arxiv.org/abs/2507.06684
作者: Matéo Ducastel(GREYC),David Tschumperlé,Yvain Quéau
机构: 未知
类目: Image and Video Processing (eess.IV); Artificial Intelligence (cs.AI)
备注: in French language. GRETSI 2025, Association GRETSI, Aug 2025, Strasbourg, France
Abstract:Recent state-of-the-art algorithms in photometric stereo rely on neural networks and operate either through prior learning or inverse rendering optimization. Here, we revisit the problem of calibrated photometric stereo by leveraging recent advances in 3D inverse rendering using the Gaussian Splatting formalism. This allows us to parameterize the 3D scene to be reconstructed and optimize it in a more interpretable manner. Our approach incorporates a simplified model for light representation and demonstrates the potential of the Gaussian Splatting rendering engine for the photometric stereo problem.
zh
[AI-69] Generative Lagrangian data assimilation for ocean dynamics under extreme sparsity
【速读】:该论文试图解决从观测数据中重建海洋动力学的问题,尤其是针对由于空间采样稀疏、不规则和拉格朗日特性所带来的挑战,特别是在次表层和偏远区域。传统数据同化方法和深度学习模型在恢复中尺度湍流方面存在困难。该研究的关键解决方案是采用一种结合神经算子与去噪扩散概率模型(DDPM)的深度学习框架,通过将生成模型条件化于神经算子的输出,从而在极高稀疏度下(如99%的合成数据和99.9%的真实卫星观测数据)准确捕捉小尺度、高波数的动力学特征。
链接: https://arxiv.org/abs/2507.06479
作者: Niloofar Asefi,Leonard Lupin-Jimenez,Tianning Wu,Ruoying He,Ashesh Chattopadhyay
机构: 未知
类目: Atmospheric and Oceanic Physics (physics.ao-ph); Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Dynamical Systems (math.DS); Chaotic Dynamics (nlin.CD)
备注:
Abstract:Reconstructing ocean dynamics from observational data is fundamentally limited by the sparse, irregular, and Lagrangian nature of spatial sampling, particularly in subsurface and remote regions. This sparsity poses significant challenges for forecasting key phenomena such as eddy shedding and rogue waves. Traditional data assimilation methods and deep learning models often struggle to recover mesoscale turbulence under such constraints. We leverage a deep learning framework that combines neural operators with denoising diffusion probabilistic models (DDPMs) to reconstruct high-resolution ocean states from extremely sparse Lagrangian observations. By conditioning the generative model on neural operator outputs, the framework accurately captures small-scale, high-wavenumber dynamics even at 99% sparsity (for synthetic data) and 99.9% sparsity (for real satellite observations). We validate our method on benchmark systems, synthetic float observations, and real satellite data, demonstrating robust performance under severe spatial sampling limitations as compared to other deep learning baselines.
zh
[AI-70] Magneto-radiative modelling and artificial neural network optimization of biofluid flow in a stenosed arterial domain
【速读】:该论文旨在解决心血管疾病治疗中传统方法的局限性,通过开发新型药物递送系统以实现靶向、有效和可控的治疗,从而支持联合国可持续发展目标3(良好健康与福祉)和9(产业、创新和基础设施)。其解决方案的关键在于研究Casson-Maxwell纳米流体在狭窄动脉域中的流动特性,分析皮肤摩擦和传热速率等关键参数,并利用Levenberg-Marquardt反向传播训练方案预测热流率,同时探讨三金属纳米颗粒体积分数及磁辐射、线性热源和Casson-Maxwell参数的影响。
链接: https://arxiv.org/abs/2507.06273
作者: S P Shivakumar,Gunisetty Ramasekhar,P Nimmy,Sujesh Areekara,L Thanuja,T V Smitha,S Devanathan,Ganesh R Naik,K V Nagaraja
机构: 未知
类目: Medical Physics (physics.med-ph); Artificial Intelligence (cs.AI); Numerical Analysis (math.NA); Biological Physics (physics.bio-ph)
备注:
Abstract:The increasing complexity of cardiovascular diseases and limitations in traditional healing methods mandate the invention of new drug delivery systems that assure targeted, effective, and regulated treatments, contributing directly to UN SDGs 3 and 9, thereby encouraging the utilization of sustainable medical technologies in healthcare. This study investigates the flow of a Casson-Maxwell nanofluid through a stenosed arterial domain. The quantities, such as skin friction and heat transfer rate, are analysed in detail. The Casson-Maxwell fluid shows a lower velocity profile than the Casson fluids, which indicates the improved residence time for efficient drug delivery. The heat transfer rate shows an increase with higher volume fractions of copper and aluminium oxide nanoparticles and a decrease with higher volume fractions of silver nanoparticles. The skin friction coefficient decreases by 219% with a unit increase in the Maxwell parameter, whereas it increases by 66.1% with a unit rise in the Casson parameter. This work supports SDGs 4 and 17 by fostering interdisciplinary learning and collaboration in fluid dynamics and healthcare innovation. Additionally, the rate of heat flow was forecasted (with an overall R-value of 0.99457) using the Levenberg-Marquardt backpropagation training scheme under the influence of magneto-radiative, linear heat source and Casson-Maxwell parameters along with the tri-metallic nanoparticle volume fractions. It is also observed that the drag coefficient is most sensitive to the changes in the Maxwell parameter.
zh
[AI-71] Machine Learning based Enterprise Financial Audit Framework and High Risk Identification
【速读】:该论文试图解决传统人工审计方法在面对大规模数据量、复杂业务结构和不断演变的欺诈手段时所遇到的效率与准确性不足的问题。其解决方案的关键在于构建一个基于机器学习的财务审计与高风险识别框架,通过引入支持向量机(SVM)、随机森林(RF)和K近邻(KNN)等算法,提升风险预测的性能。其中,随机森林在实验中表现出最佳效果,具有0.9012的F1分数,能够有效识别欺诈和合规异常,并通过特征重要性分析确定了审计频率、历史违规记录、员工工作量和客户评价等关键预测因子。研究建议采用随机森林作为核心模型,并结合特征工程和实时风险监控以增强实际应用效果。
链接: https://arxiv.org/abs/2507.06266
作者: Tingyu Yuan,Xi Zhang,Xuanjing Chen
机构: 未知
类目: Risk Management (q-fin.RM); Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Applications (stat.AP)
备注:
Abstract:In the face of global economic uncertainty, financial auditing has become essential for regulatory compliance and risk mitigation. Traditional manual auditing methods are increasingly limited by large data volumes, complex business structures, and evolving fraud tactics. This study proposes an AI-driven framework for enterprise financial audits and high-risk identification, leveraging machine learning to improve efficiency and accuracy. Using a dataset from the Big Four accounting firms (EY, PwC, Deloitte, KPMG) from 2020 to 2025, the research examines trends in risk assessment, compliance violations, and fraud detection. The dataset includes key indicators such as audit project counts, high-risk cases, fraud instances, compliance breaches, employee workload, and client satisfaction, capturing both audit behaviors and AI’s impact on operations. To build a robust risk prediction model, three algorithms - Support Vector Machine (SVM), Random Forest (RF), and K-Nearest Neighbors (KNN) - are evaluated. SVM uses hyperplane optimization for complex classification, RF combines decision trees to manage high-dimensional, nonlinear data with resistance to overfitting, and KNN applies distance-based learning for flexible performance. Through hierarchical K-fold cross-validation and evaluation using F1-score, accuracy, and recall, Random Forest achieves the best performance, with an F1-score of 0.9012, excelling in identifying fraud and compliance anomalies. Feature importance analysis reveals audit frequency, past violations, employee workload, and client ratings as key predictors. The study recommends adopting Random Forest as a core model, enhancing features via engineering, and implementing real-time risk monitoring. This research contributes valuable insights into using machine learning for intelligent auditing and risk management in modern enterprises.
zh
机器学习
[LG-0] Does Data Scaling Lead to Visual Compositional Generalization? ICML2025
链接: https://arxiv.org/abs/2507.07102
作者: Arnas Uselis,Andrea Dittadi,Seong Joon Oh
类目: Machine Learning (cs.LG)
*备注: ICML 2025
Abstract:Compositional understanding is crucial for human intelligence, yet it remains unclear whether contemporary vision models exhibit it. The dominant machine learning paradigm is built on the premise that scaling data and model sizes will improve out-of-distribution performance, including compositional generalization. We test this premise through controlled experiments that systematically vary data scale, concept diversity, and combination coverage. We find that compositional generalization is driven by data diversity, not mere data scale. Increased combinatorial coverage forces models to discover a linearly factored representational structure, where concepts decompose into additive components. We prove this structure is key to efficiency, enabling perfect generalization from few observed combinations. Evaluating pretrained models (DINO, CLIP), we find above-random yet imperfect performance, suggesting partial presence of this structure. Our work motivates stronger emphasis on constructing diverse datasets for compositional generalization, and considering the importance of representational structure that enables efficient compositional learning. Code available at this https URL.
[LG-1] Small Batch Size Training for Language Models: When Vanilla SGD Works and Why Gradient Accumulation Is Wasteful ATC
链接: https://arxiv.org/abs/2507.07101
作者: Martin Marek,Sanae Lotfi,Aditya Somasundaram,Andrew Gordon Wilson,Micah Goldblum
类目: Machine Learning (cs.LG)
*备注: Code available at: this https URL
Abstract:Conventional wisdom dictates that small batch sizes make language model pretraining and fine-tuning unstable, motivating gradient accumulation, which trades off the number of optimizer steps for a proportional increase in batch size. While it is common to decrease the learning rate for smaller batch sizes, other hyperparameters are often held fixed. In this work, we revisit small batch sizes all the way down to batch size one, and we propose a rule for scaling Adam hyperparameters to small batch sizes. We find that small batch sizes (1) train stably, (2) are consistently more robust to hyperparameter choices, (3) achieve equal or better per-FLOP performance than larger batch sizes, and (4) notably enable stable language model training with vanilla SGD, even without momentum, despite storing no optimizer state. Building on these results, we provide practical recommendations for selecting a batch size and setting optimizer hyperparameters. We further recommend against gradient accumulation unless training on multiple devices with multiple model replicas, bottlenecked by inter-device bandwidth.
[LG-2] An Ensemble Embedding Approach for Improving Semantic Caching Performance in LLM -based Systems
链接: https://arxiv.org/abs/2507.07061
作者: Shervin Ghaffari,Zohre Bahranifard,Mohammad Akbari
类目: Machine Learning (cs.LG)
*备注: 10 pages, 8 figures, 2 table. Submitted to the Journal of Information Science
Abstract:Semantic caching enhances the efficiency of large language model (LLM) systems by identifying semantically similar queries, storing responses once, and serving them for subsequent equivalent requests. However, existing semantic caching frameworks rely on single embedding models for query representation, which limits their ability to capture the diverse semantic relationships present in real-world query distributions. This paper presents an ensemble embedding approach that combines multiple embedding models through a trained meta-encoder to improve semantic similarity detection in LLM caching systems. We evaluate our method using the Quora Question Pairs (QQP) dataset, measuring cache hit ratios, cache miss ratios, token savings, and response times. Our ensemble approach achieves a 92% cache hit ratio for semantically equivalent queries while maintaining an 85% accuracy in correctly rejecting non-equivalent queries as cache misses. These results demonstrate that ensemble embedding methods significantly outperform single-model approaches in distinguishing between semantically similar and dissimilar queries, leading to more effective caching performance and reduced computational overhead in LLM-based systems.
[LG-3] LoRAShield: Data-Free Editing Alignment for Secure Personalized LoRA Sharing
链接: https://arxiv.org/abs/2507.07056
作者: Jiahao Chen,junhao li,Yiming Wang,Zhe Ma,Yi Jiang,Chunyi Zhou,Qingming Li,Tianyu Du,Shouling Ji
类目: Cryptography and Security (cs.CR); Machine Learning (cs.LG)
*备注:
Abstract:The proliferation of Low-Rank Adaptation (LoRA) models has democratized personalized text-to-image generation, enabling users to share lightweight models (e.g., personal portraits) on platforms like Civitai and Liblib. However, this “share-and-play” ecosystem introduces critical risks: benign LoRAs can be weaponized by adversaries to generate harmful content (e.g., political, defamatory imagery), undermining creator rights and platform safety. Existing defenses like concept-erasure methods focus on full diffusion models (DMs), neglecting LoRA’s unique role as a modular adapter and its vulnerability to adversarial prompt engineering. To bridge this gap, we propose LoRAShield, the first data-free editing framework for securing LoRA models against misuse. Our platform-driven approach dynamically edits and realigns LoRA’s weight subspace via adversarial optimization and semantic augmentation. Experimental results demonstrate that LoRAShield achieves remarkable effectiveness, efficiency, and robustness in blocking malicious generations without sacrificing the functionality of the benign task. By shifting the defense to platforms, LoRAShield enables secure, scalable sharing of personalized models, a critical step toward trustworthy generative ecosystems.
[LG-4] Self-Supervised Learning at the Edge: The Cost of Labeling
链接: https://arxiv.org/abs/2507.07033
作者: Roberto Pereira,Fernanda Famá,Asal Rangrazi,Marco Miozzo,Charalampos Kalalas,Paolo Dini
类目: Machine Learning (cs.LG); Signal Processing (eess.SP)
*备注: Accepted for publication in IEEE MLSP 2025
Abstract:Contrastive learning (CL) has recently emerged as an alternative to traditional supervised machine learning solutions by enabling rich representations from unstructured and unlabeled data. However, CL and, more broadly, self-supervised learning (SSL) methods often demand a large amount of data and computational resources, posing challenges for deployment on resource-constrained edge devices. In this work, we explore the feasibility and efficiency of SSL techniques for edge-based learning, focusing on trade-offs between model performance and energy efficiency. In particular, we analyze how different SSL techniques adapt to limited computational, data, and energy budgets, evaluating their effectiveness in learning robust representations under resource-constrained settings. Moreover, we also consider the energy costs involved in labeling data and assess how semi-supervised learning may assist in reducing the overall energy consumed to train CL models. Through extensive experiments, we demonstrate that tailored SSL strategies can achieve competitive performance while reducing resource consumption by up to 4X, underscoring their potential for energy-efficient learning at the edge.
[LG-5] ZKTorch: Compiling ML Inference to Zero-Knowledge Proofs via Parallel Proof Accumulation
链接: https://arxiv.org/abs/2507.07031
作者: Bing-Jyue Chen,Lilia Tang,Daniel Kang
类目: Cryptography and Security (cs.CR); Machine Learning (cs.LG)
*备注: 16 pages, 2 figures
Abstract:As AI models become ubiquitous in our daily lives, there has been an increasing demand for transparency in ML services. However, the model owner does not want to reveal the weights, as they are considered trade secrets. To solve this problem, researchers have turned to zero-knowledge proofs of ML model inference. These proofs convince the user that the ML model output is correct, without revealing the weights of the model to the user. Past work on these provers can be placed into two categories. The first method compiles the ML model into a low-level circuit, and proves the circuit using a ZK-SNARK. The second method uses custom cryptographic protocols designed only for a specific class of models. Unfortunately, the first method is highly inefficient, making it impractical for the large models used today, and the second method does not generalize well, making it difficult to update in the rapidly changing field of machine learning. To solve this, we propose ZKTorch, an open source end-to-end proving system that compiles ML models into base cryptographic operations called basic blocks, each proved using specialized protocols. ZKTorch is built on top of a novel parallel extension to the Mira accumulation scheme, enabling succinct proofs with minimal accumulation overhead. These contributions allow ZKTorch to achieve at least a 3\times reduction in the proof size compared to specialized protocols and up to a 6\times speedup in proving time over a general-purpose ZKML framework.
[LG-6] On-Device Training of PV Power Forecasting Models in a Smart Meter for Grid Edge Intelligence
链接: https://arxiv.org/abs/2507.07016
作者: Jian Huang,Yongli Zhu,Linna Xu,Zhe Zheng,Wenpeng Cui,Mingyang Sun
类目: Machine Learning (cs.LG); Signal Processing (eess.SP)
*备注: This paper is currently under reviewing by an IEEE publication; it may be subjected to minor changes due to review comments later
Abstract:In this paper, an edge-side model training study is conducted on a resource-limited smart meter. The motivation of grid-edge intelligence and the concept of on-device training are introduced. Then, the technical preparation steps for on-device training are described. A case study on the task of photovoltaic power forecasting is presented, where two representative machine learning models are investigated: a gradient boosting tree model and a recurrent neural network model. To adapt to the resource-limited situation in the smart meter, “mixed”- and “reduced”-precision training schemes are also devised. Experiment results demonstrate the feasibility of economically achieving grid-edge intelligence via the existing advanced metering infrastructures.
[LG-7] Exact Evaluation of the Accuracy of Diffusion Models for Inverse Problems with Gaussian Data Distributions
链接: https://arxiv.org/abs/2507.07008
作者: Emile Pierret,Bruno Galerne
类目: Machine Learning (cs.LG)
*备注:
Abstract:Used as priors for Bayesian inverse problems, diffusion models have recently attracted considerable attention in the literature. Their flexibility and high variance enable them to generate multiple solutions for a given task, such as inpainting, super-resolution, and deblurring. However, several unresolved questions remain about how well they perform. In this article, we investigate the accuracy of these models when applied to a Gaussian data distribution for deblurring. Within this constrained context, we are able to precisely analyze the discrepancy between the theoretical resolution of inverse problems and their resolution obtained using diffusion models by computing the exact Wasserstein distance between the distribution of the diffusion model sampler and the ideal distribution of solutions to the inverse problem. Our findings allow for the comparison of different algorithms from the literature.
[LG-8] DICE: Data Influence Cascade in Decentralized Learning ICLR2025
链接: https://arxiv.org/abs/2507.06931
作者: Tongtian Zhu,Wenhao Li,Can Wang,Fengxiang He
类目: Machine Learning (cs.LG); Distributed, Parallel, and Cluster Computing (cs.DC); Multiagent Systems (cs.MA); Social and Information Networks (cs.SI); Machine Learning (stat.ML)
*备注: Published as a poster at ICLR 2025
Abstract:Decentralized learning offers a promising approach to crowdsource data consumptions and computational workloads across geographically distributed compute interconnected through peer-to-peer networks, accommodating the exponentially increasing demands. However, proper incentives are still in absence, considerably discouraging participation. Our vision is that a fair incentive mechanism relies on fair attribution of contributions to participating nodes, which faces non-trivial challenges arising from the localized connections making influence ``cascade’’ in a decentralized network. To overcome this, we design the first method to estimate \textbfData \textbfInfluence \textbfCascad\textbfE (DICE) in a decentralized environment. Theoretically, the framework derives tractable approximations of influence cascade over arbitrary neighbor hops, suggesting the influence cascade is determined by an interplay of data, communication topology, and the curvature of loss landscape. DICE also lays the foundations for applications including selecting suitable collaborators and identifying malicious behaviors. Project page is available at this https URL.
[LG-9] Robust and Safe Traffic Sign Recognition using N-version with Weighted Voting
链接: https://arxiv.org/abs/2507.06907
作者: Linyun Gao,Qiang Wen,Fumio Machida
类目: Machine Learning (cs.LG); Software Engineering (cs.SE)
*备注: 27 pages including appendix, 1 figure
Abstract:Autonomous driving is rapidly advancing as a key application of machine learning, yet ensuring the safety of these systems remains a critical challenge. Traffic sign recognition, an essential component of autonomous vehicles, is particularly vulnerable to adversarial attacks that can compromise driving safety. In this paper, we propose an N-version machine learning (NVML) framework that integrates a safety-aware weighted soft voting mechanism. Our approach utilizes Failure Mode and Effects Analysis (FMEA) to assess potential safety risks and assign dynamic, safety-aware weights to the ensemble outputs. We evaluate the robustness of three-version NVML systems employing various voting mechanisms against adversarial samples generated using the Fast Gradient Sign Method (FGSM) and Projected Gradient Descent (PGD) attacks. Experimental results demonstrate that our NVML approach significantly enhances the robustness and safety of traffic sign recognition systems under adversarial conditions.
[LG-10] Designing Adaptive Algorithms Based on Reinforcement Learning for Dynamic Optimization of Sliding Window Size in Multi-Dimensional Data Streams
链接: https://arxiv.org/abs/2507.06901
作者: Abolfazl Zarghani,Sadegh Abedi
类目: Machine Learning (cs.LG)
*备注:
Abstract:Multi-dimensional data streams, prevalent in applications like IoT, financial markets, and real-time analytics, pose significant challenges due to their high velocity, unbounded nature, and complex inter-dimensional dependencies. Sliding window techniques are critical for processing such streams, but fixed-size windows struggle to adapt to dynamic changes like concept drift or bursty patterns. This paper proposes a novel reinforcement learning (RL)-based approach to dynamically optimize sliding window sizes for multi-dimensional data streams. By formulating window size selection as an RL problem, we enable an agent to learn an adaptive policy based on stream characteristics, such as variance, correlations, and temporal trends. Our method, RL-Window, leverages a Dueling Deep Q-Network (DQN) with prioritized experience replay to handle non-stationarity and high-dimensionality. Evaluations on benchmark datasets (UCI HAR, PAMAP2, Yahoo! Finance Stream) demonstrate that RL-Window outperforms state-of-the-art methods like ADWIN and CNN-Adaptive in classification accuracy, drift robustness, and computational efficiency. Additional qualitative analyses, extended metrics (e.g., energy efficiency, latency), and a comprehensive dataset characterization further highlight its adaptability and stability, making it suitable for real-time applications.
[LG-11] Horizontal and Vertical Federated Causal Structure Learning via Higher-order Cumulants
链接: https://arxiv.org/abs/2507.06888
作者: Wei Chen,Wanyang Gu,Linjun Peng,Ruichu Cai,Zhifeng Hao,Kun Zhang
类目: Machine Learning (cs.LG)
*备注:
Abstract:Federated causal discovery aims to uncover the causal relationships between entities while protecting data privacy, which has significant importance and numerous applications in real-world scenarios. Existing federated causal structure learning methods primarily focus on horizontal federated settings. However, in practical situations, different clients may not necessarily contain data on the same variables. In a single client, the incomplete set of variables can easily lead to spurious causal relationships, thereby affecting the information transmitted to other clients. To address this issue, we comprehensively consider causal structure learning methods under both horizontal and vertical federated settings. We provide the identification theories and methods for learning causal structure in the horizontal and vertical federal setting via higher-order cumulants. Specifically, we first aggregate higher-order cumulant information from all participating clients to construct global cumulant estimates. These global estimates are then used for recursive source identification, ultimately yielding a global causal strength matrix. Our approach not only enables the reconstruction of causal graphs but also facilitates the estimation of causal strength coefficients. Our algorithm demonstrates superior performance in experiments conducted on both synthetic data and real-world data.
[LG-12] Episodic Contextual Bandits with Knapsacks under Conversion Models
链接: https://arxiv.org/abs/2507.06859
作者: Zitian Li,Wang Chi Cheung
类目: Machine Learning (cs.LG)
*备注:
Abstract:We study an online setting, where a decision maker (DM) interacts with contextual bandit-with-knapsack (BwK) instances in repeated episodes. These episodes start with different resource amounts, and the contexts’ probability distributions are non-stationary in an episode. All episodes share the same latent conversion model, which governs the random outcome contingent upon a request’s context and an allocation decision. Our model captures applications such as dynamic pricing on perishable resources with episodic replenishment, and first price auctions in repeated episodes with different starting budgets. We design an online algorithm that achieves a regret sub-linear in T , the number of episodes, assuming access to a \emphconfidence bound oracle that achieves an o(T) -regret. Such an oracle is readily available from existing contextual bandit literature. We overcome the technical challenge with arbitrarily many possible contexts, which leads to a reinforcement learning problem with an unbounded state space. Our framework provides improved regret bounds in certain settings when the DM is provided with unlabeled feature data, which is novel to the contextual BwK literature.
[LG-13] Scalable Gaussian Processes: Advances in Iterative Methods and Pathwise Conditioning
链接: https://arxiv.org/abs/2507.06839
作者: Jihao Andreas Lin
类目: Machine Learning (cs.LG); Machine Learning (stat.ML)
*备注: PhD Thesis, University of Cambridge
Abstract:Gaussian processes are a powerful framework for uncertainty-aware function approximation and sequential decision-making. Unfortunately, their classical formulation does not scale gracefully to large amounts of data and modern hardware for massively-parallel computation, prompting many researchers to develop techniques which improve their scalability. This dissertation focuses on the powerful combination of iterative methods and pathwise conditioning to develop methodological contributions which facilitate the use of Gaussian processes in modern large-scale settings. By combining these two techniques synergistically, expensive computations are expressed as solutions to systems of linear equations and obtained by leveraging iterative linear system solvers. This drastically reduces memory requirements, facilitating application to significantly larger amounts of data, and introduces matrix multiplication as the main computational operation, which is ideal for modern hardware.
[LG-14] Speech Tokenizer is Key to Consistent Representation
链接: https://arxiv.org/abs/2507.06802
作者: Wonjin Jung,Sungil Kang,Dong-Yeon Cho
类目: Machine Learning (cs.LG)
*备注:
Abstract:Speech tokenization is crucial in digital speech processing, converting continuous speech signals into discrete units for various computational tasks. This paper introduces a novel speech tokenizer with broad applicability across downstream tasks. While recent advances in residual vector quantization (RVQ) have incorporated semantic elements, they often neglect critical acoustic features. We propose an advanced approach that simultaneously encodes both linguistic and acoustic information, preserving prosodic and emotional content. Our method significantly enhances speech representation fidelity across diverse applications. Empirical evaluations demonstrate its effectiveness in speech coding, voice conversion, emotion recognition, and multimodal language modeling, without requiring additional training. This versatility underscores its potential as a key tool for advancing AI-driven speech processing.
[LG-15] Learning safe constrained policies via imitation learning: Connection to Probabilistic Inference and a Naive Algorithm
链接: https://arxiv.org/abs/2507.06780
作者: George Papadopoulos,George A. Vouros
类目: Machine Learning (cs.LG); Multiagent Systems (cs.MA)
*备注:
Abstract:This article introduces an imitation learning method for learning maximum entropy policies that comply with constraints demonstrated by expert trajectories executing a task. The formulation of the method takes advantage of results connecting performance to bounds for the KL-divergence between demonstrated and learned policies, and its objective is rigorously justified through a connection to a probabilistic inference framework for reinforcement learning, incorporating the reinforcement learning objective and the objective to abide by constraints in an entropy maximization setting. The proposed algorithm optimizes the learning objective with dual gradient descent, supporting effective and stable training. Experiments show that the proposed method can learn effective policy models for constraints-abiding behaviour, in settings with multiple constraints of different types, accommodating different modalities of demonstrated behaviour, and with abilities to generalize.
[LG-16] ailoring deep learning for real-time brain-computer interfaces: From offline models to calibration-free online decoding
链接: https://arxiv.org/abs/2507.06779
作者: Martin Wimpff,Jan Zerfowski,Bin Yang
类目: Human-Computer Interaction (cs.HC); Machine Learning (cs.LG)
*备注:
Abstract:Despite the growing success of deep learning (DL) in offline brain-computer interfaces (BCIs), its adoption in real-time applications remains limited due to three primary challenges. First, most DL solutions are designed for offline decoding, making the transition to online decoding unclear. Second, the use of sliding windows in online decoding substantially increases computational complexity. Third, DL models typically require large amounts of training data, which are often scarce in BCI applications. To address these challenges and enable real-time, cross-subject decoding without subject-specific calibration, we introduce realtime adaptive pooling (RAP), a novel parameter-free method. RAP seamlessly modifies the pooling layers of existing offline DL models to meet online decoding requirements. It also reduces computational complexity during training by jointly decoding consecutive sliding windows. To further alleviate data requirements, our method leverages source-free domain adaptation, enabling privacy-preserving adaptation across varying amounts of target data. Our results demonstrate that RAP provides a robust and efficient framework for real-time BCI applications. It preserves privacy, reduces calibration demands, and supports co-adaptive BCI systems, paving the way for broader adoption of DL in online BCIs. These findings lay a strong foundation for developing user-centered, high-performance BCIs that facilitate immediate feedback and user learning.
[LG-17] Mutual Information Free Topological Generalization Bounds via Stability
链接: https://arxiv.org/abs/2507.06775
作者: Mario Tuci,Lennart Bastian,Benjamin Dupuis,Nassir Navab,Tolga Birdal,Umut Şimşekli
类目: Machine Learning (cs.LG); Algebraic Topology (math.AT); Machine Learning (stat.ML)
*备注: 25 pages, 5 figures
Abstract:Providing generalization guarantees for stochastic optimization algorithms is a major challenge in modern learning theory. Recently, several studies highlighted the impact of the geometry of training trajectories on the generalization error, both theoretically and empirically. Among these works, a series of topological generalization bounds have been proposed, relating the generalization error to notions of topological complexity that stem from topological data analysis (TDA). Despite their empirical success, these bounds rely on intricate information-theoretic (IT) terms that can be bounded in specific cases but remain intractable for practical algorithms (such as ADAM), potentially reducing the relevance of the derived bounds. In this paper, we seek to formulate comprehensive and interpretable topological generalization bounds free of intractable mutual information terms. To this end, we introduce a novel learning theoretic framework that departs from the existing strategies via proof techniques rooted in algorithmic stability. By extending an existing notion of \textithypothesis set stability, to \textittrajectory stability, we prove that the generalization error of trajectory-stable algorithms can be upper bounded in terms of (i) TDA quantities describing the complexity of the trajectory of the optimizer in the parameter space, and (ii) the trajectory stability parameter of the algorithm. Through a series of experimental evaluations, we demonstrate that the TDA terms in the bound are of great importance, especially as the number of training samples grows. This ultimately forms an explanation of the empirical success of the topological generalization bounds.
[LG-18] Robust Deep Network Learning of Nonlinear Regression Tasks by Parametric Leaky Exponential Linear Units (LELUs) and a Diffusion Metric
链接: https://arxiv.org/abs/2507.06765
作者: Enda D.V. Bigarella
类目: Machine Learning (cs.LG)
*备注:
Abstract:This document proposes a parametric activation function (ac.f.) aimed at improving multidimensional nonlinear data regression. It is a established knowledge that nonlinear ac.f.'s are required for learning nonlinear datasets. This work shows that smoothness and gradient properties of the ac.f. further impact the performance of large neural networks in terms of overfitting and sensitivity to model parameters. Smooth but vanishing-gradient ac.f.'s such as ELU or SiLU have limited performance and non-smooth ac.f.'s such as RELU and Leaky-RELU further impart discontinuity in the trained model. Improved performance is demonstrated with a smooth “Leaky Exponential Linear Unit”, with non-zero gradient that can be trained. A novel diffusion-loss metric is also proposed to gauge the performance of the trained models in terms of overfitting.
[LG-19] Mathematical artificial data for operator learning
链接: https://arxiv.org/abs/2507.06752
作者: Heng Wu,Benzhuo Lu
类目: Machine Learning (cs.LG); Numerical Analysis (math.NA); Machine Learning (stat.ML)
*备注: 22 pages, 5 figures
Abstract:Machine learning has emerged as a transformative tool for solving differential equations (DEs), yet prevailing methodologies remain constrained by dual limitations: data-driven methods demand costly labeled datasets while model-driven techniques face efficiency-accuracy trade-offs. We present the Mathematical Artificial Data (MAD) framework, a new paradigm that integrates physical laws with data-driven learning to facilitate large-scale operator discovery. By exploiting DEs’ intrinsic mathematical structure to generate physics-embedded analytical solutions and associated synthetic data, MAD fundamentally eliminates dependence on experimental or simulated training data. This enables computationally efficient operator learning across multi-parameter systems while maintaining mathematical rigor. Through numerical demonstrations spanning 2D parametric problems where both the boundary values and source term are functions, we showcase MAD’s generalizability and superior efficiency/accuracy across various DE scenarios. This physics-embedded-data-driven framework and its capacity to handle complex parameter spaces gives it the potential to become a universal paradigm for physics-informed machine intelligence in scientific computing.
[LG-20] PINN-Obs: Physics-Informed Neural Network-Based Observer for Nonlinear Dynamical Systems
链接: https://arxiv.org/abs/2507.06712
作者: Ayoub Farkane,Mohamed Boutayeb,Mustapha Oudani,Mounir Ghogho
类目: Machine Learning (cs.LG); Dynamical Systems (math.DS); Chaotic Dynamics (nlin.CD)
*备注:
Abstract:State estimation for nonlinear dynamical systems is a critical challenge in control and engineering applications, particularly when only partial and noisy measurements are available. This paper introduces a novel Adaptive Physics-Informed Neural Network-based Observer (PINN-Obs) for accurate state estimation in nonlinear systems. Unlike traditional model-based observers, which require explicit system transformations or linearization, the proposed framework directly integrates system dynamics and sensor data into a physics-informed learning process. The observer adaptively learns an optimal gain matrix, ensuring convergence of the estimated states to the true system states. A rigorous theoretical analysis establishes formal convergence guarantees, demonstrating that the proposed approach achieves uniform error minimization under mild observability conditions. The effectiveness of PINN-Obs is validated through extensive numerical simulations on diverse nonlinear systems, including an induction motor model, a satellite motion system, and benchmark academic examples. Comparative experimental studies against existing observer designs highlight its superior accuracy, robustness, and adaptability.
[LG-21] Value from Observations: Towards Large-Scale Imitation Learning via Self-Improvement
链接: https://arxiv.org/abs/2507.06701
作者: Michael Bloesch,Markus Wulfmeier,Philemon Brakel,Todor Davchev,Martina Zambelli,Jost Tobias Springenberg,Abbas Abdolmaleki,William F Whitney,Nicolas Heess,Roland Hafner,Martin Riedmiller
类目: Machine Learning (cs.LG)
*备注:
Abstract:Imitation Learning from Observation (IfO) offers a powerful way to learn behaviors at large-scale: Unlike behavior cloning or offline reinforcement learning, IfO can leverage action-free demonstrations and thus circumvents the need for costly action-labeled demonstrations or reward functions. However, current IfO research focuses on idealized scenarios with mostly bimodal-quality data distributions, restricting the meaningfulness of the results. In contrast, this paper investigates more nuanced distributions and introduces a method to learn from such data, moving closer to a paradigm in which imitation learning can be performed iteratively via self-improvement. Our method adapts RL-based imitation learning to action-free demonstrations, using a value function to transfer information between expert and non-expert data. Through comprehensive evaluation, we delineate the relation between different data distributions and the applicability of algorithms and highlight the limitations of established methods. Our findings provide valuable insights for developing more robust and practical IfO techniques on a path to scalable behaviour learning.
[LG-22] Heterogeneous Graph Neural Networks for Short-term State Forecasting in Power Systems across Domains and Time Scales: A Hydroelectric Power Plant Case Study
链接: https://arxiv.org/abs/2507.06694
作者: Raffael Theiler,Olga Fink
类目: Machine Learning (cs.LG); Signal Processing (eess.SP); Systems and Control (eess.SY)
*备注: 25 pages, 9 figures
Abstract:Accurate short-term state forecasting is essential for efficient and stable operation of modern power systems, especially in the context of increasing variability introduced by renewable and distributed energy resources. As these systems evolve rapidly, it becomes increasingly important to reliably predict their states in the short term to ensure operational stability, support control decisions, and enable interpretable monitoring of sensor and machine behavior. Modern power systems often span multiple physical domains - including electrical, mechanical, hydraulic, and thermal - posing significant challenges for modeling and prediction. Graph Neural Networks (GNNs) have emerged as a promising data-driven framework for system state estimation and state forecasting in such settings. By leveraging the topological structure of sensor networks, GNNs can implicitly learn inter-sensor relationships and propagate information across the network. However, most existing GNN-based methods are designed under the assumption of homogeneous sensor relationships and are typically constrained to a single physical domain. This limitation restricts their ability to integrate and reason over heterogeneous sensor data commonly encountered in real-world energy systems, such as those used in energy conversion infrastructure. In this work, we propose the use of Heterogeneous Graph Attention Networks to address these limitations. Our approach models both homogeneous intra-domain and heterogeneous inter-domain relationships among sensor data from two distinct physical domains - hydraulic and electrical - which exhibit fundamentally different temporal dynamics. Experimental results demonstrate that our method significantly outperforms conventional baselines on average by 35.5% in terms of normalized root mean square error, confirming its effectiveness in multi-domain, multi-rate power system state forecasting.
[LG-23] Federated Learning Inspired Fuzzy Systems: Decentralized Rule Updating for Privacy and Scalable Decision Making
链接: https://arxiv.org/abs/2507.06652
作者: Arthur Alexander Lim(1),Zhen Bin It(2),Jovan Bowen Heng(2),Tee Hui Teo(2) ((1) The University of Newcastle, Callaghan, Australia (2) Singapore University of Technology and Design, Singapore)
类目: Machine Learning (cs.LG)
*备注:
Abstract:Fuzzy systems are a way to allow machines, systems and frameworks to deal with uncertainty, which is not possible in binary systems that most computers use. These systems have already been deployed for certain use cases, and fuzzy systems could be further improved as proposed in this paper. Such technologies to draw inspiration from include machine learning and federated learning. Machine learning is one of the recent breakthroughs of technology and could be applied to fuzzy systems to further improve the results it produces. Federated learning is also one of the recent technologies that have huge potential, which allows machine learning training to improve by reducing privacy risk, reducing burden on networking infrastructure, and reducing latency of the latest model. Aspects from federated learning could be used to improve federated learning, such as applying the idea of updating the fuzzy rules that make up a key part of fuzzy systems, to further improve it over time. This paper discusses how these improvements would be implemented in fuzzy systems, and how it would improve fuzzy systems. It also discusses certain limitations on the potential improvements. It concludes that these proposed ideas and improvements require further investigation to see how far the improvements are, but the potential is there to improve fuzzy systems.
[LG-24] Prevention of Overfitting on Mesh-Structured Data Regressions with a Modified Laplace Operator
链接: https://arxiv.org/abs/2507.06631
作者: Enda D.V. Bigarella
类目: Machine Learning (cs.LG)
*备注:
Abstract:This document reports on a method for detecting and preventing overfitting on data regressions, herein applied to mesh-like data structures. The mesh structure allows for the straightforward computation of the Laplace-operator second-order derivatives in a finite-difference fashion for noiseless data. Derivatives of the training data are computed on the original training mesh to serve as a true label of the entropy of the training data. Derivatives of the trained data are computed on a staggered mesh to identify oscillations in the interior of the original training mesh cells. The loss of the Laplace-operator derivatives is used for hyperparameter optimisation, achieving a reduction of unwanted oscillation through the minimisation of the entropy of the trained model. In this setup, testing does not require the splitting of points from the training data, and training is thus directly performed on all available training points. The Laplace operator applied to the trained data on a staggered mesh serves as a surrogate testing metric based on diffusion properties.
[LG-25] UniOD: A Universal Model for Outlier Detection across Diverse Domains
链接: https://arxiv.org/abs/2507.06624
作者: Dazhi Fu,Jicong Fan
类目: Machine Learning (cs.LG)
*备注: 20 pages, 4 figures
Abstract:Outlier detection (OD) seeks to distinguish inliers and outliers in completely unlabeled datasets and plays a vital role in science and engineering. Most existing OD methods require troublesome dataset-specific hyperparameter tuning and costly model training before they can be deployed to identify outliers. In this work, we propose UniOD, a universal OD framework that leverages labeled datasets to train a single model capable of detecting outliers of datasets from diverse domains. Specifically, UniOD converts each dataset into multiple graphs, produces consistent node features, and frames outlier detection as a node-classification task, and is able to generalize to unseen domains. As a result, UniOD avoids effort on model selection and hyperparameter tuning, reduces computational cost, and effectively utilizes the knowledge from historical datasets, which improves the convenience and accuracy in real applications. We evaluate UniOD on 15 benchmark OD datasets against 15 state-of-the-art baselines, demonstrating its effectiveness.
[LG-26] Steps Adaptive Decay DPSGD: Enhancing Performance on Imbalanced Datasets with Differential Privacy with HAM10000
链接: https://arxiv.org/abs/2507.06619
作者: Xiaobo Huang,Fang Xie
类目: Machine Learning (cs.LG); Machine Learning (stat.ML)
*备注:
Abstract:When applying machine learning to medical image classification, data leakage is a critical issue. Previous methods, such as adding noise to gradients for differential privacy, work well on large datasets like MNIST and CIFAR-100, but fail on small, imbalanced medical datasets like HAM10000. This is because the imbalanced distribution causes gradients from minority classes to be clipped and lose crucial information, while majority classes dominate. This leads the model to fall into suboptimal solutions early. To address this, we propose SAD-DPSGD, which uses a linear decaying mechanism for noise and clipping thresholds. By allocating more privacy budget and using higher clipping thresholds in the initial training phases, the model avoids suboptimal solutions and enhances performance. Experiments show that SAD-DPSGD outperforms Auto-DPSGD on HAM10000, improving accuracy by 2.15% under \epsilon = 3.0 , \delta = 10^-3 .
[LG-27] Nexus: Taming Throughput-Latency Tradeoff in LLM Serving via Efficient GPU Sharing
链接: https://arxiv.org/abs/2507.06608
作者: Xiaoxiang Shi,Colin Cai,Junjia Du,Zhanda Zhu,Xingda Wei,Zhihao Jia
类目: Distributed, Parallel, and Cluster Computing (cs.DC); Machine Learning (cs.LG)
*备注:
Abstract:Current prefill-decode (PD) disaggregation is typically deployed at the level of entire serving engines, assigning separate GPUs to handle prefill and decode phases. While effective at reducing latency, this approach demands more hardware. To improve GPU utilization, Chunked Prefill mixes prefill and decode requests within the same batch, but introduces phase interference between prefill and decode. While existing PD disaggregation solutions separate the phases across GPUs, we ask: can the same decoupling be achieved within a single serving engine? The key challenge lies in managing the conflicting resource requirements of prefill and decode when they share the same hardware. In this paper, we first show that chunked prefill requests cause interference with decode requests due to their distinct requirements for GPU resources. Second, we find that GPU resources exhibit diminishing returns. Beyond a saturation point, increasing GPU allocation yields negligible latency improvements. This insight enables us to split a single GPU’s resources and dynamically allocate them to prefill and decode on the fly, effectively disaggregating the two phases within the same GPU. Across a range of models and workloads, our system Nexus achieves up to 2.2x higher throughput, 20x lower TTFT, and 2.5x lower TBT than vLLM. It also outperforms SGLang with up to 2x higher throughput, 2x lower TTFT, and 1.7x lower TBT, and achieves 1.4x higher throughput than vLLM-disaggregation using only half the number of GPUs. Subjects: Distributed, Parallel, and Cluster Computing (cs.DC); Machine Learning (cs.LG) Cite as: arXiv:2507.06608 [cs.DC] (or arXiv:2507.06608v1 [cs.DC] for this version) https://doi.org/10.48550/arXiv.2507.06608 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
[LG-28] Generalization in Reinforcement Learning for Radio Access Networks
链接: https://arxiv.org/abs/2507.06602
作者: Burak Demirel,Yu Wang,Cristian Tatino,Pablo Soldati
类目: Machine Learning (cs.LG)
*备注:
Abstract:Modern RAN operate in highly dynamic and heterogeneous environments, where hand-tuned, rule-based RRM algorithms often underperform. While RL can surpass such heuristics in constrained settings, the diversity of deployments and unpredictable radio conditions introduce major generalization challenges. Data-driven policies frequently overfit to training conditions, degrading performance in unseen scenarios. To address this, we propose a generalization-centered RL framework for RAN control that: (i) encodes cell topology and node attributes via attention-based graph representations; (ii) applies domain randomization to broaden the training distribution; and (iii) distributes data generation across multiple actors while centralizing training in a cloud-compatible architecture aligned with O-RAN principles. Although generalization increases computational and data-management complexity, our distributed design mitigates this by scaling data collection and training across diverse network conditions. Applied to downlink link adaptation in five 5G benchmarks, our policy improves average throughput and spectral efficiency by ~10% over an OLLA baseline (10% BLER target) in full-buffer MIMO/mMIMO and by 20% under high mobility. It matches specialized RL in full-buffer traffic and achieves up to 4- and 2-fold gains in eMBB and mixed-traffic benchmarks, respectively. In nine-cell deployments, GAT models offer 30% higher throughput over MLP baselines. These results, combined with our scalable architecture, offer a path toward AI-native 6G RAN using a single, generalizable RL agent.
[LG-29] SlimCaching: Edge Caching of Mixture-of-Experts for Distributed Inference
链接: https://arxiv.org/abs/2507.06567
作者: Qian Chen,Xianhao Chen,Kaibin Huang
类目: Machine Learning (cs.LG); Distributed, Parallel, and Cluster Computing (cs.DC); Networking and Internet Architecture (cs.NI)
*备注: 14 pages, 10 figures
Abstract:Mixture-of-Experts (MoE) models improve the scalability of large language models (LLMs) by activating only a small subset of relevant experts per input. However, the sheer number of expert networks in an MoE model introduces a significant storage burden for an edge device. To address this challenge, we consider a scenario where experts are dispersed within an edge network for distributed inference. Based on the popular Top- K expert selection strategy, we formulate a latency minimization problem by optimizing expert caching on edge servers under storage constraints. When K=1 , the problem reduces to a monotone submodular maximization problem with knapsack constraints, for which we design a greedy-based algorithm with a (1 - 1/e) -approximation guarantee. For the general case where K\geq1 , expert co-activation within the same MoE layer introduces non-submodularity, causing greedy methods to be ineffective. To tackle this issue, we propose a successive greedy decomposition method to decompose the original problem into a series of subproblems, with each being solved by a dynamic programming approach. Furthermore, we design an accelerated algorithm based on the max-convolution technique to obtain the approximate solution with a provable guarantee in polynomial time. Simulation results on various MoE models demonstrate that our method significantly reduces inference latency compared to existing baselines.
[LG-30] Deep-Learning-Based Pre-Layout Parasitic Capacitance Prediction on SRAM Designs
链接: https://arxiv.org/abs/2507.06549
作者: Shan Shen,Dingcheng Yang,Yuyang Xie,Chunyan Pei,Wenjian Yu,Bei Yu
类目: Machine Learning (cs.LG); Hardware Architecture (cs.AR); Systems and Control (eess.SY)
*备注: Published in Proceedings of GLSVLSI2024
Abstract:To achieve higher system energy efficiency, SRAM in SoCs is often customized. The parasitic effects cause notable discrepancies between pre-layout and post-layout circuit simulations, leading to difficulty in converging design parameters and excessive design iterations. Is it possible to well predict the parasitics based on the pre-layout circuit, so as to perform parasitic-aware pre-layout simulation? In this work, we propose a deep-learning-based 2-stage model to accurately predict these parasitics in pre-layout stages. The model combines a Graph Neural Network (GNN) classifier and Multi-Layer Perceptron (MLP) regressors, effectively managing class imbalance of the net parasitics in SRAM circuits. We also employ Focal Loss to mitigate the impact of abundant internal net samples and integrate subcircuit information into the graph to abstract the hierarchical structure of schematics. Experiments on 4 real SRAM designs show that our approach not only surpasses the state-of-the-art model in parasitic prediction by a maximum of 19X reduction of error but also significantly boosts the simulation process by up to 598X speedup.
[LG-31] A Single Merging Suffices: Recovering Server-based Learning Performance in Decentralized Learning
链接: https://arxiv.org/abs/2507.06542
作者: Tongtian Zhu,Tianyu Zhang,Mingze Wang,Zhanpeng Zhou,Can Wang
类目: Machine Learning (cs.LG); Distributed, Parallel, and Cluster Computing (cs.DC); Multiagent Systems (cs.MA); Machine Learning (stat.ML)
*备注: We discover and theoretically explain why and when a single global parameter merging in decentralized learning can recover the performance of server-based learning, even in highly heterogeneous and communication-constrained environments
Abstract:Decentralized learning provides a scalable alternative to traditional parameter-server-based training, yet its performance is often hindered by limited peer-to-peer communication. In this paper, we study how communication should be scheduled over time, including determining when and how frequently devices synchronize. Our empirical results show that concentrating communication budgets in the later stages of decentralized training markedly improves global generalization. Surprisingly, we uncover that fully connected communication at the final step, implemented by a single global merging, is sufficient to match the performance of server-based training. We further show that low communication in decentralized learning preserves the \textitmergeability of local models throughout training. Our theoretical contributions, which explains these phenomena, are first to establish that the globally merged model of decentralized SGD can converge faster than centralized mini-batch SGD. Technically, we novelly reinterpret part of the discrepancy among local models, which were previously considered as detrimental noise, as constructive components that accelerate convergence. This work challenges the common belief that decentralized learning generalizes poorly under data heterogeneity and limited communication, while offering new insights into model merging and neural network loss landscapes.
[LG-32] Few-shot Learning on AMS Circuits and Its Application to Parasitic Capacitance Prediction
链接: https://arxiv.org/abs/2507.06538
作者: Shan Shen,Yibin Zhang,Hector Rodriguez Rodriguez,Wenjian Yu
类目: Machine Learning (cs.LG); Systems and Control (eess.SY)
*备注: Published in Proceedings of DAC2025
Abstract:Graph representation learning is a powerful method to extract features from graph-structured data, such as analog/mixed-signal (AMS) circuits. However, training deep learning models for AMS designs is severely limited by the scarcity of integrated circuit design data. In this work, we present CircuitGPS, a few-shot learning method for parasitic effect prediction in AMS circuits. The circuit netlist is represented as a heterogeneous graph, with the coupling capacitance modeled as a link. CircuitGPS is pre-trained on link prediction and fine-tuned on edge regression. The proposed method starts with a small-hop sampling technique that converts a link or a node into a subgraph. Then, the subgraph embeddings are learned with a hybrid graph Transformer. Additionally, CircuitGPS integrates a low-cost positional encoding that summarizes the positional and structural information of the sampled subgraph. CircuitGPS improves the accuracy of coupling existence by at least 20% and reduces the MAE of capacitance estimation by at least 0.067 compared to existing methods. Our method demonstrates strong inherent scalability, enabling direct application to diverse AMS circuit designs through zero-shot learning. Furthermore, the ablation studies provide valuable insights into graph models for representation learning.
[LG-33] ransferable Parasitic Estimation via Graph Contrastive Learning and Label Rebalancing in AMS Circuits
链接: https://arxiv.org/abs/2507.06535
作者: Shan Shen,Shenglu Hua,Jiajun Zou,Jiawei Liu,Jianwang Zhai,Chuan Shi,Wenjian Yu
类目: Machine Learning (cs.LG); Systems and Control (eess.SY)
*备注: Accepted by ICCAD2025. This is the initial version. Minor changes will be made
Abstract:Graph representation learning on Analog-Mixed Signal (AMS) circuits is crucial for various downstream tasks, e.g., parasitic estimation. However, the scarcity of design data, the unbalanced distribution of labels, and the inherent diversity of circuit implementations pose significant challenges to learning robust and transferable circuit representations. To address these limitations, we propose CircuitGCL, a novel graph contrastive learning framework that integrates representation scattering and label rebalancing to enhance transferability across heterogeneous circuit graphs. CircuitGCL employs a self-supervised strategy to learn topology-invariant node embeddings through hyperspherical representation scattering, eliminating dependency on large-scale data. Simultaneously, balanced mean squared error (MSE) and softmax cross-entropy (bsmCE) losses are introduced to mitigate label distribution disparities between circuits, enabling robust and transferable parasitic estimation. Evaluated on parasitic capacitance estimation (edge-level task) and ground capacitance classification (node-level task) across TSMC 28nm AMS designs, CircuitGCL outperforms all state-of-the-art (SOTA) methods, with the R^2 improvement of 33.64% \sim 44.20% for edge regression and F1-score gain of 0.9\times \sim 2.1\times for node classification. Our code is available at \hrefthis https URLhere.
[LG-34] Direct Regret Optimization in Bayesian Optimization
链接: https://arxiv.org/abs/2507.06529
作者: Fengxue Zhang,Yuxin Chen
类目: Machine Learning (cs.LG)
*备注:
Abstract:Bayesian optimization (BO) is a powerful paradigm for optimizing expensive black-box functions. Traditional BO methods typically rely on separate hand-crafted acquisition functions and surrogate models for the underlying function, and often operate in a myopic manner. In this paper, we propose a novel direct regret optimization approach that jointly learns the optimal model and non-myopic acquisition by distilling from a set of candidate models and acquisitions, and explicitly targets minimizing the multi-step regret. Our framework leverages an ensemble of Gaussian Processes (GPs) with varying hyperparameters to generate simulated BO trajectories, each guided by an acquisition function chosen from a pool of conventional choices, until a Bayesian early stop criterion is met. These simulated trajectories, capturing multi-step exploration strategies, are used to train an end-to-end decision transformer that directly learns to select next query points aimed at improving the ultimate objective. We further adopt a dense training–sparse learning paradigm: The decision transformer is trained offline with abundant simulated data sampled from ensemble GPs and acquisitions, while a limited number of real evaluations refine the GPs online. Experimental results on synthetic and real-world benchmarks suggest that our method consistently outperforms BO baselines, achieving lower simple regret and demonstrating more robust exploration in high-dimensional or noisy settings.
[LG-35] AdaDPIGU: Differentially Private SGD with Adaptive Clipping and Importance-Based Gradient Updates for Deep Neural Networks
链接: https://arxiv.org/abs/2507.06525
作者: Huiqi Zhang,Fang Xie
类目: Machine Learning (cs.LG); Statistics Theory (math.ST); Machine Learning (stat.ML)
*备注:
Abstract:Differential privacy has been proven effective for stochastic gradient descent; however, existing methods often suffer from performance degradation in high-dimensional settings, as the scale of injected noise increases with dimensionality. To tackle this challenge, we propose AdaDPIGU–a new differentially private SGD framework with importance-based gradient updates tailored for deep neural networks. In the pretraining stage, we apply a differentially private Gaussian mechanism to estimate the importance of each parameter while preserving privacy. During the gradient update phase, we prune low-importance coordinates and introduce a coordinate-wise adaptive clipping mechanism, enabling sparse and noise-efficient gradient updates. Theoretically, we prove that AdaDPIGU satisfies (\varepsilon, \delta) -differential privacy and retains convergence guarantees. Extensive experiments on standard benchmarks validate the effectiveness of AdaDPIGU. All results are reported under a fixed retention ratio of 60%. On MNIST, our method achieves a test accuracy of 99.12% under a privacy budget of \epsilon = 8 , nearly matching the non-private model. Remarkably, on CIFAR-10, it attains 73.21% accuracy at \epsilon = 4 , outperforming the non-private baseline of 71.12%, demonstrating that adaptive sparsification can enhance both privacy and utility.
[LG-36] Instance-Wise Monotonic Calibration by Constrained Transformation UAI
链接: https://arxiv.org/abs/2507.06516
作者: Yunrui Zhang,Gustavo Batista,Salil S. Kanhere
类目: Machine Learning (cs.LG); Machine Learning (stat.ML)
*备注: Accepted to Conference on Uncertainty in Artificial Intelligence (UAI)
Abstract:Deep neural networks often produce miscalibrated probability estimates, leading to overconfident predictions. A common approach for calibration is fitting a post-hoc calibration map on unseen validation data that transforms predicted probabilities. A key desirable property of the calibration map is instance-wise monotonicity (i.e., preserving the ranking of probability outputs). However, most existing post-hoc calibration methods do not guarantee monotonicity. Previous monotonic approaches either use an under-parameterized calibration map with limited expressive ability or rely on black-box neural networks, which lack interpretability and robustness. In this paper, we propose a family of novel monotonic post-hoc calibration methods, which employs a constrained calibration map parameterized linearly with respect to the number of classes. Our proposed approach ensures expressiveness, robustness, and interpretability while preserving the relative ordering of the probability output by formulating the proposed calibration map as a constrained optimization problem. Our proposed methods achieve state-of-the-art performance across datasets with different deep neural network models, outperforming existing calibration methods while being data and computation-efficient. Our code is available at this https URL
[LG-37] Prediction-Augmented Mechanism Design for Weighted Facility Location
链接: https://arxiv.org/abs/2507.06509
作者: Yangguang Shi,Zhenyu Xue
类目: Data Structures and Algorithms (cs.DS); Computer Science and Game Theory (cs.GT); Machine Learning (cs.LG)
*备注: An extended abstract of this paper is to appear in the 19th Annual Conference on Theory and Applications of Models of Computation (TAMC 2025)
Abstract:Facility location is fundamental in operations research, mechanism design, and algorithmic game theory, with applications ranging from urban infrastructure planning to distributed systems. Recent research in this area has focused on augmenting classic strategyproof mechanisms with predictions to achieve an improved performance guarantee against the uncertainty under the strategic environment. Previous work has been devoted to address the trade-off obstacle of balancing the consistency (near-optimality under accurate predictions) and robustness (bounded inefficiency under poor predictions) primarily in the unweighted setting, assuming that all agents have the same importance. However, this assumption may not be true in some practical scenarios, leading to research of weighted facility location problems. The major contribution of the current work is to provide a prediction augmented algorithmic framework for balancing the consistency and robustness over strategic agents with non-uniform weights. In particular, through a reduction technique that identifies a subset of \emphrepresentative instances and maps the other given locations to the representative ones, we prove that there exists a \emphstrategyproof mechanism achieving a bounded consistency guarantee of \frac\sqrt(1+c)^2W^2_\min+(1-c)^2W^2_\max(1+c)W_\min and a bounded robustness guarantee of \frac\sqrt(1-c)^2W^2_\min+(1+c)^2W^2_\max(1-c)W_\min in weighted settings, where c can be viewed as a parameter to make a trade-off between the consistency and robustness and W_\min and W_\max denote the minimum and maximum agents’ weight. We also proved that there is no strategyproof deterministic mechanism that reach 1 -consistency and O\left( n \cdot \fracW_\maxW_\min \right) -robustness in weighted FLP, even with fully predictions of all agents. Comments: An extended abstract of this paper is to appear in the 19th Annual Conference on Theory and Applications of Models of Computation (TAMC 2025) Subjects: Data Structures and Algorithms (cs.DS); Computer Science and Game Theory (cs.GT); Machine Learning (cs.LG) MSC classes: 68W27, 68Q32 ACMclasses: F.2.2 Cite as: arXiv:2507.06509 [cs.DS] (or arXiv:2507.06509v1 [cs.DS] for this version) https://doi.org/10.48550/arXiv.2507.06509 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
[LG-38] FedDifRC: Unlocking the Potential of Text-to-Image Diffusion Models in Heterogeneous Federated Learning ICCV2025
链接: https://arxiv.org/abs/2507.06482
作者: Huan Wang,Haoran Li,Huaming Chen,Jun Yan,Jiahua Shi,Jun Shen
类目: Machine Learning (cs.LG)
*备注: 19 Pages, ICCV 2025
Abstract:Federated learning aims at training models collaboratively across participants while protecting privacy. However, one major challenge for this paradigm is the data heterogeneity issue, where biased data preferences across multiple clients, harming the model’s convergence and performance. In this paper, we first introduce powerful diffusion models into the federated learning paradigm and show that diffusion representations are effective steers during federated training. To explore the possibility of using diffusion representations in handling data heterogeneity, we propose a novel diffusion-inspired Federated paradigm with Diffusion Representation Collaboration, termed FedDifRC, leveraging meaningful guidance of diffusion models to mitigate data heterogeneity. The key idea is to construct text-driven diffusion contrasting and noise-driven diffusion regularization, aiming to provide abundant class-related semantic information and consistent convergence signals. On the one hand, we exploit the conditional feedback from the diffusion model for different text prompts to build a text-driven contrastive learning strategy. On the other hand, we introduce a noise-driven consistency regularization to align local instances with diffusion denoising representations, constraining the optimization region in the feature space. In addition, FedDifRC can be extended to a self-supervised scheme without relying on any labeled data. We also provide a theoretical analysis for FedDifRC to ensure convergence under non-convex objectives. The experiments on different scenarios validate the effectiveness of FedDifRC and the efficiency of crucial components.
[LG-39] Stochastic Alignments: Matching an Observed Trace to Stochastic Process Models
链接: https://arxiv.org/abs/2507.06472
作者: Tian Li,Artem Polyvyanyy,Sander J.J. Leemans
类目: Formal Languages and Automata Theory (cs.FL); Machine Learning (cs.LG)
*备注:
Abstract:Process mining leverages event data extracted from IT systems to generate insights into the business processes of organizations. Such insights benefit from explicitly considering the frequency of behavior in business processes, which is captured by stochastic process models. Given an observed trace and a stochastic process model, conventional alignment-based conformance checking techniques face a fundamental limitation: They prioritize matching the trace to a model path with minimal deviations, which may, however, lead to selecting an unlikely path. In this paper, we study the problem of matching an observed trace to a stochastic process model by identifying a likely model path with a low edit distance to the trace. We phrase this as an optimization problem and develop a heuristic-guided path-finding algorithm to solve it. Our open-source implementation demonstrates the feasibility of the approach and shows that it can provide new, useful diagnostic insights for analysts.
[LG-40] Mitigating Message Imbalance in Fraud Detection with Dual-View Graph Representation Learning
链接: https://arxiv.org/abs/2507.06469
作者: Yudan Song,Yuecen Wei,Yuhang Lu,Qingyun Sun,Minglai Shao,Li-e Wang,Chunming Hu,Xianxian Li,Xingcheng Fu
类目: Machine Learning (cs.LG); Social and Information Networks (cs.SI)
*备注:
Abstract:Graph representation learning has become a mainstream method for fraud detection due to its strong expressive power, which focuses on enhancing node representations through improved neighborhood knowledge capture. However, the focus on local interactions leads to imbalanced transmission of global topological information and increased risk of node-specific information being overwhelmed during aggregation due to the imbalance between fraud and benign nodes. In this paper, we first summarize the impact of topology and class imbalance on downstream tasks in GNN-based fraud detection, as the problem of imbalanced supervisory messages is caused by fraudsters’ topological behavior obfuscation and identity feature concealment. Based on statistical validation, we propose a novel dual-view graph representation learning method to mitigate Message imbalance in Fraud Detection(MimbFD). Specifically, we design a topological message reachability module for high-quality node representation learning to penetrate fraudsters’ camouflage and alleviate insufficient propagation. Then, we introduce a local confounding debiasing module to adjust node representations, enhancing the stable association between node representations and labels to balance the influence of different classes. Finally, we conducted experiments on three public fraud datasets, and the results demonstrate that MimbFD exhibits outstanding performance in fraud detection.
[LG-41] Energy-Efficient Supervised Learning with a Binary Stochastic Forward-Forward Algorithm
链接: https://arxiv.org/abs/2507.06461
作者: Risi Jaiswal,Supriyo Datta,Joseph G. Makin
类目: Machine Learning (cs.LG); Neural and Evolutionary Computing (cs.NE)
*备注: 24 pages, 5 figures, 4 tables. Under review
Abstract:Reducing energy consumption has become a pressing need for modern machine learning, which has achieved many of its most impressive results by scaling to larger and more energy-consumptive neural networks. Unfortunately, the main algorithm for training such networks, backpropagation, poses significant challenges for custom hardware accelerators, due to both its serial dependencies and the memory footprint needed to store forward activations for the backward pass. Alternatives to backprop, although less effective, do exist; here the main computational bottleneck becomes matrix multiplication. In this study, we derive forward-forward algorithms for binary, stochastic units. Binarization of the activations transforms matrix multiplications into indexing operations, which can be executed efficiently in hardware. Stochasticity, combined with tied weights across units with different biases, bypasses the information bottleneck imposed by binary units. Furthermore, although slow and expensive in traditional hardware, binary sampling that is very fast can be implemented cheaply with p-bits (probabilistic bits), novel devices made up of unstable magnets. We evaluate our proposed algorithms on the MNIST, Fashion-MNIST, and CIFAR-10 datasets, showing that its performance is close to real-valued forward-forward, but with an estimated energy savings of about one order of magnitude.
[LG-42] Automated Neuron Labelling Enables Generative Steering and Interpretability in Protein Language Models
链接: https://arxiv.org/abs/2507.06458
作者: Arjun Banerjee,David Martinez,Camille Dang,Ethan Tam
类目: Machine Learning (cs.LG); Biomolecules (q-bio.BM)
*备注: 15 pages, 13 figures. Accepted to Proceedings of the Workshop on Generative AI for Biology at the 42nd International Conference on Machine Learning (Spotlight)
Abstract:Protein language models (PLMs) encode rich biological information, yet their internal neuron representations are poorly understood. We introduce the first automated framework for labeling every neuron in a PLM with biologically grounded natural language descriptions. Unlike prior approaches relying on sparse autoencoders or manual annotation, our method scales to hundreds of thousands of neurons, revealing individual neurons are selectively sensitive to diverse biochemical and structural properties. We then develop a novel neuron activation-guided steering method to generate proteins with desired traits, enabling convergence to target biochemical properties like molecular weight and instability index as well as secondary and tertiary structural motifs, including alpha helices and canonical Zinc Fingers. We finally show that analysis of labeled neurons in different model sizes reveals PLM scaling laws and a structured neuron space distribution.
[LG-43] gFloss: A Python package for refining sleep EEG recordings using machine learning models
链接: https://arxiv.org/abs/2507.06433
作者: Niloy Sikder,Paul Zerr,Mahdad Jafarzadeh Esfahani,Martin Dresler,Matthias Krauledat
类目: Machine Learning (cs.LG); Signal Processing (eess.SP); Quantitative Methods (q-bio.QM)
*备注: The eegFloss package is available under the MIT License at this https URL
Abstract:Electroencephalography (EEG) allows monitoring of brain activity, providing insights into the functional dynamics of various brain regions and their roles in cognitive processes. EEG is a cornerstone in sleep research, serving as the primary modality of polysomnography, the gold standard in the field. However, EEG signals are prone to artifacts caused by both internal (device-specific) factors and external (environmental) interferences. As sleep studies are becoming larger, most rely on automatic sleep staging, a process highly susceptible to artifacts, leading to erroneous sleep scores. This paper addresses this challenge by introducing eegFloss, an open-source Python package to utilize eegUsability, a novel machine learning (ML) model designed to detect segments with artifacts in sleep EEG recordings. eegUsability has been trained and evaluated on manually artifact-labeled EEG data collected from 15 participants over 127 nights using the Zmax headband. It demonstrates solid overall classification performance (F1-score is approximately 0.85, Cohens kappa is 0.78), achieving a high recall rate of approximately 94% in identifying channel-wise usable EEG data, and extends beyond Zmax. Additionally, eegFloss offers features such as automatic time-in-bed detection using another ML model named eegMobility, filtering out certain artifacts, and generating hypnograms and sleep statistics. By addressing a fundamental challenge faced by most sleep studies, eegFloss can enhance the precision and rigor of their analysis as well as the accuracy and reliability of their outcomes.
[LG-44] Detection of Intelligent Tampering in Wireless Electrocardiogram Signals Using Hybrid Machine Learning
链接: https://arxiv.org/abs/2507.06402
作者: Siddhant Deshpande,Yalemzerf Getnet,Waltenegus Dargie
类目: Machine Learning (cs.LG); Cryptography and Security (cs.CR); Signal Processing (eess.SP)
*备注:
Abstract:With the proliferation of wireless electrocardiogram (ECG) systems for health monitoring and authentication, protecting signal integrity against tampering is becoming increasingly important. This paper analyzes the performance of CNN, ResNet, and hybrid Transformer-CNN models for tamper detection. It also evaluates the performance of a Siamese network for ECG based identity verification. Six tampering strategies, including structured segment substitutions and random insertions, are emulated to mimic real world attacks. The one-dimensional ECG signals are transformed into a two dimensional representation in the time frequency domain using the continuous wavelet transform (CWT). The models are trained and evaluated using ECG data from 54 subjects recorded in four sessions 2019 to 2025 outside of clinical settings while the subjects performed seven different daily activities. Experimental results show that in highly fragmented manipulation scenarios, CNN, FeatCNN-TranCNN, FeatCNN-Tran and ResNet models achieved an accuracy exceeding 99.5 percent . Similarly, for subtle manipulations (for example, 50 percent from A and 50 percent from B and, 75 percent from A and 25 percent from B substitutions) our FeatCNN-TranCNN model demonstrated consistently reliable performance, achieving an average accuracy of 98 percent . For identity verification, the pure Transformer-Siamese network achieved an average accuracy of 98.30 percent . In contrast, the hybrid CNN-Transformer Siamese model delivered perfect verification performance with 100 percent accuracy.
[LG-45] he Riemannian Geometry associated to Gradient Flows of Linear Convolutional Networks
链接: https://arxiv.org/abs/2507.06367
作者: El Mehdi Achour,Kathlén Kohn,Holger Rauhut
类目: Machine Learning (cs.LG); Algebraic Geometry (math.AG)
*备注:
Abstract:We study geometric properties of the gradient flow for learning deep linear convolutional networks. For linear fully connected networks, it has been shown recently that the corresponding gradient flow on parameter space can be written as a Riemannian gradient flow on function space (i.e., on the product of weight matrices) if the initialization satisfies a so-called balancedness condition. We establish that the gradient flow on parameter space for learning linear convolutional networks can be written as a Riemannian gradient flow on function space regardless of the initialization. This result holds for D -dimensional convolutions with D \geq 2 , and for D =1 it holds if all so-called strides of the convolutions are greater than one. The corresponding Riemannian metric depends on the initialization.
[LG-46] DecoyDB: A Dataset for Graph Contrastive Learning in Protein-Ligand Binding Affinity Prediction
链接: https://arxiv.org/abs/2507.06366
作者: Yupu Zhang,Zelin Xu,Tingsong Xiao,Gustavo Seabra,Yanjun Li,Chenglong Li,Zhe Jiang
类目: Machine Learning (cs.LG); Biomolecules (q-bio.BM)
*备注:
Abstract:Predicting the binding affinity of protein-ligand complexes plays a vital role in drug discovery. Unfortunately, progress has been hindered by the lack of large-scale and high-quality binding affinity labels. The widely used PDBbind dataset has fewer than 20K labeled complexes. Self-supervised learning, especially graph contrastive learning (GCL), provides a unique opportunity to break the barrier by pre-training graph neural network models based on vast unlabeled complexes and fine-tuning the models on much fewer labeled complexes. However, the problem faces unique challenges, including a lack of a comprehensive unlabeled dataset with well-defined positive/negative complex pairs and the need to design GCL algorithms that incorporate the unique characteristics of such data. To fill the gap, we propose DecoyDB, a large-scale, structure-aware dataset specifically designed for self-supervised GCL on protein-ligand complexes. DecoyDB consists of high-resolution ground truth complexes (less than 2.5 Angstrom) and diverse decoy structures with computationally generated binding poses that range from realistic to suboptimal (negative pairs). Each decoy is annotated with a Root Mean Squared Deviation (RMSD) from the native pose. We further design a customized GCL framework to pre-train graph neural networks based on DecoyDB and fine-tune the models with labels from PDBbind. Extensive experiments confirm that models pre-trained with DecoyDB achieve superior accuracy, label efficiency, and generalizability.
[LG-47] Neural Network-Based Parameter Estimation for Non-Autonomous Differential Equations with Discontinuous Signals
链接: https://arxiv.org/abs/2507.06267
作者: Hyeontae Jo,Krešimir Josić,Jae Kyoung Kim
类目: Machine Learning (cs.LG)
*备注:
Abstract:Non-autonomous differential equations are crucial for modeling systems influenced by external signals, yet fitting these models to data becomes particularly challenging when the signals change abruptly. To address this problem, we propose a novel parameter estimation method utilizing functional approximations with artificial neural networks. Our approach, termed Harmonic Approximation of Discontinuous External Signals using Neural Networks (HADES-NN), operates in two iterated stages. In the first stage, the algorithm employs a neural network to approximate the discontinuous signal with a smooth function. In the second stage, it uses this smooth approximate signal to estimate model parameters. HADES-NN gives highly accurate and precise parameter estimates across various applications, including circadian clock systems regulated by external light inputs measured via wearable devices and the mating response of yeast to external pheromone signals. HADES-NN greatly extends the range of model systems that can be fit to real-world measurements.
[LG-48] Generalized and Unified Equivalences between Hardness and Pseudoentropy
链接: https://arxiv.org/abs/2507.05972
作者: Lunjia Hu,Salil Vadhan
类目: Computational Complexity (cs.CC); Cryptography and Security (cs.CR); Machine Learning (cs.LG)
*备注:
Abstract:Pseudoentropy characterizations provide a quantitatively precise demonstration of the close relationship between computational hardness and computational randomness. We prove a unified pseudoentropy characterization that generalizes and strengthens previous results for both uniform and non-uniform models of computation. Our characterization holds for a general family of entropy notions that encompasses the common notions of Shannon entropy and min entropy as special cases. Moreover, we show that the characterizations for different entropy notions can be simultaneously achieved by a single, universal function that simultaneously witnesses computational hardness and computational randomness. A key technical insight of our work is that the notion of weight-restricted calibration from the recent literature on algorithm fairness, along with standard computational indistinguishability (known as multiaccuracy in the fairness literature), suffices for proving pseudoentropy characterizations for general entropy notions. This demonstrates the power of weight-restricted calibration to enhance the classic Complexity-Theoretic Regularity Lemma (Trevisan, Tulsiani, and Vadhan, 2009) and Leakage Simulation Lemma (Jetchev and Pietrzak, 2014) and allows us to achieve an exponential improvement in the complexity dependency on the alphabet size compared to the pseudoentropy characterizations by Casacuberta, Dwork, and Vadhan (2024) based on the much stronger notion of multicalibration. We show that the exponential dependency on the alphabet size is inevitable for multicalibration as well as for the weaker notion of calibrated multiaccuracy.
[LG-49] How to Bridge the Sim-to-Real Gap in Digital Twin-Aided Telecommunication Networks
链接: https://arxiv.org/abs/2507.07067
作者: Clement Ruah,Houssem Sifaou,Osvaldo Simeone,Bashir M. Al-Hashimi
类目: ignal Processing (eess.SP); Machine Learning (cs.LG)
*备注: This work has been submitted to the IEEE for possible publication
Abstract:Training effective artificial intelligence models for telecommunications is challenging due to the scarcity of deployment-specific data. Real data collection is expensive, and available datasets often fail to capture the unique operational conditions and contextual variability of the network environment. Digital twinning provides a potential solution to this problem, as simulators tailored to the current network deployment can generate site-specific data to augment the available training datasets. However, there is a need to develop solutions to bridge the inherent simulation-to-reality (sim-to-real) gap between synthetic and real-world data. This paper reviews recent advances on two complementary strategies: 1) the calibration of digital twins (DTs) through real-world measurements, and 2) the use of sim-to-real gap-aware training strategies to robustly handle residual discrepancies between digital twin-generated and real data. For the latter, we evaluate two conceptually distinct methods that model the sim-to-real gap either at the level of the environment via Bayesian learning or at the level of the training loss via prediction-powered inference.
[LG-50] Non-Asymptotic Analysis of Online Local Private Learning with SGD
链接: https://arxiv.org/abs/2507.07041
作者: Enze Shi,Jinhan Xie,Bei Jiang,Linglong Kong,Xuming He
类目: Methodology (stat.ME); Machine Learning (cs.LG); Machine Learning (stat.ML)
*备注:
Abstract:Differentially Private Stochastic Gradient Descent (DP-SGD) has been widely used for solving optimization problems with privacy guarantees in machine learning and statistics. Despite this, a systematic non-asymptotic convergence analysis for DP-SGD, particularly in the context of online problems and local differential privacy (LDP) models, remains largely elusive. Existing non-asymptotic analyses have focused on non-private optimization methods, and hence are not applicable to privacy-preserving optimization problems. This work initiates the analysis to bridge this gap and opens the door to non-asymptotic convergence analysis of private optimization problems. A general framework is investigated for the online LDP model in stochastic optimization problems. We assume that sensitive information from individuals is collected sequentially and aim to estimate, in real-time, a static parameter that pertains to the population of interest. Most importantly, we conduct a comprehensive non-asymptotic convergence analysis of the proposed estimators in finite-sample situations, which gives their users practical guidelines regarding the effect of various hyperparameters, such as step size, parameter dimensions, and privacy budgets, on convergence rates. Our proposed estimators are validated in the theoretical and practical realms by rigorous mathematical derivations and carefully constructed numerical experiments.
[LG-51] When Context Is Not Enough: Modeling Unexplained Variability in Car-Following Behavior
链接: https://arxiv.org/abs/2507.07012
作者: Chengyuan Zhang,Zhengbing He,Cathy Wu,Lijun Sun
类目: Applications (stat.AP); Machine Learning (cs.LG); Robotics (cs.RO)
*备注:
Abstract:Modeling car-following behavior is fundamental to microscopic traffic simulation, yet traditional deterministic models often fail to capture the full extent of variability and unpredictability in human driving. While many modern approaches incorporate context-aware inputs (e.g., spacing, speed, relative speed), they frequently overlook structured stochasticity that arises from latent driver intentions, perception errors, and memory effects – factors that are not directly observable from context alone. To fill the gap, this study introduces an interpretable stochastic modeling framework that captures not only context-dependent dynamics but also residual variability beyond what context can explain. Leveraging deep neural networks integrated with nonstationary Gaussian processes (GPs), our model employs a scenario-adaptive Gibbs kernel to learn dynamic temporal correlations in acceleration decisions, where the strength and duration of correlations between acceleration decisions evolve with the driving context. This formulation enables a principled, data-driven quantification of uncertainty in acceleration, speed, and spacing, grounded in both observable context and latent behavioral variability. Comprehensive experiments on the naturalistic vehicle trajectory dataset collected from the German highway, i.e., the HighD dataset, demonstrate that the proposed stochastic simulation method within this framework surpasses conventional methods in both predictive performance and interpretable uncertainty quantification. The integration of interpretability and accuracy makes this framework a promising tool for traffic analysis and safety-critical applications.
[LG-52] Federated Learning-based MARL for Strengthening Physical-Layer Security in B5G Networks
链接: https://arxiv.org/abs/2507.06997
作者: Deemah H. Tashman,Soumaya Cherkaoui,Walaa Hamouda
类目: ignal Processing (eess.SP); Emerging Technologies (cs.ET); Machine Learning (cs.LG); Networking and Internet Architecture (cs.NI)
*备注:
Abstract:This paper explores the application of a federated learning-based multi-agent reinforcement learning (MARL) strategy to enhance physical-layer security (PLS) in a multi-cellular network within the context of beyond 5G networks. At each cell, a base station (BS) operates as a deep reinforcement learning (DRL) agent that interacts with the surrounding environment to maximize the secrecy rate of legitimate users in the presence of an eavesdropper. This eavesdropper attempts to intercept the confidential information shared between the BS and its authorized users. The DRL agents are deemed to be federated since they only share their network parameters with a central server and not the private data of their legitimate users. Two DRL approaches, deep Q-network (DQN) and Reinforce deep policy gradient (RDPG), are explored and compared. The results demonstrate that RDPG converges more rapidly than DQN. In addition, we demonstrate that the proposed method outperforms the distributed DRL approach. Furthermore, the outcomes illustrate the trade-off between security and complexity.
[LG-53] Off-Policy Evaluation Under Nonignorable Missing Data
链接: https://arxiv.org/abs/2507.06961
作者: Han Wang,Yang Xu,Wenbin Lu,Rui Song
类目: Machine Learning (stat.ML); Machine Learning (cs.LG)
*备注:
Abstract:Off-Policy Evaluation (OPE) aims to estimate the value of a target policy using offline data collected from potentially different policies. In real-world applications, however, logged data often suffers from missingness. While OPE has been extensively studied in the literature, a theoretical understanding of how missing data affects OPE results remains unclear. In this paper, we investigate OPE in the presence of monotone missingness and theoretically demonstrate that the value estimates remain unbiased under ignorable missingness but can be biased under nonignorable (informative) missingness. To retain the consistency of value estimation, we propose an inverse probability weighted value estimator and conduct statistical inference to quantify the uncertainty of the estimates. Through a series of numerical experiments, we empirically demonstrate that our proposed estimator yields a more reliable value inference under missing data.
[LG-54] Machine-Learned Force Fields for Lattice Dynamics at Coupled-Cluster Level Accuracy
链接: https://arxiv.org/abs/2507.06929
作者: Sita Schönbauer,Johanna P. Carbone,Andreas Grüneis
类目: Materials Science (cond-mat.mtrl-sci); Machine Learning (cs.LG); Computational Physics (physics.comp-ph)
*备注: 22 pages, 12 figures
Abstract:We investigate Machine-Learned Force Fields (MLFFs) trained on approximate Density Functional Theory (DFT) and Coupled Cluster (CC) level potential energy surfaces for the carbon diamond and lithium hydride solids. We assess the accuracy and precision of the MLFFs by calculating phonon dispersions and vibrational densities of states (VDOS) that are compared to experiment and reference ab initio results. To overcome limitations from long-range effects and the lack of atomic forces in the CC training data, a delta-learning approach based on the difference between CC and DFT results is explored. Compared to DFT, MLFFs trained on CC theory yield higher vibrational frequencies for optical modes, agreeing better with experiment. Furthermore, the MLFFs are used to estimate anharmonic effects on the VDOS of lithium hydride at the level of CC theory.
[LG-55] Distribution-free inference for LightGBM and GLM with Tweedie loss
链接: https://arxiv.org/abs/2507.06921
作者: Alokesh Manna,Aditya Vikram Sett,Dipak K. Dey,Yuwen Gu,Elizabeth D. Schifano,Jichao He
类目: Machine Learning (stat.ML); Machine Learning (cs.LG)
*备注:
Abstract:Prediction uncertainty quantification is a key research topic in recent years scientific and business problems. In insurance industries (\citeparodi2023pricing), assessing the range of possible claim costs for individual drivers improves premium pricing accuracy. It also enables insurers to manage risk more effectively by accounting for uncertainty in accident likelihood and severity. In the presence of covariates, a variety of regression-type models are often used for modeling insurance claims, ranging from relatively simple generalized linear models (GLMs) to regularized GLMs to gradient boosting models (GBMs). Conformal predictive inference has arisen as a popular distribution-free approach for quantifying predictive uncertainty under relatively weak assumptions of exchangeability, and has been well studied under the classic linear regression setting. In this work, we propose new non-conformity measures for GLMs and GBMs with GLM-type loss. Using regularized Tweedie GLM regression and LightGBM with Tweedie loss, we demonstrate conformal prediction performance with these non-conformity measures in insurance claims data. Our simulation results favor the use of locally weighted Pearson residuals for LightGBM over other methods considered, as the resulting intervals maintained the nominal coverage with the smallest average width.
[LG-56] Adaptive collaboration for online personalized distributed learning with heterogeneous clients
链接: https://arxiv.org/abs/2507.06844
作者: Constantin Philippenko,Batiste Le Bars,Kevin Scaman,Laurent Massoulié
类目: Machine Learning (stat.ML); Machine Learning (cs.LG)
*备注: 18 pages
Abstract:We study the problem of online personalized decentralized learning with N statistically heterogeneous clients collaborating to accelerate local training. An important challenge in this setting is to select relevant collaborators to reduce gradient variance while mitigating the introduced bias. To tackle this, we introduce a gradient-based collaboration criterion, allowing each client to dynamically select peers with similar gradients during the optimization process. Our criterion is motivated by a refined and more general theoretical analysis of the All-for-one algorithm, proved to be optimal in Even et al. (2022) for an oracle collaboration scheme. We derive excess loss upper-bounds for smooth objective functions, being either strongly convex, non-convex, or satisfying the Polyak-Lojasiewicz condition; our analysis reveals that the algorithm acts as a variance reduction method where the speed-up depends on a sufficient variance. We put forward two collaboration methods instantiating the proposed general schema; and we show that one variant preserves the optimality of All-for-one. We validate our results with experiments on synthetic and real datasets.
[LG-57] Designing Robust Software Sensors for Nonlinear Systems via Neural Networks and Adaptive Sliding Mode Control
链接: https://arxiv.org/abs/2507.06817
作者: Ayoub Farkane,Mohamed Boutayeb,Mustapha Oudani,Mounir Ghogho
类目: Dynamical Systems (math.DS); Machine Learning (cs.LG); Neural and Evolutionary Computing (cs.NE); Optimization and Control (math.OC)
*备注: Submitted to IEEE Transactions Journal
Abstract:Accurate knowledge of the state variables in a dynamical system is critical for effective control, diagnosis, and supervision, especially when direct measurements of all states are infeasible. This paper presents a novel approach to designing software sensors for nonlinear dynamical systems expressed in their most general form. Unlike traditional model-based observers that rely on explicit transformations or linearization, the proposed framework integrates neural networks with adaptive Sliding Mode Control (SMC) to design a robust state observer under a less restrictive set of conditions. The learning process is driven by available sensor measurements, which are used to correct the observer’s state estimate. The training methodology leverages the system’s governing equations as a physics-based constraint, enabling observer synthesis without access to ground-truth state trajectories. By employing a time-varying gain matrix dynamically adjusted by the neural network, the observer adapts in real-time to system changes, ensuring robustness against noise, external disturbances, and variations in system dynamics. Furthermore, we provide sufficient conditions to guarantee estimation error convergence, establishing a theoretical foundation for the observer’s reliability. The methodology’s effectiveness is validated through simulations on challenging examples, including systems with non-differentiable dynamics and varying observability conditions. These examples, which are often problematic for conventional techniques, serve to demonstrate the robustness and broad applicability of our approach. The results show rapid convergence and high accuracy, underscoring the method’s potential for addressing complex state estimation challenges in real-world applications.
[LG-58] Fast Gaussian Processes under Monotonicity Constraints
链接: https://arxiv.org/abs/2507.06677
作者: Chao Zhang,Jasper M. Everink,Jakob Sauer Jørgensen
类目: Machine Learning (stat.ML); Machine Learning (cs.LG); Methodology (stat.ME)
*备注: 35 pages, 10 figures
Abstract:Gaussian processes (GPs) are widely used as surrogate models for complicated functions in scientific and engineering applications. In many cases, prior knowledge about the function to be approximated, such as monotonicity, is available and can be leveraged to improve model fidelity. Incorporating such constraints into GP models enhances predictive accuracy and reduces uncertainty, but remains a computationally challenging task for high-dimensional problems. In this work, we present a novel virtual point-based framework for building constrained GP models under monotonicity constraints, based on regularized linear randomize-then-optimize (RLRTO), which enables efficient sampling from a constrained posterior distribution by means of solving randomized optimization problems. We also enhance two existing virtual point-based approaches by replacing Gibbs sampling with the No U-Turn Sampler (NUTS) for improved efficiency. A Python implementation of these methods is provided and can be easily applied to a wide range of problems. This implementation is then used to validate the approaches on approximating a range of synthetic functions, demonstrating comparable predictive performance between all considered methods and significant improvements in computational efficiency with the two NUTS methods and especially with the RLRTO method. The framework is further applied to construct surrogate models for systems of differential equations.
[LG-59] Semi-parametric Functional Classification via Path Signatures Logistic Regression
链接: https://arxiv.org/abs/2507.06637
作者: Pengcheng Zeng,Siyuan Jiang
类目: Machine Learning (stat.ML); Machine Learning (cs.LG)
*备注:
Abstract:We propose Path Signatures Logistic Regression (PSLR), a semi-parametric framework for classifying vector-valued functional data with scalar covariates. Classical functional logistic regression models rely on linear assumptions and fixed basis expansions, which limit flexibility and degrade performance under irregular sampling. PSLR overcomes these issues by leveraging truncated path signatures to construct a finite-dimensional, basis-free representation that captures nonlinear and cross-channel dependencies. By embedding trajectories as time-augmented paths, PSLR extracts stable, geometry-aware features that are robust to sampling irregularity without requiring a common time grid, while still preserving subject-specific timing patterns. We establish theoretical guarantees for the existence and consistent estimation of the optimal truncation order, along with non-asymptotic risk bounds. Experiments on synthetic and real-world datasets show that PSLR outperforms traditional functional classifiers in accuracy, robustness, and interpretability, particularly under non-uniform sampling schemes. Our results highlight the practical and theoretical benefits of integrating rough path theory into modern functional data analysis.
[LG-60] On the Hardness of Unsupervised Domain Adaptation: Optimal Learners and Information-Theoretic Perspective
链接: https://arxiv.org/abs/2507.06552
作者: Zhiyi Dong,Zixuan Liu,Yongyi Mao
类目: Machine Learning (stat.ML); Information Theory (cs.IT); Machine Learning (cs.LG)
*备注: Accepted at the 4th Conference on Lifelong Learning Agents (CoLLAs 2025)
Abstract:This paper studies the hardness of unsupervised domain adaptation (UDA) under covariate shift. We model the uncertainty that the learner faces by a distribution \pi in the ground-truth triples (p, q, f) – which we call a UDA class – where (p, q) is the source – target distribution pair and f is the classifier. We define the performance of a learner as the overall target domain risk, averaged over the randomness of the ground-truth triple. This formulation couples the source distribution, the target distribution and the classifier in the ground truth, and deviates from the classical worst-case analyses, which pessimistically emphasize the impact of hard but rare UDA instances. In this formulation, we precisely characterize the optimal learner. The performance of the optimal learner then allows us to define the learning difficulty for the UDA class and for the observed sample. To quantify this difficulty, we introduce an information-theoretic quantity – Posterior Target Label Uncertainty (PTLU) – along with its empirical estimate (EPTLU) from the sample , which capture the uncertainty in the prediction for the target domain. Briefly, PTLU is the entropy of the predicted label in the target domain under the posterior distribution of ground-truth classifier given the observed source and target samples. By proving that such a quantity serves to lower-bound the risk of any learner, we suggest that these quantities can be used as proxies for evaluating the hardness of UDA learning. We provide several examples to demonstrate the advantage of PTLU, relative to the existing measures, in evaluating the difficulty of UDA learning.
[LG-61] From large-eddy simulations to deep learning: A U-net model for fast urban canopy flow predictions
链接: https://arxiv.org/abs/2507.06533
作者: Themistoklis Vargiemezis,Catherine Gorlé
类目: Computational Physics (physics.comp-ph); Machine Learning (cs.LG); Fluid Dynamics (physics.flu-dyn)
*备注:
Abstract:Accurate prediction of wind flow fields in urban canopies is crucial for ensuring pedestrian comfort, safety, and sustainable urban design. Traditional methods using wind tunnels and Computational Fluid Dynamics, such as Large-Eddy Simulations (LES), are limited by high costs, computational demands, and time requirements. This study presents a deep neural network (DNN) approach for fast and accurate predictions of urban wind flow fields, reducing computation time from an order of 10 hours on 32 CPUs for one LES evaluation to an order of 1 second on a single GPU using the DNN model. We employ a U-Net architecture trained on LES data including 252 synthetic urban configurations at seven wind directions ( 0^o to 90^o in 15^o increments). The model predicts two key quantities of interest: mean velocity magnitude and streamwise turbulence intensity, at multiple heights within the urban canopy. The U-net uses 2D building representations augmented with signed distance functions and their gradients as inputs, forming a 256\times256\times9 tensor. In addition, a Spatial Attention Module is used for feature transfer through skip connections. The loss function combines the root-mean-square error of predictions, their gradient magnitudes, and L2 regularization. Model evaluation on 50 test cases demonstrates high accuracy with an overall mean relative error of 9.3% for velocity magnitude and 5.2% for turbulence intensity. This research shows the potential of deep learning approaches to provide fast, accurate urban wind assessments essential for creating comfortable and safe urban environments. Code is available at this https URL
[LG-62] Neural Actor-Critic Methods for Hamilton-Jacobi-Bellm an PDEs: Asymptotic Analysis and Numerical Studies
链接: https://arxiv.org/abs/2507.06428
作者: Samuel N. Cohen,Jackson Hebner,Deqing Jiang,Justin Sirignano
类目: Optimization and Control (math.OC); Machine Learning (cs.LG); Numerical Analysis (math.NA); Machine Learning (stat.ML)
*备注: 41 pages
Abstract:We mathematically analyze and numerically study an actor-critic machine learning algorithm for solving high-dimensional Hamilton-Jacobi-Bellman (HJB) partial differential equations from stochastic control theory. The architecture of the critic (the estimator for the value function) is structured so that the boundary condition is always perfectly satisfied (rather than being included in the training loss) and utilizes a biased gradient which reduces computational cost. The actor (the estimator for the optimal control) is trained by minimizing the integral of the Hamiltonian over the domain, where the Hamiltonian is estimated using the critic. We show that the training dynamics of the actor and critic neural networks converge in a Sobolev-type space to a certain infinite-dimensional ordinary differential equation (ODE) as the number of hidden units in the actor and critic \rightarrow \infty . Further, under a convexity-like assumption on the Hamiltonian, we prove that any fixed point of this limit ODE is a solution of the original stochastic control problem. This provides an important guarantee for the algorithm’s performance in light of the fact that finite-width neural networks may only converge to a local minimizers (and not optimal solutions) due to the non-convexity of their loss functions. In our numerical studies, we demonstrate that the algorithm can solve stochastic control problems accurately in up to 200 dimensions. In particular, we construct a series of increasingly complex stochastic control problems with known analytic solutions and study the algorithm’s numerical performance on them. These problems range from a linear-quadratic regulator equation to highly challenging equations with non-convex Hamiltonians, allowing us to identify and analyze the strengths and limitations of this neural actor-critic method for solving HJB equations.
[LG-63] Deep learning-based species-area models reveal multi-scale patterns of species richness and turnover
链接: https://arxiv.org/abs/2507.06358
作者: Victor Boussange,Philipp Brun,Johanna T. Malle,Gabriele Midolo,Jeanne Portier,Théophile Sanchez,Niklaus E. Zimmermann,Irena Axmanová,Helge Bruelheide,Milan Chytrý,Stephan Kambach,Zdeňka Lososová,Martin Večeřa,Idoia Biurrun,Klaus T. Ecker,Jonathan Lenoir,Jens-Christian Svenning,Dirk Nikolaus Karger
类目: Populations and Evolution (q-bio.PE); Machine Learning (cs.LG)
*备注: 31 pages
Abstract:The number of species within ecosystems is influenced not only by their intrinsic characteristics but also by the spatial scale considered. As the sampled area expands, species richness increases, a phenomenon described by the species-area relationship (SAR). The accumulation dynamics of the SAR results from a complex interplay of biotic and abiotic processes operating at various spatial scales. However, the challenge of collecting exhaustive biodiversity records across spatial scales has hindered a comprehensive understanding of these dynamics. Here, we develop a deep learning approach that leverages sampling theory and small-scale ecological surveys to spatially resolve the scale-dependency of species richness. We demonstrate its performance by predicting the species richness of vascular plant communities across Europe, and evaluate the predictions against an independent dataset of plant community inventories. Our model improves species richness estimates by 32% and delivers spatially explicit patterns of species richness and turnover for sampling areas ranging from square meters to hundreds of square kilometers. Explainable AI techniques further disentangle how drivers of species richness operate across spatial scales. The ability of our model to represent the multi-scale nature of biodiversity is essential to deliver robust biodiversity assessments and forecasts under global change.
[LG-64] rainability of Quantum Models Beyond Known Classical Simulability
链接: https://arxiv.org/abs/2507.06344
作者: Sabri Meyer,Francesco Scala,Francesco Tacchino,Aurelien Lucchi
类目: Quantum Physics (quant-ph); Computational Complexity (cs.CC); Machine Learning (cs.LG)
*备注: 11 pages, 7 figures, 52 pages of supplementary material
Abstract:Variational Quantum Algorithms (VQAs) are promising candidates for near-term quantum computing, yet they face scalability challenges due to barren plateaus, where gradients vanish exponentially in the system size. Recent conjectures suggest that avoiding barren plateaus might inherently lead to classical simulability, thus limiting the opportunities for quantum advantage. In this work, we advance the theoretical understanding of the relationship between the trainability and computational complexity of VQAs, thus directly addressing the conjecture. We introduce the Linear Clifford Encoder (LCE), a novel technique that ensures constant-scaling gradient statistics on optimization landscape regions that are close to Clifford circuits. Additionally, we leverage classical Taylor surrogates to reveal computational complexity phase transitions from polynomial to super-polynomial as the initialization region size increases. Combining these results, we reveal a deeper link between trainability and computational complexity, and analytically prove that barren plateaus can be avoided in regions for which no classical surrogate is known to exist. Furthermore, numerical experiments on LCE transformed landscapes confirm in practice the existence of a super-polynomially complex ``transition zone’’ where gradients decay polynomially. These findings indicate a plausible path to practically relevant, barren plateau-free variational models with potential for quantum advantage.
[LG-65] Self-supervised learning predicts plant growth trajectories from multi-modal industrial greenhouse data
链接: https://arxiv.org/abs/2507.06336
作者: Adam J Riesselman,Evan M Cofer,Therese LaRue,Wim Meeussen
类目: Quantitative Methods (q-bio.QM); Machine Learning (cs.LG); Robotics (cs.RO)
*备注:
Abstract:Quantifying organism-level phenotypes, such as growth dynamics and biomass accumulation, is fundamental to understanding agronomic traits and optimizing crop production. However, quality growing data of plants at scale is difficult to generate. Here we use a mobile robotic platform to capture high-resolution environmental sensing and phenotyping measurements of a large-scale hydroponic leafy greens system. We describe a self-supervised modeling approach to build a map from observed growing data to the entire plant growth trajectory. We demonstrate our approach by forecasting future plant height and harvest mass of crops in this system. This approach represents a significant advance in combining robotic automation and machine learning, as well as providing actionable insights for agronomic research and operational efficiency.
[LG-66] A Machine Learning Framework for Breast Cancer Treatment Classification Using a Novel Dataset
链接: https://arxiv.org/abs/2507.06243
作者: Md Nahid Hasan,Md Monzur Murshed,Md Mahadi Hasan,Faysal A. Chowdhury
类目: Applications (stat.AP); Machine Learning (cs.LG)
*备注: 12 pages, 3 figures, 3 tables. This paper has been submitted to Scientific Reports and has been under review for five months
Abstract:Breast cancer (BC) remains a significant global health challenge, with personalized treatment selection complicated by the disease’s molecular and clinical heterogeneity. BC treatment decisions rely on various patient-specific clinical factors, and machine learning (ML) offers a powerful approach to predicting treatment outcomes. This study utilizes The Cancer Genome Atlas (TCGA) breast cancer clinical dataset to develop ML models for predicting the likelihood of undergoing chemotherapy or hormonal therapy. The models are trained using five-fold cross-validation and evaluated through performance metrics, including accuracy, precision, recall, specificity, sensitivity, F1-score, and area under the receiver operating characteristic curve (AUROC). Model uncertainty is assessed using bootstrap techniques, while SHAP values enhance interpretability by identifying key predictors. Among the tested models, the Gradient Boosting Machine (GBM) achieves the highest stable performance (accuracy = 0.7718, AUROC = 0.8252), followed by Extreme Gradient Boosting (XGBoost) (accuracy = 0.7557, AUROC = 0.8044) and Adaptive Boosting (AdaBoost) (accuracy = 0.7552, AUROC = 0.8016). These findings underscore the potential of ML in supporting personalized breast cancer treatment decisions through data-driven insights.
信息检索
[IR-0] Boosting Parameter Efficiency in LLM -Based Recommendation through Sophisticated Pruning
链接: https://arxiv.org/abs/2507.07064
作者: Shanle Zheng,Keqin Bao,Jizhi Zhang,Yang Zhang,Fuli Feng,Xiangnan He
类目: Information Retrieval (cs.IR)
*备注:
Abstract:LLM-based recommender systems have made significant progress; however, the deployment cost associated with the large parameter volume of LLMs still hinders their real-world applications. This work explores parameter pruning to improve parameter efficiency while maintaining recommendation quality, thereby enabling easier deployment. Unlike existing approaches that focus primarily on inter-layer redundancy, we uncover intra-layer redundancy within components such as self-attention and MLP modules. Building on this analysis, we propose a more fine-grained pruning approach that integrates both intra-layer and layer-wise pruning. Specifically, we introduce a three-stage pruning strategy that progressively prunes parameters at different levels and parts of the model, moving from intra-layer to layer-wise pruning, or from width to depth. Each stage also includes a performance restoration step using distillation techniques, helping to strike a balance between performance and parameter efficiency. Empirical results demonstrate the effectiveness of our approach: across three datasets, our models achieve an average of 88% of the original model’s performance while pruning more than 95% of the non-embedding parameters. This underscores the potential of our method to significantly reduce resource requirements without greatly compromising recommendation quality. Our code will be available at: this https URL
[IR-1] CDC: Causal Domain Clustering for Multi-Domain Recommendation SIGIR2025
链接: https://arxiv.org/abs/2507.06877
作者: Huishi Luo,Yiqing Wu,Yiwen Chen,Fuzhen Zhuang,Deqing Wang
类目: Information Retrieval (cs.IR)
*备注: Accepted at SIGIR 2025
Abstract:Multi-domain recommendation leverages domain-general knowledge to improve recommendations across several domains. However, as platforms expand to dozens or hundreds of scenarios, training all domains in a unified model leads to performance degradation due to significant inter-domain differences. Existing domain grouping methods, based on business logic or data similarities, often fail to capture the true transfer relationships required for optimal grouping. To effectively cluster domains, we propose Causal Domain Clustering (CDC). CDC models domain transfer patterns within a large number of domains using two distinct effects: the Isolated Domain Affinity Matrix for modeling non-interactive domain transfers, and the Hybrid Domain Affinity Matrix for considering dynamic domain synergy or interference under joint training. To integrate these two transfer effects, we introduce causal discovery to calculate a cohesion-based coefficient that adaptively balances their contributions. A Co-Optimized Dynamic Clustering algorithm iteratively optimizes target domain clustering and source domain selection for training. CDC significantly enhances performance across over 50 domains on public datasets and in industrial settings, achieving a 4.9% increase in online eCPM. Code is available at this https URL
[IR-2] Impacts of Mainstream-Driven Algorithms on Recommendations for Children Across Domains: A Reproducibility Study RECSYS2025
链接: https://arxiv.org/abs/2507.06596
作者: Robin Ungruh,Alejandro Bellogín,Dominik Kowald,Maria Soledad Pera
类目: Information Retrieval (cs.IR)
*备注: Preprint of accepted RecSys 2025 contribution
Abstract:Children are often exposed to items curated by recommendation algorithms. Yet, research seldom considers children as a user group, and when it does, it is anchored on datasets where children are underrepresented, risking overlooking their interests, favoring those of the majority, i.e., mainstream users. Recently, Ungruh et al. demonstrated that children’s consumption patterns and preferences differ from those of mainstream users, resulting in inconsistent recommendation algorithm performance and behavior for this user group. These findings, however, are based on two datasets with a limited child user sample. We reproduce and replicate this study on a wider range of datasets in the movie, music, and book domains, uncovering interaction patterns and aspects of child-recommender interactions consistent across domains, as well as those specific to some user samples in the data. We also extend insights from the original study with popularity bias metrics, given the interpretation of results from the original study. With this reproduction and extension, we uncover consumption patterns and differences between age groups stemming from intrinsic differences between children and others, and those unique to specific datasets or domains.
[IR-3] SPEAR: Subset-sampled Performance Evaluation via Automated Ground Truth Generation for RAG
链接: https://arxiv.org/abs/2507.06554
作者: Zou Yuheng,Wang Yiran,Tian Yuzhu,Zhu Min,Huang Yanhua
类目: Information Retrieval (cs.IR)
*备注:
Abstract:Retrieval-Augmented Generation (RAG) is a core approach for enhancing Large Language Models (LLMs), where the effectiveness of the retriever largely determines the overall response quality of RAG systems. Retrievers encompass a multitude of hyperparameters that significantly impact performance outcomes and demonstrate sensitivity to specific applications. Nevertheless, hyperparameter optimization entails prohibitively high computational expenses. Existing evaluation methods suffer from either prohibitive costs or disconnection from domain-specific scenarios. This paper proposes SEARA (Subset sampling Evaluation for Automatic Retriever Assessment), which addresses evaluation data challenges through subset sampling techniques and achieves robust automated retriever evaluation by minimal retrieval facts extraction and comprehensive retrieval metrics. Based on real user queries, this method enables fully automated retriever evaluation at low cost, thereby obtaining optimal retriever for specific business scenarios. We validate our method across classic RAG applications in rednote, including knowledge-based QA system and retrieval-based travel assistant, successfully obtaining scenario-specific optimal retrievers.
[IR-4] USD: A User-Intent-Driven Sampling and Dual-Debiasing Framework for Large-Scale Homepage Recommendations
链接: https://arxiv.org/abs/2507.06503
作者: Jiaqi Zheng,Cheng Guo,Yi Cao,Chaoqun Hou,Tong Liu,Bo Zheng
类目: Information Retrieval (cs.IR)
*备注:
Abstract:Large-scale homepage recommendations face critical challenges from pseudo-negative samples caused by exposure bias, where non-clicks may indicate inattention rather than disinterest. Existing work lacks thorough analysis of invalid exposures and typically addresses isolated aspects (e.g., sampling strategies), overlooking the critical impact of pseudo-positive samples - such as homepage clicks merely to visit marketing portals. We propose a unified framework for large-scale homepage recommendation sampling and debiasing. Our framework consists of two key components: (1) a user intent-aware negative sampling module to filter invalid exposure samples, and (2) an intent-driven dual-debiasing module that jointly corrects exposure bias and click bias. Extensive online experiments on Taobao demonstrate the efficacy of our framework, achieving significant improvements in user click-through rates (UCTR) by 35.4% and 14.5% in two variants of the marketing block on the Taobao homepage, Baiyibutie and Taobaomiaosha.