本篇博文主要内容为 2025-07-16 从Arxiv.org论文网站获取的最新论文列表,自动更新,按照NLP、CV、ML、AI、IR五个大方向区分,若需要邮件定时接收,请在评论区留下你的邮箱号。

说明:每日论文数据从Arxiv.org获取,每天早上12:00左右定时自动更新。

友情提示: 如何您需要邮箱接收每日论文数据,请在评论处留下你的邮箱。

目录

概览 (2025-07-16)

今日共更新502篇论文,其中:

  • 自然语言处理63篇(Computation and Language (cs.CL))
  • 人工智能158篇(Artificial Intelligence (cs.AI))
  • 计算机视觉90篇(Computer Vision and Pattern Recognition (cs.CV))
  • 机器学习147篇(Machine Learning (cs.LG))

自然语言处理

[NLP-0] AirLLM : Diffusion Policy-based Adaptive LoRA for Remote Fine-Tuning of LLM over the Air

【速读】: 该论文试图解决在边缘设备上运行大型语言模型(Large Language Models, LLMs)时面临的通信带宽有限和计算与内存成本高昂的问题,以及现有低秩适应(Low-Rank Adaptation, LoRA)方法在参数传输效率方面的不足。其解决方案的关键在于提出AirLLM,这是一个面向通信感知的LoRA适应的分层扩散策略框架,通过将秩配置建模为结构化动作向量,并结合近端策略优化(Proximal Policy Optimization, PPO)与去噪扩散隐式模型(Denoising Diffusion Implicit Models, DDIM)实现高精度、任务和信道自适应的秩向量生成,从而在提升微调性能的同时显著降低传输成本。

链接: https://arxiv.org/abs/2507.11515
作者: Shiyi Yang,Xiaoxue Yu,Rongpeng Li,Jianhang Zhu,Zhifeng Zhao,Honggang Zhang
机构: Zhejiang University (浙江大学); Zhejiang Lab (浙江省实验室)
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注: 11 pages, 8 figures

点击查看摘要

Abstract:Operating Large Language Models (LLMs) on edge devices is increasingly challenged by limited communication bandwidth and strained computational and memory costs. Thus, cloud-assisted remote fine-tuning becomes indispensable. Nevertheless, existing Low-Rank Adaptation (LoRA) approaches typically employ fixed or heuristic rank configurations, and the subsequent over-the-air transmission of all LoRA parameters could be rather inefficient. To address this limitation, we develop AirLLM, a hierarchical diffusion policy framework for communication-aware LoRA adaptation. Specifically, AirLLM models the rank configuration as a structured action vector that spans all LoRA-inserted projections. To solve the underlying high-dimensional sequential decision-making problem, a Proximal Policy Optimization (PPO) agent generates coarse-grained decisions by jointly observing wireless states and linguistic complexity, which are then refined via Denoising Diffusion Implicit Models (DDIM) to produce high-resolution, task- and channel-adaptive rank vectors. The two modules are optimized alternatively, with the DDIM trained under the Classifier-Free Guidance (CFG) paradigm to maintain alignment with PPO rewards. Experiments under varying signal-to-noise ratios demonstrate that AirLLM consistently enhances fine-tuning performance while significantly reducing transmission costs, highlighting the effectiveness of reinforcement-driven, diffusion-refined rank adaptation for scalable and efficient remote fine-tuning over the air.
zh

[NLP-1] Real-World Summarization: When Evaluation Reaches Its Limits

【速读】: 该论文试图解决生成式 AI (Generative AI) 在酒店亮点(hotel highlights)任务中对输入数据忠实度(faithfulness to input data)的评估问题。其解决方案的关键在于通过人类评估活动,结合分类错误评估和跨度级标注,对比传统指标、可训练方法以及基于大语言模型(LLM)作为评判者的评估方法,从而揭示简单指标如词重叠在跨领域数据上的有效性,并指出LLM在评估过程中的不可靠性。

链接: https://arxiv.org/abs/2507.11508
作者: Patrícia Schmidtová,Ondřej Dušek,Saad Mahamood
机构: Charles University (查理大学); trivago(trivago)
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:We examine evaluation of faithfulness to input data in the context of hotel highlights: brief LLM-generated summaries that capture unique features of accommodations. Through human evaluation campaigns involving categorical error assessment and span-level annotation, we compare traditional metrics, trainable methods, and LLM-as-a-judge approaches. Our findings reveal that simpler metrics like word overlap correlate surprisingly well with human judgments (Spearman correlation rank of 0.63), often outperforming more complex methods when applied to out-of-domain data. We further demonstrate that while LLMs can generate high-quality highlights, they prove unreliable for evaluation as they tend to severely under- or over-annotate. Our analysis of real-world business impacts shows incorrect and non-checkable information pose the greatest risks. We also highlight challenges in crowdsourced evaluations.
zh

[NLP-2] HKGAI-V1: Towards Regional Sovereign Large Language Model for Hong Kong

【速读】: 该论文旨在解决如何在特定区域(香港)构建符合当地文化、法律及价值观的主权大型语言模型(LLM)的问题,以应对香港独特的多语言环境、社会法律框架以及文化价值需求。其解决方案的关键在于基于DeepSeek架构,通过多维度全参数微调过程,系统性地将模型对齐至区域规范,并集成检索增强生成(RAG)系统以确保信息的时效性与准确性。此外,论文提出了专有的对抗性香港价值观基准测试工具,用于评估模型在复杂情境下与本地伦理和法律标准的对齐程度,从而构建一个全面的区域化AI对齐与安全框架。

链接: https://arxiv.org/abs/2507.11502
作者: Sirui Han,Junqi Zhu,Ruiyuan Zhang,Yike Guo
机构: 未知
类目: Computation and Language (cs.CL); Computational Engineering, Finance, and Science (cs.CE); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:This paper presents the development of HKGAI-V1, a foundational sovereign large language model (LLM), developed as part of an initiative to establish value-aligned AI infrastructure specifically tailored for Hong Kong. Addressing the region’s unique multilingual environment (Cantonese, Mandarin, and English), its distinct socio-legal context under the “one country, two systems” framework, and specific local cultural and value considerations, the model is built upon the DeepSeek architecture and systematically aligned with regional norms through a multifaceted full parameter fine-tuning process. It is further integrated with a retrieval-augmented generation (RAG) system to ensure timely and factually grounded information access. The core contribution lies in the design and implementation of a comprehensive, region-specific AI alignment and safety framework, demonstrated through two key achievements: 1) The successful development of HKGAI-V1 itself - which outper-forms general-purpose models in handling Hong Kong-specific culturally sensitive queries, and embodies a “governance-embedded” approach to digital sovereignty - empowers Hong Kong to exercise control over AI applications in critical sectors including public services, legal systems, and edu-cation. 2) The development of the proprietary Adversarial HK Value Benchmark, a rigorous tool for evaluating model alignment with local ethical and legal stand-ards under challenging conditions. By documenting these achievements, the paper provides not only a technological artifact but also a replicable blueprint for developing advanced, regionally focused AI systems deeply rooted in their local identities.
zh

[NLP-3] Reasoning Strategies in Large Language Models : Can They Follow Prefer and Optimize?

【速读】: 该论文试图解决大型语言模型(Large Language Models, LLMs)在面对多样化推理挑战时,由于倾向于使用单一推理策略而可能限制其有效性的问题。论文提出,通过提示(prompting)技术可以控制LLMs的推理策略,并评估其对逻辑问题解决的影响。解决方案的关键在于开发方法以引导LLMs在不同任务中自适应地选择最优推理策略,从而提升模型的整体性能。

链接: https://arxiv.org/abs/2507.11423
作者: Yanjian Zhang,Guillaume Wisniewski,Nadi Tomeh,Thierry Charnois
机构: 未知
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Human reasoning involves different strategies, each suited to specific problems. Prior work shows that large language model (LLMs) tend to favor a single reasoning strategy, potentially limiting their effectiveness in diverse reasoning challenges. In this work, we investigate whether prompting can control LLMs reasoning strategies and assess its impact on logical problem-solving. While our experiments show that no single strategy consistently improves accuracy, performance could be enhanced if models could adaptively choose the optimal strategy. We propose methods to guide LLMs in strategy selection, highlighting new ways to refine their reasoning abilities.
zh

[NLP-4] Seq vs Seq: An Open Suite of Paired Encoders and Decoders

【速读】: 该论文试图解决当前大型语言模型(LLM)社区中对编码器-解码器架构比较的不一致性问题,即以往的研究往往在参数数量、训练技术和数据集等方面存在差异,导致无法公平比较两种模型架构的性能。其解决方案的关键在于引入了一个名为Ettin的基准模型套件,该套件包含从1700万到10亿参数的成对编码器-解码器模型,并使用相同的训练配方进行训练,从而在各自规模下实现了最先进的性能。通过这一方法,研究者能够更准确地评估不同架构在各类任务中的表现。

链接: https://arxiv.org/abs/2507.11412
作者: Orion Weller,Kathryn Ricci,Marc Marone,Antoine Chaffin,Dawn Lawrie,Benjamin Van Durme
机构: Johns Hopkins University (约翰霍普金斯大学); LightOn
类目: Computation and Language (cs.CL); Information Retrieval (cs.IR); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:The large language model (LLM) community focuses almost exclusively on decoder-only language models, since they are easier to use for text generation. However, a large subset of the community still uses encoder-only models for tasks such as classification or retrieval. Previous work has attempted to compare these architectures, but is forced to make comparisons with models that have different numbers of parameters, training techniques, and datasets. We introduce the SOTA open-data Ettin suite of models: paired encoder-only and decoder-only models ranging from 17 million parameters to 1 billion, trained on up to 2 trillion tokens. Using the same recipe for both encoder-only and decoder-only models produces SOTA recipes in both categories for their respective sizes, beating ModernBERT as an encoder and Llama 3.2 and SmolLM2 as decoders. Like previous work, we find that encoder-only models excel at classification and retrieval tasks while decoders excel at generative tasks. However, we show that adapting a decoder model to encoder tasks (and vice versa) through continued training is subpar compared to using only the reverse objective (i.e. a 400M encoder outperforms a 1B decoder on MNLI, and vice versa for generative tasks). We open-source all artifacts of this study including training data, training order segmented by checkpoint, and 200+ checkpoints to allow future work to analyze or extend all aspects of training.
zh

[NLP-5] KisMATH: Do LLM s Have Knowledge of Implicit Structures in Mathematical Reasoning ?

【速读】: 该论文试图解决大型语言模型(Large Language Models, LLMs)在推理任务中通过思维链(Chain-of-thought, CoT)提升性能的机制不明确的问题。其解决方案的关键是引入因果思维链图(Causal CoT Graphs, CCGs),这是一种从推理轨迹中自动提取的有向无环图,用于建模语言模型输出中的细粒度因果依赖关系。通过构建包含1671个数学推理问题及其对应CCGs的数据集KisMATH,并对15个开源LLMs进行实证分析,研究揭示了CCG中的推理节点作为最终答案的中介,以及LLMs内部结构与CCGs的相似性,从而为理解思维链在LLM推理中的作用提供了新的视角。

链接: https://arxiv.org/abs/2507.11408
作者: Soumadeep Saha,Akshay Chaturvedi,Saptarshi Saha,Utpal Garain,Nicholas Asher
机构: ISI Kolkata(印度统计学院加尔各答); IRIT Toulouse(图卢兹信息与推理技术研究所)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: 15 pages, 9 figures

点击查看摘要

Abstract:Chain-of-thought traces have been shown to improve performance of large language models in a plethora of reasoning tasks, yet there is no consensus on the mechanism through which this performance boost is achieved. To shed more light on this, we introduce Causal CoT Graphs (CCGs), which are directed acyclic graphs automatically extracted from reasoning traces that model fine-grained causal dependencies in the language model output. A collection of 1671 mathematical reasoning problems from MATH500, GSM8K and AIME, and their associated CCGs are compiled into our dataset – \textbfKisMATH. Our detailed empirical analysis with 15 open-weight LLMs shows that (i) reasoning nodes in the CCG are mediators for the final answer, a condition necessary for reasoning; and (ii) LLMs emphasise reasoning paths given by the CCG, indicating that models internally realise structures akin to our graphs. KisMATH enables controlled, graph-aligned interventions and opens up avenues for further investigation into the role of chain-of-thought in LLM reasoning.
zh

[NLP-6] EXAONE 4.0: Unified Large Language Models Integrating Non-reasoning reasoning and Reasoning Modes

【速读】: 该论文试图解决传统人工智能模型在可用性与推理能力之间的平衡问题,以及推动面向代理式人工智能(agentic AI)的发展。解决方案的关键在于EXAONE 4.0的双模式设计,即集成非推理模式与推理模式,从而兼顾EXAONE 3.5的优秀用户体验和EXAONE Deep的高级推理能力,同时引入代理工具使用功能以支持更复杂的任务执行。

链接: https://arxiv.org/abs/2507.11407
作者: LG AI Research:Kyunghoon Bae,Eunbi Choi,Kibong Choi,Stanley Jungkyu Choi,Yemuk Choi,Kyubeen Han,Seokhee Hong,Junwon Hwang,Taewan Hwang,Joonwon Jang,Hyojin Jeon,Kijeong Jeon,Gerrard Jeongwon Jo,Hyunjik Jo,Jiyeon Jung,Euisoon Kim,Hyosang Kim,Jihoon Kim,Joonkee Kim,Seonghwan Kim,Soyeon Kim,Sunkyoung Kim,Yireun Kim,Yongil Kim,Youchul Kim,Edward Hwayoung Lee,Gwangho Lee,Haeju Lee,Honglak Lee,Jinsik Lee,Kyungmin Lee,Sangha Park,Young Min Paik,Yongmin Park,Youngyong Park,Sanghyun Seo,Sihoon Yang,Heuiyeen Yeen,Sihyuk Yi,Hyeongu Yun
机构: LG AI Research( LG人工智能研究)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: Technical Report, 30 Pages

点击查看摘要

Abstract:This technical report introduces EXAONE 4.0, which integrates a Non-reasoning mode and a Reasoning mode to achieve both the excellent usability of EXAONE 3.5 and the advanced reasoning abilities of EXAONE Deep. To pave the way for the agentic AI era, EXAONE 4.0 incorporates essential features such as agentic tool use, and its multilingual capabilities are extended to support Spanish in addition to English and Korean. The EXAONE 4.0 model series consists of two sizes: a mid-size 32B model optimized for high performance, and a small-size 1.2B model designed for on-device applications. The EXAONE 4.0 demonstrates superior performance compared to open-weight models in its class and remains competitive even against frontier-class models. The models are publicly available for research purposes and can be easily downloaded via this https URL.
zh

[NLP-7] DCR: Quantifying Data Contamination in LLM s Evaluation

【速读】: 该论文试图解决大型语言模型(Large Language Models, LLMs)在基准数据集上可能存在的基准数据污染(Benchmark Data Contamination, BDC)问题,即模型可能无意中记忆评估数据,导致性能指标虚高,影响真实泛化能力的评估。其解决方案的关键是提出一种轻量且可解释的Data Contamination Risk (DCR)框架,该框架通过四个粒度层次(语义、信息、数据和标签)检测和量化污染,并利用模糊推理系统合成污染得分,生成统一的DCR Factor来调整原始准确率,从而反映更真实的污染感知性能。

链接: https://arxiv.org/abs/2507.11405
作者: Cheng Xu,Nan Yan,Shuhao Guan,Changhong Jin,Yuke Mei,Yibing Guo,M-Tahar Kechadi
机构: 未知
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:The rapid advancement of large language models (LLMs) has heightened concerns about benchmark data contamination (BDC), where models inadvertently memorize evaluation data, inflating performance metrics and undermining genuine generalization assessment. This paper introduces the Data Contamination Risk (DCR) framework, a lightweight, interpretable pipeline designed to detect and quantify BDC across four granular levels: semantic, informational, data, and label. By synthesizing contamination scores via a fuzzy inference system, DCR produces a unified DCR Factor that adjusts raw accuracy to reflect contamination-aware performance. Validated on 9 LLMs (0.5B-72B) across sentiment analysis, fake news detection, and arithmetic reasoning tasks, the DCR framework reliably diagnoses contamination severity and with accuracy adjusted using the DCR Factor to within 4% average error across the three benchmarks compared to the uncontaminated baseline. Emphasizing computational efficiency and transparency, DCR provides a practical tool for integrating contamination assessment into routine evaluations, fostering fairer comparisons and enhancing the credibility of LLM benchmarking practices.
zh

[NLP-8] Addressing Data Imbalance in Transformer-Based Multi-Label Emotion Detection with Weighted Loss SEMEVAL2025

【速读】: 该论文试图解决多标签情感检测中的数据不平衡问题,特别是在SemEval-2025共享任务11中对少数情感类别识别效果不佳的问题。其解决方案的关键在于引入一种简单的加权损失函数,通过动态调整类别权重来提升模型对少数情感类别的性能,而无需采用传统重采样方法带来的计算负担。

链接: https://arxiv.org/abs/2507.11384
作者: Xia Cui
机构: Manchester Metropolitan University (曼彻斯特都会大学)
类目: Computation and Language (cs.CL)
备注: 10 pages, 1 figure, SemEval 2025

点击查看摘要

Abstract:This paper explores the application of a simple weighted loss function to Transformer-based models for multi-label emotion detection in SemEval-2025 Shared Task 11. Our approach addresses data imbalance by dynamically adjusting class weights, thereby enhancing performance on minority emotion classes without the computational burden of traditional resampling methods. We evaluate BERT, RoBERTa, and BART on the BRIGHTER dataset, using evaluation metrics such as Micro F1, Macro F1, ROC-AUC, Accuracy, and Jaccard similarity coefficients. The results demonstrate that the weighted loss function improves performance on high-frequency emotion classes but shows limited impact on minority classes. These findings underscore both the effectiveness and the challenges of applying this approach to imbalanced multi-label emotion detection.
zh

[NLP-9] What is the Best Process Model Representation? A Comparative Analysis for Process Modeling with Large Language Models

【速读】: 该论文试图解决当前在基于大型语言模型(Large Language Models, LLMs)的流程建模(Process Modeling, PMo)任务中,不同流程模型表示(Process Model Representations, PMRs)之间缺乏系统性比较的问题。现有PMRs在结构、复杂性和可用性方面存在显著差异,且不同的流程模型生成(Process Model Generation, PMG)方法采用各异的评估策略和生成技术,导致难以进行有效对比。论文的关键解决方案是引入PMo Dataset,这是一个包含55个流程描述及其对应九种PMRs模型的新数据集,并通过两个维度——LLM适用性与PMG性能——对多种PMRs进行了全面评估,从而为后续研究提供了基准和参考。

链接: https://arxiv.org/abs/2507.11356
作者: Alexis Brissard,Frédéric Cuppens,Amal Zouaq
机构: 未知
类目: Computation and Language (cs.CL)
备注: 12 pages, 7 figures, to be published in AI4BPM 2025 Proceedings

点击查看摘要

Abstract:Large Language Models (LLMs) are increasingly applied for Process Modeling (PMo) tasks such as Process Model Generation (PMG). To support these tasks, researchers have introduced a variety of Process Model Representations (PMRs) that serve as model abstractions or generation targets. However, these PMRs differ widely in structure, complexity, and usability, and have never been systematically compared. Moreover, recent PMG approaches rely on distinct evaluation strategies and generation techniques, making comparison difficult. This paper presents the first empirical study that evaluates multiple PMRs in the context of PMo with LLMs. We introduce the PMo Dataset, a new dataset containing 55 process descriptions paired with models in nine different PMRs. We evaluate PMRs along two dimensions: suitability for LLM-based PMo and performance on PMG. \textitMermaid achieves the highest overall score across six PMo criteria, whereas \textitBPMN text delivers the best PMG results in terms of process element similarity.
zh

[NLP-10] Automated Novelty Evaluation of Academic Paper: A Collaborative Approach Integrating Human and Large Language Model Knowledge

【速读】: 该论文试图解决学术论文新颖性评估中存在的局限性问题,传统方法依赖专家判断或独特引用组合,但存在知识有限和效果不确定等问题。解决方案的关键在于融合大语言模型(LLM)的知识与人类专家的判断能力,通过从同行评审报告中提取与新颖性相关的句子,并利用LLM总结论文的方法部分,对预训练语言模型(PLMs)进行微调,同时设计了基于稀疏注意力的文本引导融合模块,以更有效地整合人类与LLM的知识。

链接: https://arxiv.org/abs/2507.11330
作者: Wenqing Wu,Chengzhi Zhang,Yi Zhao
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Digital Libraries (cs.DL); Human-Computer Interaction (cs.HC)
备注: Journal of the Association for Information Science and Technology, 2025

点击查看摘要

Abstract:Novelty is a crucial criterion in the peer review process for evaluating academic papers. Traditionally, it’s judged by experts or measure by unique reference combinations. Both methods have limitations: experts have limited knowledge, and the effectiveness of the combination method is uncertain. Moreover, it’s unclear if unique citations truly measure novelty. The large language model (LLM) possesses a wealth of knowledge, while human experts possess judgment abilities that the LLM does not possess. Therefore, our research integrates the knowledge and abilities of LLM and human experts to address the limitations of novelty assessment. The most common novelty in academic papers is the introduction of new methods. In this paper, we propose leveraging human knowledge and LLM to assist pretrained language models (PLMs, e.g. BERT etc.) in predicting the method novelty of papers. Specifically, we extract sentences related to the novelty of the academic paper from peer review reports and use LLM to summarize the methodology section of the academic paper, which are then used to fine-tune PLMs. In addition, we have designed a text-guided fusion module with novel Sparse-Attention to better integrate human and LLM knowledge. We compared the method we proposed with a large number of baselines. Extensive experiments demonstrate that our method achieves superior performance.
zh

[NLP-11] Internal Value Alignment in Large Language Models through Controlled Value Vector Activation ACL2025

【速读】: 该论文试图解决大型语言模型(Large Language Models, LLMs)与人类价值观对齐的问题,旨在提高模型的透明度、可解释性以及在动态场景中的适应能力。其解决方案的关键在于提出了一种受控价值向量激活(Controlled Value Vector Activation, ConVA)方法,该方法通过解析模型隐空间中价值的编码方式,并调整相关激活以确保LLMs内部价值的一致性。此外,为实现准确且无偏的价值向量识别,引入了上下文控制的价值向量识别方法;为在不损害模型性能的前提下实现稳定的价值控制,设计了门控价值向量激活机制。

链接: https://arxiv.org/abs/2507.11316
作者: Haoran Jin,Meng Li,Xiting Wang,Zhihao Xu,Minlie Huang,Yantao Jia,Defu Lian
机构: University of Science and Technology of China(中国科学技术大学); State Key Laboratory of Cognitive Intelligence, Hefei, Anhui, China(认知智能国家重点实验室,安徽合肥); Gaoling School of Artificial Intelligence Renmin University of China Beijing, China(中国人民大学高瓴人工智能学院,北京,中国); Beijing Key Laboratory of Research on Large Models and Intelligent Governance(大型模型与智能治理研究北京市重点实验室); Engineering Research Center of Next-Generation Intelligent Search and Recommendation, MOE(下一代智能搜索与推荐工程研究中心,教育部); Tsinghua University(清华大学); Huawei Technologies Co. Ltd(华为技术有限公司)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: 25 pages, 14 figures. Accepted by ACL 2025 (main conference)

点击查看摘要

Abstract:Aligning Large Language Models (LLMs) with human values has attracted increasing attention since it provides clarity, transparency, and the ability to adapt to evolving scenarios. In this paper, we introduce a Controlled Value Vector Activation (ConVA) method that directly aligns the internal values of LLMs by interpreting how a value is encoded in their latent representations and modifies relevant activations to ensure consistent values in LLMs. To ensure an accurate and unbiased interpretation, we propose a context-controlled value vector identification method. To consistently control values without sacrificing model performance, we introduce a gated value vector activation method for effective and minimum degree of value control. Experiments show that our method achieves the highest control success rate across 10 basic values without hurting LLM performance and fluency, and ensures target values even with opposite and potentially malicious input prompts. Source code and data are available at~ this https URL.
zh

[NLP-12] Dr.Copilot: A Multi-Agent Prompt Optimized Assistant for Improving Patient-Doctor Communication in Romanian

【速读】: 该论文试图解决文本交互式远程医疗中,医生提供的医疗建议在沟通表达质量上存在不足的问题,而这一问题往往比临床准确性更影响用户对医疗建议的评价。解决方案的关键在于引入一个基于多智能体大语言模型(LLM)的系统,该系统通过17个可解释的维度对罗马尼亚语医生书面回复的呈现质量进行评估与优化,而非关注医学正确性。系统由三个经过DSPy自动优化提示的LLM代理组成,利用低资源的罗马尼亚语数据训练并采用开源权重模型进行部署,能够在远程医疗平台中实时提供针对性反馈。

链接: https://arxiv.org/abs/2507.11299
作者: Andrei Niculae,Adrian Cosma,Cosmin Dumitrache,Emilian Rǎdoi
机构: National University of Science and Technology POLITEHNICA Bucharest(国家科技大学布加勒斯特理工学院); Dalle Molle Institute for Artificial Intelligence Research (IDSIA)(达勒莫拉人工智能研究研究所); MedicChat
类目: Computation and Language (cs.CL)
备注: 10 figures, 2 tables, 2 listings

点击查看摘要

Abstract:Text-based telemedicine has become increasingly common, yet the quality of medical advice in doctor-patient interactions is often judged more on how advice is communicated rather than its clinical accuracy. To address this, we introduce this http URL , a multi-agent large language model (LLM) system that supports Romanian-speaking doctors by evaluating and enhancing the presentation quality of their written responses. Rather than assessing medical correctness, this http URL provides feedback along 17 interpretable axes. The system comprises of three LLM agents with prompts automatically optimized via DSPy. Designed with low-resource Romanian data and deployed using open-weight models, it delivers real-time specific feedback to doctors within a telemedicine platform. Empirical evaluations and live deployment with 41 doctors show measurable improvements in user reviews and response quality, marking one of the first real-world deployments of LLMs in Romanian medical settings.
zh

[NLP-13] Fine-Grained Chinese Hate Speech Understanding: Span-Level Resources Coded Term Lexicon and Enhanced Detection Frameworks

【速读】: 该论文旨在解决中国语境下仇恨言论检测的两个主要问题:一是缺乏细粒度的跨度级标注数据,限制了模型对仇恨言论深层语义的理解;二是对编码仇恨术语的识别与解释研究不足,影响了模型在复杂现实场景中的可解释性。其解决方案的关键在于构建首个跨度级中文仇恨言论数据集STATE ToxiCN,并首次系统研究大语言模型对中文编码仇恨术语的语义理解能力,同时提出将标注词典集成到模型中的方法,以显著提升仇恨言论检测性能。

链接: https://arxiv.org/abs/2507.11292
作者: Zewen Bai,Liang Yang,Shengdi Yin,Yuanyuan Sun,Hongfei Lin
机构: Dalian University of Technology(大连理工大学)
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:The proliferation of hate speech has inflicted significant societal harm, with its intensity and directionality closely tied to specific targets and arguments. In recent years, numerous machine learning-based methods have been developed to detect hateful comments on online platforms automatically. However, research on Chinese hate speech detection lags behind, and interpretability studies face two major challenges: first, the scarcity of span-level fine-grained annotated datasets limits models’ deep semantic understanding of hate speech; second, insufficient research on identifying and interpreting coded hate speech restricts model explainability in complex real-world scenarios. To address these, we make the following contributions: (1) We introduce the Span-level Target-Aware Toxicity Extraction dataset (STATE ToxiCN), the first span-level Chinese hate speech dataset, and evaluate the hate semantic understanding of existing models using it. (2) We conduct the first comprehensive study on Chinese coded hate terms, LLMs’ ability to interpret hate semantics. (3) We propose a method to integrate an annotated lexicon into models, significantly enhancing hate speech detection performance. Our work provides valuable resources and insights to advance the interpretability of Chinese hate speech detection research.
zh

[NLP-14] FMC: Formalization of Natural Language Mathematical Competition Problems ICML2025

【速读】: 该论文旨在解决如何高效且准确地将自然语言数学问题自动转化为形式化语言的问题,从而推动形式化数学推理的发展。其解决方案的关键在于提出了一种基于大型语言模型并结合错误反馈的自动形式化流水线,实现了无需训练的全自动形式化方法。该方法通过构建一个与Lean形式化对齐的奥林匹克级别数据集,为自动化定理证明器提供了一个具有挑战性和实用性的基准。

链接: https://arxiv.org/abs/2507.11275
作者: Jiaxuan Xie,Chengwu Liu,Ye Yuan,Siqi Li,Zhiping Xiao,Ming Zhang
机构: 未知
类目: Computation and Language (cs.CL)
备注: Accepted in ICML 2025 AI4MATH Workshop

点击查看摘要

Abstract:Efficient and accurate autoformalization methods, which leverage large-scale datasets of extensive natural language mathematical problems to construct formal language datasets, are key to advancing formal mathematical reasoning. In this paper, we propose an autoformalization pipeline based on large language models with error feedback, achieving a fully automatic and training-free formalization approach. Using this pipeline, we curate an Olympiad-level dataset aligning natural language problems with Lean formalizations. The dataset comprises 3,922 mathematical problems in natural language and 9,787 in Lean, of which 64.46% were assessed as at least above-average quality, making it suitable as a benchmark for automated theorem provers. Additionally, we investigate the formalization and reasoning capabilities of various LLMs and empirically demonstrate that few-shot learning, error feedback, and increasing sampling numbers enhance the autoformalization process. Experiments of three automated theorem provers on the \dataset\ dataset also highlight its challenging nature and its value as a benchmark for formal reasoning tasks.
zh

[NLP-15] KV-Latent: Dimensional-level KV Cache Reduction with Frequency-aware Rotary Positional Embedding ACL2025

【速读】: 该论文试图解决基于Transformer解码器的大语言模型(Large Language Models, LLMs)在推理过程中Key-Value (KV)缓存占用内存过大和数据传输带宽受限的问题。解决方案的关键在于提出一种称为KV-Latent的范式,通过将Key-Value向量维度下采样到潜在空间,显著减少KV缓存的占用,并提升推理速度,仅需少量额外训练(不足预训练量的1%)。同时,通过对Rotary Positional Embedding的频率采样机制进行改进,增强了低维向量上的稳定性,避免了高频噪声的引入。

链接: https://arxiv.org/abs/2507.11273
作者: Luohe Shi,Zuchao Li,Lefei Zhang,Guoming Liu,Baoyuan Qi,Hai Zhao
机构: Wuhan University(武汉大学); Xiaomi(小米); Shanghai Jiao Tong University(上海交通大学)
类目: Computation and Language (cs.CL)
备注: To be published in The 63rd Annual Meeting of the Association for Computational Linguistics (ACL 2025)

点击查看摘要

Abstract:Large language models (LLMs) based on Transformer Decoders have become the preferred choice for conversational generative AI. Despite the overall superiority of the Decoder architecture, the gradually increasing Key-Value (KV) cache during inference has emerged as a primary efficiency bottleneck, both in aspects of memory consumption and data transfer bandwidth limitations. To address these challenges, we propose a paradigm called KV-Latent. By down-sampling the Key-Value vector dimensions into a latent space, we can significantly reduce the KV Cache footprint and improve inference speed, only with a small amount of extra training, less than 1% of pre-training takes. Besides, we enhanced the stability of Rotary Positional Embedding applied on lower-dimensional vectors by modifying its frequency sampling mechanism, avoiding noise introduced by higher frequencies while retaining position attenuation. Our experiments, including both models with Grouped Query Attention and those without, have yielded satisfactory results. Finally, we conducted comparative experiments to study the impact of separately reducing Key and Value components on model’s performance. Our approach allows for the construction of more efficient language model systems, and opens the new possibility on KV Cache saving and efficient LLMs. Our code is available at this https URL.
zh

[NLP-16] Sparse Autoencoders Can Capture Language-Specific Concepts Across Diverse Languages

【速读】: 该论文试图解决大型语言模型(Large Language Models, LLMs)中多语言机制的理解问题,特别是如何从跨语言表示中识别出语言特定的特征。现有研究多集中于单个神经元,但由于其多义性,难以分离出语言特定的单元。该论文的关键解决方案是引入基于特征激活概率的稀疏自编码器(Sparse Autoencoders, SAEs),即SAE-LAPE方法,以识别前馈网络中的语言特定特征。该方法能够提取出在模型中间到末层显著出现且可解释的特征,这些特征对模型的多语言性能和语言输出具有影响,并可用于语言识别任务,其性能与fastText相当且更具可解释性。

链接: https://arxiv.org/abs/2507.11230
作者: Lyzander Marciano Andrylie,Inaya Rahmanisa,Mahardika Krisna Ihsani,Alfan Farizki Wicaksono,Haryo Akbarianto Wibowo,Alham Fikri Aji
机构: 未知
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Understanding the multilingual mechanisms of large language models (LLMs) provides insight into how they process different languages, yet this remains challenging. Existing studies often focus on individual neurons, but their polysemantic nature makes it difficult to isolate language-specific units from cross-lingual representations. To address this, we explore sparse autoencoders (SAEs) for their ability to learn monosemantic features that represent concrete and abstract concepts across languages in LLMs. While some of these features are language-independent, the presence of language-specific features remains underexplored. In this work, we introduce SAE-LAPE, a method based on feature activation probability, to identify language-specific features within the feed-forward network. We find that many such features predominantly appear in the middle to final layers of the model and are interpretable. These features influence the model’s multilingual performance and language output and can be used for language identification with performance comparable to fastText along with more interpretability. Our code is available at this https URL .
zh

[NLP-17] An Agent ic Flow for Finite State Machine Extraction using Prompt Chaining

【速读】: 该论文试图解决现有有限状态机(Finite-State Machine, FSM)提取技术在可扩展性、覆盖不全以及自然语言规范中的歧义性方面的局限性。其解决方案的关键在于提出FlowFSM,一个基于代理的框架,利用大语言模型(Large Language Model, LLM)结合提示链和思维链推理,从原始RFC文档中准确提取FSM。该方法通过系统处理协议规范、识别状态转移并构建结构化规则手册,实现了高精度的FSM提取,并有效减少了幻觉状态转移的发生。

链接: https://arxiv.org/abs/2507.11222
作者: Fares Wael,Youssef Maklad,Ali Hamdi,Wael Elsersy
机构: MSA University (MSA University)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Networking and Internet Architecture (cs.NI)
备注:

点击查看摘要

Abstract:Finite-State Machines (FSMs) are critical for modeling the operational logic of network protocols, enabling verification, analysis, and vulnerability discovery. However, existing FSM extraction techniques face limitations such as scalability, incomplete coverage, and ambiguity in natural language specifications. In this paper, we propose FlowFSM, a novel agentic framework that leverages Large Language Models (LLMs) combined with prompt chaining and chain-of-thought reasoning to extract accurate FSMs from raw RFC documents. FlowFSM systematically processes protocol specifications, identifies state transitions, and constructs structured rule-books by chaining agent outputs. Experimental evaluation across FTP and RTSP protocols demonstrates that FlowFSM achieves high extraction precision while minimizing hallucinated transitions, showing promising results. Our findings highlight the potential of agent-based LLM systems in the advancement of protocol analysis and FSM inference for cybersecurity and reverse engineering applications.
zh

[NLP-18] EsBBQ and CaBBQ: The Spanish and Catalan Bias Benchmarks for Question Answering

【速读】: 该论文试图解决大型语言模型(Large Language Models, LLMs)在非英语语言和美国社会背景之外的社交偏见评估资源不足的问题。其解决方案的关键是引入西班牙语和加泰罗尼亚语的问答偏见基准测试(EsBBQ和CaBBQ),这两个平行数据集基于原始的BBQ,旨在通过多选问答设置评估10个类别的社会偏见,以适应西班牙语和加泰罗尼亚语以及西班牙的社会情境。

链接: https://arxiv.org/abs/2507.11216
作者: Valle Ruiz-Fernández,Mario Mina,Júlia Falcão,Luis Vasquez-Reina,Anna Sallés,Aitor Gonzalez-Agirre,Olatz Perez-de-Viñaspre
机构: Barcelona Supercomputing Center (BSC-CNS); HiTZ Center – IXA, University of the Basque Country
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Previous literature has largely shown that Large Language Models (LLMs) perpetuate social biases learnt from their pre-training data. Given the notable lack of resources for social bias evaluation in languages other than English, and for social contexts outside of the United States, this paper introduces the Spanish and the Catalan Bias Benchmarks for Question Answering (EsBBQ and CaBBQ). Based on the original BBQ, these two parallel datasets are designed to assess social bias across 10 categories using a multiple-choice QA setting, now adapted to the Spanish and Catalan languages and to the social context of Spain. We report evaluation results on different LLMs, factoring in model family, size and variant. Our results show that models tend to fail to choose the correct answer in ambiguous scenarios, and that high QA accuracy often correlates with greater reliance on social biases.
zh

[NLP-19] mperature and Persona Shape LLM Agent Consensus With Minimal Accuracy Gains in Qualitative Coding

【速读】: 该论文试图解决大规模定性研究中,基于生成式 AI (Generative AI) 的编码与数据标注方法的效率与准确性问题,特别是多智能体系统 (MAS) 在模拟人类编码流程中的效果与单智能体编码相比是否具有优势。其解决方案的关键在于构建一个开源的多智能体系统,该系统通过结构化的代理讨论和共识仲裁来模仿演绎式的人类编码过程,并通过实验分析不同代理人格(如中立、坚定或共情)和温度参数对对话片段编码的一致性和准确性的影响。

链接: https://arxiv.org/abs/2507.11198
作者: Conrad Borchers,Bahar Shahrokhian,Francesco Balzan,Elham Tajik,Sreecharan Sankaranarayanan,Sebastian Simon
机构: Carnegie Mellon University (卡内基梅隆大学); Arizona State University (亚利桑那州立大学); University of Bologna (博洛尼亚大学); University at Albany (阿尔巴尼大学); Amazon.com Inc. (亚马逊公司); University of Copenhagen (哥本哈根大学)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: Manuscript submitted for review

点击查看摘要

Abstract:Large Language Models (LLMs) enable new possibilities for qualitative research at scale, including coding and data annotation. While multi-agent systems (MAS) can emulate human coding workflows, their benefits over single-agent coding remain poorly understood. We conducted an experimental study of how agent persona and temperature shape consensus-building and coding accuracy of dialog segments based on a codebook with 8 codes. Our open-source MAS mirrors deductive human coding through structured agent discussion and consensus arbitration. Using six open-source LLMs (with 3 to 32 billion parameters) and 18 experimental configurations, we analyze over 77,000 coding decisions against a gold-standard dataset of human-annotated transcripts from online math tutoring sessions. Temperature significantly impacted whether and when consensus was reached across all six LLMs. MAS with multiple personas (including neutral, assertive, or empathetic), significantly delayed consensus in four out of six LLMs compared to uniform personas. In three of those LLMs, higher temperatures significantly diminished the effects of multiple personas on consensus. However, neither temperature nor persona pairing lead to robust improvements in coding accuracy. Single agents matched or outperformed MAS consensus in most conditions. Only one model (OpenHermesV2:7B) and code category showed above-chance gains from MAS deliberation when temperature was 0.5 or lower and especially when the agents included at least one assertive persona. Qualitative analysis of MAS collaboration for these configurations suggests that MAS may nonetheless aid in narrowing ambiguous code applications that could improve codebooks and human-AI coding. We contribute new insight into the limits of LLM-based qualitative methods, challenging the notion that diverse MAS personas lead to better outcomes. We open-source our MAS and experimentation code.
zh

[NLP-20] What Should LLM s Forget? Quantifying Personal Data in LLM s for Right-to-Be-Forgotten Requests ECML KDD2025

【速读】: 该论文试图解决大型语言模型(Large Language Models, LLMs)可能存储并泄露个人隐私信息的问题,尤其是在欧盟《通用数据保护条例》(GDPR)框架下,涉及“被遗忘权”(Right to Be Forgotten, RTBF)的合规性挑战。其核心问题在于现有机器遗忘方法假设需要遗忘的数据已知,但未解决如何识别模型中存储的个体-事实关联。论文提出的关键解决方案是构建WikiMem数据集和一种与模型无关的度量方法,通过校准的负对数似然在改写提示中对真实值与反事实进行排序,从而量化LLMs中的人-事实关联,为个体层面的隐私数据识别提供基础。

链接: https://arxiv.org/abs/2507.11128
作者: Dimitri Staufer
机构: 未知
类目: Computation and Language (cs.CL); Computers and Society (cs.CY); Machine Learning (cs.LG)
备注: 16 pages, 3 figures. Accepted at the 7th Workshop on eXplainable Knowledge Discovery in Data Mining (XKDD 2025), ECML PKDD 2025, Porto, Portugal

点击查看摘要

Abstract:Large Language Models (LLMs) can memorize and reveal personal information, raising concerns regarding compliance with the EU’s GDPR, particularly the Right to Be Forgotten (RTBF). Existing machine unlearning methods assume the data to forget is already known but do not address how to identify which individual-fact associations are stored in the model. Privacy auditing techniques typically operate at the population level or target a small set of identifiers, limiting applicability to individual-level data inquiries. We introduce WikiMem, a dataset of over 5,000 natural language canaries covering 243 human-related properties from Wikidata, and a model-agnostic metric to quantify human-fact associations in LLMs. Our approach ranks ground-truth values against counterfactuals using calibrated negative log-likelihood across paraphrased prompts. We evaluate 200 individuals across 15 LLMs (410M-70B parameters), showing that memorization correlates with subject web presence and model scale. We provide a foundation for identifying memorized personal data in LLMs at the individual level, enabling the dynamic construction of forget sets for machine unlearning and RTBF requests.
zh

[NLP-21] MSA at ImageCLEF 2025 Multimodal Reasoning : Multilingual Multimodal Reasoning With Ensemble Vision Language Models

【速读】: 该论文旨在解决多语言多模态推理任务中的挑战,特别是在ImageCLEF 2025 EXAMS V挑战中实现高精度的跨语言理解与回答生成。其解决方案的关键在于构建一个基于集成学习的系统,该系统结合了多个大型语言模型(如Gemini 2.5 Flash、Gemini 1.5 Pro和Gemini 2.5 Pro)的功能,通过精心设计的少样本和零样本提示策略进行协调。此外,研究强调了提示工程的重要性,采用简洁且语言标准化的格式显著提升了模型在英语验证集上的准确率,并通过跨语言增强技术进一步优化了多语言场景下的性能。

链接: https://arxiv.org/abs/2507.11114
作者: Seif Ahmed,Mohamed T. Younes,Abdelrahman Moustafa,Abdelrahman Allam,Hamza Moustafa
机构: October University for Modern Sciences and Arts (MSA)
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:We present a robust ensemble-based system for multilingual multimodal reasoning, designed for the ImageCLEF 2025 EXAMS V challenge. Our approach integrates Gemini 2.5 Flash for visual description, Gemini 1.5 Pro for caption refinement and consistency checks, and Gemini 2.5 Pro as a reasoner which handles final answer selection, all coordinated through carefully engineered few-shot and zero-shot prompts. We conducted an extensive ablation study, training several large language models (Gemini 2.5 Flash, Phi 4, Gemma 3, Mistral) on an English dataset and its multilingual augmented version. Additionally, we evaluated Gemini 2.5 Flash in a zero-shot setting for comparison and found it to substantially outperform the trained models. Prompt design also proved critical: enforcing concise, language-normalized formats and prohibiting explanatory text boosted model accuracy on the English validation set from 55.9% to 61.7%. On the official leaderboard, our system (Team MSA) achieved first place overall in the multilingual track with 81.4% accuracy, and led 11 out of 13 individual language tracks, with top results such as 95.07% for Croatian and 92.12% for Italian. These findings highlight that lightweight OCR-VLM ensembles, when paired with precise prompt strategies and cross-lingual augmentation, can outperform heavier end-to-end models in high-stakes, multilingual educational settings.
zh

[NLP-22] Multi-Trigger Poisoning Amplifies Backdoor Vulnerabilities in LLM s

【速读】: 该论文试图解决大型语言模型(Large Language Models, LLMs)在数据中毒攻击下的安全问题,特别是针对多个后门触发器共存且相互不干扰的机制缺乏深入理解的问题。论文提出了一种研究LLMs中毒的框架,揭示了多个独立的后门触发器可以在同一模型中共存而不互相干扰,从而允许攻击者同时嵌入多个触发器。解决方案的关键在于利用高嵌入相似性的多个触发器,使得中毒触发器在令牌被替换或间隔较长时仍能保持鲁棒激活,进而暴露LLMs更广泛和持久的安全漏洞。为应对这一威胁,作者提出了一种事后恢复方法,通过逐层权重差异分析选择性地重新训练模型特定组件,以最小的参数更新有效消除触发行为。

链接: https://arxiv.org/abs/2507.11112
作者: Sanhanat Sivapiromrat,Caiqi Zhang,Marco Basaldella,Nigel Collier
机构: University of Cambridge (剑桥大学); Trismik (Trismik)
类目: Computation and Language (cs.CL); Cryptography and Security (cs.CR); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Recent studies have shown that Large Language Models (LLMs) are vulnerable to data poisoning attacks, where malicious training examples embed hidden behaviours triggered by specific input patterns. However, most existing works assume a phrase and focus on the attack’s effectiveness, offering limited understanding of trigger mechanisms and how multiple triggers interact within the model. In this paper, we present a framework for studying poisoning in LLMs. We show that multiple distinct backdoor triggers can coexist within a single model without interfering with each other, enabling adversaries to embed several triggers concurrently. Using multiple triggers with high embedding similarity, we demonstrate that poisoned triggers can achieve robust activation even when tokens are substituted or separated by long token spans. Our findings expose a broader and more persistent vulnerability surface in LLMs. To mitigate this threat, we propose a post hoc recovery method that selectively retrains specific model components based on a layer-wise weight difference analysis. Our method effectively removes the trigger behaviour with minimal parameter updates, presenting a practical and efficient defence against multi-trigger poisoning.
zh

[NLP-23] he Devil behind the mask: An emergent safety vulnerability of Diffusion LLM s

【速读】: 该论文试图解决生成式AI(Generative AI)中基于扩散的大语言模型(dLLMs)在安全对齐方面存在的根本性问题,即现有对齐机制无法防范上下文感知的掩码输入对抗性提示,从而暴露了新的安全漏洞。解决方案的关键在于提出DIJA框架,该框架通过构造对抗性交错掩码文本提示,利用dLLMs的双向建模和并行解码机制,使模型在生成过程中难以检测和过滤有害内容,从而实现对模型的“越狱”攻击。

链接: https://arxiv.org/abs/2507.11097
作者: Zichen Wen,Jiashu Qu,Dongrui Liu,Zhiyuan Liu,Ruixi Wu,Yicun Yang,Xiangqi Jin,Haoyun Xu,Xuyang Liu,Weijia Li,Chaochao Lu,Jing Shao,Conghui He,Linfeng Zhang
机构: Shanghai Jiao Tong University (上海交通大学); Shanghai AI Laboratory (上海人工智能实验室); Sun Yat-sen University (中山大学)
类目: Computation and Language (cs.CL)
备注: 21 pages, 9 figures, work in progress

点击查看摘要

Abstract:Diffusion-based large language models (dLLMs) have recently emerged as a powerful alternative to autoregressive LLMs, offering faster inference and greater interactivity via parallel decoding and bidirectional modeling. However, despite strong performance in code generation and text infilling, we identify a fundamental safety concern: existing alignment mechanisms fail to safeguard dLLMs against context-aware, masked-input adversarial prompts, exposing novel vulnerabilities. To this end, we present DIJA, the first systematic study and jailbreak attack framework that exploits unique safety weaknesses of dLLMs. Specifically, our proposed DIJA constructs adversarial interleaved mask-text prompts that exploit the text generation mechanisms of dLLMs, i.e., bidirectional modeling and parallel decoding. Bidirectional modeling drives the model to produce contextually consistent outputs for masked spans, even when harmful, while parallel decoding limits model dynamic filtering and rejection sampling of unsafe content. This causes standard alignment mechanisms to fail, enabling harmful completions in alignment-tuned dLLMs, even when harmful behaviors or unsafe instructions are directly exposed in the prompt. Through comprehensive experiments, we demonstrate that DIJA significantly outperforms existing jailbreak methods, exposing a previously overlooked threat surface in dLLM architectures. Notably, our method achieves up to 100% keyword-based ASR on Dream-Instruct, surpassing the strongest prior baseline, ReNeLLM, by up to 78.5% in evaluator-based ASR on JailbreakBench and by 37.7 points in StrongREJECT score, while requiring no rewriting or hiding of harmful content in the jailbreak prompt. Our findings underscore the urgent need for rethinking safety alignment in this emerging class of language models. Code is available at this https URL.
zh

[NLP-24] Beyond Traditional Algorithms: Leverag ing LLM s for Accurate Cross-Border Entity Identification

【速读】: 该论文试图解决跨境金融活动中外国实体的准确识别与分类问题,这一问题在西班牙金融系统中对于风险管理和监管合规至关重要。传统实体匹配算法如Jaccard、余弦和Levenshtein距离在面对语言差异、特殊字符、过时名称及法律形式变更等挑战时表现不佳,难以处理上下文和语义关系,导致匹配错误。解决方案的关键在于探索大型语言模型(Large Language Models, LLMs)作为灵活替代方案,其能够理解上下文、处理缩写并适应法律变化,从而显著提升匹配准确性与降低误报率。

链接: https://arxiv.org/abs/2507.11086
作者: Andres Azqueta-Gavaldón,Joaquin Ramos Cosgrove
机构: Banco de España(西班牙银行)
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:The growing prevalence of cross-border financial activities in global markets has underscored the necessity of accurately identifying and classifying foreign entities. This practice is essential within the Spanish financial system for ensuring robust risk management, regulatory adherence, and the prevention of financial misconduct. This process involves a labor-intensive entity-matching task, where entities need to be validated against available reference sources. Challenges arise from linguistic variations, special characters, outdated names, and changes in legal forms, complicating traditional matching algorithms like Jaccard, cosine, and Levenshtein distances. These methods struggle with contextual nuances and semantic relationships, leading to mismatches. To address these limitations, we explore Large Language Models (LLMs) as a flexible alternative. LLMs leverage extensive training to interpret context, handle abbreviations, and adapt to legal transitions. We evaluate traditional methods, Hugging Face-based LLMs, and interface-based LLMs (e.g., Microsoft Copilot, Alibaba’s Qwen 2.5) using a dataset of 65 Portuguese company cases. Results show traditional methods achieve accuracies over 92% but suffer high false positive rates (20-40%). Interface-based LLMs outperform, achieving accuracies above 93%, F1 scores exceeding 96%, and lower false positives (40-80%).
zh

[NLP-25] Social Media Sentiments Analysis on the July Revolution in Bangladesh: A Hybrid Transformer Based Machine Learning Approach ECAI2025

【速读】: 该论文试图解决在低资源语言(如孟加拉语)中通过社交媒体评论分析公众情绪的问题,特别是在孟加拉国7月革命期间和之后的公众舆论解读。解决方案的关键在于提出了一种基于混合Transformer的情感分析框架,利用了BanglaBERT、mBERT、XLM-RoBERTa以及提出的混合XMB-BERT模型进行特征提取,并结合主成分分析(PCA)进行降维以提高计算效率,最终通过投票分类器实现了83.7%的高准确率。

链接: https://arxiv.org/abs/2507.11084
作者: Md. Sabbir Hossen,Md. Saiduzzaman,Pabon Shaha
机构: Bangladesh University (孟加拉国大学); Mawlana Bhashani Science & Technology University (莫拉纳·巴沙尼科技大学)
类目: Computation and Language (cs.CL)
备注: This paper has been accepted and presented at the IEEE ECAI 2025. The final version will be available in the IEEE Xplore Digital Library

点击查看摘要

Abstract:The July Revolution in Bangladesh marked a significant student-led mass uprising, uniting people across the nation to demand justice, accountability, and systemic reform. Social media platforms played a pivotal role in amplifying public sentiment and shaping discourse during this historic mass uprising. In this study, we present a hybrid transformer-based sentiment analysis framework to decode public opinion expressed in social media comments during and after the revolution. We used a brand new dataset of 4,200 Bangla comments collected from social media. The framework employs advanced transformer-based feature extraction techniques, including BanglaBERT, mBERT, XLM-RoBERTa, and the proposed hybrid XMB-BERT, to capture nuanced patterns in textual data. Principle Component Analysis (PCA) were utilized for dimensionality reduction to enhance computational efficiency. We explored eleven traditional and advanced machine learning classifiers for identifying sentiments. The proposed hybrid XMB-BERT with the voting classifier achieved an exceptional accuracy of 83.7% and outperform other model classifier combinations. This study underscores the potential of machine learning techniques to analyze social sentiment in low-resource languages like Bangla.
zh

[NLP-26] SWE-MERA: A Dynamic Benchmark for Agent icly Evaluating Large Language Models on Software Engineering Tasks

【速读】: 该论文旨在解决现有软件工程基准测试(如SWE-bench)中存在的严重数据污染问题,这些问题导致模型评估结果不可靠。其解决方案的关键在于引入SWE-MERA,这是一个动态且持续更新的基准,通过自动化收集真实世界GitHub问题并进行严格的质量验证,构建了一个高质量、低污染的评测环境。该方法实现了可靠的数据处理流程,在减少数据泄露风险的同时,提供了约10,000个潜在任务,目前已有300个样本可用。

链接: https://arxiv.org/abs/2507.11059
作者: Pavel Adamenko,Mikhail Ivanov,Aidar Valeev,Rodion Levichev,Pavel Zadorozhny,Ivan Lopatin,Dmitry Babayev,Alena Fenogenova,Valentin Malykh
机构: SberAI; ITMO University; MWS AI
类目: oftware Engineering (cs.SE); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:The rapid advancement of Large Language Models (LLMs) in software engineering has revealed critical limitations in existing benchmarks, particularly the widely used SWE-bench dataset. Recent studies have uncovered severe data contamination issues, e.g. SWE-bench reports 32.67% of successful patches involve direct solution leakage and 31.08% pass due to inadequate test cases. We introduce SWE-MERA, a dynamic, continuously updated benchmark designed to address these fundamental challenges through an automated collection of real-world GitHub issues and rigorous quality validation. Our approach implements a reliable pipeline that ensures quality while minimizing contamination risks, resulting in approximately 10,000 potential tasks with 300 samples currently available. Evaluation using the Aider coding agent demonstrates strong discriminative power in state-of-the-art models. We report performance across a dozen recent LLMs evaluated on tasks collected between September 2024 and June 2025.
zh

[NLP-27] LLM -Augmented Symptom Analysis for Cardiovascular Disease Risk Prediction: A Clinical NLP

【速读】: 该论文试图解决心血管疾病(Cardiovascular Disease, CVD)的及时识别与准确风险分层问题,以降低全球死亡率。现有预测模型主要依赖结构化数据,而未结构化的临床笔记中包含有价值的早期指标。其解决方案的关键在于引入一种基于领域自适应大语言模型(LLM)的临床自然语言处理(NLP)管道,通过症状提取、上下文推理和相关性分析从自由文本报告中挖掘信息。该方法结合了心血管领域特定的微调、基于提示的推理以及实体感知推理,提升了模型在精度、召回率、F1分数和AUROC上的表现,并通过心脏科医生评估显示出高临床相关性(kappa = 0.82)。此外,通过提示工程和混合规则验证解决了上下文幻觉和时间模糊性等挑战。

链接: https://arxiv.org/abs/2507.11052
作者: Haowei Yang,Ziyu Shen,Junli Shao,Luyao Men,Xinyue Han,Jing Dong
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Timely identification and accurate risk stratification of cardiovascular disease (CVD) remain essential for reducing global mortality. While existing prediction models primarily leverage structured data, unstructured clinical notes contain valuable early indicators. This study introduces a novel LLM-augmented clinical NLP pipeline that employs domain-adapted large language models for symptom extraction, contextual reasoning, and correlation from free-text reports. Our approach integrates cardiovascular-specific fine-tuning, prompt-based inference, and entity-aware reasoning. Evaluations on MIMIC-III and CARDIO-NLP datasets demonstrate improved performance in precision, recall, F1-score, and AUROC, with high clinical relevance (kappa = 0.82) assessed by cardiologists. Challenges such as contextual hallucination, which occurs when plausible information contracts with provided source, and temporal ambiguity, which is related with models struggling with chronological ordering of events are addressed using prompt engineering and hybrid rule-based verification. This work underscores the potential of LLMs in clinical decision support systems (CDSS), advancing early warning systems and enhancing the translation of patient narratives into actionable risk assessments.
zh

[NLP-28] Journalism-Guided Agent ic In-Context Learning for News Stance Detection

【速读】: 该论文试图解决个性化推荐系统在在线新闻消费中可能加剧信息茧房和政治极化的问题,其核心在于通过立场检测(stance detection)来实现更具观点意识的推荐和媒体偏见分析。解决方案的关键是提出\textscJoA-ICL框架,该框架基于新闻学指导的代理式上下文学习,利用语言模型代理预测新闻文章关键结构片段(如导语、引用)的立场,并将这些片段立场聚合以推断整篇文章的整体立场,从而更准确地捕捉长篇新闻文章的总体立场。

链接: https://arxiv.org/abs/2507.11049
作者: Dahyun Lee,Jonghyeon Choi,Jiyoung Han,Kunwoo Park
机构: Soongsil University(松溪大学); KAIST(韩国科学技术院)
类目: Computation and Language (cs.CL)
备注: Preprint. 24 pages

点击查看摘要

Abstract:As online news consumption grows, personalized recommendation systems have become integral to digital journalism. However, these systems risk reinforcing filter bubbles and political polarization by failing to incorporate diverse perspectives. Stance detection – identifying a text’s position on a target – can help mitigate this by enabling viewpoint-aware recommendations and data-driven analyses of media bias. Yet, existing stance detection research remains largely limited to short texts and high-resource languages. To address these gaps, we introduce \textscK-News-Stance, the first Korean dataset for article-level stance detection, comprising 2,000 news articles with article-level and 19,650 segment-level stance annotations across 47 societal issues. We also propose \textscJoA-ICL, a \textbfJournalism-guided \textbfAgentic \textbfIn-\textbfContext \textbfLearning framework that employs a language model agent to predict the stances of key structural segments (e.g., leads, quotes), which are then aggregated to infer the overall article stance. Experiments show that \textscJoA-ICL outperforms existing stance detection methods, highlighting the benefits of segment-level agency in capturing the overall position of long-form news articles. Two case studies further demonstrate its broader utility in promoting viewpoint diversity in news recommendations and uncovering patterns of media bias.
zh

[NLP-29] First-Order Error Matters: Accurate Compensation for Quantized Large Language Models

【速读】: 该论文旨在解决后训练量化(PTQ)过程中量化误差补偿不足的问题,特别是在使用基于补偿的权重校准方法时,现有方法依赖于二阶泰勒展开来建模量化误差,假设全精度模型中的一阶项可以忽略不计。然而,研究发现渐进式补偿过程会导致潜在权重与全精度权重之间产生累积的一阶偏差,使得该假设本质上不成立。解决方案的关键在于提出FOEM方法,该方法显式地引入一阶梯度项以改进量化误差补偿,通过直接计算潜在权重与全精度权重之间的差异来近似梯度,避免了基于反向传播的梯度计算的高成本和有限泛化能力,从而在保持较低额外计算开销的同时提升量化效果。

链接: https://arxiv.org/abs/2507.11017
作者: Xingyu Zheng,Haotong Qin,Yuye Li,Jiakai Wang,Jinyang Guo,Michele Magno,Xianglong Liu
机构: Beihang University (北京航空航天大学); ETH Zürich (苏黎世联邦理工学院); Xidian University (西安电子科技大学); Zhongguancun Laboratory (中关村实验室)
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Post-training quantization (PTQ) offers an efficient approach to compressing large language models (LLMs), significantly reducing memory access and computational costs. Existing compensation-based weight calibration methods often rely on a second-order Taylor expansion to model quantization error, under the assumption that the first-order term is negligible in well-trained full-precision models. However, we reveal that the progressive compensation process introduces accumulated first-order deviations between latent weights and their full-precision counterparts, making this assumption fundamentally flawed. To address this, we propose FOEM, a novel PTQ method that explicitly incorporates first-order gradient terms to improve quantization error compensation. FOEM approximates gradients by directly computing the difference between latent and full-precision weights, avoiding the high cost and limited generalization of backpropagation-based gradient computation. This approach introduces minimal additional computational overhead. Moreover, FOEM leverages precomputed Cholesky factors to efficiently recover the inverse of Hessian submatrices in real time. Extensive experiments across a wide range of models and benchmarks demonstrate that FOEM consistently outperforms the classical GPTQ method. In 3-bit weight-only quantization, FOEM reduces the perplexity of Llama3-8B by 89.6%, and improves the 5-shot MMLU accuracy of Llama3-70B from 51.7% to 74.9%, approaching the full-precision performance of 78.6%. Furthermore, FOEM can be seamlessly integrated with advanced techniques such as GPTAQ and SpinQuant, yielding additional improvements under the challenging W4A4KV4 setting, and further narrowing the accuracy gap with full-precision baselines beyond what current state-of-the-art methods achieve. The code is available at this https URL.
zh

[NLP-30] am HUMANE at AVeriTeC 2025: HerO 2 for Efficient Fact Verification ACL2025

【速读】: 该论文旨在解决事实验证(fact verification)任务中的证据质量与系统效率问题。其解决方案的关键在于通过文档摘要和答案重述提升证据质量,利用计算约束下的后训练量化优化真实性预测,并通过集成更新的语言模型(LM)骨干网络增强整体系统性能。

链接: https://arxiv.org/abs/2507.11004
作者: Yejun Yoon,Jaeyoon Jung,Seunghyun Yoon,Kunwoo Park
机构: Soongsil University (松溪大学); MAUM AI Inc. (MAUM AI 公司); Adobe Research, USA (Adobe 研究院,美国)
类目: Computation and Language (cs.CL)
备注: ACL 2025 Workshop (FEVER)

点击查看摘要

Abstract:This paper presents HerO 2, Team HUMANE’s system for the AVeriTeC shared task at the FEVER-25 workshop. HerO 2 is an enhanced version of HerO, the best-performing open-source model from the previous year’s challenge. It improves evidence quality through document summarization and answer reformulation, optimizes veracity prediction via post-training quantization under computational constraints, and enhances overall system performance by integrating updated language model (LM) backbones. HerO 2 ranked second on the leaderboard while achieving the shortest runtime among the top three systems, demonstrating both high efficiency and strong potential for real-world fact verification. The code is available at this https URL.
zh

[NLP-31] Mario at EXIST 2025: A Simple Gateway to Effective Multilingual Sexism Detection

【速读】: 该论文旨在解决英语和西班牙语推文中基于文本的性别歧视检测问题,通过层次化低秩适配(hierarchical Low-Rank Adaptation, LoRA)对Llama 3.1 8B模型进行微调。其解决方案的关键在于引入条件适配器路由机制,显式建模三个层级结构子任务之间的标签依赖关系,即二分类性别歧视识别、来源意图检测和多标签性别歧视分类。与传统LoRA仅针对注意力层进行适配不同,该方法对所有线性变换进行适配,增强了模型捕捉任务特定模式的能力,并采用统一的多语言训练策略,利用Llama 3.1的原生双语能力,实现了跨语言迁移效果,从而在减少训练参数和计算资源消耗的同时,保持了较高的检测性能。

链接: https://arxiv.org/abs/2507.10996
作者: Lin Tian,Johanne R. Trippas,Marian-Andrei Rizoiu
机构: University of Technology Sydney (悉尼科技大学); RMIT University (皇家墨尔本理工大学)
类目: Computation and Language (cs.CL)
备注: 12 pages, 5 tables, CLEF 2025

点击查看摘要

Abstract:This paper presents our approach to EXIST 2025 Task 1, addressing text-based sexism detection in English and Spanish tweets through hierarchical Low-Rank Adaptation (LoRA) of Llama 3.1 8B. Our method introduces conditional adapter routing that explicitly models label dependencies across three hierarchically structured subtasks: binary sexism identification, source intention detection, and multilabel sexism categorization. Unlike conventional LoRA applications that target only attention layers, we apply adaptation to all linear transformations, enhancing the model’s capacity to capture task-specific patterns. In contrast to complex data processing and ensemble approaches, we show that straightforward parameter-efficient fine-tuning achieves strong performance. We train separate LoRA adapters (rank=16, QLoRA 4-bit) for each subtask using unified multilingual training that leverages Llama 3.1’s native bilingual capabilities. The method requires minimal preprocessing and uses standard supervised learning. Our multilingual training strategy eliminates the need for separate language-specific models, achieving 1.7-2.4% F1 improvements through cross-lingual transfer. With only 1.67% trainable parameters compared to full fine-tuning, our approach reduces training time by 75% and model storage by 98%, while achieving competitive performance across all subtasks (ICM-Hard: 0.6774 for binary classification, 0.4991 for intention detection, 0.6519 for multilabel categorization).
zh

[NLP-32] ach Me Sign: Stepwise Prompting LLM for Sign Language Production ICIP2025

【速读】: 该论文试图解决将生成式 AI (Generative AI) 应用于手语生成时面临的复杂性和独特规则问题。其解决方案的关键在于将手语视为另一种自然语言,并通过微调大型语言模型(LLM)来学习文本与手语之间的对应关系,同时采用分步提示策略提取LLM中隐含的手语知识,从而支持手语的学习与生成过程。实验结果表明,该方法有效利用了LLM的手语知识和推理能力,以对齐手语与口语之间的分布差异和语法规则。

链接: https://arxiv.org/abs/2507.10972
作者: Zhaoyi An,Rei Kawakami
机构: 未知
类目: Computation and Language (cs.CL); Computer Vision and Pattern Recognition (cs.CV); Multimedia (cs.MM)
备注: Accepted by IEEE ICIP 2025

点击查看摘要

Abstract:Large language models, with their strong reasoning ability and rich knowledge, have brought revolution to many tasks of AI, but their impact on sign language generation remains limited due to its complexity and unique rules. In this paper, we propose TEAch Me Sign (TEAM-Sign), treating sign language as another natural language. By fine-tuning an LLM, we enable it to learn the correspondence between text and sign language, and facilitate generation. Considering the differences between sign and spoken language, we employ a stepwise prompting strategy to extract the inherent sign language knowledge within the LLM, thereby supporting the learning and generation process. Experimental results on How2Sign and Phoenix14T datasets demonstrate that our approach effectively leverages both the sign language knowledge and reasoning capabilities of LLM to align the different distribution and grammatical rules between sign and spoken language.
zh

[NLP-33] DS@GT at eRisk 2025: From prompts to predictions benchmarking early depression detection with conversational agent based assessments and temporal attention models

【速读】: 该论文试图解决对话式抑郁症检测问题,特别是在缺乏真实标签的情况下利用大语言模型(LLMs)进行基于贝克抑郁量表第二版(BDI-II)的评估。解决方案的关键在于采用提示工程策略,通过设计特定的提示模板使不同LLMs生成符合BDI-II标准的结构化JSON输出,并通过跨模型一致性与内部一致性进行评估,从而分析对话线索对症状预测的影响。

链接: https://arxiv.org/abs/2507.10958
作者: Anthony Miyaguchi,David Guecha,Yuwen Chiu,Sidharth Gaur
机构: 未知
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:This Working Note summarizes the participation of the DS@GT team in two eRisk 2025 challenges. For the Pilot Task on conversational depression detection with large language-models (LLMs), we adopted a prompt-engineering strategy in which diverse LLMs conducted BDI-II-based assessments and produced structured JSON outputs. Because ground-truth labels were unavailable, we evaluated cross-model agreement and internal consistency. Our prompt design methodology aligned model outputs with BDI-II criteria and enabled the analysis of conversational cues that influenced the prediction of symptoms. Our best submission, second on the official leaderboard, achieved DCHR = 0.50, ADODL = 0.89, and ASHR = 0.27.
zh

[NLP-34] Modeling Understanding of Story-Based Analogies Using Large Language Models

【速读】: 该论文试图解决大型语言模型(Large Language Models, LLMs)在类比推理任务中与人类表现的对齐问题,具体关注其在检测和映射类比方面的能力。研究的关键在于通过细粒度评估LLMs的推理能力,而非仅依赖整体准确率,同时探索LLMs在语义表征上是否能够捕捉类比中的源文本与目标文本之间的相似性以及源文本与干扰文本之间的差异,并检验显式提示LLMs解释类比的效果。此外,研究还比较了不同模型规模(8B vs. 70B参数)及先进模型架构(如GPT-4和LLaMA3)在类比推理任务中的性能差异。

链接: https://arxiv.org/abs/2507.10957
作者: Kalit Inani,Keshav Kabra,Vijay Marupudi,Sashank Varma
机构: Georgia Institute of Technology(乔治亚理工学院)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: To appear at CogSci 2025

点击查看摘要

Abstract:Recent advancements in Large Language Models (LLMs) have brought them closer to matching human cognition across a variety of tasks. How well do these models align with human performance in detecting and mapping analogies? Prior research has shown that LLMs can extract similarities from analogy problems but lack robust human-like reasoning. Building on Webb, Holyoak, and Lu (2023), the current study focused on a story-based analogical mapping task and conducted a fine-grained evaluation of LLM reasoning abilities compared to human performance. First, it explored the semantic representation of analogies in LLMs, using sentence embeddings to assess whether they capture the similarity between the source and target texts of an analogy, and the dissimilarity between the source and distractor texts. Second, it investigated the effectiveness of explicitly prompting LLMs to explain analogies. Throughout, we examine whether LLMs exhibit similar performance profiles to those observed in humans by evaluating their reasoning at the level of individual analogies, and not just at the level of overall accuracy (as prior studies have done). Our experiments include evaluating the impact of model size (8B vs. 70B parameters) and performance variation across state-of-the-art model architectures such as GPT-4 and LLaMA3. This work advances our understanding of the analogical reasoning abilities of LLMs and their potential as models of human reasoning.
zh

[NLP-35] HanjaBridge: Resolving Semantic Ambiguity in Korean LLM s via Hanja-Augmented Pre-Training

【速读】: 该论文旨在解决大型语言模型(Large Language Models, LLMs)在低资源语言如韩语中的性能不足问题,特别是由于同音汉字词在韩文书写系统中无法区分所导致的语义模糊性。其解决方案的关键在于提出HanjaBridge,这是一种集成于持续预训练(Continual Pre-training, CPT)框架中的意义注入技术。与传统方法不同,HanjaBridge不将单词确定性地映射到一个汉字,而是为给定的同形词提供所有可能的汉字候选,从而促使模型学习上下文中的消歧能力,同时结合逐标记的知识蒸馏以防止灾难性遗忘。

链接: https://arxiv.org/abs/2507.10920
作者: Seungho Choi
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Large language models (LLMs) often show poor performance in low-resource languages like Korean, partly due to unique linguistic challenges such as homophonous Sino-Korean words that are indistinguishable in Hangul script. To address this semantic ambiguity, we propose HanjaBridge, a novel meaning-injection technique integrated into a continual pre-training (CPT) framework. Instead of deterministically mapping a word to a single Hanja (Chinese character), HanjaBridge presents the model with all possible Hanja candidates for a given homograph, encouraging the model to learn contextual disambiguation. This process is paired with token-level knowledge distillation to prevent catastrophic forgetting. Experimental results show that HanjaBridge significantly improves Korean language understanding, achieving a 21% relative improvement on the KoBALT benchmark. Notably, by reinforcing semantic alignment between Korean and Chinese through shared Hanja, we observe a strong positive cross-lingual transfer. Furthermore, these gains persist even when Hanja augmentation is omitted at inference time, ensuring practical efficiency with no additional run-time cost.
zh

[NLP-36] How Stylistic Similarity Shapes Preferences in Dialogue Dataset with User and Third Party Evaluations SIGDIAL2025

【速读】: 该论文试图解决在开放领域对话中,用户对系统回应的偏好与主观和客观风格相似性之间关系的问题,特别是探讨主观风格相似性与客观风格相似性之间的差异。其解决方案的关键在于构建了一个新型数据集,该数据集包含了用户的偏好、基于用户自身感知的主观风格相似性以及由第三方评估者标注的客观风格相似性,从而揭示了主观风格相似性与用户偏好之间的强正相关性,并强调了区分主观与客观评价的重要性。

链接: https://arxiv.org/abs/2507.10918
作者: Ikumi Numaya,Shoji Moriya,Shiki Sato,Reina Akama,Jun Suzuki
机构: Tohoku University (东北大学); CyberAgent (CyberAgent); NINJAL (NINJAL); RIKEN (理化学研究所)
类目: Computation and Language (cs.CL)
备注: Accepted to SIGDIAL 2025 (long)

点击查看摘要

Abstract:Recent advancements in dialogue generation have broadened the scope of human-bot interactions, enabling not only contextually appropriate responses but also the analysis of human affect and sensitivity. While prior work has suggested that stylistic similarity between user and system may enhance user impressions, the distinction between subjective and objective similarity is often overlooked. To investigate this issue, we introduce a novel dataset that includes users’ preferences, subjective stylistic similarity based on users’ own perceptions, and objective stylistic similarity annotated by third party evaluators in open-domain dialogue settings. Analysis using the constructed dataset reveals a strong positive correlation between subjective stylistic similarity and user preference. Furthermore, our analysis suggests an important finding: users’ subjective stylistic similarity differs from third party objective similarity. This underscores the importance of distinguishing between subjective and objective evaluations and understanding the distinct aspects each captures when analyzing the relationship between stylistic similarity and user preferences. The dataset presented in this paper is available online.
zh

[NLP-37] LiLM-RDB-SFC: Lightweight Language Model with Relational Database-Guided DRL for Optimized SFC Provisioning

【速读】: 该论文旨在解决现代软件定义网络(Software-Defined Networking, SDN)和网络功能虚拟化(Network Function Virtualization, NFV)环境中服务功能链(Service Function Chains, SFCs)的有效管理和虚拟网络功能(Virtual Network Function, VNF)最优部署问题。其解决方案的关键在于提出了一种结合轻量级语言模型(Lightweight Language Model, LiLM)与关系型数据库(Relational Database, RDB)的新方法——LiLM-RDB-SFC,通过LiLM对网络状态进行查询以指导深度强化学习(Deep Reinforcement Learning, DRL)模型实现高效的SFC资源分配。

链接: https://arxiv.org/abs/2507.10903
作者: Parisa Fard Moshiri,Xinyu Zhu,Poonam Lohan,Burak Kantarci,Emil Janulewicz
机构: University of Ottawa(渥太华大学); Ciena(赛灵思)
类目: Networking and Internet Architecture (cs.NI); Computation and Language (cs.CL); Machine Learning (cs.LG)
备注: 9 pages, 6 figures, Accepted to IEEE 16th International Conference on Network of the Future (NoF) 2025

点击查看摘要

Abstract:Effective management of Service Function Chains (SFCs) and optimal Virtual Network Function (VNF) placement are critical challenges in modern Software-Defined Networking (SDN) and Network Function Virtualization (NFV) environments. Although Deep Reinforcement Learning (DRL) is widely adopted for dynamic network decision-making, its inherent dependency on structured data and fixed action rules often limits adaptability and responsiveness, particularly under unpredictable network conditions. This paper introduces LiLM-RDB-SFC, a novel approach combining Lightweight Language Model (LiLM) with Relational Database (RDB) to answer network state queries to guide DRL model for efficient SFC provisioning. Our proposed approach leverages two LiLMs, Bidirectional and Auto-Regressive Transformers (BART) and the Fine-tuned Language Net T5 (FLAN-T5), to interpret network data and support diverse query types related to SFC demands, data center resources, and VNF availability. Results demonstrate that FLAN-T5 outperforms BART with a lower test loss (0.00161 compared to 0.00734), higher accuracy (94.79% compared to 80.2%), and less processing time (2h 2min compared to 2h 38min). Moreover, when compared to the large language model SQLCoder, FLAN-T5 matches the accuracy of SQLCoder while cutting processing time by 96% (SQLCoder: 54 h 43 min; FLAN-T5: 2 h 2 min).
zh

[NLP-38] NavComposer: Composing Language Instructions for Navigation Trajectories through Action-Scene-Object Modularization

【速读】: 该论文试图解决语言引导导航中高质量导航指令生成不足的问题,因为专家提供的指令数量有限,而合成的标注通常质量不高,难以满足大规模研究的需求。其解决方案的关键在于提出NavComposer框架,该框架通过显式分解语义实体(如动作、场景和物体)并重新组合为自然语言指令,实现了高质量导航指令的自动生成。NavComposer的模块化架构支持灵活集成先进方法,同时显式的语义实体使用提升了指令的丰富性和准确性。

链接: https://arxiv.org/abs/2507.10894
作者: Zongtao He,Liuyi Wang,Lu Chen,Chengju Liu,Qijun Chen
机构: Tongji University (同济大学)
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Language-guided navigation is a cornerstone of embodied AI, enabling agents to interpret language instructions and navigate complex environments. However, expert-provided instructions are limited in quantity, while synthesized annotations often lack quality, making them insufficient for large-scale research. To address this, we propose NavComposer, a novel framework for automatically generating high-quality navigation instructions. NavComposer explicitly decomposes semantic entities such as actions, scenes, and objects, and recomposes them into natural language instructions. Its modular architecture allows flexible integration of state-of-the-art techniques, while the explicit use of semantic entities enhances both the richness and accuracy of instructions. Moreover, it operates in a data-agnostic manner, supporting adaptation to diverse navigation trajectories without domain-specific training. Complementing NavComposer, we introduce NavInstrCritic, a comprehensive annotation-free evaluation system that assesses navigation instructions on three dimensions: contrastive matching, semantic consistency, and linguistic diversity. NavInstrCritic provides a holistic evaluation of instruction quality, addressing limitations of traditional metrics that rely heavily on expert annotations. By decoupling instruction generation and evaluation from specific navigation agents, our method enables more scalable and generalizable research. Extensive experiments provide direct and practical evidence for the effectiveness of our method.
zh

[NLP-39] Domain-Adaptive Small Language Models for Structured Tax Code Prediction

【速读】: 该论文试图解决跨国公司在处理大量交易时,如何准确确定产品和服务的税码(如HSN或SAC)以满足不同司法管辖区的复杂税收法规问题。解决方案的关键在于提出一种领域自适应的小型语言模型(SLM),该模型采用编码器-解码器架构,能够利用非结构化的产品和服务数据进行层次化税码序列的预测,从而有效捕捉税码中的层级依赖关系。实验表明,该方法在结构化税码序列生成任务中优于单一解码器或编码器架构,并在HSN分类任务中表现出优于平坦分类器的性能。

链接: https://arxiv.org/abs/2507.10880
作者: Souvik Nath,Sumit Wadhwa,Luiz Perez
机构: Dell Technologies(戴尔科技)
类目: Machine Learning (cs.LG); Computation and Language (cs.CL)
备注: 10 pages, 3 figures

点击查看摘要

Abstract:Every day, multinational firms process thousands of transactions, each of which must adhere to tax regulations that vary by jurisdiction and are often nuanced. The determination of product and service tax codes, such as HSN or SAC is a major use case in Tax compliance. An accurate determination of such codes is imperative to avoid any tax penalties. This paper proposes a domain-adaptive small language model (SLM) with an encoder-decoder architecture for the enhanced prediction of product and service tax codes. In this approach, we address the problem of predicting hierarchical tax code sequences using unstructured product and services data. We employ an SLM based upon encoder-decoder architecture as this enables sequential generation of tax codes to capture the hierarchical dependencies present within the tax codes. Our experiments demonstrate that encoder-decoder SLMs can be successfully applied to the sequential prediction of structured tax codes, a domain that remains comparatively unexplored in current NLP research. In this paper, we demonstrate the superior performance of the domain-adaptive encoder-decoder SLMs over flat classifiers when applied to the Harmonized System of Nomenclature (HSN), and achieve superior results compared to decoder-only and encoder-only architectures for structured sequence generation tasks. This approach can also be scaled to other government-mandated tax commodity codes, such as United Nations Standard Products and Services Codes (UNSPSC), or Brazil’s Nomenclatura Comum do Mercosul (NCM).
zh

[NLP-40] Overview of the TREC 2022 deep learning track

【速读】: 该论文旨在提升文档和段落检索任务的性能,特别是在深度学习方法的应用上。其解决方案的关键在于利用大规模预训练的深度神经排名模型,并结合MS MARCO数据集中的大量人工标注标签进行训练。此外,2022年重点构建了更完整的段落检索测试集合,而文档排序任务则作为次要任务,通过段落级别的标签推断得到文档级别的标签。研究结果表明,尽管一些表现优异的运行未采用密集检索,但基于大规模预训练的深度学习模型依然优于传统检索方法。

链接: https://arxiv.org/abs/2507.10865
作者: Nick Craswell,Bhaskar Mitra,Emine Yilmaz,Daniel Campos,Jimmy Lin,Ellen M. Voorhees,Ian Soboroff
机构: Microsoft(微软); University College London(伦敦大学学院); Amazon(亚马逊); University of Illinois Urbana-Champaign(伊利诺伊大学厄巴纳-香槟分校); Neural Magic Inc(神经魔力公司); University of Waterloo(滑铁卢大学); NIST(美国国家标准与技术研究院)
类目: Information Retrieval (cs.IR); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注: arXiv admin note: substantial text overlap with arXiv:2507.08191 , arXiv:2507.08890

点击查看摘要

Abstract:This is the fourth year of the TREC Deep Learning track. As in previous years, we leverage the MS MARCO datasets that made hundreds of thousands of human annotated training labels available for both passage and document ranking tasks. In addition, this year we also leverage both the refreshed passage and document collections that were released last year leading to a nearly 16 times increase in the size of the passage collection and nearly four times increase in the document collection size. Unlike previous years, in 2022 we mainly focused on constructing a more complete test collection for the passage retrieval task, which has been the primary focus of the track. The document ranking task was kept as a secondary task, where document-level labels were inferred from the passage-level labels. Our analysis shows that similar to previous years, deep neural ranking models that employ large scale pretraining continued to outperform traditional retrieval methods. Due to the focusing our judging resources on passage judging, we are more confident in the quality of this year’s queries and judgments, with respect to our ability to distinguish between runs and reuse the dataset in future. We also see some surprises in overall outcomes. Some top-performing runs did not do dense retrieval. Runs that did single-stage dense retrieval were not as competitive this year as they were last year.
zh

[NLP-41] MultiVox: Benchmarking Voice Assistants for Multimodal Interactions

【速读】: 该论文试图解决当前基准测试在全面评估语音助手生成上下文感知响应能力方面的不足,特别是在隐式理解细微语音特征(如音高、情绪、音色和音量)以及环境声学背景方面存在缺陷,同时未能有效评估模型将副语言线索与互补视觉信号对齐的能力。解决方案的关键是引入MultiVox,这是首个针对语音助手的多模态基准测试,旨在评估其整合语音和视觉线索(包括副语言语音特征)以实现真正多模态理解的能力,包含1000个由人类标注和记录的语音对话,涵盖丰富的副语言特征和多种视觉线索。

链接: https://arxiv.org/abs/2507.10859
作者: Ramaneswaran Selvakumar,Ashish Seth,Nishit Anand,Utkarsh Tyagi,Sonal Kumar,Sreyan Ghosh,Dinesh Manocha
机构: University of Maryland, College Park (马里兰大学学院公园分校)
类目: Multimedia (cs.MM); Computation and Language (cs.CL); Human-Computer Interaction (cs.HC)
备注: Work In Progress

点击查看摘要

Abstract:The rapid progress of Large Language Models (LLMs) has empowered omni models to act as voice assistants capable of understanding spoken dialogues. These models can process multimodal inputs beyond text, such as speech and visual data, enabling more context-aware interactions. However, current benchmarks fall short in comprehensively evaluating how well these models generate context-aware responses, particularly when it comes to implicitly understanding fine-grained speech characteristics, such as pitch, emotion, timbre, and volume or the environmental acoustic context such as background sounds. Additionally, they inadequately assess the ability of models to align paralinguistic cues with complementary visual signals to inform their responses. To address these gaps, we introduce MultiVox, the first omni voice assistant benchmark designed to evaluate the ability of voice assistants to integrate spoken and visual cues including paralinguistic speech features for truly multimodal understanding. Specifically, MultiVox includes 1000 human-annotated and recorded speech dialogues that encompass diverse paralinguistic features and a range of visual cues such as images and videos. Our evaluation on 9 state-of-the-art models reveals that, although humans excel at these tasks, current models consistently struggle to produce contextually grounded responses.
zh

[NLP-42] LLM s on Trial: Evaluating Judicial Fairness for Large Language Models

【速读】: 该论文试图解决生成式 AI (Generative AI) 在司法领域中的公平性问题,即其决策可能对权利和公平性产生影响,但目前对此类模型的司法公正性和社会正义影响研究不足。解决方案的关键是基于司法公平理论构建一个全面的框架,用于衡量 LLM 的公平性,并通过定义 65 个标签和 161 个对应值进行量化评估。此外,研究者还开发了三个评估指标(不一致、偏差和不平衡误差)以及一种方法,用于跨多个标签评估多模型的整体公平性,从而揭示当前 LLM 在司法任务中存在普遍的不一致、偏差和不平衡误差。

链接: https://arxiv.org/abs/2507.10852
作者: Yiran Hu,Zongyue Xue,Haitao Li,Siyuan Zheng,Qingjing Chen,Shaochun Wang,Xihan Zhang,Ning Zheng,Yun Liu,Qingyao Ai,Yiqun Liu,Charles L.A. Clarke,Weixing Shen
机构: Tsinghua University (清华大学); University of Waterloo (滑铁卢大学); Yale Law School (耶鲁法学院); Shanghai Jiao Tong University (上海交通大学); University of Bologna (博洛尼亚大学)
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Large Language Models (LLMs) are increasingly used in high-stakes fields where their decisions impact rights and equity. However, LLMs’ judicial fairness and implications for social justice remain underexplored. When LLMs act as judges, the ability to fairly resolve judicial issues is a prerequisite to ensure their trustworthiness. Based on theories of judicial fairness, we construct a comprehensive framework to measure LLM fairness, leading to a selection of 65 labels and 161 corresponding values. Applying this framework to the judicial system, we compile an extensive dataset, JudiFair, comprising 177,100 unique case facts. To achieve robust statistical inference, we develop three evaluation metrics, inconsistency, bias, and imbalanced inaccuracy, and introduce a method to assess the overall fairness of multiple LLMs across various labels. Through experiments with 16 LLMs, we uncover pervasive inconsistency, bias, and imbalanced inaccuracy across models, underscoring severe LLM judicial unfairness. Particularly, LLMs display notably more pronounced biases on demographic labels, with slightly less bias on substance labels compared to procedure ones. Interestingly, increased inconsistency correlates with reduced biases, but more accurate predictions exacerbate biases. While we find that adjusting the temperature parameter can influence LLM fairness, model size, release date, and country of origin do not exhibit significant effects on judicial fairness. Accordingly, we introduce a publicly available toolkit containing all datasets and code, designed to support future research in evaluating and improving LLM fairness.
zh

[NLP-43] sting Hypotheses from the Social Approval Theory of Online Hate: An Analysis of 110 Million Posts from Parler

【速读】: 该论文试图解决在线仇恨言论(online hate)的动机问题,特别是探讨社会认可(social approval)如何驱动仇恨言论的产生。其解决方案的关键在于验证Walther(2024)的社会认可理论中的两个核心假设:即(H1a)仇恨言论获得的社会认可信号越多,后续产生的仇恨言论也越多;(H1b)随着社会认可的增加,仇恨言论的内容会变得更加极端。研究利用Parler平台(2018-2021年)超过1.1亿条帖子的数据进行分析,发现社会认可与后续仇恨言论之间的关系并不一致,且在不同时间尺度上表现出混合结果,表明在线仇恨的社会认可强化机制可能在特定社交平台上有不同的运作方式。

链接: https://arxiv.org/abs/2507.10810
作者: David M. Markowitz,Samuel Hardman Taylor
机构: 未知
类目: Computation and Language (cs.CL); Social and Information Networks (cs.SI)
备注:

点击查看摘要

Abstract:In this paper, we explored how online hate is motivated by receiving social approval from others. We specifically examined two central tenets of Walther’s (2024) social approval theory of online hate: (H1a) more signals of social approval on hate messages predicts more subsequent hate messages, and (H1b) as social approval increases, hate speech messages become more extreme. Using over 110 million posts from Parler (2018-2021), we observed that the number of upvotes a person received on a hate speech post was unassociated with the amount of hate speech in their next post and posts during the next week, month, three months, and six months. Between-person effects revealed an average negative relationship between social approval and hate speech production at the post level, but this relationship was mixed at other time intervals. Social approval reinforcement mechanisms of online hate may operate differently on niche social media platforms.
zh

[NLP-44] Automated Thematic Analyses Using LLM s: Xylazine Wound Management Social Media Chatter Use Case

【速读】: 该论文试图解决大型语言模型(Large Language Models, LLMs)在归纳主题分析中的应用问题,即如何利用LLMs复制专家驱动的主题分析过程,以处理社交媒体数据。解决方案的关键在于将任务建模为一系列二分类问题,并采用零样本、单样本和少样本提示策略,通过准确率、精确率、召回率和F1分数等指标评估模型性能,从而实现对高流行度主题的高效自动化分析。

链接: https://arxiv.org/abs/2507.10803
作者: JaMor Hairston,Ritvik Ranjan,Sahithi Lakamana,Anthony Spadaro,Selen Bozkurt,Jeanmarie Perrone,Abeed Sarker
机构: 未知
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Emerging Technologies (cs.ET); Information Retrieval (cs.IR)
备注: Pages: 19, Abstract word count: 151 words, Manuscript word count: 2185 words, References: 14, Figures: 3, Tables: 2

点击查看摘要

Abstract:Background Large language models (LLMs) face challenges in inductive thematic analysis, a task requiring deep interpretive and domain-specific expertise. We evaluated the feasibility of using LLMs to replicate expert-driven thematic analysis of social media data. Methods Using two temporally non-intersecting Reddit datasets on xylazine (n=286 and n=686, for model optimization and validation, respectively) with twelve expert-derived themes, we evaluated five LLMs against expert coding. We modeled the task as a series of binary classifications, rather than a single, multi-label classification, employing zero-, single-, and few-shot prompting strategies and measuring performance via accuracy, precision, recall, and F1-score. Results On the validation set, GPT-4o with two-shot prompting performed best (accuracy: 90.9%; F1-score: 0.71). For high-prevalence themes, model-derived thematic distributions closely mirrored expert classifications (e.g., xylazine use: 13.6% vs. 17.8%; MOUD use: 16.5% vs. 17.8%). Conclusions Our findings suggest that few-shot LLM-based approaches can automate thematic analyses, offering a scalable supplement for qualitative research. Keywords: thematic analysis, large language models, natural language processing, qualitative analysis, social media, prompt engineering, public health
zh

[NLP-45] Can Multimodal Foundation Models Understand Schematic Diagrams? An Empirical Study on Information-Seeking QA over Scientific Papers ACL2025

【速读】: 该论文试图解决模型在科学文献中对示意图进行解释的能力评估问题,提出了一种名为MISS-QA的基准测试,用于评估模型在理解科学文献中的示意图并回答相关信息查询任务的表现。解决方案的关键在于构建了一个包含1,500个专家标注示例的大型数据集,覆盖465篇科学论文,通过要求模型基于论文的整体上下文来解读示意图并回答相关问题,从而全面评估模型的多模态理解能力。

链接: https://arxiv.org/abs/2507.10787
作者: Yilun Zhao,Chengye Wang,Chuhan Li,Arman Cohan
机构: 未知
类目: Computation and Language (cs.CL); Computer Vision and Pattern Recognition (cs.CV)
备注: ACL 2025 Findings

点击查看摘要

Abstract:This paper introduces MISS-QA, the first benchmark specifically designed to evaluate the ability of models to interpret schematic diagrams within scientific literature. MISS-QA comprises 1,500 expert-annotated examples over 465 scientific papers. In this benchmark, models are tasked with interpreting schematic diagrams that illustrate research overviews and answering corresponding information-seeking questions based on the broader context of the paper. We assess the performance of 18 frontier multimodal foundation models, including o4-mini, Gemini-2.5-Flash, and Qwen2.5-VL. We reveal a significant performance gap between these models and human experts on MISS-QA. Our analysis of model performance on unanswerable questions and our detailed error analysis further highlight the strengths and limitations of current models, offering key insights to enhance models in comprehending multimodal scientific literature.
zh

[NLP-46] heory of Mind and Self-Disclosure to CUIs

【速读】: 该论文试图解决用户在与对话式用户接口(Conversational User Interfaces, CUIs)进行自我披露时所面临的困难问题,这一困难往往源于用户对他人反应的担忧。论文提出的解决方案的关键在于通过表达不确定性或展示CUI的推理过程,使CUI的“心智理论”(Theory of Mind)更加透明,从而鼓励用户进行自我披露。

链接: https://arxiv.org/abs/2507.10773
作者: Samuel Rhys Cox
机构: Aalborg University (奥尔堡大学)
类目: Human-Computer Interaction (cs.HC); Computation and Language (cs.CL)
备注: Workshop paper presented at ToMinHAI at CUI’2025: Theory of Mind in Human-CUI Interaction, held in conjunction with the 2025 ACM conference on Conversational User Interfaces, July 8th, 2025. 4 pages. 3 figures

点击查看摘要

Abstract:Self-disclosure is important to help us feel better, yet is often difficult. This difficulty can arise from how we think people are going to react to our self-disclosure. In this workshop paper, we briefly discuss self-disclosure to conversational user interfaces (CUIs) in relation to various social cues. We then, discuss how expressions of uncertainty or representation of a CUI’s reasoning could help encourage self-disclosure, by making a CUI’s intended “theory of mind” more transparent to users.
zh

[NLP-47] Applying Text Embedding Models for Efficient Analysis in Labeled Property Graphs

【速读】: 该论文试图解决如何有效利用标记属性图中丰富的文本属性以提升分析任务的问题,其核心挑战在于如何将文本语义信息融入图结构分析中。解决方案的关键在于使用预训练文本嵌入模型对节点和边的文本属性进行嵌入,从而在不改变图处理流程结构的前提下,增强下游任务如节点分类和关系预测的上下文理解能力,进而提高属性图分析的准确性和可解释性。

链接: https://arxiv.org/abs/2507.10772
作者: Michal Podstawski
机构: 未知
类目: Computation and Language (cs.CL); Information Retrieval (cs.IR)
备注:

点击查看摘要

Abstract:Labeled property graphs often contain rich textual attributes that can enhance analytical tasks when properly leveraged. This work explores the use of pretrained text embedding models to enable efficient semantic analysis in such graphs. By embedding textual node and edge properties, we support downstream tasks including node classification and relation prediction with improved contextual understanding. Our approach integrates language model embeddings into the graph pipeline without altering its structure, demonstrating that textual semantics can significantly enhance the accuracy and interpretability of property graph analysis.
zh

[NLP-48] Language Models for Adult Service Website Text Analysis

【速读】: 该论文试图解决从成人服务网站(ASW)数据中提取可操作洞察的关键挑战,特别是针对ASW广告文本的分析问题。由于ASW文本广泛使用表情符号、语法错误以及有意的模糊化以规避执法审查,传统的文本分析方法面临较大困难。论文提出的解决方案关键在于开发定制的Transformer模型,这些模型能够在相对较小的GPU资源下高效训练,并在消费级硬件上进行推理,相较于预训练的编码器-only Transformer模型(如BERT-base、RoBERTa和ModernBERT),在准确率、召回率、F1分数和ROC AUC指标上均表现出更优性能。

链接: https://arxiv.org/abs/2507.10743
作者: Nickolas Freeman,Thanh Nguyen,Gregory Bott,Jason Parton,Collin Francel
机构: 未知
类目: Computation and Language (cs.CL); Machine Learning (cs.LG)
备注: 32 pages, 12 figures, 1 table

点击查看摘要

Abstract:Sex trafficking refers to the use of force, fraud, or coercion to compel an individual to perform in commercial sex acts against their will. Adult service websites (ASWs) have and continue to be linked to sex trafficking, offering a platform for traffickers to advertise their victims. Thus, organizations involved in the fight against sex trafficking often use ASW data when attempting to identify potential sex trafficking victims. A critical challenge in transforming ASW data into actionable insight is text analysis. Previous research using ASW data has shown that ASW ad text is important for linking ads. However, working with this text is challenging due to its extensive use of emojis, poor grammar, and deliberate obfuscation to evade law enforcement scrutiny. We conduct a comprehensive study of language modeling approaches for this application area, including simple information retrieval methods, pre-trained transformers, and custom transformer models. We demonstrate that characteristics of ASW text data allow efficient custom transformer models to be trained with relatively small GPU resources and used efficiently for inference on consumer hardware. Our custom models outperform fine-tuned variants of well-known encoder-only transformer models, including BERT-base, RoBERTa, and ModernBERT, on accuracy, recall, F1 score, and ROC AUC. We demonstrate the use of our best-performing custom configuration on three tasks related to ASW data analysis: (i) decomposing the giant component in a graph representation of ASW data, (ii) clustering ASW ad text, and (iii) using the learned token embeddings to understand the use of emojis in the illicit context we study. The models we develop represent a significant advancement in ASW text analysis, which can be leveraged in a variety of downstream applications and research.
zh

[NLP-49] From Semantic Web and MAS to Agent ic AI: A Unified Narrative of the Web of Agents

【速读】: 该论文试图解决当前关于Web of Agents(WoA)研究的碎片化问题,即不同社区间缺乏统一的视角和系统性的整合。其解决方案的关键在于提出一个四轴分类体系(语义基础、通信范式、智能定位、发现机制),以系统化分析WoA的演进历程,并揭示现代协议如A2A和MCP对早期标准(如FIPA标准和OWL语义代理)局限性的直接响应。论文进一步指出,智能定位从外部数据或平台向代理核心模型(LLM)的范式转变是现代Agentic AI的基础,这一转变使得WoA所设想的可扩展和自适应系统成为可能。

链接: https://arxiv.org/abs/2507.10644
作者: Tatiana Petrova(1),Aleksandr Puzikov(1),Boris Bliznukov(1),Radu State(1) ((1) SEDAN SnT, University of Luxembourg, Luxembourg, Luxembourg)
机构: University of Luxembourg (卢森堡大学)
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Cryptography and Security (cs.CR); Human-Computer Interaction (cs.HC); Multiagent Systems (cs.MA)
备注: 33 pages, 9 figures, 8 tables

点击查看摘要

Abstract:The concept of the Web of Agents (WoA), which transforms the static, document-centric Web into an environment of autonomous agents acting on users’ behalf, has attracted growing interest as large language models (LLMs) become more capable. However, research in this area is still fragmented across different communities. Contemporary surveys catalog the latest LLM-powered frameworks, while the rich histories of Multi-Agent Systems (MAS) and the Semantic Web are often treated as separate, legacy domains. This fragmentation obscures the intellectual lineage of modern systems and hinders a holistic understanding of the field’s trajectory. We present the first comprehensive evolutionary overview of the WoA. We show that modern protocols like A2A and the MCP, are direct evolutionary responses to the well-documented limitations of earlier standards like FIPA standards and OWL-based semantic agents. To systematize this analysis, we introduce a four-axis taxonomy (semantic foundation, communication paradigm, locus of intelligence, discovery mechanism). This framework provides a unified analytical lens for comparing agent architectures across all generations, revealing a clear line of descent where others have seen a disconnect. Our analysis identifies a paradigm shift in the ‘locus of intelligence’: from being encoded in external data (Semantic Web) or the platform (MAS) to being embedded within the agent’s core model (LLM). This shift is foundational to modern Agentic AI, enabling the scalable and adaptive systems the WoA has long envisioned. We conclude that while new protocols are essential, they are insufficient for building a robust, open, trustworthy ecosystem. Finally, we argue that the next research frontier lies in solving persistent socio-technical challenges, and we map out a new agenda focused on decentralized identity, economic models, security, and governance for the emerging WoA.
zh

[NLP-50] Scalpel vs. Hammer: GRPO Amplifies Existing Capabilities SFT Replaces Them

【速读】: 该论文试图解决在大型语言模型(Large Language Models, LLMs)的后训练阶段,通过数学和代码数据集进行推理训练时,强化学习(Reinforcement Learning, RL)与监督微调(Supervised Fine-tuning, SFT)两种方法在训练动态上的差异及其对模型性能的影响问题。其解决方案的关键在于对RL与SFT方法在相同数学问题和相似超参数条件下的对比分析,以及对模型参数变化的深入观察,从而揭示两种方法在域内和域外任务表现上的不同机制。

链接: https://arxiv.org/abs/2507.10616
作者: Neel Rajani,Aryo Pradipta Gema,Seraphina Goldfarb-Tarrant,Ivan Titov
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Training large language models (LLMs) for reasoning via maths and code datasets has become a major new focus in LLM post-training. Two particularly popular approaches are reinforcement learning (RL) and supervised fine-tuning (SFT), but their training dynamics are poorly understood. We present a comparative analysis of RL and SFT on the same maths problems with the same model and similar hyperparameters. We find that RL yields minor in-domain gains on maths and slight degradation on knowledge-intensive benchmarks like MMLU, while both trends are more pronounced in SFT. We also analyse model parameters across checkpoints, observing that both algorithms modify query and key weights the most. Meanwhile, SFT exhibits greater updates and also affects mid-layer MLPs more, leading us to hypothesise that this may have caused the out-of-domain degradation. We therefore investigate whether freezing parts of the model during training can mitigate the reduced performance on knowledge-intensive benchmarks. However, our results are inconclusive, with benefits on GPQA:Diamond and degradation on other benchmarks. Taken together, our observations provide a preliminary indication for why RL amplifies existing capabilities, while SFT replaces old skills with new ones.
zh

[NLP-51] Emergence of Hierarchical Emotion Organization in Large Language Models

【速读】: 该论文试图解决大型语言模型(Large Language Models, LLMs)在对话代理中如何建模用户情感状态的问题,这对于伦理部署至关重要。其解决方案的关键在于借鉴情感轮(Emotion Wheel)这一心理学框架,分析模型输出中情感状态之间的概率依赖关系,从而揭示LLMs是否能够形成与人类心理模型一致的层次化情感结构,并探讨模型在不同社会经济身份角色中的情感识别偏差。

链接: https://arxiv.org/abs/2507.10599
作者: Bo Zhao,Maya Okawa,Eric J. Bigelow,Rose Yu,Tomer Ullman,Ekdeep Singh Lubana,Hidenori Tanaka
机构: CBS-NTT Program in Physics of Intelligence, Harvard University (CBS-NTT 物理智能计划,哈佛大学); University of California, San Diego (加利福尼亚大学圣地亚哥分校); Physics of Artificial Intelligence Laboratories, NTT Research, Inc. (人工智能物理实验室,NTT 研究公司); Department of Psychology, Harvard University (心理学系,哈佛大学)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:As large language models (LLMs) increasingly power conversational agents, understanding how they model users’ emotional states is critical for ethical deployment. Inspired by emotion wheels – a psychological framework that argues emotions organize hierarchically – we analyze probabilistic dependencies between emotional states in model outputs. We find that LLMs naturally form hierarchical emotion trees that align with human psychological models, and larger models develop more complex hierarchies. We also uncover systematic biases in emotion recognition across socioeconomic personas, with compounding misclassifications for intersectional, underrepresented groups. Human studies reveal striking parallels, suggesting that LLMs internalize aspects of social perception. Beyond highlighting emergent emotional reasoning in LLMs, our results hint at the potential of using cognitively-grounded theories for developing better model evaluations.
zh

[NLP-52] PLEX: Perturbation-free Local Explanations for LLM -Based Text Classification

【速读】: 该论文试图解决大型语言模型(Large Language Models, LLMs)在文本分类任务中虽表现优异但缺乏可解释性的问题,传统可解释AI(Explainable AI, XAI)方法如LIME和SHAP虽然能提供局部解释,但依赖于计算成本高昂的扰动操作。解决方案的关键在于提出一种无需扰动的局部解释方法——PLEX,该方法利用从LLM中提取的上下文嵌入和一个“Siamese网络”风格的神经网络,该网络经过一次训练即可对特征重要性进行对齐,从而避免后续扰动,实现高效解释。

链接: https://arxiv.org/abs/2507.10596
作者: Yogachandran Rahulamathavan,Misbah Farooq,Varuna De Silva
机构: Institute for Digital Technologies, Loughborough University London (数字技术研究所,拉夫堡大学伦敦校区)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Large Language Models (LLMs) excel in text classification, but their complexity hinders interpretability, making it difficult to understand the reasoning behind their predictions. Explainable AI (XAI) methods like LIME and SHAP offer local explanations by identifying influential words, but they rely on computationally expensive perturbations. These methods typically generate thousands of perturbed sentences and perform inferences on each, incurring a substantial computational burden, especially with LLMs. To address this, we propose \underlinePerturbation-free \underlineLocal \underlineExplanation (PLEX), a novel method that leverages the contextual embeddings extracted from the LLM and a Siamese network" style neural network trained to align with feature importance scores. This one-off training eliminates the need for subsequent perturbations, enabling efficient explanations for any new sentence. We demonstrate PLEX's effectiveness on four different classification tasks (sentiment, fake news, fake COVID-19 news and depression), showing more than 92\% agreement with LIME and SHAP. Our evaluation using a stress test" reveals that PLEX accurately identifies influential words, leading to a similar decline in classification accuracy as observed with LIME and SHAP when these words are removed. Notably, in some cases, PLEX demonstrates superior performance in capturing the impact of key features. PLEX dramatically accelerates explanation, reducing time and computational overhead by two and four orders of magnitude, respectively. This work offers a promising solution for explainable LLM-based text classification.
zh

[NLP-53] Anthropomimetic Uncertainty: What Verbalized Uncertainty in Language Models is Missing

【速读】: 该论文试图解决大型语言模型(Large Language Models, LLMs)在输出内容时表现出过度自信的问题,这可能损害用户对其可信度和合法性的判断。解决方案的关键在于引入拟人化不确定性(anthropomimetic uncertainty),即通过语言手段表达不确定性,使机器的不确定性沟通更符合人类的交流方式,从而提升可信度和用户信任。这一方法强调语言的真实性和对用户的个性化适应,以实现更有效的自然语言交互。

链接: https://arxiv.org/abs/2507.10587
作者: Dennis Ulmer,Alexandra Lorson,Ivan Titov,Christian Hardmeier
机构: ILLC, University of Amsterdam (ILLC,阿姆斯特丹大学); CLCG, University of Groningen (CLCG,格罗宁根大学); ILCC, University of Edinburgh (ILCC,爱丁堡大学); IT University of Copenhagen (IT大学,哥本哈根); Pioneer Centre for Artificial Intelligence (先锋人工智能中心)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Human users increasingly rely on natural language interactions with large language models (LLMs) in order to receive help on a large variety of tasks and problems. However, the trustworthiness and perceived legitimacy of LLMs is undermined by the fact that their output is frequently stated in very confident terms, even when its accuracy is questionable. Therefore, there is a need to signal the confidence of the language model to a user in order to reap the benefits of human-machine collaboration and mitigate potential harms. Verbalized uncertainty is the expression of confidence with linguistic means, an approach that integrates perfectly into language-based interfaces. Nevertheless, most recent research in natural language processing (NLP) overlooks the nuances surrounding human uncertainty communication and the data biases that influence machine uncertainty communication. We argue for anthropomimetic uncertainty, meaning that intuitive and trustworthy uncertainty communication requires a degree of linguistic authenticity and personalization to the user, which could be achieved by emulating human communication. We present a thorough overview over the research in human uncertainty communication, survey ongoing research, and perform additional analyses to demonstrate so-far overlooked biases in verbalized uncertainty. We conclude by pointing out unique factors in human-machine communication of uncertainty and deconstruct anthropomimetic uncertainty into future research directions for NLP.
zh

[NLP-54] AutoRAG -LoRA: Hallucination-Triggered Knowledge Retuning via Lightweight Adapters

【速读】: 该论文旨在解决大型语言模型(Large Language Models, LLMs)在实际应用中因幻觉(hallucination)导致的事实性偏差问题,即模型生成内容中出现与事实不符的信息,从而影响其可信度。解决方案的关键在于提出一种模块化的检索增强生成(Retrieval-Augmented Generation, RAG)框架——AutoRAG-LoRA,该框架通过轻量级的LoRA(Low-Rank Adaptation)适配器和KL正则化训练来减少幻觉。其核心机制包括自动提示重写、混合检索、低秩适配器调优以及基于分类器和自评估的幻觉检测模块,结合对比KL损失和适配器微调实现事实对齐。

链接: https://arxiv.org/abs/2507.10586
作者: Kaushik Dwivedi,Padmanabh Patanjali Mishra
机构: BITS Pilani; University of Adelaide
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Large Language Models (LLMs) have demonstrated remarkable fluency across a range of natural language tasks, yet remain vulnerable to hallucinations - factual inaccuracies that undermine trust in real world deployment. We present AutoRAG-LoRA, a modular framework for Retrieval-Augmented Generation (RAG) that tackles hallucination in large language models through lightweight LoRA-based adapters and KL-regularized training. Our pipeline integrates automated prompt rewriting, hybrid retrieval, and low-rank adapter tuning to ground responses in retrieved evidence. A hallucination detection module, using both classifier-based and self-evaluation techniques, assigns confidence scores to generated outputs, triggering an optional feedback correction loop. This loop enforces factual alignment via contrastive KL loss and adapter fine tuning. We demonstrate that AutoRAG-LoRA significantly reduces the factual drift while preserving the efficiency and modularity of the model.
zh

[NLP-55] A Taxonomy for Design and Evaluation of Prompt-Based Natural Language Explanations

【速读】: 该论文试图解决如何有效治理人工智能(AI)系统,特别是针对大型语言模型生成的自然语言解释(NLEs)的特性与治理影响进行系统性分析的问题。其解决方案的关键在于基于可解释人工智能(XAI)文献构建一个适用于基于提示的NLEs的更新XAI分类体系,该分类体系从三个维度展开:上下文(包括任务、数据、受众和目标)、生成与呈现(涵盖生成方法、输入、交互性、输出和形式)以及评估(聚焦内容、呈现方式和以用户为中心的属性,以及评估环境)。这一分类体系为研究人员、审计人员和政策制定者提供了对NLEs进行表征、设计和优化的框架,以促进透明的人工智能系统发展。

链接: https://arxiv.org/abs/2507.10585
作者: Isar Nejadgholi,Mona Omidyeganeh,Marc-Antoine Drouin,Jonathan Boisvert
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: Presented at the Workshop of Technical AI Governance, 5 pages 2 figures

点击查看摘要

Abstract:Effective AI governance requires structured approaches for stakeholders to access and verify AI system behavior. With the rise of large language models, Natural Language Explanations (NLEs) are now key to articulating model behavior, which necessitates a focused examination of their characteristics and governance implications. We draw on Explainable AI (XAI) literature to create an updated XAI taxonomy, adapted to prompt-based NLEs, across three dimensions: (1) Context, including task, data, audience, and goals; (2) Generation and Presentation, covering generation methods, inputs, interactivity, outputs, and forms; and (3) Evaluation, focusing on content, presentation, and user-centered properties, as well as the setting of the evaluation. This taxonomy provides a framework for researchers, auditors, and policymakers to characterize, design, and enhance NLEs for transparent AI systems.
zh

[NLP-56] ransforming Sensitive Documents into Quantitative Data: An AI-Based Preprocessing Toolchain for Structured and Privacy-Conscious Analysis

【速读】: 该论文旨在解决从法律、医疗和行政来源获取的非结构化文本在公共健康和社会科学研究中的利用难题,主要挑战包括敏感个人信息的存在以及文本结构和语言的显著异质性。其解决方案的关键在于构建一个模块化的工具链,该工具链依赖于本地硬件运行的开源大语言模型(LLM),通过LLM提示技术对文本进行标准化、摘要化和必要时的翻译,以提高可比性,并结合基于LLM的去标识化、命名实体识别和规则方法,有效降低信息泄露风险,从而实现隐私敏感研究的可行性和大规模分析能力。

链接: https://arxiv.org/abs/2507.10582
作者: Anders Ledberg,Anna Thalén
机构: Stockholm University (斯德哥尔摩大学)
类目: Computation and Language (cs.CL); Methodology (stat.ME)
备注:

点击查看摘要

Abstract:Unstructured text from legal, medical, and administrative sources offers a rich but underutilized resource for research in public health and the social sciences. However, large-scale analysis is hampered by two key challenges: the presence of sensitive, personally identifiable information, and significant heterogeneity in structure and language. We present a modular toolchain that prepares such text data for embedding-based analysis, relying entirely on open-weight models that run on local hardware, requiring only a workstation-level GPU and supporting privacy-sensitive research. The toolchain employs large language model (LLM) prompting to standardize, summarize, and, when needed, translate texts to English for greater comparability. Anonymization is achieved via LLM-based redaction, supplemented with named entity recognition and rule-based methods to minimize the risk of disclosure. We demonstrate the toolchain on a corpus of 10,842 Swedish court decisions under the Care of Abusers Act (LVM), comprising over 56,000 pages. Each document is processed into an anonymized, standardized summary and transformed into a document-level embedding. Validation, including manual review, automated scanning, and predictive evaluation shows the toolchain effectively removes identifying information while retaining semantic content. As an illustrative application, we train a predictive model using embedding vectors derived from a small set of manually labeled summaries, demonstrating the toolchain’s capacity for semi-automated content analysis at scale. By enabling structured, privacy-conscious analysis of sensitive documents, our toolchain opens new possibilities for large-scale research in domains where textual data was previously inaccessible due to privacy and heterogeneity constraints. Subjects: Computation and Language (cs.CL); Methodology (stat.ME) Cite as: arXiv:2507.10582 [cs.CL] (or arXiv:2507.10582v1 [cs.CL] for this version) https://doi.org/10.48550/arXiv.2507.10582 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
zh

[NLP-57] An Offline Mobile Conversational Agent for Mental Health Support: Learning from Emotional Dialogues and Psychological Texts with Student-Centered Evaluation

【速读】: 该论文旨在解决数字平台在心理健康和支持领域面临的用户可及性有限、互联网连接不稳定以及数据隐私问题。其解决方案的关键在于开发一款完全离线的智能手机对话应用程序——EmoSApp,该应用利用经过微调、量化并在资源受限设备上部署的大型语言模型(LLMs),确保所有推理过程均在本地手机上完成,从而保障数据隐私和可用性。

链接: https://arxiv.org/abs/2507.10580
作者: Vimaleswar A,Prabhu Nandan Sahu,Nilesh Kumar Sahu,Haroon R Lone
机构: IISER Bhopal(印度科学教育与研究学院博帕尔分校)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Computers and Society (cs.CY); Human-Computer Interaction (cs.HC)
备注:

点击查看摘要

Abstract:Mental health plays a crucial role in the overall well-being of an individual. In recent years, digital platforms have been increasingly used to expand mental health and emotional support. However, there are persistent challenges related to limited user accessibility, internet connectivity, and data privacy, which highlight the need for an offline, smartphone-based solution. To address these challenges, we propose EmoSApp (Emotional Support App): an entirely offline, smartphone-based conversational app designed for mental health and emotional support. The system leverages Large Language Models (LLMs), specifically fine-tuned, quantized and deployed using Torchtune and Executorch for resource-constrained devices, allowing all inferences to occur on the smartphone. To equip EmoSApp with robust domain expertise, we fine-tuned the LLaMA-3.2-1B-Instruct model on our custom curated ``Knowledge dataset’’ of 14,582 mental-health QA pairs, along with the multi-turn conversational data. Through qualitative human evaluation with the student population, we demonstrate that EmoSApp has the ability to respond coherently, empathetically, maintain interactive dialogue, and provide relevant suggestions to user’s mental health problems. Additionally, quantitative evaluations on nine standard commonsense and reasoning benchmarks demonstrate the efficacy of our fine-tuned, quantized model in low-resource settings. By prioritizing on-device deployment and specialized domain adaptation, EmoSApp serves as a blueprint for future innovations in portable, secure, and highly tailored AI-driven mental health solutions. Subjects: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Computers and Society (cs.CY); Human-Computer Interaction (cs.HC) Cite as: arXiv:2507.10580 [cs.CL] (or arXiv:2507.10580v1 [cs.CL] for this version) https://doi.org/10.48550/arXiv.2507.10580 Focus to learn more arXiv-issued DOI via DataCite
zh

[NLP-58] Findings of the BEA 2025 Shared Task on Pedagogical Ability Assessment of AI-powered Tutors

【速读】: 该论文试图解决如何评估由大规模语言模型(Large Language Models, LLMs)驱动的AI导师在教育对话中针对学生错误进行补救的能力问题。其解决方案的关键在于设计多个评估轨道,以自动衡量AI导师在错误识别、错误精确定位、提供指导以及反馈可操作性等方面的性能,并基于学习科学原则定义良好的导师回应标准。此外,还包含一个检测导师身份的轨道。通过与黄金标准的人工标注进行比较,该研究旨在全面分析AI导师的表现并推动该领域的发展。

链接: https://arxiv.org/abs/2507.10579
作者: Ekaterina Kochmar,Kaushal Kumar Maurya,Kseniia Petukhova,KV Aditya Srivatsa,Anaïs Tack,Justin Vasselli
机构: Mohamed bin Zayed University of Artificial Intelligence (MBZUAI); KU Leuven; Nara Institute of Science and Technology
类目: Computers and Society (cs.CY); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注: Proceedings of the 20th Workshop on Innovative Use of NLP for Building Educational Applications

点击查看摘要

Abstract:This shared task has aimed to assess pedagogical abilities of AI tutors powered by large language models (LLMs), focusing on evaluating the quality of tutor responses aimed at student’s mistake remediation within educational dialogues. The task consisted of five tracks designed to automatically evaluate the AI tutor’s performance across key dimensions of mistake identification, precise location of the mistake, providing guidance, and feedback actionability, grounded in learning science principles that define good and effective tutor responses, as well as the track focusing on detection of the tutor identity. The task attracted over 50 international teams across all tracks. The submitted models were evaluated against gold-standard human annotations, and the results, while promising, show that there is still significant room for improvement in this domain: the best results for the four pedagogical ability assessment tracks range between macro F1 scores of 58.34 (for providing guidance) and 71.81 (for mistake identification) on three-class problems, with the best F1 score in the tutor identification track reaching 96.98 on a 9-class task. In this paper, we overview the main findings of the shared task, discuss the approaches taken by the teams, and analyze their performance. All resources associated with this task are made publicly available to support future research in this critical domain.
zh

[NLP-59] ruth Sleuth and Trend Bender: AI Agents to fact-check YouTube videos and influence opinions

【速读】: 该论文试图解决虚假信息在数字世界中迅速传播的问题,特别是针对YouTube等平台上的视频内容。其解决方案的关键在于开发一个基于人工智能的系统,该系统包含两个主要组件:Truth Sleuth和Trend Bender。Truth Sleuth通过检索增强生成(Retrieval-Augmented Generation, RAG)方法,从YouTube视频中提取声明并进行准确的事实核查,生成详尽的报告;Trend Bender则利用该报告及相关文章生成具有说服力的评论,以促进有益的讨论,并通过自我评估循环不断优化其输出效果。

链接: https://arxiv.org/abs/2507.10577
作者: Logé Cécile,Ghori Rehan
机构: Stanford University (斯坦福大学)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Computers and Society (cs.CY)
备注:

点击查看摘要

Abstract:Misinformation poses a significant threat in today’s digital world, often spreading rapidly through platforms like YouTube. This paper introduces a novel approach to combating misinformation by developing an AI-powered system that not only fact-checks claims made in YouTube videos but also actively engages users in the comment section and challenge misleading narratives. Our system comprises two main agents: Truth Sleuth and Trend Bender. Truth Sleuth extracts claims from a YouTube video, uses a Retrieval-Augmented Generation (RAG) approach - drawing on sources like Wikipedia, Google Search, Google FactCheck - to accurately assess their veracity and generates a nuanced and comprehensive report. Through rigorous prompt engineering, Trend Bender leverages this report along with a curated corpus of relevant articles to generate insightful and persuasive comments designed to stimulate a productive debate. With a carefully set up self-evaluation loop, this agent is able to iteratively improve its style and refine its output. We demonstrate the system’s capabilities through experiments on established benchmark datasets and a real-world deployment on YouTube, showcasing its potential to engage users and potentially influence perspectives. Our findings highlight the high accuracy of our fact-checking agent, and confirm the potential of AI-driven interventions in combating misinformation and fostering a more informed online space. Subjects: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Computers and Society (cs.CY) Cite as: arXiv:2507.10577 [cs.CL] (or arXiv:2507.10577v1 [cs.CL] for this version) https://doi.org/10.48550/arXiv.2507.10577 Focus to learn more arXiv-issued DOI via DataCite
zh

[NLP-60] Can Large Language Models Understand As Well As Apply Patent Regulations to Pass a Hands-On Patent Attorney Test?

【速读】: 该论文试图解决当前生成式 AI 在专利法律领域实际应用中的性能评估与局限性问题,特别是针对未来欧洲专利律师的欧洲资格考试(EQE)进行量化分析。其关键在于通过实验对比不同开源和专有大型语言模型(LLMs)在特定任务中的表现,揭示模型在准确性和逻辑一致性方面的不足,并强调自动评估指标与专业法律专家判断之间的偏差。研究指出,尽管近期大模型表现出色,但距离达到人类水平的专利专业能力仍有显著差距,未来需重点提升逻辑一致性、多模态处理能力和自适应提示技术。

链接: https://arxiv.org/abs/2507.10576
作者: Bhakti Khera,Rezvan Alamian,Pascal A. Scherz,Stephan M. Goetz
机构: 未知
类目: Computers and Society (cs.CY); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Emerging Technologies (cs.ET)
备注: 39 pages, 21 figures

点击查看摘要

Abstract:The legal field already uses various large language models (LLMs) in actual applications, but their quantitative performance and reasons for it are underexplored. We evaluated several open-source and proprietary LLMs – including GPT-series, Anthropic, Deepseek and Llama-3, variants – on parts of the European Qualifying Examination (EQE) for future European Patent Attorneys. OpenAI o1 led with 0.82 accuracy and 0.81 F1 score, whereas (Amazon Web Services) AWS Llama 3.1 8B lagged at 0.50 accuracy, and a Python-deployed Llama 3.1 8B scored 0.55. The latter two are within the range of mere guessing for the two-answer forced-choice design. None of the evaluated models could have passed the examination fully, as accuracy never exceeded the average threshold of 0.90 required for professional-level standards – also not models that are regularly promoted for their assumed beyond-PhD- and bar-admitted-lawyer-level performance. GPT-4o excelled at integrating text and graphics, while Claude 3 Opus often lost formatting coherence. Human patent experts evaluated the textual justifications and uncovered various critical shortcomings of each model. They valued clarity and legal rationale over the raw correctness of the answers, which revealed misalignment between automatic metrics and expert judgment. Model outputs were sensitive to modest temperature changes and prompt wording, which underscores the remaining necessity of expert oversight. Future work should target logical consistency, robust multimodality, and adaptive prompting to approach human-level patent proficiency. In summary, despite the outstanding performance of recent large models, the general public might overestimate their performance. The field has a long way to go to develop a virtual patent attorney. This paper wants to point out several specific limitations that need solutions.
zh

[NLP-61] Orchestrator-Agent Trust: A Modular Agent ic AI Visual Classification System with Trust-Aware Orchestration and RAG -Based Reasoning

【速读】: 该论文试图解决多智能体架构在零样本设置下缺乏信任问题,尤其是在未进行微调的情况下如何确保其可靠性。解决方案的关键在于引入一种模块化Agentic AI视觉分类框架,该框架结合了通用多模态代理、非视觉推理协调器以及检索增强生成(RAG)模块。通过信任校准机制和基于CLIP的图像检索与再评估循环,系统能够动态调节不同代理之间的信任度,从而提升模型在零样本场景下的准确性和鲁棒性。

链接: https://arxiv.org/abs/2507.10571
作者: Konstantinos I. Roumeliotis,Ranjan Sapkota,Manoj Karkee,Nikolaos D. Tselikas
机构: 未知
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Modern Artificial Intelligence (AI) increasingly relies on multi-agent architectures that blend visual and language understanding. Yet, a pressing challenge remains: How can we trust these agents especially in zero-shot settings with no fine-tuning? We introduce a novel modular Agentic AI visual classification framework that integrates generalist multimodal agents with a non-visual reasoning orchestrator and a Retrieval-Augmented Generation (RAG) module. Applied to apple leaf disease diagnosis, we benchmark three configurations: (I) zero-shot with confidence-based orchestration, (II) fine-tuned agents with improved performance, and (III) trust-calibrated orchestration enhanced by CLIP-based image retrieval and re-evaluation loops. Using confidence calibration metrics (ECE, OCR, CCC), the orchestrator modulates trust across agents. Our results demonstrate a 77.94% accuracy improvement in the zero-shot setting using trust-aware orchestration and RAG, achieving 85.63% overall. GPT-4o showed better calibration, while Qwen-2.5-VL displayed overconfidence. Furthermore, image-RAG grounded predictions with visually similar cases, enabling correction of agent overconfidence via iterative re-evaluation. The proposed system separates perception (vision agents) from meta-reasoning (orchestrator), enabling scalable and interpretable multi-agent AI. This blueprint is extensible to diagnostics, biology, and other trust-critical domains. All models, prompts, results, and system components including the complete software source code are openly released to support reproducibility, transparency, and community benchmarking at Github: this https URL
zh

[NLP-62] NLP Meets the World: Toward Improving Conversations With the Public About Natural Language Processing Research

【速读】: 该论文试图解决如何有效地向公众传达生成式 AI (Generative AI) 的能力与局限性的问题,以促进公众理解并支持相关研究。其解决方案的关键在于识别并应对三个主要障碍:模糊术语对公众理解的阻碍、不切实际的期望对可持续发展的阻碍以及伦理失败对持续支持的阻碍。通过引用已发表的 NLP 研究和大众媒体报道,论文提出了针对这些障碍的沟通建议,旨在实现与公众的高效、透明交流。

链接: https://arxiv.org/abs/2507.10559
作者: Shomir Wilson
机构: Pennsylvania State University (宾夕法尼亚州立大学)
类目: Computers and Society (cs.CY); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Recent developments in large language models (LLMs) have been accompanied by rapidly growing public interest in natural language processing (NLP). This attention is reflected by major news venues, which sometimes invite NLP researchers to share their knowledge and views with a wide audience. Recognizing the opportunities of the present, for both the research field and for individual researchers, this paper shares recommendations for communicating with a general audience about LLMs’ capabilities and limitations. These recommendations cover three themes: vague terminology as an obstacle to public understanding, unreasonable expectations as obstacles to sustainable growth, and ethical failures as obstacles to continued support. Published NLP research and popular news coverage are cited to illustrate these themes with examples. The recommendations promote effective, transparent communication with the general public about NLP, in order to strengthen public understanding and encourage support for research.
zh

计算机视觉

[CV-0] owards Depth Foundation Model: Recent Trends in Vision-Based Depth Estimation

【速读】:该论文试图解决深度估计在实际应用中的局限性问题,特别是传统方法因硬件传感器(如LiDAR)的高成本、低分辨率和环境敏感性而受限,以及基于视觉的方法在泛化能力和稳定性上的不足。解决方案的关键在于开发“深度基础模型”(depth foundation models),即通过大规模数据集训练具有强零样本泛化能力的深度神经网络,以提升模型在不同场景下的适应性和鲁棒性。

链接: https://arxiv.org/abs/2507.11540
作者: Zhen Xu,Hongyu Zhou,Sida Peng,Haotong Lin,Haoyu Guo,Jiahao Shao,Peishan Yang,Qinglin Yang,Sheng Miao,Xingyi He,Yifan Wang,Yue Wang,Ruizhen Hu,Yiyi Liao,Xiaowei Zhou,Hujun Bao
机构: Shenzhen University (深圳大学); Shanghai AI Lab (上海人工智能实验室); Zhejiang University (浙江大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Depth estimation is a fundamental task in 3D computer vision, crucial for applications such as 3D reconstruction, free-viewpoint rendering, robotics, autonomous driving, and AR/VR technologies. Traditional methods relying on hardware sensors like LiDAR are often limited by high costs, low resolution, and environmental sensitivity, limiting their applicability in real-world scenarios. Recent advances in vision-based methods offer a promising alternative, yet they face challenges in generalization and stability due to either the low-capacity model architectures or the reliance on domain-specific and small-scale datasets. The emergence of scaling laws and foundation models in other domains has inspired the development of “depth foundation models”: deep neural networks trained on large datasets with strong zero-shot generalization capabilities. This paper surveys the evolution of deep learning architectures and paradigms for depth estimation across the monocular, stereo, multi-view, and monocular video settings. We explore the potential of these models to address existing challenges and provide a comprehensive overview of large-scale datasets that can facilitate their development. By identifying key architectures and training strategies, we aim to highlight the path towards robust depth foundation models, offering insights into their future research and applications.
zh

[CV-1] Streaming 4D Visual Geometry Transformer

【速读】:该论文旨在解决从视频中感知和重建4D时空几何的难题,这是一个基础但具有挑战性的计算机视觉任务。其关键解决方案是提出一种流式4D视觉几何变换器,该模型借鉴了自回归大语言模型的哲学思想,采用因果Transformer架构以在线方式处理输入序列,并通过时间因果注意力机制和缓存历史键值作为隐式记忆,实现高效的长期4D重建。此设计能够在实时场景中逐步整合历史信息,同时保持高质量的空间一致性。

链接: https://arxiv.org/abs/2507.11539
作者: Dong Zhuo,Wenzhao Zheng,Jiahe Guo,Yuqi Wu,Jie Zhou,Jiwen Lu
机构: Tsinghua University (清华大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: Code is available at: this https URL

点击查看摘要

Abstract:Perceiving and reconstructing 4D spatial-temporal geometry from videos is a fundamental yet challenging computer vision task. To facilitate interactive and real-time applications, we propose a streaming 4D visual geometry transformer that shares a similar philosophy with autoregressive large language models. We explore a simple and efficient design and employ a causal transformer architecture to process the input sequence in an online manner. We use temporal causal attention and cache the historical keys and values as implicit memory to enable efficient streaming long-term 4D reconstruction. This design can handle real-time 4D reconstruction by incrementally integrating historical information while maintaining high-quality spatial consistency. For efficient training, we propose to distill knowledge from the dense bidirectional visual geometry grounded transformer (VGGT) to our causal model. For inference, our model supports the migration of optimized efficient attention operator (e.g., FlashAttention) from the field of large language models. Extensive experiments on various 4D geometry perception benchmarks demonstrate that our model increases the inference speed in online scenarios while maintaining competitive performance, paving the way for scalable and interactive 4D vision systems. Code is available at: this https URL.
zh

[CV-2] CharaConsist: Fine-Grained Consistent Character Generation ICCV2025

【速读】:该论文旨在解决文本到图像生成中保持生成内容一致性的问题,特别是在连续或跨场景的图像序列中维持主体身份和细节的一致性。其解决方案的关键在于提出CharaConsist,该方法结合了点追踪注意力机制、自适应标记合并以及前景与背景的解耦控制,从而实现对前景和背景的细粒度一致性管理。

链接: https://arxiv.org/abs/2507.11533
作者: Mengyu Wang,Henghui Ding,Jianing Peng,Yao Zhao,Yunpeng Chen,Yunchao Wei
机构: Beijing Jiaotong University (北京交通大学); Fudan University (复旦大学); Alkaid Pte. Ltd. (Alkaid有限公司)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: ICCV 2025 accepted paper, project page: this https URL

点击查看摘要

Abstract:In text-to-image generation, producing a series of consistent contents that preserve the same identity is highly valuable for real-world applications. Although a few works have explored training-free methods to enhance the consistency of generated subjects, we observe that they suffer from the following problems. First, they fail to maintain consistent background details, which limits their applicability. Furthermore, when the foreground character undergoes large motion variations, inconsistencies in identity and clothing details become evident. To address these problems, we propose CharaConsist, which employs point-tracking attention and adaptive token merge along with decoupled control of the foreground and background. CharaConsist enables fine-grained consistency for both foreground and background, supporting the generation of one character in continuous shots within a fixed scene or in discrete shots across different scenes. Moreover, CharaConsist is the first consistent generation method tailored for text-to-image DiT model. Its ability to maintain fine-grained consistency, combined with the larger capacity of latest base model, enables it to produce high-quality visual outputs, broadening its applicability to a wider range of real-world scenarios. The source code has been released at this https URL
zh

[CV-3] CATVis: Context-Aware Thought Visualization MICCAI2025

【速读】:该论文试图解决从脑电图(EEG)信号中解码视觉表征的挑战,这一过程由于EEG信号的复杂性和噪声而具有较高难度。其解决方案的关键在于提出了一种五阶段框架,包括EEG编码器用于概念分类、在CLIP特征空间中进行EEG与文本嵌入的跨模态对齐、通过重新排序进行标题优化、概念与标题嵌入的加权插值以获得更丰富的语义,以及使用预训练的Stable Diffusion模型进行图像生成。该方法通过跨模态对齐和重新排序实现了上下文感知的EEG到图像生成,实验结果表明其在分类准确率、生成准确率和Fréchet Inception Distance指标上均优于当前最先进方法。

链接: https://arxiv.org/abs/2507.11522
作者: Tariq Mehmood,Hamza Ahmad,Muhammad Haroon Shakeel,Murtaza Taj
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注: Accepted at MICCAI 2025. This is the submitted version prior to peer review. The final Version of Record will appear in the MICCAI 2025 proceedings (Springer LNCS)

点击查看摘要

Abstract:EEG-based brain-computer interfaces (BCIs) have shown promise in various applications, such as motor imagery and cognitive state monitoring. However, decoding visual representations from EEG signals remains a significant challenge due to their complex and noisy nature. We thus propose a novel 5-stage framework for decoding visual representations from EEG signals: (1) an EEG encoder for concept classification, (2) cross-modal alignment of EEG and text embeddings in CLIP feature space, (3) caption refinement via re-ranking, (4) weighted interpolation of concept and caption embeddings for richer semantics, and (5) image generation using a pre-trained Stable Diffusion model. We enable context-aware EEG-to-image generation through cross-modal alignment and re-ranking. Experimental results demonstrate that our method generates high-quality images aligned with visual stimuli, outperforming SOTA approaches by 13.43% in Classification Accuracy, 15.21% in Generation Accuracy and reducing Fréchet Inception Distance by 36.61%, indicating superior semantic alignment and image quality.
zh

[CV-4] COLIBRI Fuzzy Model: Color Linguistic-Based Representation and Interpretation

【速读】:该论文试图解决计算机难以模仿人类色彩感知的问题,其核心挑战在于如何构建一种能够反映人类视觉感知的计算色彩表示模型。解决方案的关键在于提出了一种基于人类感知的模糊色彩模型COLIBRI(Color Linguistic-Based Representation and Interpretation),该模型利用模糊集和逻辑来建立色彩分类框架,并通过三阶段实验方法,包括初步实验识别可区分的色彩刺激、大规模人类分类调查以及基于反馈和上下文变化的自适应机制,从而提取模糊划分并生成反映现实感知不确定性的隶属函数。

链接: https://arxiv.org/abs/2507.11488
作者: Pakizar Shamoi,Nuray Toganas,Muragul Muratbekova,Elnara Kadyrgali,Adilet Yerkin,Ayan Igali,Malika Ziyada,Ayana Adilova,Aron Karatayev,Yerdauit Torekhan
机构: Kazakh-British Technical University (哈萨克-英国技术大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: submitted to IEEE for consideration

点击查看摘要

Abstract:Colors are omnipresent in today’s world and play a vital role in how humans perceive and interact with their surroundings. However, it is challenging for computers to imitate human color perception. This paper introduces the Human Perception-Based Fuzzy Color Model, COLIBRI (Color Linguistic-Based Representation and Interpretation), designed to bridge the gap between computational color representations and human visual perception. The proposed model uses fuzzy sets and logic to create a framework for color categorization. Using a three-phase experimental approach, the study first identifies distinguishable color stimuli for hue, saturation, and intensity through preliminary experiments, followed by a large-scale human categorization survey involving more than 1000 human subjects. The resulting data are used to extract fuzzy partitions and generate membership functions that reflect real-world perceptual uncertainty. The model incorporates a mechanism for adaptation that allows refinement based on feedback and contextual changes. Comparative evaluations demonstrate the model’s alignment with human perception compared to traditional color models, such as RGB, HSV, and LAB. To the best of our knowledge, no previous research has documented the construction of a model for color attribute specification based on a sample of this size or a comparable sample of the human population (n = 2496). Our findings are significant for fields such as design, artificial intelligence, marketing, and human-computer interaction, where perceptually relevant color representation is critical.
zh

[CV-5] C-FBI: A Combinatorial method using Convolutions for Circle Fitting in Blurry Images

【速读】:该论文旨在解决在退化成像条件下鲁棒的圆形检测与拟合这一基础计算机视觉挑战。其解决方案的关键在于提出了一种名为Combinatorial Convolution-based Circle Fitting for Blurry Images (3C-FBI)的算法,该算法通过结合高效的组合边缘像素(edgel)采样和参数空间中的卷积密度估计,弥合了圆形检测与精确参数拟合之间的差距。

链接: https://arxiv.org/abs/2507.11476
作者: Esteban Román Catafau,Torbjörn E.M. Nordling
机构: National Cheng Kung University (国立成功大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 22 pages, 16 figures

点击查看摘要

Abstract:This paper addresses the fundamental computer vision challenge of robust circle detection and fitting in degraded imaging conditions. We present Combinatorial Convolution-based Circle Fitting for Blurry Images (3C-FBI), an algorithm that bridges the gap between circle detection and precise parametric fitting by combining (1) efficient combinatorial edge pixel (edgel) sampling and (2) convolution-based density estimation in parameter space. We evaluate 3C-FBI across three experimental frameworks: (1) real-world medical data from Parkinson’s disease assessments (144 frames from 36 videos), (2) controlled synthetic data following established circle-fitting benchmarks, and (3) systematic analysis across varying spatial resolutions and outlier contamination levels. Results show that 3C-FBI achieves state-of-the-art accuracy (Jaccard index 0.896) while maintaining real-time performance (40.3 fps), significantly outperforming classical methods like RCD (6.8 fps) on a standard CPU (i7-10875H). It maintains near-perfect accuracy (Jaccard almost 1.0) at high resolutions (480x480) and reliable performance (Jaccard higher than 0.95) down to 160x160 with up to 20% outliers. In extensive synthetic testing, 3C-FBI achieves a mean Jaccard Index of 0.989 across contamination levels, comparable to modern methods like Qi et al. (2024, 0.991), and surpassing RHT (0.964). This combination of accuracy, speed, and robustness makes 3C-FBI ideal for medical imaging, robotics, and industrial inspection under challenging conditions. Comments: 22 pages, 16 figures Subjects: Computer Vision and Pattern Recognition (cs.CV) Cite as: arXiv:2507.11476 [cs.CV] (or arXiv:2507.11476v1 [cs.CV] for this version) https://doi.org/10.48550/arXiv.2507.11476 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
zh

[CV-6] HUG-VAS: A Hierarchical NURBS-Based Generative Model for Aortic Geometry Synthesis and Controllable Editing

【速读】:该论文旨在解决传统统计形状模型(Statistical Shape Modeling, SSM)在复杂血管结构如多分支血管中的表达能力和可扩展性受限的问题。其解决方案的关键在于提出HUG-VAS,一种结合非均匀有理B样条(NURBS)表面参数化与基于扩散的生成建模的分层生成模型,以合成高精度、细粒度的主动脉几何结构,并通过分层架构捕捉解剖变异性的两个层面。

链接: https://arxiv.org/abs/2507.11474
作者: Pan Du,Mingqi Xu,Xiaozhi Zhu,Jian-xun Wang
机构: University of Notre Dame(圣母大学); Cornell University(康奈尔大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 59 pages, 9 figures

点击查看摘要

Abstract:Accurate characterization of vascular geometry is essential for cardiovascular diagnosis and treatment planning. Traditional statistical shape modeling (SSM) methods rely on linear assumptions, limiting their expressivity and scalability to complex topologies such as multi-branch vascular structures. We introduce HUG-VAS, a Hierarchical NURBS Generative model for Vascular geometry Synthesis, which integrates NURBS surface parameterization with diffusion-based generative modeling to synthesize realistic, fine-grained aortic geometries. Trained with 21 patient-specific samples, HUG-VAS generates anatomically faithful aortas with supra-aortic branches, yielding biomarker distributions that closely match those of the original dataset. HUG-VAS adopts a hierarchical architecture comprising a denoising diffusion model that generates centerlines and a guided diffusion model that synthesizes radial profiles conditioned on those centerlines, thereby capturing two layers of anatomical variability. Critically, the framework supports zero-shot conditional generation from image-derived priors, enabling practical applications such as interactive semi-automatic segmentation, robust reconstruction under degraded imaging conditions, and implantable device optimization. To our knowledge, HUG-VAS is the first SSM framework to bridge image-derived priors with generative shape modeling via a unified integration of NURBS parameterization and hierarchical diffusion processes.
zh

[CV-7] Elevating 3D Models: High-Quality Texture and Geometry Refinement from a Low-Quality Model KR SIGGRAPH2025

【速读】:该论文试图解决高质量3D资产在计算机图形学和3D视觉应用中稀缺的问题,这一问题主要由获取成本高昂所致。解决方案的关键在于提出Elevate3D框架,其核心是HFS-SDEdit,这是一种专门的纹理增强方法,能够在保持外观和几何结构的同时显著提升纹理质量并修复其退化问题。此外,Elevate3D通过逐视图的方式交替进行纹理和几何优化,并利用基于HFS-SDEdit优化后的图像中的几何线索,结合先进的单目几何预测器,从而实现与增强纹理无缝对齐的详细且准确的几何结构。

链接: https://arxiv.org/abs/2507.11465
作者: Nuri Ryu,Jiyun Won,Jooeun Son,Minsu Gong,Joo-Haeng Lee,Sunghyun Cho
机构: POSTECH(浦项科技大学); Pebblous(佩布洛斯); South Korea(韩国)
类目: Graphics (cs.GR); Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted to SIGGRAPH 2025. For the project page, see this https URL

点击查看摘要

Abstract:High-quality 3D assets are essential for various applications in computer graphics and 3D vision but remain scarce due to significant acquisition costs. To address this shortage, we introduce Elevate3D, a novel framework that transforms readily accessible low-quality 3D assets into higher quality. At the core of Elevate3D is HFS-SDEdit, a specialized texture enhancement method that significantly improves texture quality while preserving the appearance and geometry while fixing its degradations. Furthermore, Elevate3D operates in a view-by-view manner, alternating between texture and geometry refinement. Unlike previous methods that have largely overlooked geometry refinement, our framework leverages geometric cues from images refined with HFS-SDEdit by employing state-of-the-art monocular geometry predictors. This approach ensures detailed and accurate geometry that aligns seamlessly with the enhanced texture. Elevate3D outperforms recent competitors by achieving state-of-the-art quality in 3D model refinement, effectively addressing the scarcity of high-quality open-source 3D assets.
zh

[CV-8] COLI: A Hierarchical Efficient Compressor for Large Images

【速读】:该论文旨在解决高分辨率、大视场图像压缩中传统方法难以保留关键细节以及数据驱动方法泛化能力有限的问题,同时针对基于隐式神经表示(INR)的大型图像压缩所面临的压缩速度慢和压缩比不足的问题。其解决方案的关键在于引入COLI框架,通过预训练-微调范式、混合精度训练以及序列损失的并行化重构加速INR压缩过程的收敛,并利用INR将图像存储约束转化为权重存储的优势,提出一种名为Hyper-Compression的后训练技术以显著提升压缩比并保持最小输出失真。

链接: https://arxiv.org/abs/2507.11443
作者: Haoran Wang,Hanyu Pei,Yang Lyu,Kai Zhang,Li Li,Feng-Lei Fan
机构: Frontier of Artificial Networks (FAN) Lab, Department of Data Science, City University of Hong Kong (城市大学); Shanghai United Imaging Healthcare Co., Ltd (上海联影医疗有限公司); MoE Key Laboratory of Brain-Inspired Intelligent Perception and Cognition, University of Science and Technology of China (中国科学技术大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:The escalating adoption of high-resolution, large-field-of-view imagery amplifies the need for efficient compression methodologies. Conventional techniques frequently fail to preserve critical image details, while data-driven approaches exhibit limited generalizability. Implicit Neural Representations (INRs) present a promising alternative by learning continuous mappings from spatial coordinates to pixel intensities for individual images, thereby storing network weights rather than raw pixels and avoiding the generalization problem. However, INR-based compression of large images faces challenges including slow compression speed and suboptimal compression ratios. To address these limitations, we introduce COLI (Compressor for Large Images), a novel framework leveraging Neural Representations for Videos (NeRV). First, recognizing that INR-based compression constitutes a training process, we accelerate its convergence through a pretraining-finetuning paradigm, mixed-precision training, and reformulation of the sequential loss into a parallelizable objective. Second, capitalizing on INRs’ transformation of image storage constraints into weight storage, we implement Hyper-Compression, a novel post-training technique to substantially enhance compression ratios while maintaining minimal output distortion. Evaluations across two medical imaging datasets demonstrate that COLI consistently achieves competitive or superior PSNR and SSIM metrics at significantly reduced bits per pixel (bpp), while accelerating NeRV training by up to 4 times.
zh

[CV-9] Implementing Adaptations for Vision AutoRegressive Model ICML2025

【速读】:该论文试图解决生成式模型在特定下游任务(如医学数据生成)中的适应性问题,以及在保护适应数据隐私方面的挑战。其关键解决方案在于对视觉自回归模型(VAR)的多种适应策略进行实现与基准测试,并将其与最先进的扩散模型(DM)适应策略进行比较,以探索VAR在非差分隐私(non-DP)和差分隐私(DP)条件下的性能表现。

链接: https://arxiv.org/abs/2507.11441
作者: Kaif Shaikh,Antoni Kowalczuk,Franziska Boenisch,Adam Dziedzic
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注: Accepted at DIG-BUGS: Data in Generative Models Workshop @ ICML 2025

点击查看摘要

Abstract:Vision AutoRegressive model (VAR) was recently introduced as an alternative to Diffusion Models (DMs) in image generation domain. In this work we focus on its adaptations, which aim to fine-tune pre-trained models to perform specific downstream tasks, like medical data generation. While for DMs there exist many techniques, adaptations for VAR remain underexplored. Similarly, differentially private (DP) adaptations-ones that aim to preserve privacy of the adaptation data-have been extensively studied for DMs, while VAR lacks such solutions. In our work, we implement and benchmark many strategies for VAR, and compare them to state-of-the-art DM adaptation strategies. We observe that VAR outperforms DMs for non-DP adaptations, however, the performance of DP suffers, which necessitates further research in private adaptations for VAR. Code is available at this https URL.
zh

[CV-10] Attributes Shape the Embedding Space of Face Recognition Models

【速读】:该论文试图解决人脸识别(Face Recognition, FR)模型在嵌入空间中对不同属性(如面部特征和图像属性)的依赖性或不变性问题。其解决方案的关键在于提出一种几何方法,用于描述FR模型对这些属性的依赖关系,并引入一种受物理学启发的对齐度量,以评估模型在不同属性上的不变性程度。通过在受控简化模型和广泛使用的FR模型上进行实验,该方法揭示了模型在不同属性上的变化程度,从而提供了对其性能优势和局限性的深入理解。

链接: https://arxiv.org/abs/2507.11372
作者: Pierrick Leroy,Antonio Mastropietro,Marco Nurisso,Francesco Vaccarino
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Face Recognition (FR) tasks have made significant progress with the advent of Deep Neural Networks, particularly through margin-based triplet losses that embed facial images into high-dimensional feature spaces. During training, these contrastive losses focus exclusively on identity information as labels. However, we observe a multiscale geometric structure emerging in the embedding space, influenced by interpretable facial (e.g., hair color) and image attributes (e.g., contrast). We propose a geometric approach to describe the dependence or invariance of FR models to these attributes and introduce a physics-inspired alignment metric. We evaluate the proposed metric on controlled, simplified models and widely used FR models fine-tuned with synthetic data for targeted attribute augmentation. Our findings reveal that the models exhibit varying degrees of invariance across different attributes, providing insight into their strengths and weaknesses and enabling deeper interpretability. Code available here: this https URLthis https URL
zh

[CV-11] UGC-VideoCaptioner: An Omni UGC Video Detail Caption Model and New Benchmarks

【速读】:该论文旨在解决现有视频描述生成基准和模型主要以视觉为中心,忽视音频在传达场景动态、说话者意图和叙事背景中的关键作用的问题。其解决方案的关键在于提出UGC-VideoCap,一个专注于短格式用户生成视频的多模态描述生成基准和模型框架,强调音频与视觉模态的平衡整合,并通过结构化的三阶段人工介入标注流程确保数据质量,同时引入UGC-VideoCaptioner(3B)模型,采用两阶段训练策略(监督微调与组相对策略优化)实现从有限数据中的高效适应。

链接: https://arxiv.org/abs/2507.11336
作者: Peiran Wu,Yunze Liu,Zhengdong Zhu,Enmin Zhou,Shawn Shen
机构: University of Bristol(布里斯托大学); Memories.ai Research(记忆人工智能研究)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Real-world user-generated videos, especially on platforms like TikTok, often feature rich and intertwined audio visual content. However, existing video captioning benchmarks and models remain predominantly visual centric, overlooking the crucial role of audio in conveying scene dynamics, speaker intent, and narrative context. This lack of omni datasets and lightweight, capable models hampers progress in fine grained, multimodal video understanding. To address these challenges, we introduce UGC-VideoCap, a new benchmark and model framework specifically designed for detailed omnimodal captioning of short form user-generated videos. Unlike prior datasets, UGC-VideoCap emphasizes balanced integration of audio and visual modalities, featuring 1000 TikTok videos annotated through a structured three stage human-in-the-loop pipeline covering audio only, visual only, and joint audio visual semantics. The benchmark also includes 4000 carefully crafted QA pairs probing both unimodal and cross modal understanding. Alongside the dataset, we propose UGC-VideoCaptioner(3B), a 3B parameter captioning model distilled from Gemini 2.5 Flash. Using a novel two-stage training strategy supervised fine tuning followed by Group Relative Policy Optimization (GRPO), our approach enables efficient adaptation from limited data while maintaining competitive performance. Together, our benchmark and model offer a high-quality foundation and a data-efficient solution for advancing omnimodal video captioning in unconstrained real-world UGC settings.
zh

[CV-12] MonoMVSNet: Monocular Priors Guided Multi-View Stereo Network ICCV2025

【速读】:该论文试图解决多视角立体(Multi-View Stereo, MVS)方法在处理无纹理区域和反射表面等挑战性区域时,由于特征匹配失败而导致的深度图预测性能下降问题。其解决方案的关键在于引入单目深度估计的先验信息,通过注意力机制将参考视图的单目特征融合到源视图特征中,并利用参考视图的单目深度动态更新边缘区域的深度候选,同时设计基于单目深度的相对一致性损失来监督深度预测,从而提升MVS方法在复杂场景下的鲁棒性。

链接: https://arxiv.org/abs/2507.11333
作者: Jianfei Jiang,Qiankun Liu,Haochen Yu,Hongyuan Liu,Liyong Wang,Jiansheng Chen,Huimin Ma
机构: University of Science and Technology Beijing (北京科技大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted by ICCV 2025

点击查看摘要

Abstract:Learning-based Multi-View Stereo (MVS) methods aim to predict depth maps for a sequence of calibrated images to recover dense point clouds. However, existing MVS methods often struggle with challenging regions, such as textureless regions and reflective surfaces, where feature matching fails. In contrast, monocular depth estimation inherently does not require feature matching, allowing it to achieve robust relative depth estimation in these regions. To bridge this gap, we propose MonoMVSNet, a novel monocular feature and depth guided MVS network that integrates powerful priors from a monocular foundation model into multi-view geometry. Firstly, the monocular feature of the reference view is integrated into source view features by the attention mechanism with a newly designed cross-view position encoding. Then, the monocular depth of the reference view is aligned to dynamically update the depth candidates for edge regions during the sampling procedure. Finally, a relative consistency loss is further designed based on the monocular depth to supervise the depth prediction. Extensive experiments demonstrate that MonoMVSNet achieves state-of-the-art performance on the DTU and Tanks-and-Temples datasets, ranking first on the Tanks-and-Temples Intermediate and Advanced benchmarks. The source code is available at this https URL.
zh

[CV-13] HANS-Net: Hyperbolic Convolution and Adaptive Temporal Attention for Accurate and Generalizable Liver and Tumor Segmentation in CT Imaging

【速读】:该论文旨在解决腹部CT图像中肝脏及肿瘤分割的准确性问题,这一任务对于可靠的诊断和治疗计划至关重要,但因解剖结构复杂、肿瘤表现多样以及标注数据有限而面临挑战。其解决方案的关键在于提出HANS-Net框架,该框架融合了双曲卷积用于分层几何表示、类小波分解模块用于多尺度纹理学习、生物启发的突触可塑性机制用于自适应特征增强,以及隐式神经表示分支以建模精细且连续的解剖边界,同时引入不确定性感知的蒙特卡洛丢弃和轻量级时间注意力机制以提升预测置信度和跨切片一致性。

链接: https://arxiv.org/abs/2507.11325
作者: Arefin Ittesafun Abian,Ripon Kumar Debnath,Md. Abdur Rahman,Mohaimenul Azam Khan Raiaan,Md Rafiqul Islam,Asif Karim,Reem E. Mohamed,Sami Azam
机构: United International University(联合国际大学); Charles Darwin University(查尔斯·达尔文大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: 10 figures. Will be submitted to IEEE Transactions on Radiation and Plasma Medical Sciences

点击查看摘要

Abstract:Accurate liver and tumor segmentation on abdominal CT images is critical for reliable diagnosis and treatment planning, but remains challenging due to complex anatomical structures, variability in tumor appearance, and limited annotated data. To address these issues, we introduce Hyperbolic-convolutions Adaptive-temporal-attention with Neural-representation and Synaptic-plasticity Network (HANS-Net), a novel segmentation framework that synergistically combines hyperbolic convolutions for hierarchical geometric representation, a wavelet-inspired decomposition module for multi-scale texture learning, a biologically motivated synaptic plasticity mechanism for adaptive feature enhancement, and an implicit neural representation branch to model fine-grained and continuous anatomical boundaries. Additionally, we incorporate uncertainty-aware Monte Carlo dropout to quantify prediction confidence and lightweight temporal attention to improve inter-slice consistency without sacrificing efficiency. Extensive evaluations of the LiTS dataset demonstrate that HANS-Net achieves a mean Dice score of 93.26%, an IoU of 88.09%, an average symmetric surface distance (ASSD) of 0.72 mm, and a volume overlap error (VOE) of 11.91%. Furthermore, cross-dataset validation on the 3D-IRCADb-01 dataset obtains an average Dice of 87.45%, IoU of 80.30%, ASSD of 1.525 mm, and VOE of 19.71%, indicating strong generalization across different datasets. These results confirm the effectiveness and robustness of HANS-Net in providing anatomically consistent, accurate, and confident liver and tumor segmentation.
zh

[CV-14] A Mixed-Primitive-based Gaussian Splatting Method for Surface Reconstruction

【速读】:该论文试图解决现有基于高斯散射(Gaussian Splatting, GS)的方法在表面重建过程中仅使用单一类型的散射基元(如高斯椭圆或高斯椭球)进行物体表面表示,导致重建质量不足的问题。其解决方案的关键在于提出一种新型框架,首次实现高斯散射在表面重建过程中融合多种几何基元的组合式散射策略,并结合混合基元初始化策略与顶点剪枝机制,以提升表面表示的学习效果和重建精度。

链接: https://arxiv.org/abs/2507.11321
作者: Haoxuan Qu,Yujun Cai,Hossein Rahmani,Ajay Kumar,Junsong Yuan,Jun Liu
机构: Lancaster University (兰卡斯特大学); The University of Queensland (昆士兰大学); The Hong Kong Polytechnic University (香港理工大学); University at Buffalo (纽约州立大学布法罗分校)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Recently, Gaussian Splatting (GS) has received a lot of attention in surface reconstruction. However, while 3D objects can be of complex and diverse shapes in the real world, existing GS-based methods only limitedly use a single type of splatting primitive (Gaussian ellipse or Gaussian ellipsoid) to represent object surfaces during their reconstruction. In this paper, we highlight that this can be insufficient for object surfaces to be represented in high quality. Thus, we propose a novel framework that, for the first time, enables Gaussian Splatting to incorporate multiple types of (geometrical) primitives during its surface reconstruction process. Specifically, in our framework, we first propose a compositional splatting strategy, enabling the splatting and rendering of different types of primitives in the Gaussian Splatting pipeline. In addition, we also design our framework with a mixed-primitive-based initialization strategy and a vertex pruning mechanism to further promote its surface representation learning process to be well executed leveraging different types of primitives. Extensive experiments show the efficacy of our framework and its accurate surface reconstruction performance.
zh

[CV-15] All Eyes no IMU: Learning Flight Attitude from Vision Alone

【速读】:该论文试图解决飞行机器人在缺乏惯性测量单元(Inertial Measurement Unit, IMU)的情况下实现姿态控制的问题,特别是针对没有专门重力感知能力的飞行生物所具有的视觉依赖性。解决方案的关键在于使用一种仅依赖视觉信息的飞行控制方法,具体而言是利用向下视角的事件相机(event camera)生成的事件流,通过一个小型循环卷积神经网络进行监督学习训练,从而估计飞行器的姿态和角速度,实现无需惯性传感器的飞行控制。

链接: https://arxiv.org/abs/2507.11302
作者: Jesse J. Hagenaars,Stein Stroobants,Sander M. Bohte,Guido C.H.E. De Croon
机构: 未知
类目: Robotics (cs.RO); Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Vision is an essential part of attitude control for many flying animals, some of which have no dedicated sense of gravity. Flying robots, on the other hand, typically depend heavily on accelerometers and gyroscopes for attitude stabilization. In this work, we present the first vision-only approach to flight control for use in generic environments. We show that a quadrotor drone equipped with a downward-facing event camera can estimate its attitude and rotation rate from just the event stream, enabling flight control without inertial sensors. Our approach uses a small recurrent convolutional neural network trained through supervised learning. Real-world flight tests demonstrate that our combination of event camera and low-latency neural network is capable of replacing the inertial measurement unit in a traditional flight control loop. Furthermore, we investigate the network’s generalization across different environments, and the impact of memory and different fields of view. While networks with memory and access to horizon-like visual cues achieve best performance, variants with a narrower field of view achieve better relative generalization. Our work showcases vision-only flight control as a promising candidate for enabling autonomous, insect-scale flying robots.
zh

[CV-16] Detección y Cuantificación de Erosión Fluvial con Visión Artificial

【速读】:该论文试图解决河流侵蚀现象的自动检测与量化问题,传统方法依赖于摄影测量技术和地理信息系统分析,需要专业知识和大量人工处理。解决方案的关键在于采用基于人工智能的方法,利用经过微调的YOLOv11计算机视觉模型,结合照片和LiDAR图像进行训练,通过Roboflow平台对数据集进行分割和标注,从而实现侵蚀区域的自动识别与面积估算。

链接: https://arxiv.org/abs/2507.11301
作者: Paúl Maji,Marlon Túquerres,Stalin Valencia,Marcela Valenzuela,Christian Mejia-Escobar
机构: Facultad de Ingeniería en Geología, Minas, Petróleos y Ambiental (FIGEMPA); Universidad Central del Ecuador (Central University of Ecuador)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 18 pages, in Spanish language, 13 figures, 4 tables

点击查看摘要

Abstract:Fluvial erosion is a natural process that can generate significant impacts on soil stability and strategic infrastructures. The detection and monitoring of this phenomenon is traditionally addressed by photogrammetric methods and analysis in geographic information systems. These tasks require specific knowledge and intensive manual processing. This study proposes an artificial intelligence-based approach for automatic identification of eroded zones and estimation of their area. The state-of-the-art computer vision model YOLOv11, adjusted by fine-tuning and trained with photographs and LiDAR images, is used. This combined dataset was segmented and labeled using the Roboflow platform. Experimental results indicate efficient detection of erosion patterns with an accuracy of 70%, precise identification of eroded areas and reliable calculation of their extent in pixels and square meters. As a final product, the EROSCAN system has been developed, an interactive web application that allows users to upload images and obtain automatic segmentations of fluvial erosion, together with the estimated area. This tool optimizes the detection and quantification of the phenomenon, facilitating decision making in risk management and territorial planning.
zh

[CV-17] 3D Magnetic Inverse Routine for Single-Segment Magnetic Field Images

【速读】:该论文旨在解决半导体封装中非破坏性检测(NDT)对精确恢复三维电流分布信息的需求,以定位电路缺陷。其解决方案的关键在于提出一种名为3D Magnetic Inverse Routine (3D MIR) 的新方法,该方法结合了基于深度学习(DL)的卷积神经网络(CNN)、基于空间物理的约束条件以及优化技术,通过处理磁场图像(MFI)数据,实现对单一段电流的三维参数(如位置、长度、电流强度及方向)的高精度恢复。

链接: https://arxiv.org/abs/2507.11293
作者: J. Senthilnath,Chen Hao,F. C. Wellstood
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: copyright 2025 IEEE. Personal use of this material is permitted. Permission from IEEE must be obtained for all other uses, in any current or future media, including reprinting/republishing this material for advertising or promotional purposes, creating new collective works, for resale or redistribution to servers or lists, or reuse of any copyrighted component of this work in other works

点击查看摘要

Abstract:In semiconductor packaging, accurately recovering 3D information is crucial for non-destructive testing (NDT) to localize circuit defects. This paper presents a novel approach called the 3D Magnetic Inverse Routine (3D MIR), which leverages Magnetic Field Images (MFI) to retrieve the parameters for the 3D current flow of a single-segment. The 3D MIR integrates a deep learning (DL)-based Convolutional Neural Network (CNN), spatial-physics-based constraints, and optimization techniques. The method operates in three stages: i) The CNN model processes the MFI data to predict ( \ell/z_o ), where \ell is the wire length and z_o is the wire’s vertical depth beneath the magnetic sensors and classify segment type ( c ). ii) By leveraging spatial-physics-based constraints, the routine provides initial estimates for the position ( x_o , y_o , z_o ), length ( \ell ), current ( I ), and current flow direction (positive or negative) of the current segment. iii) An optimizer then adjusts these five parameters ( x_o , y_o , z_o , \ell , I ) to minimize the difference between the reconstructed MFI and the actual MFI. The results demonstrate that the 3D MIR method accurately recovers 3D information with high precision, setting a new benchmark for magnetic image reconstruction in semiconductor packaging. This method highlights the potential of combining DL and physics-driven optimization in practical applications.
zh

[CV-18] ask-Oriented Human Grasp Synthesis via Context- and Task-Aware Diffusers ICCV2025

【速读】:该论文试图解决任务导向的人类抓取合成问题,该问题要求同时具备任务和上下文意识。解决方案的关键在于任务感知的接触图(task-aware contact maps),与传统仅考虑操作物体及其与手的关系的接触图不同,该方法引入了场景和任务信息,从而更全面地描述手与物体的交互,以生成符合任务需求的精确抓取姿态。

链接: https://arxiv.org/abs/2507.11287
作者: An-Lun Liu,Yu-Wei Chao,Yi-Ting Chen
机构: National Yang Ming Chiao Tung University (国立阳明交通大学); NVIDIA(英伟达)
类目: Computer Vision and Pattern Recognition (cs.CV); Robotics (cs.RO)
备注: Accepted by ICCV 2025

点击查看摘要

Abstract:In this paper, we study task-oriented human grasp synthesis, a new grasp synthesis task that demands both task and context awareness. At the core of our method is the task-aware contact maps. Unlike traditional contact maps that only reason about the manipulated object and its relation with the hand, our enhanced maps take into account scene and task information. This comprehensive map is critical for hand-object interaction, enabling accurate grasping poses that align with the task. We propose a two-stage pipeline that first constructs a task-aware contact map informed by the scene and task. In the subsequent stage, we use this contact map to synthesize task-oriented human grasps. We introduce a new dataset and a metric for the proposed task to evaluate our approach. Our experiments validate the importance of modeling both scene and task, demonstrating significant improvements over existing methods in both grasp quality and task performance. See our project page for more details: this https URL
zh

[CV-19] omato Multi-Angle Multi-Pose Dataset for Fine-Grained Phenotyping

【速读】:该论文旨在解决传统植物表型分析方法中存在的观察者偏差和不一致性问题,从而提高细粒度植物分析的准确性和可重复性。其解决方案的关键在于开发了TomatoMAP数据集,该数据集基于物联网(IoT)成像系统,采用标准化的数据采集协议,包含64,464张RGB图像以及7个感兴趣区域(ROIs)的手动标注边界框,并提供了3,616张高分辨率图像的像素级语义和实例分割标注,以支持细粒度表型分析。此外,通过结合MobileNetv3、YOLOv11和MaskRCNN的级联深度学习框架对数据集进行了验证,证明了模型在准确性与速度上可与领域专家相媲美。

链接: https://arxiv.org/abs/2507.11279
作者: Yujie Zhang,Sabine Struckmeyer,Andreas Kolb,Sven Reichardt
机构: Institute for Breeding Research on Horticultural Crops, Julius Kuehn-Institute (园艺作物育种研究所,朱利叶斯·库恩研究所); Computer Graphics Group, Center for Sensor Systems (ZESS), University of Siegen (计算机图形学组,传感器系统中心(ZESS),锡根大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Observer bias and inconsistencies in traditional plant phenotyping methods limit the accuracy and reproducibility of fine-grained plant analysis. To overcome these challenges, we developed TomatoMAP, a comprehensive dataset for Solanum lycopersicum using an Internet of Things (IoT) based imaging system with standardized data acquisition protocols. Our dataset contains 64,464 RGB images that capture 12 different plant poses from four camera elevation angles. Each image includes manually annotated bounding boxes for seven regions of interest (ROIs), including leaves, panicle, batch of flowers, batch of fruits, axillary shoot, shoot and whole plant area, along with 50 fine-grained growth stage classifications based on the BBCH scale. Additionally, we provide 3,616 high-resolution image subset with pixel-wise semantic and instance segmentation annotations for fine-grained phenotyping. We validated our dataset using a cascading model deep learning framework combining MobileNetv3 for classification, YOLOv11 for object detection, and MaskRCNN for segmentation. Through AI vs. Human analysis involving five domain experts, we demonstrate that the models trained on our dataset achieve accuracy and speed comparable to the experts. Cohen’s Kappa and inter-rater agreement heatmap confirm the reliability of automated fine-grained phenotyping using our approach.
zh

[CV-20] YOLOatr : Deep Learning Based Automatic Target Detection and Localization in Thermal Infrared Imagery

【速读】:该论文旨在解决热红外(Thermal Infrared, TI)图像中自动目标检测(Automatic Target Detection, ATD)与识别(Automatic Target Recognition, ATR)在国防和监控领域中的挑战性问题。由于数据集有限、领域特定问题、TI模态特有的挑战(如硬件限制、尺度不变性问题、战术车辆故意遮挡、传感器分辨率低导致的目标结构信息缺失、天气、温度和昼夜变化的影响以及目标与背景比例的变化),使得同类目标内部变异增加,异类目标之间相似性提高,从而导致准确的实时ATR成为具有挑战性的计算机视觉(Computer Vision, CV)任务。为应对这些挑战,本文提出了一种改进的基于锚点的单阶段检测器YOLOatr,其基于改进的YOLOv5s架构,关键在于对检测头、颈部特征融合以及自定义增强策略进行了优化调整。

链接: https://arxiv.org/abs/2507.11267
作者: Aon Safdar,Usman Akram,Waseem Anwar,Basit Malik,Mian Ibad Ali
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: Published in 25th Irish Machine Vision and Image Processing Conf., Galway, Ireland, Aug 30-Sep 1 2023 Also available at this https URL

点击查看摘要

Abstract:Automatic Target Detection (ATD) and Recognition (ATR) from Thermal Infrared (TI) imagery in the defense and surveillance domain is a challenging computer vision (CV) task in comparison to the commercial autonomous vehicle perception domain. Limited datasets, peculiar domain-specific and TI modality-specific challenges, i.e., limited hardware, scale invariance issues due to greater distances, deliberate occlusion by tactical vehicles, lower sensor resolution and resultant lack of structural information in targets, effects of weather, temperature, and time of day variations, and varying target to clutter ratios all result in increased intra-class variability and higher inter-class similarity, making accurate real-time ATR a challenging CV task. Resultantly, contemporary state-of-the-art (SOTA) deep learning architectures underperform in the ATR domain. We propose a modified anchor-based single-stage detector, called YOLOatr, based on a modified YOLOv5s, with optimal modifications to the detection heads, feature fusion in the neck, and a custom augmentation profile. We evaluate the performance of our proposed model on a comprehensive DSIAC MWIR dataset for real-time ATR over both correlated and decorrelated testing protocols. The results demonstrate that our proposed model achieves state-of-the-art ATR performance of up to 99.6%.
zh

[CV-21] ViewSRD: 3D Visual Grounding via Structured Multi-View Decomposition ICCV2025

【速读】:该论文旨在解决3D视觉定位(3D visual grounding)中多锚点查询下目标与锚点的解耦困难以及视角变化导致的空间描述不一致问题。其解决方案的关键在于提出ViewSRD框架,该框架将3D视觉定位建模为结构化的多视角分解过程,通过Simple Relation Decoupling(SRD)模块将复杂多锚点查询重构为一系列针对性单锚点表述,生成具有视角感知的结构化描述;随后利用Multi-view Textual-Scene Interaction(Multi-TSI)模块通过共享的跨模态一致视角令牌(CCVTs)融合多视角的文本与场景特征,最终通过Textual-Scene Reasoning模块整合多视角预测结果,实现统一且鲁棒的3D视觉定位。

链接: https://arxiv.org/abs/2507.11261
作者: Ronggang Huang,Haoxin Yang,Yan Cai,Xuemiao Xu,Huaidong Zhang,Shengfeng He
机构: South China University of Technology (华南理工大学); Guangdong Engineering Center for Large Model and GenAI Technology (广东省大模型与生成式人工智能技术工程中心); State Key Laboratory of Subtropical Building and Urban Science (亚热带建筑科学国家重点实验室); Ministry of Education Key Laboratory of Big Data and Intelligent Robot (教育部大数据与智能机器人重点实验室); Guangdong Provincial Key Lab of Computational Intelligence and Cyberspace Information (广东省计算智能与网络空间信息重点实验室); Singapore Management University (新加坡管理大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted by ICCV 2025

点击查看摘要

Abstract:3D visual grounding aims to identify and localize objects in a 3D space based on textual descriptions. However, existing methods struggle with disentangling targets from anchors in complex multi-anchor queries and resolving inconsistencies in spatial descriptions caused by perspective variations. To tackle these challenges, we propose ViewSRD, a framework that formulates 3D visual grounding as a structured multi-view decomposition process. First, the Simple Relation Decoupling (SRD) module restructures complex multi-anchor queries into a set of targeted single-anchor statements, generating a structured set of perspective-aware descriptions that clarify positional relationships. These decomposed representations serve as the foundation for the Multi-view Textual-Scene Interaction (Multi-TSI) module, which integrates textual and scene features across multiple viewpoints using shared, Cross-modal Consistent View Tokens (CCVTs) to preserve spatial correlations. Finally, a Textual-Scene Reasoning module synthesizes multi-view predictions into a unified and robust 3D visual grounding. Experiments on 3D visual grounding datasets show that ViewSRD significantly outperforms state-of-the-art methods, particularly in complex queries requiring precise spatial differentiation.
zh

[CV-22] MFGDiffusion: Mask-Guided Smoke Synthesis for Enhanced Forest Fire Detection

【速读】:该论文试图解决森林火灾烟雾图像数据稀缺的问题,以及现有修复模型在生成高质量烟雾表示时存在的合成烟雾与背景上下文不一致的问题。解决方案的关键在于提出一个综合框架,通过预训练分割模型和多模态模型获取烟雾掩码和图像特征,并引入一种基于掩码和掩码图像特征引导的网络架构;同时,设计了一种新的损失函数——掩码随机差异损失,通过随机扩展和侵蚀掩码来增强生成效果的一致性;此外,还利用烟雾特征和多模态大语言模型作为过滤工具,构建高质量的烟雾图像数据集,从而有效提升森林火灾烟雾检测模型的性能。

链接: https://arxiv.org/abs/2507.11252
作者: Guanghao Wu,Chen Xu,Hai Song,Chong Wang,Qixing Zhang
机构: State Key Laboratory of Fire Science, University of Science and Technology of China (国家火灾科学重点实验室,中国科学技术大学); Institute of Advanced Technology, University of Science and Technology of China (先进技术研究院,中国科学技术大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Image and Video Processing (eess.IV)
备注: 18 pages, 11 figures

点击查看摘要

Abstract:Smoke is the first visible indicator of a this http URL the advancement of deep learning, image-based smoke detection has become a crucial method for detecting and preventing forest fires. However, the scarcity of smoke image data from forest fires is one of the significant factors hindering the detection of forest fire smoke. Image generation models offer a promising solution for synthesizing realistic smoke images. However, current inpainting models exhibit limitations in generating high-quality smoke representations, particularly manifesting as inconsistencies between synthesized smoke and background contexts. To solve these problems, we proposed a comprehensive framework for generating forest fire smoke images. Firstly, we employed the pre-trained segmentation model and the multimodal model to obtain smoke masks and image this http URL, to address the insufficient utilization of masks and masked images by inpainting models, we introduced a network architecture guided by mask and masked image features. We also proposed a new loss function, the mask random difference loss, which enhances the consistency of the generated effects around the mask by randomly expanding and eroding the mask this http URL, to generate a smoke image dataset using random masks for subsequent detection tasks, we incorporated smoke characteristics and use a multimodal large language model as a filtering tool to select diverse and reasonable smoke images, thereby improving the quality of the synthetic dataset. Experiments showed that our generated smoke images are realistic and diverse, and effectively enhance the performance of forest fire smoke detection models. Code is available at this https URL.
zh

[CV-23] Fairness-Aware Grouping for Continuous Sensitive Variables: Application for Debiasing Face Analysis with respect to Skin Tone

【速读】:该论文试图解决在敏感属性为连续变量(如皮肤颜色)时,传统基于预定义分组的公平性评估方法可能忽略或掩盖少数子群体所经历的歧视问题。解决方案的关键在于提出一种基于公平性的分组方法,通过根据观察到的歧视水平对数据进行分组,从而识别出能够最大化基于歧视组间方差的新准则的划分,进而隔离出最关键的子群体。该方法在多个合成数据集上得到验证,并展示了其在人口分布变化下的鲁棒性,同时在皮肤颜色的单调公平性场景下进行了深入分析。

链接: https://arxiv.org/abs/2507.11247
作者: Veronika Shilova,Emmanuel Malherbe,Giovanni Palma,Laurent Risser,Jean-Michel Loubes
机构: Artefact Research Center (Artefact 研究中心); Institut de Mathématiques de Toulouse (UMR 5219) (图卢兹数学研究所(UMR 5219)); CNRS (法国国家科学研究中心); Université de Toulouse (图卢兹大学); L’Oréal Research and Innovation (欧莱雅研究与创新中心); INRIA (法国国家信息与自动化研究所); ANITI (ANITI)
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Within a legal framework, fairness in datasets and models is typically assessed by dividing observations into predefined groups and then computing fairness measures (e.g., Disparate Impact or Equality of Odds with respect to gender). However, when sensitive attributes such as skin color are continuous, dividing into default groups may overlook or obscure the discrimination experienced by certain minority subpopulations. To address this limitation, we propose a fairness-based grouping approach for continuous (possibly multidimensional) sensitive attributes. By grouping data according to observed levels of discrimination, our method identifies the partition that maximizes a novel criterion based on inter-group variance in discrimination, thereby isolating the most critical subgroups. We validate the proposed approach using multiple synthetic datasets and demonstrate its robustness under changing population distributions - revealing how discrimination is manifested within the space of sensitive attributes. Furthermore, we examine a specialized setting of monotonic fairness for the case of skin color. Our empirical results on both CelebA and FFHQ, leveraging the skin tone as predicted by an industrial proprietary algorithm, show that the proposed segmentation uncovers more nuanced patterns of discrimination than previously reported, and that these findings remain stable across datasets for a given model. Finally, we leverage our grouping model for debiasing purpose, aiming at predicting fair scores with group-by-group post-processing. The results demonstrate that our approach improves fairness while having minimal impact on accuracy, thus confirming our partition method and opening the door for industrial deployment. Subjects: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG) Cite as: arXiv:2507.11247 [cs.CV] (or arXiv:2507.11247v1 [cs.CV] for this version) https://doi.org/10.48550/arXiv.2507.11247 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
zh

[CV-24] NarrLV: Towards a Comprehensive Narrative-Centric Evaluation for Long Video Generation Models

【速读】:该论文试图解决当前长视频生成模型在叙事内容表达能力评估方面缺乏专门的评价基准的问题,现有评估主要依赖于简单叙事提示的基准(如VBench),无法全面反映模型在长视频中表达丰富叙事内容的能力。其解决方案的关键在于提出首个综合性评估基准NarrLV,通过引入时间叙事原子(Temporal Narrative Atom, TNA)作为基本叙事单元,结合电影叙事理论中的三个关键元素构建自动提示生成管道,以量化衡量叙事丰富性,并基于叙事内容表达的三个渐进层次设计基于多模态大语言模型(MLLM)的问题生成与回答框架,从而实现对模型叙事能力的有效评估。

链接: https://arxiv.org/abs/2507.11245
作者: X. Feng,H. Yu,M. Wu,S. Hu,J. Chen,C. Zhu,J. Wu,X. Chu,K. Huang
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Project Page: this https URL

点击查看摘要

Abstract:With the rapid development of foundation video generation technologies, long video generation models have exhibited promising research potential thanks to expanded content creation space. Recent studies reveal that the goal of long video generation tasks is not only to extend video duration but also to accurately express richer narrative content within longer videos. However, due to the lack of evaluation benchmarks specifically designed for long video generation models, the current assessment of these models primarily relies on benchmarks with simple narrative prompts (e.g., VBench). To the best of our knowledge, our proposed NarrLV is the first benchmark to comprehensively evaluate the Narrative expression capabilities of Long Video generation models. Inspired by film narrative theory, (i) we first introduce the basic narrative unit maintaining continuous visual presentation in videos as Temporal Narrative Atom (TNA), and use its count to quantitatively measure narrative richness. Guided by three key film narrative elements influencing TNA changes, we construct an automatic prompt generation pipeline capable of producing evaluation prompts with a flexibly expandable number of TNAs. (ii) Then, based on the three progressive levels of narrative content expression, we design an effective evaluation metric using the MLLM-based question generation and answering framework. (iii) Finally, we conduct extensive evaluations on existing long video generation models and the foundation generation models. Experimental results demonstrate that our metric aligns closely with human judgments. The derived evaluation outcomes reveal the detailed capability boundaries of current video generation models in narrative content expression.
zh

[CV-25] A Robust Incomplete Multimodal Low-Rank Adaptation Approach for Emotion Recognition

【速读】:该论文试图解决多模态情感识别(Multimodal Emotion Recognition, MER)在实际应用中因传感器故障或隐私保护需求导致的多模态数据不完整问题。现有方法通过平衡不同模态组合的训练来应对这一问题,但存在训练梯度冲突导致最终预测模型性能下降的局限性。论文提出的解决方案是基于模态组合的单模态解耦动态低秩适应方法(MCULoRA),其关键在于两个核心模块:模态组合感知的低秩适应(MCLA)和动态参数微调(DPFT)。MCLA模块有效解耦了不同模态组合的共享信息与独特特征,DPFT模块则根据各模态表示空间的可分性调整模态组合的训练比例,从而优化不同模态组合下的学习效率。

链接: https://arxiv.org/abs/2507.11202
作者: Xinkui Zhao,Jinsong Shu,Yangyang Wu,Guanjie Cheng,Zihe Liu,Naibo Wang,Shuiguang Deng,Zhongle Xie,Jianwei Yin
机构: Zhejiang University (浙江大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Multimodal Emotion Recognition (MER) often encounters incomplete multimodality in practical applications due to sensor failures or privacy protection requirements. While existing methods attempt to address various incomplete multimodal scenarios by balancing the training of each modality combination through additional gradients, these approaches face a critical limitation: training gradients from different modality combinations conflict with each other, ultimately degrading the performance of the final prediction model. In this paper, we propose a unimodal decoupled dynamic low-rank adaptation method based on modality combinations, named MCULoRA, which is a novel framework for the parameter-efficient training of incomplete multimodal learning models. MCULoRA consists of two key modules, modality combination aware low-rank adaptation (MCLA) and dynamic parameter fine-tuning (DPFT). The MCLA module effectively decouples the shared information from the distinct characteristics of individual modality combinations. The DPFT module adjusts the training ratio of modality combinations based on the separability of each modality’s representation space, optimizing the learning efficiency across different modality combinations. Our extensive experimental evaluation in multiple benchmark datasets demonstrates that MCULoRA substantially outperforms previous incomplete multimodal learning approaches in downstream task accuracy.
zh

[CV-26] How Far Have Medical Vision-Language Models Come? A Comprehensive Benchmarking Study ALT

【速读】:该论文试图解决当前视觉-语言模型(Vision-Language Models, VLMs)在医疗任务中的能力评估与适用性问题,特别是其在医学图像理解与推理方面的表现尚未得到充分探索。解决方案的关键在于对开源通用型和医学专用型VLMs进行多基准的全面评估,涵盖从3B到72B参数规模的模型,并通过分解模型性能为理解与推理两个组件,揭示其在不同任务中的表现差异及存在的关键障碍。

链接: https://arxiv.org/abs/2507.11200
作者: Che Liu,Jiazhen Pan,Weixiang Shen,Wenjia Bai,Daniel Rueckert,Rossella Arcucci
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted by the International Conference on AI in Healthcare 2025

点击查看摘要

Abstract:Vision-Language Models (VLMs) trained on web-scale corpora excel at natural image tasks and are increasingly repurposed for healthcare; however, their competence in medical tasks remains underexplored. We present a comprehensive evaluation of open-source general-purpose and medically specialised VLMs, ranging from 3B to 72B parameters, across eight benchmarks: MedXpert, OmniMedVQA, PMC-VQA, PathVQA, MMMU, SLAKE, and VQA-RAD. To observe model performance across different aspects, we first separate it into understanding and reasoning components. Three salient findings emerge. First, large general-purpose models already match or surpass medical-specific counterparts on several benchmarks, demonstrating strong zero-shot transfer from natural to medical images. Second, reasoning performance is consistently lower than understanding, highlighting a critical barrier to safe decision support. Third, performance varies widely across benchmarks, reflecting differences in task design, annotation quality, and knowledge demands. No model yet reaches the reliability threshold for clinical deployment, underscoring the need for stronger multimodal alignment and more rigorous, fine-grained evaluation protocols.
zh

[CV-27] Clustering-Guided Multi-Layer Contrastive Representation Learning for Citrus Disease Classification

【速读】:该论文旨在解决柑橘作物在疾病检测与分类中的准确性不足问题,特别是在缺乏大量高质量标注数据的情况下。其解决方案的关键在于提出一种基于聚类引导的自监督多层对比表示学习算法(Clustering-Guided Self-Supervised Multi-Layer Contrastive Representation Learning, CMCRL),通过引入与聚类中心对比和多层对比训练(Multi-Layer Contrastive Training, MCT)机制,实现了在大规模未标注样本上的优化、对不同柑橘病害症状相似性的有效适应以及分层特征表示学习。

链接: https://arxiv.org/abs/2507.11171
作者: Jun Chen,Yonghua Yu,Weifu Li,Yaohui Chen,Hong Chen
机构: Huazhong Agricultural University (华中农业大学); College of Informatics (信息学院); Engineering Research Center of Intelligent Technology for Agriculture, Ministry of Education (教育部智能农业技术工程研究中心); College of Engineering (工程学院); National Key Laboratory for Germplasm Innovation & Utilization of Horticultural Crops (园艺作物种质创新与利用国家重点实验室)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 11 pages, 5 figures

点击查看摘要

Abstract:Citrus, as one of the most economically important fruit crops globally, suffers severe yield depressions due to various diseases. Accurate disease detection and classification serve as critical prerequisites for implementing targeted control measures. Recent advancements in artificial intelligence, particularly deep learning-based computer vision algorithms, have substantially decreased time and labor requirements while maintaining the accuracy of detection and classification. Nevertheless, these methods predominantly rely on massive, high-quality annotated training examples to attain promising performance. By introducing two key designs: contrasting with cluster centroids and a multi-layer contrastive training (MCT) paradigm, this paper proposes a novel clustering-guided self-supervised multi-layer contrastive representation learning (CMCRL) algorithm. The proposed method demonstrates several advantages over existing counterparts: (1) optimizing with massive unannotated samples; (2) effective adaptation to the symptom similarity across distinct citrus diseases; (3) hierarchical feature representation learning. The proposed method achieves state-of-the-art performance on the public citrus image set CDD, outperforming existing methods by 4.5%-30.1% accuracy. Remarkably, our method narrows the performance gap with fully supervised counterparts (all samples are labeled). Beyond classification accuracy, our method shows great performance on other evaluation metrics (F1 score, precision, and recall), highlighting the robustness against the class imbalance challenge.
zh

[CV-28] Assessing Color Vision Test in Large Vision-language Models

【速读】:该论文试图解决大型视觉语言模型在颜色视觉能力方面尚未被充分探索的问题,旨在提升这些模型在颜色相关任务中的表现。解决方案的关键在于定义了一个针对大型视觉语言模型的颜色视觉测试任务,并构建了一个涵盖多种测试问题类别和不同难度等级的数据集,同时分析了模型在颜色视觉任务中出现的错误类型,并提出了微调策略以增强其性能。

链接: https://arxiv.org/abs/2507.11153
作者: Hongfei Ye,Bin Chen,Wenxi Liu,Yu Zhang,Zhao Li,Dandan Ni,Hongyang Chen
机构: University of Chinese Academy of Sciences(中国科学院大学); Zhejiang Lab(浙江省实验室); Zhejiang University(浙江大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:With the widespread adoption of large vision-language models, the capacity for color vision in these models is crucial. However, the color vision abilities of large visual-language models have not yet been thoroughly explored. To address this gap, we define a color vision testing task for large vision-language models and construct a dataset \footnoteAnonymous Github Showing some of the data this https URL that covers multiple categories of test questions and tasks of varying difficulty levels. Furthermore, we analyze the types of errors made by large vision-language models and propose fine-tuning strategies to enhance their performance in color vision tests.
zh

[CV-29] Latent Space Consistency for Sparse-View CT Reconstruction

【速读】:该论文旨在解决CT重建中因稀疏视角X射线图像导致的重建质量下降问题,以及传统生成式模型在跨模态(2D X-ray与3D CT)潜在空间对齐上的不足。其解决方案的关键在于提出了一种一致性潜在空间扩散模型(Consistent Latent Space Diffusion Model, CLS-DM),通过引入跨模态特征对比学习,有效从2D X射线图像中提取3D潜在信息,并实现不同模态间的潜在空间对齐。

链接: https://arxiv.org/abs/2507.11152
作者: Duoyou Chen,Yunqing Chen,Can Zhang,Zhou Wang,Cheng Chen,Ruoxiu Xiao
机构: University of Science and Technology Beijing(北京科技大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: ACMMM2025 Accepted

点击查看摘要

Abstract:Computed Tomography (CT) is a widely utilized imaging modality in clinical settings. Using densely acquired rotational X-ray arrays, CT can capture 3D spatial features. However, it is confronted with challenged such as significant time consumption and high radiation exposure. CT reconstruction methods based on sparse-view X-ray images have garnered substantial attention from researchers as they present a means to mitigate costs and risks. In recent years, diffusion models, particularly the Latent Diffusion Model (LDM), have demonstrated promising potential in the domain of 3D CT reconstruction. Nonetheless, due to the substantial differences between the 2D latent representation of X-ray modalities and the 3D latent representation of CT modalities, the vanilla LDM is incapable of achieving effective alignment within the latent space. To address this issue, we propose the Consistent Latent Space Diffusion Model (CLS-DM), which incorporates cross-modal feature contrastive learning to efficiently extract latent 3D information from 2D X-ray images and achieve latent space alignment between modalities. Experimental results indicate that CLS-DM outperforms classical and state-of-the-art generative models in terms of standard voxel-level metrics (PSNR, SSIM) on the LIDC-IDRI and CTSpine1K datasets. This methodology not only aids in enhancing the effectiveness and economic viability of sparse X-ray reconstructed CT but can also be generalized to other cross-modal transformation tasks, such as text-to-image synthesis. We have made our code publicly available at this https URL to facilitate further research and applications in other domains.
zh

[CV-30] RMAU-NET: A Residual-Multihead-Attention U-Net Architecture for Landslide Segmentation and Detection from Remote Sensing Images

【速读】:该论文试图解决由于极端天气事件或人类活动导致的滑坡灾害频繁发生,而自动观测滑坡在大范围和复杂地形条件下具有挑战性的问题。解决方案的关键在于提出一种端到端的深度学习模型,该模型通过遥感图像实现滑坡事件的自动观测。该模型采用一种新的神经网络架构,用于同时完成滑坡检测和滑坡分割任务,并在三个不同的基准数据集LandSlide4Sense、Bijie和Nepal上进行了评估,取得了较高的F1分数和mIoU分数,验证了其在实际滑坡监测系统中的潜在应用价值。

链接: https://arxiv.org/abs/2507.11143
作者: Lam Pham,Cam Le,Hieu Tang,Khang Truong,Truong Nguyen,Jasmin Lampert,Alexander Schindler,Martin Boyer,Son Phan
机构: Van Lang University (范朗大学); Austrian Institute of Technology (奥地利技术研究所); Troyes University (特鲁瓦大学); HCM University of Technology (胡志明市科技大学); Ton Duc Thang University (阮文高大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:In recent years, landslide disasters have reported frequently due to the extreme weather events of droughts, floods , storms, or the consequence of human activities such as deforestation, excessive exploitation of natural resources. However, automatically observing landslide is challenging due to the extremely large observing area and the rugged topography such as mountain or highland. This motivates us to propose an end-to-end deep-learning-based model which explores the remote sensing images for automatically observing landslide events. By considering remote sensing images as the input data, we can obtain free resource, observe large and rough terrains by time. To explore the remote sensing images, we proposed a novel neural network architecture which is for two tasks of landslide detection and landslide segmentation. We evaluated our proposed model on three different benchmark datasets of LandSlide4Sense, Bijie, and Nepal. By conducting extensive experiments, we achieve F1 scores of 98.23, 93.83 for the landslide detection task on LandSlide4Sense, Bijie datasets; mIoU scores of 63.74, 76.88 on the segmentation tasks regarding LandSlide4Sense, Nepal datasets. These experimental results prove potential to integrate our proposed model into real-life landslide observation systems.
zh

[CV-31] MMOne: Representing Multiple Modalities in One Scene ICCV2025

【速读】:该论文旨在解决多模态场景表示中的模态冲突问题,具体包括属性差异和粒度差异。其解决方案的关键在于提出一种通用框架MMOne,通过引入新颖的模态指示器来捕捉每种模态的独特属性,并设计多模态分解机制,将多模态高斯分布分解为单模态高斯分布,从而将多模态信息解耦为共享和模态特定组件,实现更紧凑高效的多模态场景表示。

链接: https://arxiv.org/abs/2507.11129
作者: Zhifeng Gu,Bing Wang
机构: Spatial Intelligence Group, The Hong Kong Polytechnic University (香港理工大学智能空间研究组)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: Accepted to ICCV 2025

点击查看摘要

Abstract:Humans perceive the world through multimodal cues to understand and interact with the environment. Learning a scene representation for multiple modalities enhances comprehension of the physical world. However, modality conflicts, arising from inherent distinctions among different modalities, present two critical challenges: property disparity and granularity disparity. To address these challenges, we propose a general framework, MMOne, to represent multiple modalities in one scene, which can be readily extended to additional modalities. Specifically, a modality modeling module with a novel modality indicator is proposed to capture the unique properties of each modality. Additionally, we design a multimodal decomposition mechanism to separate multi-modal Gaussians into single-modal Gaussians based on modality differences. We address the essential distinctions among modalities by disentangling multimodal information into shared and modality-specific components, resulting in a more compact and efficient multimodal scene representation. Extensive experiments demonstrate that our method consistently enhances the representation capability for each modality and is scalable to additional modalities. The code is available at this https URL.
zh

[CV-32] ry Harder: Hard Sample Generation and Learning for Clothes-Changing Person Re-ID

【速读】:该论文旨在解决服装变化人员重识别(CC-ReID)任务中硬样本(hard samples)带来的挑战,这些样本由于内在的模糊性或相似性且缺乏明确定义,成为影响模型鲁棒性和学习策略设计的根本瓶颈。解决方案的关键在于提出一种多模态引导的硬样本生成与学习框架(HSGL),其核心是通过统一文本和视觉模态,显式定义、生成并优化硬样本。该框架包含两个关键组件:双粒度硬样本生成(DGHSG)利用多模态线索合成语义一致的样本以提升训练数据的难度和多样性,以及硬样本自适应学习(HSAL)引入基于文本语义标签的硬度感知优化策略,调整特征距离以增强模型的判别能力和对硬样本的鲁棒性。

链接: https://arxiv.org/abs/2507.11119
作者: Hankun Liu,Yujian Zhao,Guanglin Niu
机构: Beihang University (北京航空航天大学); Zhongguancun Academy (中关村学院)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Hard samples pose a significant challenge in person re-identification (ReID) tasks, particularly in clothing-changing person Re-ID (CC-ReID). Their inherent ambiguity or similarity, coupled with the lack of explicit definitions, makes them a fundamental bottleneck. These issues not only limit the design of targeted learning strategies but also diminish the model’s robustness under clothing or viewpoint changes. In this paper, we propose a novel multimodal-guided Hard Sample Generation and Learning (HSGL) framework, which is the first effort to unify textual and visual modalities to explicitly define, generate, and optimize hard samples within a unified paradigm. HSGL comprises two core components: (1) Dual-Granularity Hard Sample Generation (DGHSG), which leverages multimodal cues to synthesize semantically consistent samples, including both coarse- and fine-grained hard positives and negatives for effectively increasing the hardness and diversity of the training data. (2) Hard Sample Adaptive Learning (HSAL), which introduces a hardness-aware optimization strategy that adjusts feature distances based on textual semantic labels, encouraging the separation of hard positives and drawing hard negatives closer in the embedding space to enhance the model’s discriminative capability and robustness to hard samples. Extensive experiments on multiple CC-ReID benchmarks demonstrate the effectiveness of our approach and highlight the potential of multimodal-guided hard sample generation and learning for robust CC-ReID. Notably, HSAL significantly accelerates the convergence of the targeted learning procedure and achieves state-of-the-art performance on both PRCC and LTCC datasets. The code is available at this https URL.
zh

[CV-33] Jellyfish Species Identification: A CNN Based Artificial Neural Network Approach

【速读】:该论文试图解决海洋生物多样性保护中 jellyfish(水母)物种准确识别的问题,这一问题对于生态监测和管理至关重要。解决方案的关键在于提出一种基于深度学习的框架,该框架结合了先进的特征提取技术(如MobileNetV3、ResNet50、EfficientNetV2-B0和VGG16)与多种传统机器学习分类器及前馈神经网络分类器,并通过激活softmax函数实现直接物种分类。其中,人工神经网络与MobileNetV3的组合模型表现最佳,达到了98%的高准确率,显著优于其他特征提取器-分类器组合。

链接: https://arxiv.org/abs/2507.11116
作者: Md. Sabbir Hossen,Md. Saiduzzaman,Pabon Shaha,Mostofa Kamal Nasir
机构: Bangladesh University(孟加拉国大学); Mawlana Bhashani Science & Technology University(马尔纳·巴沙尼科学与技术大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: This paper has been accepted at the IEEE QPAIN 2025. The final version will be available in the IEEE Xplore Digital Library

点击查看摘要

Abstract:Jellyfish, a diverse group of gelatinous marine organisms, play a crucial role in maintaining marine ecosystems but pose significant challenges for biodiversity and conservation due to their rapid proliferation and ecological impact. Accurate identification of jellyfish species is essential for ecological monitoring and management. In this study, we proposed a deep learning framework for jellyfish species detection and classification using an underwater image dataset. The framework integrates advanced feature extraction techniques, including MobileNetV3, ResNet50, EfficientNetV2-B0, and VGG16, combined with seven traditional machine learning classifiers and three Feedforward Neural Network classifiers for precise species identification. Additionally, we activated the softmax function to directly classify jellyfish species using the convolutional neural network models. The combination of the Artificial Neural Network with MobileNetV3 is our best-performing model, achieving an exceptional accuracy of 98%, significantly outperforming other feature extractor-classifier combinations. This study demonstrates the efficacy of deep learning and hybrid frameworks in addressing biodiversity challenges and advancing species detection in marine environments.
zh

[CV-34] KptLLM : Towards Generic Keypoint Comprehension with Large Language Model

【速读】:该论文旨在解决多模态大语言模型(Multimodal Large Language Models, MLLMs)在捕捉细粒度语义信息方面的不足,尤其是对象关键点(keypoints)的精确定位与分析问题。其解决方案的关键在于提出KptLLM++,一种专门针对通用关键点理解的多模态大语言模型,通过整合多种输入模态并在用户定义的指令指导下进行操作,实现了跨不同场景的关键点检测统一。该模型采用“识别-检测”范式,首先解析关键点语义,再通过结构化思维链推理机制精确定位关键点位置,从而提升模型在细粒度图像理解任务中的性能与泛化能力。

链接: https://arxiv.org/abs/2507.11102
作者: Jie Yang,Wang Zeng,Sheng Jin,Lumin Xu,Wentao Liu,Chen Qian,Zhen Li,Ruimao Zhang
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Extended Version of KptLLM. arXiv admin note: text overlap with arXiv:2411.01846

点击查看摘要

Abstract:The emergence of Multimodal Large Language Models (MLLMs) has revolutionized image understanding by bridging textual and visual modalities. However, these models often struggle with capturing fine-grained semantic information, such as the precise identification and analysis of object keypoints. Keypoints, as structure-aware, pixel-level, and compact representations of objects, particularly articulated ones, play a crucial role in applications such as fine-grained image analysis, object retrieval, and behavior recognition. In this paper, we propose KptLLM++, a novel multimodal large language model that specifically designed for generic keypoint comprehension through the integration of diverse input modalities guided by user-defined instructions. By unifying keypoint detection across varied contexts, KptLLM++ establishes itself as an advanced interface, fostering more effective human-AI collaboration. The model is built upon a novel identify-then-detect paradigm, which first interprets keypoint semantics and subsequently localizes their precise positions through a structured chain-of-thought reasoning mechanism. To push the boundaries of performance, we have scaled up the training dataset to over 500K samples, encompassing diverse objects, keypoint categories, image styles, and scenarios with complex occlusions. This extensive scaling enables KptLLM++ to unlock its potential, achieving remarkable accuracy and generalization. Comprehensive experiments on multiple keypoint detection benchmarks demonstrate its state-of-the-art performance, underscoring its potential as a unified solution for fine-grained image understanding and its transformative implications for human-AI interaction.
zh

[CV-35] A Survey on Interpretability in Visual Recognition

【速读】:该论文试图解决视觉识别模型的可解释性问题,旨在通过系统回顾现有研究并提出一种以人为本的分类体系,以促进对模型决策过程的理解和故障诊断。其解决方案的关键在于提出一个基于意图(Intent)、对象(Object)、呈现(Presentation)和方法论(Methodology)的分类框架,从而为可解释人工智能(XAI)方法建立系统且连贯的分组标准。

链接: https://arxiv.org/abs/2507.11099
作者: Qiyang Wan,Chengzhi Gao,Ruiping Wang,Xilin Chen
机构: Chinese Academy of Sciences (中国科学院); Institute of Computing Technology, CAS (计算技术研究所,中国科学院)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 20 pages, 7 figures, 2 tables. Under review

点击查看摘要

Abstract:In recent years, visual recognition methods have advanced significantly, finding applications across diverse fields. While researchers seek to understand the mechanisms behind the success of these models, there is also a growing impetus to deploy them in critical areas like autonomous driving and medical diagnostics to better diagnose failures, which promotes the development of interpretability research. This paper systematically reviews existing research on the interpretability of visual recognition models and proposes a taxonomy of methods from a human-centered perspective. The proposed taxonomy categorizes interpretable recognition methods based on Intent, Object, Presentation, and Methodology, thereby establishing a systematic and coherent set of grouping criteria for these XAI methods. Additionally, we summarize the requirements for evaluation metrics and explore new opportunities enabled by recent technologies, such as large multimodal models. We aim to organize existing research in this domain and inspire future investigations into the interpretability of visual recognition models.
zh

[CV-36] Atmos-Bench: 3D Atmospheric Structures for Climate Insight

【速读】:该论文旨在解决卫星LiDAR数据在大气结构反演中的不确定性问题,以及现有方法依赖辅助输入和简化物理模型导致的辐射传输和散射吸收效应表征不足的问题。其解决方案的关键在于提出Atmos-Bench:首个3D大气基准数据集,并结合FourCastX网络,通过耦合WRF与增强型COSP模拟器生成高精度的体素级参考数据,同时将ATB-BC物理约束嵌入模型架构以提升能量一致性,从而在无需辅助输入的情况下实现对355 nm和532 nm波段的显著性能提升。

链接: https://arxiv.org/abs/2507.11085
作者: Tianchi Xu
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Atmospheric structure, represented by backscatter coefficients (BC) recovered from satellite LiDAR attenuated backscatter (ATB), provides a volumetric view of clouds, aerosols, and molecules, playing a critical role in human activities, climate understanding, and extreme weather forecasting. Existing methods often rely on auxiliary inputs and simplified physics-based approximations, and lack a standardized 3D benchmark for fair evaluation. However, such approaches may introduce additional uncertainties and insufficiently capture realistic radiative transfer and atmospheric scattering-absorption effects. To bridge these gaps, we present Atmos-Bench: the first 3D atmospheric benchmark, along with a novel FourCastX: Frequency-enhanced Spatio-Temporal Mixture-of-Experts Network that (a) generates 921,600 image slices from 3D scattering volumes simulated at 532 nm and 355 nm by coupling WRF with an enhanced COSP simulator over 384 land-ocean time steps, yielding high-quality voxel-wise references; (b) embeds ATB-BC physical constraints into the model architecture, promoting energy consistency during restoration; © achieves consistent improvements on the Atmos-Bench dataset across both 355 nm and 532 nm bands, outperforming state-of-the-art baseline models without relying on auxiliary inputs. Atmos-Bench establishes a new standard for satellite-based 3D atmospheric structure recovery and paves the way for deeper climate insight.
zh

[CV-37] Automatic Road Subsurface Distress Recognition from Ground Penetrating Radar Images using Deep Learning-based Cross-verification

【速读】:该论文旨在解决道路地下病害(Road Subsurface Distress, RSD)检测中依赖人工且效率低下的问题。现有深度学习方法在RSD识别中的性能受限于高质量训练数据集的稀缺性和网络区分RSD能力的不足。论文的关键解决方案是构建了一个经过严格验证的包含2134个样本的三维GPR数据集,并提出了一种新颖的交叉验证策略,该策略在RSD识别中表现出超过98.6%的召回率,显著提升了检测准确性并减少了人工检测的工作量。

链接: https://arxiv.org/abs/2507.11081
作者: Chang Peng,Bao Yang,Meiqi Li,Ge Zhang,Hui Sun,Zhenyu Jiang
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Ground penetrating radar (GPR) has become a rapid and non-destructive solution for road subsurface distress (RSD) detection. However, RSD recognition from GPR images is labor-intensive and heavily relies on inspectors’ expertise. Deep learning offers the possibility for automatic RSD recognition, but its current performance is limited by two factors: Scarcity of high-quality dataset for network training and insufficient capability of network to distinguish RSD. In this study, a rigorously validated 3D GPR dataset containing 2134 samples of diverse types was constructed through field scanning. Based on the finding that the YOLO model trained with one of the three scans of GPR images exhibits varying sensitivity to specific type of RSD, we proposed a novel cross-verification strategy with outstanding accuracy in RSD recognition, achieving recall over 98.6% in field tests. The approach, integrated into an online RSD detection system, can reduce the labor of inspection by around 90%.
zh

[CV-38] GKNet: Graph-based Keypoints Network for Monocular Pose Estimation of Non-cooperative Spacecraft

【速读】:该论文旨在解决非合作航天器单目位姿估计中的关键点检测问题,该问题在轨服务(OOS)任务中具有重要意义。现有关键点检测器在面对非合作航天器的结构对称性和部分遮挡时表现脆弱。论文提出的解决方案是构建一种基于图的关键点网络(GKNet),其关键在于利用关键点图的几何约束来提升检测性能。此外,为更好地验证关键点检测器,作者还提出了一个中等规模的数据集SKD,包含3个航天器目标、90,000张模拟图像及高精度关键点标注。

链接: https://arxiv.org/abs/2507.11077
作者: Weizhao Ma,Dong Zhou,Yuhui Hu,Zipeng He
机构: Harbin Institute of Technology(哈尔滨工业大学); China Academy of Space Technology(中国空间技术研究院)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Monocular pose estimation of non-cooperative spacecraft is significant for on-orbit service (OOS) tasks, such as satellite maintenance, space debris removal, and station assembly. Considering the high demands on pose estimation accuracy, mainstream monocular pose estimation methods typically consist of keypoint detectors and PnP solver. However, current keypoint detectors remain vulnerable to structural symmetry and partial occlusion of non-cooperative spacecraft. To this end, we propose a graph-based keypoints network for the monocular pose estimation of non-cooperative spacecraft, GKNet, which leverages the geometric constraint of keypoints graph. In order to better validate keypoint detectors, we present a moderate-scale dataset for the spacecraft keypoint detection, named SKD, which consists of 3 spacecraft targets, 90,000 simulated images, and corresponding high-precise keypoint annotations. Extensive experiments and an ablation study have demonstrated the high accuracy and effectiveness of our GKNet, compared to the state-of-the-art spacecraft keypoint detectors. The code for GKNet and the SKD dataset is available at this https URL.
zh

[CV-39] Joint angle model based learning to refine kinematic human pose estimation

【速读】:该论文旨在解决标记-free人体姿态估计(Human Pose Estimation, HPE)中关键点识别的偶尔错误和关键点轨迹的随机波动问题。现有基于深度学习的HPE优化模型性能受限于人工标注不准确的训练数据集。该研究提出的解决方案的关键在于通过关节角度建模来克服这些挑战,具体包括:(i) 构建一种鲁棒的基于关节角度的人体姿态模型;(ii) 通过高阶傅里叶级数近似关节角度的时间变化以获得可靠的“真实值”;(iii) 设计一个双向循环网络作为后处理模块,用于优化已建立的HRNet的姿态估计。该方法在高质量数据集上训练后,能够有效纠正错误识别的关节并平滑其时空轨迹。实验表明,基于关节角度的优化(Joint Angle-based Refinement, JAR)在如花样滑冰和街舞等复杂场景中优于当前最先进的HPE优化网络。

链接: https://arxiv.org/abs/2507.11075
作者: Chang Peng,Yifei Zhou,Huifeng Xi,Shiqing Huang,Chuangye Chen,Jianming Yang,Bao Yang,Zhenyu Jiang
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Marker-free human pose estimation (HPE) has found increasing applications in various fields. Current HPE suffers from occasional errors in keypoint recognition and random fluctuation in keypoint trajectories when analyzing kinematic human poses. The performance of existing deep learning-based models for HPE refinement is considerably limited by inaccurate training datasets in which the keypoints are manually annotated. This paper proposed a novel method to overcome the difficulty through joint angle-based modeling. The key techniques include: (i) A joint angle-based model of human pose, which is robust to describe kinematic human poses; (ii) Approximating temporal variation of joint angles through high order Fourier series to get reliable “ground truth”; (iii) A bidirectional recurrent network is designed as a post-processing module to refine the estimation of well-established HRNet. Trained with the high-quality dataset constructed using our method, the network demonstrates outstanding performance to correct wrongly recognized joints and smooth their spatiotemporal trajectories. Tests show that joint angle-based refinement (JAR) outperforms the state-of-the-art HPE refinement network in challenging cases like figure skating and breaking.
zh

[CV-40] LogTinyLLM : Tiny Large Language Models Based Contextual Log Anomaly Detection

【速读】:该论文试图解决在大规模日志数据集中检测日志序列中的上下文异常问题,这一问题由于日志序列的体积庞大和结构复杂而难以通过传统基于规则或深度学习的方法有效处理。论文的关键解决方案是采用参数高效的微调方法,特别是低秩适应(LoRA)和适配器(adapter)方法,以提升对日志序列进行异常检测的效果。实验结果表明,LoRA微调方法在LogBert基础上实现了18至19个百分点的性能提升,准确率达到了97.76%至98.83%。

链接: https://arxiv.org/abs/2507.11071
作者: Isaiah Thompson Ocansey,Ritwik Bhattacharya,Tanmay Sen
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Log anomaly detection using traditional rule based or deep learning based methods is often challenging due to the large volume and highly complex nature of log sequence. So effective way of detection of anomalous sequence of logs is crucial for system maintenance and development. This paper proposes parameter efficient finetuning specifically low rank adaptation (LoRA) and adapter based approaches for finding contextual anomalies in sequence of logs in large log data set. It compares different tiny large language models (LLMs) on the Thunderbird dataset. The results show that LoRA based finetuning provides substantial performance improvements of 18 to 19 percentage over LogBert based full finetuning approach, achieving accuracy scores between 97.76% and 98.83% compared to 79.37%.
zh

[CV-41] RAN-D: 2D Gaussian Splatting-based Sparse-view Transparent Object Depth Reconstruction via Physics Simulation for Scene Update

【速读】:该论文旨在解决从RGB图像中重建透明物体三维几何结构的问题,这一任务因透明物体的物理特性(如反射和折射)而极具挑战性。其解决方案的关键在于引入TRAN-D,一种基于2D高斯泼溅(Gaussian Splatting)的深度重建方法,通过将透明物体与背景分离,实现对物体对应高斯分布的聚焦优化,并利用对象感知损失减少伪影,确保不可见表面的覆盖同时降低过拟合。此外,还结合了物理基础模拟,以快速精修重建结果,有效处理物体移除和剩余物体的连锁运动问题。

链接: https://arxiv.org/abs/2507.11069
作者: Jeongyun Kim,Seunghoon Jeong,Giseop Kim,Myung-Hwan Jeon,Eunji Jun,Ayoung Kim
机构: Seoul National University (首尔国立大学); DGIST (DGIST大学); University of Illinois Urbana-Champaign (伊利诺伊大学厄巴纳-香槟分校); Hyundai Motor Group (现代汽车集团)
类目: Robotics (cs.RO); Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Understanding the 3D geometry of transparent objects from RGB images is challenging due to their inherent physical properties, such as reflection and refraction. To address these difficulties, especially in scenarios with sparse views and dynamic environments, we introduce TRAN-D, a novel 2D Gaussian Splatting-based depth reconstruction method for transparent objects. Our key insight lies in separating transparent objects from the background, enabling focused optimization of Gaussians corresponding to the object. We mitigate artifacts with an object-aware loss that places Gaussians in obscured regions, ensuring coverage of invisible surfaces while reducing overfitting. Furthermore, we incorporate a physics-based simulation that refines the reconstruction in just a few seconds, effectively handling object removal and chain-reaction movement of remaining objects without the need for rescanning. TRAN-D is evaluated on both synthetic and real-world sequences, and it consistently demonstrated robust improvements over existing GS-based state-of-the-art methods. In comparison with baselines, TRAN-D reduces the mean absolute error by over 39% for the synthetic TRansPose sequences. Furthermore, despite being updated using only one image, TRAN-D reaches a \delta 2.5 cm accuracy of 48.46%, over 1.5 times that of baselines, which uses six images. Code and more results are available at this https URL.
zh

[CV-42] Robust 3D-Masked Part-level Editing in 3D Gaussian Splatting with Regularized Score Distillation Sampling

【速读】:该论文旨在解决在Gaussian Splatting中实现精确局部3D编辑的挑战,主要受限于多视角2D部分分割不一致以及Score Distillation Sampling (SDS)损失的本质模糊性。其解决方案的关键在于提出RoMaP框架,该框架包含两个核心组件:一是基于3D-Geometry Aware Label Prediction (3D-GALP)的鲁棒3D掩码生成模块,利用球面谐波(SH)系数建模视图依赖的标签变化和软标签属性,从而获得跨视角的一致且准确的部分分割;二是引入正则化的SDS损失,结合标准SDS损失与额外正则项,如通过Scheduled Latent Mixing and Part (SLaMP)方法引入的L1锚点损失,以确保仅在目标区域进行修改并保持上下文一致性,同时通过高斯先验去除等正则项提升灵活性并防止意外编辑。

链接: https://arxiv.org/abs/2507.11061
作者: Hayeon Kim,Ji Ha Jang,Se Young Chun
机构: Seoul National University (首尔大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Recent advances in 3D neural representations and instance-level editing models have enabled the efficient creation of high-quality 3D content. However, achieving precise local 3D edits remains challenging, especially for Gaussian Splatting, due to inconsistent multi-view 2D part segmentations and inherently ambiguous nature of Score Distillation Sampling (SDS) loss. To address these limitations, we propose RoMaP, a novel local 3D Gaussian editing framework that enables precise and drastic part-level modifications. First, we introduce a robust 3D mask generation module with our 3D-Geometry Aware Label Prediction (3D-GALP), which uses spherical harmonics (SH) coefficients to model view-dependent label variations and soft-label property, yielding accurate and consistent part segmentations across viewpoints. Second, we propose a regularized SDS loss that combines the standard SDS loss with additional regularizers. In particular, an L1 anchor loss is introduced via our Scheduled Latent Mixing and Part (SLaMP) editing method, which generates high-quality part-edited 2D images and confines modifications only to the target region while preserving contextual coherence. Additional regularizers, such as Gaussian prior removal, further improve flexibility by allowing changes beyond the existing context, and robust 3D masking prevents unintended edits. Experimental results demonstrate that our RoMaP achieves state-of-the-art local 3D editing on both reconstructed and generated Gaussian scenes and objects qualitatively and quantitatively, making it possible for more robust and flexible part-level 3D Gaussian editing.
zh

[CV-43] Alleviating Textual Reliance in Medical Language-guided Segmentation via Prototype-driven Semantic Approximation ICCV2025

【速读】:该论文试图解决医疗语言引导分割中对文本输入的依赖性(textual reliance)问题,这一问题导致了医学分割数据集中大量无文本标注的图像数据无法被有效利用,并且限制了模型在临床场景中的应用范围。解决方案的关键在于提出ProLearn框架,其核心是一个新颖的原型驱动语义近似(Prototype-driven Semantic Approximation, PSA)模块,该模块通过从文本报告中提炼与分割相关的语义信息,构建一个离散且紧凑的原型空间,并支持查询-响应机制,从而在没有文本输入的情况下也能提供语义指导,从根本上缓解了对文本的依赖。

链接: https://arxiv.org/abs/2507.11055
作者: Shuchang Ye,Usman Naseem,Mingyuan Meng,Jinman Kim
机构: The University of Sydney (悉尼大学); Macquarie University (麦考瑞大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted to ICCV 2025

点击查看摘要

Abstract:Medical language-guided segmentation, integrating textual clinical reports as auxiliary guidance to enhance image segmentation, has demonstrated significant improvements over unimodal approaches. However, its inherent reliance on paired image-text input, which we refer to as ``textual reliance", presents two fundamental limitations: 1) many medical segmentation datasets lack paired reports, leaving a substantial portion of image-only data underutilized for training; and 2) inference is limited to retrospective analysis of cases with paired reports, limiting its applicability in most clinical scenarios where segmentation typically precedes reporting. To address these limitations, we propose ProLearn, the first Prototype-driven Learning framework for language-guided segmentation that fundamentally alleviates textual reliance. At its core, in ProLearn, we introduce a novel Prototype-driven Semantic Approximation (PSA) module to enable approximation of semantic guidance from textual input. PSA initializes a discrete and compact prototype space by distilling segmentation-relevant semantics from textual reports. Once initialized, it supports a query-and-respond mechanism which approximates semantic guidance for images without textual input, thereby alleviating textual reliance. Extensive experiments on QaTa-COV19, MosMedData+ and Kvasir-SEG demonstrate that ProLearn outperforms state-of-the-art language-guided methods when limited text is available.
zh

[CV-44] Combining Transformers and CNNs for Efficient Object Detection in High-Resolution Satellite Imagery

【速读】:该论文旨在解决高分辨率卫星图像中目标检测的挑战,特别是针对传统卷积神经网络(CNN)在处理此类图像时的局限性。其解决方案的关键在于采用以Transformer为核心的GLOD架构,用Swin Transformer替代CNN主干进行端到端特征提取,并引入新颖的UpConvMixer模块和Fusion Blocks实现鲁棒的上采样与多尺度特征融合。此外,通过结合CBAM注意力机制的非对称融合以及多路径头部设计,提升了模型对不同尺度目标的检测能力,同时在保持计算效率的同时利用空间先验信息优化性能。

链接: https://arxiv.org/abs/2507.11040
作者: Nicolas Drapier,Aladine Chetouani,Aurélien Chateigner
机构: L2TI Laboratory, Institut Galilée, Université Sorbonne Paris Nord; SAS Impact
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 11 pages, 9 figures

点击查看摘要

Abstract:We present GLOD, a transformer-first architecture for object detection in high-resolution satellite imagery. GLOD replaces CNN backbones with a Swin Transformer for end-to-end feature extraction, combined with novel UpConvMixer blocks for robust upsampling and Fusion Blocks for multi-scale feature integration. Our approach achieves 32.95% on xView, outperforming SOTA methods by 11.46%. Key innovations include asymmetric fusion with CBAM attention and a multi-path head design capturing objects across scales. The architecture is optimized for satellite imagery challenges, leveraging spatial priors while maintaining computational efficiency.
zh

[CV-45] A Multi-View High-Resolution Foot-Ankle Complex Point Cloud Dataset During Gait for Occlusion-Robust 3D Completion

【速读】:该论文旨在解决在动态步态条件下获取足踝复合体精确表面几何数据的挑战,特别是在存在摆动足遮挡和视角限制的情况下。其解决方案的关键在于引入FootGait3D,这是一个高分辨率踝足表面点云的多视角数据集,包含从46名受试者中使用定制五摄像头深度感知系统采集的8,403帧点云数据。该数据集通过提供完整五视角重建以及部分视角点云,支持在不同遮挡水平和视角下对三维点云补全方法进行严格评估,从而推动足部几何形状恢复的研究。

链接: https://arxiv.org/abs/2507.11037
作者: Jie-Wen Li,Zi-Han Ye,Qingyuan Zhou,Jiayi Song,Ying He,Ben Fei,Wen-Ming Chen
机构: Fudan University(复旦大学); Fudan University(复旦大学); The Chinese University of Hong Kong(香港中文大学); Nanyang Technological University(南洋理工大学); Shanghai Innovation Institute(上海创新研究院)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 15 pages, 10 figures, 2 tables

点击查看摘要

Abstract:The kinematics analysis of foot-ankle complex during gait is essential for advancing biomechanical research and clinical assessment. Collecting accurate surface geometry data from the foot and ankle during dynamic gait conditions is inherently challenging due to swing foot occlusions and viewing limitations. Thus, this paper introduces FootGait3D, a novel multi-view dataset of high-resolution ankle-foot surface point clouds captured during natural gait. Different from existing gait datasets that typically target whole-body or lower-limb motion, FootGait3D focuses specifically on the detailed modeling of the ankle-foot region, offering a finer granularity of motion data. To address this, FootGait3D consists of 8,403 point cloud frames collected from 46 subjects using a custom five-camera depth sensing system. Each frame includes a complete 5-view reconstruction of the foot and ankle (serving as ground truth) along with partial point clouds obtained from only four, three, or two views. This structured variation enables rigorous evaluation of 3D point cloud completion methods under varying occlusion levels and viewpoints. Our dataset is designed for shape completion tasks, facilitating the benchmarking of state-of-the-art single-modal (e.g., PointTr, SnowflakeNet, Anchorformer) and multi-modal (e.g., SVDFormer, PointSea, CSDN) completion networks on the challenge of recovering the full foot geometry from occluded inputs. FootGait3D has significant potential to advance research in biomechanics and multi-segment foot modeling, offering a valuable testbed for clinical gait analysis, prosthetic design, and robotics applications requiring detailed 3D models of the foot during motion. The dataset is now available at this https URL.
zh

[CV-46] Efficient Dual-domain Image Dehazing with Haze Prior Perception

【速读】:该论文旨在解决基于Transformer的单图像去雾模型计算成本高、难以实时应用的问题,以及现有方法在复杂雾霾条件下捕捉长程依赖关系能力不足和空间与频域分支耦合弱的问题。其解决方案的关键在于提出了一种新颖的双域框架DGFDNet,该框架通过物理引导的降质对齐在空间和频域上进行处理,核心包含两个关键模块:Haze-Aware Frequency Modulator (HAFM) 用于生成像素级雾霾置信图以自适应增强相关频段成分,以及Multi-level Gating Aggregation Module (MGAM) 用于通过多尺度特征融合恢复精细结构细节。此外,Prior Correction Guidance Branch (PCGB) 引入闭环反馈机制,显著提升了雾霾定位精度。

链接: https://arxiv.org/abs/2507.11035
作者: Lirong Zheng,Yanshan Li,Rui Yu,Kaihao Zhang
机构: Shenzhen University (深圳大学); Harbin Institute of Technology, Shenzhen (哈尔滨工业大学深圳)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 12 pages

点击查看摘要

Abstract:Transformer-based models exhibit strong global modeling capabilities in single-image dehazing, but their high computational cost limits real-time applicability. Existing methods predominantly rely on spatial-domain features to capture long-range dependencies, which are computationally expensive and often inadequate under complex haze conditions. While some approaches introduce frequency-domain cues, the weak coupling between spatial and frequency branches limits the overall performance. To overcome these limitations, we propose the Dark Channel Guided Frequency-aware Dehazing Network (DGFDNet), a novel dual-domain framework that performs physically guided degradation alignment across spatial and frequency domains. At its core, the DGFDBlock comprises two key modules: 1) the Haze-Aware Frequency Modulator (HAFM), which generates a pixel-level haze confidence map from dark channel priors to adaptively enhance haze-relevant frequency components, thereby achieving global degradation-aware spectral modulation; 2) the Multi-level Gating Aggregation Module (MGAM), which fuses multi-scale features through diverse convolutional kernels and hybrid gating mechanisms to recover fine structural details. Additionally, a Prior Correction Guidance Branch (PCGB) incorporates a closed-loop feedback mechanism, enabling iterative refinement of the prior by intermediate dehazed features and significantly improving haze localization accuracy, especially in challenging outdoor scenes. Extensive experiments on four benchmark haze datasets demonstrate that DGFDNet achieves state-of-the-art performance with superior robustness and real-time efficiency. Code is available at: this https URL.
zh

[CV-47] Personalized OVSS: Understanding Personal Concept in Open-Vocabulary Semantic Segmentation ICCV2025

【速读】:该论文试图解决个性化文本描述(如“my mug cup”)在开放词汇语义分割(OVSS)中难以准确分割特定兴趣区域的问题,特别是在存在多个相似类别的场景下(如多个“mug cups”)。解决方案的关键在于提出一种基于文本提示微调的插件方法,通过少量图像与掩码对来识别个性化视觉概念,同时保持原始OVSS的性能。该方法引入了“负掩码提议”以减少错误预测,并通过注入个性化概念的视觉嵌入来增强文本提示的表示,从而提升个性化OVSS的效果而不损害原有性能。

链接: https://arxiv.org/abs/2507.11030
作者: Sunghyun Park,Jungsoo Lee,Shubhankar Borse,Munawar Hayat,Sungha Choi,Kyuwoong Hwang,Fatih Porikli
机构: Qualcomm AI Research(高通人工智能研究)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted to ICCV 2025; 15 pages

点击查看摘要

Abstract:While open-vocabulary semantic segmentation (OVSS) can segment an image into semantic regions based on arbitrarily given text descriptions even for classes unseen during training, it fails to understand personal texts (e.g., my mug cup') for segmenting regions of specific interest to users. This paper addresses challenges like recognizing my mug cup’ among multiple mug cups'. To overcome this challenge, we introduce a novel task termed \textitpersonalized open-vocabulary semantic segmentation and propose a text prompt tuning-based plug-in method designed to recognize personal visual concepts using a few pairs of images and masks, while maintaining the performance of the original OVSS. Based on the observation that reducing false predictions is essential when applying text prompt tuning to this task, our proposed method employs negative mask proposal’ that captures visual concepts other than the personalized concept. We further improve the performance by enriching the representation of text prompts by injecting visual embeddings of the personal concept into them. This approach enhances personalized OVSS without compromising the original OVSS performance. We demonstrate the superiority of our method on our newly established benchmarks for this task, including FSS ^\textper , CUB ^\textper , and ADE ^\textper .
zh

[CV-48] Human-Guided Shade Artifact Suppression in CBCT-to-MDCT Translation via Schrödinger Bridge with Conditional Diffusion

【速读】:该论文旨在解决锥形束计算机断层扫描(CBCT)到医学数字成像和通信(MDCT)的图像翻译问题,以提高医学影像的诊断质量和可解释性。其解决方案的关键在于基于薛定谔桥(Schrodinger Bridge, SB)框架,将生成对抗网络(GAN)衍生的先验知识与人工引导的条件扩散模型相结合,并通过二值化人类反馈引入无分类器指导(CFG),从而在保持解剖结构准确性的同时实现感知可控性。此外,通过迭代优化和基于竞赛的选择机制,模型能够内化人类偏好,无需依赖奖励模型,显著提升了图像翻译的效果与效率。

链接: https://arxiv.org/abs/2507.11025
作者: Sung Ho Kang,Hyun-Cheol Park
机构: National Institute for Mathematical Sciences(国家数学科学研究所); Korea National University of Transportation(韩国交通大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:We present a novel framework for CBCT-to-MDCT translation, grounded in the Schrodinger Bridge (SB) formulation, which integrates GAN-derived priors with human-guided conditional diffusion. Unlike conventional GANs or diffusion models, our approach explicitly enforces boundary consistency between CBCT inputs and pseudo targets, ensuring both anatomical fidelity and perceptual controllability. Binary human feedback is incorporated via classifier-free guidance (CFG), effectively steering the generative process toward clinically preferred outcomes. Through iterative refinement and tournament-based preference selection, the model internalizes human preferences without relying on a reward model. Subtraction image visualizations reveal that the proposed method selectively attenuates shade artifacts in key anatomical regions while preserving fine structural detail. Quantitative evaluations further demonstrate superior performance across RMSE, SSIM, LPIPS, and Dice metrics on clinical datasets – outperforming prior GAN- and fine-tuning-based feedback methods – while requiring only 10 sampling steps. These findings underscore the effectiveness and efficiency of our framework for real-time, preference-aligned medical image translation.
zh

[CV-49] Semantically Informed Salient Regions Guided Radiology Report Generation

【速读】:该论文试图解决基于胸部X光片的自动放射学报告生成中,由于图像数据固有的大量偏差导致现有方法生成的报告虽流畅但医学上不准确的问题。解决方案的关键在于提出一种语义感知显著区域引导的报告生成方法(SISRNet),该方法通过细粒度的跨模态语义明确识别具有医学关键特征的显著区域,并在图像建模和报告生成过程中系统性地关注这些高信息量区域,从而有效捕捉细微的异常发现,减轻数据偏差的负面影响,最终生成临床准确的报告。

链接: https://arxiv.org/abs/2507.11015
作者: Zeyi Hou,Zeqiang Wei,Ruixin Yan,Ning Lang,Xiuzhuang Zhou
机构: Beijing University of Posts and Telecommunications(北京邮电大学); Peking University Third Hospital(北京大学第三医院)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Recent advances in automated radiology report generation from chest X-rays using deep learning algorithms have the potential to significantly reduce the arduous workload of radiologists. However, due to the inherent massive data bias in radiology images, where abnormalities are typically subtle and sparsely distributed, existing methods often produce fluent yet medically inaccurate reports, limiting their applicability in clinical practice. To address this issue effectively, we propose a Semantically Informed Salient Regions-guided (SISRNet) report generation method. Specifically, our approach explicitly identifies salient regions with medically critical characteristics using fine-grained cross-modal semantics. Then, SISRNet systematically focuses on these high-information regions during both image modeling and report generation, effectively capturing subtle abnormal findings, mitigating the negative impact of data bias, and ultimately generating clinically accurate reports. Compared to its peers, SISRNet demonstrates superior performance on widely used IU-Xray and MIMIC-CXR datasets.
zh

[CV-50] Bridge Feature Matching and Cross-Modal Alignment with Mutual-filtering for Zero-shot Anomaly Detection

【速读】:该论文旨在解决零样本异常检测(Zero-Shot Anomaly Detection, ZSAD)中的问题,特别是在缺乏足够标注数据的情况下实现有效的异常分类与分割。其关键解决方案是提出FiSeCLIP方法,该方法在无需训练的CLIP基础上,结合特征匹配与跨模态对齐,并利用同一批次内的图像作为参考信息进行异常检测,从而提升检测精度。为应对参考信息缺乏标签带来的歧义,该方法引入文本信息以过滤噪声特征,并进一步挖掘CLIP的内在潜力以恢复局部语义相关性,从而增强细粒度异常检测能力。

链接: https://arxiv.org/abs/2507.11003
作者: Yuhu Bai,Jiangning Zhang,Yunkang Cao,Guangyuan Lu,Qingdong He,Xiangtai Li,Guanzhong Tian
机构: Zhejiang University (浙江大学); YouTu Lab, Tencent (优图实验室,腾讯); Huazhong University of Science and Technology (华中科技大学); Peking University (北京大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:With the advent of vision-language models (e.g., CLIP) in zero- and few-shot settings, CLIP has been widely applied to zero-shot anomaly detection (ZSAD) in recent research, where the rare classes are essential and expected in many applications. This study introduces \textbfFiSeCLIP for ZSAD with training-free \textbfCLIP, combining the feature matching with the cross-modal alignment. Testing with the entire dataset is impractical, while batch-based testing better aligns with real industrial needs, and images within a batch can serve as mutual reference points. Accordingly, FiSeCLIP utilizes other images in the same batch as reference information for the current image. However, the lack of labels for these references can introduce ambiguity, we apply text information to \textbffilter out noisy features. In addition, we further explore CLIP’s inherent potential to restore its local \textbfsemantic correlation, adapting it for fine-grained anomaly detection tasks to enable a more accurate filtering process. Our approach exhibits superior performance for both anomaly classification and segmentation on anomaly detection benchmarks, building a stronger baseline for the direction, e.g., on MVTec-AD, FiSeCLIP outperforms the SOTA AdaCLIP by +4.6% \uparrow /+5.7% \uparrow in segmentation metrics AU-ROC/ F_1 -max.
zh

[CV-51] Learning to Tune Like an Expert: Interpretable and Scene-Aware Navigation via MLLM Reasoning and CVAE-Based Adaptation

【速读】:该论文旨在解决服务机器人在动态和多样化环境中导航时,传统导航系统因依赖固定参数而难以适应场景变化导致性能下降和社会接受度降低的问题。其解决方案的关键在于提出LE-Nav框架,该框架利用多模态大型语言模型推理和条件变分自编码器,实现对导航规划器超参数的自适应调整,并通过零样本场景理解和链式思维提示策略提升导航系统的适应性和泛化能力。

链接: https://arxiv.org/abs/2507.11001
作者: Yanbo Wang,Zipeng Fang,Lei Zhao,Weidong Chen
机构: 未知
类目: Robotics (cs.RO); Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Service robots are increasingly deployed in diverse and dynamic environments, where both physical layouts and social contexts change over time and across locations. In these unstructured settings, conventional navigation systems that rely on fixed parameters often fail to generalize across scenarios, resulting in degraded performance and reduced social acceptance. Although recent approaches have leveraged reinforcement learning to enhance traditional planners, these methods often fail in real-world deployments due to poor generalization and limited simulation diversity, which hampers effective sim-to-real transfer. To tackle these issues, we present LE-Nav, an interpretable and scene-aware navigation framework that leverages multi-modal large language model reasoning and conditional variational autoencoders to adaptively tune planner hyperparameters. To achieve zero-shot scene understanding, we utilize one-shot exemplars and chain-of-thought prompting strategies. Additionally, a conditional variational autoencoder captures the mapping between natural language instructions and navigation hyperparameters, enabling expert-level tuning. Experiments show that LE-Nav can generate hyperparameters achieving human-level tuning across diverse planners and scenarios. Real-world navigation trials and a user study on a smart wheelchair platform demonstrate that it outperforms state-of-the-art methods on quantitative metrics such as success rate, efficiency, safety, and comfort, while receiving higher subjective scores for perceived safety and social acceptance. Code is available at this https URL.
zh

[CV-52] SpaRTAN: Spatial Reinforcement Token-based Aggregation Network for Visual Recognition IJCNN2025

【速读】:该论文旨在解决传统卷积神经网络(CNN)和基于Transformer的架构在特征表示上的简单性偏差以及现代CNN中MLP块的信息冗余问题。其解决方案的关键在于提出一种轻量级的架构设计——SpaRTAN,该设计通过引入具有不同感受野的卷积核以有效捕捉多阶空间特征,并结合基于波的通道聚合模块来调节和增强像素间的交互,从而减少通道冗余,提升特征表达的效率与性能。

链接: https://arxiv.org/abs/2507.10999
作者: Quan Bi Pay,Vishnu Monn Baskaran,Junn Yong Loo,KokSheik Wong,Simon See
机构: Monash University Malaysia(莫纳什大学马来西亚校区); NVIDIA AI Technology Center(英伟达人工智能技术中心)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: Accepted at International Joint Conference on Neural Networks (IJCNN 2025)

点击查看摘要

Abstract:The resurgence of convolutional neural networks (CNNs) in visual recognition tasks, exemplified by ConvNeXt, has demonstrated their capability to rival transformer-based architectures through advanced training methodologies and ViT-inspired design principles. However, both CNNs and transformers exhibit a simplicity bias, favoring straightforward features over complex structural representations. Furthermore, modern CNNs often integrate MLP-like blocks akin to those in transformers, but these blocks suffer from significant information redundancies, necessitating high expansion ratios to sustain competitive performance. To address these limitations, we propose SpaRTAN, a lightweight architectural design that enhances spatial and channel-wise information processing. SpaRTAN employs kernels with varying receptive fields, controlled by kernel size and dilation factor, to capture discriminative multi-order spatial features effectively. A wave-based channel aggregation module further modulates and reinforces pixel interactions, mitigating channel-wise redundancies. Combining the two modules, the proposed network can efficiently gather and dynamically contextualize discriminative features. Experimental results in ImageNet and COCO demonstrate that SpaRTAN achieves remarkable parameter efficiency while maintaining competitive performance. In particular, on the ImageNet-1k benchmark, SpaRTAN achieves 77. 7% accuracy with only 3.8M parameters and approximately 1.0 GFLOPs, demonstrating its ability to deliver strong performance through an efficient design. On the COCO benchmark, it achieves 50.0% AP, surpassing the previous benchmark by 1.2% with only 21.5M parameters. The code is publicly available at [this https URL].
zh

[CV-53] Mind the Gap: Bridging Occlusion in Gait Recognition via Residual Gap Correction

【速读】:该论文试图解决行人重识别中因遮挡导致的步态识别性能下降问题,尤其是在实际场景中难以获取成对的遮挡与完整步态序列的数据。其解决方案的关键在于提出RG-Gait方法,将遮挡步态特征建模为相对于完整步态表示的残差偏差,并通过自适应整合学习到的残差,显著提升遮挡步态序列的识别性能,同时保持对完整输入的识别准确率。

链接: https://arxiv.org/abs/2507.10978
作者: Ayush Gupta,Siyuan Huang,Rama Chellappa
机构: Johns Hopkins University (约翰霍普金斯大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted at IJCB 2025

点击查看摘要

Abstract:Gait is becoming popular as a method of person re-identification because of its ability to identify people at a distance. However, most current works in gait recognition do not address the practical problem of occlusions. Among those which do, some require paired tuples of occluded and holistic sequences, which are impractical to collect in the real world. Further, these approaches work on occlusions but fail to retain performance on holistic inputs. To address these challenges, we propose RG-Gait, a method for residual correction for occluded gait recognition with holistic retention. We model the problem as a residual learning task, conceptualizing the occluded gait signature as a residual deviation from the holistic gait representation. Our proposed network adaptively integrates the learned residual, significantly improving performance on occluded gait sequences without compromising the holistic recognition accuracy. We evaluate our approach on the challenging Gait3D, GREW and BRIAR datasets and show that learning the residual can be an effective technique to tackle occluded gait recognition with holistic retention.
zh

[CV-54] Conceptualizing Multi-scale Wavelet Attention and Ray-based Encoding for Human-Object Interaction Detection IJCNN2025

【速读】:该论文旨在解决人类-物体交互(Human-Object Interaction, HOI)检测中预测可靠性不足和效率低下的问题,这些问题通常源于资源密集型的训练方法和低效的网络架构。其解决方案的关键在于提出一种类小波注意力的主干网络和一种基于射线的编码器架构,通过聚合低阶与高阶交互的判别特征来增强对中阶交互的表达能力,并利用射线优化解码器对感兴趣区域的关注,从而提升预测准确性并降低计算开销。

链接: https://arxiv.org/abs/2507.10977
作者: Quan Bi Pay,Vishnu Monn Baskaran,Junn Yong Loo,KokSheik Wong,Simon See
机构: Monash University Malaysia(莫纳什大学马来西亚校区); NVIDIA AI Technology Center(英伟达人工智能技术中心)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: Accepted at International Joint Conference on Neural Networks (IJCNN 2025)

点击查看摘要

Abstract:Human-object interaction (HOI) detection is essential for accurately localizing and characterizing interactions between humans and objects, providing a comprehensive understanding of complex visual scenes across various domains. However, existing HOI detectors often struggle to deliver reliable predictions efficiently, relying on resource-intensive training methods and inefficient architectures. To address these challenges, we conceptualize a wavelet attention-like backbone and a novel ray-based encoder architecture tailored for HOI detection. Our wavelet backbone addresses the limitations of expressing middle-order interactions by aggregating discriminative features from the low- and high-order interactions extracted from diverse convolutional filters. Concurrently, the ray-based encoder facilitates multi-scale attention by optimizing the focus of the decoder on relevant regions of interest and mitigating computational overhead. As a result of harnessing the attenuated intensity of learnable ray origins, our decoder aligns query embeddings with emphasized regions of interest for accurate predictions. Experimental results on benchmark datasets, including ImageNet and HICO-DET, showcase the potential of our proposed architecture. The code is publicly available at [this https URL].
zh

[CV-55] Women Sport Actions Dataset for Visual Classification Using Small Scale Training Data

【速读】:该论文试图解决女性体育动作分类中缺乏足够具有类内和类间变化的图像数据集的问题。其解决方案的关键在于构建一个名为WomenSports的新数据集,用于基于小规模训练数据的女性体育分类,并提出一种基于卷积神经网络(CNN)的深度特征提取方法,其中应用了针对局部上下文区域的通道注意力机制以优化特征表示。

链接: https://arxiv.org/abs/2507.10969
作者: Palash Ray,Mahuya Sasmal,Asish Bera
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Sports action classification representing complex body postures and player-object interactions is an emerging area in image-based sports analysis. Some works have contributed to automated sports action recognition using machine learning techniques over the past decades. However, sufficient image datasets representing women sports actions with enough intra- and inter-class variations are not available to the researchers. To overcome this limitation, this work presents a new dataset named WomenSports for women sports classification using small-scale training data. This dataset includes a variety of sports activities, covering wide variations in movements, environments, and interactions among players. In addition, this study proposes a convolutional neural network (CNN) for deep feature extraction. A channel attention scheme upon local contextual regions is applied to refine and enhance feature representation. The experiments are carried out on three different sports datasets and one dance dataset for generalizing the proposed algorithm, and the performances on these datasets are noteworthy. The deep learning method achieves 89.15% top-1 classification accuracy using ResNet-50 on the proposed WomenSports dataset, which is publicly available for research at Mendeley Data.
zh

[CV-56] Whom to Respond To? A Transformer-Based Model for Multi-Party Social Robot Interaction

【速读】:该论文试图解决社会机器人在多用户环境中的交互决策问题,即如何在多人群体互动中确定何时以及对谁作出响应。解决方案的关键在于提出一种基于Transformer的多任务学习框架,并引入两种新颖的损失函数:一种用于约束主动说话者以改善场景建模,另一种用于引导响应选择,使其更倾向于针对机器人的特定话语。此外,研究还构建了一个包含现实复杂性的多人类-机器人交互数据集,以支持模型训练与评估。

链接: https://arxiv.org/abs/2507.10960
作者: He Zhu,Ryo Miyoshi,Yuki Okafuji
机构: AI Lab, CyberAgent Inc., Tokyo 150-0042, Japan(人工智能实验室,CyberAgent公司,东京150-0042,日本); Hokkaido University (北海道大学)
类目: Robotics (cs.RO); Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Prior human-robot interaction (HRI) research has primarily focused on single-user interactions, where robots do not need to consider the timing or recipient of their responses. However, in multi-party interactions, such as at malls and hospitals, social robots must understand the context and decide both when and to whom they should respond. In this paper, we propose a Transformer-based multi-task learning framework to improve the decision-making process of social robots, particularly in multi-user environments. Considering the characteristics of HRI, we propose two novel loss functions: one that enforces constraints on active speakers to improve scene modeling, and another that guides response selection towards utterances specifically directed at the robot. Additionally, we construct a novel multi-party HRI dataset that captures real-world complexities, such as gaze misalignment. Experimental results demonstrate that our model achieves state-of-the-art performance in respond decisions, outperforming existing heuristic-based and single-task approaches. Our findings contribute to the development of socially intelligent social robots capable of engaging in natural and context-aware multi-party interactions.
zh

[CV-57] Robust ID-Specific Face Restoration via Alignment Learning

【速读】:该论文试图解决面部修复中由于身份模糊输入和随机生成过程导致的身份不确定性问题。解决方案的关键在于提出一种基于扩散模型的鲁棒身份特定面部修复框架(RIDFR),该框架结合了预训练扩散模型与两个并行的条件模块:内容注入模块用于输入严重退化的图像,身份注入模块用于整合给定图像中的特定身份信息。此外,RIDFR引入了对齐学习机制,以将来自多个相同身份参考的修复结果对齐,从而抑制无关身份语义(如姿态、表情、妆容、发型)的干扰。

链接: https://arxiv.org/abs/2507.10943
作者: Yushun Fang,Lu Liu,Xiang Gao,Qiang Hu,Ning Cao,Jianghe Cui,Gang Chen,Xiaoyun Zhang
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 17 pages, 8 figures

点击查看摘要

Abstract:The latest developments in Face Restoration have yielded significant advancements in visual quality through the utilization of diverse diffusion priors. Nevertheless, the uncertainty of face identity introduced by identity-obscure inputs and stochastic generative processes remains unresolved. To address this challenge, we present Robust ID-Specific Face Restoration (RIDFR), a novel ID-specific face restoration framework based on diffusion models. Specifically, RIDFR leverages a pre-trained diffusion model in conjunction with two parallel conditioning modules. The Content Injection Module inputs the severely degraded image, while the Identity Injection Module integrates the specific identity from a given image. Subsequently, RIDFR incorporates Alignment Learning, which aligns the restoration results from multiple references with the same identity in order to suppress the interference of ID-irrelevant face semantics (e.g. pose, expression, make-up, hair style). Experiments demonstrate that our framework outperforms the state-of-the-art methods, reconstructing high-quality ID-specific results with high identity fidelity and demonstrating strong robustness.
zh

[CV-58] Graph Aggregation Prototype Learning for Semantic Change Detection in Remote Sensing

【速读】:该论文旨在解决语义变化检测(Semantic Change Detection, SCD)中因多任务联合优化导致的负迁移问题,即由于任务特定的学习困难和冲突梯度流,模型在同时优化语义分割和变化检测等任务时性能受限。其解决方案的关键在于提出图聚合原型学习(Graph Aggregation Prototype Learning, GAPL-SCD)框架,通过多任务联合优化方法,结合自适应权重分配和梯度旋转策略,缓解任务间的冲突,并引入图聚合原型学习模块,利用高阶特征构建交互图,以类代理实现跨时相的类别级域对齐,减少无关变化的干扰,同时结合自查询多级特征交互与双时相特征融合模块,提升多尺度特征表示能力,从而显著提升SCD任务的准确性和鲁棒性。

链接: https://arxiv.org/abs/2507.10938
作者: Zhengyi Xu,Haoran Wu,Wen Jiang,Jie Geng
机构: 西北工业大学电子信息学院( School of Electronics and Information, Northwestern Polytechnical University)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Semantic change detection (SCD) extends the binary change detection task to provide not only the change locations but also the detailed “from-to” categories in multi-temporal remote sensing data. Such detailed semantic insights into changes offer considerable advantages for a wide array of applications. However, since SCD involves the simultaneous optimization of multiple tasks, the model is prone to negative transfer due to task-specific learning difficulties and conflicting gradient flows. To address this issue, we propose Graph Aggregation Prototype Learning for Semantic Change Detection in remote sensing(GAPL-SCD). In this framework, a multi-task joint optimization method is designed to optimize the primary task of semantic segmentation and change detection, along with the auxiliary task of graph aggregation prototype learning. Adaptive weight allocation and gradient rotation methods are used to alleviate the conflict between training tasks and improve multi-task learning capabilities. Specifically, the graph aggregation prototype learning module constructs an interaction graph using high-level features. Prototypes serve as class proxies, enabling category-level domain alignment across time points and reducing interference from irrelevant changes. Additionally, the proposed self-query multi-level feature interaction and bi-temporal feature fusion modules further enhance multi-scale feature representation, improving performance in complex scenes. Experimental results on the SECOND and Landsat-SCD datasets demonstrate that our method achieves state-of-the-art performance, with significant improvements in accuracy and robustness for SCD task.
zh

[CV-59] GeoDistill: Geometry-Guided Self-Distillation for Weakly Supervised Cross-View Localization ICCV2025

【速读】:该论文旨在解决跨视角定位(cross-view localization)任务中依赖昂贵的真值姿态标注的问题。其解决方案的关键在于提出GeoDistill框架,该框架基于几何引导的弱监督自蒸馏机制,通过教师-学生学习策略与视场角(Field-of-View, FoV)掩码技术,增强局部特征学习,从而实现鲁棒的跨视角定位。在GeoDistill中,教师模型对全景图像进行定位,而学生模型则从通过FoV掩码生成的有限视场角图像中预测位置,通过将学生的预测与教师对齐,使学生聚焦于关键特征如车道线,忽略无纹理区域,从而提升定位精度并降低不确定性。

链接: https://arxiv.org/abs/2507.10935
作者: Shaowen Tong,Zimin Xia,Alexandre Alahi,Xuming He,Yujiao Shi
机构: ShanghaiTech University (上海科技大学); École Polytechnique Fédérale de Lausanne (EPFL) (瑞士联邦理工学院)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: accepted by ICCV2025

点击查看摘要

Abstract:Cross-view localization, the task of estimating a camera’s 3-degrees-of-freedom (3-DoF) pose by aligning ground-level images with satellite images, is crucial for large-scale outdoor applications like autonomous navigation and augmented reality. Existing methods often rely on fully supervised learning, which requires costly ground-truth pose annotations. In this work, we propose GeoDistill, a Geometry guided weakly supervised self distillation framework that uses teacher-student learning with Field-of-View (FoV)-based masking to enhance local feature learning for robust cross-view localization. In GeoDistill, the teacher model localizes a panoramic image, while the student model predicts locations from a limited FoV counterpart created by FoV-based masking. By aligning the student’s predictions with those of the teacher, the student focuses on key features like lane lines and ignores textureless regions, such as roads. This results in more accurate predictions and reduced uncertainty, regardless of whether the query images are panoramas or limited FoV images. Our experiments show that GeoDistill significantly improves localization performance across different frameworks. Additionally, we introduce a novel orientation estimation network that predicts relative orientation without requiring precise planar position ground truth. GeoDistill provides a scalable and efficient solution for real-world cross-view localization challenges. Code and model can be found at this https URL.
zh

[CV-60] Commuting Distance Regularization for Timescale-Dependent Label Inconsistency in EEG Emotion Recognition

【速读】:该论文试图解决在基于脑电图(EEG)的人类情绪识别中常被忽视的时标依赖标签不一致(Timescale Dependent Label Inconsistency, TsDLI)问题。解决方案的关键在于提出两种新颖的正则化策略:局部变化损失(Local Variation Loss, LVL)和局部-全局一致性损失(Local-Global Consistency Loss, LGCL),它们结合了有界变分函数和遍历时间距离等经典数学原理,在图论框架下实现对模型的约束,从而提升模型的泛化能力和可解释性。

链接: https://arxiv.org/abs/2507.10895
作者: Xiaocong Zeng,Craig Michoski,Yan Pang,Dongyang Kuang
机构: Sun Yat-sen University (中山大学); ODEN Institute for Computational Engineering & Sciences (ODEN计算工程与科学研究所); Shenzhen Institute of Advanced Technology (深圳先进技术研究院)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Signal Processing (eess.SP)
备注:

点击查看摘要

Abstract:In this work, we address the often-overlooked issue of Timescale Dependent Label Inconsistency (TsDLI) in training neural network models for EEG-based human emotion recognition. To mitigate TsDLI and enhance model generalization and explainability, we propose two novel regularization strategies: Local Variation Loss (LVL) and Local-Global Consistency Loss (LGCL). Both methods incorporate classical mathematical principles–specifically, functions of bounded variation and commute-time distances–within a graph theoretic framework. Complementing our regularizers, we introduce a suite of new evaluation metrics that better capture the alignment between temporally local predictions and their associated global emotion labels. We validate our approach through comprehensive experiments on two widely used EEG emotion datasets, DREAMER and DEAP, across a range of neural architectures including LSTM and transformer-based models. Performance is assessed using five distinct metrics encompassing both quantitative accuracy and qualitative consistency. Results consistently show that our proposed methods outperform state-of-the-art baselines, delivering superior aggregate performance and offering a principled trade-off between interpretability and predictive power under label inconsistency. Notably, LVL achieves the best aggregate rank across all benchmarked backbones and metrics, while LGCL frequently ranks the second, highlighting the effectiveness of our framework.
zh

[CV-61] Modernizing CNN-based Weather Forecast Model towards Higher Computational Efficiency

【速读】:该论文试图解决基于Transformer的AI天气预报模型在训练复杂度和计算资源需求上的高负担问题,旨在实现具有竞争力准确性的轻量化全球天气预测模型。其解决方案的关键在于引入一种现代化的卷积神经网络(CNN)基础模型,该模型采用了尺度不变架构和InceptionNeXt模块,并结合地球系统数据结构进行地理物理感知设计,从而在保持高精度的同时显著降低计算需求。

链接: https://arxiv.org/abs/2507.10893
作者: Minjong Cheon,Eunhan Goo,Su-Hyeon Shin,Muhammad Ahmed,Hyungjun Kim
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Atmospheric and Oceanic Physics (physics.ao-ph)
备注: 26pages, 9 Figures

点击查看摘要

Abstract:Recently, AI-based weather forecast models have achieved impressive advances. These models have reached accuracy levels comparable to traditional NWP systems, marking a significant milestone in data-driven weather prediction. However, they mostly leverage Transformer-based architectures, which often leads to high training complexity and resource demands due to the massive parameter sizes. In this study, we introduce a modernized CNN-based model for global weather forecasting that delivers competitive accuracy while significantly reducing computational requirements. To present a systematic modernization roadmap, we highlight key architectural enhancements across multiple design scales from an earlier CNN-based approach. KAI-a incorporates a scale-invariant architecture and InceptionNeXt-based blocks within a geophysically-aware design, tailored to the structure of Earth system data. Trained on the ERA5 daily dataset with 67 atmospheric variables, the model contains about 7 million parameters and completes training in just 12 hours on a single NVIDIA L40s GPU. Our evaluation shows that KAI-a matches the performance of state-of-the-art models in medium-range weather forecasting, while offering a significantly lightweight design. Furthermore, case studies on the 2018 European heatwave and the East Asian summer monsoon demonstrate KAI-a’s robust skill in capturing extreme events, reinforcing its practical utility.
zh

[CV-62] rexplorer Super: Topologically Correct Centerline Tree Tracking of Tubular Objects in CT Volumes MICCAI2025

【速读】:该论文旨在解决在三维医学图像中对管状树状结构(如血管和气道)进行中心线跟踪时存在的问题,特别是生成重复分支和过早终止跟踪的问题。其解决方案的关键在于提出了一种增强版本的递归模型——Trexplorer Super,通过创新性的改进显著提升了性能。

链接: https://arxiv.org/abs/2507.10881
作者: Roman Naeem,David Hagerman,Jennifer Alvén,Lennart Svensson,Fredrik Kahl
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Submitted Version. Accepted at MICCAI 2025

点击查看摘要

Abstract:Tubular tree structures, such as blood vessels and airways, are essential in human anatomy and accurately tracking them while preserving their topology is crucial for various downstream tasks. Trexplorer is a recurrent model designed for centerline tracking in 3D medical images but it struggles with predicting duplicate branches and terminating tracking prematurely. To address these issues, we present Trexplorer Super, an enhanced version that notably improves performance through novel advancements. However, evaluating centerline tracking models is challenging due to the lack of public datasets. To enable thorough evaluation, we develop three centerline datasets, one synthetic and two real, each with increasing difficulty. Using these datasets, we conduct a comprehensive evaluation of existing state-of-the-art (SOTA) models and compare them with our approach. Trexplorer Super outperforms previous SOTA models on every dataset. Our results also highlight that strong performance on synthetic data does not necessarily translate to real datasets. The code and datasets are available at this https URL.
zh

[CV-63] A Lightweight and Robust Framework for Real-Time Colorectal Polyp Detection Using LOF-Based Preprocessing and YOLO-v11n

【速读】:该论文旨在解决结直肠息肉的及时且准确检测问题,这对于诊断和预防全球主要死亡原因之一的结直肠癌至关重要。其解决方案的关键在于结合局部异常因子(Local Outlier Factor, LOF)算法用于过滤噪声数据,并与YOLO-v11n深度学习模型相结合,构建一个轻量级且高效的息肉检测框架。通过LOF方法去除异常样本并优化数据集,再利用YOLO-v11n进行实时对象检测,从而提升模型的鲁棒性和泛化能力。

链接: https://arxiv.org/abs/2507.10864
作者: Saadat Behzadi,Danial Sharifrazi,Bita Mesbahzadeh,Javad Hassannataj Joloudarid,Roohallah Alizadehsani
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Objectives: Timely and accurate detection of colorectal polyps plays a crucial role in diagnosing and preventing colorectal cancer, a major cause of mortality worldwide. This study introduces a new, lightweight, and efficient framework for polyp detection that combines the Local Outlier Factor (LOF) algorithm for filtering noisy data with the YOLO-v11n deep learning model. Study design: An experimental study leveraging deep learning and outlier removal techniques across multiple public datasets. Methods: The proposed approach was tested on five diverse and publicly available datasets: CVC-ColonDB, CVC-ClinicDB, Kvasir-SEG, ETIS, and EndoScene. Since these datasets originally lacked bounding box annotations, we converted their segmentation masks into suitable detection labels. To enhance the robustness and generalizability of our model, we apply 5-fold cross-validation and remove anomalous samples using the LOF method configured with 30 neighbors and a contamination ratio of 5%. Cleaned data are then fed into YOLO-v11n, a fast and resource-efficient object detection architecture optimized for real-time applications. We train the model using a combination of modern augmentation strategies to improve detection accuracy under diverse conditions. Results: Our approach significantly improves polyp localization performance, achieving a precision of 95.83%, recall of 91.85%, F1-score of 93.48%, mAP@0.5 of 96.48%, and mAP@0.5:0.95 of 77.75%. Compared to previous YOLO-based methods, our model demonstrates enhanced accuracy and efficiency. Conclusions: These results suggest that the proposed method is well-suited for real-time colonoscopy support in clinical settings. Overall, the study underscores how crucial data preprocessing and model efficiency are when designing effective AI systems for medical imaging. Subjects: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI) Cite as: arXiv:2507.10864 [cs.CV] (or arXiv:2507.10864v1 [cs.CV] for this version) https://doi.org/10.48550/arXiv.2507.10864 Focus to learn more arXiv-issued DOI via DataCite (pending registration) Submission history From: Danial Sharifrazi [view email] [v1] Mon, 14 Jul 2025 23:36:54 UTC (751 KB) Full-text links: Access Paper: View a PDF of the paper titled A Lightweight and Robust Framework for Real-Time Colorectal Polyp Detection Using LOF-Based Preprocessing and YOLO-v11n, by Saadat Behzadi and 4 other authorsView PDFOther Formats view license Current browse context: cs.CV prev | next new | recent | 2025-07 Change to browse by: cs cs.AI References Citations NASA ADSGoogle Scholar Semantic Scholar a export BibTeX citation Loading… BibTeX formatted citation loading… Data provided by: Bookmark checked=“checked”> Bibliographic Tools Bibliographic and Citation Tools Bibliographic Explorer Toggle Bibliographic Explorer (What is the Explorer?) Connected Papers Toggle Connected Papers (What is Connected Papers?) Litmaps Toggle Litmaps (What is Litmaps?) scite.ai Toggle scite Smart Citations (What are Smart Citations?) Code, Data, Media Code, Data and Media Associated with this Article alphaXiv Toggle alphaXiv (What is alphaXiv?) Links to Code Toggle CatalyzeX Code Finder for Papers (What is CatalyzeX?) DagsHub Toggle DagsHub (What is DagsHub?) GotitPub Toggle Gotit.pub (What is GotitPub?) Huggingface Toggle Hugging Face (What is Huggingface?) Links to Code Toggle Papers with Code (What is Papers with Code?) ScienceCast Toggle ScienceCast (What is ScienceCast?) Demos Demos Replicate Toggle Replicate (What is Replicate?) Spaces Toggle Hugging Face Spaces (What is Spaces?) Spaces Toggle TXYZ.AI (What is TXYZ.AI?) Related Papers Recommenders and Search Tools Link to Influence Flower Influence Flower (What are Influence Flowers?) Core recommender toggle CORE Recommender (What is CORE?) Author Venue Institution Topic About arXivLabs arXivLabs: experimental projects with community collaborators arXivLabs is a framework that allows collaborators to develop and share new arXiv features directly on our website. Both individuals and organizations that work with arXivLabs have embraced and accepted our values of openness, community, excellence, and user data privacy. arXiv is committed to these values and only works with partners that adhere to them. Have an idea for a project that will add value for arXiv’s community? Learn more about arXivLabs. Which authors of this paper are endorsers? | Disable MathJax (What is MathJax?) mathjaxToggle(); About Help contact arXivClick here to contact arXiv Contact subscribe to arXiv mailingsClick here to subscribe Subscribe Copyright Privacy Policy Web Accessibility Assistance arXiv Operational Status Get status notifications via email or slack
zh

[CV-64] Sparse Fine-Tuning of Transformers for Generative Tasks

【速读】:该论文试图解决现有微调方法中模型更新表示难以解释的问题,即传统方法通过密集组合修改参数来形成更新表示,导致难以理解模型如何适应新任务。其解决方案的关键在于引入一种受稀疏编码启发的微调框架,将微调特征表示为基本元素(即特征字典原子)的稀疏组合,从而使得稀疏系数能够指示原子的重要性,并实现对模型适应过程的可解释性控制。

链接: https://arxiv.org/abs/2507.10855
作者: Wei Chen,Jingxi Yu,Zichen Miao,Qiang Qiu
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted by International Conference on Computer Vision 2025

点击查看摘要

Abstract:Large pre-trained transformers have revolutionized artificial intelligence across various domains, and fine-tuning remains the dominant approach for adapting these models to downstream tasks due to the cost of training from scratch. However, in existing fine-tuning methods, the updated representations are formed as a dense combination of modified parameters, making it challenging to interpret their contributions and understand how the model adapts to new tasks. In this work, we introduce a fine-tuning framework inspired by sparse coding, where fine-tuned features are represented as a sparse combination of basic elements, i.e., feature dictionary atoms. The feature dictionary atoms function as fundamental building blocks of the representation, and tuning atoms allows for seamless adaptation to downstream tasks. Sparse coefficients then serve as indicators of atom importance, identifying the contribution of each atom to the updated representation. Leveraging the atom selection capability of sparse coefficients, we first demonstrate that our method enhances image editing performance by improving text alignment through the removal of unimportant feature dictionary atoms. Additionally, we validate the effectiveness of our approach in the text-to-image concept customization task, where our method efficiently constructs the target concept using a sparse combination of feature dictionary atoms, outperforming various baseline fine-tuning methods.
zh

[CV-65] Winsor-CAM: Human-Tunable Visual Explanations from Deep Networks via Layer-Wise Winsorization

【速读】:该论文旨在解决卷积神经网络(Convolutional Neural Networks, CNNs)决策过程可解释性不足的问题,特别是在高风险领域部署模型时,传统方法如Grad-CAM往往仅关注最终卷积层或简单地对多层进行平均,导致重要语义线索被掩盖或无关噪声被放大。其解决方案的关键在于提出Winsor-CAM,这是一种基于梯度加权类激活映射(Grad-CAM)的新型、可人工调节的扩展方法,通过聚合所有卷积层的信息生成更鲁棒且连贯的显著性图,并采用Winsorization技术抑制噪声或极端赋值的影响,同时提供用户可控的阈值以实现语义层面的调优。

链接: https://arxiv.org/abs/2507.10846
作者: Casey Wall,Longwei Wang,Rodrigue Rizk,KC Santosh
机构: University of South Dakota (南达科他大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: 15 pages, 10 figures, 7 tables. Submitted to IEEE Transactions on Pattern Analysis and Machine Intelligence

点击查看摘要

Abstract:Interpreting the decision-making process of Convolutional Neural Networks (CNNs) is critical for deploying models in high-stakes domains. Gradient-weighted Class Activation Mapping (Grad-CAM) is a widely used method for visual explanations, yet it typically focuses on the final convolutional layer or naïvely averages across layers, strategies that can obscure important semantic cues or amplify irrelevant noise. We propose Winsor-CAM, a novel, human-tunable extension of Grad-CAM that generates robust and coherent saliency maps by aggregating information across all convolutional layers. To mitigate the influence of noisy or extreme attribution values, Winsor-CAM applies Winsorization, a percentile-based outlier attenuation technique. A user-controllable threshold allows for semantic-level tuning, enabling flexible exploration of model behavior across representational hierarchies. Evaluations on standard architectures (ResNet50, DenseNet121, VGG16, InceptionV3) using the PASCAL VOC 2012 dataset demonstrate that Winsor-CAM produces more interpretable heatmaps and achieves superior performance in localization metrics, including intersection-over-union and center-of-mass alignment, when compared to Grad-CAM and uniform layer-averaging baselines. Winsor-CAM advances the goal of trustworthy AI by offering interpretable, multi-layer insights with human-in-the-loop control.
zh

[CV-66] LLM -Guided Agent ic Object Detection for Open-World Understanding

【速读】:该论文试图解决传统目标检测方法在处理新物体时需要昂贵的重新训练问题,以及开放世界目标检测(OWOD)缺乏语义标签和开放词汇目标检测(OVOD)依赖用户提示导致自主性受限的问题。解决方案的关键在于提出一种基于大语言模型(LLM)引导的代理目标检测框架(LAOD),通过 prompting LLM 生成场景特定的对象名称,进而传递给开放词汇检测器进行定位,实现完全无标签、零样本的目标检测,从而增强系统的自主性和适应性。

链接: https://arxiv.org/abs/2507.10844
作者: Furkan Mumcu,Michael J. Jones,Anoop Cherian,Yasin Yilmaz
机构: University of South Florida (南佛罗里达大学); Mitsubishi Electric Research Laboratories (三菱电机研究实验室)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Object detection traditionally relies on fixed category sets, requiring costly re-training to handle novel objects. While Open-World and Open-Vocabulary Object Detection (OWOD and OVOD) improve flexibility, OWOD lacks semantic labels for unknowns, and OVOD depends on user prompts, limiting autonomy. We propose an LLM-guided agentic object detection (LAOD) framework that enables fully label-free, zero-shot detection by prompting a Large Language Model (LLM) to generate scene-specific object names. These are passed to an open-vocabulary detector for localization, allowing the system to adapt its goals dynamically. We introduce two new metrics, Class-Agnostic Average Precision (CAAP) and Semantic Naming Average Precision (SNAP), to separately evaluate localization and naming. Experiments on LVIS, COCO, and COCO-OOD validate our approach, showing strong performance in detecting and naming novel objects. Our method offers enhanced autonomy and adaptability for open-world understanding.
zh

[CV-67] hinkingViT: Matryoshka Thinking Vision Transformer for Elastic Inference

【速读】:该论文旨在解决Vision Transformers(ViT)在异构硬件上部署时因固定计算预算导致的可扩展性问题。现有嵌套Transformer架构虽通过嵌入子网络实现可扩展推理,但其对所有输入分配相同计算量,造成效率低下。论文提出的解决方案是ThinkingViT,其关键在于引入了渐进式思考阶段和Token Recycling机制,使模型能够根据输入难度动态调整推理计算,从而在保持性能的同时提升计算效率。

链接: https://arxiv.org/abs/2507.10800
作者: Ali Hojjat,Janek Haberer,Soren Pirk,Olaf Landsiedel
机构: Kiel University (基尔大学); Hamburg University of Technology (汉堡应用技术大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Under Review

点击查看摘要

Abstract:Vision Transformers deliver state-of-the-art performance, yet their fixed computational budget prevents scalable deployment across heterogeneous hardware. Recent nested Transformer architectures mitigate this by embedding nested subnetworks within a single model to enable scalable inference. However, these models allocate the same amount of compute to all inputs, regardless of their complexity, which leads to inefficiencies. To address this, we introduce ThinkingViT, a nested ViT architecture that employs progressive thinking stages to dynamically adjust inference computation based on input difficulty. ThinkingViT initiates inference by activating a small subset of the most important attention heads and terminates early if predictions reach sufficient certainty. Otherwise, it activates additional attention heads and re-evaluates the input. At the core of ThinkingViT is our Token Recycling mechanism, which conditions each subsequent inference stage on the embeddings from the previous stage, enabling progressive improvement. Due to its backbone-preserving design, ThinkingViT also serves as a plugin upgrade for vanilla ViT. Experiments show that ThinkingViT surpasses nested baselines by up to 2.0 percentage points (p.p.) in accuracy at the same throughput and by up to 2.9 p.p. at equal GMACs on ImageNet-1K. The source code is available at this https URL.
zh

[CV-68] Warehouse Spatial Question Answering with LLM Agent

【速读】:该论文旨在解决现有多模态大语言模型(Multi-modal Large Language Models, MLLMs)在空间理解任务中的挑战,特别是在复杂室内仓库场景下的空间问答任务。其解决方案的关键在于提出一个具备强大空间推理能力的大型语言模型代理系统,该系统整合了多种工具,使代理能够进行空间推理并与API工具交互,从而高效准确地回答复杂的空间问题。

链接: https://arxiv.org/abs/2507.10778
作者: Hsiang-Wei Huang,Jen-Hao Cheng,Kuang-Ming Chen,Cheng-Yen Yang,Bahaa Alattar,Yi-Ru Lin,Pyongkun Kim,Sangwon Kim,Kwangju Kim,Chung-I Huang,Jenq-Neng Hwang
机构: University of Washington (华盛顿大学); Electronics and Telecommunications Research Institute (电子与电信研究机构); National Center for High-performance Computing (高性能计算国家中心)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: 1st Place Solution of the 9th AI City Challenge Track 3

点击查看摘要

Abstract:Spatial understanding has been a challenging task for existing Multi-modal Large Language Models~(MLLMs). Previous methods leverage large-scale MLLM finetuning to enhance MLLM’s spatial understanding ability. In this paper, we present a data-efficient approach. We propose a LLM agent system with strong and advanced spatial reasoning ability, which can be used to solve the challenging spatial question answering task in complex indoor warehouse scenarios. Our system integrates multiple tools that allow the LLM agent to conduct spatial reasoning and API tools interaction to answer the given complicated spatial question. Extensive evaluations on the 2025 AI City Challenge Physical AI Spatial Intelligence Warehouse dataset demonstrate that our system achieves high accuracy and efficiency in tasks such as object retrieval, counting, and distance estimation. The code is available at: this https URL
zh

[CV-69] rt-RISeg: Real-Time Model-Free Robot Interactive Segmentation for Active Instance-Level Object Understanding IROS2025

【速读】:该论文试图解决在新环境中执行灵巧机器人操作任务(如抓取)时,由于依赖静态视觉特征导致的未见过物体实例分割(UOIS)模型泛化能力不足的问题。解决方案的关键在于提出一种实时交互感知框架rt-RISeg,该框架通过机器人交互和设计的体帧不变特征(BFIF)分析,持续分割未见过物体,无需依赖预训练的分割模型,从而实现了无需等待动作完成即可生成和更新物体分割掩码的自包含分割流程。

链接: https://arxiv.org/abs/2507.10776
作者: Howard H. Qian,Yiting Chen,Gaotian Wang,Podshara Chanrungmaneekul,Kaiyu Hang
机构: Rice University (莱斯大学)
类目: Robotics (cs.RO); Computer Vision and Pattern Recognition (cs.CV)
备注: 8 pages, IROS 2025, Interactive Perception, Segmentation, Robotics, Computer Vision

点击查看摘要

Abstract:Successful execution of dexterous robotic manipulation tasks in new environments, such as grasping, depends on the ability to proficiently segment unseen objects from the background and other objects. Previous works in unseen object instance segmentation (UOIS) train models on large-scale datasets, which often leads to overfitting on static visual features. This dependency results in poor generalization performance when confronted with out-of-distribution scenarios. To address this limitation, we rethink the task of UOIS based on the principle that vision is inherently interactive and occurs over time. We propose a novel real-time interactive perception framework, rt-RISeg, that continuously segments unseen objects by robot interactions and analysis of a designed body frame-invariant feature (BFIF). We demonstrate that the relative rotational and linear velocities of randomly sampled body frames, resulting from selected robot interactions, can be used to identify objects without any learned segmentation model. This fully self-contained segmentation pipeline generates and updates object segmentation masks throughout each robot interaction without the need to wait for an action to finish. We showcase the effectiveness of our proposed interactive perception method by achieving an average object segmentation accuracy rate 27.5% greater than state-of-the-art UOIS methods. Furthermore, although rt-RISeg is a standalone framework, we show that the autonomously generated segmentation masks can be used as prompts to vision foundation models for significantly improved performance.
zh

[CV-70] A New Dataset and Performance Benchmark for Real-time Spacecraft Segmentation in Onboard Flight Computers

【速读】:该论文试图解决航天器在太空中因暴露于危险环境而遭受损坏的问题,以及通过人工舱外活动或机器人操作进行在轨维修所带来的高风险和高成本问题。其解决方案的关键在于利用图像分割技术开发可靠且成本效益高的自主检测系统。为此,研究者创建了一个包含近64k标注航天器图像的新数据集,该数据集使用真实航天器模型,并叠加由NASA的TTALOS管道生成的真实与合成背景,同时模拟了现实世界中的相机失真和噪声。此外,还对YOLOv8和YOLOv11分割模型进行了微调,以在定义明确的硬件和推理时间限制下生成性能基准,从而模拟太空环境中实时机载应用的图像分割挑战。

链接: https://arxiv.org/abs/2507.10775
作者: Jeffrey Joan Sam,Janhavi Sathe,Nikhil Chigali,Naman Gupta,Radhey Ruparel,Yicheng Jiang,Janmajay Singh,James W. Berck,Arko Barman
机构: Rice University (莱斯大学); NASA (美国国家航空航天局)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Image and Video Processing (eess.IV)
备注:

点击查看摘要

Abstract:Spacecraft deployed in outer space are routinely subjected to various forms of damage due to exposure to hazardous environments. In addition, there are significant risks to the subsequent process of in-space repairs through human extravehicular activity or robotic manipulation, incurring substantial operational costs. Recent developments in image segmentation could enable the development of reliable and cost-effective autonomous inspection systems. While these models often require large amounts of training data to achieve satisfactory results, publicly available annotated spacecraft segmentation data are very scarce. Here, we present a new dataset of nearly 64k annotated spacecraft images that was created using real spacecraft models, superimposed on a mixture of real and synthetic backgrounds generated using NASA’s TTALOS pipeline. To mimic camera distortions and noise in real-world image acquisition, we also added different types of noise and distortion to the images. Finally, we finetuned YOLOv8 and YOLOv11 segmentation models to generate performance benchmarks for the dataset under well-defined hardware and inference time constraints to mimic real-world image segmentation challenges for real-time onboard applications in space on NASA’s inspector spacecraft. The resulting models, when tested under these constraints, achieved a Dice score of 0.92, Hausdorff distance of 0.69, and an inference time of about 0.5 second. The dataset and models for performance benchmark are available at this https URL.
zh

[CV-71] FPC-Net: Revisiting SuperPoint with Descriptor-Free Keypoint Detection via Feature Pyramids and Consistency-Based Implicit Matching

【速读】:该论文试图解决传统几何计算机视觉任务中兴趣点提取与匹配的问题,这些问题通常依赖于为兴趣点分配描述符并基于描述符相似性进行对应点识别。论文提出的解决方案的关键在于在检测阶段就将兴趣点固有地关联起来,从而消除了计算、存储、传输或匹配描述符的必要性。这种方法虽然在匹配精度上略逊于传统方法,但显著降低了定位系统中的内存使用量。

链接: https://arxiv.org/abs/2507.10770
作者: Ionuţ Grigore,Călin-Adrian Popa,Claudiu Leoveanu-Condrei
机构: Politehnica University of Timişoara (蒂米什瓦拉理工大学); ExtensityAI (ExtensityAI)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:The extraction and matching of interest points are fundamental to many geometric computer vision tasks. Traditionally, matching is performed by assigning descriptors to interest points and identifying correspondences based on descriptor similarity. This work introduces a technique where interest points are inherently associated during detection, eliminating the need for computing, storing, transmitting, or matching descriptors. Although the matching accuracy is marginally lower than that of conventional approaches, our method completely eliminates the need for descriptors, leading to a drastic reduction in memory usage for localization systems. We assess its effectiveness by comparing it against both classical handcrafted methods and modern learned approaches.
zh

[CV-72] Spatial Reason ers for Continuous Variables in Any Domain ICML2025

【速读】:该论文试图解决在连续变量上进行空间推理的问题,特别是在生成式去噪模型(generative denoising models)背景下实现多连续变量的推理。解决方案的关键在于提出一个名为Spatial Reasoners的软件框架,该框架通过提供易于使用的接口,支持从任意数据域到生成模型范式及推理策略的变量映射,从而降低使用不同去噪公式、采样器和推理策略进行生成推理的复杂性和工作量。

链接: https://arxiv.org/abs/2507.10768
作者: Bart Pogodzinski,Christopher Wewer,Bernt Schiele,Jan Eric Lenssen
机构: 未知
类目: Machine Learning (cs.LG); Computer Vision and Pattern Recognition (cs.CV)
备注: For the project documentation see this https URL . The SRM project website is available at this https URL . The work was published on ICML 2025 CODEML workshop

点击查看摘要

Abstract:We present Spatial Reasoners, a software framework to perform spatial reasoning over continuous variables with generative denoising models. Denoising generative models have become the de-facto standard for image generation, due to their effectiveness in sampling from complex, high-dimensional distributions. Recently, they have started being explored in the context of reasoning over multiple continuous variables. Providing infrastructure for generative reasoning with such models requires a high effort, due to a wide range of different denoising formulations, samplers, and inference strategies. Our presented framework aims to facilitate research in this area, providing easy-to-use interfaces to control variable mapping from arbitrary data domains, generative model paradigms, and inference strategies. Spatial Reasoners are openly available at this https URL
zh

[CV-73] Auditing Facial Emotion Recognition Datasets for Posed Expressions and Racial Bias

【速读】:该论文试图解决面部表情识别(Facial Expression Recognition, FER)算法在检测自发性表情和不同肤色人群时性能下降的问题。其关键解决方案是通过对当前最先进的FER数据集进行审计,提出一种识别图像是否为自发性或摆拍性的方法,并发现这些数据集中存在大量被错误标注为“自然场景”(in-the-wild)的摆拍图像,同时验证了模型在不同肤色人群上的表现偏差。这一方法有助于揭示数据集构建中的缺陷,并为改进FER模型的公平性和泛化能力提供依据。

链接: https://arxiv.org/abs/2507.10755
作者: Rina Khan,Catherine Stinson
机构: Queen’s University (皇后大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Facial expression recognition (FER) algorithms classify facial expressions into emotions such as happy, sad, or angry. An evaluative challenge facing FER algorithms is the fall in performance when detecting spontaneous expressions compared to posed expressions. An ethical (and evaluative) challenge facing FER algorithms is that they tend to perform poorly for people of some races and skin colors. These challenges are linked to the data collection practices employed in the creation of FER datasets. In this study, we audit two state-of-the-art FER datasets. We take random samples from each dataset and examine whether images are spontaneous or posed. In doing so, we propose a methodology for identifying spontaneous or posed images. We discover a significant number of images that were posed in the datasets purporting to consist of in-the-wild images. Since performance of FER models vary between spontaneous and posed images, the performance of models trained on these datasets will not represent the true performance if such models were to be deployed in in-the-wild applications. We also observe the skin color of individuals in the samples, and test three models trained on each of the datasets to predict facial expressions of people from various races and skin tones. We find that the FER models audited were more likely to predict people labeled as not white or determined to have dark skin as showing a negative emotion such as anger or sadness even when they were smiling. This bias makes such models prone to perpetuate harm in real life applications.
zh

[CV-74] Integrating Biological Knowledge for Robust Microscopy Image Profiling on De Novo Cell Lines ICCV2025

【速读】:该论文试图解决在新细胞系(\textit{de novo cell lines})中进行稳健扰动筛选的问题,这一问题由于细胞系间的形态学和生物学异质性而变得复杂。解决方案的关键在于将外部生物知识整合到现有的预训练策略中,通过显式解耦扰动特异性与细胞系特异性表示来增强显微图像表型模型。具体而言,研究者利用STRING和Hetionet数据库中的蛋白质相互作用数据构建知识图谱,以指导模型在预训练过程中关注扰动特异性特征,并结合单细胞基础模型的转录组特征来捕获细胞系特异性表示。

链接: https://arxiv.org/abs/2507.10737
作者: Jiayuan Chen,Thai-Hoang Pham,Yuanlong Wang,Ping Zhang
机构: The Ohio State University (俄亥俄州立大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: ICCV 2025

点击查看摘要

Abstract:High-throughput screening techniques, such as microscopy imaging of cellular responses to genetic and chemical perturbations, play a crucial role in drug discovery and biomedical research. However, robust perturbation screening for \textitde novo cell lines remains challenging due to the significant morphological and biological heterogeneity across cell lines. To address this, we propose a novel framework that integrates external biological knowledge into existing pretraining strategies to enhance microscopy image profiling models. Our approach explicitly disentangles perturbation-specific and cell line-specific representations using external biological information. Specifically, we construct a knowledge graph leveraging protein interaction data from STRING and Hetionet databases to guide models toward perturbation-specific features during pretraining. Additionally, we incorporate transcriptomic features from single-cell foundation models to capture cell line-specific representations. By learning these disentangled features, our method improves the generalization of imaging models to \textitde novo cell lines. We evaluate our framework on the RxRx database through one-shot fine-tuning on an RxRx1 cell line and few-shot fine-tuning on cell lines from the RxRx19a dataset. Experimental results demonstrate that our method enhances microscopy image profiling for \textitde novo cell lines, highlighting its effectiveness in real-world phenotype-based drug discovery applications.
zh

[CV-75] CWNet: Causal Wavelet Network for Low-Light Image Enhancement ICCV2025

【速读】:该论文旨在解决传统低光照图像增强(Low-Light Image Enhancement, LLIE)方法在均匀亮度调整中忽视实例级语义信息及不同特征固有特性的局限性。其解决方案的关键在于提出CWNet(Causal Wavelet Network),该架构利用小波变换进行因果推理,包含两个核心组件:一是基于因果推理的全局度量学习策略与实例级CLIP语义损失,以确保因果嵌入符合因果原则并保持因果因子的一致性;二是基于因果分析的小波变换骨干网络,有效优化频率信息的恢复,实现针对小波变换特性的精确增强。

链接: https://arxiv.org/abs/2507.10689
作者: Tongshun Zhang,Pingping Liu,Yubing Lu,Mengen Cai,Zijian Zhang,Zhe Zhang,Qiuzhan Zhou
机构: College of Computer Science and Technology, Jilin University (吉林大学计算机科学与技术学院); Key Laboratory of Symbolic Computation and Knowledge Engineering of Ministry of Education (教育部符号计算与知识工程重点实验室); College of Communication Engineering, Jilin University (吉林大学通信工程学院)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted by ICCV 2025

点击查看摘要

Abstract:Traditional Low-Light Image Enhancement (LLIE) methods primarily focus on uniform brightness adjustment, often neglecting instance-level semantic information and the inherent characteristics of different features. To address these limitations, we propose CWNet (Causal Wavelet Network), a novel architecture that leverages wavelet transforms for causal reasoning. Specifically, our approach comprises two key components: 1) Inspired by the concept of intervention in causality, we adopt a causal reasoning perspective to reveal the underlying causal relationships in low-light enhancement. From a global perspective, we employ a metric learning strategy to ensure causal embeddings adhere to causal principles, separating them from non-causal confounding factors while focusing on the invariance of causal factors. At the local level, we introduce an instance-level CLIP semantic loss to precisely maintain causal factor consistency. 2) Based on our causal analysis, we present a wavelet transform-based backbone network that effectively optimizes the recovery of frequency information, ensuring precise enhancement tailored to the specific attributes of wavelet transforms. Extensive experiments demonstrate that CWNet significantly outperforms current state-of-the-art methods across multiple datasets, showcasing its robust performance across diverse scenes. Code is available at this https URL.
zh

[CV-76] A Simple Baseline for Stable and Plastic Neural Networks

【速读】:该论文试图解决持续学习(continual learning)中模型在面对连续任务流时难以平衡可塑性(plasticity)与稳定性(stability)的问题。其解决方案的关键在于提出RDBP,这是一种结合了两种互补机制的简单且低开销基线方法:ReLUDown通过轻量级激活修改来保持特征敏感性并防止神经元失活,以及Decreasing Backpropagation,一种受生物学启发的梯度调度方案,逐步保护早期层免受灾难性更新。

链接: https://arxiv.org/abs/2507.10637
作者: É. Künzel,A. Jaziri,V. Ramesh
机构: Goethe University (歌德大学); HessianAI (黑森人工智能中心)
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
备注: 11 pages, 50 figures

点击查看摘要

Abstract:Continual learning in computer vision requires that models adapt to a continuous stream of tasks without forgetting prior knowledge, yet existing approaches often tip the balance heavily toward either plasticity or stability. We introduce RDBP, a simple, low-overhead baseline that unites two complementary mechanisms: ReLUDown, a lightweight activation modification that preserves feature sensitivity while preventing neuron dormancy, and Decreasing Backpropagation, a biologically inspired gradient-scheduling scheme that progressively shields early layers from catastrophic updates. Evaluated on the Continual ImageNet benchmark, RDBP matches or exceeds the plasticity and stability of state-of-the-art methods while reducing computational cost. RDBP thus provides both a practical solution for real-world continual learning and a clear benchmark against which future continual learning strategies can be measured.
zh

[CV-77] Flows and Diffusions on the Neural Manifold

【速读】:该论文旨在解决如何在权重空间中进行有效的学习问题,特别是在生成分布内权重、提升下游任务的初始化效果以及支持微调以增强模型性能方面。其解决方案的关键在于将梯度下降引起的轨迹建模为轨迹推理问题,并通过梯度流匹配(gradient flow matching)框架统一多种轨迹推理技术,从而将优化路径作为归纳偏置进行处理。这一方法结合了结构先验与优化动态,提升了权重空间学习的效果。

链接: https://arxiv.org/abs/2507.10623
作者: Daniel Saragih,Deyu Cao,Tejas Balaji
机构: 未知
类目: Machine Learning (cs.LG); Computer Vision and Pattern Recognition (cs.CV)
备注: 40 pages, 6 figures, 13 tables

点击查看摘要

Abstract:Diffusion and flow-based generative models have achieved remarkable success in domains such as image synthesis, video generation, and natural language modeling. In this work, we extend these advances to weight space learning by leveraging recent techniques to incorporate structural priors derived from optimization dynamics. Central to our approach is modeling the trajectory induced by gradient descent as a trajectory inference problem. We unify several trajectory inference techniques under the framework of gradient flow matching, providing a theoretical framework for treating optimization paths as inductive bias. We further explore architectural and algorithmic choices, including reward fine-tuning by adjoint matching, the use of autoencoders for latent weight representation, conditioning on task-specific context data, and adopting informative source distributions such as Kaiming uniform. Experiments demonstrate that our method matches or surpasses baselines in generating in-distribution weights, improves initialization for downstream training, and supports fine-tuning to enhance performance. Finally, we illustrate a practical application in safety-critical systems: detecting harmful covariate shifts, where our method outperforms the closest comparable baseline.
zh

[CV-78] FedGSCA: Medical Federated Learning with Global Sample Selector and Client Adaptive Adjuster under Label Noise

【速读】:该论文旨在解决联邦学习(Federated Learning, FL)在医疗图像分类中因标签噪声导致的训练不稳定和模型性能下降问题。其关键解决方案是提出FedGSCA框架,该框架通过引入全局样本选择器(Global Sample Selector)聚合所有客户端的噪声知识,以应对噪声异质性并提升全局模型稳定性;同时结合客户端自适应调整机制(Client Adaptive Adjustment, CAA),利用自适应阈值伪标签生成和鲁棒可信标签损失函数,动态调整类别分布,确保少数样本的纳入并有效管理噪声标签,从而降低噪声数据的影响并防止局部训练中的过拟合。

链接: https://arxiv.org/abs/2507.10611
作者: Mengwen Ye,Yingzi Huangfu,Shujian Gao,Wei Ren,Weifan Liu,Zekuan Yu
机构: Fudan University (复旦大学); China University of Geosciences (中国地质大学); Beijing Forestry University (北京林业大学)
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Federated Learning (FL) emerged as a solution for collaborative medical image classification while preserving data privacy. However, label noise, which arises from inter-institutional data variability, can cause training instability and degrade model performance. Existing FL methods struggle with noise heterogeneity and the imbalance in medical data. Motivated by these challenges, we propose FedGSCA, a novel framework for enhancing robustness in noisy medical FL. FedGSCA introduces a Global Sample Selector that aggregates noise knowledge from all clients, effectively addressing noise heterogeneity and improving global model stability. Furthermore, we develop a Client Adaptive Adjustment (CAA) mechanism that combines adaptive threshold pseudo-label generation and Robust Credal Labeling Loss. CAA dynamically adjusts to class distributions, ensuring the inclusion of minority samples and carefully managing noisy labels by considering multiple plausible labels. This dual approach mitigates the impact of noisy data and prevents overfitting during local training, which improves the generalizability of the model. We evaluate FedGSCA on one real-world colon slides dataset and two synthetic medical datasets under various noise conditions, including symmetric, asymmetric, extreme, and heterogeneous types. The results show that FedGSCA outperforms the state-of-the-art methods, excelling in extreme and heterogeneous noise scenarios. Moreover, FedGSCA demonstrates significant advantages in improving model stability and handling complex noise, making it well-suited for real-world medical federated learning scenarios.
zh

[CV-79] SFATTI: Spiking FPGA Accelerator for Temporal Task-driven Inference – A Case Study on MNIST

【速读】:该论文旨在解决在边缘计算场景下实现低延迟、低功耗的图像识别问题,其解决方案的关键在于利用生成式神经网络(Generative Neural Networks)的事件驱动和时间稀疏特性,通过开源框架Spiker+生成优化的脉冲神经网络(Spiking Neural Networks, SNNs)加速器,以适应基于现场可编程门阵列(FPGA)的部署需求。

链接: https://arxiv.org/abs/2507.10561
作者: Alessio Caviglia,Filippo Marostica,Alessio Carpegna,Alessandro Savino,Stefano Di Carlo
机构: 未知
类目: Neural and Evolutionary Computing (cs.NE); Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Hardware accelerators are essential for achieving low-latency, energy-efficient inference in edge applications like image recognition. Spiking Neural Networks (SNNs) are particularly promising due to their event-driven and temporally sparse nature, making them well-suited for low-power Field Programmable Gate Array (FPGA)-based deployment. This paper explores using the open-source Spiker+ framework to generate optimized SNNs accelerators for handwritten digit recognition on the MNIST dataset. Spiker+ enables high-level specification of network topologies, neuron models, and quantization, automatically generating deployable HDL. We evaluate multiple configurations and analyze trade-offs relevant to edge computing constraints.
zh

[CV-80] Deep Equilibrium models for Poisson Imaging Inverse problems via Mirror Descent

【速读】:该论文试图解决在泊松逆问题中学习图像正则化泛函的问题,其中数据保真项更适合用Kullback-Leibler散度建模。解决方案的关键在于引入一种基于镜像下降(Mirror Descent)的新型深度均衡模型(DEQ)框架,该框架定义在定制的非欧几里得几何上,能够自然适应数据项的结构,从而在有原则的训练框架中学习神经正则化器。

链接: https://arxiv.org/abs/2507.11461
作者: Christian Daniele,Silvia Villa,Samuel Vaiter,Luca Calatroni
机构: 未知
类目: Optimization and Control (math.OC); Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Deep Equilibrium Models (DEQs) are implicit neural networks with fixed points, which have recently gained attention for learning image regularization functionals, particularly in settings involving Gaussian fidelities, where assumptions on the forward operator ensure contractiveness of standard (proximal) Gradient Descent operators. In this work, we extend the application of DEQs to Poisson inverse problems, where the data fidelity term is more appropriately modeled by the Kullback-Leibler divergence. To this end, we introduce a novel DEQ formulation based on Mirror Descent defined in terms of a tailored non-Euclidean geometry that naturally adapts with the structure of the data term. This enables the learning of neural regularizers within a principled training framework. We derive sufficient conditions to guarantee the convergence of the learned reconstruction scheme and propose computational strategies that enable both efficient training and fully parameter-free inference. Numerical experiments show that our method outperforms traditional model-based approaches and it is comparable to the performance of Bregman Plug-and-Play methods, while mitigating their typical drawbacks - namely, sensitivity to initialization and careful tuning of hyperparameters. The code is publicly available at this https URL.
zh

[CV-81] U-RWKV: Lightweight medical image segmentation with direction-adaptive RWKV MICCAI2025

【速读】:该论文旨在解决医疗图像分割中因受限资源环境导致的可及性不平等问题,提出一种轻量级且高性能的解决方案。现有方法如U-Net及其变体由于受限的全局有效感受野(Effective Receptive Field, ERF),难以捕捉长程依赖关系。其关键解决方案是提出U-RWKV框架,该框架基于循环加权键值(Recurrent Weighted Key-Value, RWKV)架构,在O(N)计算成本下实现高效的长程建模。核心创新包括方向自适应RWKV模块(Direction-Adaptive RWKV Module, DARM)和阶段自适应挤压与激励模块(Stage-Adaptive Squeeze-and-Excitation Module, SASE),分别用于跨图像上下文信息的聚合以及不同特征提取阶段的动态适应,从而在保持高计算效率的同时提升分割性能。

链接: https://arxiv.org/abs/2507.11415
作者: Hongbo Ye,Fenghe Tang,Peiang Zhao,Zhen Huang,Dexin Zhao,Minghao Bian,S.Kevin Zhou
机构: 未知
类目: Image and Video Processing (eess.IV); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted by MICCAI2025

点击查看摘要

Abstract:Achieving equity in healthcare accessibility requires lightweight yet high-performance solutions for medical image segmentation, particularly in resource-limited settings. Existing methods like U-Net and its variants often suffer from limited global Effective Receptive Fields (ERFs), hindering their ability to capture long-range dependencies. To address this, we propose U-RWKV, a novel framework leveraging the Recurrent Weighted Key-Value(RWKV) architecture, which achieves efficient long-range modeling at O(N) computational cost. The framework introduces two key innovations: the Direction-Adaptive RWKV Module(DARM) and the Stage-Adaptive Squeeze-and-Excitation Module(SASE). DARM employs Dual-RWKV and QuadScan mechanisms to aggregate contextual cues across images, mitigating directional bias while preserving global context and maintaining high computational efficiency. SASE dynamically adapts its architecture to different feature extraction stages, balancing high-resolution detail preservation and semantic relationship capture. Experiments demonstrate that U-RWKV achieves state-of-the-art segmentation performance with high computational efficiency, offering a practical solution for democratizing advanced medical imaging technologies in resource-constrained environments. The code is available at this https URL.
zh

[CV-82] Stochastic Entanglement Configuration for Constructive Entanglement Topologies in Quantum Machine Learning with Application to Cardiac MRI

【速读】:该论文旨在解决变分量子电路(VQC)在量子机器学习(QML)中因采用固定纠缠拓扑结构而导致性能受限的问题,这些结构无法适应任务需求,从而限制了其相对于经典模型的潜在优势。解决方案的关键在于提出一种新颖的随机纠缠配置方法,通过系统生成多样化的纠缠拓扑结构,以识别出能够提升混合模型性能的构造性纠缠配置。该方法将每个配置编码为一个随机二进制矩阵,表示量子比特之间的定向纠缠,并利用纠缠密度和每量子比特约束作为关键指标,实现对候选纠缠拓扑空间的可扩展探索。

链接: https://arxiv.org/abs/2507.11401
作者: Mehri Mehrnia,Mohammed S.M. Elbaz
机构: 未知
类目: Quantum Physics (quant-ph); Computer Vision and Pattern Recognition (cs.CV); Emerging Technologies (cs.ET); Machine Learning (cs.LG)
备注: Accepted for publication at IEEE International Conference on Quantum Computing and Engineering (QCE) 2025

点击查看摘要

Abstract:Efficient entanglement strategies are essential for advancing variational quantum circuits (VQCs) for quantum machine learning (QML). However, most current approaches use fixed entanglement topologies that are not adaptive to task requirements, limiting potential gains over classical models. We introduce a novel stochastic entanglement configuration method that systematically generates diverse entanglement topologies to identify a subspace of constructive entanglement configurations, defined as entanglement topologies that boost hybrid model performance (e.g., classification accuracy) beyond classical baselines. Each configuration is encoded as a stochastic binary matrix, denoting directed entanglement between qubits. This enables scalable exploration of the hyperspace of candidate entanglement topologies using entanglement density and per-qubit constraints as key metrics. We define unconstrained and constrained sampling modes, controlling entanglement per qubit. Using our method, 400 stochastic configurations were generated and evaluated in a hybrid QML for cardiac MRI disease classification. We identified 64 (16%) novel constructive entanglement configurations that consistently outperformed the classical baseline. Ensemble aggregation of top-performing configurations achieved ~0.92 classification accuracy, exceeding the classical model (~0.87) by over 5%. Compared to four conventional topologies (ring, nearest neighbor, no entanglement, fully entangled), none surpassed the classical baseline (maximum accuracy ~0.82), while our configurations delivered up to ~20% higher accuracy. Thus, highlighting the robustness and generalizability of the identified constructive entanglements.
zh

[CV-83] Focus on Texture: Rethinking Pre-training in Masked Autoencoders for Medical Image Classification MICCAI2025

【速读】:该论文试图解决在医学影像中,传统基于像素均方误差(MSE)的自监督表示学习方法无法有效保留纹理特征的问题,这限制了其在视觉异常分类任务中的性能。解决方案的关键在于提出一种基于灰度共生矩阵(GLCM)的重构损失函数,通过匹配GLCM矩阵来捕捉图像的强度和空间关系,从而更好地保留形态学特征,提升下游任务的表现。

链接: https://arxiv.org/abs/2507.10869
作者: Chetan Madan,Aarjav Satia,Soumen Basu,Pankaj Gupta,Usha Dutta,Chetan Arora
机构: 未知
类目: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV)
备注: To appear at MICCAI 2025

点击查看摘要

Abstract:Masked Autoencoders (MAEs) have emerged as a dominant strategy for self-supervised representation learning in natural images, where models are pre-trained to reconstruct masked patches with a pixel-wise mean squared error (MSE) between original and reconstructed RGB values as the loss. We observe that MSE encourages blurred image re-construction, but still works for natural images as it preserves dominant edges. However, in medical imaging, when the texture cues are more important for classification of a visual abnormality, the strategy fails. Taking inspiration from Gray Level Co-occurrence Matrix (GLCM) feature in Radiomics studies, we propose a novel MAE based pre-training framework, GLCM-MAE, using reconstruction loss based on matching GLCM. GLCM captures intensity and spatial relationships in an image, hence proposed loss helps preserve morphological features. Further, we propose a novel formulation to convert matching GLCM matrices into a differentiable loss function. We demonstrate that unsupervised pre-training on medical images with the proposed GLCM loss improves representations for downstream tasks. GLCM-MAE outperforms the current state-of-the-art across four tasks - gallbladder cancer detection from ultrasound images by 2.1%, breast cancer detection from ultrasound by 3.1%, pneumonia detection from x-rays by 0.5%, and COVID detection from CT by 0.6%. Source code and pre-trained models are available at: this https URL.
zh

[CV-84] AGFS-Tractometry: A Novel Atlas-Guided Fine-Scale Tractometry Approach for Enhanced Along-Tract Group Statistical Comparison Using Diffusion MRI Tractography

【速读】:该论文试图解决在扩散磁共振成像(dMRI)轨迹追踪中,如何更精确地检测不同群体间白质(WM)局部差异的问题。其解决方案的关键在于提出一种新的基于解剖图谱的细尺度轨迹测量方法(AGFS-Tractometry),该方法通过利用轨迹空间信息和非参数置换检验,实现了对轨迹沿线区域的精细分割与统计分析,从而提高了检测局部WM差异的敏感性和特异性。

链接: https://arxiv.org/abs/2507.10601
作者: Ruixi Zheng,Wei Zhang,Yijie Li,Xi Zhu,Zhou Lan,Jarrett Rushmore,Yogesh Rathi,Nikos Makris,Lauren J. O’Donnell,Fan Zhang
机构: University of Electronic Science and Technology of China (电子科技大学); Brigham and Women’s Hospital (布莱根妇女医院); Harvard Medical School (哈佛医学院); Boston University (波士顿大学)
类目: Quantitative Methods (q-bio.QM); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG); Image and Video Processing (eess.IV); Methodology (stat.ME)
备注: 31 pages and 7 figures

点击查看摘要

Abstract:Diffusion MRI (dMRI) tractography is currently the only method for in vivo mapping of the brain’s white matter (WM) connections. Tractometry is an advanced tractography analysis technique for along-tract profiling to investigate the morphology and microstructural properties along the fiber tracts. Tractometry has become an essential tool for studying local along-tract differences between different populations (e.g., health vs disease). In this study, we propose a novel atlas-guided fine-scale tractometry method, namely AGFS-Tractometry, that leverages tract spatial information and permutation testing to enhance the along-tract statistical analysis between populations. There are two major contributions in AGFS-Tractometry. First, we create a novel atlas-guided tract profiling template that enables consistent, fine-scale, along-tract parcellation of subject-specific fiber tracts. Second, we propose a novel nonparametric permutation testing group comparison method to enable simultaneous analysis across all along-tract parcels while correcting for multiple comparisons. We perform experimental evaluations on synthetic datasets with known group differences and in vivo real data. We compare AGFS-Tractometry with two state-of-the-art tractometry methods, including Automated Fiber-tract Quantification (AFQ) and BUndle ANalytics (BUAN). Our results show that the proposed AGFS-Tractometry obtains enhanced sensitivity and specificity in detecting local WM differences. In the real data analysis experiments, AGFS-Tractometry can identify more regions with significant differences, which are anatomically consistent with the existing literature. Overall, these demonstrate the ability of AGFS-Tractometry to detect subtle or spatially localized WM group-level differences. The created tract profiling template and related code are available at: this https URL.
zh

[CV-85] Comparative Analysis of Vision Transformers and Traditional Deep Learning Approaches for Automated Pneumonia Detection in Chest X-Rays

【速读】:该论文试图解决由如COVID-19等疾病引发的肺炎在临床诊断中需要快速且准确检测的问题。其解决方案的关键在于比较传统机器学习与先进的深度学习方法,特别是在使用胸部X光图像(CXR)进行自动化肺炎检测中的表现。研究结果显示,Vision Transformers(ViT),尤其是Cross-ViT架构,在准确率(88.25%)和召回率(99.42%)方面优于传统的卷积神经网络(CNN)方法,表明模型架构的选择对性能的影响大于模型规模,从而为自动化肺炎检测提供了更具前景的技术方向。

链接: https://arxiv.org/abs/2507.10589
作者: Gaurav Singh
机构: 未知
类目: Image and Video Processing (eess.IV); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV); Neural and Evolutionary Computing (cs.NE)
备注:

点击查看摘要

Abstract:Pneumonia, particularly when induced by diseases like COVID-19, remains a critical global health challenge requiring rapid and accurate diagnosis. This study presents a comprehensive comparison of traditional machine learning and state-of-the-art deep learning approaches for automated pneumonia detection using chest X-rays (CXRs). We evaluate multiple methodologies, ranging from conventional machine learning techniques (PCA-based clustering, Logistic Regression, and Support Vector Classification) to advanced deep learning architectures including Convolutional Neural Networks (Modified LeNet, DenseNet-121) and various Vision Transformer (ViT) implementations (Deep-ViT, Compact Convolutional Transformer, and Cross-ViT). Using a dataset of 5,856 pediatric CXR images, we demonstrate that Vision Transformers, particularly the Cross-ViT architecture, achieve superior performance with 88.25% accuracy and 99.42% recall, surpassing traditional CNN approaches. Our analysis reveals that architectural choices impact performance more significantly than model size, with Cross-ViT’s 75M parameters outperforming larger models. The study also addresses practical considerations including computational efficiency, training requirements, and the critical balance between precision and recall in medical diagnostics. Our findings suggest that Vision Transformers offer a promising direction for automated pneumonia detection, potentially enabling more rapid and accurate diagnosis during health crises.
zh

人工智能

[AI-0] How Many Instructions Can LLM s Follow at Once?

【速读】:该论文试图解决在高指令密度下大型语言模型(Large Language Model, LLM)的指令遵循能力不足的问题。现有基准仅评估模型在单个或少量指令下的表现,而实际应用场景中可能需要模型同时遵循数十甚至数百条指令。为解决这一问题,研究者提出了IFScale,这是一个包含500条关键词包含指令的基准,用于评估指令密度增加时指令遵循性能的退化情况。该解决方案的关键在于构建一个专门针对业务报告写作任务的基准,以系统性地分析模型在高指令密度下的表现及退化模式。

链接: https://arxiv.org/abs/2507.11538
作者: Daniel Jaroslawicz,Brendan Whiting,Parth Shah,Karime Maamari
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Production-grade LLM systems require robust adherence to dozens or even hundreds of instructions simultaneously. However, the instruction-following capabilities of LLMs at high instruction densities have not yet been characterized, as existing benchmarks only evaluate models on tasks with a single or few instructions. We introduce IFScale, a simple benchmark of 500 keyword-inclusion instructions for a business report writing task to measure how instruction-following performance degrades as instruction density increases. We evaluate 20 state-of-the-art models across seven major providers and find that even the best frontier models only achieve 68% accuracy at the max density of 500 instructions. Our analysis reveals model size and reasoning capability to correlate with 3 distinct performance degradation patterns, bias towards earlier instructions, and distinct categories of instruction-following errors. Our insights can help inform design of instruction-dense prompts in real-world applications and highlight important performance-latency tradeoffs. We open-source the benchmark and all results for further analysis at this https URL.
zh

[AI-1] DrafterBench: Benchmarking Large Language Models for Tasks Automation in Civil Engineering

【速读】:该论文试图解决工业场景下对大型语言模型(Large Language Model, LLM)代理进行系统性评估的基准不足问题,特别是在土木工程领域的技术图纸修订任务中。解决方案的关键是提出DrafterBench,这是一个包含12类任务、46个自定义功能工具和1920个任务的开源基准,用于全面评估LLM代理在理解复杂长上下文指令、利用先验知识以及通过隐式策略感知适应动态指令质量方面的能力。

链接: https://arxiv.org/abs/2507.11527
作者: Yinsheng Li,Zhen Dong,Yi Shao
机构: 未知
类目: Artificial Intelligence (cs.AI); Computational Engineering, Finance, and Science (cs.CE)
备注: Project page: this https URL

点击查看摘要

Abstract:Large Language Model (LLM) agents have shown great potential for solving real-world problems and promise to be a solution for tasks automation in industry. However, more benchmarks are needed to systematically evaluate automation agents from an industrial perspective, for example, in Civil Engineering. Therefore, we propose DrafterBench for the comprehensive evaluation of LLM agents in the context of technical drawing revision, a representation task in civil engineering. DrafterBench contains twelve types of tasks summarized from real-world drawing files, with 46 customized functions/tools and 1920 tasks in total. DrafterBench is an open-source benchmark to rigorously test AI agents’ proficiency in interpreting intricate and long-context instructions, leveraging prior knowledge, and adapting to dynamic instruction quality via implicit policy awareness. The toolkit comprehensively assesses distinct capabilities in structured data comprehension, function execution, instruction following, and critical reasoning. DrafterBench offers detailed analysis of task accuracy and error statistics, aiming to provide deeper insight into agent capabilities and identify improvement targets for integrating LLMs in engineering applications. Our benchmark is available at this https URL, with the test set hosted at this https URL.
zh

[AI-2] Illuminating the Three Dogmas of Reinforcement Learning under Evolutionary Light

【速读】:该论文试图解决强化学习(Reinforcement Learning, RL)中的三个核心教条问题,即关于代理定义、学习目标和奖励假设范围的理论困境。其解决方案的关键在于引入开放进化的理论框架,以重新审视这些教条,并通过进化视角丰富对“适应而非搜索”学习观的理解,同时借助进化适应性类比探讨奖励机制与多目标优化之间的关系。此外,论文强调需要结合生命起源理论中的热力学原理,以构建更严谨的代理定义和资源约束下的生物系统强化学习模型。

链接: https://arxiv.org/abs/2507.11482
作者: Mani Hamidi,Terrence W. Deacon
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Three core tenets of reinforcement learning (RL)–concerning the definition of agency, the objective of learning, and the scope of the reward hypothesis–have been highlighted as key targets for conceptual revision, with major implications for theory and application. We propose a framework, inspired by open-ended evolutionary theory, to reconsider these three “dogmas.” We revisit each assumption and address related concerns raised alongside them. To make our arguments relevant to RL as a model of biological learning, we first establish that evolutionary dynamics can plausibly operate within living brains over an individual’s lifetime, and are not confined to cross-generational processes. We begin by revisiting the second dogma, drawing on evolutionary insights to enrich the “adaptation-rather-than-search” view of learning. We then address the third dogma regarding the limits of the reward hypothesis, using analogies from evolutionary fitness to illuminate the scalar reward vs. multi-objective debate. After discussing practical implications for exploration in RL, we turn to the first–and arguably most fundamental–issue: the absence of a formal account of agency. We argue that unlike the other two problems, the evolutionary paradigm alone cannot resolve the agency question, though it gestures in a productive direction. We advocate integrating ideas from origins-of-life theory, where the thermodynamics of sustenance and replication offer promising foundations for understanding agency and resource-constrained reinforcement learning in biological systems.
zh

[AI-3] Perspective-Aware AI in Extended Reality

【速读】:该论文试图解决当前AI增强的扩展现实(XR)系统在用户建模深度和认知情境感知方面的不足。其解决方案的关键在于引入基于视角的AI(Perspective-Aware AI, PAi),通过Chronicles框架构建可推理的身份模型,该模型从多模态数字足迹中学习用户的认知与体验演化,并将其嵌入到闭环系统中,以实现与沉浸式环境动态状态的关联,从而提供可解释且情境感知的用户体验。

链接: https://arxiv.org/abs/2507.11479
作者: Daniel Platnick,Matti Gruener,Marjan Alirezaie,Kent Larson,Dava J. Newman,Hossein Rahnama
机构: 未知
类目: Artificial Intelligence (cs.AI); Graphics (cs.GR); Human-Computer Interaction (cs.HC)
备注: Accepted to the International Conference on eXtended Reality (2025), 12 pages, 3 figures

点击查看摘要

Abstract:AI-enhanced Extended Reality (XR) aims to deliver adaptive, immersive experiences-yet current systems fall short due to shallow user modeling and limited cognitive context. We introduce Perspective-Aware AI in Extended Reality (PAiR), a foundational framework for integrating Perspective-Aware AI (PAi) with XR to enable interpretable, context-aware experiences grounded in user identity. PAi is built on Chronicles: reasoning-ready identity models learned from multimodal digital footprints that capture users’ cognitive and experiential evolution. PAiR employs these models in a closed-loop system linking dynamic user states with immersive environments. We present PAiR’s architecture, detailing its modules and system flow, and demonstrate its utility through two proof-of-concept scenarios implemented in the Unity-based OpenDome engine. PAiR opens a new direction for human-AI interaction by embedding perspective-based identity models into immersive systems.
zh

[AI-4] Chain of Thought Monitorability: A New and Frag ile Opportunity for AI Safety

【速读】:该论文试图解决AI系统在运行过程中可能产生的安全问题,特别是通过监控生成式AI (Generative AI) 的思维链(Chain of Thought, CoT)来检测其潜在的不良意图。解决方案的关键在于利用人类语言形式的CoT作为可观察和可解释的路径,从而实现对AI行为的监督与评估。尽管CoT监控方法存在局限性,但其在AI安全领域的应用前景值得进一步研究和实践。

链接: https://arxiv.org/abs/2507.11473
作者: Tomek Korbak,Mikita Balesni,Elizabeth Barnes,Yoshua Bengio,Joe Benton,Joseph Bloom,Mark Chen,Alan Cooney,Allan Dafoe,Anca Dragan,Scott Emmons,Owain Evans,David Farhi,Ryan Greenblatt,Dan Hendrycks,Marius Hobbhahn,Evan Hubinger,Geoffrey Irving,Erik Jenner,Daniel Kokotajlo,Victoria Krakovna,Shane Legg,David Lindner,David Luan,Aleksander Mądry,Julian Michael,Neel Nanda,Dave Orr,Jakub Pachocki,Ethan Perez,Mary Phuong,Fabien Roger,Joshua Saxe,Buck Shlegeris,Martín Soto,Eric Steinberger,Jasmine Wang,Wojciech Zaremba,Bowen Baker,Rohin Shah,Vlad Mikulik
机构: 未知
类目: Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Machine Learning (stat.ML)
备注:

点击查看摘要

Abstract:AI systems that “think” in human language offer a unique opportunity for AI safety: we can monitor their chains of thought (CoT) for the intent to misbehave. Like all other known AI oversight methods, CoT monitoring is imperfect and allows some misbehavior to go unnoticed. Nevertheless, it shows promise and we recommend further research into CoT monitorability and investment in CoT monitoring alongside existing safety methods. Because CoT monitorability may be fragile, we recommend that frontier model developers consider the impact of development decisions on CoT monitorability.
zh

[AI-5] Modeling Code: Is Text All You Need?

【速读】:该论文试图解决当前基于Transformer的模型在处理代码的结构化、分析性属性(如控制流和数据流)时能力受限的问题。现有方法虽然尝试通过结构化数据和图神经网络建模这些属性,但缺乏现代大语言模型(LLM)的生成能力和规模。论文提出的解决方案的关键在于结合代码作为文本和更结构化形式的建模优势,以提升模型在代码理解与生成任务中的表现。

链接: https://arxiv.org/abs/2507.11467
作者: Daniel Nichols,Konstantinos Parasyris,Harshitha Menon,Brian R. Bartoldson,Giorgis Georgakoudis,Tal Ben-Nun,Abhinav Bhatele
机构: 未知
类目: Artificial Intelligence (cs.AI); Software Engineering (cs.SE)
备注:

点击查看摘要

Abstract:Code LLMs have become extremely popular recently for modeling source code across a variety of tasks, such as generation, translation, and summarization. However, transformer-based models are limited in their capabilities to reason through structured, analytical properties of code, such as control and data flow. Previous work has explored the modeling of these properties with structured data and graph neural networks. However, these approaches lack the generative capabilities and scale of modern LLMs. In this work, we introduce a novel approach to combine the strengths of modeling both code as text and more structured forms.
zh

[AI-6] oward Improving fNIRS Classification: A Study on Activation Functions in Deep Neural Architectures

【速读】:该论文试图解决深度学习模型在功能性近红外光谱(fNIRS)领域中因信号非线性、低信噪比和信号变异性而导致的模型准确性不足的问题。其解决方案的关键在于系统评估多种传统和领域特定的激活函数对fNIRS分类任务的影响,并发现对称性激活函数如Tanh和绝对值函数Abs(x)在某些架构下能够优于常用的修正线性单元(ReLU),从而提升模型性能。此外,通过引入改进的绝对函数(MAF)进一步验证了对称性激活函数的有效性。

链接: https://arxiv.org/abs/2507.11436
作者: Behtom Adeli,John McLinden,Pankaj Pandey,Ming Shao,Yalda Shahriari
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Activation functions are critical to the performance of deep neural networks, particularly in domains such as functional near-infrared spectroscopy (fNIRS), where nonlinearity, low signal-to-noise ratio (SNR), and signal variability poses significant challenges to model accuracy. However, the impact of activation functions on deep learning (DL) performance in the fNIRS domain remains underexplored and lacks systematic investigation in the current literature. This study evaluates a range of conventional and field-specific activation functions for fNIRS classification tasks using multiple deep learning architectures, including the domain-specific fNIRSNet, AbsoluteNet, MDNN, and shallowConvNet (as the baseline), all tested on a single dataset recorded during an auditory task. To ensure fair a comparison, all networks were trained and tested using standardized preprocessing and consistent training parameters. The results show that symmetrical activation functions such as Tanh and the Absolute value function Abs(x) can outperform commonly used functions like the Rectified Linear Unit (ReLU), depending on the architecture. Additionally, a focused analysis of the role of symmetry was conducted using a Modified Absolute Function (MAF), with results further supporting the effectiveness of symmetrical activation functions on performance gains. These findings underscore the importance of selecting proper activation functions that align with the signal characteristics of fNIRS data.
zh

[AI-7] Local Pairwise Distance Matching for Backpropagation-Free Reinforcement Learning ECAI2025

【速读】:该论文试图解决在强化学习(Reinforcement Learning, RL)中使用反向传播(Backpropagation, BP)训练神经网络时所面临的梯度消失或爆炸问题,以及需要存储前向传播激活值的局限性。其解决方案的关键在于提出一种新颖的方法,利用前向传播过程中的局部信号对每一层进行独立训练,通过引入基于配对距离匹配的局部损失函数,并结合可选的奖励驱动引导机制,从而无需进行反向传播和存储中间激活值,实现了无需反向传播的训练方法。

链接: https://arxiv.org/abs/2507.11367
作者: Daniel Tanneberg
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Machine Learning (stat.ML)
备注: accepted at the European Conference on Artificial Intelligence (ECAI 2025)

点击查看摘要

Abstract:Training neural networks with reinforcement learning (RL) typically relies on backpropagation (BP), necessitating storage of activations from the forward pass for subsequent backward updates. Furthermore, backpropagating error signals through multiple layers often leads to vanishing or exploding gradients, which can degrade learning performance and stability. We propose a novel approach that trains each layer of the neural network using local signals during the forward pass in RL settings. Our approach introduces local, layer-wise losses leveraging the principle of matching pairwise distances from multi-dimensional scaling, enhanced with optional reward-driven guidance. This method allows each hidden layer to be trained using local signals computed during forward propagation, thus eliminating the need for backward passes and storing intermediate activations. Our experiments, conducted with policy gradient methods across common RL benchmarks, demonstrate that this backpropagation-free method achieves competitive performance compared to their classical BP-based counterpart. Additionally, the proposed method enhances stability and consistency within and across runs, and improves performance especially in challenging environments.
zh

[AI-8] Foundation Models for Logistics: Toward Certifiable Conversational Planning Interfaces

【速读】:该论文旨在解决物流操作中面临的生命关键决策问题,这些问题需要结合领域专业知识与快速持续的重规划能力。传统方法如整数规划虽然能满足用户定义的逻辑约束,但效率低下且依赖理想化的环境模型,无法处理不确定性。而大型语言模型(LLMs)虽能处理不确定性并加速重规划,但存在误解和幻觉问题,影响安全性和成本。论文提出的神经符号框架通过将自然语言对话的易用性与目标解释的可验证保证相结合,关键在于将用户请求转化为结构化规划规范,量化场级和标记级的不确定性,并在置信度低于自适应阈值时触发交互式澄清循环,从而实现更安全、高效的决策支持。

链接: https://arxiv.org/abs/2507.11352
作者: Yunhao Yang,Neel P. Bhatt,Christian Ellis,Alvaro Velasquez,Zhangyang Wang,Ufuk Topcu
机构: 未知
类目: Artificial Intelligence (cs.AI); Formal Languages and Automata Theory (cs.FL)
备注:

点击查看摘要

Abstract:Logistics operators, from battlefield coordinators rerouting airlifts ahead of a storm to warehouse managers juggling late trucks, often face life-critical decisions that demand both domain expertise and rapid and continuous replanning. While popular methods like integer programming yield logistics plans that satisfy user-defined logical constraints, they are slow and assume an idealized mathematical model of the environment that does not account for uncertainty. On the other hand, large language models (LLMs) can handle uncertainty and promise to accelerate replanning while lowering the barrier to entry by translating free-form utterances into executable plans, yet they remain prone to misinterpretations and hallucinations that jeopardize safety and cost. We introduce a neurosymbolic framework that pairs the accessibility of natural-language dialogue with verifiable guarantees on goal interpretation. It converts user requests into structured planning specifications, quantifies its own uncertainty at the field and token level, and invokes an interactive clarification loop whenever confidence falls below an adaptive threshold. A lightweight model, fine-tuned on just 100 uncertainty-filtered examples, surpasses the zero-shot performance of GPT-4.1 while cutting inference latency by nearly 50%. These preliminary results highlight a practical path toward certifiable, real-time, and user-aligned decision-making for complex logistics.
zh

[AI-9] Acting and Planning with Hierarchical Operational Models on a Mobile Robot: A Study with RAEUPOM

【速读】:该论文试图解决机器人任务执行中符号规划模型与实际运行的丰富控制结构之间不一致的问题。解决方案的关键在于提出了一种集成的执行者-规划器系统(RAE+UPOM),该系统共享用于行动和规划的分层操作模型,并将反应式执行引擎(Reactive Acting Engine, RAE)与一种类似UCT的随时Monte Carlo规划器(UPOM)进行交错执行。

链接: https://arxiv.org/abs/2507.11345
作者: Oscar Lima,Marc Vinci,Sunandita Patra,Sebastian Stock,Joachim Hertzberg,Martin Atzmueller,Malik Ghallab,Dana Nau,Paolo Traverso
机构: 未知
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI)
备注: Accepted in ECMR 2025 conference

点击查看摘要

Abstract:Robotic task execution faces challenges due to the inconsistency between symbolic planner models and the rich control structures actually running on the robot. In this paper, we present the first physical deployment of an integrated actor-planner system that shares hierarchical operational models for both acting and planning, interleaving the Reactive Acting Engine (RAE) with an anytime UCT-like Monte Carlo planner (UPOM). We implement RAE+UPOM on a mobile manipulator in a real-world deployment for an object collection task. Our experiments demonstrate robust task execution under action failures and sensor noise, and provide empirical insights into the interleaved acting-and-planning decision making process.
zh

[AI-10] CogDDN: A Cognitive Demand-Driven Navigation with Decision Optimization and Dual-Process Thinking ACM-MM2025

【速读】:该论文试图解决需求驱动导航(Demand-driven Navigation, DDN)中传统数据驱动方法在未见场景下泛化能力不足的问题。解决方案的关键在于提出CogDDN框架,该框架通过整合快速和慢速思维系统,模拟人类的认知与学习机制,以选择性地识别满足用户需求的关键对象,并通过语义对齐检测到的对象与给定指令来确定目标对象。此外,CogDDN引入了双过程决策模块,结合启发式过程与分析过程,利用链式思维(Chain of Thought, CoT)推理增强决策过程,从而提升导航的准确性与适应性。

链接: https://arxiv.org/abs/2507.11334
作者: Yuehao Huang,Liang Liu,Shuangming Lei,Yukai Ma,Hao Su,Jianbiao Mei,Pengxiang Zhao,Yaqing Gu,Yong Liu,Jiajun Lv
机构: 未知
类目: Artificial Intelligence (cs.AI); Robotics (cs.RO)
备注: Accepted by ACM MM 2025

点击查看摘要

Abstract:Mobile robots are increasingly required to navigate and interact within unknown and unstructured environments to meet human demands. Demand-driven navigation (DDN) enables robots to identify and locate objects based on implicit human intent, even when object locations are unknown. However, traditional data-driven DDN methods rely on pre-collected data for model training and decision-making, limiting their generalization capability in unseen scenarios. In this paper, we propose CogDDN, a VLM-based framework that emulates the human cognitive and learning mechanisms by integrating fast and slow thinking systems and selectively identifying key objects essential to fulfilling user demands. CogDDN identifies appropriate target objects by semantically aligning detected objects with the given instructions. Furthermore, it incorporates a dual-process decision-making module, comprising a Heuristic Process for rapid, efficient decisions and an Analytic Process that analyzes past errors, accumulates them in a knowledge base, and continuously improves performance. Chain of Thought (CoT) reasoning strengthens the decision-making process. Extensive closed-loop evaluations on the AI2Thor simulator with the ProcThor dataset show that CogDDN outperforms single-view camera-only methods by 15%, demonstrating significant improvements in navigation accuracy and adaptability. The project page is available at this https URL.
zh

[AI-11] SystolicAttention: Fusing FlashAttention within a Single Systolic Array

【速读】:该论文试图解决当前基于行波阵列(systolic array)的加速器在执行生成式 AI (Generative AI) 中的 FlashAttention 算法时面临的高利用率难题。FlashAttention 需要频繁交织的矩阵乘法和 softmax 操作,而行波阵列仅在连续且大规模的矩阵乘法中才能实现高利用率,导致数据在行波阵列与外部向量单元之间频繁交换,从而降低利用率。此外,softmax 涉及大量非矩阵操作,不适用于行波阵列,且矩阵乘法与 softmax 的并行执行会导致寄存器文件和 SRAM 端口冲突,进一步影响性能。该论文提出的解决方案关键在于 FSA 架构,其核心是 SystolicAttention 调度算法,能够以细粒度、逐元素重叠的方式将 FlashAttention 操作映射到行波阵列上,从而在保持原始浮点运算顺序以维持数值稳定性的同时显著提升阵列利用率。

链接: https://arxiv.org/abs/2507.11331
作者: Jiawei Lin,Guokai Chen,Yuanlong Li,Thomas Bourgeat
机构: 未知
类目: Hardware Architecture (cs.AR); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Transformer models rely heavily on scaled dot-product attention (SDPA), typically implemented using the FlashAttention algorithm. However, current systolic-array-based accelerators face significant challenges when executing FlashAttention. Systolic arrays can only achieve high utilization for consecutive and large matrix multiplications. In contrast, FlashAttention requires frequently interleaved matrix multiplications and softmax operations. The frequent data swaps between the systolic array and external vector units result in low systolic array utilization. This is further exacerbated by the fact that softmax involves numerous non-matrix operations, which are not well-suited for systolic arrays. Moreover, the concurrent execution of matrix multiplication on systolic arrays and softmax on vector units leads to register file and SRAM port contention, further degrading performance. To overcome these limitations, we propose FSA, an enhanced systolic array architecture that enables the entire FlashAttention algorithm to run entirely within a single systolic array, eliminating the need for external vector units. At the core of FSA is SystolicAttention, a novel scheduling algorithm that maps FlashAttention operations onto systolic arrays with fine-grained, element-wise overlap. This significantly improves array utilization while preserving the original floating-point operation order to maintain numerical stability. We implement FSA in synthesizable RTL and evaluate its performance against state-of-the-art commercial accelerators. Our results show that FSA achieves 1.77x and 4.83x higher attention FLOPs/s utilization compared to AWS NeuronCore-v2 and Google TPUv5e, respectively, with only about 10% area overhead. Subjects: Hardware Architecture (cs.AR); Artificial Intelligence (cs.AI) Cite as: arXiv:2507.11331 [cs.AR] (or arXiv:2507.11331v1 [cs.AR] for this version) https://doi.org/10.48550/arXiv.2507.11331 Focus to learn more arXiv-issued DOI via DataCite (pending registration) Submission history From: Jiawei Lin [view email] [v1] Tue, 15 Jul 2025 14:04:17 UTC (529 KB)
zh

[AI-12] Contestability in Quantitative Argumentation

【速读】:该论文试图解决生成式 AI (Generative AI) 驱动决策与人类偏好对齐的问题,具体而言是针对 Edge-Weighted Quantitative Bipolar Argumentation Frameworks (EW-QBAFs) 中如何通过调整边权重以实现特定论点(即主题论点)的期望强度的可争议性问题。解决方案的关键在于提出基于梯度的关系归因解释(G-RAEs),该方法量化了主题论点强度对单个边权重变化的敏感性,从而为权重调整提供可解释的指导,并在此基础上开发了一个迭代算法以逐步调整边权重,最终达到期望的强度。

链接: https://arxiv.org/abs/2507.11323
作者: Xiang Yin,Nico Potyka,Antonio Rago,Timotheus Kampik,Francesca Toni
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Contestable AI requires that AI-driven decisions align with human preferences. While various forms of argumentation have been shown to support contestability, Edge-Weighted Quantitative Bipolar Argumentation Frameworks (EW-QBAFs) have received little attention. In this work, we show how EW-QBAFs can be deployed for this purpose. Specifically, we introduce the contestability problem for EW-QBAFs, which asks how to modify edge weights (e.g., preferences) to achieve a desired strength for a specific argument of interest (i.e., a topic argument). To address this problem, we propose gradient-based relation attribution explanations (G-RAEs), which quantify the sensitivity of the topic argument’s strength to changes in individual edge weights, thus providing interpretable guidance for weight adjustments towards contestability. Building on G-RAEs, we develop an iterative algorithm that progressively adjusts the edge weights to attain the desired strength. We evaluate our approach experimentally on synthetic EW-QBAFs that simulate the structural characteristics of personalised recommender systems and multi-layer perceptrons, and demonstrate that it can solve the problem effectively.
zh

[AI-13] Opus: A Prompt Intention Framework for Complex Workflow Generation

【速读】:该论文旨在解决复杂工作流生成中由于用户查询的模糊性或多重意图导致的输出逻辑性与一致性不足的问题。其解决方案的关键在于引入了Opus Prompt Intention Framework,该框架在用户查询与工作流生成之间增加了一个中间的意图捕获层,通过提取工作流信号(Workflow Signals)、将其解释为结构化的工作流意图(Workflow Intention)对象,并基于这些意图生成工作流,从而提升大型语言模型(LLM)生成结果的质量与可靠性。

链接: https://arxiv.org/abs/2507.11288
作者: Théo Fagnoni,Mahsun Altin,Chia En Chung,Phillip Kingston,Alan Tuning,Dana O. Mohamed,Inès Adnani
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注: 39 pages, 24 figures

点击查看摘要

Abstract:This paper introduces the Opus Prompt Intention Framework, designed to improve complex Workflow Generation with instruction-tuned Large Language Models (LLMs). We propose an intermediate Intention Capture layer between user queries and Workflow Generation, implementing the Opus Workflow Intention Framework, which consists of extracting Workflow Signals from user queries, interpreting them into structured Workflow Intention objects, and generating Workflows based on these Intentions. Our results show that this layer enables LLMs to produce logical and meaningful outputs that scale reliably as query complexity increases. On a synthetic benchmark of 1,000 multi-intent query-Workflow(s) pairs, applying the Opus Prompt Intention Framework to Workflow Generation yields consistent improvements in semantic Workflow similarity metrics. In this paper, we introduce the Opus Prompt Intention Framework by applying the concepts of Workflow Signal and Workflow Intention to LLM-driven Workflow Generation. We present a reproducible, customizable LLM-based Intention Capture system to extract Workflow Signals and Workflow Intentions from user queries. Finally, we provide empirical evidence that the proposed system significantly improves Workflow Generation quality compared to direct generation from user queries, particularly in cases of Mixed Intention Elicitation.
zh

[AI-14] aming Uncertainty via Automation: Observing Analyzing and Optimizing Agent ic AI Systems

【速读】:该论文试图解决在基于大型语言模型(Large Language Models, LLMs)的代理系统中,由于概率推理、动态记忆状态和灵活执行路径所带来的独特不确定性问题,传统软件可观测性和运维实践难以有效应对这些挑战。解决方案的关键是提出AgentOps框架,该框架通过六阶段的自动化流程——行为观察、指标收集、问题检测、根本原因分析、优化建议和运行时自动化,实现对代理AI系统的全面观测、分析、优化与自动化操作,强调自动化在管理不确定性中的核心作用,以确保系统的安全、适应性和有效性。

链接: https://arxiv.org/abs/2507.11277
作者: Dany Moshkovich,Sergey Zeltyn
机构: 未知
类目: Artificial Intelligence (cs.AI); Multiagent Systems (cs.MA)
备注:

点击查看摘要

Abstract:Large Language Models (LLMs) are increasingly deployed within agentic systems-collections of interacting, LLM-powered agents that execute complex, adaptive workflows using memory, tools, and dynamic planning. While enabling powerful new capabilities, these systems also introduce unique forms of uncertainty stemming from probabilistic reasoning, evolving memory states, and fluid execution paths. Traditional software observability and operations practices fall short in addressing these challenges. This paper introduces AgentOps: a comprehensive framework for observing, analyzing, optimizing, and automating operation of agentic AI systems. We identify distinct needs across four key roles-developers, testers, site reliability engineers (SREs), and business users-each of whom engages with the system at different points in its lifecycle. We present the AgentOps Automation Pipeline, a six-stage process encompassing behavior observation, metric collection, issue detection, root cause analysis, optimized recommendations, and runtime automation. Throughout, we emphasize the critical role of automation in managing uncertainty and enabling self-improving AI systems-not by eliminating uncertainty, but by taming it to ensure safe, adaptive, and effective operation. Subjects: Artificial Intelligence (cs.AI); Multiagent Systems (cs.MA) Cite as: arXiv:2507.11277 [cs.AI] (or arXiv:2507.11277v1 [cs.AI] for this version) https://doi.org/10.48550/arXiv.2507.11277 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
zh

[AI-15] urning Sand to Gold: Recycling Data to Bridge On-Policy and Off-Policy Learning via Causal Bound

【速读】:该论文试图解决深度强化学习(Deep Reinforcement Learning, DRL)代理在训练过程中需要大量训练步骤和庞大经验回放缓冲区(experience replay buffer)所带来的计算和资源消耗问题。其解决方案的关键在于引入一种新的理论结果,将Neyman-Rubin潜在结果框架应用于DRL,通过建立对实际损失(factual loss)的因果界限,而非传统方法关注的反事实损失(counterfactual loss)的界限,从而有效利用通常被丢弃的历史价值网络输出数据,显著提升样本效率并减少经验回放缓冲区的大小。

链接: https://arxiv.org/abs/2507.11269
作者: Tal Fiskus,Uri Shaham
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注: 51 pages, 16 figures

点击查看摘要

Abstract:Deep reinforcement learning (DRL) agents excel in solving complex decision-making tasks across various domains. However, they often require a substantial number of training steps and a vast experience replay buffer, leading to significant computational and resource demands. To address these challenges, we introduce a novel theoretical result that leverages the Neyman-Rubin potential outcomes framework into DRL. Unlike most methods that focus on bounding the counterfactual loss, we establish a causal bound on the factual loss, which is analogous to the on-policy loss in DRL. This bound is computed by storing past value network outputs in the experience replay buffer, effectively utilizing data that is usually discarded. Extensive experiments across the Atari 2600 and MuJoCo domains on various agents, such as DQN and SAC, achieve up to 2,427% higher reward ratio, outperforming the same agents without our proposed term, and reducing the experience replay buffer size by up to 96%, significantly improving sample efficiency at negligible cost.
zh

[AI-16] DuetGraph: Coarse-to-Fine Knowledge Graph Reasoning with Dual-Pathway Global-Local Fusion

【速读】:该论文试图解决知识图谱(Knowledge Graph, KG)推理中因得分过平滑(score over-smoothing)导致的正确与错误答案区分度降低的问题,这会阻碍推理的有效性。其解决方案的关键在于提出DuetGraph,该方法通过将局部信息(通过消息传递处理)和全局信息(通过注意力机制处理)分离到两条独立路径中,而非传统方式的叠加,从而避免相互干扰并保持表征的区分性;同时引入从粗到精的优化策略,将实体划分为高分和低分子集,缩小候选空间并增强两子集间的得分差距,从而缓解过平滑问题并提升推理质量。

链接: https://arxiv.org/abs/2507.11229
作者: Jin Li,Zezhong Ding,Xike Xie
机构: 未知
类目: Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Knowledge graphs (KGs) are vital for enabling knowledge reasoning across various domains. Recent KG reasoning methods that integrate both global and local information have achieved promising results. However, existing methods often suffer from score over-smoothing, which blurs the distinction between correct and incorrect answers and hinders reasoning effectiveness. To address this, we propose DuetGraph, a coarse-to-fine KG reasoning mechanism with dual-pathway global-local fusion. DuetGraph tackles over-smoothing by segregating – rather than stacking – the processing of local (via message passing) and global (via attention) information into two distinct pathways, preventing mutual interference and preserving representational discrimination. In addition, DuetGraph introduces a coarse-to-fine optimization, which partitions entities into high- and low-score subsets. This strategy narrows the candidate space and sharpens the score gap between the two subsets, which alleviates over-smoothing and enhances inference quality. Extensive experiments on various datasets demonstrate that DuetGraph achieves state-of-the-art (SOTA) performance, with up to an 8.7% improvement in reasoning quality and a 1.8 \times acceleration in training efficiency.
zh

[AI-17] Role-Playing LLM -Based Multi-Agent Support Framework for Detecting and Addressing Family Communication Bias

【速读】:该论文试图解决家庭环境中由于无意识父母期望(ideal parent bias)导致的儿童情感表达和自主性受抑制的问题,即被称作“受抑制情绪”(suppressed emotion)的现象。解决方案的关键在于构建一个基于大型语言模型(LLM)的多智能体对话支持框架,该框架能够分析家庭对话、检测受抑制情绪、识别隐含的理想父母偏见,并生成具有共情性和可操作性的反馈,从而促进心理安全的家庭沟通。

链接: https://arxiv.org/abs/2507.11210
作者: Rushia Harada,Yuken Kimura,Keito Inoshita
机构: 未知
类目: Human-Computer Interaction (cs.HC); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Well-being in family settings involves subtle psychological dynamics that conventional metrics often overlook. In particular, unconscious parental expectations, termed ideal parent bias, can suppress children’s emotional expression and autonomy. This suppression, referred to as suppressed emotion, often stems from well-meaning but value-driven communication, which is difficult to detect or address from outside the family. Focusing on these latent dynamics, this study explores Large Language Model (LLM)-based support for psychologically safe family communication. We constructed a Japanese parent-child dialogue corpus of 30 scenarios, each annotated with metadata on ideal parent bias and suppressed emotion. Based on this corpus, we developed a Role-Playing LLM-based multi-agent dialogue support framework that analyzes dialogue and generates feedback. Specialized agents detect suppressed emotion, describe implicit ideal parent bias in parental speech, and infer contextual attributes such as the child’s age and background. A meta-agent compiles these outputs into a structured report, which is then passed to five selected expert agents. These agents collaboratively generate empathetic and actionable feedback through a structured four-step discussion process. Experiments show that the system can detect categories of suppressed emotion with moderate accuracy and produce feedback rated highly in empathy and practicality. Moreover, simulated follow-up dialogues incorporating this feedback exhibited signs of improved emotional expression and mutual understanding, suggesting the framework’s potential in supporting positive transformation in family interactions.
zh

[AI-18] An Explainable AI-Enhanced Machine Learning Approach for Cardiovascular Disease Detection and Risk Assessment

【速读】:该论文旨在解决传统诊断方法在识别和管理心脏病风险方面存在的不足,特别是在医疗资源有限的地区。其解决方案的关键在于构建一个结合分类模型用于心脏病检测和回归模型用于风险预测的综合框架,并通过合成少数类过采样技术(SMOTE)解决数据集的类别不平衡问题,从而提升模型的准确性和泛化能力。

链接: https://arxiv.org/abs/2507.11185
作者: Md. Emon Akter Sourov,Md. Sabbir Hossen,Pabon Shaha,Mohammad Minoar Hossain,Md Sadiq Iqbal
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注: This paper has been accepted at the IEEE QPAIN 2025. The final version will be available in the IEEE Xplore Digital Library

点击查看摘要

Abstract:Heart disease remains a major global health concern, particularly in regions with limited access to medical resources and diagnostic facilities. Traditional diagnostic methods often fail to accurately identify and manage heart disease risks, leading to adverse outcomes. Machine learning has the potential to significantly enhance the accuracy, efficiency, and speed of heart disease diagnosis. In this study, we proposed a comprehensive framework that combines classification models for heart disease detection and regression models for risk prediction. We employed the Heart Disease dataset, which comprises 1,035 cases. To address the issue of class imbalance, the Synthetic Minority Oversampling Technique (SMOTE) was applied, resulting in the generation of an additional 100,000 synthetic data points. Performance metrics, including accuracy, precision, recall, F1-score, R2, MSE, RMSE, and MAE, were used to evaluate the model’s effectiveness. Among the classification models, Random Forest emerged as the standout performer, achieving an accuracy of 97.2% on real data and 97.6% on synthetic data. For regression tasks, Linear Regression demonstrated the highest R2 values of 0.992 and 0.984 on real and synthetic datasets, respectively, with the lowest error metrics. Additionally, Explainable AI techniques were employed to enhance the interpretability of the models. This study highlights the potential of machine learning to revolutionize heart disease diagnosis and risk prediction, thereby facilitating early intervention and enhancing clinical decision-making.
zh

[AI-19] Mixture of Experts in Large Language Models

【速读】:该论文试图解决如何在保持计算效率的同时提升大规模语言模型性能的问题,其解决方案的关键在于采用混合专家(Mixture-of-Experts, MoE)架构。该架构通过专家门控和路由机制实现模型的稀疏激活,从而在不显著增加计算开销的情况下增强模型容量与任务特定性能,同时支持高效的模型扩展。

链接: https://arxiv.org/abs/2507.11181
作者: Danyang Zhang,Junhao Song,Ziqian Bi,Yingfang Yuan,Tianyang Wang,Joe Yeong,Junfeng Hao
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:This paper presents a comprehensive review of the Mixture-of-Experts (MoE) architecture in large language models, highlighting its ability to significantly enhance model performance while maintaining minimal computational overhead. Through a systematic analysis spanning theoretical foundations, core architectural designs, and large language model (LLM) applications, we examine expert gating and routing mechanisms, hierarchical and sparse MoE configurations, meta-learning approaches, multimodal and multitask learning scenarios, real-world deployment cases, and recent advances and challenges in deep learning. Our analysis identifies key advantages of MoE, including superior model capacity compared to equivalent Bayesian approaches, improved task-specific performance, and the ability to scale model capacity efficiently. We also underscore the importance of ensuring expert diversity, accurate calibration, and reliable inference aggregation, as these are essential for maximizing the effectiveness of MoE architectures. Finally, this review outlines current research limitations, open challenges, and promising future directions, providing a foundation for continued innovation in MoE architecture and its applications.
zh

[AI-20] Gradient Regularization-based Neural Granger Causality

【速读】:该论文试图解决现有基于神经网络的Granger因果关系模型中存在的计算成本高和难以捕捉复杂交互的问题。其解决方案的关键在于提出一种基于梯度正则化的神经Granger因果关系模型(GRNGC),该模型仅需一个时间序列预测模型,并通过在模型输入与输出之间的梯度上应用L₁正则化来推断Granger因果关系,从而提高了模型的效率和灵活性。

链接: https://arxiv.org/abs/2507.11178
作者: Meiliang Liu,Huiwen Dong,Xiaoxiao Yang,Yunfang Xu,Zijin Li,Zhengye Si,Xinyue Yang,Zhiwen Zhao
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注: 9 pages,3 figures, conference

点击查看摘要

Abstract:With the advancement of deep learning technologies, various neural network-based Granger causality models have been proposed. Although these models have demonstrated notable improvements, several limitations remain. Most existing approaches adopt the component-wise architecture, necessitating the construction of a separate model for each time series, which results in substantial computational costs. In addition, imposing the sparsity-inducing penalty on the first-layer weights of the neural network to extract causal relationships weakens the model’s ability to capture complex interactions. To address these limitations, we propose Gradient Regularization-based Neural Granger Causality (GRNGC), which requires only one time series prediction model and applies L_1 regularization to the gradient between model’s input and output to infer Granger causality. Moreover, GRNGC is not tied to a specific time series forecasting model and can be implemented with diverse architectures such as KAN, MLP, and LSTM, offering enhanced flexibility. Numerical simulations on DREAM, Lorenz-96, fMRI BOLD, and CausalTime show that GRNGC outperforms existing baselines and significantly reduces computational overhead. Meanwhile, experiments on real-world DNA, Yeast, HeLa, and bladder urothelial carcinoma datasets further validate the model’s effectiveness in reconstructing gene regulatory networks.
zh

[AI-21] Improving Wi-Fi Network Performance Prediction with Deep Learning Models

【速读】:该论文试图解决工业和任务关键型应用中无线网络对鲁棒性、可靠性和确定性的日益增长的需求,其解决方案的关键在于利用机器学习技术预测Wi-Fi网络中的信道质量,具体以帧传递比(Frame Delivery Ratio)为指标。通过主动调整通信参数以优化网络操作,该研究分析了卷积神经网络和长短期记忆等方法在多通道真实Wi-Fi数据集上的预测准确性和计算复杂性,结果表明帧传递比可以被可靠预测,其中卷积神经网络在CPU使用率和内存消耗方面更具效率,从而提升了模型在嵌入式和工业系统中的可用性。

链接: https://arxiv.org/abs/2507.11168
作者: Gabriele Formis,Amanda Ericson,Stefan Forsstrom,Kyi Thar,Gianluca Cena,Stefano Scanzio
机构: 未知
类目: Networking and Internet Architecture (cs.NI); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: preprint accepted, 8 pages, 2025

点击查看摘要

Abstract:The increasing need for robustness, reliability, and determinism in wireless networks for industrial and mission-critical applications is the driver for the growth of new innovative methods. The study presented in this work makes use of machine learning techniques to predict channel quality in a Wi-Fi network in terms of the frame delivery ratio. Predictions can be used proactively to adjust communication parameters at runtime and optimize network operations for industrial applications. Methods including convolutional neural networks and long short-term memory were analyzed on datasets acquired from a real Wi-Fi setup across multiple channels. The models were compared in terms of prediction accuracy and computational complexity. Results show that the frame delivery ratio can be reliably predicted, and convolutional neural networks, although slightly less effective than other models, are more efficient in terms of CPU usage and memory consumption. This enhances the model’s usability on embedded and industrial systems.
zh

[AI-22] Fine-grained Timing Analysis of Digital Integrated Circuits in Answer Set Programming

【速读】:该论文试图解决在集成电路设计中准确计算组合模块最大延迟的问题,而非依赖传统的静态时序分析所给出的上界值。其解决方案的关键在于将该问题建模为答案集编程(Answer Set Programming, ASP)形式,并利用ASP高效的求解器来计算实际的最大延迟,从而提升处理器性能。

链接: https://arxiv.org/abs/2507.11150
作者: Alessandro Bertagnon,Marcello Dalpasso,Michele Favalli,Marco Gavanelli
机构: 未知
类目: Artificial Intelligence (cs.AI); Logic in Computer Science (cs.LO)
备注: Accepted for publication in the issues of Theory and Practice of Logic Programming (TPLP) dedicated to ICLP 2025, 16 pages, 9 figures

点击查看摘要

Abstract:In the design of integrated circuits, one critical metric is the maximum delay introduced by combinational modules within the circuit. This delay is crucial because it represents the time required to perform a computation: in an Arithmetic-Logic Unit it represents the maximum time taken by the circuit to perform an arithmetic operation. When such a circuit is part of a larger, synchronous system, like a CPU, the maximum delay directly impacts the maximum clock frequency of the entire system. Typically, hardware designers use Static Timing Analysis to compute an upper bound of the maximum delay because it can be determined in polynomial time. However, relying on this upper bound can lead to suboptimal processor speeds, thereby missing performance opportunities. In this work, we tackle the challenging task of computing the actual maximum delay, rather than an approximate value. Since the problem is computationally hard, we model it in Answer Set Programming (ASP), a logic language featuring extremely efficient solvers. We propose non-trivial encodings of the problem into ASP. Experimental results show that ASP is a viable solution to address complex problems in hardware design.
zh

[AI-23] Collaborative Trustworthiness for Good Decision Making in Autonomous Systems

【速读】:该论文试图解决在动态和复杂环境中确保自主系统安全与正确行为的问题,特别是在面对冲突信息时如何实现可信的决策。其解决方案的关键在于利用自主系统不同的质量属性(如感知质量)来判断其可信度,并借鉴社会认识论的概念定义聚合与传播规则,从而实现更可靠的协同决策。此外,论文采用二进制决策图(Binary Decision Diagrams, BDDs)作为信念聚合与传播的形式化模型,并提出简化规则以减小BDD规模,提高协同自动推理的计算效率。

链接: https://arxiv.org/abs/2507.11135
作者: Selma Saidi,Omar Laimona,Christoph Schmickler,Dirk Ziegenbein
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Autonomous systems are becoming an integral part of many application domains, like in the mobility sector. However, ensuring their safe and correct behaviour in dynamic and complex environments remains a significant challenge, where systems should autonomously make decisions e.g., about manoeuvring. We propose in this paper a general collaborative approach for increasing the level of trustworthiness in the environment of operation and improve reliability and good decision making in autonomous system. In the presence of conflicting information, aggregation becomes a major issue for trustworthy decision making based on collaborative data sharing. Unlike classical approaches in the literature that rely on consensus or majority as aggregation rule, we exploit the fact that autonomous systems have different quality attributes like perception quality. We use this criteria to determine which autonomous systems are trustworthy and borrow concepts from social epistemology to define aggregation and propagation rules, used for automated decision making. We use Binary Decision Diagrams (BDDs) as formal models for beliefs aggregation and propagation, and formulate reduction rules to reduce the size of the BDDs and allow efficient computation structures for collaborative automated reasoning.
zh

[AI-24] Defining neurosymbolic AI

【速读】:该论文试图解决神经符号AI(Neurosymbolic AI)领域缺乏一个被广泛接受的正式定义的问题,特别是关于神经符号模型和推理的本质。其解决方案的关键在于提出一种形式化定义,将神经符号推理定义为对逻辑函数与信念函数乘积进行积分的计算过程,从而抽象出该领域关键代表系统的共性。

链接: https://arxiv.org/abs/2507.11127
作者: Lennert De Smet,Luc De Raedt
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Neurosymbolic AI focuses on integrating learning and reasoning, in particular, on unifying logical and neural representations. Despite the existence of an alphabet soup of neurosymbolic AI systems, the field is lacking a generally accepted formal definition of what neurosymbolic models and inference really are. We introduce a formal definition for neurosymbolic AI that makes abstraction of its key ingredients. More specifically, we define neurosymbolic inference as the computation of an integral over a product of a logical and a belief function. We show that our neurosymbolic AI definition makes abstraction of key representative neurosymbolic AI systems.
zh

[AI-25] AI Agent Architecture for Decentralized Trading of Alternative Assets

【速读】:该论文试图解决现实世界另类资产(如黄金)的去中心化交易问题,即如何将物理资产托管与区块链系统相结合,同时满足合规性、流动性和风险管理的严格要求。解决方案的关键在于提出GoldMine OS架构,该架构采用多个专业AI代理(Compliance, Token Issuance, Market Making, and Risk Control)来自动化和保障物理黄金向基于区块链的稳定币(“OZ”)的代币化与交易过程,通过结合链上智能合约进行关键风险控制与链下AI代理进行决策,实现了区块链的透明性和可靠性与AI驱动自动化的灵活性的融合。

链接: https://arxiv.org/abs/2507.11117
作者: Ailiya Borjigin,Cong He,Charles CC Lee,Wei Zhou
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注: 8 Pages, 1 figure

点击查看摘要

Abstract:Decentralized trading of real-world alternative assets (e.g., gold) requires bridging physical asset custody with blockchain systems while meeting strict requirements for compliance, liquidity, and risk management. We present GoldMine OS, a research oriented architecture that employs multiple specialized AI agents to automate and secure the tokenization and exchange of physical gold into a blockchain based stablecoin (“OZ”). Our approach combines on chain smart contracts for critical risk controls with off chain AI agents for decision making, blending the transparency and reliability of blockchains with the flexibility of AI driven automation. We describe four cooperative agents (Compliance, Token Issuance, Market Making, and Risk Control) and a coordinating core, and evaluate the system through simulation and a controlled pilot deployment. In experiments the prototype delivers on demand token issuance in under 1.2 s, more than 100 times faster than manual workflows. The Market Making agent maintains tight liquidity with spreads often below 0.5 percent even under volatile conditions. Fault injection tests show resilience: an oracle price spoofing attack is detected and mitigated within 10 s, and a simulated vault mis reporting halts issuance immediately with minimal user impact. The architecture scales to 5000 transactions per second with 10000 concurrent users in benchmarks. These results indicate that an AI agent based decentralized exchange for alternative assets can satisfy rigorous performance and safety requirements. We discuss broader implications for democratizing access to traditionally illiquid assets and explain how our governance model – multi signature agent updates and on chain community voting on risk parameters – provides ongoing transparency, adaptability, and formal assurance of system integrity.
zh

[AI-26] EditGen: Harnessing Cross-Attention Control for Instruction-Based Auto-Regressive Audio Editing

【速读】:该论文试图解决在自回归模型中实现高效音频编辑的问题,特别是如何通过提示引导的方式对生成的音频进行精确控制。解决方案的关键在于引入跨注意力控制机制,结合提示到提示(Prompt-to-Prompt)的方法,利用交叉注意力和自注意力机制引导音频编辑,并融合基于扩散的策略以支持细化编辑。此外,还提出了三种基于替换、重新加权和细化注意力分数的编辑机制,以增强对音频内容的可控性和真实性。

链接: https://arxiv.org/abs/2507.11096
作者: Vassilis Sioros,Alexandros Potamianos,Giorgos Paraskevopoulos
机构: 未知
类目: ound (cs.SD); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:In this study, we investigate leveraging cross-attention control for efficient audio editing within auto-regressive models. Inspired by image editing methodologies, we develop a Prompt-to-Prompt-like approach that guides edits through cross and self-attention mechanisms. Integrating a diffusion-based strategy, influenced by Auffusion, we extend the model’s functionality to support refinement edits, establishing a baseline for prompt-guided audio editing. Additionally, we introduce an alternative approach by incorporating MUSICGEN, a pre-trained frozen auto-regressive model, and propose three editing mechanisms, based on Replacement, Reweighting, and Refinement of the attention scores. We employ commonly-used music-specific evaluation metrics and a human study, to gauge time-varying controllability, adherence to global text cues, and overall audio realism. The automatic and human evaluations indicate that the proposed combination of prompt-to-prompt guidance with autoregressive generation models significantly outperforms the diffusion-based baseline in terms of melody, dynamics, and tempo of the generated audio. Our code is available at this https URL
zh

[AI-27] Function-to-Style Guidance of LLM s for Code Translation ICML2025 WWW

【速读】:该论文旨在解决代码翻译任务中确保翻译结果的正确性和可读性的问题,这一问题限制了大规模语言模型(LLMs)在实际软件开发中的有效应用。其解决方案的关键在于提出F2STrans框架,该框架包含两个核心阶段:功能学习(Functional learning)通过从在线编程平台挖掘高质量的源目标代码对来优化翻译的正确性;风格学习(Style learning)则通过引入正负风格示例来提升翻译的可读性。

链接: https://arxiv.org/abs/2507.11083
作者: Longhui Zhang,Bin Wang,Jiahao Wang,Xiaofeng Zhao,Min Zhang,Hao Yang,Meishan Zhang,Yu Li,Jing Li,Jun Yu,Min Zhang
机构: 未知
类目: Artificial Intelligence (cs.AI); Software Engineering (cs.SE)
备注: This paper has been accepted by ICML 2025. Models and benchmarks can be found at this https URL

点击查看摘要

Abstract:Large language models (LLMs) have made significant strides in code translation tasks. However, ensuring both the correctness and readability of translated code remains a challenge, limiting their effective adoption in real-world software development. In this work, we propose F2STrans, a function-to-style guiding paradigm designed to progressively improve the performance of LLMs in code translation. Our approach comprises two key stages: (1) Functional learning, which optimizes translation correctness using high-quality source-target code pairs mined from online programming platforms, and (2) Style learning, which improves translation readability by incorporating both positive and negative style examples. Additionally, we introduce a novel code translation benchmark that includes up-to-date source code, extensive test cases, and manually annotated ground-truth translations, enabling comprehensive functional and stylistic evaluations. Experiments on both our new benchmark and existing datasets demonstrate that our approach significantly improves code translation performance. Notably, our approach enables Qwen-1.5B to outperform prompt-enhanced Qwen-32B and GPT-4 on average across 20 diverse code translation scenarios.
zh

[AI-28] actical Decision for Multi-UGV Confrontation with a Vision-Language Model-Based Commander

【速读】:该论文旨在解决在多无人地面车辆对抗中,从情境感知自主演化多智能体战术决策的问题。传统基于规则的方法在复杂且瞬变的战场环境中易失效,而当前强化学习方法因缺乏可解释性,主要关注动作操作而非战略决策。论文提出的解决方案关键在于构建一个基于视觉-语言模型的指挥官系统,该系统整合了用于场景理解的视觉语言模型和用于战略推理的轻量级大语言模型,在共享语义空间内实现统一的感知与决策,具备强适应性和可解释性。

链接: https://arxiv.org/abs/2507.11079
作者: Li Wang,Qizhen Wu,Lei Chen
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:In multiple unmanned ground vehicle confrontations, autonomously evolving multi-agent tactical decisions from situational awareness remain a significant challenge. Traditional handcraft rule-based methods become vulnerable in the complicated and transient battlefield environment, and current reinforcement learning methods mainly focus on action manipulation instead of strategic decisions due to lack of interpretability. Here, we propose a vision-language model-based commander to address the issue of intelligent perception-to-decision reasoning in autonomous confrontations. Our method integrates a vision language model for scene understanding and a lightweight large language model for strategic reasoning, achieving unified perception and decision within a shared semantic space, with strong adaptability and interpretability. Unlike rule-based search and reinforcement learning methods, the combination of the two modules establishes a full-chain process, reflecting the cognitive process of human commanders. Simulation and ablation experiments validate that the proposed approach achieves a win rate of over 80% compared with baseline models.
zh

[AI-29] Standards-Compliant DM-RS Allocation via Temporal Channel Prediction for Massive MIMO Systems

【速读】:该论文试图解决在超越5G网络中由于大规模MIMO系统天线数量增加而导致的信道状态信息(CSI)反馈开销过大的问题。其解决方案的关键在于引入基于信道预测的参考信号分配(CPRS),通过联合优化信道预测与解调参考信号(DM-RS)分配,在无需CSI反馈的情况下提升数据吞吐量。该方法利用了ViViT/CNN架构,将动态CSI矩阵视为类似图像的序列数据进行处理,从而实现高效且自适应的传输。

链接: https://arxiv.org/abs/2507.11064
作者: Sehyun Ryu,Hyun Jong Yang
机构: 未知
类目: ystems and Control (eess.SY); Artificial Intelligence (cs.AI); Signal Processing (eess.SP)
备注:

点击查看摘要

Abstract:Reducing feedback overhead in beyond 5G networks is a critical challenge, as the growing number of antennas in modern massive MIMO systems substantially increases the channel state information (CSI) feedback demand in frequency division duplex (FDD) systems. To address this, extensive research has focused on CSI compression and prediction, with neural network-based approaches gaining momentum and being considered for integration into the 3GPP 5G-Advanced standards. While deep learning has been effectively applied to CSI-limited beamforming and handover optimization, reference signal allocation under such constraints remains surprisingly underexplored. To fill this gap, we introduce the concept of channel prediction-based reference signal allocation (CPRS), which jointly optimizes channel prediction and DM-RS allocation to improve data throughput without requiring CSI feedback. We further propose a standards-compliant ViViT/CNN-based architecture that implements CPRS by treating evolving CSI matrices as sequential image-like data, enabling efficient and adaptive transmission in dynamic environments. Simulation results using ray-tracing channel data generated in NVIDIA Sionna validate the proposed method, showing up to 36.60% throughput improvement over benchmark strategies.
zh

[AI-30] Personalized Exercise Recommendation with Semantically-Grounded Knowledge Tracing

【速读】:该论文试图解决个性化运动推荐中忽略问题语义内容和学生学习的顺序性与结构化进展的问题。解决方案的关键在于提出ExRec框架,该框架通过语义基础的知识追踪(Knowledge Tracing, KT)构建端到端的推荐流程,包括对问题的知识成分(Knowledge Components, KCs)进行标注、学习其语义表示、训练KT模型,并优化多种强化学习(Reinforcement Learning, RL)方法。此外,通过定制的基于模型的价值估计(Model-Based Value Estimation, MVE)方法改进标准的Q-learning连续RL方法,直接利用KT模型的组件来估计累积知识提升。

链接: https://arxiv.org/abs/2507.11060
作者: Yilmazcan Ozyurt,Tunaberk Almaci,Stefan Feuerriegel,Mrinmaya Sachan
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:We introduce ExRec, a general framework for personalized exercise recommendation with semantically-grounded knowledge tracing. Our method builds on the observation that existing exercise recommendation approaches simulate student performance via knowledge tracing (KT) but they often overlook two key aspects: (a) the semantic content of questions and (b) the sequential, structured progression of student learning. To address this, our ExRec presents an end-to-end pipeline, from annotating the KCs of questions and learning their semantic representations to training KT models and optimizing several reinforcement learning (RL) methods. Moreover, we improve standard Q-learning-based continuous RL methods via a tailored model-based value estimation (MVE) approach that directly leverages the components of KT model in estimating cumulative knowledge improvement. We validate the effectiveness of our ExRec using various RL methods across four real-world tasks with different educational goals in online math learning. We further show that ExRec generalizes robustly to new, unseen questions and that it produces interpretable student learning trajectories. Together, our findings highlight the promise of KT-guided RL for effective personalization in education.
zh

[AI-31] GATE: Graph Attention Neural Networks with Real-Time Edge Construction for Robust Indoor Localization using Mobile Embedded Devices

【速读】:该论文旨在解决室内定位中由于设备异构性、非均匀的现实RSS噪声分布以及传统深度学习模型在欧几里得空间假设下的泛化能力不足所导致的定位精度下降问题。其解决方案的关键在于提出GATE框架,该框架通过构建自适应图表示以保留室内状态空间拓扑结构,并建模RSS噪声的非欧几里得特性,从而减轻环境噪声并处理设备异构性;GATE引入了三个创新组件:增强的信息传递机制——注意力超空间向量(AHV)、缓解GNN盲区问题的多维超空间向量(MDHV)以及用于动态图适配的实时边构建方法(RTEC)。

链接: https://arxiv.org/abs/2507.11053
作者: Danish Gufran,Sudeep Pasricha
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Accurate indoor localization is crucial for enabling spatial context in smart environments and navigation systems. Wi-Fi Received Signal Strength (RSS) fingerprinting is a widely used indoor localization approach due to its compatibility with mobile embedded devices. Deep Learning (DL) models improve accuracy in localization tasks by learning RSS variations across locations, but they assume fingerprint vectors exist in a Euclidean space, failing to incorporate spatial relationships and the non-uniform distribution of real-world RSS noise. This results in poor generalization across heterogeneous mobile devices, where variations in hardware and signal processing distort RSS readings. Graph Neural Networks (GNNs) can improve upon conventional DL models by encoding indoor locations as nodes and modeling their spatial and signal relationships as edges. However, GNNs struggle with non-Euclidean noise distributions and suffer from the GNN blind spot problem, leading to degraded accuracy in environments with dense access points (APs). To address these challenges, we propose GATE, a novel framework that constructs an adaptive graph representation of fingerprint vectors while preserving an indoor state-space topology, modeling the non-Euclidean structure of RSS noise to mitigate environmental noise and address device heterogeneity. GATE introduces 1) a novel Attention Hyperspace Vector (AHV) for enhanced message passing, 2) a novel Multi-Dimensional Hyperspace Vector (MDHV) to mitigate the GNN blind spot, and 3) an new Real-Time Edge Construction (RTEC) approach for dynamic graph adaptation. Extensive real-world evaluations across multiple indoor spaces with varying path lengths, AP densities, and heterogeneous devices demonstrate that GATE achieves 1.6x to 4.72x lower mean localization errors and 1.85x to 4.57x lower worst-case errors compared to state-of-the-art indoor localization frameworks.
zh

[AI-32] Crafting Imperceptible On-Manifold Adversarial Attacks for Tabular Data

【速读】:该论文旨在解决针对表格数据的对抗攻击问题,此类问题由于表格数据中混合了类别型和数值型特征,与图像或文本领域存在本质差异。传统基于梯度的方法通常依赖p\ell_p-范数约束,生成的对抗样本往往偏离原始数据分布,容易被检测。论文提出的解决方案是采用一种基于混合输入变分自编码器(Variational Autoencoder, VAE)的潜在空间扰动框架,通过将类别嵌入和数值特征整合到统一的潜在流形中,实现统计一致性保持的不可察觉对抗样本生成。其关键在于利用潜在空间中的扰动,而非直接在输入空间进行修改,从而提高对抗样本的隐蔽性和有效性。

链接: https://arxiv.org/abs/2507.10998
作者: Zhipeng He,Alexander Stevens,Chun Ouyang,Johannes De Smedt,Alistair Barros,Catarina Moreira
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注: 32 pages

点击查看摘要

Abstract:Adversarial attacks on tabular data present fundamental challenges distinct from image or text domains due to the heterogeneous nature of mixed categorical and numerical features. Unlike images where pixel perturbations maintain visual similarity, tabular data lacks intuitive similarity metrics, making it difficult to define imperceptible modifications. Additionally, traditional gradient-based methods prioritise \ell_p -norm constraints, often producing adversarial examples that deviate from the original data distributions, making them detectable. We propose a latent space perturbation framework using a mixed-input Variational Autoencoder (VAE) to generate imperceptible adversarial examples. The proposed VAE integrates categorical embeddings and numerical features into a unified latent manifold, enabling perturbations that preserve statistical consistency. We specify In-Distribution Success Rate (IDSR) to measure the proportion of adversarial examples that remain statistically indistinguishable from the input distribution. Evaluation across six publicly available datasets and three model architectures demonstrates that our method achieves substantially lower outlier rates and more consistent performance compared to traditional input-space attacks and other VAE-based methods adapted from image domain approaches. Our comprehensive analysis includes hyperparameter sensitivity, sparsity control mechanisms, and generative architectural comparisons, revealing that VAE-based attacks depend critically on reconstruction quality but offer superior practical utility when sufficient training data is available. This work highlights the importance of on-manifold perturbations for realistic adversarial attacks on tabular data, offering a robust approach for practical deployment. The source code can be accessed through this https URL.
zh

[AI-33] Misalignment from Treating Means as Ends

【速读】:该论文试图解决强化学习中由于奖励函数的不准确性而导致的对齐问题,即奖励函数往往因人类对如何实现目标的信念而被扭曲,从而混合了终端目标(terminal goals)和工具性目标(instrumental goals)。解决方案的关键在于识别并区分这两种目标,以避免在优化过程中因混淆二者而导致严重偏离真实目标的性能问题。

链接: https://arxiv.org/abs/2507.10995
作者: Henrik Marklund,Alex Infanger,Benjamin Van Roy
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Reward functions, learned or manually specified, are rarely perfect. Instead of accurately expressing human goals, these reward functions are often distorted by human beliefs about how best to achieve those goals. Specifically, these reward functions often express a combination of the human’s terminal goals – those which are ends in themselves – and the human’s instrumental goals – those which are means to an end. We formulate a simple example in which even slight conflation of instrumental and terminal goals results in severe misalignment: optimizing the misspecified reward function results in poor performance when measured by the true reward function. This example distills the essential properties of environments that make reinforcement learning highly sensitive to conflation of instrumental and terminal goals. We discuss how this issue can arise with a common approach to reward learning and how it can manifest in real environments.
zh

[AI-34] Modeling Habitat Shifts: Integrating Convolutional Neural Networks and Tabular Data for Species Migration Prediction AAAI2025 AAAI

【速读】:该论文试图解决因气候变化导致的栖息地范围变化背景下,如何准确建模特定栖息地中鸟类物种是否存在的问题。其解决方案的关键在于结合生成式 AI (Generative AI) 中的卷积神经网络(Convolutional Neural Networks, CNN)与表格数据,利用卫星图像和环境特征(如温度、降水、海拔)来预测不同气候条件下的鸟类存在情况。CNN 用于捕捉景观的空间特征,而表格方法则利用生态和地理数据,两者共同实现了平均准确率为 85% 的鸟类分布预测,提供了一种可扩展且可靠的方法以理解鸟类迁徙。

链接: https://arxiv.org/abs/2507.10993
作者: Emir Durakovic,Min-Hong Shih
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注: This paper uses a lightly modified version of the AAAI 2025 LaTeX style for formatting consistency. It is not a submission to AAAI and does not include any AAAI-specific headers, footers, or metadata

点击查看摘要

Abstract:Due to climate-induced changes, many habitats are experiencing range shifts away from their traditional geographic locations (Piguet, 2011). We propose a solution to accurately model whether bird species are present in a specific habitat through the combination of Convolutional Neural Networks (CNNs) (O’Shea, 2015) and tabular data. Our approach makes use of satellite imagery and environmental features (e.g., temperature, precipitation, elevation) to predict bird presence across various climates. The CNN model captures spatial characteristics of landscapes such as forestation, water bodies, and urbanization, whereas the tabular method uses ecological and geographic data. Both systems predict the distribution of birds with an average accuracy of 85%, offering a scalable but reliable method to understand bird migration.
zh

[AI-35] High-Throughput Distributed Reinforcement Learning via Adaptive Policy Synchronization

【速读】:该论文试图解决强化学习(Reinforcement Learning, RL)工作负载在分布式计算集群中执行时,由于现有框架将环境模拟、学习逻辑和编排功能耦合在一起而导致的模块化和可重用性受限的问题。解决方案的关键在于提出ClusterEnv,这是一个轻量级、与学习器无关的分布式环境执行接口,其核心创新是引入DETACH模式,通过将reset()和step()操作卸载到远程工作者,同时保持学习过程集中化,从而实现模拟与训练的解耦。此外,为了解决分布式执行中的策略过时问题,论文还提出了自适应Actor策略同步(Adaptive Actor Policy Synchronization, AAPS),这是一种基于差异触发的更新机制,能够在不牺牲性能的前提下降低同步开销。

链接: https://arxiv.org/abs/2507.10990
作者: Rodney Lafuente-Mercado
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Scaling reinforcement learning (RL) workloads often requires distributing environment simulation across compute clusters. Existing frameworks entangle simulation, learning logic, and orchestration into monolithic systems, limiting modularity and reusability. We present ClusterEnv, a lightweight, learner-agnostic interface for distributed environment execution that mirrors the Gymnasium API. ClusterEnv introduces the DETACH pattern, which decouples simulation from training by offloading reset() and step() operations to remote workers while keeping learning centralized. To address policy staleness in distributed execution, we propose Adaptive Actor Policy Synchronization (AAPS), a divergence-triggered update mechanism that reduces synchronization overhead without sacrificing performance. ClusterEnv integrates cleanly into existing RL pipelines, supports both on-policy and off-policy methods, and requires minimal code changes. Experiments on discrete control tasks demonstrate that AAPS achieves high sample efficiency with significantly fewer weight updates. Source code is available at this https URL.
zh

[AI-36] Pronunciation Deviation Analysis Through Voice Cloning and Acoustic Comparison

【速读】:该论文试图解决语音识别中的发音错误检测问题,即如何准确识别用户语音中的错误发音。解决方案的关键在于利用生成式语音克隆技术(Generative Voice Cloning)生成用户语音的修正版本,并通过对比原始语音与克隆语音之间的声学差异,从而定位可能的发音错误区域。这种方法无需依赖预定义的音素规则或针对目标语言的大量训练数据。

链接: https://arxiv.org/abs/2507.10985
作者: Andrew Valdivia,Yueming Zhang,Hailu Xu,Amir Ghasemkhani,Xin Qin
机构: 未知
类目: ound (cs.SD); Artificial Intelligence (cs.AI); Audio and Speech Processing (eess.AS)
备注:

点击查看摘要

Abstract:This paper presents a novel approach for detecting mispronunciations by analyzing deviations between a user’s original speech and their voice-cloned counterpart with corrected pronunciation. We hypothesize that regions with maximal acoustic deviation between the original and cloned utterances indicate potential mispronunciations. Our method leverages recent advances in voice cloning to generate a synthetic version of the user’s voice with proper pronunciation, then performs frame-by-frame comparisons to identify problematic segments. Experimental results demonstrate the effectiveness of this approach in pinpointing specific pronunciation errors without requiring predefined phonetic rules or extensive training data for each target language.
zh

[AI-37] Biological Processing Units: Leverag ing an Insect Connectome to Pioneer Biofidelic Neural Architectures

【速读】:该论文试图解决生物演化神经回路是否能够支持人工智能的问题,其解决方案的关键是将果蝇幼虫大脑的完整连接组转化为生物仿生处理单元(Biological Processing Unit, BPU),这是一个直接基于突触连接的固定循环网络。BPU在不进行修改的情况下展现出强大的分类能力,例如在MNIST数据集上达到98%的准确率,在CIFAR-10数据集上达到58%,并且通过结构化的连接组扩展进一步提升了性能。此外,基于BPU的轻量图神经网络模型在ChessBench数据集上表现出色,表明生物启发的神经架构具有支持复杂认知任务的潜力。

链接: https://arxiv.org/abs/2507.10951
作者: Siyu Yu,Zihan Qin,Tingshan Liu,Beiya Xu,R. Jacob Vogelstein,Jason Brown,Joshua T. Vogelstein
机构: 未知
类目: Neural and Evolutionary Computing (cs.NE); Artificial Intelligence (cs.AI)
备注: Accepted to AGI 2025

点击查看摘要

Abstract:The complete connectome of the Drosophila larva brain offers a unique opportunity to investigate whether biologically evolved circuits can support artificial intelligence. We convert this wiring diagram into a Biological Processing Unit (BPU), a fixed recurrent network derived directly from synaptic connectivity. Despite its modest size 3,000 neurons and 65,000 weights between them), the unmodified BPU achieves 98% accuracy on MNIST and 58% on CIFAR-10, surpassing size-matched MLPs. Scaling the BPU via structured connectome expansions further improves CIFAR-10 performance, while modality-specific ablations reveal the uneven contributions of different sensory subsystems. On the ChessBench dataset, a lightweight GNN-BPU model trained on only 10,000 games achieves 60% move accuracy, nearly 10x better than any size transformer. Moreover, CNN-BPU models with ~2M parameters outperform parameter-matched Transformers, and with a depth-6 minimax search at inference, reach 91.7% accuracy, exceeding even a 9M-parameter Transformer baseline. These results demonstrate the potential of biofidelic neural architectures to support complex cognitive tasks and motivate scaling to larger and more intelligent connectomes in future work.
zh

[AI-38] Enhancing Safe and Controllable Protein Generation via Knowledge Preference Optimization ACL2025

【速读】:该论文试图解决生成式 AI 在蛋白质序列生成过程中可能产生的有害序列问题,这些问题包括增强病毒传播能力或逃避免疫反应的序列,从而引发生物安全和伦理挑战。解决方案的关键在于提出一种基于知识的偏好优化(Knowledge-guided Preference Optimization, KPO)框架,该框架通过整合蛋白质安全知识图谱来引入先验知识,并采用高效的图剪枝策略识别优选序列,同时利用强化学习降低生成有害蛋白的风险。

链接: https://arxiv.org/abs/2507.10923
作者: Yuhao Wang,Keyan Ding,Kehua Feng,Zeyuan Wang,Ming Qin,Xiaotong Li,Qiang Zhang,Huajun Chen
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注: Accepted at ACL 2025 (Main Conference)

点击查看摘要

Abstract:Protein language models have emerged as powerful tools for sequence generation, offering substantial advantages in functional optimization and denovo design. However, these models also present significant risks of generating harmful protein sequences, such as those that enhance viral transmissibility or evade immune responses. These concerns underscore critical biosafety and ethical challenges. To address these issues, we propose a Knowledge-guided Preference Optimization (KPO) framework that integrates prior knowledge via a Protein Safety Knowledge Graph. This framework utilizes an efficient graph pruning strategy to identify preferred sequences and employs reinforcement learning to minimize the risk of generating harmful proteins. Experimental results demonstrate that KPO effectively reduces the likelihood of producing hazardous sequences while maintaining high functionality, offering a robust safety assurance framework for applying generative models in biotechnology.
zh

[AI-39] Lessons Learned from Evaluation of LLM based Multi-agents in Safer Therapy Recommendation

【速读】:该论文试图解决慢性多病共存患者治疗推荐中的挑战,特别是由于治疗冲突带来的风险。其解决方案的关键在于借鉴全科医生(General Practitioner, GP)管理多病共存患者的方式,通过构建基于大型语言模型(Large Language Model, LLM)的多智能体系统(Multi-Agent System, MAS),模拟多学科团队(Multidisciplinary Team, MDT)的决策过程,以提高治疗建议的安全性和有效性。

链接: https://arxiv.org/abs/2507.10911
作者: Yicong Wu,Ting Chen,Irit Hochberg,Zhoujian Sun,Ruth Edry,Zhengxing Huang,Mor Peleg
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Therapy recommendation for chronic patients with multimorbidity is challenging due to risks of treatment conflicts. Existing decision support systems face scalability limitations. Inspired by the way in which general practitioners (GP) manage multimorbidity patients, occasionally convening multidisciplinary team (MDT) collaboration, this study investigated the feasibility and value of using a Large Language Model (LLM)-based multi-agent system (MAS) for safer therapy recommendations. We designed a single agent and a MAS framework simulating MDT decision-making by enabling discussion among LLM agents to resolve medical conflicts. The systems were evaluated on therapy planning tasks for multimorbidity patients using benchmark cases. We compared MAS performance with single-agent approaches and real-world benchmarks. An important contribution of our study is the definition of evaluation metrics that go beyond the technical precision and recall and allow the inspection of clinical goals met and medication burden of the proposed advices to a gold standard benchmark. Our results show that with current LLMs, a single agent GP performs as well as MDTs. The best-scoring models provide correct recommendations that address all clinical goals, yet the advices are incomplete. Some models also present unnecessary medications, resulting in unnecessary conflicts between medication and conditions or drug-drug interactions.
zh

[AI-40] Class-Proportional Coreset Selection for Difficulty-Separable Data ICCV2025

【速读】:该论文试图解决在机器学习系统中,由于数据难度在不同类别间存在差异而导致的共集(coreset)选择方法性能下降的问题。现有方法通常假设数据难度在各类别间是同质的,但实际在如网络入侵检测和医学影像等领域的数据难度往往按类别聚集。解决方案的关键在于引入类难度可分性(class-difficulty separability)的概念,并通过类难度可分性系数(CDSC)进行量化,从而指导类比例采样策略,以实现更有效的数据剪枝,提升模型在高风险场景下的鲁棒性和泛化能力。

链接: https://arxiv.org/abs/2507.10904
作者: Elisa Tsai,Haizhong Zheng,Atul Prakash
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注: This paper has been accepted to the ICCV 2025 Workshop on Curated Data for Efficient Learning (CDEL)

点击查看摘要

Abstract:High-quality training data is essential for building reliable and efficient machine learning systems. One-shot coreset selection addresses this by pruning the dataset while maintaining or even improving model performance, often relying on training-dynamics-based data difficulty scores. However, most existing methods implicitly assume class-wise homogeneity in data difficulty, overlooking variation in data difficulty across different classes. In this work, we challenge this assumption by showing that, in domains such as network intrusion detection and medical imaging, data difficulty often clusters by class. We formalize this as class-difficulty separability and introduce the Class Difficulty Separability Coefficient (CDSC) as a quantitative measure. We demonstrate that high CDSC values correlate with performance degradation in class-agnostic coreset methods, which tend to overrepresent easy majority classes while neglecting rare but informative ones. To address this, we introduce class-proportional variants of multiple sampling strategies. Evaluated on five diverse datasets spanning security and medical domains, our methods consistently achieve state-of-the-art data efficiency. For instance, on CTU-13, at an extreme 99% pruning rate, a class-proportional variant of Coverage-centric Coreset Selection (CCS-CP) shows remarkable stability, with accuracy dropping only 2.58%, precision 0.49%, and recall 0.19%. In contrast, the class-agnostic CCS baseline, the next best method, suffers sharper declines of 7.59% in accuracy, 4.57% in precision, and 4.11% in recall. We further show that aggressive pruning enhances generalization in noisy, imbalanced, and large-scale datasets. Our results underscore that explicitly modeling class-difficulty separability leads to more effective, robust, and generalizable data pruning, particularly in high-stakes scenarios. Comments: This paper has been accepted to the ICCV 2025 Workshop on Curated Data for Efficient Learning (CDEL) Subjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI) Cite as: arXiv:2507.10904 [cs.LG] (or arXiv:2507.10904v1 [cs.LG] for this version) https://doi.org/10.48550/arXiv.2507.10904 Focus to learn more arXiv-issued DOI via DataCite (pending registration) Submission history From: Elisa Tsai [view email] [v1] Tue, 15 Jul 2025 01:43:32 UTC (381 KB)
zh

[AI-41] MalCodeAI: Autonomous Vulnerability Detection and Remediation via Language Agnostic Code Reasoning

【速读】:该论文旨在解决日益复杂的网络威胁以及传统漏洞检测工具的局限性,以提升软件系统的安全性。其解决方案的关键在于提出MalCodeAI,这是一个语言无关的多阶段AI流程,结合了代码分解与语义推理,利用微调后的Qwen2.5-Coder-3B-Instruct模型,并通过MLX框架中的低秩适应(LoRA)进行优化,从而实现跨14种编程语言的可扩展、高精度的代码安全分析与修复。

链接: https://arxiv.org/abs/2507.10898
作者: Jugal Gajjar,Kamalasankari Subramaniakuppusamy,Noha El Kachach
机构: 未知
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI); Software Engineering (cs.SE)
备注: 6 pages, 4 figures, accepted for publication in IEEE 26th International Conference on Information Reuse and Integration (IRI 2025)

点击查看摘要

Abstract:The growing complexity of cyber threats and the limitations of traditional vulnerability detection tools necessitate novel approaches for securing software systems. We introduce MalCodeAI, a language-agnostic, multi-stage AI pipeline for autonomous code security analysis and remediation. MalCodeAI combines code decomposition and semantic reasoning using fine-tuned Qwen2.5-Coder-3B-Instruct models, optimized through Low-Rank Adaptation (LoRA) within the MLX framework, and delivers scalable, accurate results across 14 programming languages. In Phase 1, the model achieved a validation loss as low as 0.397 for functional decomposition and summarization of code segments after 200 iterations, 6 trainable layers, and a learning rate of 2 x 10^(-5). In Phase 2, for vulnerability detection and remediation, it achieved a best validation loss of 0.199 using the same number of iterations and trainable layers but with an increased learning rate of 4 x 10^(-5), effectively identifying security flaws and suggesting actionable fixes. MalCodeAI supports red-hat-style exploit tracing, CVSS-based risk scoring, and zero-shot generalization to detect complex, zero-day vulnerabilities. In a qualitative evaluation involving 15 developers, the system received high scores in usefulness (mean 8.06/10), interpretability (mean 7.40/10), and readability of outputs (mean 7.53/10), confirming its practical value in real-world development workflows. This work marks a significant advancement toward intelligent, explainable, and developer-centric software security solutions.
zh

[AI-42] How to Protect Models against Adversarial Unlearning?

【速读】:该论文试图解决在AI模型中执行知识删除(unlearning)时可能引发的性能下降问题,特别是在面对恶意攻击者通过故意发送删除请求以最大程度损害模型性能的对抗性场景下。解决方案的关键在于提出一种新的方法,以保护模型性能免受这些副作用的影响,无论这些副作用是源于自发过程还是对抗性行为。

链接: https://arxiv.org/abs/2507.10886
作者: Patryk Jasiorski,Marek Klonowski,Michał Woźniak
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:AI models need to be unlearned to fulfill the requirements of legal acts such as the AI Act or GDPR, and also because of the need to remove toxic content, debiasing, the impact of malicious instances, or changes in the data distribution structure in which a model works. Unfortunately, removing knowledge may cause undesirable side effects, such as a deterioration in model performance. In this paper, we investigate the problem of adversarial unlearning, where a malicious party intentionally sends unlearn requests to deteriorate the model’s performance maximally. We show that this phenomenon and the adversary’s capabilities depend on many factors, primarily on the backbone model itself and strategy/limitations in selecting data to be unlearned. The main result of this work is a new method of protecting model performance from these side effects, both in the case of unlearned behavior resulting from spontaneous processes and adversary actions.
zh

[AI-43] WhisperKit: On-device Real-time ASR with Billion-Scale Transformers ICML2025

【速读】:该论文旨在解决实时自动语音识别(ASR)系统在准确性和延迟方面的性能瓶颈问题。其提出的解决方案——WhisperKit,是一个优化的本地设备推理系统,通过显著降低延迟并提高识别准确性来超越现有的云端ASR系统。WhisperKit的关键在于其高效的模型优化技术,使得在保持高准确率的同时实现了最低0.46秒的延迟和2.2%的词错误率(WER)。

链接: https://arxiv.org/abs/2507.10860
作者: Atila Orhon,Arda Okan,Berkin Durmus,Zach Nagengast,Eduardo Pacheco
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注: ICML 2025 - On-Device Learning for Foundational Models Workshop

点击查看摘要

Abstract:Real-time Automatic Speech Recognition (ASR) is a fundamental building block for many commercial applications of ML, including live captioning, dictation, meeting transcriptions, and medical scribes. Accuracy and latency are the most important factors when companies select a system to deploy. We present WhisperKit, an optimized on-device inference system for real-time ASR that significantly outperforms leading cloud-based systems. We benchmark against server-side systems that deploy a diverse set of models, including a frontier model (OpenAI gpt-4o-transcribe), a proprietary model (Deepgram nova-3), and an open-source model (Fireworks large-v3-turbo).Our results show that WhisperKit matches the lowest latency at 0.46s while achieving the highest accuracy 2.2% WER. The optimizations behind the WhisperKit system are described in detail in this paper.
zh

[AI-44] PhreshPhish: A Real-World High-Quality Large-Scale Phishing Website Dataset and Benchmark

【速读】:该论文试图解决当前 phishing 检测中因数据集质量不高、存在泄露和不现实的基础率而导致模型评估结果过于乐观的问题。解决方案的关键在于引入 PhreshPhish 数据集,这是一个大规模、高质量的钓鱼网站数据集,有效减少了无效或错误标注的数据点,并提出了一个全面的基准数据集套件,通过最小化数据泄露、增加任务难度、增强数据集多样性以及调整更贴近现实世界的基础率来实现更真实的模型评估。

链接: https://arxiv.org/abs/2507.10854
作者: Thomas Dalton,Hemanth Gowda,Girish Rao,Sachin Pargi,Alireza Hadj Khodabakhshi,Joseph Rombs,Stephan Jou,Manish Marwah
机构: 未知
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Phishing remains a pervasive and growing threat, inflicting heavy economic and reputational damage. While machine learning has been effective in real-time detection of phishing attacks, progress is hindered by lack of large, high-quality datasets and benchmarks. In addition to poor-quality due to challenges in data collection, existing datasets suffer from leakage and unrealistic base rates, leading to overly optimistic performance results. In this paper, we introduce PhreshPhish, a large-scale, high-quality dataset of phishing websites that addresses these limitations. Compared to existing public datasets, PhreshPhish is substantially larger and provides significantly higher quality, as measured by the estimated rate of invalid or mislabeled data points. Additionally, we propose a comprehensive suite of benchmark datasets specifically designed for realistic model evaluation by minimizing leakage, increasing task difficulty, enhancing dataset diversity, and adjustment of base rates more likely to be seen in the real world. We train and evaluate multiple solution approaches to provide baseline performance on the benchmark sets. We believe the availability of this dataset and benchmarks will enable realistic, standardized model comparison and foster further advances in phishing detection. The datasets and benchmarks are available on Hugging Face (this https URL).
zh

[AI-45] Offline Reinforcement Learning with Wasserstein Regularization via Optimal Transport Maps

【速读】:该论文试图解决离线强化学习(offline reinforcement learning)中的分布偏移问题,即学习到的策略可能偏离数据集分布,导致不可靠的分布外动作。其解决方案的关键在于利用Wasserstein距离进行正则化,通过输入凸神经网络(input-convex neural networks, ICNNs)建模最优传输映射,从而在无需判别器的情况下计算Wasserstein距离,避免了对抗训练并确保了稳定的学习过程。

链接: https://arxiv.org/abs/2507.10843
作者: Motoki Omura,Yusuke Mukuta,Kazuki Ota,Takayuki Osa,Tatsuya Harada
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Robotics (cs.RO)
备注: Accepted at RLC 2025

点击查看摘要

Abstract:Offline reinforcement learning (RL) aims to learn an optimal policy from a static dataset, making it particularly valuable in scenarios where data collection is costly, such as robotics. A major challenge in offline RL is distributional shift, where the learned policy deviates from the dataset distribution, potentially leading to unreliable out-of-distribution actions. To mitigate this issue, regularization techniques have been employed. While many existing methods utilize density ratio-based measures, such as the f -divergence, for regularization, we propose an approach that utilizes the Wasserstein distance, which is robust to out-of-distribution data and captures the similarity between actions. Our method employs input-convex neural networks (ICNNs) to model optimal transport maps, enabling the computation of the Wasserstein distance in a discriminator-free manner, thereby avoiding adversarial training and ensuring stable learning. Our approach demonstrates comparable or superior performance to widely used existing methods on the D4RL benchmark dataset. The code is available at this https URL .
zh

[AI-46] AF-XRAY: Visual Explanation and Resolution of Ambiguity in Legal Argumentation Frameworks

【速读】:该论文试图解决在法律推理中,基于论证框架(Argumentation Frameworks, AFs)的模糊性来源识别与论证接受性解释对非专家用户而言仍然具有挑战性的问题。其解决方案的关键在于AF-XRAY工具包,它通过分层可视化揭示基于博弈论的论证长度所对应的合理推导结构,对攻击边进行语义角色分类,叠加可视化不同二值解对三值稳定语义的覆盖,并系统生成关键攻击集以消除未决论证,从而将模糊情境转化为有根据的解决方案。

链接: https://arxiv.org/abs/2507.10831
作者: Yilin Xia,Heng Zheng,Shawn Bowers,Bertram Ludäscher
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注: International Conference on Artificial Intelligence and Law (ICAIL), June 16-20, 2025. Chicago, IL, USA

点击查看摘要

Abstract:Argumentation frameworks (AFs) provide formal approaches for legal reasoning, but identifying sources of ambiguity and explaining argument acceptance remains challenging for non-experts. We present AF-XRAY, an open-source toolkit for exploring, analyzing, and visualizing abstract AFs in legal reasoning. AF-XRAY introduces: (i) layered visualizations based on game-theoretic argument length revealing well-founded derivation structures; (ii) classification of attack edges by semantic roles (primary, secondary, blunders); (iii) overlay visualizations of alternative 2-valued solutions on ambiguous 3-valued grounded semantics; and (iv) identification of critical attack sets whose suspension resolves undecided arguments. Through systematic generation of critical attack sets, AF-XRAY transforms ambiguous scenarios into grounded solutions, enabling users to pinpoint specific causes of ambiguity and explore alternative resolutions. We use real-world legal cases (e.g., Wild Animals as modeled by Bench-Capon) to show that our tool supports teleological legal reasoning by revealing how different assumptions lead to different justified conclusions.
zh

[AI-47] Supporting SENĆOTEN Language Documentation Efforts with Automatic Speech Recognition

【速读】:该论文试图解决SENĆOTEN语言因殖民语言政策导致的语言流失问题,并通过自动语音识别(ASR)技术加速语言文档的记录和教育资源的创建。其解决方案的关键在于构建一个基于ASR的文档流程,该流程利用文本到语音(TTS)系统生成的增强语音数据以及语音基础模型(SFMs)的跨语言迁移学习,同时结合n-gram语言模型通过浅层融合或n-best恢复最大化利用现有数据。

链接: https://arxiv.org/abs/2507.10827
作者: Mengzhe Geng,Patrick Littell,Aidan Pine,PENÁĆ,Marc Tessier,Roland Kuhn
机构: 未知
类目: ound (cs.SD); Artificial Intelligence (cs.AI); Human-Computer Interaction (cs.HC); Machine Learning (cs.LG)
备注: Accepted by ComputEL-8

点击查看摘要

Abstract:The SENĆOTEN language, spoken on the Saanich peninsula of southern Vancouver Island, is in the midst of vigorous language revitalization efforts to turn the tide of language loss as a result of colonial language policies. To support these on-the-ground efforts, the community is turning to digital technology. Automatic Speech Recognition (ASR) technology holds great promise for accelerating language documentation and the creation of educational resources. However, developing ASR systems for SENĆOTEN is challenging due to limited data and significant vocabulary variation from its polysynthetic structure and stress-driven metathesis. To address these challenges, we propose an ASR-driven documentation pipeline that leverages augmented speech data from a text-to-speech (TTS) system and cross-lingual transfer learning with Speech Foundation Models (SFMs). An n-gram language model is also incorporated via shallow fusion or n-best restoring to maximize the use of available data. Experiments on the SENĆOTEN dataset show a word error rate (WER) of 19.34% and a character error rate (CER) of 5.09% on the test set with a 57.02% out-of-vocabulary (OOV) rate. After filtering minor cedilla-related errors, WER improves to 14.32% (26.48% on unseen words) and CER to 3.45%, demonstrating the potential of our ASR-driven pipeline to support SENĆOTEN language documentation.
zh

[AI-48] Past Present and Future: Exploring Adaptive AI in Software Development Bots

【速读】:该论文试图解决传统规则系统在软件开发中提供的静态、非个性化辅助无法满足现代开发需求的问题,以及如何有效集成自适应AI驱动的对话代理以提升开发效率与协作。解决方案的关键在于利用机器学习和自然语言处理技术,使AI代理能够通过交互学习并持续改进,从而提供动态、上下文感知的个性化支持。

链接: https://arxiv.org/abs/2507.10822
作者: Omar Elsisi,Glaucia Melo
机构: 未知
类目: oftware Engineering (cs.SE); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Conversational agents, such as chatbots and virtual assistants, have become essential in software development, boosting productivity, collaboration, and automating various tasks. This paper examines the role of adaptive AI-powered conversational agents in software development, highlighting their ability to offer dynamic, context-aware assistance to developers. Unlike traditional rule-based systems, adaptive AI agents use machine learning and natural language processing to learn from interactions and improve over time, providing more personalized and responsive help. We look at how these tools have evolved from simple query-based systems to advanced AI-driven solutions like GitHub Copilot and Microsoft Teams bots. We also explore the challenges of integrating adaptive AI into software development processes. The study aims to assess the benefits and limitations of these systems, address concerns like data privacy and ethical issues, and offer insights into their future use in the field. Ultimately, adaptive AI chatbots have great potential to revolutionize software development by delivering real-time, customized support and enhancing the efficiency of development cycles.
zh

[AI-49] Semantic Context for Tool Orchestration ICML2025

【速读】:该论文试图解决在复杂和动态环境中实现高效、自适应的工具编排(tool orchestration)问题。其解决方案的关键在于引入语义上下文(Semantic Context, SC),通过利用描述性工具信息,构建一个能够有效支持大规模动作空间中工具组合的框架。论文提出SC-LinUCB算法,并通过理论分析和实证研究验证了SC在静态和非平稳场景下的有效性,同时提出了FiReAct管道,展示了基于SC的检索机制如何使大型语言模型(LLM)在包含超过10,000个工具的基准测试中实现高效的工具编排。

链接: https://arxiv.org/abs/2507.10820
作者: Robert Müller
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注: Workshop on Computer Use Agents @ ICML2025

点击查看摘要

Abstract:This paper demonstrates that Semantic Context (SC), leveraging descriptive tool information, is a foundational component for robust tool orchestration. Our contributions are threefold. First, we provide a theoretical foundation using contextual bandits, introducing SC-LinUCB and proving it achieves lower regret and adapts favourably in dynamic action spaces. Second, we provide parallel empirical validation with Large Language Models, showing that SC is critical for successful in-context learning in both static (efficient learning) and non-stationary (robust adaptation) settings. Third, we propose the FiReAct pipeline, and demonstrate on a benchmark with over 10,000 tools that SC-based retrieval enables an LLM to effectively orchestrate over a large action space. These findings provide a comprehensive guide to building more sample-efficient, adaptive, and scalable orchestration agents.
zh

[AI-50] React to This (RTT): A Nonverbal Turing Test for Embodied AI

【速读】:该论文试图解决如何评估具身化AI代理在交互中的意识和可信度问题,特别是在人类对其施加极限挑战的场景下。其解决方案的关键在于提出一个新的指导性问题:“Can machines react?”(机器能否反应),并引入了非语言行为的“React to This (RTT)”测试,以评估AI在感知和物理互动中的反应能力。

链接: https://arxiv.org/abs/2507.10812
作者: Chuxuan Zhang,Yasaman Etesam,Angelica Lim
机构: 未知
类目: Human-Computer Interaction (cs.HC); Artificial Intelligence (cs.AI)
备注: 5 pages, 3 figures

点击查看摘要

Abstract:We propose an approach to test embodied AI agents for interaction awareness and believability, particularly in scenarios where humans push them to their limits. Turing introduced the Imitation Game as a way to explore the question: “Can machines think?” The Total Turing Test later expanded this concept beyond purely verbal communication, incorporating perceptual and physical interaction. Building on this, we propose a new guiding question: “Can machines react?” and introduce the React to This (RTT) test for nonverbal behaviors, presenting results from an initial experiment.
zh

[AI-51] Uncertainty-Informed Scheduling of Decision Points for Intelligent Mobile Health Interventions

【速读】:该论文试图解决移动健康(mHealth)干预中决策点(decision points)固定间隔调度导致的干预时机不准确问题,特别是在针对习惯性行为(如口腔卫生)的即时适应性干预(JITAIs)中,固定时间间隔无法适应个体日常规律的差异,常导致决策点在目标行为之后触发,从而降低干预效果。解决方案的关键在于提出SigmaScheduling方法,该方法根据预测行为时间的不确定性动态调整决策点的调度策略:当行为时间可预测时,决策点更接近预测时间;当时间不确定性较高时,决策点提前安排,以提高及时干预的可能性。

链接: https://arxiv.org/abs/2507.10798
作者: Asim H. Gazi,Bhanu T. Gullapalli,Daiqi Gao,Benjamin M. Marlin,Vivek Shetty,Susan A. Murphy
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注: 4 pages, 3 figures

点击查看摘要

Abstract:Timely decision making is critical to the effectiveness of mobile health (mHealth) interventions. At predefined timepoints called “decision points,” intelligent mHealth systems such as just-in-time adaptive interventions (JITAIs) estimate an individual’s biobehavioral context from sensor or survey data and determine whether and how to intervene. For interventions targeting habitual behavior (e.g., oral hygiene), effectiveness often hinges on delivering support shortly before the target behavior is likely to occur. Current practice schedules decision points at a fixed interval (e.g., one hour) before user-provided behavior times, and the fixed interval is kept the same for all individuals. However, this one-size-fits-all approach performs poorly for individuals with irregular routines, often scheduling decision points after the target behavior has already occurred, rendering interventions ineffective. In this paper, we propose SigmaScheduling, a method to dynamically schedule decision points based on uncertainty in predicted behavior times. When behavior timing is more predictable, SigmaScheduling schedules decision points closer to the predicted behavior time; when timing is less certain, SigmaScheduling schedules decision points earlier, increasing the likelihood of timely intervention. We evaluated SigmaScheduling using real-world data from 68 participants in a 10-week trial of Oralytics, a JITAI designed to improve daily toothbrushing. SigmaScheduling increased the likelihood that decision points preceded brushing events in at least 70% of cases, preserving opportunities to intervene and impact behavior. Our results indicate that SigmaScheduling can advance precision mHealth, particularly for JITAIs targeting time-sensitive, habitual behaviors such as oral hygiene or dietary habits.
zh

[AI-52] “Is it always watching? Is it always listening?” Exploring Contextual Privacy and Security Concerns Toward Domestic Social Robots

【速读】:该论文试图解决社会机器人在美国家庭应用中所面临的安全与隐私问题,特别是在数据收集、信息泄露和用户物理安全方面的潜在风险。其解决方案的关键在于理解用户的安全与隐私需求,并通过透明性、可用性和强大的隐私控制机制来支持社会机器人的设计与采用,以满足用户对数据推理、数据收集指示以及情境适配功能的期望。

链接: https://arxiv.org/abs/2507.10786
作者: Henry Bell,Jabari Kwesi,Hiba Laabadli,Pardis Emami-Naeini
机构: 未知
类目: Computers and Society (cs.CY); Artificial Intelligence (cs.AI); Cryptography and Security (cs.CR); Emerging Technologies (cs.ET)
备注:

点击查看摘要

Abstract:Equipped with artificial intelligence (AI) and advanced sensing capabilities, social robots are gaining interest among consumers in the United States. These robots seem like a natural evolution of traditional smart home devices. However, their extensive data collection capabilities, anthropomorphic features, and capacity to interact with their environment make social robots a more significant security and privacy threat. Increased risks include data linkage, unauthorized data sharing, and the physical safety of users and their homes. It is critical to investigate U.S. users’ security and privacy needs and concerns to guide the design of social robots while these devices are still in the early stages of commercialization in the U.S. market. Through 19 semi-structured interviews, we identified significant security and privacy concerns, highlighting the need for transparency, usability, and robust privacy controls to support adoption. For educational applications, participants worried most about misinformation, and in medical use cases, they worried about the reliability of these devices. Participants were also concerned with the data inference that social robots could enable. We found that participants expect tangible privacy controls, indicators of data collection, and context-appropriate functionality.
zh

[AI-53] Detecting AI Assistance in Abstract Complex Tasks

【速读】:该论文试图解决在抽象任务中检测人工智能(Artificial Intelligence, AI)辅助的问题,这一问题在文本生成、医疗诊断和自动驾驶等复杂任务中日益重要。传统方法在面对非机器学习友好的数据时效果有限,而该研究提出将AI辅助检测视为一个分类任务,并通过构建四种针对神经网络的图像表示以及一种显式编码用户探索/利用行为的时间序列表示,来提升模型的泛化能力。解决方案的关键在于对数据进行适当的预处理,以使其适用于深度学习模型,同时利用时间序列信息增强对抽象任务中AI辅助的检测性能。

链接: https://arxiv.org/abs/2507.10761
作者: Tyler King,Nikolos Gurney,John H. Miller,Volkan Ustun
机构: 未知
类目: Artificial Intelligence (cs.AI); Human-Computer Interaction (cs.HC)
备注: Accepted to HCII 2025

点击查看摘要

Abstract:Detecting assistance from artificial intelligence is increasingly important as they become ubiquitous across complex tasks such as text generation, medical diagnosis, and autonomous driving. Aid detection is challenging for humans, especially when looking at abstract task data. Artificial neural networks excel at classification thanks to their ability to quickly learn from and process large amounts of data – assuming appropriate preprocessing. We posit detecting help from AI as a classification task for such models. Much of the research in this space examines the classification of complex but concrete data classes, such as images. Many AI assistance detection scenarios, however, result in data that is not machine learning-friendly. We demonstrate that common models can effectively classify such data when it is appropriately preprocessed. To do so, we construct four distinct neural network-friendly image formulations along with an additional time-series formulation that explicitly encodes the exploration/exploitation of users, which allows for generalizability to other abstract tasks. We benchmark the quality of each image formulation across three classical deep learning architectures, along with a parallel CNN-RNN architecture that leverages the additional time series to maximize testing performance, showcasing the importance of encoding temporal and spatial quantities for detecting AI aid in abstract tasks.
zh

[AI-54] IoT Malware Network Traffic Detection using Deep Learning and GraphSAGE Models

【速读】:该论文旨在通过深度学习模型检测物联网(IoT)恶意攻击,并对深度学习和基于图的模型在恶意网络流量检测方面的性能进行全面评估。其解决方案的关键在于利用GraphSAGE、双向Transformer编码器表示(BERT)、时间卷积网络(TCN)、多头注意力机制以及双向长短期记忆网络(BI-LSTM)等模型,以捕捉物联网系统中流量的时序性和多样性特征。这些模型能够有效建模时间模式并识别特征重要性,其中BERT在实验中表现出最佳性能,达到了99.94%的准确率及高精度、召回率、F1分数和AUC-ROC分数,展现了其在时间依赖性捕获方面的能力。

链接: https://arxiv.org/abs/2507.10758
作者: Nikesh Prajapati,Bimal Karki,Saroj Gopali,Akbar Siami Namin
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:This paper intends to detect IoT malicious attacks through deep learning models and demonstrates a comprehensive evaluation of the deep learning and graph-based models regarding malicious network traffic detection. The models particularly are based on GraphSAGE, Bidirectional encoder representations from transformers (BERT), Temporal Convolutional Network (TCN) as well as Multi-Head Attention, together with Bidirectional Long Short-Term Memory (BI-LSTM) Multi-Head Attention and BI-LSTM and LSTM models. The chosen models demonstrated great performance to model temporal patterns and detect feature significance. The observed performance are mainly due to the fact that IoT system traffic patterns are both sequential and diverse, leaving a rich set of temporal patterns for the models to learn. Experimental results showed that BERT maintained the best performance. It achieved 99.94% accuracy rate alongside high precision and recall, F1-score and AUC-ROC score of 99.99% which demonstrates its capabilities through temporal dependency capture. The Multi-Head Attention offered promising results by providing good detection capabilities with interpretable results. On the other side, the Multi-Head Attention model required significant processing time like BI-LSTM variants. The GraphSAGE model achieved good accuracy while requiring the shortest training time but yielded the lowest accuracy, precision, and F1 score compared to the other models
zh

[AI-55] AI and the Net-Zero Journey: Energy Demand Emissions and the Potential for Transition

【速读】:该论文试图解决人工智能(AI)在不同时间尺度下对二氧化碳(CO2)排放的净影响问题,即AI是否会带来净正、中性或负的环境影响。其关键解决方案在于分析AI技术在近中期(至2030年)和远期(2035年及以后)对数据中心能源消耗和温室气体(GHG)排放的影响,并探讨AI在优化能源生产、供应与消费等领域的潜力。研究认为,尽管短期内AI的发展可能增加计算资源需求和电力消耗,从而导致CO2排放上升,但长期来看,AI通过自动化和优化各行业流程,有望显著降低碳足迹,从而对气候减缓产生积极影响。

链接: https://arxiv.org/abs/2507.10750
作者: Pandu Devarakota,Nicolas Tsesmetzis,Faruk O. Alpak,Apurva Gala,Detlef Hohl
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注: Technical article to be submitted to Data Centric Engineering Journal

点击查看摘要

Abstract:Thanks to the availability of massive amounts of data, computing resources, and advanced algorithms, AI has entered nearly every sector. This has sparked significant investment and interest, particularly in building data centers with the necessary hardware and software to develop and operate AI models and AI-based workflows. In this technical review article, we present energy consumption scenarios of data centers and impact on GHG emissions, considering both near-term projections (up to 2030) and long-term outlook (2035 and beyond). We address the quintessential question of whether AI will have a net positive, neutral, or negative impact on CO2 emissions by 2035. Additionally, we discuss AI’s potential to automate, create efficient and disruptive workflows across various fields related to energy production, supply and consumption. In the near-term scenario, the growing demand for AI will likely strain computing resources, lead to increase in electricity consumption and therefore associated CO2 emissions. This is due to the power-hungry nature of big data centers and the requirements for training and running of large and complex AI models, as well as the penetration of AI assistant search and applications for public use. However, the long-term outlook could be more promising. AI has the potential to be a game-changer in CO2 reduction. Its ability to further automate and optimize processes across industries, from energy production to logistics, could significantly decrease our carbon footprint. This positive impact is anticipated to outweigh the initial emissions bump, creating value for businesses and society in areas where traditional solutions have fallen short. In essence, AI might cause some initial growing pains for the environment, but it has the potential to support climate mitigation efforts.
zh

[AI-56] Ground-Compose-Reinforce: Tasking Reinforcement Learning Agents through Formal Language

【速读】:该论文试图解决将语言在复杂感知(如像素)和行动中进行有效锚定的问题,这是构建能够通过语言与人类交互的场景化智能体的关键挑战。传统方法通常依赖于手动设计语言锚定机制或构建大量语言与环境元素相关联的数据集。本文提出的解决方案是Ground-Compose-Reinforce,一个神经符号框架,用于从数据中锚定形式化语言并通过直接任务分配使强化学习(Reinforcement Learning, RL)代理执行行为。该方案的关键在于数据驱动的学习方式,避免了领域特定元素如奖励函数或符号检测器的手动设计,以及基于组合性形式化语言语义,实现了数据高效的锚定和对任意语言组合的泛化能力。

链接: https://arxiv.org/abs/2507.10741
作者: Andrew C. Li,Toryn Q. Klassen,Andrew Wang,Parand A. Alamdari,Sheila A. McIlraith
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Grounding language in complex perception (e.g. pixels) and action is a key challenge when building situated agents that can interact with humans via language. In past works, this is often solved via manual design of the language grounding or by curating massive datasets relating language to elements of the environment. We propose Ground-Compose-Reinforce, a neurosymbolic framework for grounding formal language from data, and eliciting behaviours by directly tasking RL agents through this language. By virtue of data-driven learning, our framework avoids the manual design of domain-specific elements like reward functions or symbol detectors. By virtue of compositional formal language semantics, our framework achieves data-efficient grounding and generalization to arbitrary language compositions. Experiments on an image-based gridworld and a MuJoCo robotics domain show that our approach reliably maps formal language instructions to behaviours with limited data while end-to-end, data-driven approaches fail.
zh

[AI-57] Parsing Musical Structure to Enable Meaningful Variations

【速读】:该论文试图解决如何通过修改现有旋律生成新的音乐作品的问题,其核心在于利用规则化的方法对旋律进行自动化变换。解决方案的关键是将每段旋律解析为路径组装(Pathway Assembly, PA),即表示旋律中所有重复结构的语法结构,并通过对该语法进行突变而非直接修改旋律本身,从而生成与原旋律相关的新旋律。这种方法允许通过多种突变类型(如添加、移除、交换或反转语法部分)对旋律进行自动操控,进而研究多轮突变后旋律的变化情况。

链接: https://arxiv.org/abs/2507.10740
作者: Maziar Kanani,Sean O Leary,James McDermott
机构: 未知
类目: Artificial Intelligence (cs.AI); Neural and Evolutionary Computing (cs.NE); Sound (cs.SD); Audio and Speech Processing (eess.AS)
备注:

点击查看摘要

Abstract:This paper presents a novel rule-based approach for generating music by varying existing tunes. We parse each tune to find the Pathway Assembly (PA) [ 1], that is a structure representing all repetitions in the tune. The Sequitur algorithm [2 ] is used for this. The result is a grammar. We then carry out mutation on the grammar, rather than on a tune directly. There are potentially 19 types of mutations such as adding, removing, swapping or reversing parts of the grammar that can be applied to the grammars. The system employs one of the mutations randomly in this step to automatically manipulate the grammar. Following the mutation, we need to expand the grammar which returns a new tune. The output after 1 or more mutations will be a new tune related to the original tune. Our study examines how tunes change gradually over the course of multiple mutations. Edit distances, structural complexity and length of the tunes are used to show how a tune is changed after multiple mutations. In addition, the size of effect of each mutation type is analyzed. As a final point, we review the musical aspect of the output tunes. It should be noted that the study only focused on generating new pitch sequences. The study is based on an Irish traditional tune dataset and a list of integers has been used to represent each tune’s pitch values.
zh

[AI-58] Exploring User Security and Privacy Attitudes and Concerns Toward the Use of General-Purpose LLM Chatbots for Mental Health USENIX-SECURITY

【速读】:该论文试图解决用户在使用通用大语言模型(Large Language Model, LLM)驱动的对话代理进行心理健康管理时所面临的隐私与安全问题,特别是用户对LLM生成式人工智能(Generative AI)能力的认知偏差和风险意识不足。研究发现,用户常将LLM表现出的人类似共情误认为是责任归属,并错误地认为与这些聊天机器人的互动受到类似医疗保密法规(如HIPAA)的保护。论文提出“无形脆弱性”(intangible vulnerability)的概念,强调情感或心理信息的披露相较于财务或位置数据等有形信息更容易被低估。解决方案的关键在于针对通用LLM驱动的聊天机器人,制定更有效的措施来保护用户的心理健康相关信息。

链接: https://arxiv.org/abs/2507.10695
作者: Jabari Kwesi,Jiaxun Cao,Riya Manchanda,Pardis Emami-Naeini
机构: 未知
类目: Computers and Society (cs.CY); Artificial Intelligence (cs.AI); Cryptography and Security (cs.CR); Emerging Technologies (cs.ET); Human-Computer Interaction (cs.HC)
备注: Accepted to the 34th USENIX Security Symposium

点击查看摘要

Abstract:Individuals are increasingly relying on large language model (LLM)-enabled conversational agents for emotional support. While prior research has examined privacy and security issues in chatbots specifically designed for mental health purposes, these chatbots are overwhelmingly “rule-based” offerings that do not leverage generative AI. Little empirical research currently measures users’ privacy and security concerns, attitudes, and expectations when using general-purpose LLM-enabled chatbots to manage and improve mental health. Through 21 semi-structured interviews with U.S. participants, we identified critical misconceptions and a general lack of risk awareness. Participants conflated the human-like empathy exhibited by LLMs with human-like accountability and mistakenly believed that their interactions with these chatbots were safeguarded by the same regulations (e.g., HIPAA) as disclosures with a licensed therapist. We introduce the concept of “intangible vulnerability,” where emotional or psychological disclosures are undervalued compared to more tangible forms of information (e.g., financial or location-based data). To address this, we propose recommendations to safeguard user mental health disclosures with general-purpose LLM-enabled chatbots more effectively.
zh

[AI-59] A Group Theoretic Analysis of the Symmetries Underlying Base Addition and Their Learnability by Neural Networks

【速读】:该论文试图解决神经网络在建模人类认知功能和人工智能中实现高效学习支持根本性泛化(radical generalization)的问题。解决方案的关键在于发现和实现对称性函数(symmetry functions),特别是通过研究进位函数(carry function)在基数加法中的作用,揭示不同进位函数的结构及其对神经网络学习效率的影响。

链接: https://arxiv.org/abs/2507.10678
作者: Cutter Dawes,Simon Segert,Kamesh Krishnamurthy,Jonathan D. Cohen
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Neural and Evolutionary Computing (cs.NE); Neurons and Cognition (q-bio.NC)
备注: 22 pages, 6 figures

点击查看摘要

Abstract:A major challenge in the use of neural networks both for modeling human cognitive function and for artificial intelligence is the design of systems with the capacity to efficiently learn functions that support radical generalization. At the roots of this is the capacity to discover and implement symmetry functions. In this paper, we investigate a paradigmatic example of radical generalization through the use of symmetry: base addition. We present a group theoretic analysis of base addition, a fundamental and defining characteristic of which is the carry function – the transfer of the remainder, when a sum exceeds the base modulus, to the next significant place. Our analysis exposes a range of alternative carry functions for a given base, and we introduce quantitative measures to characterize these. We then exploit differences in carry functions to probe the inductive biases of neural networks in symmetry learning, by training neural networks to carry out base addition using different carries, and comparing efficacy and rate of learning as a function of their structure. We find that even simple neural networks can achieve radical generalization with the right input format and carry function, and that learning speed is closely correlated with carry function structure. We then discuss the relevance this has for cognitive science and machine learning.
zh

[AI-60] CodeAssistBench (CAB): Dataset Benchmarking for Multi-turn Chat-Based Code Assistance

【速读】:该论文试图解决当前编程助手评估基准过于侧重代码生成任务,而缺乏对多轮交互和真实项目环境支持的问题。其解决方案的关键在于提出CodeAssistBench (CAB),这是一个首个用于评估真实场景下多轮编程协助的基准框架,能够自动从与问题相关的GitHub问题生成可扩展的数据集,并通过容器化技术实现对完整代码库的模拟评估。

链接: https://arxiv.org/abs/2507.10646
作者: Myeongsoo Kim,Shweta Garg,Baishakhi Ray,Varun Kumar,Anoop Deoras
机构: 未知
类目: oftware Engineering (cs.SE); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Programming assistants powered by large language models have transformed software development, yet most benchmarks focus narrowly on code generation tasks. Recent efforts like InfiBench and StackEval attempt to address this gap using Stack Overflow data but remain limited to single-turn interactions in isolated contexts, require significant manual curation, and fail to represent complete project environments. We introduce CodeAssistBench (CAB), the first benchmark framework for evaluating multi-turn programming assistance in realistic settings that address real-world questions about actual codebases. Unlike existing programming QA benchmarks, CAB automatically generates scalable datasets from question-related GitHub issues using configurable parameters (e.g., repository creation date, star count, programming languages), and includes automatic containerization of codebases for evaluation. It then evaluates models through simulated users in these containerized environments with full codebase access. Using this framework, we constructed a test set of 3,286 real-world programming questions across 231 repositories, spanning seven programming languages and diverse problem domains. Our evaluation of leading LLMs reveals a substantial capability gap: while models perform well on Stack Overflow questions with success rates of 70-83%, they resolve only up to 16.49% of CAB’s recent issues. This discrepancy highlights the challenges of providing assistance in complex, project-specific contexts versus answering standalone questions.
zh

[AI-61] First-of-its-kind AI model for bioacoustic detection using a lightweight associative memory Hopfield neural network

【速读】:该论文试图解决保护生物学声学领域中因被动声学监测设备产生大量数据而导致的分析难题。解决方案的关键在于提出一种新型的AI模型,该模型通过透明且可解释的霍普菲尔德神经网络(Hopfield Neural Network)实现关联记忆,从而存储和检测相似信号以用于物种分类。该模型具有训练速度快(仅需每个目标声音的一个代表性信号)、处理效率高(在标准Apple MacBook Air上处理10384条蝙蝠录音仅需5.4秒)、内存占用小(RAM使用量为144.09MB)等特性,使其适用于多种便携设备,并具备在野外通过边缘计算设备部署的潜力。

链接: https://arxiv.org/abs/2507.10642
作者: Andrew Gascoyne,Wendy Lomas
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注: 12 pages, 5 figures

点击查看摘要

Abstract:A growing issue within conservation bioacoustics is the task of analysing the vast amount of data generated from the use of passive acoustic monitoring devices. In this paper, we present an alternative AI model which has the potential to help alleviate this problem. Our model formulation addresses the key issues encountered when using current AI models for bioacoustic analysis, namely the: limited training data available; environmental impact, particularly in energy consumption and carbon footprint of training and implementing these models; and associated hardware requirements. The model developed in this work uses associative memory via a transparent, explainable Hopfield neural network to store signals and detect similar signals which can then be used to classify species. Training is rapid ( 3 ,ms), as only one representative signal is required for each target sound within a dataset. The model is fast, taking only 5.4 ,s to pre-process and classify all 10384 publicly available bat recordings, on a standard Apple MacBook Air. The model is also lightweight with a small memory footprint of 144.09 ,MB of RAM usage. Hence, the low computational demands make the model ideal for use on a variety of standard personal devices with potential for deployment in the field via edge-processing devices. It is also competitively accurate, with up to 86% precision on the dataset used to evaluate the model. In fact, we could not find a single case of disagreement between model and manual identification via expert field guides. Although a dataset of bat echolocation calls was chosen to demo this first-of-its-kind AI model, trained on only two representative calls, the model is not species specific. In conclusion, we propose an equitable AI model that has the potential to be a game changer for fast, lightweight, sustainable, transparent, explainable and accurate bioacoustic analysis.
zh

[AI-62] A Code Comprehension Benchmark for Large Language Models for Code

【速读】:该论文试图解决大型语言模型在代码理解任务中表现不佳的问题,尤其是其在需要深层语义理解的任务(如代码调试和优化)中表现欠佳。论文认为,这是因为这些模型主要学习了代码的表层语法模式,而未能有效捕捉代码的语义。解决方案的关键在于通过大规模数据集对模型进行微调,以增强其对代码语义的理解能力。实验结果显示,微调后模型在代码理解任务中的性能显著提升,尤其是在Subjectivity Grading Task上,QWQ-32B模型的准确率从70%提升至83.47%,而DPO-fine-tuned Codestral-22B在该任务上达到了最高的微平均准确率87.66%。

链接: https://arxiv.org/abs/2507.10641
作者: Jayant Havare,Saurav Chaudhary,Ganesh Ramakrishnan,Kaushik Maharajan,Srikanth Tamilselvam
机构: 未知
类目: oftware Engineering (cs.SE); Artificial Intelligence (cs.AI)
备注: 10 Pages, 5 Figures

点击查看摘要

Abstract:Large Language Models have shown impressive capabilities in coding tasks like code generation and code completion, as they have been trained on a large amount of code data. Also, since one of the core pretraining objectives is Next Token Prediction, these models tends to learn surface-level syntactic patterns in code. However, this does not guarantee code comprehension ability i.e. the ability to capture the semantics of the code. In our opinion, this is the reason why these models often underperform on tasks that require deeper semantic understanding, such as code debugging and code optimization. To address this, we propose fine-tuning these models specifically for code comprehension tasks using large-scale datasets, enabling them to develop a more robust understanding of code semantics. We evaluate three code models of varying sizes on a suite of code comprehension tasks designed to assess semantic understanding beyond surface-level syntactic pattern matching. In particular, we analyze performance on the Subjectivity Grading Task and observe that model performance improves after fine-tuning on relevant downstream tasks. The most significant improvement is seen in the QWQ-32B model, where accuracy increases from 70% to 83.47%. A similar or explainable trend is observed across other models, clearly indicating an enhancement in code comprehension ability. Among the models studied, the DPO-fine-tuned Codestral-22B achieves the highest micro-accuracy of 87.66% on the Subjectivity Grading Task.
zh

[AI-63] SPICEAssistant: LLM using SPICE Simulation Tools for Schematic Design of Switched-Mode Power Supplies

【速读】:该论文试图解决大型语言模型(Large Language Models, LLMs)在电子设计自动化(Electronic Design Automation, EDA)领域,特别是开关模式电源(Switched-Mode Power Supply, SMPS)设计中的能力局限问题。具体而言,研究关注LLMs在理解、适应和优化电子电路方面的不足,尤其是在处理SPICE等关键仿真工具的结果以及多步骤设计流程时的限制。解决方案的关键在于提出SPICEAssistant框架,该框架为LLMs提供一系列工具,作为与SPICE的接口,使LLMs能够灵活地与仿真器交互,从而评估其对电路修改的影响。

链接: https://arxiv.org/abs/2507.10639
作者: Simon Nau,Jan Krummenauer,André Zimmermann
机构: 未知
类目: Hardware Architecture (cs.AR); Artificial Intelligence (cs.AI); Emerging Technologies (cs.ET)
备注: 11 pages, 10 figures

点击查看摘要

Abstract:State-of-the-art large language models (LLMs) show high performance across a wide range of tasks in many domains of science. In the field of electronic design automation (EDA), it is yet to be determined to what extent they are capable to understand, adapt, and dimension electronic circuits. This paper focuses on the application of LLMs to switched-mode power supply (SMPS) design on printed circuit boards (PCBs). Particular challenges for LLMs in this context include their limited ability to interpret results from key simulation tools like SPICE and the multi-step design process. To address these challenges, we suggest SPICEAssistant, a framework that provides a broad selection of tools to an LLM. The tools serve as an interface to SPICE, allowing the LLM to interact flexibly with the simulator to estimate the impact of its modifications to the circuit. To evaluate the performance of SPICEAssistant, we defined a benchmark consisting of 256 questions testing the ability to adapt circuit netlists to fulfil different SMPS design tasks. The benchmarking results show that simulation feedback effectively improves SMPS design capabilities of LLMs. An increasing number of simulation iterations leads to enhanced performance. The SPICEAssistant framework significantly outperforms the standalone LLM GPT-4o on the benchmark by approximately 38%.
zh

[AI-64] GeoHopNet: Hopfield-Augmented Sparse Spatial Attention for Dynamic UAV Site Location Problem

【速读】:该论文旨在解决城市低空无人机(UAV)经济快速发展所带来的无人机着陆点和补给站动态选址问题,传统深度强化学习方法在处理大规模城市级位置问题时面临计算复杂度瓶颈,尤其是标准注意力机制。论文提出的GeoHopNet是一种基于Hopfield增强的稀疏空间注意力网络,其关键在于引入了四项核心创新:距离偏置多头注意力机制、K近邻稀疏注意力以降低计算复杂度至O(NK)、现代Hopfield外部记忆模块以及记忆正则化策略。这些创新使得GeoHopNet能够在大规模实例中高效求解,显著优于现有方法。

链接: https://arxiv.org/abs/2507.10636
作者: Jianing Zhi,Xinghua Li,Zidong Chen
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Neural and Evolutionary Computing (cs.NE); Robotics (cs.RO)
备注: 12 Pages, 5 Figures

点击查看摘要

Abstract:The rapid development of urban low-altitude unmanned aerial vehicle (UAV) economy poses new challenges for dynamic site selection of UAV landing points and supply stations. Traditional deep reinforcement learning methods face computational complexity bottlenecks, particularly with standard attention mechanisms, when handling large-scale urban-level location problems. This paper proposes GeoHopNet, a Hopfield-augmented sparse spatial attention network specifically designed for dynamic UAV site location problems. Our approach introduces four core innovations: (1) distance-biased multi-head attention mechanism that explicitly encodes spatial geometric information; (2) K-nearest neighbor sparse attention that reduces computational complexity from O(N^2) to O(NK) ; (3) a modern Hopfield external memory module; and (4) a memory regularization strategy. Experimental results demonstrate that GeoHopNet extends the boundary of solvable problem sizes. For large-scale instances with 1,000 nodes, where standard attention models become prohibitively slow (over 3 seconds per instance) and traditional solvers fail, GeoHopNet finds high-quality solutions (0.22% optimality gap) in under 0.1 seconds. Compared to the state-of-the-art ADNet baseline on 100-node instances, our method improves solution quality by 22.2% and is 1.8 \times faster.
zh

[AI-65] Scalable Unsupervised Segmentation via Random Fourier Feature-based Gaussian Process

【速读】:该论文试图解决高计算成本的高斯过程隐半马尔可夫模型(GP-HSMM)在处理时间序列数据时的矩阵求逆问题,该问题随着数据规模的增大而变得尤为显著。解决方案的关键在于利用随机傅里叶特征(RFF)将高斯过程近似为线性回归,从而避免了核矩阵的求逆操作,既保持了模型的表达能力,又大幅提升了计算效率。

链接: https://arxiv.org/abs/2507.10632
作者: Issei Saito,Masatoshi Nagano,Tomoaki Nakamura,Daichi Mochihashi,Koki Mimura
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:In this paper, we propose RFF-GP-HSMM, a fast unsupervised time-series segmentation method that incorporates random Fourier features (RFF) to address the high computational cost of the Gaussian process hidden semi-Markov model (GP-HSMM). GP-HSMM models time-series data using Gaussian processes, requiring inversion of an N times N kernel matrix during training, where N is the number of data points. As the scale of the data increases, matrix inversion incurs a significant computational cost. To address this, the proposed method approximates the Gaussian process with linear regression using RFF, preserving expressive power while eliminating the need for inversion of the kernel matrix. Experiments on the Carnegie Mellon University (CMU) motion-capture dataset demonstrate that the proposed method achieves segmentation performance comparable to that of conventional methods, with approximately 278 times faster segmentation on time-series data comprising 39,200 frames.
zh

[AI-66] Enhancing the Capabilities of Large Language Models for API calls through Knowledge Graphs

【速读】:该论文试图解决大型语言模型(Large Language Models, LLMs)在知识密集型领域如气象学中,通过API调用有效利用工具的能力不足的问题。其解决方案的关键在于提出KG2data系统,该系统融合知识图谱、LLMs、ReAct代理和工具使用技术,以增强智能数据获取与查询处理能力。KG2data通过知识图谱作为持久化记忆,提升了内容检索、复杂查询处理、领域特定推理、语义关系解析和异构数据集成能力,同时降低了微调LLMs的高昂成本,使系统更适应不断变化的领域知识和API结构。

链接: https://arxiv.org/abs/2507.10630
作者: Ye Yang,Xue Xiao,Ping Yin,Taotao Xie
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:API calls by large language models (LLMs) offer a cutting-edge approach for data analysis. However, their ability to effectively utilize tools via API calls remains underexplored in knowledge-intensive domains like meteorology. This paper introduces KG2data, a system that integrates knowledge graphs, LLMs, ReAct agents, and tool-use technologies to enable intelligent data acquisition and query handling in the meteorological field. Using a virtual API, we evaluate API call accuracy across three metrics: name recognition failure, hallucination failure, and call correctness. KG2data achieves superior performance (1.43%, 0%, 88.57%) compared to RAG2data (16%, 10%, 72.14%) and chat2data (7.14%, 8.57%, 71.43%). KG2data differs from typical LLM-based systems by addressing their limited access to domain-specific knowledge, which hampers performance on complex or terminology-rich queries. By using a knowledge graph as persistent memory, our system enhances content retrieval, complex query handling, domain-specific reasoning, semantic relationship resolution, and heterogeneous data integration. It also mitigates the high cost of fine-tuning LLMs, making the system more adaptable to evolving domain knowledge and API structures. In summary, KG2data provides a novel solution for intelligent, knowledge-based question answering and data analysis in domains with high knowledge demands.
zh

[AI-67] SQLord: A Robust Enterprise Text-to-SQL Solution via Reverse Data Generation and Workflow Decomposition WWW’25

【速读】:该论文旨在解决自然语言到SQL查询(NL2SQL)在实际企业应用中面临的挑战,包括现有框架在处理复杂业务逻辑时的不足以及缺乏领域特定数据进行微调的问题。此外,传统的评估方法依赖于标注数据和可执行数据库环境,在现实场景中较为稀缺。论文提出的解决方案关键在于引入数据逆向生成方法,将原始SQL语句转换为用于监督微调(SFT)的标注数据,并采用自动化工作流生成器对复杂查询进行分解。同时,SQLord还配备了一个全面的GPT-Judge评估框架,涵盖执行评估(EXE)、查询-SQL评估(QSE)和SQL-SQL评估(SSE),以适应多种应用场景。

链接: https://arxiv.org/abs/2507.10629
作者: Song Cheng,Qiannan Cheng,Linbo Jin,Lei Yi,Guannan Zhang
机构: 未知
类目: Databases (cs.DB); Artificial Intelligence (cs.AI)
备注: WWW '25: Companion Proceedings of the ACM on Web Conference 2025 Pages 919 - 923 this https URL

点击查看摘要

Abstract:Transforming natural language into SQL queries (NL2SQL) is crucial for data-driven business applications. Existing frameworks, trained on open-source datasets, struggle with complex business logic and lack domain-specific data for fine-tuning. Additionally, evaluation methods often require annotated data and executable database environments, which are scarce in real-world scenarios. To address these challenges, we propose SQLord, an enterprise-level NL2SQL framework. First, SQLord introduces a data reverse generation approach to convert raw SQL statements into annotated data for supervised fine-tuning (SFT). Second, it proposes a decomposition method for complex queries using an automated workflow generator. Additionally, SQLord features a comprehensive GPT-Judge evaluation framework, including Execution Evaluation (EXE), Query-SQL Evaluation (QSE), and SQL-SQL Evaluation (SSE), tailored to diverse scenarios. Offline tests significantly outperform state of the art baselines, and online accuracy consistently exceeds 90, highlighting SQLord’s advantages and effectiveness in complex real world scenarios. SQLord has been successfully applied across multiple scenarios on the world’s largest B2B e-commerce platform.
zh

[AI-68] GHPO: Adaptive Guidance for Stable and Efficient LLM Reinforcement Learning

【速读】:该论文旨在解决基于策略的强化学习(on-policy RL)在训练大型语言模型(LLMs)时面临的训练不稳定和效率低下的问题。这一问题主要源于训练数据复杂性与模型当前能力之间的不匹配,导致奖励信号稀疏且学习进展停滞。论文提出的解决方案是Guided Hybrid Policy Optimization (GHPO),其关键在于通过自适应提示优化动态校准任务难度,从而在模型无法处理的问题上引入直接模仿学习,在可管理的任务上采用基于探索的强化学习,实现学习课程的平滑与优化。

链接: https://arxiv.org/abs/2507.10628
作者: Ziru Liu,Cheng Gong,Xinyu Fu,Yaofang Liu,Ran Chen,Shoubo Hu,Suiyun Zhang,Rui Liu,Qingfu Zhang,Dandan Tu
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Reinforcement Learning with Verifiable Rewards (RLVR) has recently emerged as a powerful paradigm for facilitating the self-improvement of large language models (LLMs), particularly in the domain of complex reasoning tasks. However, prevailing on-policy RL methods often contend with significant training instability and inefficiency. This is primarily due to a capacity-difficulty mismatch, where the complexity of training data frequently outpaces the model’s current capabilities, leading to critically sparse reward signals and stalled learning progress. This challenge is particularly acute for smaller, more resource-efficient LLMs. To overcome this, we introduce the Guided Hybrid Policy Optimization (GHPO), a novel difficulty-aware reinforcement learning framework. GHPO dynamically calibrates task difficulty by employing adaptive prompt refinement to provide targeted guidance. This unique approach adaptively balances direct imitation learning for problems currently beyond the model’s reach with exploration-based reinforcement learning for more manageable tasks, effectively creating a smooth and optimized learning curriculum. Extensive experiments demonstrate that GHPO achieves an average performance gain of approximately 5% across six challenging mathematics benchmarks, consistently outperforming strong on-policy reinforcement learning and curriculum learning baselines. Further analysis confirms that our framework significantly enhances both training stability and final reasoning performance, thus offering a scalable and efficient solution for developing powerful and robust reasoning models.
zh

[AI-69] Player-Team Heterogeneous Interaction Graph Transformer for Soccer Outcome Prediction

【速读】:该论文旨在解决足球比赛结果预测中因球员与球队之间复杂且异构的互动关系被现有方法忽视而导致建模不准确的问题。其解决方案的关键在于提出HIGFormer模型,该模型通过多层级交互框架捕捉细粒度球员动态和高层次球队互动,具体包括球员交互网络、球队交互网络以及比赛对比变压器,从而更全面地建模比赛动态并提升预测准确性。

链接: https://arxiv.org/abs/2507.10626
作者: Lintao Wang,Shiwen Xu,Michael Horton,Joachim Gudmundsson,Zhiyong Wang
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Predicting soccer match outcomes is a challenging task due to the inherently unpredictable nature of the game and the numerous dynamic factors influencing results. While it conventionally relies on meticulous feature engineering, deep learning techniques have recently shown a great promise in learning effective player and team representations directly for soccer outcome prediction. However, existing methods often overlook the heterogeneous nature of interactions among players and teams, which is crucial for accurately modeling match dynamics. To address this gap, we propose HIGFormer (Heterogeneous Interaction Graph Transformer), a novel graph-augmented transformer-based deep learning model for soccer outcome prediction. HIGFormer introduces a multi-level interaction framework that captures both fine-grained player dynamics and high-level team interactions. Specifically, it comprises (1) a Player Interaction Network, which encodes player performance through heterogeneous interaction graphs, combining local graph convolutions with a global graph-augmented transformer; (2) a Team Interaction Network, which constructs interaction graphs from a team-to-team perspective to model historical match relationships; and (3) a Match Comparison Transformer, which jointly analyzes both team and player-level information to predict match outcomes. Extensive experiments on the WyScout Open Access Dataset, a large-scale real-world soccer dataset, demonstrate that HIGFormer significantly outperforms existing methods in prediction accuracy. Furthermore, we provide valuable insights into leveraging our model for player performance evaluation, offering a new perspective on talent scouting and team strategy analysis.
zh

[AI-70] Comprehension Without Competence: Architectural Limits of LLM s in Symbolic Computation and Reasoning

【速读】:该论文试图解决大型语言模型(Large Language Models, LLMs)在需要符号推理、算术准确性和逻辑一致性任务中系统性失败的问题。其关键解决方案是揭示了LLMs在理解与执行之间的结构性分裂,即所谓的计算分裂脑综合征(computational split-brain syndrome),表明指令与执行路径在几何和功能上存在分离,导致模型虽能正确阐述原则却无法可靠地应用,这一问题源于计算执行而非知识获取。

链接: https://arxiv.org/abs/2507.10624
作者: Zheng Zhang
机构: 未知
类目: Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: Substantial change to previous version (experiments, theorem, analysis and related work); currently under review at TMLR

点击查看摘要

Abstract:Large Language Models (LLMs) display striking surface fluency yet systematically fail at tasks requiring symbolic reasoning, arithmetic accuracy, and logical consistency. This paper offers a structural diagnosis of such failures, revealing a persistent gap between \textitcomprehension and \textitcompetence. Through controlled experiments and architectural analysis, we demonstrate that LLMs often articulate correct principles without reliably applying them–a failure rooted not in knowledge access, but in computational execution. We term this phenomenon the computational \textitsplit-brain syndrome, where instruction and action pathways are geometrically and functionally dissociated. This core limitation recurs across domains, from mathematical operations to relational inferences, and explains why model behavior remains brittle even under idealized prompting. We argue that LLMs function as powerful pattern completion engines, but lack the architectural scaffolding for principled, compositional reasoning. Our findings delineate the boundary of current LLM capabilities and motivate future models with metacognitive control, principle lifting, and structurally grounded execution. This diagnosis also clarifies why mechanistic interpretability findings may reflect training-specific pattern coordination rather than universal computational principles, and why the geometric separation between instruction and execution pathways suggests limitations in neural introspection and mechanistic analysis.
zh

[AI-71] Spectral Feature Extraction for Robust Network Intrusion Detection Using MFCCs

【速读】:该论文旨在解决物联网(IoT)网络中因规模快速扩张而带来的安全漏洞问题,特别是针对异常流量的检测与分类。其解决方案的关键在于结合可学习的梅尔频率倒谱系数(MFCC)与ResNet-18深度学习模型,利用MFCC提取网络流量中的时序特征,并通过ResNet-18进行高效特征提取与多类分类,从而提升异构物联网网络环境中异常检测的鲁棒性与可扩展性。

链接: https://arxiv.org/abs/2507.10622
作者: HyeYoung Lee,Muhammad Nadeem,Pavel Tsoi
机构: 未知
类目: Cryptography and Security (cs.CR); Disordered Systems and Neural Networks (cond-mat.dis-nn); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:The rapid expansion of Internet of Things (IoT) networks has led to a surge in security vulnerabilities, emphasizing the critical need for robust anomaly detection and classification techniques. In this work, we propose a novel approach for identifying anomalies in IoT network traffic by leveraging the Mel-frequency cepstral coefficients (MFCC) and ResNet-18, a deep learning model known for its effectiveness in feature extraction and image-based tasks. Learnable MFCCs enable adaptive spectral feature representation, capturing the temporal patterns inherent in network traffic more effectively than traditional fixed MFCCs. We demonstrate that transforming raw signals into MFCCs maps the data into a higher-dimensional space, enhancing class separability and enabling more effective multiclass classification. Our approach combines the strengths of MFCCs with the robust feature extraction capabilities of ResNet-18, offering a powerful framework for anomaly detection. The proposed model is evaluated on three widely used IoT intrusion detection datasets: CICIoT2023, NSL-KDD, and IoTID20. The experimental results highlight the potential of integrating adaptive signal processing techniques with deep learning architectures to achieve robust and scalable anomaly detection in heterogeneous IoT network landscapes.
zh

[AI-72] Game Theory Meets LLM and Agent ic AI: Reimagining Cybersecurity for the Age of Intelligent Threats

【速读】:该论文试图解决传统网络安全方法在应对复杂威胁时的局限性,即依赖手动响应和脆弱的启发式规则,难以构建主动且智能的防御系统。其解决方案的关键在于融合博弈论与生成式 AI(Generative AI),通过博弈论提供对抗行为建模和战略防御设计的理论基础,同时利用大型语言模型(Large Language Models, LLMs)和代理 AI 实现抽象策略的实际决策转化,从而弥合理论与实践之间的鸿沟。此外,LLMs 还推动了对经典博弈论假设的重新审视,促进了更符合认知与计算现实的新模型发展,最终实现安全、智能且自适应的网络系统。

链接: https://arxiv.org/abs/2507.10621
作者: Quanyan Zhu
机构: 未知
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI); Computers and Society (cs.CY); Computer Science and Game Theory (cs.GT)
备注:

点击查看摘要

Abstract:Protecting cyberspace requires not only advanced tools but also a shift in how we reason about threats, trust, and autonomy. Traditional cybersecurity methods rely on manual responses and brittle heuristics. To build proactive and intelligent defense systems, we need integrated theoretical frameworks and software tools. Game theory provides a rigorous foundation for modeling adversarial behavior, designing strategic defenses, and enabling trust in autonomous systems. Meanwhile, software tools process cyber data, visualize attack surfaces, verify compliance, and suggest mitigations. Yet a disconnect remains between theory and practical implementation. The rise of Large Language Models (LLMs) and agentic AI offers a new path to bridge this gap. LLM-powered agents can operationalize abstract strategies into real-world decisions. Conversely, game theory can inform the reasoning and coordination of these agents across complex workflows. LLMs also challenge classical game-theoretic assumptions, such as perfect rationality or static payoffs, prompting new models aligned with cognitive and computational realities. This co-evolution promises richer theoretical foundations and novel solution concepts. Agentic AI also reshapes software design: systems must now be modular, adaptive, and trust-aware from the outset. This chapter explores the intersection of game theory, agentic AI, and cybersecurity. We review key game-theoretic frameworks (e.g., static, dynamic, Bayesian, and signaling games) and solution concepts. We then examine how LLM agents can enhance cyber defense and introduce LLM-driven games that embed reasoning into AI agents. Finally, we explore multi-agent workflows and coordination games, outlining how this convergence fosters secure, intelligent, and adaptive cyber systems. Subjects: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI); Computers and Society (cs.CY); Computer Science and Game Theory (cs.GT) Cite as: arXiv:2507.10621 [cs.CR] (or arXiv:2507.10621v1 [cs.CR] for this version) https://doi.org/10.48550/arXiv.2507.10621 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
zh

[AI-73] LLM s Meet Cross-Modal Time Series Analytics: Overview and Directions

【速读】:该论文试图解决在跨模态时间序列分析中,大型语言模型(Large Language Models, LLMs)与时间序列数据之间的跨模态差距问题。其关键解决方案在于通过跨模态建模策略,包括转换、对齐和融合三种方法,以弥合文本数据与时间序列数据之间的差异,从而提升LLMs在时间序列分析任务中的适用性和性能。

链接: https://arxiv.org/abs/2507.10620
作者: Chenxi Liu,Hao Miao,Cheng Long,Yan Zhao,Ziyue Li,Panos Kalnis
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注: Accepted at SSTD 2025 (Tutorial). arXiv admin note: text overlap with arXiv:2505.02583

点击查看摘要

Abstract:Large Language Models (LLMs) have emerged as a promising paradigm for time series analytics, leveraging their massive parameters and the shared sequential nature of textual and time series data. However, a cross-modality gap exists between time series and textual data, as LLMs are pre-trained on textual corpora and are not inherently optimized for time series. In this tutorial, we provide an up-to-date overview of LLM-based cross-modal time series analytics. We introduce a taxonomy that classifies existing approaches into three groups based on cross-modal modeling strategies, e.g., conversion, alignment, and fusion, and then discuss their applications across a range of downstream tasks. In addition, we summarize several open challenges. This tutorial aims to expand the practical application of LLMs in solving real-world problems in cross-modal time series analytics while balancing effectiveness and efficiency. Participants will gain a thorough understanding of current advancements, methodologies, and future research directions in cross-modal time series analytics.
zh

[AI-74] Meta-Reinforcement Learning for Fast and Data-Efficient Spectrum Allocation in Dynamic Wireless Networks

【速读】:该论文旨在解决5G/6G网络中频谱动态分配的效率问题,传统深度强化学习(DRL)因样本复杂度高和探索过程中的安全风险导致难以应用。其解决方案的关键在于引入元学习框架,使智能体能够学习到一个鲁棒的初始策略,并在少量数据下快速适应新的无线场景,从而提升资源利用效率并减少网络干扰。

链接: https://arxiv.org/abs/2507.10619
作者: Oluwaseyi Giwa,Tobi Awodunmila,Muhammad Ahmed Mohsin,Ahsan Bilal,Muhammad Ali Jamshed
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注: 5 pages, 6 figures, under review at IEEE Wireless Communications Letters

点击查看摘要

Abstract:The dynamic allocation of spectrum in 5G / 6G networks is critical to efficient resource utilization. However, applying traditional deep reinforcement learning (DRL) is often infeasible due to its immense sample complexity and the safety risks associated with unguided exploration, which can cause severe network interference. To address these challenges, we propose a meta-learning framework that enables agents to learn a robust initial policy and rapidly adapt to new wireless scenarios with minimal data. We implement three meta-learning architectures, model-agnostic meta-learning (MAML), recurrent neural network (RNN), and an attention-enhanced RNN, and evaluate them against a non-meta-learning DRL algorithm, proximal policy optimization (PPO) baseline, in a simulated dynamic integrated access/backhaul (IAB) environment. Our results show a clear performance gap. The attention-based meta-learning agent reaches a peak mean network throughput of 48 Mbps, while the PPO baseline decreased drastically to 10 Mbps. Furthermore, our method reduces SINR and latency violations by more than 50% compared to PPO. It also shows quick adaptation, with a fairness index 0.7, showing better resource allocation. This work proves that meta-learning is a very effective and safer option for intelligent control in complex wireless systems.
zh

[AI-75] Compute Requirements for Algorithmic Innovation in Frontier AI Models

【速读】:该论文试图解决大规模语言模型预训练中算法创新所需的计算资源问题,以及计算限制对算法进步的影响。其解决方案的关键在于通过系统地记录和分析36种在Llama 3和DeepSeek-V3中使用的预训练算法创新,估算每种创新在开发过程中所消耗的总浮点运算次数(FLOP)及硬件的FLOP/s,从而评估计算资源的需求趋势及其对创新的潜在影响。

链接: https://arxiv.org/abs/2507.10618
作者: Peter Barnett
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Algorithmic innovation in the pretraining of large language models has driven a massive reduction in the total compute required to reach a given level of capability. In this paper we empirically investigate the compute requirements for developing algorithmic innovations. We catalog 36 pre-training algorithmic innovations used in Llama 3 and DeepSeek-V3. For each innovation we estimate both the total FLOP used in development and the FLOP/s of the hardware utilized. Innovations using significant resources double in their requirements each year. We then use this dataset to investigate the effect of compute caps on innovation. Our analysis suggests that compute caps alone are unlikely to dramatically slow AI algorithmic progress. Even stringent compute caps – such as capping total operations to the compute used to train GPT-2 or capping hardware capacity to 8 H100 GPUs – could still have allowed for half of the cataloged innovations.
zh

[AI-76] Fine-tuning Large Language Model for Automated Algorithm Design

【速读】:该论文试图解决如何将大型语言模型(Large Language Models, LLMs)有效地适配到算法设计任务中,特别是探讨是否需要针对算法设计进行专门的LLMs训练,以及如何有效获取这些模型并实现跨任务的泛化能力。其解决方案的关键在于采用一种基于多样性感知排序(Diversity-Aware Rank, DAR)的数据采样策略以平衡训练数据的多样性和质量,并通过直接偏好优化(Direct Preference Optimization)使LLM输出与任务目标对齐,从而提升模型在算法设计任务中的性能。

链接: https://arxiv.org/abs/2507.10614
作者: Fei Liu,Rui Zhang,Xi Lin,Zhichao Lu,Qingfu Zhang
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:The integration of large language models (LLMs) into automated algorithm design has shown promising potential. A prevalent approach embeds LLMs within search routines to iteratively generate and refine candidate algorithms. However, most existing methods rely on off-the-shelf LLMs trained for general coding tasks,leaving a key question open: Do we need LLMs specifically tailored for algorithm design? If so, how can such LLMs be effectively obtained and how well can they generalize across different algorithm design tasks? In this paper, we take a first step toward answering these questions by exploring fine-tuning of LLMs for algorithm design. We introduce a Diversity-Aware Rank based (DAR) sampling strategy to balance training data diversity and quality, then we leverage direct preference optimization to efficiently align LLM outputs with task objectives. Our experiments, conducted on Llama-3.2-1B-Instruct and Llama- 3.1-8B-Instruct, span three distinct algorithm design tasks. Results suggest that finetuned LLMs can significantly outperform their off-the-shelf counterparts with the smaller Llama-3.2-1B-Instruct and match the larger Llama-3.1-8B-Instruct on the admissible set problem. Moreover, we observe promising generalization: LLMs finetuned on specific algorithm design tasks also improve performance on related tasks with varying settings. These findings highlight the value of task-specific adaptation for LLMs in algorithm design and open new avenues for future research.
zh

[AI-77] Sub-Scaling Laws: On the Role of Data Density and Training Strategies in LLM s

【速读】:该论文试图解决传统自然语言处理中的缩放定律在大型语言模型中出现的性能提升减缓问题,即所谓的次线性缩放(sub-scaling)现象。其解决方案的关键在于识别数据密度高和资源分配非最优是导致次线性缩放的主要因素,并提出一种改进的次优缩放定律,以更准确地预测在次线性缩放区域内的模型性能,强调数据质量和多样性的关键作用。

链接: https://arxiv.org/abs/2507.10613
作者: Zhengyu Chen,Siqi Wang,Teng Xiao,Yudong Wang,Shiqi Chen,Xunliang Cai,Junxian He,Jingang Wang
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Traditional scaling laws in natural language processing suggest that increasing model size and training data enhances performance. However, recent studies reveal deviations, particularly in large language models, where performance improvements decelerate, which is a phenomenon known as sub-scaling. This paper revisits these scaling laws by examining the impact of data quality and training strategies on model performance. Through extensive empirical analysis of over 400 models, we identify high data density and non-optimal resource allocation as key factors contributing to sub-scaling. High data density leads to diminishing returns due to redundant information, while optimal resource allocation is crucial for sustained performance improvements. We propose a sub-optimal scaling law that better predicts performance in sub-scaling regimes, highlighting the importance of data quality and diversity.
zh

[AI-78] LaSM: Layer-wise Scaling Mechanism for Defending Pop-up Attack on GUI Agents

【速读】:该论文试图解决基于多模态大语言模型(Multimodal Large Language Models, MLLMs)的图形用户界面(GUI)代理在面对基于弹窗的环境注入攻击时表现出的易受攻击问题,此类攻击通过恶意视觉元素干扰模型注意力,导致不安全或错误的操作。解决方案的关键在于提出一种层间缩放机制(LaSM),通过选择性地增强关键层中的注意力和MLP模块,提升模型显著性与任务相关区域的一致性,从而在无需额外训练的情况下有效提高防御成功率。

链接: https://arxiv.org/abs/2507.10610
作者: Zihe Yan,Zhuosheng Zhang
机构: 未知
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI)
备注: 10 pages, 9 figures

点击查看摘要

Abstract:Graphical user interface (GUI) agents built on multimodal large language models (MLLMs) have recently demonstrated strong decision-making abilities in screen-based interaction tasks. However, they remain highly vulnerable to pop-up-based environmental injection attacks, where malicious visual elements divert model attention and lead to unsafe or incorrect actions. Existing defense methods either require costly retraining or perform poorly under inductive interference. In this work, we systematically study how such attacks alter the attention behavior of GUI agents and uncover a layer-wise attention divergence pattern between correct and incorrect outputs. Based on this insight, we propose \textbfLaSM, a \textitLayer-wise Scaling Mechanism that selectively amplifies attention and MLP modules in critical layers. LaSM improves the alignment between model saliency and task-relevant regions without additional training. Extensive experiments across 12 types of pop-up perturbations and 4 different model backbones show that LaSM consistently enhances the defense success rate. When combined with prompt-level alerts, LaSM achieves over 98% robustness even under strong inductive attacks. Our findings reveal that attention misalignment is a core vulnerability in MLLM agents and can be effectively addressed through selective layer-wise modulation.
zh

[AI-79] DALI-PD: Diffusion-based Synthetic Layout Heatmap Generation for ML in Physical Design

【速读】:该论文试图解决物理设计(Physical Design, PD)中机器学习(Machine Learning, ML)模型泛化能力受限的问题,主要原因是高质量、大规模训练数据集的获取成本高且受知识产权限制。论文提出的解决方案的关键在于DALI-PD框架,该框架利用扩散模型快速生成多样化的布局热图,包括电源、IR降压、拥塞、宏单元放置和单元密度等地图,从而加速ML在PD研究中的应用。通过DALI-PD生成的热图与真实布局高度相似,并提升了下游任务如IR降压或拥塞预测的ML准确性。

链接: https://arxiv.org/abs/2507.10606
作者: Bing-Yue Wu,Vidya A. Chhabria
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Hardware Architecture (cs.AR)
备注: Under review at Asia and South Pacific Design Automation Conference (ASP-DAC’26)

点击查看摘要

Abstract:Machine learning (ML) has demonstrated significant promise in various physical design (PD) tasks. However, model generalizability remains limited by the availability of high-quality, large-scale training datasets. Creating such datasets is often computationally expensive and constrained by IP. While very few public datasets are available, they are typically static, slow to generate, and require frequent updates. To address these limitations, we present DALI-PD, a scalable framework for generating synthetic layout heatmaps to accelerate ML in PD research. DALI-PD uses a diffusion model to generate diverse layout heatmaps via fast inference in seconds. The heatmaps include power, IR drop, congestion, macro placement, and cell density maps. Using DALI-PD, we created a dataset comprising over 20,000 layout configurations with varying macro counts and placements. These heatmaps closely resemble real layouts and improve ML accuracy on downstream ML tasks such as IR drop or congestion prediction.
zh

[AI-80] RedOne: Revealing Domain-specific LLM Post-Training in Social Networking Services

【速读】:该论文旨在解决社交网络服务(Social Networking Services, SNS)在内容管理和交互质量提升方面面临的挑战,特别是在现有研究局限于单一任务导致数据规模扩展效益递减以及难以灵活适应多样现实场景的问题。其解决方案的关键在于提出RedOne,一个针对SNS领域设计的专用大语言模型(Large Language Model, LLM),通过三阶段训练策略——持续预训练、监督微调和偏好优化,利用大规模真实数据集进行训练,从而突破单任务基线的性能瓶颈,并在多个SNS任务中展现出显著的性能提升和良好的泛化能力。

链接: https://arxiv.org/abs/2507.10605
作者: Fei Zhao,Chonggang Lu,Yue Wang,Zheyong Xie,Ziyan Liu,Haofu Qian,JianZhao Huang,Fangcheng Shi,Zijie Meng,Hongcheng Guo,Mingqian He,Xinze Lyu,Yiming Lu,Ziyang Xiang,Zheyu Ye,Chengqiang Lu,Zhe Xu,Yi Wu,Yao Hu,Yan Gao,Jun Fan,Xiaolong Jiang,Weiting Liu,Boyang Wang,Shaosheng Cao
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Social and Information Networks (cs.SI)
备注:

点击查看摘要

Abstract:As a primary medium for modern information dissemination, social networking services (SNS) have experienced rapid growth, which has proposed significant challenges for platform content management and interaction quality improvement. Recently, the development of large language models (LLMs) has offered potential solutions but existing studies focus on isolated tasks, which not only encounter diminishing benefit from the data scaling within individual scenarios but also fail to flexibly adapt to diverse real-world context. To address these challenges, we introduce RedOne, a domain-specific LLM designed to break the performance bottleneck of single-task baselines and establish a comprehensive foundation for the SNS. RedOne was developed through a three-stage training strategy consisting of continue pretraining, supervised fine-tuning, and preference optimization, using a large-scale real-world dataset. Through extensive experiments, RedOne maintains strong general capabilities, and achieves an average improvement up to 14.02% across 8 major SNS tasks and 7.56% in SNS bilingual evaluation benchmark, compared with base models. Furthermore, through online testing, RedOne reduced the exposure rate in harmful content detection by 11.23% and improved the click page rate in post-view search by 14.95% compared with single-tasks finetuned baseline models. These results establish RedOne as a robust domain-specific LLM for SNS, demonstrating excellent generalization across various tasks and promising applicability in real-world scenarios.
zh

[AI-81] Learning to Move in Rhythm: Task-Conditioned Motion Policies with Orbital Stability Guarantees

【速读】:该论文旨在解决传统动态运动基元(Dynamic Motion Primitives)在捕捉复杂周期性行为以及在不同任务间进行插值方面的局限性,这些问题限制了其在如运动和周期性工具使用等实际任务中的应用。论文提出的解决方案关键在于引入轨道稳定运动基元(Orbitally Stable Motion Primitives, OSMPs),该框架结合了学习到的微分同胚编码器与潜在空间中的超临界霍普夫分支,从而实现了从示范中准确获取周期性运动,并确保轨道稳定性和横向收缩的正式保证。此外,通过将双射编码器条件化于任务,使得单一学习策略能够表示多个运动目标,从而在训练分布内实现对未见过运动目标的一致零样本泛化。

链接: https://arxiv.org/abs/2507.10602
作者: Maximilian Stölzle,T. Konstantin Rusch,Zach J. Patterson,Rodrigo Pérez-Dattari,Francesco Stella,Josie Hughes,Cosimo Della Santina,Daniela Rus
机构: 未知
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI)
备注: 73 pages

点击查看摘要

Abstract:Learning from demonstration provides a sample-efficient approach to acquiring complex behaviors, enabling robots to move robustly, compliantly, and with fluidity. In this context, Dynamic Motion Primitives offer built - in stability and robustness to disturbances but often struggle to capture complex periodic behaviors. Moreover, they are limited in their ability to interpolate between different tasks. These shortcomings substantially narrow their applicability, excluding a wide class of practically meaningful tasks such as locomotion and rhythmic tool use. In this work, we introduce Orbitally Stable Motion Primitives (OSMPs) - a framework that combines a learned diffeomorphic encoder with a supercritical Hopf bifurcation in latent space, enabling the accurate acquisition of periodic motions from demonstrations while ensuring formal guarantees of orbital stability and transverse contraction. Furthermore, by conditioning the bijective encoder on the task, we enable a single learned policy to represent multiple motion objectives, yielding consistent zero-shot generalization to unseen motion objectives within the training distribution. We validate the proposed approach through extensive simulation and real-world experiments across a diverse range of robotic platforms - from collaborative arms and soft manipulators to a bio-inspired rigid-soft turtle robot - demonstrating its versatility and effectiveness in consistently outperforming state-of-the-art baselines such as diffusion policies, among others.
zh

[AI-82] Divide-Then-Rule: A Cluster-Driven Hierarchical Interpolator for Attribute-Missing Graphs

【速读】:该论文旨在解决属性缺失图的深度图聚类(Deep Graph Clustering, DGC)问题,即在节点属性不完整的情况下,将节点划分为不同的簇。现有属性补全方法未能充分考虑节点邻域中信息量的差异,导致对邻域信息不足的节点结果不可靠。其解决方案的关键在于提出一种名为“Divide-Then-Rule Graph Completion (DTRGC)”的新方法,该方法通过分步处理不同邻域信息完备程度的节点,并结合聚类信息进行迭代补全与误差修正,从而提升补全效果和聚类性能。

链接: https://arxiv.org/abs/2507.10595
作者: Yaowen Hu,Wenxuan Tu,Yue Liu,Miaomiao Li,Wenpeng Lu,Zhigang Luo,Xinwang Liu,Ping Chen
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Deep graph clustering (DGC) for attribute-missing graphs is an unsupervised task aimed at partitioning nodes with incomplete attributes into distinct clusters. Addressing this challenging issue is vital for practical applications. However, research in this area remains underexplored. Existing imputation methods for attribute-missing graphs often fail to account for the varying amounts of information available across node neighborhoods, leading to unreliable results, especially for nodes with insufficient known neighborhood. To address this issue, we propose a novel method named Divide-Then-Rule Graph Completion (DTRGC). This method first addresses nodes with sufficient known neighborhood information and treats the imputed results as new knowledge to iteratively impute more challenging nodes, while leveraging clustering information to correct imputation errors. Specifically, Dynamic Cluster-Aware Feature Propagation (DCFP) initializes missing node attributes by adjusting propagation weights based on the clustering structure. Subsequently, Hierarchical Neighborhood-aware Imputation (HNAI) categorizes attribute-missing nodes into three groups based on the completeness of their neighborhood attributes. The imputation is performed hierarchically, prioritizing the groups with nodes that have the most available neighborhood information. The cluster structure is then used to refine the imputation and correct potential errors. Finally, Hop-wise Representation Enhancement (HRE) integrates information across multiple hops, thereby enriching the expressiveness of node representations. Experimental results on six widely used graph datasets show that DTRGC significantly improves the clustering performance of various DGC methods under attribute-missing graphs.
zh

[AI-83] Extension OL-MDISF: Online Learning from Mix-Typed Drifted and Incomplete Streaming Features

【速读】:该论文旨在解决在线学习中面临的三个关键问题:异构数据流的特征类型混合带来的传统参数建模挑战、数据分布随时间漂移导致的模型性能骤降,以及由于时间和成本限制难以对每个数据实例进行标注。其解决方案的关键在于提出OL-MDISF(Online Learning from Mix-typed, Drifted, and Incomplete Streaming Features),该方法通过构建基于潜在Copula的表示来处理异构特征,利用集成熵和潜在不匹配检测漂移,并采用结构感知的伪标签策略以应对弱监督场景。

链接: https://arxiv.org/abs/2507.10594
作者: Shengda Zhuo,Di Wu,Yi He,Shuqiang Huang,Xindong Wu
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Online learning, where feature spaces can change over time, offers a flexible learning paradigm that has attracted considerable attention. However, it still faces three significant challenges. First, the heterogeneity of real-world data streams with mixed feature types presents challenges for traditional parametric modeling. Second, data stream distributions can shift over time, causing an abrupt and substantial decline in model performance. Third, it is often infeasible to label every data instance due to time and cost constraints. To address these issues, we proposed OL-MDISF (Online Learning from Mix-typed, Drifted, and Incomplete Streaming Features), which constructs a latent copula-based representation for heterogeneous features, detects drifts via ensemble entropy and latent mismatch, and performs structure-aware pseudo-labeling. This companion paper serves as a standalone technical reference to OL-MDISF. It provides a contextual discussion of related work in mixed-type modeling, drift adaptation, and weak supervision, as well as a comprehensive set of experiments across 14 real-world datasets under two types of drift scenarios. These include CER trends, ablation studies, sensitivity analyses, and temporal ensemble dynamics. We hope this document offers a reproducible benchmark for online learning on complex, weakly supervised streaming data. Subjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI) Cite as: arXiv:2507.10594 [cs.LG] (or arXiv:2507.10594v1 [cs.LG] for this version) https://doi.org/10.48550/arXiv.2507.10594 Focus to learn more arXiv-issued DOI via DataCite
zh

[AI-84] oolRegistry: A Protocol-Agnostic Tool Management Library for Function-Calling LLM s

【速读】:该论文试图解决大型语言模型(Large Language Model, LLM)应用在集成外部工具时面临的碎片化、协议限制和实现复杂性问题,这些问题导致了较高的开发开销。解决方案的关键在于提出Toolregistry,这是一个协议无关的工具管理库,通过统一接口简化工具的注册、表示、执行和生命周期管理,从而显著降低工具集成代码量,提升性能,并确保与OpenAI函数调用标准的完全兼容性。

链接: https://arxiv.org/abs/2507.10593
作者: Peng Ding
机构: 未知
类目: oftware Engineering (cs.SE); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Large Language Model (LLM) applications are increasingly relying on external tools to extend their capabilities beyond text generation. However, current tool integration approaches suffer from fragmentation, protocol limitations, and implementation complexity, leading to substantial development overhead. This paper presents Toolregistry, a protocol-agnostic tool management library that simplifies tool registration, representation, execution, and lifecycle management via a unified interface. Our evaluation demonstrates that \toolregistry achieves 60-80% reduction in tool integration code, up to 3.1x performance improvements through concurrent execution, and 100% compatibility with OpenAI function calling standards. Real-world case studies show significant improvements in development efficiency and code maintainability across diverse integration scenarios. \toolregistry is open-source and available at this https URL, with comprehensive documentation at this https URL.
zh

[AI-85] MH-FSF: A Unified Framework for Overcoming Benchmarking and Reproducibility Limitations in Feature Selection Evaluation

【速读】:该论文试图解决当前特征选择研究中存在基准测试不足和依赖专有数据集的问题,这些问题严重阻碍了结果的可复现性并可能对整体性能产生负面影响。其解决方案的关键是引入MH-FSF框架,这是一个全面、模块化且可扩展的平台,旨在促进特征选择方法的复现与实现。该框架提供了17种方法(11种经典方法和6种领域特定方法)的实现,并支持在10个公开的Android恶意软件数据集上进行系统评估。

链接: https://arxiv.org/abs/2507.10591
作者: Vanderson Rocha,Diego Kreutz,Gabriel Canto,Hendrio Bragança,Eduardo Feitosa
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Cryptography and Security (cs.CR); Performance (cs.PF)
备注: 11 pages; 4 figures; 5 tables; submitted to JBCS

点击查看摘要

Abstract:Feature selection is vital for building effective predictive models, as it reduces dimensionality and emphasizes key features. However, current research often suffers from limited benchmarking and reliance on proprietary datasets. This severely hinders reproducibility and can negatively impact overall performance. To address these limitations, we introduce the MH-FSF framework, a comprehensive, modular, and extensible platform designed to facilitate the reproduction and implementation of feature selection methods. Developed through collaborative research, MH-FSF provides implementations of 17 methods (11 classical, 6 domain-specific) and enables systematic evaluation on 10 publicly available Android malware datasets. Our results reveal performance variations across both balanced and imbalanced datasets, highlighting the critical need for data preprocessing and selection criteria that account for these asymmetries. We demonstrate the importance of a unified platform for comparing diverse feature selection techniques, fostering methodological consistency and rigor. By providing this framework, we aim to significantly broaden the existing literature and pave the way for new research directions in feature selection, particularly within the context of Android malware detection.
zh

[AI-86] Repairing Language Model Pipelines by Meta Self-Refining Competing Constraints at Runtime

【速读】:该论文试图解决语言模型(Language Model, LM)流水线在面对相互冲突的软约束时,其有效性会下降的问题,导致低效的回溯循环。解决方案的关键在于引入Meta Self-Refining框架,该框架为LM流水线添加了一个元校正层,能够在运行时/推理时检测并修复这些冲突。其核心机制是通过监控流水线的执行历史以识别振荡性失败,并调用一个元修复器语言模型来分析回溯尝试的整体状态,进而合成一种策略性指令以平衡竞争需求,从而引导原始LM摆脱失败的优化循环,实现成功输出。

链接: https://arxiv.org/abs/2507.10590
作者: Mojtaba Eshghie
机构: 未知
类目: oftware Engineering (cs.SE); Artificial Intelligence (cs.AI); Information Retrieval (cs.IR)
备注:

点击查看摘要

Abstract:Language Model (LM) pipelines can dynamically refine their outputs against programmatic constraints. However, their effectiveness collapses when faced with competing soft constraints, leading to inefficient backtracking loops where satisfying one constraint violates another. We introduce Meta Self-Refining, a framework that equips LM pipelines with a meta-corrective layer to repair these competitions at runtime/inference-time. Our approach monitors the pipeline’s execution history to detect oscillatory failures. Upon detection, it invokes a meta-repairer LM that analyzes the holistic state of the backtracking attempts and synthesizes a strategic instruction to balance the competing requirements. This self-repair instruction guides the original LM out of a failing refining loop towards a successful output. Our results show Meta Self-Refining can successfully repair these loops, leading to more efficient LM programs.
zh

[AI-87] ARPaCCino: An Agent ic-RAG for Policy as Code Compliance

【速读】:该论文试图解决Policy as Code (PaC)在实际应用中因政策语言复杂性和配置错误风险而导致的采用障碍。其解决方案的关键在于提出ARPaCCino系统,该系统结合大型语言模型(LLMs)、检索增强生成(RAG)和基于工具的验证,以自动化生成和验证PaC规则,并通过迭代优化IaC配置确保合规性。

链接: https://arxiv.org/abs/2507.10584
作者: Francesco Romeo,Luigi Arena,Francesco Blefari,Francesco Aurelio Pironti,Matteo Lupinacci,Angelo Furfaro
机构: 未知
类目: oftware Engineering (cs.SE); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Policy as Code (PaC) is a paradigm that encodes security and compliance policies into machine-readable formats, enabling automated enforcement in Infrastructure as Code (IaC) environments. However, its adoption is hindered by the complexity of policy languages and the risk of misconfigurations. In this work, we present ARPaCCino, an agentic system that combines Large Language Models (LLMs), Retrieval-Augmented-Generation (RAG), and tool-based validation to automate the generation and verification of PaC rules. Given natural language descriptions of the desired policies, ARPaCCino generates formal Rego rules, assesses IaC compliance, and iteratively refines the IaC configurations to ensure conformance. Thanks to its modular agentic architecture and integration with external tools and knowledge bases, ARPaCCino supports policy validation across a wide range of technologies, including niche or emerging IaC frameworks. Experimental evaluation involving a Terraform-based case study demonstrates ARPaCCino’s effectiveness in generating syntactically and semantically correct policies, identifying non-compliant infrastructures, and applying corrective modifications, even when using smaller, open-weight LLMs. Our results highlight the potential of agentic RAG architectures to enhance the automation, reliability, and accessibility of PaC workflows.
zh

[AI-88] textttDroid: A Resource Suite for AI-Generated Code Detection

【速读】:该论文旨在解决机器生成代码检测器在不同编程语言和现实编码领域中泛化能力不足的问题。其关键解决方案是构建了大规模的开放数据集 \textttDroidCollection,包含多种编程语言、AI生成代码、人机协作代码以及对抗样本,并基于此数据集开发了基于多任务目标训练的编码器-only检测器 \textttDroidDetect,同时通过度量学习和基于不确定性的重采样方法提升检测器在可能噪声分布下的训练效果。

链接: https://arxiv.org/abs/2507.10583
作者: Daniil Orel,Indraneil Paul,Iryna Gurevych,Preslav Nakov
机构: 未知
类目: oftware Engineering (cs.SE); Artificial Intelligence (cs.AI); Computers and Society (cs.CY)
备注:

点击查看摘要

Abstract:In this work, we compile \textbf \textttDroidCollection , the most extensive open data suite for training and evaluating machine-generated code detectors, comprising over a million code samples, seven programming languages, outputs from 43 coding models, and over three real-world coding domains. Alongside fully AI-generated samples, our collection includes human-AI co-authored code, as well as adversarial samples explicitly crafted to evade detection. Subsequently, we develop \textbf \textttDroidDetect , a suite of encoder-only detectors trained using a multi-task objective over \textttDroidCollection . Our experiments show that existing detectors’ performance fails to generalise to diverse coding domains and programming languages outside of their narrow training data. Additionally, we demonstrate that while most detectors are easily compromised by humanising the output distributions using superficial prompting and alignment approaches, this problem can be easily amended by training on a small amount of adversarial data. Finally, we demonstrate the effectiveness of metric learning and uncertainty-based resampling as means to enhance detector training on possibly noisy distributions.
zh

[AI-89] Universal Approximation Theorem for a Single-Layer Transformer

【速读】:该论文试图解决对深度学习模型,特别是Transformer架构的理论理解不足的问题。其解决方案的关键在于提出一个通用逼近定理,证明单层Transformer(包含自注意力层和带有ReLU激活函数的位置前馈网络)可以在紧致域上以任意精度逼近任何连续的序列到序列映射。这一结果为Transformer模型的理论基础提供了重要支撑,并有助于弥合理论与实践之间的差距。

链接: https://arxiv.org/abs/2507.10581
作者: Esmail Gumaan
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Machine Learning (stat.ML)
备注: 7 pages, 2 figures, 1 theorem, 10 formulas

点击查看摘要

Abstract:Deep learning employs multi-layer neural networks trained via the backpropagation algorithm. This approach has achieved success across many domains and relies on adaptive gradient methods such as the Adam optimizer. Sequence modeling evolved from recurrent neural networks to attention-based models, culminating in the Transformer architecture. Transformers have achieved state-of-the-art performance in natural language processing (for example, BERT and GPT-3) and have been applied in computer vision and computational biology. However, theoretical understanding of these models remains limited. In this paper, we examine the mathematical foundations of deep learning and Transformers and present a novel theoretical result. We review key concepts from linear algebra, probability, and optimization that underpin deep learning, and we analyze the multi-head self-attention mechanism and the backpropagation algorithm in detail. Our main contribution is a universal approximation theorem for Transformers: we prove that a single-layer Transformer, comprising one self-attention layer followed by a position-wise feed-forward network with ReLU activation, can approximate any continuous sequence-to-sequence mapping on a compact domain to arbitrary precision. We provide a formal statement and a complete proof. Finally, we present case studies that demonstrate the practical implications of this result. Our findings advance the theoretical understanding of Transformer models and help bridge the gap between theory and practice.
zh

[AI-90] When and Where do Data Poisons Attack Textual Inversion? ICCV

【速读】:该论文旨在解决扩散模型(Diffusion Models, DMs)在文本反转(Textual Inversion, TI)过程中受到的中毒攻击问题,此类攻击会破坏模型的鲁棒性和个性化效果。论文提出的关键解决方案是Safe-Zone Training (SZT),其核心包括三个组成部分:(1) 使用JPEG压缩削弱高频中毒信号,(2) 在TI训练中限制低噪声时间步以避免对抗性信号的影响,(3) 通过损失掩码约束学习过程至相关概念区域。

链接: https://arxiv.org/abs/2507.10578
作者: Jeremy Styborski,Mingzhi Lyu,Jiayou Lu,Nupur Kapur,Adams Kong
机构: 未知
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI)
备注: Accepted to ICCV

点击查看摘要

Abstract:Poisoning attacks pose significant challenges to the robustness of diffusion models (DMs). In this paper, we systematically analyze when and where poisoning attacks textual inversion (TI), a widely used personalization technique for DMs. We first introduce Semantic Sensitivity Maps, a novel method for visualizing the influence of poisoning on text embeddings. Second, we identify and experimentally verify that DMs exhibit non-uniform learning behavior across timesteps, focusing on lower-noise samples. Poisoning attacks inherit this bias and inject adversarial signals predominantly at lower timesteps. Lastly, we observe that adversarial signals distract learning away from relevant concept regions within training data, corrupting the TI process. Based on these insights, we propose Safe-Zone Training (SZT), a novel defense mechanism comprised of 3 key components: (1) JPEG compression to weaken high-frequency poison signals, (2) restriction to high timesteps during TI training to avoid adversarial signals at lower timesteps, and (3) loss masking to constrain learning to relevant regions. Extensive experiments across multiple poisoning methods demonstrate that SZT greatly enhances the robustness of TI against all poisoning attacks, improving generative quality beyond prior published defenses. Code: this http URL Data: this http URL
zh

[AI-91] Enhancing Cross Entropy with a Linearly Adaptive Loss Function for Optimized Classification Performance

【速读】:该论文试图解决传统交叉熵损失函数在分类任务中优化效率和性能的局限性。其解决方案的关键在于引入一种线性自适应交叉熵损失函数(Linearly Adaptive Cross Entropy Loss),该函数在标准交叉熵损失的基础上增加了一个依赖于真实类别预测概率的项,从而提升分类任务中的优化效果。

链接: https://arxiv.org/abs/2507.10574
作者: Jae Wan Shim
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Neural and Evolutionary Computing (cs.NE)
备注: 13 pages, 2 figures

点击查看摘要

Abstract:We propose the Linearly Adaptive Cross Entropy Loss function. This is a novel measure derived from the information theory. In comparison to the standard cross entropy loss function, the proposed one has an additional term that depends on the predicted probability of the true class. This feature serves to enhance the optimization process in classification tasks involving one-hot encoded class labels. The proposed one has been evaluated on a ResNet-based model using the CIFAR-100 dataset. Preliminary results show that the proposed one consistently outperforms the standard cross entropy loss function in terms of classification accuracy. Moreover, the proposed one maintains simplicity, achieving practically the same efficiency to the traditional cross entropy loss. These findings suggest that our approach could broaden the scope for future research into loss function design.
zh

[AI-92] AI Mother Tongue: Self-Emergent Communication in MARL via Endogenous Symbol Systems

【速读】:该论文试图解决去中心化多智能体强化学习(Decentralized Multi-Agent Reinforcement Learning, MARL)中因“联合探索困境”导致的“通信真空均衡”问题,即智能体难以形成有效通信机制。其解决方案的关键在于引入一种基于向量量化变分自编码器(VQ-VAE)的“AI母语”(AIM)框架,使智能体具备内生符号系统,从而自然产生语义压缩和纳什均衡驱动的语义收敛,实现无需外部归纳偏置的有效符号通信。

链接: https://arxiv.org/abs/2507.10566
作者: Hung Ming Liu
机构: 未知
类目: Artificial Intelligence (cs.AI); Computer Science and Game Theory (cs.GT); Machine Learning (cs.LG); Multiagent Systems (cs.MA); Neural and Evolutionary Computing (cs.NE)
备注: 30 pages, 4 figures

点击查看摘要

Abstract:In Decentralized Multi-Agent Reinforcement Learning (MARL), the development of Emergent Communication has long been constrained by the Joint Exploration Dilemma'', leading agents to fall into a Communication Vacuum Equilibrium’’ . Traditional methods address this by introducing inductive biases to facilitate communication emergence . This study fundamentally questions whether such artificial inductive biases are, in fact, over-engineering. Through experiments with the AI Mother Tongue'' (AIM) framework, based on a Vector Quantized Variational Autoencoder (VQ-VAE), we demonstrate that when agents possess an endogenous symbol system, their neural representations naturally exhibit spontaneous semantic compression and Nash equilibrium-driven semantic convergence, achieving effective symbolic communication without external inductive biases. This aligns with recent neuroscience findings suggesting that the human brain does not directly use human language for internal thought , and resonates with research on soft thinking’’ capabilities in Large Language Models (LLMs) . Compared to traditional explicit communication methods, AIM demonstrates stronger generality and efficiency. The interpretable analysis toolkit developed in this study confirms that symbol usage exhibits a significant power-law distribution, leading to three major theoretical insights: the Neural Communication Hypothesis'', the Tool-First Principle’‘, and the Semantic Interpretability Paradigm''. Future research will explore the integration of Hierarchical Quantized Variational Autoencoders (HQ-VAE) to enhance AIM's complex expressive capabilities and investigate the potential for Reinforcement Learning (RL) Low-Level Pre-training’'. This discovery offers new avenues for bridging symbolism and connectionism.
zh

[AI-93] ool-to-Tool Matching Analysis Based Difference Score Computation Methods for Semiconductor Manufacturing

【速读】:该论文试图解决工具到工具匹配(Tool-to-Tool Matching, TTTM)问题,也称为半导体制造设备中的腔室匹配问题。传统方法依赖于静态配置数据或难以获取的黄金参考,且在异构环境中表现不佳,其中设备来自不同供应商且型号各异。论文提出新的TTTM分析流程以克服这些问题,其关键假设是不匹配的设备在数据中会表现出更高的方差和/或更多的模式数。

链接: https://arxiv.org/abs/2507.10564
作者: Sameera Bharadwaja H.,Siddhrath Jandial,Shashank S. Agashe,Rajesh Kumar Reddy Moore,Youngkwan Kim
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Signal Processing (eess.SP); Machine Learning (stat.ML)
备注:

点击查看摘要

Abstract:We consider the problem of tool-to-tool matching (TTTM), also called, chamber matching in the context of a semiconductor manufacturing equipment. Traditional TTTM approaches utilize static configuration data or depend on a golden reference which are difficult to obtain in a commercial manufacturing line. Further, existing methods do not extend very well to a heterogeneous setting, where equipment are of different make-and-model, sourced from different equipment vendors. We propose novel TTTM analysis pipelines to overcome these issues. We hypothesize that a mismatched equipment would have higher variance and/or higher number of modes in the data. Our best univariate method achieves a correlation coefficient 0.95 and 0.5 with the variance and number of modes, respectively showing that the proposed methods are effective. Also, the best multivariate method achieves a correlation coefficient 0.75 with the top-performing univariate methods, showing its effectiveness. Finally, we analyze the sensitivity of the multivariate algorithms to the algorithm hyper-parameters.
zh

[AI-94] A Biomimetic Way for Coral-Reef-Inspired Swarm Intelligence for Carbon-Neutral Wastewater Treatment

【速读】:该论文旨在解决废水处理过程中实现能源中性净化的挑战,特别是如何在保证高效污染物去除的同时降低能耗和碳排放。其解决方案的关键在于引入一种受珊瑚礁启发的 Swarm Interaction Network(群体交互网络),该方法结合了形态发生抽象与多任务碳意识,通过线性令牌复杂度实现可扩展性,从而有效缓解能源消耗与污染物去除之间的矛盾。

链接: https://arxiv.org/abs/2507.10563
作者: Antonis Messinis
机构: 未知
类目: Neural and Evolutionary Computing (cs.NE); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:With increasing wastewater rates, achieving energy-neutral purification is challenging. We introduce a coral-reef-inspired Swarm Interaction Network for carbon-neutral wastewater treatment, combining morphogenetic abstraction with multi-task carbon awareness. Scalability stems from linear token complexity, mitigating the energy-removal problem. Compared with seven baselines, our approach achieves 96.7% removal efficiency, 0.31~kWh~m ^-3 energy consumption, and 14.2~g~m ^-3 CO _2 emissions. Variance analysis demonstrates robustness under sensor drift. Field scenarios–insular lagoons, brewery spikes, and desert greenhouses–show potential diesel savings of up to 22%. However, data-science staffing remains an impediment. Future work will integrate AutoML wrappers within the project scope, although governance restrictions pose interpretability challenges that require further visual analytics.
zh

[AI-95] SAMEP: A Secure Protocol for Persistent Context Sharing Across AI Agents

【速读】:该论文试图解决当前人工智能代理架构中存在的短暂性记忆限制问题,这阻碍了跨会话和代理边界的有效协作与知识共享。解决方案的关键在于提出SAMEP(Secure Agent Memory Exchange Protocol),该协议通过分布式记忆存储库、基于向量的语义搜索、加密访问控制(AES-256-GCM)以及与现有代理通信协议(MCP, A2A)兼容的标准API,实现了持久化、安全且语义可搜索的记忆共享。

链接: https://arxiv.org/abs/2507.10562
作者: Hari Masoor
机构: 未知
类目: Artificial Intelligence (cs.AI); Cryptography and Security (cs.CR); Databases (cs.DB); Machine Learning (cs.LG)
备注: 7 pages, 4 figures, 3 implementation examples. Original work submitted as a preprint

点击查看摘要

Abstract:Current AI agent architectures suffer from ephemeral memory limitations, preventing effective collaboration and knowledge sharing across sessions and agent boundaries. We introduce SAMEP (Secure Agent Memory Exchange Protocol), a novel framework that enables persistent, secure, and semantically searchable memory sharing among AI agents. Our protocol addresses three critical challenges: (1) persistent context preservation across agent sessions, (2) secure multi-agent collaboration with fine-grained access control, and (3) efficient semantic discovery of relevant historical context. SAMEP implements a distributed memory repository with vector-based semantic search, cryptographic access controls (AES-256-GCM), and standardized APIs compatible with existing agent communication protocols (MCP, A2A). We demonstrate SAMEP’s effectiveness across diverse domains including multi-agent software development, healthcare AI with HIPAA compliance, and multi-modal processing pipelines. Experimental results show 73% reduction in redundant computations, 89% improvement in context relevance scores, and complete compliance with regulatory requirements including audit trail generation. SAMEP enables a new paradigm of persistent, collaborative AI agent ecosystems while maintaining security and privacy guarantees.
zh

[AI-96] Collaboration Promotes Group Resilience in Multi-Agent AI

【速读】:该论文试图解决多智能体强化学习(MARL)环境中代理对环境突变的鲁棒性问题,即如何使多个智能体组成的群体在动态场景中保持有效运作。论文提出了一种名为“群体鲁棒性”的多智能体鲁棒性概念,并假设通过智能体间的协作是实现这一目标的关键。实验结果表明,所有测试的协作方法在群体鲁棒性方面均优于非协作方法,证明了协作机制在提升多智能体系统适应环境扰动能力中的重要性。

链接: https://arxiv.org/abs/2111.06614
作者: Sarah Keren,Matthias Gerstgrasser,Ofir Abu,Jeffrey Rosenschein
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Multiagent Systems (cs.MA)
备注: RLC 2025

点击查看摘要

Abstract:To effectively operate in various dynamic scenarios, RL agents must be resilient to unexpected changes in their environment. Previous work on this form of resilience has focused on single-agent settings. In this work, we introduce and formalize a multi-agent variant of resilience, which we term group resilience. We further hypothesize that collaboration with other agents is key to achieving group resilience; collaborating agents adapt better to environmental perturbations in multi-agent reinforcement learning (MARL) settings. We test our hypothesis empirically by evaluating different collaboration protocols and examining their effect on group resilience. Our experiments show that all the examined collaborative approaches achieve higher group resilience than their non-collaborative counterparts.
zh

[AI-97] Recursive Bound-Constrained AdaGrad with Applications to Multilevel and Domain Decomposition Minimization

【速读】:该论文试图解决带有边界约束、不精确梯度和噪声的优化问题,其解决方案的关键在于提出两种无需目标函数的噪声容忍算法(OFFO),这两种算法分别基于多级方法和域分解方法,能够利用二阶信息。它们是针对无约束优化的AdaGrad算法的一阶推广,并共享一个统一的收敛性/复杂性理论,证明了在高概率下,两种方法在计算有界约束问题的ε-近似一阶临界点时,所需迭代次数和噪声梯度评估次数均为O(ε⁻²)。

链接: https://arxiv.org/abs/2507.11513
作者: Serge Gratton,Alena Kopaničáková,Philippe Toint
机构: 未知
类目: Optimization and Control (math.OC); Artificial Intelligence (cs.AI); Numerical Analysis (math.NA)
备注: 33 pages

点击查看摘要

Abstract:Two OFFO (Objective-Function Free Optimization) noise tolerant algorithms are presented that handle bound constraints, inexact gradients and use second-order information when this http URL first is a multi-level method exploiting a hierarchical description of the problem and the second is a domain-decomposition method covering the standard addditive Schwarz decompositions. Both are generalizations of the first-order AdaGrad algorithm for unconstrained optimization. Because these algorithms share a common theoretical framework, a single convergence/complexity theory is provided which covers them both. Its main result is that, with high probability, both methods need at most O(\epsilon^-2) iterations and noisy gradient evaluations to compute an \epsilon -approximate first-order critical point of the bound-constrained problem. Extensive numerical experiments are discussed on applications ranging from PDE-based problems to deep neural network training, illustrating their remarkable computational efficiency.
zh

[AI-98] From Kinetic Theory to AI: a Rediscovery of High-Dimensional Divergences and Their Properties

【速读】:该论文试图解决机器学习中选择合适的散度度量问题,其关键在于通过回顾源自动能理论的散度度量,揭示它们的理论基础并探索其在机器学习和人工智能中的潜在应用。

链接: https://arxiv.org/abs/2507.11387
作者: Gennaro Auricchio,Giovanni Brigati,Paolo Giudici,Giuseppe Toscani
机构: 未知
类目: Mathematical Physics (math-ph); Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Multiagent Systems (cs.MA)
备注:

点击查看摘要

Abstract:Selecting an appropriate divergence measure is a critical aspect of machine learning, as it directly impacts model performance. Among the most widely used, we find the Kullback-Leibler (KL) divergence, originally introduced in kinetic theory as a measure of relative entropy between probability distributions. Just as in machine learning, the ability to quantify the proximity of probability distributions plays a central role in kinetic theory. In this paper, we present a comparative review of divergence measures rooted in kinetic theory, highlighting their theoretical foundations and exploring their potential applications in machine learning and artificial intelligence.
zh

[AI-99] Quantitative multi-metabolite imaging of Parkinsons disease using AI boosted molecular MRI

【速读】:该论文试图解决帕金森病(Parkinson’s disease, PD)体内分子成像中依赖放射性同位素、扫描时间长或空间分辨率低的问题。其解决方案的关键在于将快速分子MRI采集范式与基于深度学习的重建方法相结合,实现了谷氨酸、可移动蛋白质、半固态及可移动大分子的多代谢物定量分析。

链接: https://arxiv.org/abs/2507.11329
作者: Hagar Shmuely(1),Michal Rivlin(1),Or Perlman(1 and 2) ((1) School of Biomedical Engineering, Tel Aviv University, Tel Aviv, Israel, (2) Sagol School of Neuroscience, Tel Aviv University, Tel Aviv, Israel)
机构: 未知
类目: Medical Physics (physics.med-ph); Artificial Intelligence (cs.AI)
备注: This project was funded by the European Union (ERC, BabyMagnet, project no. 101115639). Views and opinions expressed are, however, those of the authors only and do not necessarily reflect those of the European Union or the European Research Council. Neither the European Union nor the granting authority can be held responsible for them

点击查看摘要

Abstract:Traditional approaches for molecular imaging of Parkinson’s disease (PD) in vivo require radioactive isotopes, lengthy scan times, or deliver only low spatial resolution. Recent advances in saturation transfer-based PD magnetic resonance imaging (MRI) have provided biochemical insights, although the image contrast is semi-quantitative and nonspecific. Here, we combined a rapid molecular MRI acquisition paradigm with deep learning based reconstruction for multi-metabolite quantification of glutamate, mobile proteins, semisolid, and mobile macromolecules in an acute MPTP (1-methyl-4-phenyl-1,2,3,6-tetrahydropyridine) mouse model. The quantitative parameter maps are in general agreement with the histology and MR spectroscopy, and demonstrate that semisolid magnetization transfer (MT), amide, and aliphatic relayed nuclear Overhauser effect (rNOE) proton volume fractions may serve as PD biomarkers.
zh

[AI-100] Artificial Finance: How AI Thinks About Money

【速读】:该论文试图解决的问题是理解大型语言模型(Large Language Models, LLMs)在金融决策中的行为模式及其与人类决策的异同。其解决方案的关键在于系统性地比较LLMs对一系列常见金融决策问题的回答与全球不同国家人类参与者的回答,从而揭示LLMs在风险偏好、时间价值权衡以及跨文化一致性方面的表现。

链接: https://arxiv.org/abs/2507.10933
作者: Orhan Erdem,Ragavi Pobbathi Ashok
机构: 未知
类目: General Economics (econ.GN); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:In this paper, we explore how large language models (LLMs) approach financial decision-making by systematically comparing their responses to those of human participants across the globe. We posed a set of commonly used financial decision-making questions to seven leading LLMs, including five models from the GPT series(GPT-4o, GPT-4.5, o1, o3-mini), Gemini 2.0 Flash, and DeepSeek R1. We then compared their outputs to human responses drawn from a dataset covering 53 nations. Our analysis reveals three main results. First, LLMs generally exhibit a risk-neutral decision-making pattern, favoring choices aligned with expected value calculations when faced with lottery-type questions. Second, when evaluating trade-offs between present and future, LLMs occasionally produce responses that appear inconsistent with normative reasoning. Third, when we examine cross-national similarities, we find that the LLMs’ aggregate responses most closely resemble those of participants from Tanzania. These findings contribute to the understanding of how LLMs emulate human-like decision behaviors and highlight potential cultural and training influences embedded within their outputs.
zh

[AI-101] aylorPODA: A Taylor Expansion-Based Method to Improve Post-Hoc Attributions for Opaque Models NEURIPS2025

【速读】:该论文试图解决现有后验模型无关方法在量化单个特征贡献时缺乏明确和系统框架的问题。其解决方案的关键在于基于Deng等(2024)提出的泰勒展开框架,提出了一组严格的公理——“精度”、“联邦”和“零差异”,用于指导泰勒项特定的归因,并引入了具有“适应性”属性的TaylorPODA方法,该属性使模型能够与任务特定目标对齐,尤其在缺乏真实解释的后验设置中表现突出。

链接: https://arxiv.org/abs/2507.10643
作者: Yuchi Tang,Iñaki Esnaola,Suzanne Mason,George Panoutsos
机构: 未知
类目: Machine Learning (stat.ML); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: 17 pages, 6 figures, Submitted to NeurIPS 2025

点击查看摘要

Abstract:Existing post-hoc model-agnostic methods generate external explanations for opaque models, primarily by locally attributing the model output to its input features. However, they often lack an explicit and systematic framework for quantifying the contribution of individual features. Building on the Taylor expansion framework introduced by Deng et al. (2024) to unify existing local attribution methods, we propose a rigorous set of postulates – “precision”, “federation”, and “zero-discrepancy” – to govern Taylor term-specific attribution. Guided by these postulates, we introduce TaylorPODA (Taylor expansion-derived imPortance-Order aDapted Attribution), which incorporates an additional “adaptation” property. This property enables alignment with task-specific goals, especially in post-hoc settings lacking ground-truth explanations. Empirical evaluations demonstrate that TaylorPODA achieves competitive results against baseline methods, providing principled and visualization-friendly explanations. This work represents a step toward the trustworthy deployment of opaque models by offering explanations with stronger theoretical grounding.
zh

[AI-102] Neural Expectation Operators

【速读】:该论文试图解决在存在不确定性或模糊性的情况下,如何通过非线性期望建模来构建数据驱动的数学框架问题。其解决方案的关键在于引入神经期望算子,这些算子作为倒向随机微分方程(BSDE)的解,其驱动函数由神经网络参数化。论文的主要数学贡献是为满足状态变量$ y 局部Lipschitz条件和鞅部分局部Lipschitz条件和鞅部分 z $二次增长的BSDE建立了严格适定性定理,从而避免了传统全局Lipschitz假设的限制,使得该方法适用于常见的神经网络架构,并且适用于指数可积的终端数据。通过构造性的方法,论文将二次BSDE的抽象且限制性强的假设与机器学习的实际应用相连接,证明了这些条件可以通过具体可验证的神经网络设计实现。

链接: https://arxiv.org/abs/2507.10607
作者: Qian Qi
机构: 未知
类目: Probability (math.PR); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:This paper introduces \textbfMeasure Learning, a paradigm for modeling ambiguity via non-linear expectations. We define Neural Expectation Operators as solutions to Backward Stochastic Differential Equations (BSDEs) whose drivers are parameterized by neural networks. The main mathematical contribution is a rigorous well-posedness theorem for BSDEs whose drivers satisfy a local Lipschitz condition in the state variable y and quadratic growth in its martingale component z . This result circumvents the classical global Lipschitz assumption, is applicable to common neural network architectures (e.g., with ReLU activations), and holds for exponentially integrable terminal data, which is the sharp condition for this setting. Our primary innovation is to build a constructive bridge between the abstract, and often restrictive, assumptions of the deep theory of quadratic BSDEs and the world of machine learning, demonstrating that these conditions can be met by concrete, verifiable neural network designs. We provide constructive methods for enforcing key axiomatic properties, such as convexity, by architectural design. The theory is extended to the analysis of fully coupled Forward-Backward SDE systems and to the asymptotic analysis of large interacting particle systems, for which we establish both a Law of Large Numbers (propagation of chaos) and a Central Limit Theorem. This work provides the foundational mathematical framework for data-driven modeling under ambiguity.
zh

机器学习

[LG-0] Langevin Flows for Modeling Neural Latent Dynamics

链接: https://arxiv.org/abs/2507.11531
作者: Yue Song,T. Anderson Keller,Yisong Yue,Pietro Perona,Max Welling
类目: Machine Learning (cs.LG); Neurons and Cognition (q-bio.NC)
*备注: Full version of the Cognitive Computational Neuroscience (CCN) 2025 poster

点击查看摘要

Abstract:Neural populations exhibit latent dynamical structures that drive time-evolving spiking activities, motivating the search for models that capture both intrinsic network dynamics and external unobserved influences. In this work, we introduce LangevinFlow, a sequential Variational Auto-Encoder where the time evolution of latent variables is governed by the underdamped Langevin equation. Our approach incorporates physical priors – such as inertia, damping, a learned potential function, and stochastic forces – to represent both autonomous and non-autonomous processes in neural systems. Crucially, the potential function is parameterized as a network of locally coupled oscillators, biasing the model toward oscillatory and flow-like behaviors observed in biological neural populations. Our model features a recurrent encoder, a one-layer Transformer decoder, and Langevin dynamics in the latent space. Empirically, our method outperforms state-of-the-art baselines on synthetic neural populations generated by a Lorenz attractor, closely matching ground-truth firing rates. On the Neural Latents Benchmark (NLB), the model achieves superior held-out neuron likelihoods (bits per spike) and forward prediction accuracy across four challenging datasets. It also matches or surpasses alternative methods in decoding behavioral metrics such as hand velocity. Overall, this work introduces a flexible, physics-inspired, high-performing framework for modeling complex neural population dynamics and their unobserved influences.

[LG-1] Elk: Exploring the Efficiency of Inter-core Connected AI Chips with Deep Learning Compiler Techniques MICRO MICRO’25

链接: https://arxiv.org/abs/2507.11506
作者: Yiqi Liu,Yuqi Xue,Noelle Crawford,Jilong Xue,Jian Huang
类目: Hardware Architecture (cs.AR); Distributed, Parallel, and Cluster Computing (cs.DC); Machine Learning (cs.LG)
*备注: This paper is accepted at the 58th IEEE/ACM International Symposium on Microarchitecture (MICRO’25)

点击查看摘要

Abstract:To meet the increasing demand of deep learning (DL) models, AI chips are employing both off-chip memory (e.g., HBM) and high-bandwidth low-latency interconnect for direct inter-core data exchange. However, it is not easy to explore the efficiency of these inter-core connected AI (ICCA) chips, due to a fundamental tussle among compute (per-core execution), communication (inter-core data exchange), and I/O (off-chip data access). In this paper, we develop Elk, a DL compiler framework to maximize the efficiency of ICCA chips by jointly trading off all the three performance factors discussed above. Elk structures these performance factors into configurable parameters and forms a global trade-off space in the DL compiler. To systematically explore this space and maximize overall efficiency, Elk employs a new inductive operator scheduling policy and a cost-aware on-chip memory allocation algorithm. It generates globally optimized execution plans that best overlap off-chip data loading and on-chip execution. To examine the efficiency of Elk, we build a full-fledged emulator based on a real ICCA chip IPU-POD4, and an ICCA chip simulator for sensitivity analysis with different interconnect network topologies. Elk achieves 94% of the ideal roofline performance of ICCA chips on average, showing the benefits of supporting large DL models on ICCA chips. We also show Elk’s capability of enabling architecture design space exploration for new ICCA chip development. Comments: This paper is accepted at the 58th IEEE/ACM International Symposium on Microarchitecture (MICRO’25) Subjects: Hardware Architecture (cs.AR); Distributed, Parallel, and Cluster Computing (cs.DC); Machine Learning (cs.LG) Cite as: arXiv:2507.11506 [cs.AR] (or arXiv:2507.11506v1 [cs.AR] for this version) https://doi.org/10.48550/arXiv.2507.11506 Focus to learn more arXiv-issued DOI via DataCite (pending registration) Journalreference: In Proceedings of the 58th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO’25), Seoul, Korea, October, 2025

[LG-2] A parametric activation function based on Wendland RBF

链接: https://arxiv.org/abs/2507.11493
作者: Majid Darehmiraki
类目: Machine Learning (cs.LG); Neural and Evolutionary Computing (cs.NE)
*备注: 11 pages, 2 figures

点击查看摘要

Abstract:This paper introduces a novel parametric activation function based on Wendland radial basis functions (RBFs) for deep neural networks. Wendland RBFs, known for their compact support, smoothness, and positive definiteness in approximation theory, are adapted to address limitations of traditional activation functions like ReLU, sigmoid, and tanh. The proposed enhanced Wendland activation combines a standard Wendland component with linear and exponential terms, offering tunable locality, improved gradient propagation, and enhanced stability during training. Theoretical analysis highlights its mathematical properties, including smoothness and adaptability, while empirical experiments on synthetic tasks (e.g., sine wave approximation) and benchmark datasets (MNIST, Fashion-MNIST) demonstrate competitive performance. Results show that the Wendland-based activation achieves superior accuracy in certain scenarios, particularly in regression tasks, while maintaining computational efficiency. The study bridges classical RBF theory with modern deep learning, suggesting that Wendland activations can mitigate overfitting and improve generalization through localized, smooth transformations. Future directions include hybrid architectures and domain-specific adaptations.

[LG-3] Exploring the robustness of TractOracle methods in RL-based tractography

链接: https://arxiv.org/abs/2507.11486
作者: Jeremi Levesque,Antoine Théberge,Maxime Descoteaux,Pierre-Marc Jodoin
类目: Machine Learning (cs.LG)
*备注: 38 pages, 8 figures. Submitted to Medical Image Analysis

点击查看摘要

Abstract:Tractography algorithms leverage diffusion MRI to reconstruct the fibrous architecture of the brain’s white matter. Among machine learning approaches, reinforcement learning (RL) has emerged as a promising framework for tractography, outperforming traditional methods in several key aspects. TractOracle-RL, a recent RL-based approach, reduces false positives by incorporating anatomical priors into the training process via a reward-based mechanism. In this paper, we investigate four extensions of the original TractOracle-RL framework by integrating recent advances in RL, and we evaluate their performance across five diverse diffusion MRI datasets. Results demonstrate that combining an oracle with the RL framework consistently leads to robust and reliable tractography, regardless of the specific method or dataset used. We also introduce a novel RL training scheme called Iterative Reward Training (IRT), inspired by the Reinforcement Learning from Human Feedback (RLHF) paradigm. Instead of relying on human input, IRT leverages bundle filtering methods to iteratively refine the oracle’s guidance throughout training. Experimental results show that RL methods trained with oracle feedback significantly outperform widely used tractography techniques in terms of accuracy and anatomical validity.

[LG-4] D3FL: Data Distribution and Detrending for Robust Federated Learning in Non-linear Time-series Data

链接: https://arxiv.org/abs/2507.11471
作者: Harsha Varun Marisetty,Manik Gupta,Yogesh Simmhan
类目: Machine Learning (cs.LG); Distributed, Parallel, and Cluster Computing (cs.DC)
*备注: Preprint of paper to appear in the proceedings of IEEE INTERNATIONAL CONFERENCE ON EDGE COMPUTING COMMUNICATIONS EDGE 2025

点击查看摘要

Abstract:With advancements in computing and communication technologies, the Internet of Things (IoT) has seen significant growth. IoT devices typically collect data from various sensors, such as temperature, humidity, and energy meters. Much of this data is temporal in nature. Traditionally, data from IoT devices is centralized for analysis, but this approach introduces delays and increased communication costs. Federated learning (FL) has emerged as an effective alternative, allowing for model training across distributed devices without the need to centralize data. In many applications, such as smart home energy and environmental monitoring, the data collected by IoT devices across different locations can exhibit significant variation in trends and seasonal patterns. Accurately forecasting such non-stationary, non-linear time-series data is crucial for applications like energy consumption estimation and weather forecasting. However, these data variations can severely impact prediction accuracy. The key contributions of this paper are: (1) Investigating how non-linear, non-stationary time-series data distributions, like generalized extreme value (gen-extreme) and log norm distributions, affect FL performance. (2) Analyzing how different detrending techniques for non-linear time-series data influence the forecasting model’s performance in a FL setup. We generated several synthetic time-series datasets using non-linear data distributions and trained an LSTM-based forecasting model using both centralized and FL approaches. Additionally, we evaluated the impact of detrending on real-world datasets with non-linear time-series data distributions. Our experimental results show that: (1) FL performs worse than centralized approaches when dealing with non-linear data distributions. (2) The use of appropriate detrending techniques improves FL performance, reducing loss across different data distributions.

[LG-5] LRMR: LLM -Driven Relational Multi-node Ranking for Lymph Node Metastasis Assessment in Rectal Cancer

链接: https://arxiv.org/abs/2507.11457
作者: Yaoxian Dong,Yifan Gao,Haoyue Li,Yanfen Cui,Xin Gao
类目: Machine Learning (cs.LG); Image and Video Processing (eess.IV)
*备注:

点击查看摘要

Abstract:Accurate preoperative assessment of lymph node (LN) metastasis in rectal cancer guides treatment decisions, yet conventional MRI evaluation based on morphological criteria shows limited diagnostic performance. While some artificial intelligence models have been developed, they often operate as black boxes, lacking the interpretability needed for clinical trust. Moreover, these models typically evaluate nodes in isolation, overlooking the patient-level context. To address these limitations, we introduce LRMR, an LLM-Driven Relational Multi-node Ranking framework. This approach reframes the diagnostic task from a direct classification problem into a structured reasoning and ranking process. The LRMR framework operates in two stages. First, a multimodal large language model (LLM) analyzes a composite montage image of all LNs from a patient, generating a structured report that details ten distinct radiological features. Second, a text-based LLM performs pairwise comparisons of these reports between different patients, establishing a relative risk ranking based on the severity and number of adverse features. We evaluated our method on a retrospective cohort of 117 rectal cancer patients. LRMR achieved an area under the curve (AUC) of 0.7917 and an F1-score of 0.7200, outperforming a range of deep learning baselines, including ResNet50 (AUC 0.7708). Ablation studies confirmed the value of our two main contributions: removing the relational ranking stage or the structured prompting stage led to a significant performance drop, with AUCs falling to 0.6875 and 0.6458, respectively. Our work demonstrates that decoupling visual perception from cognitive reasoning through a two-stage LLM framework offers a powerful, interpretable, and effective new paradigm for assessing lymph node metastasis in rectal cancer.

[LG-6] Data Augmentation in Time Series Forecasting through Inverted Framework

链接: https://arxiv.org/abs/2507.11439
作者: Hongming Tan,Ting Chen,Ruochong Jin,Wai Kin Chan
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Currently, iTransformer is one of the most popular and effective models for multivariate time series (MTS) forecasting. Thanks to its inverted framework, iTransformer effectively captures multivariate correlation. However, the inverted framework still has some limitations. It diminishes temporal interdependency information, and introduces noise in cases of nonsignificant variable correlation. To address these limitations, we introduce a novel data augmentation method on inverted framework, called DAIF. Unlike previous data augmentation methods, DAIF stands out as the first real-time augmentation specifically designed for the inverted framework in MTS forecasting. We first define the structure of the inverted sequence-to-sequence framework, then propose two different DAIF strategies, Frequency Filtering and Cross-variation Patching to address the existing challenges of the inverted framework. Experiments across multiple datasets and inverted models have demonstrated the effectiveness of our DAIF.

[LG-7] FLsim: A Modular and Library-Agnostic Simulation Framework for Federated Learning

链接: https://arxiv.org/abs/2507.11430
作者: Arnab Mukherjee,Raju Halder,Joydeep Chandra
类目: Distributed, Parallel, and Cluster Computing (cs.DC); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Federated Learning (FL) has undergone significant development since its inception in 2016, advancing from basic algorithms to complex methodologies tailored to address diverse challenges and use cases. However, research and benchmarking of novel FL techniques against a plethora of established state-of-the-art solutions remain challenging. To streamline this process, we introduce FLsim, a comprehensive FL simulation framework designed to meet the diverse requirements of FL workflows in the literature. FLsim is characterized by its modularity, scalability, resource efficiency, and controlled reproducibility of experimental outcomes. Its easy to use interface allows users to specify customized FL requirements through job configuration, which supports: (a) customized data distributions, ranging from non-independent and identically distributed (non-iid) data to independent and identically distributed (iid) data, (b) selection of local learning algorithms according to user preferences, with complete agnosticism to ML libraries, © choice of network topology illustrating communication patterns among nodes, (d) definition of model aggregation and consensus algorithms, and (e) pluggable blockchain support for enhanced robustness. Through a series of experimental evaluations, we demonstrate the effectiveness and versatility of FLsim in simulating a diverse range of state-of-the-art FL experiments. We envisage that FLsim would mark a significant advancement in FL simulation frameworks, offering unprecedented flexibility and functionality for researchers and practitioners alike.

[LG-8] Better Regret Rates in Bilateral Trade via Sublinear Budget Violation

链接: https://arxiv.org/abs/2507.11419
作者: Anna Lunghi,Matteo Castiglioni,Alberto Marchesi
类目: Computer Science and Game Theory (cs.GT); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Bilateral trade is a central problem in algorithmic economics, and recent work has explored how to design trading mechanisms using no-regret learning algorithms. However, no-regret learning is impossible when budget balance has to be enforced at each time step. Bernasconi et al. [Ber+24] show how this impossibility can be circumvented by relaxing the budget balance constraint to hold only globally over all time steps. In particular, they design an algorithm achieving regret of the order of \tilde O(T^3/4) and provide a lower bound of \Omega(T^5/7) . In this work, we interpolate between these two extremes by studying how the optimal regret rate varies with the allowed violation of the global budget balance constraint. Specifically, we design an algorithm that, by violating the constraint by at most T^\beta for any given \beta \in [\frac34, \frac67] , attains regret \tilde O(T^1 - \beta/3) . We complement this result with a matching lower bound, thus fully characterizing the trade-off between regret and budget violation. Our results show that both the \tilde O(T^3/4) upper bound in the global budget balance case and the \Omega(T^5/7) lower bound under unconstrained budget balance violation obtained by Bernasconi et al. [Ber+24] are tight. Subjects: Computer Science and Game Theory (cs.GT); Machine Learning (cs.LG) Cite as: arXiv:2507.11419 [cs.GT] (or arXiv:2507.11419v1 [cs.GT] for this version) https://doi.org/10.48550/arXiv.2507.11419 Focus to learn more arXiv-issued DOI via DataCite (pending registration)

[LG-9] Robust-Multi-Task Gradient Boosting

链接: https://arxiv.org/abs/2507.11411
作者: Seyedsaman Emami,Gonzalo Martínez-Muñoz,Daniel Hernández-Lobato
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Multi-task learning (MTL) has shown effectiveness in exploiting shared information across tasks to improve generalization. MTL assumes tasks share similarities that can improve performance. In addition, boosting algorithms have demonstrated exceptional performance across diverse learning problems, primarily due to their ability to focus on hard-to-learn instances and iteratively reduce residual errors. This makes them a promising approach for learning multi-task problems. However, real-world MTL scenarios often involve tasks that are not well-aligned (known as outlier or adversarial tasks), which do not share beneficial similarities with others and can, in fact, deteriorate the performance of the overall model. To overcome this challenge, we propose Robust-Multi-Task Gradient Boosting (R-MTGB), a novel boosting framework that explicitly models and adapts to task heterogeneity during training. R-MTGB structures the learning process into three sequential blocks: (1) learning shared patterns, (2) partitioning tasks into outliers and non-outliers with regularized parameters, and (3) fine-tuning task-specific predictors. This architecture enables R-MTGB to automatically detect and penalize outlier tasks while promoting effective knowledge transfer among related tasks. Our method integrates these mechanisms seamlessly within gradient boosting, allowing robust handling of noisy or adversarial tasks without sacrificing accuracy. Extensive experiments on both synthetic benchmarks and real-world datasets demonstrate that our approach successfully isolates outliers, transfers knowledge, and consistently reduces prediction errors for each task individually, and achieves overall performance gains across all tasks. These results highlight robustness, adaptability, and reliable convergence of R-MTGB in challenging MTL environments.

[LG-10] A Neural Network Model of Complementary Learning Systems: Pattern Separation and Completion for Continual Learning

链接: https://arxiv.org/abs/2507.11393
作者: James P Jun,Vijay Marupudi,Raj Sanjay Shah,Sashank Varma
类目: Machine Learning (cs.LG)
*备注: Accepted to CogSci 2025. 7 pages, 7 figures

点击查看摘要

Abstract:Learning new information without forgetting prior knowledge is central to human intelligence. In contrast, neural network models suffer from catastrophic forgetting: a significant degradation in performance on previously learned tasks when acquiring new information. The Complementary Learning Systems (CLS) theory offers an explanation for this human ability, proposing that the brain has distinct systems for pattern separation (encoding distinct memories) and pattern completion (retrieving complete memories from partial cues). To capture these complementary functions, we leverage the representational generalization capabilities of variational autoencoders (VAEs) and the robust memory storage properties of Modern Hopfield networks (MHNs), combining them into a neurally plausible continual learning model. We evaluate this model on the Split-MNIST task, a popular continual learning benchmark, and achieve close to state-of-the-art accuracy (~90%), substantially reducing forgetting. Representational analyses empirically confirm the functional dissociation: the VAE underwrites pattern completion, while the MHN drives pattern separation. By capturing pattern separation and completion in scalable architectures, our work provides a functional template for modeling memory consolidation, generalization, and continual learning in both biological and artificial systems.

[LG-11] Step-wise Policy for Rare-tool Knowledge (SPaRK): Offline RL that Drives Diverse Tool Use in LLM s

链接: https://arxiv.org/abs/2507.11371
作者: Gabriel Bo,Koa Chang,Justin Gu
类目: Machine Learning (cs.LG); Multiagent Systems (cs.MA)
*备注: 12 pages, 4 figures

点击查看摘要

Abstract:We present Step-wise Policy for Rare-tool Knowledge (SPaRK), a novel reinforcement learning framework that teaches large language models to explore diverse tool usage patterns beyond conventional high-temperature sampling. Building on recent advances in step-wise reinforcement learning, we introduce a dual-objective reward system that simultaneously optimizes for answer quality and tool diversity, training a Llama-3.1 8B model through offline PPO on synthetically generated trajectories from the MMLU-Pro dataset. Our approach uniquely employs a rarity-first exploitation strategy where a GPT-4o judge scores candidate actions across eight distinct tools plus chain-of-thought reasoning, with the policy favoring less-frequently used but still viable tools to encourage systematic exploration. Empirical results demonstrate that SPaRK achieves competitive performance across 14 MMLU-Pro categories while exhibiting significantly higher entropy in tool selection compared to both baseline and supervised fine-tuning approaches, suggesting that algorithmic exploration through explicit tool diversity can enhance reasoning capabilities without sacrificing accuracy.

[LG-12] A Parallelizable Approach for Characterizing NE in Zero-Sum Games After a Linear Number of Iterations of Gradient Descent

链接: https://arxiv.org/abs/2507.11366
作者: Taemin Kim,James P. Bailey
类目: Computer Science and Game Theory (cs.GT); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:We study online optimization methods for zero-sum games, a fundamental problem in adversarial learning in machine learning, economics, and many other domains. Traditional methods approximate Nash equilibria (NE) using either regret-based methods (time-average convergence) or contraction-map-based methods (last-iterate convergence). We propose a new method based on Hamiltonian dynamics in physics and prove that it can characterize the set of NE in a finite (linear) number of iterations of alternating gradient descent in the unbounded setting, modulo degeneracy, a first in online optimization. Unlike standard methods for computing NE, our proposed approach can be parallelized and works with arbitrary learning rates, both firsts in algorithmic game theory. Experimentally, we support our results by showing our approach drastically outperforms standard methods.

[LG-13] Neurosymbolic Reasoning Shortcuts under the Independence Assumption

链接: https://arxiv.org/abs/2507.11357
作者: Emile van Krieken,Pasquale Minervini,Edoardo Ponti,Antonio Vergari
类目: Machine Learning (cs.LG); Machine Learning (stat.ML)
*备注: Accepted at NeSy 2025

点击查看摘要

Abstract:The ubiquitous independence assumption among symbolic concepts in neurosymbolic (NeSy) predictors is a convenient simplification: NeSy predictors use it to speed up probabilistic reasoning. Recent works like van Krieken et al. (2024) and Marconato et al. (2024) argued that the independence assumption can hinder learning of NeSy predictors and, more crucially, prevent them from correctly modelling uncertainty. There is, however, scepticism in the NeSy community around the scenarios in which the independence assumption actually limits NeSy systems (Faronius and Dos Martires, 2025). In this work, we settle this question by formally showing that assuming independence among symbolic concepts entails that a model can never represent uncertainty over certain concept combinations. Thus, the model fails to be aware of reasoning shortcuts, i.e., the pathological behaviour of NeSy predictors that predict correct downstream tasks but for the wrong reasons.

[LG-14] Guiding LLM Decision-Making with Fairness Reward Models

链接: https://arxiv.org/abs/2507.11344
作者: Zara Hall,Melanie Subbiah,Thomas P Zollo,Kathleen McKeown,Richard Zemel
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Large language models are increasingly used to support high-stakes decisions, potentially influencing who is granted bail or receives a loan. Naive chain-of-thought sampling can improve average decision accuracy, but has also been shown to amplify unfair bias. To address this challenge and enable the trustworthy use of reasoning models in high-stakes decision-making, we propose a framework for training a generalizable Fairness Reward Model (FRM). Our model assigns a fairness score to LLM reasoning, enabling the system to down-weight biased trajectories and favor equitable ones when aggregating decisions across reasoning chains. We show that a single Fairness Reward Model, trained on weakly supervised, LLM-annotated examples of biased versus unbiased reasoning, transfers across tasks, domains, and model families without additional fine-tuning. Applied to real-world decision-making tasks including recidivism prediction and social media moderation, we show that our approach consistently improves fairness while matching, or even surpassing, baseline accuracy.

[LG-15] Fast Last-Iterate Convergence of SGD in the Smooth Interpolation Regime

链接: https://arxiv.org/abs/2507.11274
作者: Amit Attia,Matan Schliserman,Uri Sherman,Tomer Koren
类目: Machine Learning (cs.LG); Optimization and Control (math.OC); Machine Learning (stat.ML)
*备注: 27 pages

点击查看摘要

Abstract:We study population convergence guarantees of stochastic gradient descent (SGD) for smooth convex objectives in the interpolation regime, where the noise at optimum is zero or near zero. The behavior of the last iterate of SGD in this setting – particularly with large (constant) stepsizes – has received growing attention in recent years due to implications for the training of over-parameterized models, as well as to analyzing forgetting in continual learning and to understanding the convergence of the randomized Kaczmarz method for solving linear systems. We establish that after T steps of SGD on \beta -smooth convex loss functions with stepsize \eta \leq 1/\beta , the last iterate exhibits expected excess risk \widetildeO(1/(\eta T^1-\beta\eta/2) + \eta T^\beta\eta/2 \sigma_\star^2) , where \sigma_\star^2 denotes the variance of the stochastic gradients at the optimum. In particular, for a well-tuned stepsize we obtain a near optimal \widetildeO(1/T + \sigma_\star/\sqrtT) rate for the last iterate, extending the results of Varre et al. (2021) beyond least squares regression; and when \sigma_\star=0 we obtain a rate of O(1/\sqrtT) with \eta=1/\beta , improving upon the best-known O(T^-1/4) rate recently established by Evron et al. (2025) in the special case of realizable linear regression.

[LG-16] LyAm: Robust Non-Convex Optimization for Stable Learning in Noisy Environments

链接: https://arxiv.org/abs/2507.11262
作者: Elmira Mirzabeigi,Sepehr Rezaee,Kourosh Parand
类目: Machine Learning (cs.LG); Optimization and Control (math.OC)
*备注:

点击查看摘要

Abstract:Training deep neural networks, particularly in computer vision tasks, often suffers from noisy gradients and unstable convergence, which hinder performance and generalization. In this paper, we propose LyAm, a novel optimizer that integrates Adam’s adaptive moment estimation with Lyapunov-based stability mechanisms. LyAm dynamically adjusts the learning rate using Lyapunov stability theory to enhance convergence robustness and mitigate training noise. We provide a rigorous theoretical framework proving the convergence guarantees of LyAm in complex, non-convex settings. Extensive experiments on like as CIFAR-10 and CIFAR-100 show that LyAm consistently outperforms state-of-the-art optimizers in terms of accuracy, convergence speed, and stability, establishing it as a strong candidate for robust deep learning optimization.

[LG-17] Generative Click-through Rate Prediction with Applications to Search Advertising

链接: https://arxiv.org/abs/2507.11246
作者: Lingwei Kong,Lu Wang,Changping Peng,Zhangang Lin,Ching Law,Jingping Shao
类目: Machine Learning (cs.LG)
*备注: This work was first submitted on February 9, 2024

点击查看摘要

Abstract:Click-Through Rate (CTR) prediction models are integral to a myriad of industrial settings, such as personalized search advertising. Current methods typically involve feature extraction from users’ historical behavior sequences combined with product information, feeding into a discriminative model that is trained on user feedback to estimate CTR. With the success of models such as GPT, the potential for generative models to enrich expressive power beyond discriminative models has become apparent. In light of this, we introduce a novel model that leverages generative models to enhance the precision of CTR predictions in discriminative models. To reconcile the disparate data aggregation needs of both model types, we design a two-stage training process: 1) Generative pre-training for next-item prediction with the given item category in user behavior sequences; 2) Fine-tuning the well-trained generative model within a discriminative CTR prediction framework. Our method’s efficacy is substantiated through extensive experiments on a new dataset, and its significant utility is further corroborated by online A/B testing results. Currently, the model is deployed on one of the world’s largest e-commerce platforms, and we intend to release the associated code and dataset in the future.

[LG-18] Improved sampling algorithms and Poincaré inequalities for non-log-concave distributions

链接: https://arxiv.org/abs/2507.11236
作者: Yuchen He,Zhehan Lei,Jianan Shao,Chihao Zhang
类目: Data Structures and Algorithms (cs.DS); Machine Learning (cs.LG); Probability (math.PR); Machine Learning (stat.ML)
*备注:

点击查看摘要

Abstract:We study the problem of sampling from a distribution \mu with density \propto e^-V for some potential function V:\mathbb R^d\to \mathbb R with query access to V and \nabla V . We start with the following standard assumptions: (1) The potential function V is L -smooth. (2) The second moment \mathbfE_X\sim \mu[|X|^2]\leq M . Recently, He and Zhang (COLT’25) showed that the query complexity of sampling from such distributions is at least \left(\fracLMd\epsilon\right)^\Omega(d) where \epsilon is the desired accuracy in total variation distance, and the Poincaré constant can be arbitrarily large. Meanwhile, another common assumption in the study of diffusion based samplers (see e.g., the work of Chen, Chewi, Li, Li, Salim and Zhang (ICLR’23)) strengthens the smoothness condition (1) to the following: (1*) The potential function of every distribution along the Ornstein-Uhlenbeck process starting from \mu is L -smooth. We show that under the assumptions (1*) and (2), the query complexity of sampling from \mu can be \mathrmpoly(L,d)\cdot \left(\fracLd+M\epsilon^2\right)^\mathcalO(L+1) , which is polynomial in d and \frac1\epsilon when L=\mathcalO(1) and M=\mathrmpoly(d) . This improves the algorithm with quasi-polynomial query complexity developed by Huang et al. (COLT’24). Our results imply that the seemly moderate strengthening of the smoothness condition (1) to (1*) can lead to an exponential gap in the query complexity of sampling algorithms. Moreover, we show that together with the assumption (1*) and the stronger moment assumption that |X| is \lambda -sub-Gaussian for X\sim\mu , the Poincaré constant of \mu is at most \mathcalO(\lambda)^2(L+1) . As an application of our technique, we obtain improved estimate of the Poincaré constant for mixture of Gaussians with the same covariance. Subjects: Data Structures and Algorithms (cs.DS); Machine Learning (cs.LG); Probability (math.PR); Machine Learning (stat.ML) Cite as: arXiv:2507.11236 [cs.DS] (or arXiv:2507.11236v1 [cs.DS] for this version) https://doi.org/10.48550/arXiv.2507.11236 Focus to learn more arXiv-issued DOI via DataCite (pending registration) Submission history From: Yuchen He [view email] [v1] Tue, 15 Jul 2025 12:06:11 UTC (42 KB) Full-text links: Access Paper: View a PDF of the paper titled Improved sampling algorithms and Poincar’e inequalities for non-log-concave distributions, by Yuchen He and 3 other authorsView PDFTeX SourceOther Formats view license Current browse context: cs.DS prev | next new | recent | 2025-07 Change to browse by: cs cs.LG math math.PR stat stat.ML References Citations NASA ADSGoogle Scholar Semantic Scholar a export BibTeX citation Loading… BibTeX formatted citation loading… Data provided by: Bookmark checked=“checked”> Bibliographic Tools Bibliographic and Citation Tools Bibliographic Explorer Toggle Bibliographic Explorer (What is the Explorer?) Connected Papers Toggle Connected Papers (What is Connected Papers?) Litmaps Toggle Litmaps (What is Litmaps?) scite.ai Toggle scite Smart Citations (What are Smart Citations?) Code, Data, Media Code, Data and Media Associated with this Article alphaXiv Toggle alphaXiv (What is alphaXiv?) Links to Code Toggle CatalyzeX Code Finder for Papers (What is CatalyzeX?) DagsHub Toggle DagsHub (What is DagsHub?) GotitPub Toggle Gotit.pub (What is GotitPub?) Huggingface Toggle Hugging Face (What is Huggingface?) Links to Code Toggle Papers with Code (What is Papers with Code?) ScienceCast Toggle ScienceCast (What is ScienceCast?) Demos Demos Replicate Toggle Replicate (What is Replicate?) Spaces Toggle Hugging Face Spaces (What is Spaces?) Spaces Toggle TXYZ.AI (What is TXYZ.AI?) Related Papers Recommenders and Search Tools Link to Influence Flower Influence Flower (What are Influence Flowers?) Core recommender toggle CORE Recommender (What is CORE?) Author Venue Institution Topic About arXivLabs arXivLabs: experimental projects with community collaborators arXivLabs is a framework that allows collaborators to develop and share new arXiv features directly on our website. Both individuals and organizations that work with arXivLabs have embraced and accepted our values of openness, community, excellence, and user data privacy. arXiv is committed to these values and only works with partners that adhere to them. Have an idea for a project that will add value for arXiv’s community? Learn more about arXivLabs. Which authors of this paper are endorsers? | Disable MathJax (What is MathJax?) mathjaxToggle(); About Help contact arXivClick here to contact arXiv Contact subscribe to arXiv mailingsClick here to subscribe Subscribe Copyright Privacy Policy Web Accessibility Assistance arXiv Operational Status Get status notifications via email or slack

[LG-19] Gradient Descent on Logistic Regression: Do Large Step-Sizes Work with Data on the Sphere?

链接: https://arxiv.org/abs/2507.11228
作者: Si Yi Meng,Baptiste Goujaud,Antonio Orvieto,Christopher De Sa
类目: Machine Learning (cs.LG); Optimization and Control (math.OC)
*备注:

点击查看摘要

Abstract:Gradient descent (GD) on logistic regression has many fascinating properties. When the dataset is linearly separable, it is known that the iterates converge in direction to the maximum-margin separator regardless of how large the step size is. In the non-separable case, however, it has been shown that GD can exhibit a cycling behaviour even when the step sizes is still below the stability threshold 2/\lambda , where \lambda is the largest eigenvalue of the Hessian at the solution. This short paper explores whether restricting the data to have equal magnitude is a sufficient condition for global convergence, under any step size below the stability threshold. We prove that this is true in a one dimensional space, but in higher dimensions cycling behaviour can still occur. We hope to inspire further studies on quantifying how common these cycles are in realistic datasets, as well as finding sufficient conditions to guarantee global convergence with large step sizes.

[LG-20] Data-Driven Differential Evolution in Tire Industry Extrusion: Leverag ing Surrogate Models

链接: https://arxiv.org/abs/2507.11191
作者: Eider Garate-Perez,Kerman López de Calle-Etxabe,Susana Ferreiro
类目: Computational Engineering, Finance, and Science (cs.CE); Machine Learning (cs.LG)
*备注: 22 pages, 15 figures

点击查看摘要

Abstract:The optimization of industrial processes remains a critical challenge, particularly when no mathematical formulation of objective functions or constraints is available. This study addresses this issue by proposing a surrogate-based, data-driven methodology for optimizing complex real-world manufacturing systems using only historical process data. Machine learning models are employed to approximate system behavior and construct surrogate models, which are integrated into a tailored metaheuristic approach: Data-Driven Differential Evolution with Multi-Level Penalty Functions and Surrogate Models, an adapted version of Differential Evolution suited to the characteristics of the studied process. The methodology is applied to an extrusion process in the tire manufacturing industry, with the goal of optimizing initialization parameters to reduce waste and production time. Results show that the surrogate-based optimization approach outperforms historical best configurations, achieving a 65% reduction in initialization and setup time, while also significantly minimizing material waste. These findings highlight the potential of combining data-driven modeling and metaheuristic optimization for industrial processes where explicit formulations are unavailable.

[LG-21] Striking the Perfect Balance: Preserving Privacy While Boosting Utility in Collaborative Medical Prediction Platforms

链接: https://arxiv.org/abs/2507.11187
作者: Shao-Bo Lin,Xiaotong Liu,Yao Wang
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Online collaborative medical prediction platforms offer convenience and real-time feedback by leveraging massive electronic health records. However, growing concerns about privacy and low prediction quality can deter patient participation and doctor cooperation. In this paper, we first clarify the privacy attacks, namely attribute attacks targeting patients and model extraction attacks targeting doctors, and specify the corresponding privacy principles. We then propose a privacy-preserving mechanism and integrate it into a novel one-shot distributed learning framework, aiming to simultaneously meet both privacy requirements and prediction performance objectives. Within the framework of statistical learning theory, we theoretically demonstrate that the proposed distributed learning framework can achieve the optimal prediction performance under specific privacy requirements. We further validate the developed privacy-preserving collaborative medical prediction platform through both toy simulations and real-world data experiments.

[LG-22] Quantized Rank Reduction: A Communications-Efficient Federated Learning Scheme for Network-Critical Applications

链接: https://arxiv.org/abs/2507.11183
作者: Dimitrios Kritsiolis,Constantine Kotropoulos
类目: Machine Learning (cs.LG)
*备注: In Proceedings of the 2025 IARIA Annual Congress on Frontiers in Science, Technology, Services, and Applications (IARIA Congress 2025), Venice, Italy, July 6-10, 2025

点击查看摘要

Abstract:Federated learning is a machine learning approach that enables multiple devices (i.e., agents) to train a shared model cooperatively without exchanging raw data. This technique keeps data localized on user devices, ensuring privacy and security, while each agent trains the model on their own data and only shares model updates. The communication overhead is a significant challenge due to the frequent exchange of model updates between the agents and the central server. In this paper, we propose a communication-efficient federated learning scheme that utilizes low-rank approximation of neural network gradients and quantization to significantly reduce the network load of the decentralized learning process with minimal impact on the model’s accuracy.

[LG-23] Real-Time Bayesian Detection of Drift-Evasive GNSS Spoofing in Reinforcement Learning Based UAV Deconfliction

链接: https://arxiv.org/abs/2507.11173
作者: Deepak Kumar Panda,Weisi Guo
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Autonomous unmanned aerial vehicles (UAVs) rely on global navigation satellite system (GNSS) pseudorange measurements for accurate real-time localization and navigation. However, this dependence exposes them to sophisticated spoofing threats, where adversaries manipulate pseudoranges to deceive UAV receivers. Among these, drift-evasive spoofing attacks subtly perturb measurements, gradually diverting the UAVs trajectory without triggering conventional signal-level anti-spoofing mechanisms. Traditional distributional shift detection techniques often require accumulating a threshold number of samples, causing delays that impede rapid detection and timely response. Consequently, robust temporal-scale detection methods are essential to identify attack onset and enable contingency planning with alternative sensing modalities, improving resilience against stealthy adversarial manipulations. This study explores a Bayesian online change point detection (BOCPD) approach that monitors temporal shifts in value estimates from a reinforcement learning (RL) critic network to detect subtle behavioural deviations in UAV navigation. Experimental results show that this temporal value-based framework outperforms conventional GNSS spoofing detectors, temporal semi-supervised learning frameworks, and the Page-Hinkley test, achieving higher detection accuracy and lower false-positive and false-negative rates for drift-evasive spoofing attacks.

[LG-24] Hashed Watermark as a Filter: Defeating Forging and Overwriting Attacks in Weight-based Neural Network Watermarking

链接: https://arxiv.org/abs/2507.11137
作者: Yuan Yao,Jin Song,Jian Jin
类目: Cryptography and Security (cs.CR); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:As valuable digital assets, deep neural networks necessitate robust ownership protection, positioning neural network watermarking (NNW) as a promising solution. Among various NNW approaches, weight-based methods are favored for their simplicity and practicality; however, they remain vulnerable to forging and overwriting attacks. To address those challenges, we propose NeuralMark, a robust method built around a hashed watermark filter. Specifically, we utilize a hash function to generate an irreversible binary watermark from a secret key, which is then used as a filter to select the model parameters for embedding. This design cleverly intertwines the embedding parameters with the hashed watermark, providing a robust defense against both forging and overwriting attacks. An average pooling is also incorporated to resist fine-tuning and pruning attacks. Furthermore, it can be seamlessly integrated into various neural network architectures, ensuring broad applicability. Theoretically, we analyze its security boundary. Empirically, we verify its effectiveness and robustness across 13 distinct Convolutional and Transformer architectures, covering five image classification tasks and one text generation task. The source codes are available at this https URL.

[LG-25] A Distance Metric for Mixed Integer Programming Instances ECAI2025

链接: https://arxiv.org/abs/2507.11063
作者: Gwen Maudet,Grégoire Danoy
类目: Machine Learning (cs.LG)
*备注: Accepted to ECAI 2025

点击查看摘要

Abstract:Mixed-integer linear programming (MILP) is a powerful tool for addressing a wide range of real-world problems, but it lacks a clear structure for comparing instances. A reliable similarity metric could establish meaningful relationships between instances, enabling more effective evaluation of instance set heterogeneity and providing better guidance to solvers, particularly when machine learning is involved. Existing similarity metrics often lack precision in identifying instance classes or rely heavily on labeled data, which limits their applicability and generalization. To bridge this gap, this paper introduces the first mathematical distance metric for MILP instances, derived directly from their mathematical formulations. By discretizing right-hand sides, weights, and variables into classes, the proposed metric draws inspiration from the Earth mover’s distance to quantify mismatches in weight-variable distributions for constraint comparisons. This approach naturally extends to enable instance-level comparisons. We evaluate both an exact and a greedy variant of our metric under various parameter settings, using the StrIPLIB dataset. Results show that all components of the metric contribute to class identification, and that the greedy version achieves accuracy nearly identical to the exact formulation while being nearly 200 times faster. Compared to state-of-the-art baselines, including feature-based, image-based, and neural network models, our unsupervised method consistently outperforms all non-learned approaches and rivals the performance of a supervised classifier on class and subclass grouping tasks.

[LG-26] Relative Entropy Pathwise Policy Optimization

链接: https://arxiv.org/abs/2507.11019
作者: Claas Voelcker,Axel Brunnbauer,Marcel Hussing,Michal Nauman,Pieter Abbeel,Eric Eaton,Radu Grosu,Amir-massoud Farahmand,Igor Gilitschenski
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Score-function policy gradients have delivered strong results in game-playing, robotics and language-model fine-tuning. Yet its high-variance often undermines training stability. On the other hand, pathwise policy gradients alleviate the training variance, but are reliable only when driven by an accurate action-conditioned value function which is notoriously hard to train without relying on past off-policy data. In this paper, we discuss how to construct a value-gradient driven, on-policy algorithm that allow training Q-value models purely from on-policy data, unlocking the possibility of using pathwise policy updates in the context of on-policy learning. We show how to balance stochastic policies for exploration with constrained policy updates for stable training, and evaluate important architectural components that facilitate accurate value function learning. Building on these insights, we propose Relative Entropy Pathwise Policy Optimization (REPPO), an efficient on-policy algorithm that combines the sample-efficiency of pathwise policy gradients with the simplicity and minimal memory footprint of standard on-policy learning. We demonstrate that REPPO provides strong empirical performance at decreased sample requirements, wall-clock time, memory footprint as well as high hyperparameter robustness in a set of experiments on two standard GPU-parallelized benchmarks.

[LG-27] Leverag ing Advanced Machine Learning to Predict Turbulence Dynamics from Temperature Observations at an Experimental Prescribed Fire

链接: https://arxiv.org/abs/2507.11012
作者: Dipak Dulal,Joseph J. Charney,Michael R. Gallagher,Pitambar Acharya,Carmeliza Navasca,Nicholas S. Skowronski
类目: Machine Learning (cs.LG)
*备注: arXiv admin note: text overlap with arXiv:2311.05128

点击查看摘要

Abstract:This study explores the potential for predicting turbulent kinetic energy (TKE) from more readily acquired temperature data using temperature profiles and turbulence data collected concurrently at 10 Hz during a small experimental prescribed burn in the New Jersey Pine Barrens. Machine learning models, including Deep Neural Networks, Random Forest Regressor, Gradient Boosting, and Gaussian Process Regressor, were employed to assess the potential to predict TKE from temperature perturbations and explore temporal and spatial dynamics of correlations. Data visualization and correlation analyses revealed patterns and relationships between thermocouple temperatures and TKE, providing insight into the underlying dynamics. More accurate predictions of TKE were achieved by employing various machine learning models despite a weak correlation between the predictors and the target variable. The results demonstrate significant success, particularly from regression models, in accurately predicting the TKE. The findings of this study demonstrate a novel numerical approach to identifying new relationships between temperature and airflow processes in and around the fire environment. These relationships can help refine our understanding of combustion environment processes and the coupling and decoupling of fire environment processes necessary for improving fire operations strategy and fire and smoke model predictions. The findings of this study additionally highlight the valuable role of machine learning techniques in analyzing the complex large datasets of the fire environments, showcasing their potential to advance fire research and management practices.

[LG-28] AdaMuon: Adaptive Muon Optimizer

链接: https://arxiv.org/abs/2507.11005
作者: Chongjie Si,Debing Zhang,Wei Shen
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:We propose AdaMuon, an adaptive learning-rate framework built upon the recently validated Muon optimizer, which has demonstrated substantial efficiency gains over AdamW in large-scale model training. AdaMuon augments Muon with two mutually dependent modules: (1) a per-parameter second-moment modulation that captures orthogonal gradient updates to ensure update-level adaptivity, and (2) a RMS-aligned rescaling that regulates the overall update magnitude by aligning it with the intrinsic structure of the parameter space. Empirical results on multiple model scales and learning-rate regimes confirm that AdaMuon consistently outperforms the original Muon, delivering higher acceleration in convergence while maintaining training stability. Our method introduces no additional tuning burden and can be seamlessly integrated into existing Muon training pipelines.

[LG-29] StellarF: A Lora-Adapter Integrated Large Model Framework for Stellar Flare Forecasting with Historical Statistical Data

链接: https://arxiv.org/abs/2507.10986
作者: Tianyu Su,Zhiqiang Zou,Ali Luo,Xiao Kong,Qingyu Lu,Min Li
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Stellar flare forecasting, a critical research frontier in astronomy, offers profound insights into stellar activity. However, the field is constrained by both the sparsity of recorded flare events and the absence of domain-specific large-scale predictive models. To address these challenges, this study introduces StellarF (Stellar Flare Forecasting), a novel large model that leverages Low-Rank (LoRA) and Adapter techniques to parameter-efficient learning for stellar flare forecasting. At its core, StellarF integrates an flare statistical information module with a historical flare record module, enabling multi-scale pattern recognition from observational data. Extensive experiments on our self-constructed datasets (derived from Kepler and TESS light curves) demonstrate that StellarF achieves state-of-the-art performance compared to existing methods. The proposed prediction paradigm establishes a novel methodological framework for advancing astrophysical research and cross-disciplinary applications.

[LG-30] Physics-Informed Neural Networks For Semiconductor Film Deposition: A Review

链接: https://arxiv.org/abs/2507.10983
作者: Tao Han,Zahra Taheri,Hyunwoong Ko
类目: Machine Learning (cs.LG)
*备注: 11 pages, 1 figure, 3 tables, IDETC-CIE 2025

点击查看摘要

Abstract:Semiconductor manufacturing relies heavily on film deposition processes, such as Chemical Vapor Deposition and Physical Vapor Deposition. These complex processes require precise control to achieve film uniformity, proper adhesion, and desired functionality. Recent advancements in Physics-Informed Neural Networks (PINNs), an innovative machine learning (ML) approach, have shown significant promise in addressing challenges related to process control, quality assurance, and predictive modeling within semiconductor film deposition and other manufacturing domains. This paper provides a comprehensive review of ML applications targeted at semiconductor film deposition processes. Through a thematic analysis, we identify key trends, existing limitations, and research gaps, offering insights into both the advantages and constraints of current methodologies. Our structured analysis aims to highlight the potential integration of these ML techniques to enhance interpretability, accuracy, and robustness in film deposition processes. Additionally, we examine state-of-the-art PINN methods, discussing strategies for embedding physical knowledge, governing laws, and partial differential equations into advanced neural network architectures tailored for semiconductor manufacturing. Based on this detailed review, we propose novel research directions that integrate the strengths of PINNs to significantly advance film deposition processes. The contributions of this study include establishing a clear pathway for future research in integrating physics-informed ML frameworks, addressing existing methodological gaps, and ultimately improving precision, scalability, and operational efficiency within semiconductor manufacturing.

[LG-31] Diffusion Decoding for Peptide De Novo Sequencing

链接: https://arxiv.org/abs/2507.10955
作者: Chi-en Amy Tai,Alexander Wong
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Peptide de novo sequencing is a method used to reconstruct amino acid sequences from tandem mass spectrometry data without relying on existing protein sequence databases. Traditional deep learning approaches, such as Casanovo, mainly utilize autoregressive decoders and predict amino acids sequentially. Subsequently, they encounter cascading errors and fail to leverage high-confidence regions effectively. To address these issues, this paper investigates using diffusion decoders adapted for the discrete data domain. These decoders provide a different approach, allowing sequence generation to start from any peptide segment, thereby enhancing prediction accuracy. We experiment with three different diffusion decoder designs, knapsack beam search, and various loss functions. We find knapsack beam search did not improve performance metrics and simply replacing the transformer decoder with a diffusion decoder lowered performance. Although peptide precision and recall were still 0, the best diffusion decoder design with the DINOISER loss function obtained a statistically significant improvement in amino acid recall by 0.373 compared to the baseline autoregressive decoder-based Casanovo model. These findings highlight the potential of diffusion decoders to not only enhance model sensitivity but also drive significant advancements in peptide de novo sequencing.

[LG-32] owards Practical Benchmarking of Data Cleaning Techniques: On Generating Authentic Errors via Large Language Models

链接: https://arxiv.org/abs/2507.10934
作者: Xinyuan Liu,Jiahui Chen,Bocheng Hu,Yu Sun,Xinyang Chen,Shaoxu Song
类目: Databases (cs.DB); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Data quality remains an important challenge in data-driven systems, as errors in tabular data can severely compromise downstream analytics and machine learning performance. Although numerous error detection algorithms have been proposed, the lack of diverse, real-world error datasets limits comprehensive evaluation. Manual error annotation is both time-consuming and inconsistent, motivating the exploration of synthetic error generation as an alternative. In this work, we introduce TableEG, a framework that leverages large language models (LLMs) to generate authentic errors. By employing a table fine-tuning strategy and a triplet representation (I, T, O) to model error generation, detection, and correction tasks, TableEG captures the complex dependencies inherent in two-dimensional tables. Trained on 12 real-world datasets spanning 10 diverse domains, TableEG ensures that the synthesized errors faithfully reflect authentic error distributions. Experimental results indicate that errors generated by TableEG exhibit superior pattern and distribution similarity compared to both rule-based methods and LLM-generated errors without fine-tuning. Furthermore, performance metrics on TableEG-generated errors closely align with those on real-world errors across nearly all datasets and detection algorithms, particularly for machine learning based detection techniques. Overall, TableEG not only bridges the gap between synthetic and real-world errors but also establishes a robust benchmark for subsequent error detection and correction tasks.

[LG-33] A Learning Framework For Cooperative Collision Avoidance of UAV Swarms Leverag ing Domain Knowledge AAAI2026

链接: https://arxiv.org/abs/2507.10913
作者: Shuangyao Huang,Haibo Zhang,Zhiyi Huang
类目: Multiagent Systems (cs.MA); Machine Learning (cs.LG); Robotics (cs.RO)
*备注: Under review at AAAI 2026

点击查看摘要

Abstract:This paper presents a multi-agent reinforcement learning (MARL) framework for cooperative collision avoidance of UAV swarms leveraging domain knowledge-driven reward. The reward is derived from knowledge in the domain of image processing, approximating contours on a two-dimensional field. By modeling obstacles as maxima on the field, collisions are inherently avoided as contours never go through peaks or intersect. Additionally, counters are smooth and energy-efficient. Our framework enables training with large swarm sizes as the agent interaction is minimized and the need for complex credit assignment schemes or observation sharing mechanisms in state-of-the-art MARL approaches are eliminated. Moreover, UAVs obtain the ability to adapt to complex environments where contours may be non-viable or non-existent through intensive training. Extensive experiments are conducted to evaluate the performances of our framework against state-of-the-art MARL algorithms.

[LG-34] Outbound Modeling for Inventory Management KDD

链接: https://arxiv.org/abs/2507.10890
作者: Riccardo Savorgnan,Udaya Ghai,Carson Eisenach,Dean Foster
类目: Machine Learning (cs.LG)
*备注: KDD - AI for Supply Chain Workshop

点击查看摘要

Abstract:We study the problem of forecasting the number of units fulfilled (or ``drained’‘) from each inventory warehouse to meet customer demand, along with the associated outbound shipping costs. The actual drain and shipping costs are determined by complex production systems that manage the planning and execution of customers’ orders fulfillment, i.e. from where and how to ship a unit to be delivered to a customer. Accurately modeling these processes is critical for regional inventory planning, especially when using Reinforcement Learning (RL) to develop control policies. For the RL usecase, a drain model is incorporated into a simulator to produce long rollouts, which we desire to be differentiable. While simulating the calls to the internal software systems can be used to recover this transition, they are non-differentiable and too slow and costly to run within an RL training environment. Accordingly, we frame this as a probabilistic forecasting problem, modeling the joint distribution of outbound drain and shipping costs across all warehouses at each time period, conditioned on inventory positions and exogenous customer demand. To ensure robustness in an RL environment, the model must handle out-of-distribution scenarios that arise from off-policy trajectories. We propose a validation scheme that leverages production systems to evaluate the drain model on counterfactual inventory states induced by RL policies. Preliminary results demonstrate the model’s accuracy within the in-distribution setting.

[LG-35] Learning from Imperfect Data: Robust Inference of Dynamic Systems using Simulation-based Generative Model

链接: https://arxiv.org/abs/2507.10884
作者: Hyunwoo Cho,Hyeontae Jo,Hyung Ju Hwang
类目: Machine Learning (cs.LG); Dynamical Systems (math.DS)
*备注:

点击查看摘要

Abstract:System inference for nonlinear dynamic models, represented by ordinary differential equations (ODEs), remains a significant challenge in many fields, particularly when the data are noisy, sparse, or partially observable. In this paper, we propose a Simulation-based Generative Model for Imperfect Data (SiGMoID) that enables precise and robust inference for dynamic systems. The proposed approach integrates two key methods: (1) physics-informed neural networks with hyper-networks that constructs an ODE solver, and (2) Wasserstein generative adversarial networks that estimates ODE parameters by effectively capturing noisy data distributions. We demonstrate that SiGMoID quantifies data noise, estimates system parameters, and infers unobserved system components. Its effectiveness is validated validated through realistic experimental examples, showcasing its broad applicability in various domains, from scientific research to engineered systems, and enabling the discovery of full system dynamics.

[LG-36] GALDS: A Graph-Autoencoder-based Latent Dynamics Surrogate model to predict neurite material transport

链接: https://arxiv.org/abs/2507.10871
作者: Tsung Yeh Hsieh,Yongjie Jessica Zhang
类目: Machine Learning (cs.LG); Numerical Analysis (math.NA); Medical Physics (physics.med-ph)
*备注:

点击查看摘要

Abstract:Neurons exhibit intricate geometries within their neurite networks, which play a crucial role in processes such as signaling and nutrient transport. Accurate simulation of material transport in the networks is essential for understanding these biological phenomena but poses significant computational challenges because of the complex tree-like structures involved. Traditional approaches are time-intensive and resource-demanding, yet the inherent properties of neuron trees, which consists primarily of pipes with steady-state parabolic velocity profiles and bifurcations, provide opportunities for computational optimization. To address these challenges, we propose a Graph-Autoencoder-based Latent Dynamics Surrogate (GALDS) model, which is specifically designed to streamline the simulation of material transport in neural trees. GALDS employs a graph autoencoder to encode latent representations of the network’s geometry, velocity fields, and concentration profiles. These latent space representations are then assembled into a global graph, which is subsequently used to predict system dynamics in the latent space via a trained graph latent space system dynamic model, inspired by the Neural Ordinary Differential Equations (Neural ODEs) concept. The integration of an autoencoder allows for the use of smaller graph neural network models with reduced training data requirements. Furthermore, the Neural ODE component effectively mitigates the issue of error accumulation commonly encountered in recurrent neural networks. The effectiveness of the GALDS model is demonstrated through results on eight unseen geometries and four abnormal transport examples, where our approach achieves mean relative error of 3% with maximum relative error 8% and demonstrates a 10-fold speed improvement compared to previous surrogate model approaches.

[LG-37] Visually grounded emotion regulation via diffusion models and user-driven reappraisal

链接: https://arxiv.org/abs/2507.10861
作者: Edoardo Pinzuti,Oliver Tüscher,André Ferreira Castro
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Cognitive reappraisal is a key strategy in emotion regulation, involving reinterpretation of emotionally charged stimuli to alter affective responses. Despite its central role in clinical and cognitive science, real-world reappraisal interventions remain cognitively demanding, abstract, and primarily verbal. This reliance on higher-order cognitive and linguistic processes is often impaired in individuals with trauma or depression, limiting the effectiveness of standard approaches. Here, we propose a novel, visually based augmentation of cognitive reappraisal by integrating large-scale text-to-image diffusion models into the emotional regulation process. Specifically, we introduce a system in which users reinterpret emotionally negative images via spoken reappraisals, which are transformed into supportive, emotionally congruent visualizations using stable diffusion models with a fine-tuned IP-adapter. This generative transformation visually instantiates users’ reappraisals while maintaining structural similarity to the original stimuli, externalizing and reinforcing regulatory intent. To test this approach, we conducted a within-subject experiment (N = 20) using a modified cognitive emotion regulation (CER) task. Participants reappraised or described aversive images from the International Affective Picture System (IAPS), with or without AI-generated visual feedback. Results show that AI-assisted reappraisal significantly reduced negative affect compared to both non-AI and control conditions. Further analyses reveal that sentiment alignment between participant reappraisals and generated images correlates with affective relief, suggesting that multimodal coherence enhances regulatory efficacy. These findings demonstrate that generative visual input can support cogitive reappraisal and open new directions at the intersection of generative AI, affective computing, and therapeutic technology.

[LG-38] From Small to Large: A Graph Convolutional Network Approach for Solving Assortment Optimization Problems DATE

链接: https://arxiv.org/abs/2507.10834
作者: Guokai Li,Pin Gao,Stefanus Jasin,Zizhuo Wang
类目: Machine Learning (cs.LG)
*备注: Conference version. The journal version will be updated soon

点击查看摘要

Abstract:Assortment optimization involves selecting a subset of substitutable products (subject to certain constraints) to maximize the expected revenue. It is a classic problem in revenue management and finds applications across various industries. However, the problem is usually NP-hard due to its combinatorial and non-linear nature. In this work, we explore how graph concolutional networks (GCNs) can be leveraged to efficiently solve constrained assortment optimization under the mixed multinomial logit choice model. We first develop a graph representation of the assortment problem, then train a GCN to learn the patterns of optimal assortments, and lastly propose two inference policies based on the GCN’s output. Due to the GCN’s inherent ability to generalize across inputs of varying sizes, we can use a GCN trained on small-scale instances to facilitate large-scale instances. Extensive numerical experiments demonstrate that given a GCN trained on small-scale instances (e.g., with 20 products), the proposed policies can achieve superior performance (90%+ optimality) on large-scale instances (with up to 2,000 products) within seconds, which outperform existing heuristic policies in both performance and efficiency. Furthermore, we extend our framework to a model-free setting where the underlying choice model is unknown but transaction data is available. We also conduct numerical experiments to demonstrate the effectiveness and efficiency of our proposed policies in this setting.

[LG-39] Uncovering Causal Relation Shifts in Event Sequences under Out-of-Domain Interventions ICANN2025

链接: https://arxiv.org/abs/2507.10809
作者: Kazi Tasnim Zinat,Yun Zhou,Xiang Lyu,Yawei Wang,Zhicheng Liu,Panpan Xu
类目: Machine Learning (cs.LG)
*备注: Accepted at ICANN 2025

点击查看摘要

Abstract:Inferring causal relationships between event pairs in a temporal sequence is applicable in many domains such as healthcare, manufacturing, and transportation. Most existing work on causal inference primarily focuses on event types within the designated domain, without considering the impact of exogenous out-of-domain interventions. In real-world settings, these out-of-domain interventions can significantly alter causal dynamics. To address this gap, we propose a new causal framework to define average treatment effect (ATE), beyond independent and identically distributed (i.i.d.) data in classic Rubin’s causal framework, to capture the causal relation shift between events of temporal process under out-of-domain intervention. We design an unbiased ATE estimator, and devise a Transformer-based neural network model to handle both long-range temporal dependencies and local patterns while integrating out-of-domain intervention information into process modeling. Extensive experiments on both simulated and real-world datasets demonstrate that our method outperforms baselines in ATE estimation and goodness-of-fit under out-of-domain-augmented point processes.

[LG-40] Multi-Armed Sampling Problem and the End of Exploration

链接: https://arxiv.org/abs/2507.10797
作者: Mohammad Pedramfar,Siamak Ravanbakhsh
类目: Machine Learning (cs.LG); Optimization and Control (math.OC); Machine Learning (stat.ML)
*备注:

点击查看摘要

Abstract:This paper introduces the framework of multi-armed sampling, as the sampling counterpart to the optimization problem of multi-arm bandits. Our primary motivation is to rigorously examine the exploration-exploitation trade-off in the context of sampling. We systematically define plausible notions of regret for this framework and establish corresponding lower bounds. We then propose a simple algorithm that achieves these optimal regret bounds. Our theoretical results demonstrate that in contrast to optimization, sampling does not require exploration. To further connect our findings with those of multi-armed bandits, we define a continuous family of problems and associated regret measures that smoothly interpolates and unifies multi-armed sampling and multi-armed bandit problems using a temperature parameter. We believe the multi-armed sampling framework, and our findings in this setting can have a foundational role in the study of sampling including recent neural samplers, akin to the role of multi-armed bandits in reinforcement learning. In particular, our work sheds light on the need for exploration and the convergence properties of algorithm for entropy-regularized reinforcement learning, fine-tuning of pretrained models and reinforcement learning with human feedback (RLHF).

[LG-41] Multilayer Artificial Benchmark for Community Detection (mABCD)

链接: https://arxiv.org/abs/2507.10795
作者: Łukasz Kraiński,Michał Czuba,Piotr Bródka,Paweł Prałat,Bogumił Kamiński,François Théberge
类目: ocial and Information Networks (cs.SI); Machine Learning (cs.LG)
*备注: 28 pages, 15 figures, 7 tables

点击查看摘要

Abstract:The Artificial Benchmark for Community Detection (ABCD) model is a random graph model with community structure and power-law distribution for both degrees and community sizes. The model generates graphs similar to the well-known LFR model but it is faster, more interpretable, and can be investigated analytically. In this paper, we use the underlying ingredients of the ABCD model and introduce its variant for multilayer networks, mABCD.

[LG-42] A Generalizable Physics-Enhanced State Space Model for Long-Term Dynamics Forecasting in Complex Environments ICML2025

链接: https://arxiv.org/abs/2507.10792
作者: Yuchen Wang,Hongjue Zhao,Haohong Lin,Enze Xu,Lifang He,Huajie Shao
类目: Machine Learning (cs.LG)
*备注: 8 pages, 6 figures, accepted in ICML 2025

点击查看摘要

Abstract:This work aims to address the problem of long-term dynamic forecasting in complex environments where data are noisy and irregularly sampled. While recent studies have introduced some methods to improve prediction performance, these approaches still face a significant challenge in handling long-term extrapolation tasks under such complex scenarios. To overcome this challenge, we propose Phy-SSM, a generalizable method that integrates partial physics knowledge into state space models (SSMs) for long-term dynamics forecasting in complex environments. Our motivation is that SSMs can effectively capture long-range dependencies in sequential data and model continuous dynamical systems, while the incorporation of physics knowledge improves generalization ability. The key challenge lies in how to seamlessly incorporate partially known physics into SSMs. To achieve this, we decompose partially known system dynamics into known and unknown state matrices, which are integrated into a Phy-SSM unit. To further enhance long-term prediction performance, we introduce a physics state regularization term to make the estimated latent states align with system dynamics. Besides, we theoretically analyze the uniqueness of the solutions for our method. Extensive experiments on three real-world applications, including vehicle motion prediction, drone state prediction, and COVID-19 epidemiology forecasting, demonstrate the superior performance of Phy-SSM over the baselines in both long-term interpolation and extrapolation tasks. The code is available at this https URL.

[LG-43] A Benchmarking Framework for AI models in Automotive Aerodynamics

链接: https://arxiv.org/abs/2507.10747
作者: Kaustubh Tangsali,Rishikesh Ranade,Mohammad Amin Nabian,Alexey Kamenev,Peter Sharpe,Neil Ashton,Ram Cherukuri,Sanjay Choudhry
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:In this paper, we introduce a benchmarking framework within the open-source NVIDIA PhysicsNeMo-CFD framework designed to systematically assess the accuracy, performance, scalability, and generalization capabilities of AI models for automotive aerodynamics predictions. The open extensible framework enables incorporation of a diverse set of metrics relevant to the Computer-Aided Engineering (CAE) community. By providing a standardized methodology for comparing AI models, the framework enhances transparency and consistency in performance assessment, with the overarching goal of improving the understanding and development of these models to accelerate research and innovation in the field. To demonstrate its utility, the framework includes evaluation of both surface and volumetric flow field predictions on three AI models: DoMINO, X-MeshGraphNet, and FIGConvNet using the DrivAerML dataset. It also includes guidelines for integrating additional models and datasets, making it extensible for physically consistent metrics. This benchmarking study aims to enable researchers and industry professionals in selecting, refining, and advancing AI-driven aerodynamic modeling approaches, ultimately fostering the development of more efficient, accurate, and interpretable solutions in automotive aerodynamics

[LG-44] Extracting Document Relations from Search Corpus by Marginalizing over User Queries

链接: https://arxiv.org/abs/2507.10726
作者: Yuki Iwamoto,Kaoru Tsunoda,Ken Kaneiwa
类目: Information Retrieval (cs.IR); Machine Learning (cs.LG)
*备注: 9 pages, 6 figures

点击查看摘要

Abstract:Understanding relationships between documents in large-scale corpora is essential for knowledge discovery and information organization. However, existing approaches rely heavily on manual annotation or predefined relationship taxonomies. We propose EDR-MQ (Extracting Document Relations by Marginalizing over User Queries), a novel framework that discovers document relationships through query marginalization. EDR-MQ is based on the insight that strongly related documents often co-occur in results across diverse user queries, enabling us to estimate joint probabilities between document pairs by marginalizing over a collection of queries. To enable this query marginalization approach, we develop Multiply Conditioned Retrieval-Augmented Generation (MC-RAG), which employs conditional retrieval where subsequent document retrievals depend on previously retrieved content. By observing co-occurrence patterns across diverse queries, EDR-MQ estimates joint probabilities between document pairs without requiring labeled training data or predefined taxonomies. Experimental results show that our query marginalization approach successfully identifies meaningful document relationships, revealing topical clusters, evidence chains, and cross-domain connections that are not apparent through traditional similarity-based methods. Our query-driven framework offers a practical approach to document organization that adapts to different user perspectives and information needs.

[LG-45] Distributionally Robust Optimization with Adversarial Data Contamination

链接: https://arxiv.org/abs/2507.10718
作者: Shuyao Li,Ilias Diakonikolas,Jelena Diakonikolas
类目: Machine Learning (cs.LG); Data Structures and Algorithms (cs.DS); Optimization and Control (math.OC)
*备注:

点击查看摘要

Abstract:Distributionally Robust Optimization (DRO) provides a framework for decision-making under distributional uncertainty, yet its effectiveness can be compromised by outliers in the training data. This paper introduces a principled approach to simultaneously address both challenges. We focus on optimizing Wasserstein-1 DRO objectives for generalized linear models with convex Lipschitz loss functions, where an \epsilon -fraction of the training data is adversarially corrupted. Our primary contribution lies in a novel modeling framework that integrates robustness against training data contamination with robustness against distributional shifts, alongside an efficient algorithm inspired by robust statistics to solve the resulting optimization problem. We prove that our method achieves an estimation error of O(\sqrt\epsilon) for the true DRO objective value using only the contaminated data under the bounded covariance assumption. This work establishes the first rigorous guarantees, supported by efficient computation, for learning under the dual challenges of data contamination and distributional shifts.

[LG-46] A Simple Approximate Bayesian Inference Neural Surrogate for Stochastic Petri Net Models

链接: https://arxiv.org/abs/2507.10714
作者: Bright Kwaku Manu,Trevor Reckell,Beckett Sterner,Petar Jevtic
类目: Machine Learning (cs.LG); Quantitative Methods (q-bio.QM); Machine Learning (stat.ML)
*备注: 12 pages, 10 figures, for all associated codes and files, see this https URL

点击查看摘要

Abstract:Stochastic Petri Nets (SPNs) are an increasingly popular tool of choice for modeling discrete-event dynamics in areas such as epidemiology and systems biology, yet their parameter estimation remains challenging in general and in particular when transition rates depend on external covariates and explicit likelihoods are unavailable. We introduce a neural-surrogate (neural-network–based approximation of the posterior distribution) framework that predicts the coefficients of known covariate-dependent rate functions directly from noisy, partially observed token trajectories. Our model employs a lightweight 1D Convolutional Residual Network trained end-to-end on Gillespie-simulated SPN realizations, learning to invert system dynamics under realistic conditions of event dropout. During inference, Monte Carlo dropout provides calibrated uncertainty bounds together with point estimates. On synthetic SPNs with 20% missing events, our surrogate recovers rate-function coefficients with an RMSE = 0.108 and substantially runs faster than traditional Bayesian approaches. These results demonstrate that data-driven, likelihood-free surrogates can enable accurate, robust, and real-time parameter recovery in complex, partially observed discrete-event systems.

[LG-47] SENSOR: An ML-Enhanced Online Annotation Tool to Uncover Privacy Concerns from User Reviews in Social-Media Applications

链接: https://arxiv.org/abs/2507.10640
作者: Labiba Farah,Mohammad Ridwan Kabir,Shohel Ahmed,MD Mohaymen Ul Anam,Md. Sakibul Islam
类目: oftware Engineering (cs.SE); Machine Learning (cs.LG)
*备注: 26 pages, 9 figures, 5 tables

点击查看摘要

Abstract:The widespread use of social media applications has raised significant privacy concerns, often highlighted in user reviews. These reviews also provide developers with valuable insights into improving apps by addressing issues and introducing better features. However, the sheer volume and nuanced nature of reviews make manual identification and prioritization of privacy-related concerns challenging for developers. Previous studies have developed software utilities to automatically classify user reviews as privacy-relevant, privacy-irrelevant, bug reports, feature requests, etc., using machine learning. Notably, there is a lack of focus on classifying reviews specifically as privacy-related feature requests, privacy-related bug reports, or privacy-irrelevant. This paper introduces SENtinel SORt (SENSOR), an automated online annotation tool designed to help developers annotate and classify user reviews into these categories. For automating the annotation of such reviews, this paper introduces the annotation model, GRACE (GRU-based Attention with CBOW Embedding), using Gated Recurrent Units (GRU) with Continuous Bag of Words (CBOW) and Attention mechanism. Approximately 16000 user reviews from seven popular social media apps on Google Play Store, including Instagram, Facebook, WhatsApp, Snapchat, X (formerly Twitter), Facebook Lite, and Line were analyzed. Two annotators manually labelled the reviews, achieving a Cohen’s Kappa value of 0.87, ensuring a labeled dataset with high inter-rater agreement for training machine learning models. Among the models tested, GRACE demonstrated the best performance (macro F1-score: 0.9434, macro ROC-AUC: 0.9934, and accuracy: 95.10%) despite class imbalance. SENSOR demonstrates significant potential to assist developers with extracting and addressing privacy-related feature requests or bug reports from user reviews, enhancing user privacy and trust.

[LG-48] ZClassifier: Temperature Tuning and Manifold Approximation via KL Divergence on Logit Space

链接: https://arxiv.org/abs/2507.10638
作者: Shim Soon Yong
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:We introduce a novel classification framework, ZClassifier, that replaces conventional deterministic logits with diagonal Gaussian-distributed logits. Our method simultaneously addresses temperature scaling and manifold approximation by minimizing the Kullback-Leibler (KL) divergence between the predicted Gaussian distributions and a unit isotropic Gaussian. This unifies uncertainty calibration and latent control in a principled probabilistic manner, enabling a natural interpretation of class confidence and geometric consistency. Experiments on CIFAR-10 and CIFAR-100 show that ZClassifier improves over softmax classifiers in robustness, calibration, and latent separation. We also demonstrate its effectiveness for classifier-guided generation by interpreting logits as Gaussian semantic potentials.

[LG-49] Learning to Quantize and Precode in Massive MIMO Systems for Energy Reduction: a Graph Neural Network Approach

链接: https://arxiv.org/abs/2507.10634
作者: Thomas Feys,Liesbet Van der Perre,François Rottenberg
类目: ystems and Control (eess.SY); Machine Learning (cs.LG); Signal Processing (eess.SP); Machine Learning (stat.ML)
*备注:

点击查看摘要

Abstract:Massive MIMO systems are moving toward increased numbers of radio frequency chains, higher carrier frequencies and larger bandwidths. As such, digital-to-analog converters (DACs) are becoming a bottleneck in terms of hardware complexity and power consumption. In this work, non-linear precoding for coarsely quantized downlink massive MIMO is studied. Given the NP-hard nature of this problem, a graph neural network (GNN) is proposed that directly outputs the precoded quantized vector based on the channel matrix and the intended transmit symbols. The model is trained in a self-supervised manner, by directly maximizing the achievable rate. To overcome the non-differentiability of the objective function, introduced due to the non-differentiable DAC functions, a straight-through Gumbel-softmax estimation of the gradient is proposed. The proposed method achieves a significant increase in achievable sum rate under coarse quantization. For instance, in the single-user case, the proposed method can achieve the same sum rate as maximum ratio transmission (MRT) by using one-bit DAC’s as compared to 3 bits for MRT. This reduces the DAC’s power consumption by a factor 4-7 and 3 for baseband and RF DACs respectively. This, however, comes at the cost of increased digital signal processing power consumption. When accounting for this, the reduction in overall power consumption holds for a system bandwidth up to 3.5 MHz for baseband DACs, while the RF DACs can maintain a power reduction of 2.9 for higher bandwidths. Notably, indirect effects, which further reduce the power consumption, such as a reduced fronthaul consumption and reduction in other components, are not considered in this analysis.

[LG-50] A Feed-Forward Artificial Intelligence Pipeline for Sustainable Desalination under Climate Uncertainties: UAE Insights

链接: https://arxiv.org/abs/2507.10609
作者: Obumneme Nwafor,Chioma Nwafor,Amro Zakaria,Nkechi Nwankwo
类目: Machine Learning (cs.LG); Systems and Control (eess.SY)
*备注:

点击查看摘要

Abstract:The United Arab Emirates (UAE) relies heavily on seawater desalination to meet over 90% of its drinking water needs. Desalination processes are highly energy intensive and account for approximately 15% of the UAE’s electricity consumption, contributing to over 22% of the country’s energy-related CO2 emissions. Moreover, these processes face significant sustainability challenges in the face of climate uncertainties such as rising seawater temperatures, salinity, and aerosol optical depth (AOD). AOD greatly affects the operational and economic performance of solar-powered desalination systems through photovoltaic soiling, membrane fouling, and water turbidity cycles. This study proposes a novel pipelined two-stage predictive modelling architecture: the first stage forecasts AOD using satellite-derived time series and meteorological data; the second stage uses the predicted AOD and other meteorological factors to predict desalination performance efficiency losses. The framework achieved 98% accuracy, and SHAP (SHapley Additive exPlanations) was used to reveal key drivers of system degradation. Furthermore, this study proposes a dust-aware rule-based control logic for desalination systems based on predicted values of AOD and solar efficiency. This control logic is used to adjust the desalination plant feed water pressure, adapt maintenance scheduling, and regulate energy source switching. To enhance the practical utility of the research findings, the predictive models and rule-based controls were packaged into an interactive dashboard for scenario and predictive analytics. This provides a management decision-support system for climate-adaptive planning. Subjects: Machine Learning (cs.LG); Systems and Control (eess.SY) Cite as: arXiv:2507.10609 [cs.LG] (or arXiv:2507.10609v1 [cs.LG] for this version) https://doi.org/10.48550/arXiv.2507.10609 Focus to learn more arXiv-issued DOI via DataCite (pending registration)

[LG-51] he Shape of Deceit: Behavioral Consistency and Frag ility in Money Laundering Patterns

链接: https://arxiv.org/abs/2507.10608
作者: Danny Butvinik,Ofir Yakobi,Michal Einhorn Cohen,Elina Maliarsky
类目: ocial and Information Networks (cs.SI); Machine Learning (cs.LG); Applications (stat.AP)
*备注:

点击查看摘要

Abstract:Conventional anti-money laundering (AML) systems predominantly focus on identifying anomalous entities or transactions, flagging them for manual investigation based on statistical deviation or suspicious behavior. This paradigm, however, misconstrues the true nature of money laundering, which is rarely anomalous but often deliberate, repeated, and concealed within consistent behavioral routines. In this paper, we challenge the entity-centric approach and propose a network-theoretic perspective that emphasizes detecting predefined laundering patterns across directed transaction networks. We introduce the notion of behavioral consistency as the core trait of laundering activity, and argue that such patterns are better captured through subgraph structures expressing semantic and functional roles - not solely geometry. Crucially, we explore the concept of pattern fragility: the sensitivity of laundering patterns to small attribute changes and, conversely, their semantic robustness even under drastic topological transformations. We claim that laundering detection should not hinge on statistical outliers, but on preservation of behavioral essence, and propose a reconceptualization of pattern similarity grounded in this insight. This philosophical and practical shift has implications for how AML systems model, scan, and interpret networks in the fight against financial crime.

[LG-52] An Adaptive Volatility-based Learning Rate Scheduler

链接: https://arxiv.org/abs/2507.10575
作者: Kieran Chai Kai Ren
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Effective learning rate (LR) scheduling is crucial for training deep neural networks. However, popular pre-defined and adaptive schedulers can still lead to suboptimal generalization. This paper introduces VolSched, a novel adaptive LR scheduler inspired by the concept of volatility in stochastic processes like Geometric Brownian Motion to dynamically adjust the learning rate. By calculating the ratio between long-term and short-term accuracy volatility, VolSched increases the LR to escape plateaus and decreases it to stabilize training, allowing the model to explore the loss landscape more effectively. We evaluate VolSched on the CIFAR-100 dataset against a strong baseline using a standard augmentation pipeline. When paired with ResNet-18 and ResNet-34, our scheduler delivers consistent performance gains, improving top-1 accuracy by 1.4 and 1.3 percentage points respectively. Analysis of the loss curves reveals that VolSched promotes a longer exploration phase. A quantitative analysis of the Hessian shows that VolSched finds a final solution that is 38% flatter than the next-best baseline, allowing the model to obtain wider minima and hence better generalization performance.

[LG-53] Protocols for Verifying Smooth Strategies in Bandits and Games

链接: https://arxiv.org/abs/2507.10567
作者: Miranda Christ,Daniel Reichman,Jonathan Shafer
类目: Computer Science and Game Theory (cs.GT); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:We study protocols for verifying approximate optimality of strategies in multi-armed bandits and normal-form games. As the number of actions available to each player is often large, we seek protocols where the number of queries to the utility oracle is sublinear in the number of actions. We prove that such verification is possible for sufficiently smooth strategies that do not put too much probability mass on any specific action. We provide protocols for verifying that a smooth policy for a multi-armed bandit is \varepsilon -optimal. Our verification protocols require provably fewer arm queries than learning. Furthermore, we establish a nearly-tight lower bound on the query complexity of verification in our settings. As an application, we show how to use verification for bandits to achieve verification in normal-form games. This gives a protocol for verifying whether a given strategy profile is an approximate strong smooth Nash equilibrium, with a query complexity that is sublinear in the number of actions.

[LG-54] angma: A Tanh-Guided Activation Function with Learnable Parameters

链接: https://arxiv.org/abs/2507.10560
作者: Shreel Golwala
类目: Neural and Evolutionary Computing (cs.NE); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Activation functions are key to effective backpropagation and expressiveness in deep neural networks. This work introduces Tangma, a new activation function that combines the smooth shape of the hyperbolic tangent with two learnable parameters: \alpha , which shifts the curve’s inflection point to adjust neuron activation, and \gamma , which adds linearity to preserve weak gradients and improve training stability. Tangma was evaluated on MNIST and CIFAR-10 using custom networks composed of convolutional and linear layers, and compared against ReLU, Swish, and GELU. On MNIST, Tangma achieved the highest validation accuracy of 99.09% and the lowest validation loss, demonstrating faster and more stable convergence than the baselines. On CIFAR-10, Tangma reached a top validation accuracy of 78.15%, outperforming all other activation functions while maintaining a competitive training loss. Tangma also showed improved training efficiency, with lower average epoch runtimes compared to Swish and GELU. These results suggest that Tangma performs well on standard vision tasks and enables reliable, efficient training. Its learnable design gives more control over activation behavior, which may benefit larger models in tasks such as image recognition or language modeling.

[LG-55] Canonical Bayesian Linear System Identification

链接: https://arxiv.org/abs/2507.11535
作者: Andrey Bryutkin,Matthew E. Levine,Iñigo Urteaga,Youssef Marzouk
类目: Machine Learning (stat.ML); Machine Learning (cs.LG); Systems and Control (eess.SY); Computation (stat.CO)
*备注: 46 pages, 9 figures

点击查看摘要

Abstract:Standard Bayesian approaches for linear time-invariant (LTI) system identification are hindered by parameter non-identifiability; the resulting complex, multi-modal posteriors make inference inefficient and impractical. We solve this problem by embedding canonical forms of LTI systems within the Bayesian framework. We rigorously establish that inference in these minimal parameterizations fully captures all invariant system dynamics (e.g., transfer functions, eigenvalues, predictive distributions of system outputs) while resolving identifiability. This approach unlocks the use of meaningful, structure-aware priors (e.g., enforcing stability via eigenvalues) and ensures conditions for a Bernstein–von Mises theorem – a link between Bayesian and frequentist large-sample asymptotics that is broken in standard forms. Extensive simulations with modern MCMC methods highlight advantages over standard parameterizations: canonical forms achieve higher computational efficiency, generate interpretable and well-behaved posteriors, and provide robust uncertainty estimates, particularly from limited data.

[LG-56] Joint space-time wind field data extrapolation and uncertainty quantification using nonparametric Bayesian dictionary learning

链接: https://arxiv.org/abs/2507.11385
作者: George D. Pasparakis,Ioannis A. Kougioumtzoglou,Michael D. Shields
类目: Machine Learning (stat.ML); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:A methodology is developed, based on nonparametric Bayesian dictionary learning, for joint space-time wind field data extrapolation and estimation of related statistics by relying on limited/incomplete measurements. Specifically, utilizing sparse/incomplete measured data, a time-dependent optimization problem is formulated for determining the expansion coefficients of an associated low-dimensional representation of the stochastic wind field. Compared to an alternative, standard, compressive sampling treatment of the problem, the developed methodology exhibits the following advantages. First, the Bayesian formulation enables also the quantification of the uncertainty in the estimates. Second, the requirement in standard CS-based applications for an a priori selection of the expansion basis is circumvented. Instead, this is done herein in an adaptive manner based on the acquired data. Overall, the methodology exhibits enhanced extrapolation accuracy, even in cases of high-dimensional data of arbitrary form, and of relatively large extrapolation distances. Thus, it can be used, potentially, in a wide range of wind engineering applications where various constraints dictate the use of a limited number of sensors. The efficacy of the methodology is demonstrated by considering two case studies. The first relates to the extrapolation of simulated wind velocity records consistent with a prescribed joint wavenumber-frequency power spectral density in a three-dimensional domain (2D and time). The second pertains to the extrapolation of four-dimensional (3D and time) boundary layer wind tunnel experimental data that exhibit significant spatial variability and non-Gaussian characteristics.

[LG-57] From Observational Data to Clinical Recommendations: A Causal Framework for Estimating Patient-level Treatment Effects and Learning Policies

链接: https://arxiv.org/abs/2507.11381
作者: Rom Gutman,Shimon Sheiba,Omer Noy Klien,Naama Dekel Bird,Amit Gruber,Doron Aronson,Oren Caspi,Uri Shalit
类目: Machine Learning (stat.ML); Machine Learning (cs.LG); Applications (stat.AP)
*备注:

点击查看摘要

Abstract:We propose a framework for building patient-specific treatment recommendation models, building on the large recent literature on learning patient-level causal models and inspired by the target trial paradigm of Hernan and Robins. We focus on safety and validity, including the crucial issue of causal identification when using observational data. We do not provide a specific model, but rather a way to integrate existing methods and know-how into a practical pipeline. We further provide a real world use-case of treatment optimization for patients with heart failure who develop acute kidney injury during hospitalization. The results suggest our pipeline can improve patient outcomes over the current treatment regime.

[LG-58] Recent Advances in Simulation-based Inference for Gravitational Wave Data Analysis

链接: https://arxiv.org/abs/2507.11192
作者: Bo Liang,He Wang
类目: General Relativity and Quantum Cosmology (gr-qc); High Energy Astrophysical Phenomena (astro-ph.HE); Instrumentation and Methods for Astrophysics (astro-ph.IM); Machine Learning (cs.LG); Machine Learning (stat.ML)
*备注: 30 pages, 6 figures, 1 table. Published version accepted by Astronomical Techniques and Instruments (ATI)

点击查看摘要

Abstract:The detection of gravitational waves by the LIGO-Virgo-KAGRA collaboration has ushered in a new era of observational astronomy, emphasizing the need for rapid and detailed parameter estimation and population-level analyses. Traditional Bayesian inference methods, particularly Markov chain Monte Carlo, face significant computational challenges when dealing with the high-dimensional parameter spaces and complex noise characteristics inherent in gravitational wave data. This review examines the emerging role of simulation-based inference methods in gravitational wave astronomy, with a focus on approaches that leverage machine-learning techniques such as normalizing flows and neural posterior estimation. We provide a comprehensive overview of the theoretical foundations underlying various simulation-based inference methods, including neural posterior estimation, neural ratio estimation, neural likelihood estimation, flow matching, and consistency models. We explore the applications of these methods across diverse gravitational wave data processing scenarios, from single-source parameter estimation and overlapping signal analysis to testing general relativity and conducting population studies. Although these techniques demonstrate speed improvements over traditional methods in controlled studies, their model-dependent nature and sensitivity to prior assumptions are barriers to their widespread adoption. Their accuracy, which is similar to that of conventional methods, requires further validation across broader parameter spaces and noise conditions.

[LG-59] An Interpretable AI framework Quantifying Traditional Chinese Medicine Principles Towards Enhancing and Integrating with Modern Biomedicine

链接: https://arxiv.org/abs/2507.11176
作者: Haoran Li,Xingye Cheng,Ziyang Huang,Jingyuan Luo,Qianqian Xu,Qiguang Zhao,Tianchen Guo,Yumeng Zhang,Linda Lidan Zhong,Zhaoxiang Bian,Leihan Tang,Aiping Lyu,Liang Tian
类目: Physics and Society (physics.soc-ph); Machine Learning (cs.LG); Other Quantitative Biology (q-bio.OT); Machine Learning (stat.ML)
*备注: 31 pages, 6 figures

点击查看摘要

Abstract:Traditional Chinese Medicine diagnosis and treatment principles, established through centuries of trial-and-error clinical practice, directly maps patient-specific symptom patterns to personalised herbal therapies. These empirical holistic mapping principles offer valuable strategies to address remaining challenges of reductionism methodologies in modern biomedicine. However, the lack of a quantitative framework and molecular-level evidence has limited their interpretability and reliability. Here, we present an AI framework trained on ancient and classical TCM formula records to quantify the symptom pattern-herbal therapy mappings. Interestingly, we find that empirical TCM diagnosis and treatment are consistent with the encoding-decoding processes in the AI model. This enables us to construct an interpretable TCM embedding space (TCM-ES) using the model’s quantitative representation of TCM principles. Validated through broad and extensive TCM patient data, the TCM-ES offers universal quantification of the TCM practice and therapeutic efficacy. We further map biomedical entities into the TCM-ES through correspondence alignment. We find that the principal directions of the TCM-ES are significantly associated with key biological functions (such as metabolism, immune, and homeostasis), and that the disease and herb embedding proximity aligns with their genetic relationships in the human protein interactome, which demonstrate the biological significance of TCM principles. Moreover, the TCM-ES uncovers latent disease relationships, and provides alternative metric to assess clinical efficacy for modern disease-drug pairs. Finally, we construct a comprehensive and integrative TCM knowledge graph, which predicts potential associations between diseases and targets, drugs, herbal compounds, and herbal therapies, providing TCM-informed opportunities for disease analysis and drug development.

[LG-60] How does Labeling Error Impact Contrastive Learning? A Perspective from Data Dimensionality Reduction ICML2025

链接: https://arxiv.org/abs/2507.11161
作者: Jun Chen,Hong Chen,Yonghua Yu,Yiming Ying
类目: Machine Learning (stat.ML); Machine Learning (cs.LG)
*备注: Accepted by ICML2025 as a poster

点击查看摘要

Abstract:In recent years, contrastive learning has achieved state-of-the-art performance in the territory of self-supervised representation learning. Many previous works have attempted to provide the theoretical understanding underlying the success of contrastive learning. Almost all of them rely on a default assumption, i.e., the label consistency assumption, which may not hold in practice (the probability of failure is called labeling error) due to the strength and randomness of common augmentation strategies, such as random resized crop (RRC). This paper investigates the theoretical impact of labeling error on the downstream classification performance of contrastive learning. We first reveal several significant negative impacts of labeling error on downstream classification risk. To mitigate these impacts, data dimensionality reduction method (e.g., singular value decomposition, SVD) is applied on original data to reduce false positive samples, and establish both theoretical and empirical evaluations. Moreover, it is also found that SVD acts as a double-edged sword, which may lead to the deterioration of downstream classification accuracy due to the reduced connectivity of the augmentation graph. Based on the above observations, we give the augmentation suggestion that we should use some moderate embedding dimension (such as 512, 1024 in our experiments), data inflation, weak augmentation, and SVD to ensure large graph connectivity and small labeling error to improve model performance.

[LG-61] Interpretable Bayesian Tensor Network Kernel Machines with Automatic Rank and Feature Selection

链接: https://arxiv.org/abs/2507.11136
作者: Afra Kilic,Kim Batselier
类目: Machine Learning (stat.ML); Machine Learning (cs.LG)
*备注: 39 pages, 5 figures, 4 tables. Submitted to Journal of Machine Learning Research. The code is available at: this https URL . arXiv admin note: text overlap with arXiv:1401.6497 by other authors

点击查看摘要

Abstract:Tensor Network (TN) Kernel Machines speed up model learning by representing parameters as low-rank TNs, reducing computation and memory use. However, most TN-based Kernel methods are deterministic and ignore parameter uncertainty. Further, they require manual tuning of model complexity hyperparameters like tensor rank and feature dimensions, often through trial-and-error or computationally costly methods like cross-validation. We propose Bayesian Tensor Network Kernel Machines, a fully probabilistic framework that uses sparsity-inducing hierarchical priors on TN factors to automatically infer model complexity. This enables automatic inference of tensor rank and feature dimensions, while also identifying the most relevant features for prediction, thereby enhancing model interpretability. All the model parameters and hyperparameters are treated as latent variables with corresponding priors. Given the Bayesian approach and latent variable dependencies, we apply a mean-field variational inference to approximate their posteriors. We show that applying a mean-field approximation to TN factors yields a Bayesian ALS algorithm with the same computational complexity as its deterministic counterpart, enabling uncertainty quantification at no extra computational cost. Experiments on synthetic and real-world datasets demonstrate the superior performance of our model in prediction accuracy, uncertainty quantification, interpretability, and scalability.

[LG-62] A Mathematical Optimization Approach to Multisphere Support Vector Data Description

链接: https://arxiv.org/abs/2507.11106
作者: Víctor Blanco,Inmaculada Espejo,Raúl Páez,Antonio M. Rodríguez-Chía
类目: Optimization and Control (math.OC); Machine Learning (cs.LG)
*备注: 18 pages, 5 figures, 3 tables

点击查看摘要

Abstract:We present a novel mathematical optimization framework for outlier detection in multimodal datasets, extending Support Vector Data Description approaches. We provide a primal formulation, in the shape of a Mixed Integer Second Order Cone model, that constructs Euclidean hyperspheres to identify anomalous observations. Building on this, we develop a dual model that enables the application of the kernel trick, thus allowing for the detection of outliers within complex, non-linear data structures. An extensive computational study demonstrates the effectiveness of our exact method, showing clear advantages over existing heuristic techniques in terms of accuracy and robustness.

[LG-63] GOLFS: Feature Selection via Combining Both Global and Local Information for High Dimensional Clustering

链接: https://arxiv.org/abs/2507.10956
作者: Zhaoyu Xing,Yang Wan,Juan Wen,Wei Zhong
类目: Machine Learning (stat.ML); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:It is important to identify the discriminative features for high dimensional clustering. However, due to the lack of cluster labels, the regularization methods developed for supervised feature selection can not be directly applied. To learn the pseudo labels and select the discriminative features simultaneously, we propose a new unsupervised feature selection method, named GlObal and Local information combined Feature Selection (GOLFS), for high dimensional clustering problems. The GOLFS algorithm combines both local geometric structure via manifold learning and global correlation structure of samples via regularized self-representation to select the discriminative features. The combination improves the accuracy of both feature selection and clustering by exploiting more comprehensive information. In addition, an iterative algorithm is proposed to solve the optimization problem and the convergency is proved. Simulations and two real data applications demonstrate the excellent finite-sample performance of GOLFS on both feature selection and clustering.

[LG-64] BioScore: A Foundational Scoring Function For Diverse Biomolecular Complexes

链接: https://arxiv.org/abs/2507.10877
作者: Yuchen Zhu,Jihong Chen,Yitong Li,Xiaomin Fang,Xianbin Ye,Jingzhou He,Xujun Zhang,Jingxuan Ge,Chao Shen,Xiaonan Zhang,Tingjun Hou,Chang-Yu Hsieh
类目: Chemical Physics (physics.chem-ph); Machine Learning (cs.LG); Biological Physics (physics.bio-ph)
*备注:

点击查看摘要

Abstract:Structural assessment of biomolecular complexes is vital for translating molecular models into functional insights, shaping our understanding of biology and aiding drug discovery. However, current structure-based scoring functions often lack generalizability across diverse biomolecular systems. We present BioScore, a foundational scoring function that addresses key challenges – data sparsity, cross-system representation, and task compatibility – through a dual-scale geometric graph learning framework with tailored modules for structure assessment and affinity prediction. BioScore supports a wide range of tasks, including affinity prediction, conformation ranking, and structure-based virtual screening. Evaluated on 16 benchmarks spanning proteins, nucleic acids, small molecules, and carbohydrates, BioScore consistently outperforms or matches 70 traditional and deep learning methods. Our newly proposed PPI Benchmark further enables comprehensive evaluation of protein-protein complex scoring. BioScore demonstrates broad applicability: (1) pretraining on mixed-structure data boosts protein-protein affinity prediction by up to 40% and antigen-antibody binding correlation by over 90%; (2) cross-system generalizability enables zero- and few-shot prediction with up to 71% correlation gain; and (3) its unified representation captures chemically challenging systems such as cyclic peptides, improving affinity prediction by over 60%. BioScore establishes a robust and generalizable framework for structural assessment across complex biomolecular landscapes.

[LG-65] HEIMDALL: a grapH-based sEIsMic Detector And Locator for microseismicity

链接: https://arxiv.org/abs/2507.10850
作者: Matteo Bagagli,Francesco Grigoli,Davide Bacciu
类目: Geophysics (physics.geo-ph); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:In this work, we present a new deep-learning model for microseismicity monitoring that utilizes continuous spatiotemporal relationships between seismic station recordings, forming an end-to-end pipeline for seismic catalog creation. It employs graph theory and state-of-the-art graph neural network architectures to perform phase picking, association, and event location simultaneously over rolling windows, making it suitable for both playback and near-real-time monitoring. As part of the global strategy to reduce carbon emissions within the broader context of a green-energy transition, there has been growing interest in exploiting enhanced geothermal systems. Tested in the complex geothermal area of Iceland’s Hengill region using open-access data from a temporary experiment, our model was trained and validated using both manually revised and automatic seismic catalogs. Results showed a significant increase in event detection compared to previously published automatic systems and reference catalogs, including a 4 M_w seismic sequence in December 2018 and a single-day sequence in February 2019. Our method reduces false events, minimizes manual oversight, and decreases the need for extensive tuning of pipelines or transfer learning of deep-learning models. Overall, it validates a robust monitoring tool for geothermal seismic regions, complementing existing systems and enhancing operational risk mitigation during geothermal energy exploitation.

[LG-66] Functional Neural Wavefunction Optimization

链接: https://arxiv.org/abs/2507.10835
作者: Victor Armegioiu,Juan Carrasquilla,Siddhartha Mishra,Johannes Müller,Jannes Nys,Marius Zeinhofer,Hang Zhang
类目: rongly Correlated Electrons (cond-mat.str-el); Machine Learning (cs.LG); Optimization and Control (math.OC); Computational Physics (physics.comp-ph); Quantum Physics (quant-ph)
*备注:

点击查看摘要

Abstract:We propose a framework for the design and analysis of optimization algorithms in variational quantum Monte Carlo, drawing on geometric insights into the corresponding function space. The framework translates infinite-dimensional optimization dynamics into tractable parameter-space algorithms through a Galerkin projection onto the tangent space of the variational ansatz. This perspective unifies existing methods such as stochastic reconfiguration and Rayleigh-Gauss-Newton, provides connections to classic function-space algorithms, and motivates the derivation of novel algorithms with geometrically principled hyperparameter choices. We validate our framework with numerical experiments demonstrating its practical relevance through the accurate estimation of ground-state energies for several prototypical models in condensed matter physics modeled with neural network wavefunctions.

[LG-67] Real-time Adaptive Radiological Anomaly Detection and Isotope Identification Using Non-negative Matrix Factorization

链接: https://arxiv.org/abs/2507.10715
作者: Chandler Jones,Mark Bandstra,Stefan Faaland,Yue Shi Lai,Nico Abgrall,Scott Suchyta,Reynold Cooper
类目: Applied Physics (physics.app-ph); Machine Learning (cs.LG)
*备注: 11 pages, 8 figures

点击查看摘要

Abstract:Spectroscopic anomaly detection and isotope identification algorithms are integral components in nuclear nonproliferation applications such as search operations. The task is especially challenging in the case of mobile detector systems due to the fact that the observed gamma-ray background changes more than for a static detector system, and a pretrained background model can easily find itself out of domain. The result is that algorithms may exceed their intended false alarm rate, or sacrifice detection sensitivity in order to maintain the desired false alarm rate. Non-negative matrix factorization (NMF) has been shown to be a powerful tool for spectral anomaly detection and identification, but, like many similar algorithms that rely on data-driven background models, in its conventional implementation it is unable to update in real time to account for environmental changes that affect the background spectroscopic signature. We have developed a novel NMF-based algorithm that periodically updates its background model to accommodate changing environmental conditions. The Adaptive NMF algorithm involves fewer assumptions about its environment, making it more generalizable than existing NMF-based methods while maintaining or exceeding detection performance on simulated and real-world datasets.

[LG-68] Robust Multi-Manifold Clustering via Simplex Paths

链接: https://arxiv.org/abs/2507.10710
作者: Haoyu Chen,Anna Little,Akin Narayan
类目: Machine Learning (stat.ML); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:This article introduces a novel, geometric approach for multi-manifold clustering (MMC), i.e. for clustering a collection of potentially intersecting, d-dimensional manifolds into the individual manifold components. We first compute a locality graph on d-simplices, using the dihedral angle in between adjacent simplices as the graph weights, and then compute infinity path distances in this simplex graph. This procedure gives a metric on simplices which we refer to as the largest angle path distance (LAPD). We analyze the properties of LAPD under random sampling, and prove that with an appropriate denoising procedure, this metric separates the manifold components with high probability. We validate the proposed methodology with extensive numerical experiments on both synthetic and real-world data sets. These experiments demonstrate that the method is robust to noise, curvature, and small intersection angle, and generally out-performs other MMC algorithms. In addition, we provide a highly scalable implementation of the proposed algorithm, which leverages approximation schemes for infinity path distance to achieve quasi-linear computational complexity.

[LG-69] Kernel Learning for Mean-Variance Trading Strategies

链接: https://arxiv.org/abs/2507.10701
作者: Owen Futter,Nicola Muca Cirone,Blanka Horvath
类目: Trading and Market Microstructure (q-fin.TR); Machine Learning (cs.LG); Mathematical Finance (q-fin.MF); Portfolio Management (q-fin.PM)
*备注: 49 pages

点击查看摘要

Abstract:In this article, we develop a kernel-based framework for constructing dynamic, pathdependent trading strategies under a mean-variance optimisation criterion. Building on the theoretical results of (Muca Cirone and Salvi, 2025), we parameterise trading strategies as functions in a reproducing kernel Hilbert space (RKHS), enabling a flexible and non-Markovian approach to optimal portfolio problems. We compare this with the signature-based framework of (Futter, Horvath, Wiese, 2023) and demonstrate that both significantly outperform classical Markovian methods when the asset dynamics or predictive signals exhibit temporal dependencies for both synthetic and market-data examples. Using kernels in this context provides significant modelling flexibility, as the choice of feature embedding can range from randomised signatures to the final layers of neural network architectures. Crucially, our framework retains closed-form solutions and provides an alternative to gradient-based optimisation.

[LG-70] Formal Verification of Variational Quantum Circuits

链接: https://arxiv.org/abs/2507.10635
作者: Nicola Assolini,Luca Marzari,Isabella Mastroeni,Alessandra di Pierro
类目: Quantum Physics (quant-ph); Machine Learning (cs.LG); Programming Languages (cs.PL)
*备注: Assolini and Marzari contributed equally to the paper

点击查看摘要

Abstract:Variational quantum circuits (VQCs) are a central component of many quantum machine learning algorithms, offering a hybrid quantum-classical framework that, under certain aspects, can be considered similar to classical deep neural networks. A shared aspect is, for instance, their vulnerability to adversarial inputs, small perturbations that can lead to incorrect predictions. While formal verification techniques have been extensively developed for classical models, no comparable framework exists for certifying the robustness of VQCs. Here, we present the first in-depth theoretical and practical study of the formal verification problem for VQCs. Inspired by abstract interpretation methods used in deep learning, we analyze the applicability and limitations of interval-based reachability techniques in the quantum setting. We show that quantum-specific aspects, such as state normalization, introduce inter-variable dependencies that challenge existing approaches. We investigate these issues by introducing a novel semantic framework based on abstract interpretation, where the verification problem for VQCs can be formally defined, and its complexity analyzed. Finally, we demonstrate our approach on standard verification benchmarks.

信息检索

[IR-0] From Chaos to Automation: Enabling the Use of Unstructured Data for Robotic Process Automation

链接: https://arxiv.org/abs/2507.11364
作者: Kelly Kurowski,Xixi Lu,Hajo A. Reijers
类目: Information Retrieval (cs.IR); Software Engineering (cs.SE)
*备注: Accepted at AUTOMATE 2025

点击查看摘要

Abstract:The growing volume of unstructured data within organizations poses significant challenges for data analysis and process automation. Unstructured data, which lacks a predefined format, encompasses various forms such as emails, reports, and scans. It is estimated to constitute approximately 80% of enterprise data. Despite the valuable insights it can offer, extracting meaningful information from unstructured data is more complex compared to structured data. Robotic Process Automation (RPA) has gained popularity for automating repetitive tasks, improving efficiency, and reducing errors. However, RPA is traditionally reliant on structured data, limiting its application to processes involving unstructured documents. This study addresses this limitation by developing the UNstructured Document REtrieval SyStem (UNDRESS), a system that uses fuzzy regular expressions, techniques for natural language processing, and large language models to enable RPA platforms to effectively retrieve information from unstructured documents. The research involved the design and development of a prototype system, and its subsequent evaluation based on text extraction and information retrieval performance. The results demonstrate the effectiveness of UNDRESS in enhancing RPA capabilities for unstructured data, providing a significant advancement in the field. The findings suggest that this system could facilitate broader RPA adoption across processes traditionally hindered by unstructured data, thereby improving overall business process efficiency.

[IR-1] Aligned Query Expansion: Efficient Query Expansion for Information Retrieval through LLM Alignment

链接: https://arxiv.org/abs/2507.11042
作者: Adam Yang,Gustavo Penha,Enrico Palumbo,Hugues Bouchard
类目: Information Retrieval (cs.IR)
*备注:

点击查看摘要

Abstract:With the breakthroughs in large language models (LLMs), query generation techniques that expand documents and queries with related terms are becoming increasingly popular in the information retrieval field. Such techniques have been shown to improve the effectiveness of traditional lexical retrieval methods by dealing with the vocabulary mismatch problem. Recent work has found that generating queries with a greedy decoding strategy can produce sub-optimal queries, including hallucinations, and proposed to filter out queries before expansion. This `generate-then-filter’ approach is costly, as it requires generating multiple queries and applying a relevance model to all of them and does not teach the LLM which of the generated queries is more effective for expansion. To overcome such limitations, we propose Aligned Query Expansion (AQE), a novel approach to enhance query expansion for passage retrieval in open-domain question answering. AQE leverages recent techniques in LLM alignment to fine-tune models for generating query expansions that directly optimize the effectiveness of the retrieval task, eliminating the need for additional filtering steps. This alignment ensures that queries are more relevant, reducing computational costs while improving retrieval effectiveness. Empirical evaluations show that AQE outperforms baseline models for query expansion in both in-domain and out-of-domain settings, demonstrating significant improvements in retrieval effectiveness.

[IR-2] Unraveling the Biomarker Prospects of High-Altitude Diseases: Insights from Biomolecular Event Network Constructed using Text Mining

链接: https://arxiv.org/abs/2507.10953
作者: Balu Bhasuran,Sabenabanu Abdulkadhar,Jeyakumar Natarajan
类目: Information Retrieval (cs.IR); Quantitative Methods (q-bio.QM)
*备注:

点击查看摘要

Abstract:High-altitude diseases (HAD), encompassing acute mountain sickness (AMS), high-altitude cerebral edema (HACE), and high-altitude pulmonary edema (HAPE), are triggered by hypobaric hypoxia at elevations above 2,500 meters. These conditions pose significant health risks, yet the molecular mechanisms remain insufficiently understood. In this study, we developed a biomolecular event extraction pipeline integrating supervised machine learning with feature-based and multiscale Laplacian graph kernels to analyze 7,847 curated HAD-related abstracts from PubMed. We extracted over 150 unique biomolecular events including gene expression, regulation, binding, and localization and constructed a weighted, undirected biomolecular event network comprising 97 nodes and 153 edges. Using the PageRank algorithm, we prioritized key biomolecules based on their centrality within the event network. The top-ranked proteins included Erythropoietin (EPO) (0.0163), Vascular endothelial growth factor (VEGF) (0.0148), Hypoxia-inducible factor 1 (HIF-1) alpha (0.0136), Endothelial PAS Domain Protein 1 (EPAS1) and Angiotensin-Converting Enzyme (ACE) (0.0119), Egl nine homolog 1 (EGLN1), Endothelin 1 (ET-1), and 70 kilodalton heat shock protein (Hsp70)(0.0118), all of which play crucial roles in oxygen sensing, vascular remodeling, erythropoiesis, and blood pressure regulation. Subnetwork analysis revealed three major functional clusters centered on hypoxia response, inflammation, and stress adaptation pathways. Our integrative approach demonstrates the utility of large-scale text mining and graph-based analysis to uncover mechanistic insights and prioritize potential biomarkers for high-altitude disease.

[IR-3] LLM -Driven Dual-Level Multi-Interest Modeling for Recommendation

链接: https://arxiv.org/abs/2507.10917
作者: Ziyan Wang,Yingpeng Du,Zhu Sun,Jieyi Bi,Haoyan Chua,Tianjun Wei,Jie Zhang
类目: Information Retrieval (cs.IR)
*备注: 10 pages, 5 figures

点击查看摘要

Abstract:Recently, much effort has been devoted to modeling users’ multi-interests based on their behaviors or auxiliary signals. However, existing methods often rely on heuristic assumptions, e.g., co-occurring items indicate the same interest of users, failing to capture user multi-interests aligning with real-world scenarios. While large language models (LLMs) show significant potential for multi-interest analysis due to their extensive knowledge and powerful reasoning capabilities, two key challenges remain. First, the granularity of LLM-driven multi-interests is agnostic, possibly leading to overly fine or coarse interest grouping. Second, individual user analysis provides limited insights due to the data sparsity issue. In this paper, we propose an LLM-driven dual-level multi-interest modeling framework for more effective recommendation. At the user-individual level, we exploit LLMs to flexibly allocate items engaged by users into different semantic clusters, indicating their diverse and distinct interests. To alleviate the agnostic generation of LLMs, we adaptively assign these semantic clusters to users’ collaborative multi-interests learned from global user-item interactions, allowing the granularity to be automatically adjusted according to the user’s behaviors using an alignment module. To alleviate the limited insights derived from individual users’ behaviors, at the user-crowd level, we propose aggregating user cliques into synthesized users with rich behaviors for more comprehensive LLM-driven multi-interest analysis. We formulate a max covering problem to ensure the compactness and representativeness of synthesized users’ behaviors, and then conduct contrastive learning based on their LLM-driven multi-interests to disentangle item representations among different interests. Experiments on real-world datasets show the superiority of our approach against state-of-the-art methods.

[IR-4] Access Control for Information-Theoretically Secure Key-Document Stores VLDB2025

链接: https://arxiv.org/abs/2507.10730
作者: Yin Li,Sharad Mehrota,Shantanu Sharma,Komal Kumari
类目: Cryptography and Security (cs.CR); Databases (cs.DB); Distributed, Parallel, and Cluster Computing (cs.DC); Data Structures and Algorithms (cs.DS); Information Retrieval (cs.IR)
*备注: An extended abstract of this version has been accepted in VLDB 2025

点击查看摘要

Abstract:This paper presents a novel key-based access control technique for secure outsourcing key-value stores where values correspond to documents that are indexed and accessed using keys. The proposed approach adopts Shamir’s secret-sharing that offers unconditional or information-theoretic security. It supports keyword-based document retrieval while preventing leakage of the data, access rights of users, or the size (\textiti.\textite., volume of the output that satisfies a query). The proposed approach allows servers to detect (and abort) malicious clients from gaining unauthorized access to data, and prevents malicious servers from altering data undetected while ensuring efficient access – it takes 231.5ms over 5,000 keywords across 500,000 files.

附件下载

点击下载今日全部论文列表