Arxiv今日论文 | 2025-07-14

本篇博文主要内容为 2025-07-14 从Arxiv.org论文网站获取的最新论文列表，自动更新，按照NLP、CV、ML、AI、IR五个大方向区分，若需要邮件定时接收，请在评论区留下你的邮箱号。

说明：每日论文数据从Arxiv.org获取，每天早上12:00左右定时自动更新。

友情提示: 如何您需要邮箱接收每日论文数据，请在评论处留下你的邮箱。

【速读】：该论文试图解决如何构建一个能够模拟操作系统图形用户界面（GUI）的神经框架，以响应用户输入如鼠标移动、点击和键盘事件。解决方案的关键在于结合循环神经网络（RNN）用于跟踪计算机状态，以及基于扩散的神经渲染器用于生成屏幕图像，从而实现对GUI序列的逼真渲染和状态转换的准确预测。

链接: https://arxiv.org/abs/2507.08800
作者: Luke Rivard,Sun Sun,Hongyu Guo,Wenhu Chen,Yuntian Deng
机构: University of Waterloo (滑铁卢大学); National Research Council Canada (加拿大国家研究委员会)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Human-Computer Interaction (cs.HC); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:We introduce NeuralOS, a neural framework that simulates graphical user interfaces (GUIs) of operating systems by directly predicting screen frames in response to user inputs such as mouse movements, clicks, and keyboard events. NeuralOS combines a recurrent neural network (RNN), which tracks computer state, with a diffusion-based neural renderer that generates screen images. The model is trained on a large-scale dataset of Ubuntu XFCE recordings, which include both randomly generated interactions and realistic interactions produced by AI agents. Experiments show that NeuralOS successfully renders realistic GUI sequences, accurately captures mouse interactions, and reliably predicts state transitions like application launches. Although modeling fine-grained keyboard interactions precisely remains challenging, NeuralOS offers a step toward creating fully adaptive, generative neural interfaces for future human-computer interaction systems.
zh

[NLP-1] KV Cache Steering for Inducing Reasoning in Small Language Models

【速读】：该论文试图解决如何在不进行微调或修改提示的情况下，对小型语言模型进行隐式引导以实现链式思维推理的问题。解决方案的关键在于提出了一种轻量级的方法——缓存引导（cache steering），通过一次性的干预直接作用于键值缓存，利用GPT-4o生成的推理轨迹构建引导向量，从而将模型行为转向更明确、多步骤的推理方式。

链接: https://arxiv.org/abs/2507.08799
作者: Max Belitsky,Dawid J. Kopiczko,Michael Dorkenwald,M. Jehanzeb Mirza,Cees G. M. Snoek,Yuki M. Asano
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:We propose cache steering, a lightweight method for implicit steering of language models via a one-shot intervention applied directly to the key-value cache. To validate its effectiveness, we apply cache steering to induce chain-of-thought reasoning in small language models. Our approach leverages GPT-4o-generated reasoning traces to construct steering vectors that shift model behavior toward more explicit, multi-step reasoning without fine-tuning or prompt modifications. Experimental evaluations on diverse reasoning benchmarks demonstrate that cache steering improves both the qualitative structure of model reasoning and quantitative task performance. Compared to prior activation steering techniques that require continuous interventions, our one-shot cache steering offers substantial advantages in terms of hyperparameter stability, inference-time efficiency, and ease of integration, making it a more robust and practical solution for controlled generation.
zh

[NLP-2] One Token to Fool LLM -as-a-Judge

【速读】：该论文试图解决生成式奖励模型（Generative Reward Models）在面对表面操纵时表现出的脆弱性问题，这些操纵包括非单词符号或推理引导语，可能导致错误的正向奖励。解决方案的关键在于提出一种简单而有效的数据增强策略，并基于此训练出具有显著提升鲁棒性的生成式奖励模型。

链接: https://arxiv.org/abs/2507.08794
作者: Yulai Zhao,Haolin Liu,Dian Yu,S.Y. Kung,Haitao Mi,Dong Yu
机构: Tencent AI Lab (腾讯人工智能实验室)
类目: Machine Learning (cs.LG); Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Generative reward models (also known as LLMs-as-judges), which use large language models (LLMs) to evaluate answer quality, are increasingly adopted in reinforcement learning with verifiable rewards (RLVR). They are often preferred over rigid rule-based metrics, especially for complex reasoning tasks involving free-form outputs. In this paradigm, an LLM is typically prompted to compare a candidate answer against a ground-truth reference and assign a binary reward indicating correctness. Despite the seeming simplicity of this comparison task, we find that generative reward models exhibit surprising vulnerabilities to superficial manipulations: non-word symbols (e.g., “:” or “.”) or reasoning openers like “Thought process:” and “Let’s solve this problem step by step.” can often lead to false positive rewards. We demonstrate that this weakness is widespread across LLMs, datasets, and prompt formats, posing a serious threat for core algorithmic paradigms that rely on generative reward models, such as rejection sampling, preference optimization, and RLVR. To mitigate this issue, we introduce a simple yet effective data augmentation strategy and train a new generative reward model with substantially improved robustness. Our findings highlight the urgent need for more reliable LLM-based evaluation methods. We release our robust, general-domain reward model and its synthetic training data at this https URL and this https URL.
zh

[NLP-3] BlockFFN: Towards End-Side Acceleration-Friendly Mixture-of-Experts with Chunk-Level Activation Sparsity

【速读】：该论文旨在解决传统混合专家（Mixture-of-Experts, MoE）架构在计算负担和加速性方面的不足，特别是其非可微且不灵活的路由机制以及低块级稀疏性（Chunk-level Sparsity, CLS）问题，这些问题限制了模型在资源受限环境下的性能和与主流加速技术（如推测解码）的兼容性。解决方案的关键在于提出一种新的MoE架构——BlockFFN，并引入集成ReLU激活和RMSNorm的可微且灵活的路由机制，同时设计面向块级稀疏性的训练目标，以提升模型的加速友好性，最终通过结合激活稀疏性和推测解码实现高效的加速内核。

链接: https://arxiv.org/abs/2507.08771
作者: Chenyang Song,Weilin Zhao,Xu Han,Chaojun Xiao,Yingfa Chen,Yuxuan Li,Zhiyuan Liu,Maosong Sun
机构: Tsinghua University (清华大学)
类目: Machine Learning (cs.LG); Computation and Language (cs.CL)
备注: 21 pages, 7 figures, 15 tables

点击查看摘要

Abstract:To alleviate the computational burden of large language models (LLMs), architectures with activation sparsity, represented by mixture-of-experts (MoE), have attracted increasing attention. However, the non-differentiable and inflexible routing of vanilla MoE hurts model performance. Moreover, while each token activates only a few parameters, these sparsely-activated architectures exhibit low chunk-level sparsity, indicating that the union of multiple consecutive tokens activates a large ratio of parameters. Such a sparsity pattern is unfriendly for acceleration under low-resource conditions (e.g., end-side devices) and incompatible with mainstream acceleration techniques (e.g., speculative decoding). To address these challenges, we introduce a novel MoE architecture, BlockFFN, as well as its efficient training and deployment techniques. Specifically, we use a router integrating ReLU activation and RMSNorm for differentiable and flexible routing. Next, to promote both token-level sparsity (TLS) and chunk-level sparsity (CLS), CLS-aware training objectives are designed, making BlockFFN more acceleration-friendly. Finally, we implement efficient acceleration kernels, combining activation sparsity and speculative decoding for the first time. The experimental results demonstrate the superior performance of BlockFFN over other MoE baselines, achieving over 80% TLS and 70% 8-token CLS. Our kernels achieve up to 3.67 \times speedup on real end-side devices than dense models. All codes and checkpoints are available publicly (this https URL).
zh

[NLP-4] On Barriers to Archival Audio Processing DATE

【速读】：该论文试图解决现代语言识别（LID）和说话人识别（SR）方法在处理多语言说话人和跨年龄录音时的鲁棒性问题。研究利用了联合国教科文组织收藏的20世纪中期无线电录音数据，评估现有LID和SR技术的性能局限。解决方案的关键在于揭示说话人嵌入（speaker embeddings）在通道、年龄和语言相关偏差下的脆弱性，并强调在档案机构应用SR方法进行说话人索引时需克服这些挑战。

链接: https://arxiv.org/abs/2507.08768
作者: Peter Sullivan,Muhammad Abdul-Mageed
机构: University of British Columbia (不列颠哥伦比亚大学)
类目: ound (cs.SD); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Audio and Speech Processing (eess.AS)
备注: Update with Acknowledgements of ICNSLP 2025 paper

点击查看摘要

Abstract:In this study, we leverage a unique UNESCO collection of mid-20th century radio recordings to probe the robustness of modern off-the-shelf language identification (LID) and speaker recognition (SR) methods, especially with respect to the impact of multilingual speakers and cross-age recordings. Our findings suggest that LID systems, such as Whisper, are increasingly adept at handling second-language and accented speech. However, speaker embeddings remain a fragile component of speech processing pipelines that is prone to biases related to the channel, age, and language. Issues which will need to be overcome should archives aim to employ SR methods for speaker indexing.
zh

[NLP-5] Multilingual Multimodal Software Developer for Code Generation

【速读】：该论文试图解决大型语言模型（Large Language Models, LLMs）在代码生成中忽视视觉辅助工具（如统一建模语言UML图和流程图）的问题，从而导致生成的代码与实际软件开发中的架构不一致。解决方案的关键在于引入MM-Coder，这是一个多语言多模态的软件开发者，它将视觉设计输入（称为视觉工作流）与文本指令相结合，以提高代码生成的准确性和架构一致性。此外，研究者还开发了MMc-Instruct数据集和MMEval基准测试，以支持多模态代码生成任务并评估模型在视觉信息捕捉、指令遵循和高级编程知识方面的表现。

链接: https://arxiv.org/abs/2507.08719
作者: Linzheng Chai,Jian Yang,Shukai Liu,Wei Zhang,Liran Wang,Ke Jin,Tao Sun,Congnan Liu,Chenchen Zhang,Hualei Zhu,Jiaheng Liu,Xianjie Wu,Ge Zhang,Tianyu Liu,Zhoujun Li
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Software Engineering (cs.SE)
备注: Preprint

点击查看摘要

Abstract:The rapid advancement of Large Language Models (LLMs) has significantly improved code generation, yet most models remain text-only, neglecting crucial visual aids like diagrams and flowcharts used in real-world software development. To bridge this gap, we introduce MM-Coder, a Multilingual Multimodal software developer. MM-Coder integrates visual design inputs-Unified Modeling Language (UML) diagrams and flowcharts (termed Visual Workflow)-with textual instructions to enhance code generation accuracy and architectural alignment. To enable this, we developed MMc-Instruct, a diverse multimodal instruction-tuning dataset including visual-workflow-based code generation, allowing MM-Coder to synthesize textual and graphical information like human developers, distinct from prior work on narrow tasks. Furthermore, we introduce MMEval, a new benchmark for evaluating multimodal code generation, addressing existing text-only limitations. Our evaluations using MMEval highlight significant remaining challenges for models in precise visual information capture, instruction following, and advanced programming knowledge. Our work aims to revolutionize industrial programming by enabling LLMs to interpret and implement complex specifications conveyed through both text and visual designs.
zh

[NLP-6] KG-Attention: Knowledge Graph-Guided Attention at Test-Time via Bidirectional Information Aggregation

【速读】：该论文试图解决知识图谱（Knowledge Graphs, KGs）增强大型语言模型（Large Language Models, LLMs）过程中存在的两个关键问题：一是现有方法依赖参数密集型微调，容易导致灾难性遗忘并降低预训练模型的泛化能力；二是静态集成框架限制了对实时知识更新的适应性。解决方案的关键在于提出一种基于知识图谱引导注意力（Knowledge Graph-Guided Attention, KGA）的测试时增强框架，该框架通过动态知识融合机制，在不修改模型参数的情况下实现实时知识注入，其核心是结合向外聚合与向内聚合两条路径，形成闭环增强机制以提升知识相关模式的表达。

链接: https://arxiv.org/abs/2507.08704
作者: Songlin Zhai,Guilin Qi,Yuan Meng
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Knowledge graphs (KGs) play a critical role in enhancing large language models (LLMs) by introducing structured and grounded knowledge into the learning process. However, most existing KG-enhanced approaches rely on parameter-intensive fine-tuning, which risks catastrophic forgetting and degrades the pretrained model’s generalization. Moreover, they exhibit limited adaptability to real-time knowledge updates due to their static integration frameworks. To address these issues, we introduce the first test-time KG-augmented framework for LLMs, built around a dedicated knowledge graph-guided attention (KGA) module that enables dynamic knowledge fusion without any parameter updates. The proposed KGA module augments the standard self-attention mechanism with two synergistic pathways: outward and inward aggregation. Specifically, the outward pathway dynamically integrates external knowledge into input representations via input-driven KG fusion. This inward aggregation complements the outward pathway by refining input representations through KG-guided filtering, suppressing task-irrelevant signals and amplifying knowledge-relevant patterns. Importantly, while the outward pathway handles knowledge fusion, the inward path selects the most relevant triples and feeds them back into the fusion process, forming a closed-loop enhancement mechanism. By synergistically combining these two pathways, the proposed method supports real-time knowledge fusion exclusively at test-time, without any parameter modification. Extensive experiments on five benchmarks verify the comparable knowledge fusion performance of KGA.
zh

[NLP-7] KELPS: A Framework for Verified Multi-Language Autoformalization via Semantic-Syntactic Alignment ICML2025

【速读】：该论文旨在解决将非正式数学内容形式化为可机器验证定理时面临的瓶颈问题，主要由于多语言平行语料库的数量和质量有限。其解决方案的关键在于提出一种新颖的神经符号框架KELPS（Knowledge-Equation based Logical Processing System），该框架通过将自然语言翻译为基于断言逻辑的新型语言——知识方程（KEs），再通过严格定义的规则转换为目标形式语言（如Lean、Coq和Isabelle），从而实现对非正式数据的迭代翻译、合成与过滤，最终构建了一个包含超过60,000个问题的平行语料库。

链接: https://arxiv.org/abs/2507.08665
作者: Jiyao Zhang,Chengli Zhong,Hui Xu,Qige Li,Yi Zhou
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: Accepted by the ICML 2025 AI4MATH Workshop. 22 pages, 16 figures, 2 tables

点击查看摘要

Abstract:Modern large language models (LLMs) show promising progress in formalizing informal mathematics into machine-verifiable theorems. However, these methods still face bottlenecks due to the limited quantity and quality of multilingual parallel corpora. In this paper, we propose a novel neuro-symbolic framework KELPS (Knowledge-Equation based Logical Processing System) to address these problems. KELPS is an iterative framework for translating, synthesizing, and filtering informal data into multiple formal languages (Lean, Coq, and Isabelle). First, we translate natural language into Knowledge Equations (KEs), a novel language that we designed, theoretically grounded in assertional logic. Next, we convert them to target languages through rigorously defined rules that preserve both syntactic structure and semantic meaning. This process yielded a parallel corpus of over 60,000 problems. Our framework achieves 88.9% syntactic accuracy (pass@1) on MiniF2F, outperforming SOTA models such as Deepseek-V3 (81%) and Herald (81.3%) across multiple datasets. All datasets and codes are available in the supplementary materials.
zh

[NLP-8] he Impact of Automatic Speech Transcription on Speaker Attribution

【速读】：该论文试图解决在语音转录文本存在错误的情况下，如何进行有效的说话人归属问题，特别是在音频不可用或不可靠时，利用自动语音识别（ASR）系统生成的有误差的转录文本进行说话人识别。其解决方案的关键在于评估ASR转录错误对说话人归属性能的影响，并发现说话人归属任务在面对词级转录错误时表现出较高的鲁棒性，同时指出恢复真实转录文本的目标与说话人归属性能之间的相关性较低，表明ASR转录错误可能捕捉到与说话人身份相关的特定特征。

链接: https://arxiv.org/abs/2507.08660
作者: Cristina Aggazzotti,Matthew Wiesner,Elizabeth Allyn Smith,Nicholas Andrews
机构: 未知
类目: Computation and Language (cs.CL); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Speaker attribution from speech transcripts is the task of identifying a speaker from the transcript of their speech based on patterns in their language use. This task is especially useful when the audio is unavailable (e.g. deleted) or unreliable (e.g. anonymized speech). Prior work in this area has primarily focused on the feasibility of attributing speakers using transcripts produced by human annotators. However, in real-world settings, one often only has more errorful transcripts produced by automatic speech recognition (ASR) systems. In this paper, we conduct what is, to our knowledge, the first comprehensive study of the impact of automatic transcription on speaker attribution performance. In particular, we study the extent to which speaker attribution performance degrades in the face of transcription errors, as well as how properties of the ASR system impact attribution. We find that attribution is surprisingly resilient to word-level transcription errors and that the objective of recovering the true transcript is minimally correlated with attribution performance. Overall, our findings suggest that speaker attribution on more errorful transcripts produced by ASR is as good, if not better, than attribution based on human-transcribed data, possibly because ASR transcription errors can capture speaker-specific features revealing of speaker identity.
zh

[NLP-9] Scaling Attention to Very Long Sequences in Linear Time with Wavelet-Enhanced Random Spectral Attention (WERSA)

【速读】：该论文试图解决Transformer模型在处理长序列时计算成本过高的问题，因为传统注意力机制具有二次方时间复杂度 $O(n^2)$ 。其解决方案的关键是提出一种名为Wavelet-Enhanced Random Spectral Attention (WERSA)的新机制，该机制具有线性时间复杂度 $O(n)$ ，能够在不牺牲性能的情况下实现高效的长序列处理。WERSA通过融合内容自适应的随机频谱特征与多分辨率Haar小波以及可学习参数，选择性地关注数据中的信息尺度，从而在保持线性效率的同时提升模型表现。

链接: https://arxiv.org/abs/2507.08637
作者: Vincenzo Dentamaro
机构: University of Bari Aldo Moro(巴里阿尔多·莫罗大学)
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注: 10 pages, 1 figure

点击查看摘要

Abstract:Transformer models are computationally costly on long sequences since regular attention has quadratic O(n^2) time complexity. We introduce Wavelet-Enhanced Random Spectral Attention (WERSA), a novel mechanism of linear O(n) time complexity that is pivotal to enable successful long-sequence processing without the performance trade-off. WERSA merges content-adaptive random spectral features together with multi-resolution Haar wavelets and learnable parameters to selectively attend to informative scales of data while preserving linear efficiency. Large-scale comparisons \textbfon single GPU and across various benchmarks (vision, NLP, hierarchical reasoning) and various attention mechanisms (like Multiheaded Attention, Flash-Attention-2, FNet, Linformer, Performer, Waveformer), reveal uniform advantages of WERSA. It achieves best accuracy in all tests. On ArXiv classification, WERSA improves accuracy over vanilla attention by 1.2% (86.2% vs 85.0%) while cutting training time by 81% (296s vs 1554s) and FLOPS by 73.4% (26.2G vs 98.4G). Significantly, WERSA excels where vanilla and FlashAttention-2 fail: on ArXiv-128k’s extremely lengthy sequences, it achieves best accuracy (79.1%) and AUC (0.979) among viable methods, operating on data that gives Out-Of-Memory errors to quadratic methods while being \textbftwice as fast as Waveformer, its next-best competitor. By significantly reducing computational loads without compromising accuracy, WERSA makes possible more practical, more affordable, long-context models, in particular on low-resource hardware, for more sustainable and more scalable AI development. Comments: 10 pages, 1 figure Subjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL) Cite as: arXiv:2507.08637 [cs.LG] (or arXiv:2507.08637v1 [cs.LG] for this version) https://doi.org/10.48550/arXiv.2507.08637 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
zh

[NLP-10] A comprehensive study of LLM -based argument classification: from LLAMA through GPT -4o to Deepseek -R1

【速读】：该论文试图解决大型语言模型（Large Language Models, LLMs）在公开可用的论点分类数据库中的表现问题，旨在评估这些模型在论点挖掘（Argument Mining, AM）任务中的有效性。其解决方案的关键在于利用多样化的数据集（如该论文中提到的http URL和UKP），对多种LLM（包括GPT、Llama、DeepSeek及其结合思维链（Chain-of-Thoughts）算法的增强版本）进行系统性测试与比较，从而揭示不同模型在论点分类任务中的性能差异及存在的问题。

链接: https://arxiv.org/abs/2507.08621
作者: Marcin Pietroń,Rafał Olszowski,Jakub Gomułka,Filip Gampel,Andrzej Tomski
机构: AGH University of Krakow (AGH大学); Institute of Mathematics (数学研究所); University of Silesia (西里西亚大学)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Argument mining (AM) is an interdisciplinary research field that integrates insights from logic, philosophy, linguistics, rhetoric, law, psychology, and computer science. It involves the automatic identification and extraction of argumentative components, such as premises and claims, and the detection of relationships between them, such as support, attack, or neutrality. Recently, the field has advanced significantly, especially with the advent of large language models (LLMs), which have enhanced the efficiency of analyzing and extracting argument semantics compared to traditional methods and other deep learning models. There are many benchmarks for testing and verifying the quality of LLM, but there is still a lack of research and results on the operation of these models in publicly available argument classification databases. This paper presents a study of a selection of LLM’s, using diverse datasets such as this http URL and UKP. The models tested include versions of GPT, Llama, and DeepSeek, along with reasoning-enhanced variants incorporating the Chain-of-Thoughts algorithm. The results indicate that ChatGPT-4o outperforms the others in the argument classification benchmarks. In case of models incorporated with reasoning capabilities, the Deepseek-R1 shows its superiority. However, despite their superiority, GPT-4o and Deepseek-R1 still make errors. The most common errors are discussed for all models. To our knowledge, the presented work is the first broader analysis of the mentioned datasets using LLM and prompt algorithms. The work also shows some weaknesses of known prompt algorithms in argument analysis, while indicating directions for their improvement. The added value of the work is the in-depth analysis of the available argument datasets and the demonstration of their shortcomings.
zh

[NLP-11] DocPolarBERT: A Pre-trained Model for Document Understanding with Relative Polar Coordinate Encoding of Layout Structures

【速读】：该论文试图解决传统文档理解模型依赖于绝对二维位置嵌入（absolute 2D positional embeddings）所带来的局限性，从而提升模型对文档布局信息的感知能力。解决方案的关键在于将自注意力机制扩展为基于相对极坐标系（relative polar coordinate system）的文本块位置考虑，而非传统的笛卡尔坐标系，从而实现更有效的布局感知。

链接: https://arxiv.org/abs/2507.08606
作者: Benno Uthayasooriyar,Antoine Ly,Franck Vermet,Caio Corro
机构: 未知
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:We introduce DocPolarBERT, a layout-aware BERT model for document understanding that eliminates the need for absolute 2D positional embeddings. We extend self-attention to take into account text block positions in relative polar coordinate system rather than the Cartesian one. Despite being pre-trained on a dataset more than six times smaller than the widely used IIT-CDIP corpus, DocPolarBERT achieves state-of-the-art results. These results demonstrate that a carefully designed attention mechanism can compensate for reduced pre-training data, offering an efficient and effective alternative for document understanding.
zh

[NLP-12] Large Multi-modal Model Cartographic Map Comprehension for Textual Locality Georeferencing

【速读】：该论文试图解决自然历史收藏中数百万未地理定位（georeferenced）的生物样本记录的地理定位问题，这一任务因涉及复杂的地点描述而极具劳动密集性。解决方案的关键在于利用近期大型多模态模型（Large Multi-Modal Models, LMM）的多模态能力，使模型能够通过视觉上下文理解空间关系。该方法采用基于网格的策略，在零样本设置下适配自回归模型以完成此任务，实验结果表明其在平均距离误差方面优于单一模态的大型语言模型和现有地理定位工具。

链接: https://arxiv.org/abs/2507.08575
作者: Kalana Wijegunarathna,Kristin Stock,Christopher B. Jones
机构: 未知
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Millions of biological sample records collected in the last few centuries archived in natural history collections are un-georeferenced. Georeferencing complex locality descriptions associated with these collection samples is a highly labour-intensive task collection agencies struggle with. None of the existing automated methods exploit maps that are an essential tool for georeferencing complex relations. We present preliminary experiments and results of a novel method that exploits multi-modal capabilities of recent Large Multi-Modal Models (LMM). This method enables the model to visually contextualize spatial relations it reads in the locality description. We use a grid-based approach to adapt these auto-regressive models for this task in a zero-shot setting. Our experiments conducted on a small manually annotated dataset show impressive results for our approach ( \sim 1 km Average distance error) compared to uni-modal georeferencing with Large Language Models and existing georeferencing tools. The paper also discusses the findings of the experiments in light of an LMM’s ability to comprehend fine-grained maps. Motivated by these results, a practical framework is proposed to integrate this method into a georeferencing workflow.
zh

[NLP-13] he AI Language Proficiency Monitor – Tracking the Progress of LLM s on Multilingual Benchmarks

【速读】：该论文试图解决大型语言模型（Large Language Models, LLMs）在不同语言间能力评估不均衡的问题，旨在确保全球范围内语言的公平受益。其解决方案的关键在于引入AI Language Proficiency Monitor，这是一个全面的多语言基准测试系统，能够系统性地评估LLM在最多200种语言中的表现，尤其关注低资源语言。该基准整合了多种任务，如翻译、问答、数学和推理，并通过开放源代码、自动更新的排行榜和仪表板为研究人员、开发者和政策制定者提供模型性能的优势与不足的洞察。

链接: https://arxiv.org/abs/2507.08538
作者: David Pomerenke,Jonas Nothnagel,Simon Ostermann
机构: Bundesministerium für wirtschaftliche Zusammenarbeit und Entwicklung (BMZ); Gesellschaft für Internationale Zusammenarbeit (GIZ); Deutsches Forschungszentrum für Künstliche Intelligenz (DFKI)
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:To ensure equitable access to the benefits of large language models (LLMs), it is essential to evaluate their capabilities across the world’s languages. We introduce the AI Language Proficiency Monitor, a comprehensive multilingual benchmark that systematically assesses LLM performance across up to 200 languages, with a particular focus on low-resource languages. Our benchmark aggregates diverse tasks including translation, question answering, math, and reasoning, using datasets such as FLORES+, MMLU, GSM8K, TruthfulQA, and ARC. We provide an open-source, auto-updating leaderboard and dashboard that supports researchers, developers, and policymakers in identifying strengths and gaps in model performance. In addition to ranking models, the platform offers descriptive insights such as a global proficiency map and trends over time. By complementing and extending prior multilingual benchmarks, our work aims to foster transparency, inclusivity, and progress in multilingual AI. The system is available at this https URL.
zh

[NLP-14] A Multi-granularity Concept Sparse Activation and Hierarchical Knowledge Graph Fusion Framework for Rare Disease Diagnosis

【速读】：该论文旨在解决罕见病诊断中由于知识表征深度不足、概念理解有限和临床推理受限而导致的挑战。其解决方案的关键在于将医学概念的多粒度稀疏激活与分层知识图谱相结合，通过四种互补的匹配算法、多样性控制以及五级回退策略实现精确的概念激活，并利用包含分类学、临床特征和实例的三层知识图谱提供结构化且实时的上下文支持。

链接: https://arxiv.org/abs/2507.08529
作者: Mingda Zhang,Na Zhao,Jianglong Qin,Guoyu Ye,Ruixiang Tang
机构: Yunnan University (云南大学); Yunnan Key Laboratory of Software Engineering (云南省软件工程重点实验室); Yunnan Provincial Hospital of Traditional Chinese Medicine (云南省中医院)
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注: 10 pages,3 figures

点击查看摘要

Abstract:Despite advances from medical large language models in healthcare, rare-disease diagnosis remains hampered by insufficient knowledge-representation depth, limited concept understanding, and constrained clinical reasoning. We propose a framework that couples multi-granularity sparse activation of medical concepts with a hierarchical knowledge graph. Four complementary matching algorithms, diversity control, and a five-level fallback strategy enable precise concept activation, while a three-layer knowledge graph (taxonomy, clinical features, instances) provides structured, up-to-date context. Experiments on the BioASQ rare-disease QA set show BLEU gains of 0.09, ROUGE gains of 0.05, and accuracy gains of 0.12, with peak accuracy of 0.89 approaching the 0.90 clinical threshold. Expert evaluation confirms improvements in information quality, reasoning, and professional expression, suggesting our approach shortens the “diagnostic odyssey” for rare-disease patients.
zh

[NLP-15] PromotionGo at SemEval-2025 Task 11: A Feature-Centric Framework for Cross-Lingual Multi-Emotion Detection in Short Texts

【速读】：该论文旨在解决短文本中多标签情感检测的问题，特别是在语言多样性与资源受限条件下的挑战。其解决方案的关键在于提出一种以特征为中心的框架，该框架能够动态适应文档表示和学习算法，以优化特定语言的表现。该框架通过评估文档表示、降维和模型训练三个核心组件，在28种语言中进行实验，从而实现对多语言情感检测的可扩展性处理。

链接: https://arxiv.org/abs/2507.08499
作者: Ziyi Huang,Xia Cui
机构: Hubei University (湖北大学); Manchester Metropolitan University (曼彻斯特城市大学)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:This paper presents our system for SemEval 2025 Task 11: Bridging the Gap in Text-Based Emotion Detection (Track A), which focuses on multi-label emotion detection in short texts. We propose a feature-centric framework that dynamically adapts document representations and learning algorithms to optimize language-specific performance. Our study evaluates three key components: document representation, dimensionality reduction, and model training in 28 languages, highlighting five for detailed analysis. The results show that TF-IDF remains highly effective for low-resource languages, while contextual embeddings like FastText and transformer-based document representations, such as those produced by Sentence-BERT, exhibit language-specific strengths. Principal Component Analysis (PCA) reduces training time without compromising performance, particularly benefiting FastText and neural models such as Multi-Layer Perceptrons (MLP). Computational efficiency analysis underscores the trade-off between model complexity and processing cost. Our framework provides a scalable solution for multilingual emotion detection, addressing the challenges of linguistic diversity and resource constraints.
zh

[NLP-16] Semantic-Augmented Latent Topic Modeling with LLM -in-the-Loop

【速读】：该论文旨在解决传统主题模型（如Latent Dirichlet Allocation, LDA）在文档集合中发现抽象主题时的性能瓶颈问题，通过引入大型语言模型（Large Language Models, LLMs）来增强其效果。论文的核心解决方案在于将LLMs集成到LDA的两个关键阶段：初始化和后修正。其中，LLM在初始化阶段用于引导Gibbs采样算法的起始状态，而在后修正阶段则用于优化最终的主题分布，从而提升主题模型的连贯性评估表现。

链接: https://arxiv.org/abs/2507.08498
作者: Mengze Hong,Chen Jason Zhang,Di Jiang
机构: Hong Kong Polytechnic University (香港理工大学); AI Group, WeBank Co., Ltd (微众银行人工智能组)
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Latent Dirichlet Allocation (LDA) is a prominent generative probabilistic model used for uncovering abstract topics within document collections. In this paper, we explore the effectiveness of augmenting topic models with Large Language Models (LLMs) through integration into two key phases: Initialization and Post-Correction. Since the LDA is highly dependent on the quality of its initialization, we conduct extensive experiments on the LLM-guided topic clustering for initializing the Gibbs sampling algorithm. Interestingly, the experimental results reveal that while the proposed initialization strategy improves the early iterations of LDA, it has no effect on the convergence and yields the worst performance compared to the baselines. The LLM-enabled post-correction, on the other hand, achieved a promising improvement of 5.86% in the coherence evaluation. These results highlight the practical benefits of the LLM-in-the-loop approach and challenge the belief that LLMs are always the superior text mining alternative.
zh

[NLP-17] LLaPa: A Vision-Language Model Framework for Counterfactual-Aware Procedural Planning

【速读】：该论文旨在解决具身人工智能系统在程序规划中对多模态输入整合与反事实推理的探索不足问题。其解决方案的关键在于提出LLaPa框架，该框架利用视觉语言模型从文本任务描述和视觉环境图像中生成可执行的动作序列，并通过两个辅助模块——任务-环境重排序器（TER）和反事实活动检索器（CAR）来增强程序规划能力，从而提升规划质量与正确性。

链接: https://arxiv.org/abs/2507.08496
作者: Shibo Sun,Xue Li,Donglin Di,Mingjie Wei,Lanshun Nie,Wei-Nan Zhang,Dechen Zhan,Yang Song,Lei Fan
机构: Harbin Institute of Technology(哈尔滨工业大学); Li Auto(小鹏汽车); University of New South Wales(新南威尔士大学)
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:While large language models (LLMs) have advanced procedural planning for embodied AI systems through strong reasoning abilities, the integration of multimodal inputs and counterfactual reasoning remains underexplored. To tackle these challenges, we introduce LLaPa, a vision-language model framework designed for multimodal procedural planning. LLaPa generates executable action sequences from textual task descriptions and visual environmental images using vision-language models (VLMs). Furthermore, we enhance LLaPa with two auxiliary modules to improve procedural planning. The first module, the Task-Environment Reranker (TER), leverages task-oriented segmentation to create a task-sensitive feature space, aligning textual descriptions with visual environments and emphasizing critical regions for procedural execution. The second module, the Counterfactual Activities Retriever (CAR), identifies and emphasizes potential counterfactual conditions, enhancing the model’s reasoning capability in counterfactual scenarios. Extensive experiments on ActPlan-1K and ALFRED benchmarks demonstrate that LLaPa generates higher-quality plans with superior LCS and correctness, outperforming advanced models. The code and models are available this https URL.
zh

[NLP-18] A Third Paradigm for LLM Evaluation: Dialogue Game-Based Evaluation using clembench

【速读】：该论文试图解决大型语言模型（Large Language Models, LLMs）评估方法的局限性，具体表现为现有参考依赖型评估和基于用户偏好的评估各自存在的不足。参考依赖型评估虽能提供可控的测试环境，但缺乏实际应用场景的生态效度；而基于用户偏好的评估虽然更具生态有效性，但难以实现标准化和重复性。为弥补这些不足，论文提出了一种新的互补评估范式——对话游戏（dialogue game）评估，其关键在于通过设计可控制、无参考、可重复的多轮交互任务，同时强调目标导向性，从而在保持实验可控性的同时提升评估的实际应用价值。为此，作者提出了clembench工具，旨在提供一个成熟、易用且可扩展的评估框架。

链接: https://arxiv.org/abs/2507.08491
作者: David Schlangen,Sherzod Hakimov,Jonathan Jordan,Philipp Sadler
机构: University of Potsdam (波茨坦大学); DFKI (德国人工智能研究中心)
类目: Computation and Language (cs.CL)
备注: All code required to run the benchmark, as well as extensive documentation, is available at this https URL

点击查看摘要

Abstract:There are currently two main paradigms for evaluating large language models (LLMs), reference-based evaluation and preference-based evaluation. The first, carried over from the evaluation of machine learning models in general, relies on pre-defined task instances, for which reference task executions are available. The second, best exemplified by the LM-arena, relies on (often self-selected) users bringing their own intents to a site that routes these to several models in parallel, among whose responses the user then selects their most preferred one. The former paradigm hence excels at control over what is tested, while the latter comes with higher ecological validity, testing actual use cases interactively. Recently, a third complementary paradigm has emerged that combines some of the strengths of these approaches, offering control over multi-turn, reference-free, repeatable interactions, while stressing goal-directedness: dialogue game based evaluation. While the utility of this approach has been shown by several projects, its adoption has been held back by the lack of a mature, easily re-usable implementation. In this paper, we present clembench, which has been in continuous development since 2023 and has in its latest release been optimized for ease of general use. We describe how it can be used to benchmark one’s own models (using a provided set of benchmark game instances in English), as well as how easily the benchmark itself can be extended with new, tailor-made targeted tests.
zh

[NLP-19] Enhancing Essay Cohesion Assessment: A Novel Item Response Theory Approach

【速读】：该论文试图解决教育人工智能领域中自动评分文本连贯性的问题，特别是针对作文中的连贯性评估。其解决方案的关键在于基于项目反应理论（Item Response Theory, IRT）提出了一种连贯性得分预测方法，通过调整机器学习模型生成的分数，以更好地反映文本的连贯性特征。

链接: https://arxiv.org/abs/2507.08487
作者: Bruno Alexandre Rosa,Hilário Oliveira,Luiz Rodrigues,Eduardo Araujo Oliveira,Rafael Ferreira Mello
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: 24 pages, 4 tables

点击查看摘要

Abstract:Essays are considered a valuable mechanism for evaluating learning outcomes in writing. Textual cohesion is an essential characteristic of a text, as it facilitates the establishment of meaning between its parts. Automatically scoring cohesion in essays presents a challenge in the field of educational artificial intelligence. The machine learning algorithms used to evaluate texts generally do not consider the individual characteristics of the instances that comprise the analysed corpus. In this meaning, item response theory can be adapted to the context of machine learning, characterising the ability, difficulty and discrimination of the models used. This work proposes and analyses the performance of a cohesion score prediction approach based on item response theory to adjust the scores generated by machine learning models. In this study, the corpus selected for the experiments consisted of the extended Essay-BR, which includes 6,563 essays in the style of the National High School Exam (ENEM), and the Brazilian Portuguese Narrative Essays, comprising 1,235 essays written by 5th to 9th grade students from public schools. We extracted 325 linguistic features and treated the problem as a machine learning regression task. The experimental results indicate that the proposed approach outperforms conventional machine learning models and ensemble methods in several evaluation metrics. This research explores a potential approach for improving the automatic evaluation of cohesion in educational essays.
zh

[NLP-20] ILT-Iterative LoRA Training through Focus-Feedback-Fix for Multilingual Speech Recognition INTERSPEECH2025

【速读】：该论文试图解决在监督微调（SFT）阶段中，低秩适应（LoRA）方法常出现的过拟合问题。解决方案的关键在于提出一种创新的训练范式——迭代LoRA训练（ILT），并结合迭代伪标签策略，从而有效提升模型性能的理论上限。

链接: https://arxiv.org/abs/2507.08477
作者: Qingliang Meng,Hao Wu,Wei Liang,Wei Xu,Qing Zhao
机构: 未知
类目: Computation and Language (cs.CL)
备注: Accepted By Interspeech 2025 MLC-SLM workshop as a Research Paper

点击查看摘要

Abstract:The deep integration of large language models and automatic speech recognition systems has become a promising research direction with high practical value. To address the overfitting issue commonly observed in Low-Rank Adaptation (LoRA) during the supervised fine-tuning (SFT) stage, this work proposes an innovative training paradigm Iterative LoRA Training (ILT) in combination with an Iterative Pseudo Labeling strategy, effectively enhancing the theoretical upper bound of model performance. Based on Whisper-large-v3 and Qwen2-Audio, we conduct systematic experiments using a three-stage training process: Focus Training, Feed Back Training, and Fix Training. Experimental results demonstrate the effectiveness of the proposed method. Furthermore, the MegaAIS research team applied this technique in the Interspeech 2025 Multilingual Conversational Speech Language Modeling Challenge (MLC-SLM), achieving 4th in Track 1 (Multilingual ASR Task) and 1st place in Track 2 (Speech Separation and Recognition Task), showcasing the practical feasibility and strong application potential of our approach.
zh

[NLP-21] Using Large Language Models for Legal Decision-Making in Austrian Value-Added Tax Law: An Experimental Study

【速读】：该论文试图解决大型语言模型（Large Language Models, LLMs）在奥地利和欧盟增值税（VAT）法律框架下辅助法律决策的可行性问题，特别是在税务咨询实践中如何利用LLMs提高自动化决策效率并减轻税务专业人员的工作负担。解决方案的关键在于通过微调和检索增强生成（Retrieval-Augmented Generation, RAG）两种方法提升LLMs的性能，并在教科书案例与实际税务咨询案例中系统评估其法律推理能力，以确定最优的LLM系统配置。研究结果表明，适当配置的LLMs能够有效支持税务专业人士完成增值税相关任务，并提供合法依据，但当前原型尚未具备完全自动化的条件，仍需进一步整合结构化背景信息以应对隐含客户知识和情境化文档处理的挑战。

链接: https://arxiv.org/abs/2507.08468
作者: Marina Luketina,Andrea Benkel,Christoph G. Schuetz
机构: DKE, University of Linz (DKE，林茨大学); FH Steyr (FH Steyr)
类目: Computation and Language (cs.CL)
备注: 26 pages, 5 figures, 6 tables

点击查看摘要

Abstract:This paper provides an experimental evaluation of the capability of large language models (LLMs) to assist in legal decision-making within the framework of Austrian and European Union value-added tax (VAT) law. In tax consulting practice, clients often describe cases in natural language, making LLMs a prime candidate for supporting automated decision-making and reducing the workload of tax professionals. Given the requirement for legally grounded and well-justified analyses, the propensity of LLMs to hallucinate presents a considerable challenge. The experiments focus on two common methods for enhancing LLM performance: fine-tuning and retrieval-augmented generation (RAG). In this study, these methods are applied on both textbook cases and real-world cases from a tax consulting firm to systematically determine the best configurations of LLM-based systems and assess the legal-reasoning capabilities of LLMs. The findings highlight the potential of using LLMs to support tax consultants by automating routine tasks and providing initial analyses, although current prototypes are not ready for full automation due to the sensitivity of the legal domain. The findings indicate that LLMs, when properly configured, can effectively support tax professionals in VAT tasks and provide legally grounded justifications for decisions. However, limitations remain regarding the handling of implicit client knowledge and context-specific documentation, underscoring the need for future integration of structured background information.
zh

[NLP-22] Diagnosing Failures in Large Language Models Answers: Integrating Error Attribution into Evaluation Framework

【速读】：该论文试图解决大型语言模型（Large Language Models, LLMs）在实际应用中产生的错误无法有效归因的问题，现有评估模型缺乏错误归因能力。解决方案的关键在于建立一个包含6个主要类别和15个次要类别的全面误归因框架（Misattribution Framework），并基于此框架构建AttriData数据集，以及提出一个在AttriData上微调的MisAttributionLLM模型，该模型是首个能够同时生成评分、误归因和反馈的通用判断模型。

链接: https://arxiv.org/abs/2507.08459
作者: Zishan Xu,Shuyi Xie,Qingsong Lv,Shupei Xiao,Linlin Song,Sui Wenjuan,Fan Lin
机构: 未知
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:With the widespread application of Large Language Models (LLMs) in various tasks, the mainstream LLM platforms generate massive user-model interactions daily. In order to efficiently analyze the performance of models and diagnose failures in their answers, it is essential to develop an automated framework to systematically categorize and attribute errors. However, existing evaluation models lack error attribution capability. In this work, we establish a comprehensive Misattribution Framework with 6 primary and 15 secondary categories to facilitate in-depth analysis. Based on this framework, we present AttriData, a dataset specifically designed for error attribution, encompassing misattribution, along with the corresponding scores and feedback. We also propose MisAttributionLLM, a fine-tuned model on AttriData, which is the first general-purpose judge model capable of simultaneously generating score, misattribution, and feedback. Extensive experiments and analyses are conducted to confirm the effectiveness and robustness of our proposed method.
zh

[NLP-23] Finding Common Ground: Using Large Language Models to Detect Agreement in Multi-Agent Decision Conferences

【速读】：该论文试图解决如何利用生成式 AI (Generative AI) 模拟决策会议中的群体决策过程，特别是在动态和复杂的辩论环境中检测参与者之间的共识问题。解决方案的关键在于构建一个基于大型语言模型 (Large Language Models, LLMs) 的多智能体系统，并通过评估不同 LLM 在立场检测和立场极性检测任务中的表现，来实现对智能体间共识的有效识别与促进。研究结果表明，LLMs 能够可靠地检测共识，并且引入共识检测代理可以提升群体讨论的效率和质量，使模拟结果接近真实决策会议的效果。

链接: https://arxiv.org/abs/2507.08440
作者: Selina Heller,Mohamed Ibrahim,David Antony Selby,Sebastian Vollmer
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Multiagent Systems (cs.MA)
备注:

点击查看摘要

Abstract:Decision conferences are structured, collaborative meetings that bring together experts from various fields to address complex issues and reach a consensus on recommendations for future actions or policies. These conferences often rely on facilitated discussions to ensure productive dialogue and collective agreement. Recently, Large Language Models (LLMs) have shown significant promise in simulating real-world scenarios, particularly through collaborative multi-agent systems that mimic group interactions. In this work, we present a novel LLM-based multi-agent system designed to simulate decision conferences, specifically focusing on detecting agreement among the participant agents. To achieve this, we evaluate six distinct LLMs on two tasks: stance detection, which identifies the position an agent takes on a given issue, and stance polarity detection, which identifies the sentiment as positive, negative, or neutral. These models are further assessed within the multi-agent system to determine their effectiveness in complex simulations. Our results indicate that LLMs can reliably detect agreement even in dynamic and nuanced debates. Incorporating an agreement-detection agent within the system can also improve the efficiency of group debates and enhance the overall quality and coherence of deliberations, making them comparable to real-world decision conferences regarding outcome and decision-making. These findings demonstrate the potential for LLM-based multi-agent systems to simulate group decision-making processes. They also highlight that such systems could be instrumental in supporting decision-making with expert elicitation workshops across various domains.
zh

[NLP-24] xpSHACL: Explainable SHACL Validation using Retrieval-Augmented Generation and Large Language Models VLDB’25

【速读】：该论文试图解决传统SHACL验证引擎生成的英文验证报告对非技术用户难以理解的问题，从而影响其对约束违规进行有效处理。解决方案的关键在于提出xpSHACL系统，该系统结合基于规则的论证树与检索增强生成（RAG）和大型语言模型（LLM），生成详细、多语言且可读性强的解释，并通过使用Violation KG缓存和复用解释，提高效率和一致性。

链接: https://arxiv.org/abs/2507.08432
作者: Gustavo Correa Publio,José Emilio Labra Gayo
机构: University of Leipzig(莱比锡大学); University of Oviedo(奥维耶多大学)
类目: Databases (cs.DB); Computation and Language (cs.CL)
备注: Accepted for publication in the 2nd LLM+Graph Workshop, colocated at VLDB’25

点击查看摘要

Abstract:Shapes Constraint Language (SHACL) is a powerful language for validating RDF data. Given the recent industry attention to Knowledge Graphs (KGs), more users need to validate linked data properly. However, traditional SHACL validation engines often provide terse reports in English that are difficult for non-technical users to interpret and act upon. This paper presents xpSHACL, an explainable SHACL validation system that addresses this issue by combining rule-based justification trees with retrieval-augmented generation (RAG) and large language models (LLMs) to produce detailed, multilanguage, human-readable explanations for constraint violations. A key feature of xpSHACL is its usage of a Violation KG to cache and reuse explanations, improving efficiency and consistency.
zh

[NLP-25] ChainEdit: Propagating Ripple Effects in LLM Knowledge Editing through Logical Rule-Guided Chains ACL2025

【速读】：该论文试图解决大型语言模型（Large Language Models, LLMs）在进行知识编辑时，难以维持传播效应相关事实之间逻辑一致性的问题。其解决方案的关键在于提出ChainEdit框架，该框架通过融合知识图谱衍生的逻辑规则与LLMs的逻辑推理能力，实现系统化的链式更新。通过从结构化知识库中自动提取逻辑模式并将其与LLMs内部逻辑对齐，ChainEdit能够动态生成和编辑逻辑关联的知识簇，从而在保持编辑可靠性和特异性的同时，显著提升逻辑泛化能力。

链接: https://arxiv.org/abs/2507.08427
作者: Zilu Dong,Xiangqing Shen,Zinong Yang,Rui Xia
机构: Nanjing University of Science and Technology (南京理工大学)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: Accepted to ACL 2025 (main)

点击查看摘要

Abstract:Current knowledge editing methods for large language models (LLMs) struggle to maintain logical consistency when propagating ripple effects to associated facts. We propose ChainEdit, a framework that synergizes knowledge graph-derived logical rules with LLM logical reasoning capabilities to enable systematic chain updates. By automatically extracting logical patterns from structured knowledge bases and aligning them with LLMs’ internal logics, ChainEdit dynamically generates and edits logically connected knowledge clusters. Experiments demonstrate an improvement of more than 30% in logical generalization over baselines while preserving editing reliability and specificity. We further address evaluation biases in existing benchmarks through knowledge-aware protocols that disentangle external dependencies. This work establishes new state-of-the-art performance on ripple effect while ensuring internal logical consistency after knowledge editing.
zh

[NLP-26] A Survey of Large Language Models in Discipline-specific Research: Challenges Methods and Opportunities

【速读】：该论文试图解决如何系统性地理解和整合大型语言模型（Large Language Models, LLMs）在跨学科研究中的应用问题，当前对此领域的系统性研究仍较为缺乏。其解决方案的关键在于从技术方法和应用适用性两个角度对LLMs的集成进行分类分析，包括监督微调、检索增强生成、基于代理的方法以及工具使用整合等关键技术，同时探讨LLMs在数学、物理、化学、生物及人文社会科学等领域的具体贡献与适用场景。

链接: https://arxiv.org/abs/2507.08425
作者: Lu Xiang,Yang Zhao,Yaping Zhang,Chengqing Zong
机构: State Key Laboratory of Multimodal Artificial Intelligence Systems, Institute of Automation, Chinese Academy of Sciences, Beijing, China
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Large Language Models (LLMs) have demonstrated their transformative potential across numerous disciplinary studies, reshaping the existing research methodologies and fostering interdisciplinary collaboration. However, a systematic understanding of their integration into diverse disciplines remains underexplored. This survey paper provides a comprehensive overview of the application of LLMs in interdisciplinary studies, categorising research efforts from both a technical perspective and with regard to their applicability. From a technical standpoint, key methodologies such as supervised fine-tuning, retrieval-augmented generation, agent-based approaches, and tool-use integration are examined, which enhance the adaptability and effectiveness of LLMs in discipline-specific contexts. From the perspective of their applicability, this paper explores how LLMs are contributing to various disciplines including mathematics, physics, chemistry, biology, and the humanities and social sciences, demonstrating their role in discipline-specific tasks. The prevailing challenges are critically examined and the promising research directions are highlighted alongside the recent advances in LLMs. By providing a comprehensive overview of the technical developments and applications in this field, this survey aims to serve as an invaluable resource for the researchers who are navigating the complex landscape of LLMs in the context of interdisciplinary studies.
zh

[NLP-27] he Curious Case of Factuality Finetuning: Models Internal Beliefs Can Improve Factuality

【速读】：该论文试图解决语言模型在生成过程中出现的幻觉问题（hallucination），即生成与事实不符的文本。其关键解决方案在于发现对模型生成的数据进行微调，尤其是那些模型自身认为是事实的数据，相较于使用高质量的事实黄金数据（factual gold data）更为有效。研究进一步表明，通过模型自身的内部判断对生成数据进行过滤，能够显著提升生成内容的整体事实性。

链接: https://arxiv.org/abs/2507.08371
作者: Benjamin Newman,Abhilasha Ravichander,Jaehun Jung,Rui Xin,Hamish Ivison,Yegor Kuznetsov,Pang Wei Koh,Yejin Choi
机构: University of Washington(华盛顿大学); Stanford University(斯坦福大学)
类目: Computation and Language (cs.CL)
备注: 29 pages, 4 figures, 16 tables

点击查看摘要

Abstract:Language models are prone to hallucination - generating text that is factually incorrect. Finetuning models on high-quality factual information can potentially reduce hallucination, but concerns remain; obtaining factual gold data can be expensive and training on correct but unfamiliar data may potentially lead to even more downstream hallucination. What data should practitioners finetune on to mitigate hallucinations in language models? In this work, we study the relationship between the factuality of finetuning data and the prevalence of hallucinations in long-form generation tasks. Counterintuitively, we find that finetuning on factual gold data is not as helpful as finetuning on model-generated data that models believe to be factual. Next, we evaluate filtering strategies applied on both factual gold data and model-generated data, and find that finetuning on model-generated data that is filtered by models’ own internal judgments often leads to better overall factuality compared to other configurations: training on gold data filtered by models’ judgments, training on gold data alone, or training on model-generated data that is supported by gold data. These factuality improvements transfer across three domains we study, suggesting that a models’ own beliefs can provide a powerful signal for factuality.
zh

[NLP-28] Exploring Design of Multi-Agent LLM Dialogues for Research Ideation SIGDIAL2025

【速读】：该论文试图解决如何优化多智能体大语言模型（LLMs）对话设计以提升科学创意生成的新颖性和可行性问题。其解决方案的关键在于通过调整代理角色配置、代理数量以及对话深度等参数，来增强生成创意的多样性与可行性，特别是通过在创意-批判-修订循环中增加批判方的多样性，进一步提升最终提案的可行性。

链接: https://arxiv.org/abs/2507.08350
作者: Keisuke Ueda,Wataru Hirota,Takuto Asakura,Takahiro Omi,Kosuke Takahashi,Kosuke Arima,Tatsuya Ishigaki
机构: Artificial Intelligence Research Center, AIST(人工智能研究中心，AIST); EPFL(洛桑联邦理工学院); Stockmark(Stockmark)
类目: Computation and Language (cs.CL); Multiagent Systems (cs.MA)
备注: 16 pages, 1 figure, appendix. Accepted to SIGDIAL 2025

点击查看摘要

Abstract:Large language models (LLMs) are increasingly used to support creative tasks such as research idea generation. While recent work has shown that structured dialogues between LLMs can improve the novelty and feasibility of generated ideas, the optimal design of such interactions remains unclear. In this study, we conduct a comprehensive analysis of multi-agent LLM dialogues for scientific ideation. We compare different configurations of agent roles, number of agents, and dialogue depth to understand how these factors influence the novelty and feasibility of generated ideas. Our experimental setup includes settings where one agent generates ideas and another critiques them, enabling iterative improvement. Our results show that enlarging the agent cohort, deepening the interaction depth, and broadening agent persona heterogeneity each enrich the diversity of generated ideas. Moreover, specifically increasing critic-side diversity within the ideation-critique-revision loop further boosts the feasibility of the final proposals. Our findings offer practical guidelines for building effective multi-agent LLM systems for scientific ideation. Our code is available at this https URL.
zh

[NLP-29] Beyond N-Grams: Rethinking Evaluation Metrics and Strategies for Multilingual Abstractive Summarization ACL2025

【速读】：该论文试图解决评价生成任务的评估指标在不同语言类型中的适用性问题，特别是针对词形变化丰富的屈折语（fusional languages）中n-gram基于的评估指标效果不佳的问题。其解决方案的关键在于通过设计跨八种语言的大规模评估套件，分析不同语言类型下评估指标与人类判断的相关性，并验证适当的分词策略可以显著改善屈折语中n-gram指标的表现，同时表明专门针对评估任务训练的神经网络指标（如COMET）在低资源语言中表现更优。

链接: https://arxiv.org/abs/2507.08342
作者: Itai Mondshine,Tzuf Paz-Argaman,Reut Tsarfaty
机构: Bar-Ilan University (巴伊兰大学)
类目: Computation and Language (cs.CL)
备注: ACL 2025 Main

点击查看摘要

Abstract:Automatic n-gram based metrics such as ROUGE are widely used for evaluating generative tasks such as summarization. While these metrics are considered indicative (even if imperfect) of human evaluation for English, their suitability for other languages remains unclear. To address this, we systematically assess evaluation metrics for generation both n-gram-based and neural based to evaluate their effectiveness across languages and tasks. Specifically, we design a large-scale evaluation suite across eight languages from four typological families: agglutinative, isolating, low-fusional, and high-fusional, spanning both low- and high-resource settings, to analyze their correlation with human judgments. Our findings highlight the sensitivity of evaluation metrics to the language type. For example, in fusional languages, n-gram-based metrics show lower correlation with human assessments compared to isolating and agglutinative languages. We also demonstrate that proper tokenization can significantly mitigate this issue for morphologically rich fusional languages, sometimes even reversing negative trends. Additionally, we show that neural-based metrics specifically trained for evaluation, such as COMET, consistently outperform other neural metrics and better correlate with human judgments in low-resource languages. Overall, our analysis highlights the limitations of n-gram metrics for fusional languages and advocates for greater investment in neural-based metrics trained for evaluation tasks.
zh

[NLP-30] What Factors Affect LLM s and RLLM s in Financial Question Answering?

【速读】：该论文试图解决如何在金融领域充分释放大型语言模型（LLMs）和推理型大型语言模型（RLLMs）性能的问题，特别是探索提示方法、代理框架和多语言对齐方法对金融问答任务的影响。其解决方案的关键在于系统性地评估不同方法对LLMs和RLLMs的效果，发现当前提示方法和代理框架通过模拟长链式思维（Long CoT）提升LLMs的性能，而RLLMs因其固有的Long CoT能力，使得传统方法对其性能提升有限，同时多语言对齐方法主要通过扩展推理长度提升LLMs的多语言表现，但对RLLMs效果有限。

链接: https://arxiv.org/abs/2507.08339
作者: Peng Wang,Xuesi Hu,Jiageng Wu,Yuntao Zou,Qiancheng Zhang,Dagang Li
机构: Macau University of Science and Technology(澳门科技大学); Anhui University(安徽大学); Huazhong University of Science and Technology(华中科技大学)
类目: Computation and Language (cs.CL)
备注: Preprint

点击查看摘要

Abstract:Recently, the development of large language models (LLMs) and reasoning large language models (RLLMs) have gained considerable attention from many researchers. RLLMs enhance the reasoning capabilities of LLMs through Long Chain-of-Thought (Long CoT) processes, significantly improving the performance of LLMs in addressing complex problems. However, there are few works that systematically explore what methods can fully unlock the performance of LLMs and RLLMs within the financial domain. To investigate the impact of various methods on LLMs and RLLMs, we utilize five LLMs and three RLLMs to assess the effects of prompting methods, agentic frameworks, and multilingual alignment methods on financial question-answering tasks. Our research findings indicate: (1) Current prompting methods and agent frameworks enhance the performance of LLMs in financial question answering by simulating Long CoT; (2) RLLMs possess inherent Long CoT capabilities, which limits the effectiveness of conventional methods in further enhancing their performance; (3) Current advanced multilingual alignment methods primarily improve the multilingual performance of LLMs by extending the reasoning length, which yields minimal benefits for RLLMs. We hope that this study can serve as an important reference for LLMs and RLLMs in the field of financial question answering.
zh

[NLP-31] Distillation versus Contrastive Learning: How to Train Your Rerankers

【速读】：该论文试图解决如何有效训练交叉编码器重排序器（cross-encoder rerankers）的问题，重点比较了对比学习（contrastive learning）和知识蒸馏（knowledge distillation）两种策略在实际条件下的效果。其解决方案的关键在于通过在同一数据集上使用不同规模和架构的重排序器模型进行实验，并以一个强大的对比学习模型作为知识蒸馏的教师模型，从而评估两种方法的性能差异。研究结果表明，当从更大的教师模型中进行知识蒸馏时，通常能获得优于对比学习的领域内和领域外排序性能。

链接: https://arxiv.org/abs/2507.08336
作者: Zhichao Xu,Zhiqi Huang,Shengyao Zhuang,Ashim Gupta,Vivek Srikumar
机构: Kahlert School of Computing, University of Utah(犹他大学计算学院); Scientific Computing and Imaging Institute, University of Utah(犹他大学科学计算与成像研究所); Capital One, Inc(资本银行公司); The University of Queensland(昆士兰大学)
类目: Computation and Language (cs.CL); Information Retrieval (cs.IR)
备注:

点击查看摘要

Abstract:Training text rerankers is crucial for information retrieval. Two primary strategies are widely used: contrastive learning (optimizing directly on ground-truth labels) and knowledge distillation (transferring knowledge from a larger reranker). While both have been studied in the literature, a clear comparison of their effectiveness for training cross-encoder rerankers under practical conditions is needed. This paper empirically compares these strategies by training rerankers of different sizes and architectures using both methods on the same data, with a strong contrastive learning model acting as the distillation teacher. Our results show that knowledge distillation generally yields better in-domain and out-of-domain ranking performance than contrastive learning when distilling from a larger teacher model. This finding is consistent across student model sizes and architectures. However, distilling from a teacher of the same capacity does not provide the same advantage, particularly for out-of-domain tasks. These findings offer practical guidance for choosing a training strategy based on available teacher models. Therefore, we recommend using knowledge distillation to train smaller rerankers if a larger, more powerful teacher is accessible; in its absence, contrastive learning provides a strong and more reliable alternative otherwise. Subjects: Computation and Language (cs.CL); Information Retrieval (cs.IR) Cite as: arXiv:2507.08336 [cs.CL] (or arXiv:2507.08336v1 [cs.CL] for this version) https://doi.org/10.48550/arXiv.2507.08336 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
zh

[NLP-32] MK2 at PBIG Competition: A Prompt Generation Solution

【速读】：该论文试图解决将真实专利转化为三年内可实施的产品创意的问题，即专利驱动的创意生成任务。其解决方案的关键在于提出MK2框架，这是一个以提示（prompt）为中心的流程：首先使用Gemini 2.5迭代地生成和编辑提示，并引入较弱输出中的有用片段；随后由GPT-4.1根据该提示为每个专利生成一个创意；最后通过Qwen3-8B评估的Elo循环选择最优提示，整个过程无需额外训练数据。

链接: https://arxiv.org/abs/2507.08335
作者: Yuzheng Xu,Tosho Hirasawa,Seiya Kawano,Shota Kato,Tadashi Kozuno
机构: OMRON SINIC X(OMRON SINIC X); NexaScience(NexaScience); Kyoto Institute of Technology(京都工艺纤维大学); Kyoto University(京都大学)
类目: Computation and Language (cs.CL)
备注: 9 pages, to appear in the 2nd Workshop on Agent AI for Scenario Planning (AGENTSCEN 2025)

点击查看摘要

Abstract:The Patent-Based Idea Generation task asks systems to turn real patents into product ideas viable within three years. We propose MK2, a prompt-centric pipeline: Gemini 2.5 drafts and iteratively edits a prompt, grafting useful fragments from weaker outputs; GPT-4.1 then uses this prompt to create one idea per patent, and an Elo loop judged by Qwen3-8B selects the best prompt-all without extra training data. Across three domains, two evaluator types, and six criteria, MK2 topped the automatic leaderboard and won 25 of 36 tests. Only the materials-chemistry track lagged, indicating the need for deeper domain grounding; yet, the results show that lightweight prompt engineering has already delivered competitive, commercially relevant ideation from patents.
zh

[NLP-33] CRMAgent : A Multi-Agent LLM System for E-Commerce CRM Message Template Generation

【速读】：该论文试图解决电商私域渠道中商家在客户关系管理（CRM）活动中缺乏撰写有效营销文案的专业能力和工具的问题。解决方案的关键在于提出CRMAgent，这是一个基于大语言模型（LLMs）的多智能体系统，通过三种互补模式生成高质量的消息模板和可操作的写作指导：基于群体的学习、检索与适应以及基于规则的回退机制，从而提升消息与受众的匹配度和营销效果。

链接: https://arxiv.org/abs/2507.08325
作者: Yinzhu Quan,Xinrui Li,Ying Chen
机构: Georgia Institute of Technology (佐治亚理工学院); ByteDance Inc. (字节跳动公司)
类目: Computation and Language (cs.CL); Multiagent Systems (cs.MA)
备注:

点击查看摘要

Abstract:In e-commerce private-domain channels such as instant messaging and e-mail, merchants engage customers directly as part of their Customer Relationship Management (CRM) programmes to drive retention and conversion. While a few top performers excel at crafting outbound messages, most merchants struggle to write persuasive copy because they lack both expertise and scalable tools. We introduce CRMAgent, a multi-agent system built on large language models (LLMs) that generates high-quality message templates and actionable writing guidance through three complementary modes. First, group-based learning enables the agent to learn from a merchant’s own top-performing messages within the same audience segment and rewrite low-performing ones. Second, retrieval-and-adaptation fetches templates that share the same audience segment and exhibit high similarity in voucher type and product category, learns their successful patterns, and adapts them to the current campaign. Third, a rule-based fallback provides a lightweight zero-shot rewrite when no suitable references are available. Extensive experiments show that CRMAgent consistently outperforms merchants’ original templates, delivering significant gains in both audience-match and marketing-effectiveness metrics.
zh

[NLP-34] Improving MLLM s Document Image Machine Translation via Synchronously Self-reviewing Its OCR Proficiency ACL2025

【速读】：该论文试图解决多模态大语言模型（Multimodal Large Language Models, MLLMs）在文档图像机器翻译（Document Image Machine Translation, DIMT）任务中面临的跨模态和跨语言挑战，以及在通过有监督微调（Supervised Fine-Tuning, SFT）提升DIMT能力时导致的模型原有单语能力（如OCR）遗忘问题。解决方案的关键在于引入一种名为同步自我审查（Synchronously Self-Reviewing, SSR）的微调范式，该方法通过让模型先生成OCR文本再进行翻译，从而在学习跨语言翻译的同时保持并利用其强大的单语OCR能力，有效缓解灾难性遗忘，提升MLLMs在OCR和DIMT任务上的泛化能力。

链接: https://arxiv.org/abs/2507.08309
作者: Yupu Liang,Yaping Zhang,Zhiyang Zhang,Zhiyuan Chen,Yang Zhao,Lu Xiang,Chengqing Zong,Yu Zhou
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted by ACL 2025 Findings

点击查看摘要

Abstract:Multimodal Large Language Models (MLLMs) have shown strong performance in document image tasks, especially Optical Character Recognition (OCR). However, they struggle with Document Image Machine Translation (DIMT), which requires handling both cross-modal and cross-lingual challenges. Previous efforts to enhance DIMT capability through Supervised Fine-Tuning (SFT) on the DIMT dataset often result in the forgetting of the model’s existing monolingual abilities, such as OCR. To address these challenges, we introduce a novel fine-tuning paradigm, named Synchronously Self-Reviewing (SSR) its OCR proficiency, inspired by the concept “Bilingual Cognitive Advantage”. Specifically, SSR prompts the model to generate OCR text before producing translation text, which allows the model to leverage its strong monolingual OCR ability while learning to translate text across languages. Comprehensive experiments demonstrate the proposed SSR learning helps mitigate catastrophic forgetting, improving the generalization ability of MLLMs on both OCR and DIMT tasks.
zh

[NLP-35] M2-Reasoning : Empowering MLLM s with Unified General and Spatial Reasoning

【速读】：该论文试图解决Multimodal Large Language Models (MLLMs)在动态空间交互方面能力不足的问题，这一问题限制了其在现实世界应用中的表现。解决方案的关键在于提出M2-Reasoning-7B模型，其核心创新包括：（1）一种生成294.2K高质量数据样本的新数据流水线，涵盖冷启动微调和强化学习与可验证奖励（RLVR）；（2）一种动态多任务训练策略，结合分步优化和任务特定奖励，以减少数据与任务间的冲突并提供定制化的激励信号。这些方法共同提升了模型在通用和空间推理任务上的性能。

链接: https://arxiv.org/abs/2507.08306
作者: Inclusion AI:Fudong Wang,Jiajia Liu,Jingdong Chen,Jun Zhou,Kaixiang Ji,Lixiang Ru,Qingpei Guo,Ruobing Zheng,Tianqi Li,Yi Yuan,Yifan Mao,Yuting Xiao,Ziping Ma
机构: 未知
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注: 31pages, 14 figures

点击查看摘要

Abstract:Recent advancements in Multimodal Large Language Models (MLLMs), particularly through Reinforcement Learning with Verifiable Rewards (RLVR), have significantly enhanced their reasoning abilities. However, a critical gap persists: these models struggle with dynamic spatial interactions, a capability essential for real-world applications. To bridge this gap, we introduce M2-Reasoning-7B, a model designed to excel in both general and spatial reasoning. Our approach integrates two key innovations: (1) a novel data pipeline that generates 294.2K high-quality data samples (168K for cold-start fine-tuning and 126.2K for RLVR), which feature logically coherent reasoning trajectories and have undergone comprehensive assessment; and (2) a dynamic multi-task training strategy with step-wise optimization to mitigate conflicts between data, and task-specific rewards for delivering tailored incentive signals. This combination of curated data and advanced training allows M2-Reasoning-7B to set a new state-of-the-art (SOTA) across 8 benchmarks, showcasing superior performance in both general and spatial reasoning domains.
zh

[NLP-36] KAT-V1: Kwai-AutoThink Technical Report

【速读】：该论文试图解决推理密集型任务中因过度思考导致的效率与准确性问题，其解决方案的关键在于提出一种自动思维训练范式，能够根据任务复杂度动态切换推理模式与非推理模式。该范式通过构建双模式数据集、应用多标记预测增强的知识蒸馏方法、实施基于多数投票信号和意图感知提示的冷启动初始化策略，以及引入中间监督的强化学习算法Step-SRPO，实现了高效且细粒度的推理迁移，同时显著降低了预训练成本。

链接: https://arxiv.org/abs/2507.08297
作者: Zizheng Zhan,Ken Deng,Huaixi Tang,Wen Xiang,Kun Wu,Weihao Li,Wenqiang Zhu,Jingxuan Xu,Lecheng Huang,Zongxian Feng,Shaojie Wang,Shangpeng Yan,Jiaheng Liu,Zhongyuan Peng,Zuchen Gao,Haoyang Huang,Ziqi Zhan,Yanan Wu,Yuanxing Zhang,Jian Yang,Guang Chen,Haotian Zhang,Bin Chen,Bing Yu
机构: 未知
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:We present Kwaipilot-AutoThink (KAT), an open-source 40B large language model developed to address the overthinking problem in reasoning-intensive tasks, where an automatic thinking training paradigm is proposed to dynamically switch between reasoning and non-reasoning modes based on task complexity. Specifically, first, we construct the dual-regime dataset based on a novel tagging pipeline and a multi-agent synthesis strategy, and then we apply Multi-Token Prediction (MTP)-enhanced knowledge distillation, enabling efficient and fine-grained reasoning transfer with minimal pretraining cost. Besides, we implement a cold-start initialization strategy that introduces mode-selection priors using majority-vote signals and intent-aware prompting. Finally, we propose Step-SRPO, a reinforcement learning algorithm that incorporates intermediate supervision into the GRPO framework, offering structured guidance over both reasoning-mode selection and response accuracy. Extensive experiments across multiple benchmarks demonstrate that KAT consistently matches or even outperforms current state-of-the-art models, including DeepSeek-R1-0528 and Qwen3-235B-A22B, across a wide range of reasoning-intensive tasks while reducing token usage by up to approximately 30%. Beyond academic evaluation, KAT has been successfully deployed in Kwaipilot (i.e., Kuaishou’s internal coding assistant), and improves real-world development workflows with high accuracy, efficiency, and controllable reasoning behaviors. Moreover, we are actively training a 200B Mixture-of-Experts (MoE) with 40B activation parameters, where the early-stage results already demonstrate promising improvements in performance and efficiency, further showing the scalability of the AutoThink paradigm.
zh

[NLP-37] Lightweight Safety Guardrails via Synthetic Data and RL-guided Adversarial Training

【速读】：该论文试图解决语言模型在内容审核任务中对大型模型依赖性过高的问题，旨在证明小型语言模型（SLM）可以通过有效方法实现甚至超越大型模型的性能。解决方案的关键在于高保真合成数据生成与对抗训练的结合，通过人类标注的种子数据进行查询增强和改写，生成多样化且语境丰富的示例，并经过多轮筛选以确保数据质量；同时，基于生成对抗网络（GAN）架构，利用强化学习引导生成器产生具有挑战性的合成样本，用于微调安全分类器，从而提升其检测和缓解有害内容的能力。

链接: https://arxiv.org/abs/2507.08284
作者: Aleksei Ilin,Gor Matevosyan,Xueying Ma,Vladimir Eremin,Suhaa Dada,Muqun Li,Riyaaz Shaik,Haluk Noyan Tokgozoglu
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:We introduce a lightweight yet highly effective safety guardrail framework for language models, demonstrating that small-scale language models can achieve, and even surpass, the performance of larger counterparts in content moderation tasks. This is accomplished through high-fidelity synthetic data generation and adversarial training. The synthetic data generation process begins with human-curated seed data, which undergoes query augmentation and paraphrasing to create diverse and contextually rich examples. This augmented data is then subjected to multiple rounds of curation, ensuring high fidelity and relevance. Inspired by recent advances in the Generative Adversarial Network (GAN) architecture, our adversarial training employs reinforcement learning to guide a generator that produces challenging synthetic examples. These examples are used to fine-tune the safety classifier, enhancing its ability to detect and mitigate harmful content. Additionally, we incorporate strategies from recent research on efficient LLM training, leveraging the capabilities of smaller models to improve the performance of larger generative models. With iterative adversarial training and the generation of diverse, high-quality synthetic data, our framework enables small language models (SLMs) to serve as robust safety guardrails. This approach not only reduces computational overhead but also enhances resilience against adversarial attacks, offering a scalable and efficient solution for content moderation in AI systems.
zh

[NLP-38] Exploring Gender Differences in Chronic Pain Discussions on Reddit

【速读】：该论文试图解决性别在疼痛体验中的差异问题，特别是早期研究中对性别因素的忽视。其解决方案的关键在于利用自然语言处理（Natural Language Processing, NLP）技术，通过隐藏属性模型-卷积神经网络（Hidden Attribute Model-Convolutional Neural Network, HAM-CNN）对用户评论进行分类，从而区分男性和女性的疼痛表达，并揭示性别在语言特征及疼痛相关疾病分布上的差异。

链接: https://arxiv.org/abs/2507.08241
作者: Ancita Maria Andrade,Tanvi Banerjee,Ramakrishna Mundugar
机构: 未知
类目: Computation and Language (cs.CL); Machine Learning (cs.LG)
备注: This is an extended version of the short paper accepted at ASONAM 2025

点击查看摘要

Abstract:Pain is an inherent part of human existence, manifesting as both physical and emotional experiences, and can be categorized as either acute or chronic. Over the years, extensive research has been conducted to understand the causes of pain and explore potential treatments, with contributions from various scientific disciplines. However, earlier studies often overlooked the role of gender in pain experiences. In this study, we utilized Natural Language Processing (NLP) to analyze and gain deeper insights into individuals’ pain experiences, with a particular focus on gender differences. We successfully classified posts into male and female corpora using the Hidden Attribute Model-Convolutional Neural Network (HAM-CNN), achieving an F1 score of 0.86 by aggregating posts based on usernames. Our analysis revealed linguistic differences between genders, with female posts tending to be more emotionally focused. Additionally, the study highlighted that conditions such as migraine and sinusitis are more prevalent among females and explored how pain medication affects individuals differently based on gender.
zh

[NLP-39] Can LLM s Reliably Simulate Real Students Abilities in Mathematics and Reading Comprehension? ACL2025

【速读】：该论文试图解决的问题是：在智能辅导系统（Intelligent Tutoring Systems, ITSs）和试题测试中广泛使用的大型语言模型（Large Language Models, LLMs）作为“代理学生”的准确性问题，即这些模型在多大程度上能够真实模拟真实学生的行为和特征。解决方案的关键在于利用项目反应理论（Item Response Theory, IRT）模型，将11种多样且先进的LLMs置于与真实学生群体相同的能力量表上进行比较，从而评估其表现差异及对提示（prompt）的响应特性。研究结果表明，强通用模型在无指导情况下普遍优于平均学生水平，而弱模型或领域不匹配模型的表现可能偶然一致，但通过年级强化提示调整模型表现后，不同模型和提示组合的效果差异显著，表明需要新的训练与评估策略。

链接: https://arxiv.org/abs/2507.08232
作者: KV Aditya Srivatsa,Kaushal Kumar Maurya,Ekaterina Kochmar
机构: Mohamed bin Zayed University of Artificial Intelligence (穆罕默德·本·扎耶德人工智能大学)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: Accepted to the 20th Workshop on Innovative Use of NLP for Building Educational Applications (BEA), co-located with ACL 2025

点击查看摘要

Abstract:Large Language Models (LLMs) are increasingly used as proxy students in the development of Intelligent Tutoring Systems (ITSs) and in piloting test questions. However, to what extent these proxy students accurately emulate the behavior and characteristics of real students remains an open question. To investigate this, we collected a dataset of 489 items from the National Assessment of Educational Progress (NAEP), covering mathematics and reading comprehension in grades 4, 8, and 12. We then apply an Item Response Theory (IRT) model to position 11 diverse and state-of-the-art LLMs on the same ability scale as real student populations. Our findings reveal that, without guidance, strong general-purpose models consistently outperform the average student at every grade, while weaker or domain-mismatched models may align incidentally. Using grade-enforcement prompts changes models’ performance, but whether they align with the average grade-level student remains highly model- and prompt-specific: no evaluated model-prompt pair fits the bill across subjects and grades, underscoring the need for new training and evaluation strategies. We conclude by providing guidelines for the selection of viable proxies based on our findings.
zh

[NLP-40] Simple Mechanistic Explanations for Out-Of-Context Reasoning ICML2025

【速读】：该论文试图解决生成式人工智能（Generative AI）在微调过程中表现出的“非上下文推理”（Out-of-context reasoning, OOCR）现象，即微调后的大型语言模型（LLM）能够展现出超出分布范围的深度泛化能力。论文的核心贡献在于揭示了OOCR现象的机制，发现许多OOCR实例的关键原因是LoRA微调方法本质上添加了一个常数引导向量，使模型趋向于一个通用概念，从而提升任务性能并在相关领域实现意外的泛化能力。解决方案之关键在于通过直接训练引导向量来实现OOCR，即使对于看似需要条件行为的任务（如模型后门），仅需无条件添加引导向量即可达成效果。

链接: https://arxiv.org/abs/2507.08218
作者: Atticus Wang,Joshua Engels,Oliver Clive-Griffin
机构: 未知
类目: Computation and Language (cs.CL); Machine Learning (cs.LG)
备注: ICML 2025 Workshop R2-FM

点击查看摘要

Abstract:Out-of-context reasoning (OOCR) is a phenomenon in which fine-tuned LLMs exhibit surprisingly deep out-of-distribution generalization. Rather than learning shallow heuristics, they implicitly internalize and act on the consequences of observations scattered throughout the fine-tuning data. In this work, we investigate this phenomenon mechanistically and find that many instances of OOCR in the literature have a simple explanation: the LoRA fine-tuning essentially adds a constant steering vector, steering the model towards a general concept. This improves performance on the fine-tuning task and in many other concept-related domains, causing the surprising generalization. Moreover, we can directly train steering vectors for these tasks from scratch, which also induces OOCR. We find that our results hold even for a task that seems like it must involve conditional behavior (model backdoors); it turns out that unconditionally adding a steering vector is sufficient. Overall, our work presents one explanation of what gets learned during fine-tuning for OOCR tasks, contributing to the key question of why LLMs can reason out of context, an advanced capability that is highly relevant to their safe and reliable deployment.
zh

[NLP-41] ruthTorchLM: A Comprehensive Library for Predicting Truthfulness in LLM Outputs

【速读】：该论文旨在解决生成式大语言模型（Generative Large Language Models, LLMs）在输出中不可避免地产生不真实内容的问题，尤其关注如何准确预测这些输出的真实性。其关键解决方案是引入TruthTorchLM，一个开源的综合性Python库，包含超过30种真实性预测方法（Truth Methods），提供了多样化的技术选择，涵盖计算成本、访问级别、文档依据需求以及监督类型等方面的权衡，并支持与HuggingFace和LiteLLM的无缝集成，从而提升真实性预测方法的可访问性和实用性。

链接: https://arxiv.org/abs/2507.08203
作者: Duygu Nur Yaldiz,Yavuz Faruk Bakman,Sungmin Kang,Alperen Öziş,Hayrettin Eren Yildiz,Mitash Ashish Shah,Zhiqi Huang,Anoop Kumar,Alfy Samuel,Daben Liu,Sai Praneeth Karimireddy,Salman Avestimehr
机构: University of Southern California (南加州大学); Independent Researcher (独立研究员); Bogazici University (博兹库尔大学); Capital One (资本one)
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Generative Large Language Models (LLMs)inevitably produce untruthful responses. Accurately predicting the truthfulness of these outputs is critical, especially in high-stakes settings. To accelerate research in this domain and make truthfulness prediction methods more accessible, we introduce TruthTorchLM an open-source, comprehensive Python library featuring over 30 truthfulness prediction methods, which we refer to as Truth Methods. Unlike existing toolkits such as Guardrails, which focus solely on document-grounded verification, or LM-Polygraph, which is limited to uncertainty-based methods, TruthTorchLM offers a broad and extensible collection of techniques. These methods span diverse tradeoffs in computational cost, access level (e.g., black-box vs white-box), grounding document requirements, and supervision type (self-supervised or supervised). TruthTorchLM is seamlessly compatible with both HuggingFace and LiteLLM, enabling support for locally hosted and API-based models. It also provides a unified interface for generation, evaluation, calibration, and long-form truthfulness prediction, along with a flexible framework for extending the library with new methods. We conduct an evaluation of representative truth methods on three datasets, TriviaQA, GSM8K, and FactScore-Bio. The code is available at this https URL
zh

[NLP-42] Overview of the TREC 2021 deep learning track

【速读】：该论文旨在解决信息检索中的文档和段落排序问题，通过利用大规模预训练深度神经网络模型提升检索性能。其解决方案的关键在于使用MS MARCO数据集提供的大量人工标注的训练标签，并通过扩大文档和段落集合规模以提高模型训练效果。此外，研究还探讨了单阶段检索与多阶段检索流水线在性能上的差异，以及数据集更新对NIST判断完整性和训练标签质量的影响。

链接: https://arxiv.org/abs/2507.08191
作者: Nick Craswell,Bhaskar Mitra,Emine Yilmaz,Daniel Campos,Jimmy Lin
机构: Microsoft(微软); University College London(伦敦大学学院); University of Illinois Urbana-Champaign(伊利诺伊大学厄巴纳-香槟分校); Neural Magic Inc(神经魔力公司); University of Waterloo(滑铁卢大学)
类目: Information Retrieval (cs.IR); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:This is the third year of the TREC Deep Learning track. As in previous years, we leverage the MS MARCO datasets that made hundreds of thousands of human annotated training labels available for both passage and document ranking tasks. In addition, this year we refreshed both the document and the passage collections which also led to a nearly four times increase in the document collection size and nearly 16 times increase in the size of the passage collection. Deep neural ranking models that employ large scale pretraininig continued to outperform traditional retrieval methods this year. We also found that single stage retrieval can achieve good performance on both tasks although they still do not perform at par with multistage retrieval pipelines. Finally, the increase in the collection size and the general data refresh raised some questions about completeness of NIST judgments and the quality of the training labels that were mapped to the new collections from the old ones which we discuss in this report.
zh

[NLP-43] Distilling Empathy from Large Language Models SIGDIAL2025

【速读】：该论文试图解决在将大型语言模型（Large Language Models, LLMs）的知识蒸馏到小型语言模型（Small Language Models, SLMs）过程中，如何保留LLMs中已有的共情能力的问题。解决方案的关键在于提出一种两步微调流程，并利用从LLMs中蒸馏出的共情对话数据集，同时设计了四组针对性的提示以显著提升共情蒸馏效果。实验结果表明，经过该方法微调的SLMs在生成共情回复方面表现优于基础SLMs，其胜率达到了90%。

链接: https://arxiv.org/abs/2507.08151
作者: Henry J. Xie,Jinghan Zhang,Xinhao Zhang,Kunpeng Liu
机构: Westview High School (西维尤高中); Portland State University (波特兰州立大学)
类目: Computation and Language (cs.CL)
备注: Accepted by SIGDIAL 2025

点击查看摘要

Abstract:The distillation of knowledge from Large Language Models (LLMs) into Smaller Language Models (SLMs), preserving the capabilities and performance of LLMs while reducing model size, has played a key role in the proliferation of LLMs. Because SLMs are considerably smaller than LLMs, they are often utilized in domains where human interaction is frequent but resources are highly constrained, e.g., smart phones. Therefore, it is crucial to ensure that empathy, a fundamental aspect of positive human interactions, already instilled into LLMs, is retained by SLMs after distillation. In this paper, we develop a comprehensive approach for effective empathy distillation from LLMs into SLMs. Our approach features a two-step fine-tuning process that fully leverages datasets of empathetic dialogue responses distilled from LLMs. We explore several distillation methods beyond basic direct prompting and propose four unique sets of prompts for targeted empathy improvement to significantly enhance the empathy distillation process. Our evaluations demonstrate that SLMs fine-tuned through the two-step fine-tuning process with distillation datasets enhanced by the targeted empathy improvement prompts significantly outperform the base SLM at generating empathetic responses with a win rate of 90%. Our targeted empathy improvement prompts substantially outperform the basic direct prompting with a 10% improvement in win rate.
zh

[NLP-44] Compactor: Calibrated Query-Agnostic KV Cache Compression with Approximate Leverag e Scores

【速读】：该论文旨在解决大规模语言模型在处理长上下文时因键值（KV）缓存内存占用过高而导致的吞吐量受限和部署成本增加的问题。其解决方案的关键在于提出一种无需参数、与查询无关的KV压缩策略——Compactor，该方法利用近似杠杆率分数来确定token的重要性，从而在保持性能的同时显著减少所需保留的token数量，实现高效的内存优化。

链接: https://arxiv.org/abs/2507.08143
作者: Vivek Chari,Benjamin Van Durme
机构: Johns Hopkins University (约翰霍普金斯大学)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Modern Large Language Models (LLMs) are increasingly trained to support very large context windows. Unfortunately the ability to use long contexts in generation is complicated by the large memory requirement of the KV cache, which scales linearly with the context length. This memory footprint is often the dominant resource bottleneck in real-world deployments, limiting throughput and increasing serving cost. One way to address this is by compressing the KV cache, which can be done either with knowledge of the question being asked (query-aware) or without knowledge of the query (query-agnostic). We present Compactor, a parameter-free, query-agnostic KV compression strategy that uses approximate leverage scores to determine token importance. We show that Compactor can achieve the same performance as competing methods while retaining 1/2 the tokens in both synthetic and real-world context tasks, with minimal computational overhead. We further introduce a procedure for context-calibrated compression, which allows one to infer the maximum compression ratio a given context can support. Using context-calibrated compression, we show that Compactor achieves full KV performance on Longbench while reducing the KV memory burden by 63%, on average. To demonstrate the efficacy and generalizability of our approach, we apply Compactor to 27 synthetic and real-world tasks from RULER and Longbench, with models from both the Qwen 2.5 and Llama 3.1 families.
zh

[NLP-45] Audio Flamingo 3: Advancing Audio Intelligence with Fully Open Large Audio Language Models

【速读】：该论文旨在解决跨语音、声音和音乐的音频-语言理解与推理问题，特别是在长音频处理、多轮多音频对话以及生成式思考能力方面的挑战。其解决方案的关键在于提出了一种统一的音频编码器AF-Whisper，通过联合表示学习策略实现对三种模态的高效融合，并结合五阶段课程训练策略及多个大规模定制数据集（如AudioSkills-XL、LongAudio-XL、AF-Think和AF-Chat），从而在仅使用开源音频数据的情况下实现了超越现有开放权重和闭源模型的性能表现。

链接: https://arxiv.org/abs/2507.08128
作者: Arushi Goel,Sreyan Ghosh,Jaehyeon Kim,Sonal Kumar,Zhifeng Kong,Sang-gil Lee,Chao-Han Huck Yang,Ramani Duraiswami,Dinesh Manocha,Rafael Valle,Bryan Catanzaro
机构: NVIDIA(英伟达); University of Maryland, College Park(马里兰大学学院公园分校)
类目: ound (cs.SD); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Audio and Speech Processing (eess.AS)
备注: Code, Datasets and Models: this https URL

点击查看摘要

Abstract:We present Audio Flamingo 3 (AF3), a fully open state-of-the-art (SOTA) large audio-language model that advances reasoning and understanding across speech, sound, and music. AF3 introduces: (i) AF-Whisper, a unified audio encoder trained using a novel strategy for joint representation learning across all 3 modalities of speech, sound, and music; (ii) flexible, on-demand thinking, allowing the model to do chain-of-thought-type reasoning before answering; (iii) multi-turn, multi-audio chat; (iv) long audio understanding and reasoning (including speech) up to 10 minutes; and (v) voice-to-voice interaction. To enable these capabilities, we propose several large-scale training datasets curated using novel strategies, including AudioSkills-XL, LongAudio-XL, AF-Think, and AF-Chat, and train AF3 with a novel five-stage curriculum-based training strategy. Trained on only open-source audio data, AF3 achieves new SOTA results on over 20+ (long) audio understanding and reasoning benchmarks, surpassing both open-weight and closed-source models trained on much larger datasets.
zh

[NLP-46] Audit Alignment and Optimization of LM-Powered Subroutines with Application to Public Comment Processing

【速读】：该论文试图解决在现实世界中安全、可解释且无偏见地应用生成式 AI (Generative AI) 的问题，特别是在需要人类专家参与决策而非数据处理或提示工程的场景下。其解决方案的关键在于提出一种框架，用于声明静态类型、由语言模型 (Language Models, LMs) 驱动的子程序，这些子程序可以嵌入到传统的异步代码中，并通过在线（即使用过程中）的人类专家稀疏反馈来持续优化子程序性能。该框架还记录并暴露所有由语言模型生成的工件以供审计，从而实现透明和可追溯的使用方式。

链接: https://arxiv.org/abs/2507.08109
作者: Reilly Raab,Mike Parker,Dan Nally,Sadie Montgomery,Anastasia Bernat,Sai Munikoti,Sameera Horawalavithana
机构: Pacific Northwest National Laboratory (太平洋西北国家实验室)
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:The advent of language models (LMs) has the potential to dramatically accelerate tasks that may be cast to text-processing; however, real-world adoption is hindered by concerns regarding safety, explainability, and bias. How can we responsibly leverage LMs in a transparent, auditable manner – minimizing risk and allowing human experts to focus on informed decision-making rather than data-processing or prompt engineering? In this work, we propose a framework for declaring statically typed, LM-powered subroutines (i.e., callable, function-like procedures) for use within conventional asynchronous code – such that sparse feedback from human experts is used to improve the performance of each subroutine online (i.e., during use). In our implementation, all LM-produced artifacts (i.e., prompts, inputs, outputs, and data-dependencies) are recorded and exposed to audit on demand. We package this framework as a library to support its adoption and continued development. While this framework may be applicable across several real-world decision workflows (e.g., in healthcare and legal fields), we evaluate it in the context of public comment processing as mandated by the 1969 National Environmental Protection Act (NEPA): Specifically, we use this framework to develop “CommentNEPA,” an application that compiles, organizes, and summarizes a corpus of public commentary submitted in response to a project requiring environmental review. We quantitatively evaluate the application by comparing its outputs (when operating without human feedback) to historical ``ground-truth’’ data as labelled by human annotators during the preparation of official environmental impact statements.
zh

[NLP-47] GRASP: Generic Reasoning And SPARQL Generation across Knowledge Graphs

【速读】：该论文试图解决从自然语言问题或关键词查询生成SPARQL查询的问题，以在RDF知识图谱上进行有效检索。其解决方案的关键在于利用大型语言模型，通过战略性执行SPARQL查询来探索知识图谱，从而搜索相关的IRI和字面量，而无需进行微调。这种方法在多种基准测试中表现出色，尤其是在零样本设置下取得了最先进的结果。

链接: https://arxiv.org/abs/2507.08107
作者: Sebastian Walter,Hannah Bast
机构: 未知
类目: Computation and Language (cs.CL); Databases (cs.DB); Information Retrieval (cs.IR)
备注:

点击查看摘要

Abstract:We propose a new approach for generating SPARQL queries on RDF knowledge graphs from natural language questions or keyword queries, using a large language model. Our approach does not require fine-tuning. Instead, it uses the language model to explore the knowledge graph by strategically executing SPARQL queries and searching for relevant IRIs and literals. We evaluate our approach on a variety of benchmarks (for knowledge graphs of different kinds and sizes) and language models (of different scales and types, commercial as well as open-source) and compare it with existing approaches. On Wikidata we reach state-of-the-art results on multiple benchmarks, despite the zero-shot setting. On Freebase we come close to the best few-shot methods. On other, less commonly evaluated knowledge graphs and benchmarks our approach also performs well overall. We conduct several additional studies, like comparing different ways of searching the graphs, incorporating a feedback mechanism, or making use of few-shot examples.
zh

[NLP-48] VideoConviction: A Multimodal Benchmark for Human Conviction and Stock Market Recommendations

【速读】：该论文试图解决金融影响者（finfluencers）在社交媒体上发布的投资建议中，如何准确识别其投资行动与信念强度（conviction）的问题。解决方案的关键在于构建VideoConviction数据集，这是一个包含6,000多个专家标注的多模态数据集，用于评估多模态大语言模型（MLLMs）和文本大语言模型（LLMs）在金融话语中的表现。通过分析视频中的语气、表达风格和面部表情等多模态信号，该研究旨在提升对finfluencers推荐内容的理解与评估能力。

链接: https://arxiv.org/abs/2507.08104
作者: Michael Galarnyk,Veer Kejriwal,Agam Shah,Yash Bhardwaj,Nicholas Meyer,Anand Krishnan,Sudheer Chava
机构: Georgia Institute of Technology(佐治亚理工学院); Stanford University(斯坦福大学)
类目: Multimedia (cs.MM); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Social media has amplified the reach of financial influencers known as “finfluencers,” who share stock recommendations on platforms like YouTube. Understanding their influence requires analyzing multimodal signals like tone, delivery style, and facial expressions, which extend beyond text-based financial analysis. We introduce VideoConviction, a multimodal dataset with 6,000+ expert annotations, produced through 457 hours of human effort, to benchmark multimodal large language models (MLLMs) and text-based large language models (LLMs) in financial discourse. Our results show that while multimodal inputs improve stock ticker extraction (e.g., extracting Apple’s ticker AAPL), both MLLMs and LLMs struggle to distinguish investment actions and conviction–the strength of belief conveyed through confident delivery and detailed reasoning–often misclassifying general commentary as definitive recommendations. While high-conviction recommendations perform better than low-conviction ones, they still underperform the popular S\P 500 index fund. An inverse strategy–betting against finfluencer recommendations–outperforms the S\P 500 by 6.8% in annual returns but carries greater risk (Sharpe ratio of 0.41 vs. 0.65). Our benchmark enables a diverse evaluation of multimodal tasks, comparing model performance on both full video and segmented video inputs. This enables deeper advancements in multimodal financial research. Our code, dataset, and evaluation leaderboard are available under the CC BY-NC 4.0 license.
zh

[NLP-49] Krul: Efficient State Restoration for Multi-turn Conversations with Dynamic Cross-layer KV Sharing

【速读】：该论文试图解决多轮对话中大型语言模型（LLM）状态恢复的效率问题，主要挑战在于重新计算或加载所有历史标记的完整键值（KV）缓存所带来的高开销。其解决方案的关键在于动态选择基于层对间注意力相似性的压缩策略，并采用重计算-加载流水线以恢复KV缓存。该方法引入了三项关键创新：预判性压缩策略选择器、逐标记异构注意力相似性估计器以及无气泡恢复调度器，从而实现了高效的KV缓存恢复与存储优化，同时保持生成质量。

链接: https://arxiv.org/abs/2507.08045
作者: Junyi Wen,Junyuan Liang,Zicong Hong,Wuhui Chen,Zibin Zheng
机构: Sun Yat-sen University(中山大学); Hong Kong University of Science and Technology(香港科技大学); Peng Cheng Laboratory(鹏城实验室)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Efficient state restoration in multi-turn conversations with large language models (LLMs) remains a critical challenge, primarily due to the overhead of recomputing or loading full key-value (KV) caches for all historical tokens. To address this, existing approaches compress KV caches across adjacent layers with highly similar attention patterns. However, these methods often apply a fixed compression scheme across all conversations, selecting the same layer pairs for compression without considering conversation-specific attention dynamics. This static strategy overlooks variability in attention pattern similarity across different conversations, which can lead to noticeable accuracy degradation. We present Krul, a multi-turn LLM inference system that enables accurate and efficient KV cache restoration. Krul dynamically selects compression strategies based on attention similarity across layer pairs and uses a recomputation-loading pipeline to restore the KV cache. It introduces three key innovations: 1) a preemptive compression strategy selector to preserve critical context for future conversation turns and selects a customized strategy for the conversation; 2) a token-wise heterogeneous attention similarity estimator to mitigate the attention similarity computation and storage overhead during model generation; 3) a bubble-free restoration scheduler to reduce potential bubbles brought by the imbalance of recomputing and loading stream due to compressed KV caches. Empirical evaluations on real-world tasks demonstrate that Krul achieves a 1.5x-2.68x reduction in time-to-first-token (TTFT) and a 1.33x-2.35x reduction in KV cache storage compared to state-of-the-art methods without compromising generation quality. Subjects: Computation and Language (cs.CL); Artificial Intelligence (cs.AI) Cite as: arXiv:2507.08045 [cs.CL] (or arXiv:2507.08045v1 [cs.CL] for this version) https://doi.org/10.48550/arXiv.2507.08045 Focus to learn more arXiv-issued DOI via DataCite
zh

[NLP-50] AblationBench: Evaluating Automated Planning of Ablations in Empirical AI Research

【速读】：该论文试图解决在实证AI研究中，如何有效设计消融实验（ablation experiments）的问题。其解决方案的关键在于引入AblationBench，这是一个用于评估代理在消融规划任务中的基准套件，包含两个任务：AuthorAblation和ReviewerAblation，并开发了基于语言模型（LM）的评判器作为自动评估框架。

链接: https://arxiv.org/abs/2507.08038
作者: Talor Abramovich,Gal Chechik
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Autonomous agents built on language models (LMs) are showing increasing popularity in many fields, including scientific research. AI co-scientists aim to support or automate parts of the research process using these agents. A key component of empirical AI research is the design of ablation experiments. To this end, we introduce AblationBench, a benchmark suite for evaluating agents on ablation planning tasks in empirical AI research. It includes two tasks: AuthorAblation, which helps authors propose ablation experiments based on a method section and contains 83 instances, and ReviewerAblation, which helps reviewers find missing ablations in a full paper and contains 350 instances. For both tasks, we develop LM-based judges that serve as an automatic evaluation framework. Our experiments with frontier LMs show that these tasks remain challenging, with the best-performing LM system identifying only 29% of the original ablations on average. Lastly, we analyze the limitations of current LMs on these tasks, and find that chain-of-thought prompting outperforms the currently existing agent-based approach.
zh

[NLP-51] CRISP: Complex Reasoning with Interpretable Step-based Plans

【速读】：该论文试图解决当前大型语言模型（Large Language Models, LLMs）在处理复杂问题时推理能力不足的问题，尤其是在数学推理和代码生成等领域的表现。其解决方案的关键在于引入CRISP（Complex Reasoning with Interpretable Step-based Plans）数据集，该数据集包含经过严格验证的高层计划，通过微调小模型使其生成高质量计划，从而显著优于基于少样本提示的Chain-of-Thought（CoT）推理方法，并展现出良好的跨领域泛化能力。

链接: https://arxiv.org/abs/2507.08037
作者: Matan Vetzler,Koren Lazar,Guy Uziel,Eran Hirsch,Ateret Anaby-Tavor,Leshem Choshen
机构: IBM Research, Israel; IBM Research, MIT
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Recent advancements in large language models (LLMs) underscore the need for stronger reasoning capabilities to solve complex problems effectively. While Chain-of-Thought (CoT) reasoning has been a step forward, it remains insufficient for many domains. A promising alternative is explicit high-level plan generation, but existing approaches largely assume that LLMs can produce effective plans through few-shot prompting alone, without additional training. In this work, we challenge this assumption and introduce CRISP (Complex Reasoning with Interpretable Step-based Plans), a multi-domain dataset of high-level plans for mathematical reasoning and code generation. The plans in CRISP are automatically generated and rigorously validated–both intrinsically, using an LLM as a judge, and extrinsically, by evaluating their impact on downstream task performance. We demonstrate that fine-tuning a small model on CRISP enables it to generate higher-quality plans than much larger models using few-shot prompting, while significantly outperforming Chain-of-Thought reasoning. Furthermore, our out-of-domain evaluation reveals that fine-tuning on one domain improves plan generation in the other, highlighting the generalizability of learned planning capabilities.
zh

[NLP-52] Barriers in Integrating Medical Visual Question Answering into Radiology Workflows: A Scoping Review and Clinicians Insights

【速读】：该论文试图解决医学视觉问答（MedVQA）在临床工作流中应用受限的问题，其核心挑战在于当前模型和数据集缺乏多视角、多分辨率成像支持、电子健康记录（EHR）整合以及领域知识，同时评估指标与临床需求不匹配。解决方案的关键在于提升MedVQA系统的临床相关性，包括引入患者病史和领域知识、优化多模态分析能力、增强对特定解剖区域的关注，并开发更符合临床交互需求的对话式系统。

链接: https://arxiv.org/abs/2507.08036
作者: Deepali Mishra,Chaklam Silpasuwanchai,Ashutosh Modi,Madhumita Sushil,Sorayouth Chumnanvej
机构: 未知
类目: Computation and Language (cs.CL); Computer Vision and Pattern Recognition (cs.CV)
备注: 29 pages, 5 figures (1 in supplementary), 3 tables (1 in main text, 2 in supplementary). Scoping review and clinician survey

点击查看摘要

Abstract:Medical Visual Question Answering (MedVQA) is a promising tool to assist radiologists by automating medical image interpretation through question answering. Despite advances in models and datasets, MedVQA’s integration into clinical workflows remains limited. This study systematically reviews 68 publications (2018-2024) and surveys 50 clinicians from India and Thailand to examine MedVQA’s practical utility, challenges, and gaps. Following the Arksey and O’Malley scoping review framework, we used a two-pronged approach: (1) reviewing studies to identify key concepts, advancements, and research gaps in radiology workflows, and (2) surveying clinicians to capture their perspectives on MedVQA’s clinical relevance. Our review reveals that nearly 60% of QA pairs are non-diagnostic and lack clinical relevance. Most datasets and models do not support multi-view, multi-resolution imaging, EHR integration, or domain knowledge, features essential for clinical diagnosis. Furthermore, there is a clear mismatch between current evaluation metrics and clinical needs. The clinician survey confirms this disconnect: only 29.8% consider MedVQA systems highly useful. Key concerns include the absence of patient history or domain knowledge (87.2%), preference for manually curated datasets (51.1%), and the need for multi-view image support (78.7%). Additionally, 66% favor models focused on specific anatomical regions, and 89.4% prefer dialogue-based interactive systems. While MedVQA shows strong potential, challenges such as limited multimodal analysis, lack of patient context, and misaligned evaluation approaches must be addressed for effective clinical integration.
zh

[NLP-53] Integrating External Tools with Large Language Models to Improve Accuracy

【速读】：该论文试图解决大型语言模型（Large Language Models, LLMs）在缺乏相关上下文信息时生成质量低下或产生幻觉的问题。其解决方案的关键在于构建一个框架，通过集成外部工具来增强LLMs在教育场景中回答问题的能力，该框架允许访问外部API以获取额外相关信息，并提供计算能力如计算器或日历功能。

链接: https://arxiv.org/abs/2507.08034
作者: Nripesh Niketan,Hadj Batatia
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: 9 pages, 3 figures, 2 tables. Extended version of paper published in Proceedings of International Conference on Information Technology and Applications, Springer Nature Singapore, 2025, pp. 409-421. This version includes additional experimental results comparing against GPT-4o, LLaMA-Large, Mistral-Large, and Phi-Large, expanded evaluation methodology, and enhanced analysis

点击查看摘要

Abstract:This paper deals with improving querying large language models (LLMs). It is well-known that without relevant contextual information, LLMs can provide poor quality responses or tend to hallucinate. Several initiatives have proposed integrating LLMs with external tools to provide them with up-to-date data to improve accuracy. In this paper, we propose a framework to integrate external tools to enhance the capabilities of LLMs in answering queries in educational settings. Precisely, we develop a framework that allows accessing external APIs to request additional relevant information. Integrated tools can also provide computational capabilities such as calculators or calendars. The proposed framework has been evaluated using datasets from the Multi-Modal Language Understanding (MMLU) collection. The data consists of questions on mathematical and scientific reasoning. Results compared to state-of-the-art language models show that the proposed approach significantly improves performance. Our Athena framework achieves 83% accuracy in mathematical reasoning and 88% in scientific reasoning, substantially outperforming all tested models including GPT-4o, LLaMA-Large, Mistral-Large, Phi-Large, and GPT-3.5, with the best baseline model (LLaMA-Large) achieving only 67% and 79% respectively. These promising results open the way to creating complex computing ecosystems around LLMs to make their use more natural to support various tasks and activities.
zh

[NLP-54] Beyond Scale: Small Language Models are Comparable to GPT -4 in Mental Health Understanding

【速读】：该论文试图解决小语言模型（Small Language Models, SLMs）在心理健康理解能力方面与大语言模型（Large Language Models, LLMs）相比的内在理解能力问题。其关键解决方案是通过系统评估多种分类任务，采用零样本和少量样本学习范式，对比SLMs与LLMs在心理健康领域的性能表现，从而揭示两者在该关键领域中的相对优势与局限性。研究结果表明，尽管SLMs参数量远少于LLMs，但在二分类任务中仍能达到接近LLMs的性能，并且通过少量样本提示可显著提升其表现，显示出SLMs在隐私保护场景下分析敏感在线文本数据的潜力。

链接: https://arxiv.org/abs/2507.08031
作者: Hong Jia,Shiya Fu,Vassilis Kostakos,Feng Xia,Ting Dang
机构: University of Melbourne (墨尔本大学); University of Auckland (奥克兰大学); RMIT University (皇家墨尔本理工大学)
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:The emergence of Small Language Models (SLMs) as privacy-preserving alternatives for sensitive applications raises a fundamental question about their inherent understanding capabilities compared to Large Language Models (LLMs). This paper investigates the mental health understanding capabilities of current SLMs through systematic evaluation across diverse classification tasks. Employing zero-shot and few-shot learning paradigms, we benchmark their performance against established LLM baselines to elucidate their relative strengths and limitations in this critical domain. We assess five state-of-the-art SLMs (Phi-3, Phi-3.5, Qwen2.5, Llama-3.2, Gemma2) against three LLMs (GPT-4, FLAN-T5-XXL, Alpaca-7B) on six mental health understanding tasks. Our findings reveal that SLMs achieve mean performance within 2% of LLMs on binary classification tasks (F1 scores of 0.64 vs 0.66 in zero-shot settings), demonstrating notable competence despite orders of magnitude fewer parameters. Both model categories experience similar degradation on multi-class severity tasks (a drop of over 30%), suggesting that nuanced clinical understanding challenges transcend model scale. Few-shot prompting provides substantial improvements for SLMs (up to 14.6%), while LLM gains are more variable. Our work highlights the potential of SLMs in mental health understanding, showing they can be effective privacy-preserving tools for analyzing sensitive online text data. In particular, their ability to quickly adapt and specialize with minimal data through few-shot learning positions them as promising candidates for scalable mental health screening tools.
zh

[NLP-55] A Systematic Analysis of Declining Medical Safety Messaging in Generative AI Models

【速读】：该论文试图解决生成式AI（Generative AI）在医疗影像解读和临床问题回答中因输出内容可能存在不准确而带来的安全隐患问题，其解决方案的关键在于通过医学免责声明（medical disclaimers）提醒用户AI输出未经专业审核，不能替代专业医疗建议。研究评估了2022年至2025年间不同代际的大型语言模型（LLMs）和视觉语言模型（VLMs）输出中免责声明的存在情况，发现免责声明的出现率显著下降，强调随着公共模型能力的提升，需根据具体临床情境调整并强化免责声明的使用以保障安全。

链接: https://arxiv.org/abs/2507.08030
作者: Sonali Sharma,Ahmed M. Alaa,Roxana Daneshjou
机构: 未知
类目: Computation and Language (cs.CL); Computational Engineering, Finance, and Science (cs.CE); Human-Computer Interaction (cs.HC)
备注: 11 pages, 5 figures

点击查看摘要

Abstract:Generative AI models, including large language models (LLMs) and vision-language models (VLMs), are increasingly used to interpret medical images and answer clinical questions. Their responses often include inaccuracies; therefore, safety measures like medical disclaimers are critical to remind users that AI outputs are not professionally vetted or a substitute for medical advice. This study evaluated the presence of disclaimers in LLM and VLM outputs across model generations from 2022 to 2025. Using 500 mammograms, 500 chest X-rays, 500 dermatology images, and 500 medical questions, outputs were screened for disclaimer phrases. Medical disclaimer presence in LLM and VLM outputs dropped from 26.3% in 2022 to 0.97% in 2025, and from 19.6% in 2023 to 1.05% in 2025, respectively. By 2025, the majority of models displayed no disclaimers. As public models become more capable and authoritative, disclaimers must be implemented as a safeguard adapting to the clinical context of each output.
zh

[NLP-56] Better Together: Quantifying the Benefits of AI-Assisted Recruitment

【速读】：该论文试图解决人工智能（Artificial Intelligence, AI）在招聘过程中的实际影响问题，特别是其对招聘效率和候选人选拔效果的影响。研究通过设计一个对照实验，将37,000名申请初级开发职位的候选人随机分配至传统招聘流程或AI辅助的招聘流程，其中AI辅助流程包括初步的AI驱动结构化视频面试。解决方案的关键在于引入AI技术进行初步筛选，并与后续的人工评估相结合，以评估AI对最终招聘结果及候选人就业情况的影响。

链接: https://arxiv.org/abs/2507.08029
作者: Ada Aka,Emil Palikot,Ali Ansari,Nima Yazdani
机构: 未知
类目: Computation and Language (cs.CL); Computers and Society (cs.CY)
备注:

点击查看摘要

Abstract:Artificial intelligence (AI) is increasingly used in recruitment, yet empirical evidence quantifying its impact on hiring efficiency and candidate selection remains limited. We randomly assign 37,000 applicants for a junior-developer position to either a traditional recruitment process (resume screening followed by human selection) or an AI-assisted recruitment pipeline incorporating an initial AI-driven structured video interview before human evaluation. Candidates advancing from either track faced the same final-stage human interview, with interviewers blind to the earlier selection method. In the AI-assisted pipeline, 54% of candidates passed the final interview compared with 34% from the traditional pipeline, yielding an average treatment effect of 20 percentage points (SE 12 pp.). Five months later, we collected LinkedIn profiles of top applicants from both groups and found that 18% (SE 1.1%) of applicants from the traditional track found new jobs compared with 23% (SE 2.3%) from the AI group, resulting in a 5.9 pp. (SE 2.6 pp.) difference in the probability of finding new employment between groups. The AI system tended to select younger applicants with less experience and fewer advanced credentials. We analyze AI-generated interview transcripts to examine the selection criteria and conversational dynamics. Our findings contribute to understanding how AI technologies affect decision making in recruitment and talent acquisition while highlighting some of their potential implications.
zh

[NLP-57] “Amazing They All Lean Left” – Analyzing the Political Temperaments of Current LLM s

【速读】：该论文试图解决当前主流商业大型语言模型（Large Language Models, LLMs）在伦理和政治响应中表现出一致的自由主义倾向的问题，以及这种倾向的成因和影响尚不明确的问题。其解决方案的关键在于采用多维度的方法，包括道德基础理论、多个已建立的政治意识形态量表以及一个新的当前政治争议指数，对七种主流LLMs进行了系统性分析，揭示了自由主义价值观优先化的现象，并进一步归因于四个相互交织的因素：偏向自由主义的训练语料、基于人类反馈的强化学习（RLHF）、学术伦理讨论中自由主义框架的主导地位以及以安全为导向的微调实践。

链接: https://arxiv.org/abs/2507.08027
作者: W. Russell Neuman,Chad Coleman,Ali Dasdan,Safinah Ali,Manan Shah,Kund Meghani
机构: 未知
类目: Computation and Language (cs.CL); Computers and Society (cs.CY)
备注:

点击查看摘要

Abstract:Recent studies have revealed a consistent liberal orientation in the ethical and political responses generated by most commercial large language models (LLMs), yet the underlying causes and resulting implications remain unclear. This paper systematically investigates the political temperament of seven prominent LLMs - OpenAI’s GPT-4o, Anthropic’s Claude Sonnet 4, Perplexity (Sonar Large), Google’s Gemini 2.5 Flash, Meta AI’s Llama 4, Mistral 7b Le Chat and High-Flyer’s DeepSeek R1 – using a multi-pronged approach that includes Moral Foundations Theory, a dozen established political ideology scales and a new index of current political controversies. We find strong and consistent prioritization of liberal-leaning values, particularly care and fairness, across most models. Further analysis attributes this trend to four overlapping factors: Liberal-leaning training corpora, reinforcement learning from human feedback (RLHF), the dominance of liberal frameworks in academic ethical discourse and safety-driven fine-tuning practices. We also distinguish between political “bias” and legitimate epistemic differences, cautioning against conflating the two. A comparison of base and fine-tuned model pairs reveals that fine-tuning generally increases liberal lean, an effect confirmed through both self-report and empirical testing. We argue that this “liberal tilt” is not a programming error or the personal preference of programmers but an emergent property of training on democratic rights-focused discourse. Finally, we propose that LLMs may indirectly echo John Rawls’ famous veil-of ignorance philosophical aspiration, reflecting a moral stance unanchored to personal identity or interest. Rather than undermining democratic discourse, this pattern may offer a new lens through which to examine collective reasoning.
zh

[NLP-58] Unveiling Effective In-Context Configurations for Image Captioning: An External Internal Analysis

【速读】：该论文旨在解决大型多模态模型（Large Multimodal Models, LMMs）在上下文学习（In-Context Learning, ICL）中的演示配置策略及其对模型性能影响的问题，同时探索模型内部注意力机制与推理行为的关系。其解决方案的关键在于通过外部实验分析演示配置策略（包括样本数量、图像检索和标题分配）对模型性能的影响，并结合内部分析方法，如基于注意力的度量标准，以量化模型行为，从而提供理解多模态ICL的双重视角。

链接: https://arxiv.org/abs/2507.08021
作者: Li Li,Yongliang Wu,Jingze Zhu,Jiawei Peng,Jianfei Cai,Xu Yang
机构: Southeast University(东南大学); Monash University(莫纳什大学)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: 16 pages, 11 figures

点击查看摘要

Abstract:The evolution of large models has witnessed the emergence of In-Context Learning (ICL) capabilities. In Natural Language Processing (NLP), numerous studies have demonstrated the effectiveness of ICL. Inspired by the success of Large Language Models (LLMs), researchers have developed Large Multimodal Models (LMMs) with ICL capabilities. However, explorations of demonstration configuration for multimodal ICL remain preliminary. Additionally, the controllability of In-Context Examples (ICEs) provides an efficient and cost-effective means to observe and analyze the inference characteristics of LMMs under varying inputs. This paper conducts a comprehensive external and internal investigation of multimodal in-context learning on the image captioning task. Externally, we explore demonstration configuration strategies through three dimensions: shot number, image retrieval, and caption assignment. We employ multiple metrics to systematically and thoroughly evaluate and summarize key findings. Internally, we analyze typical LMM attention characteristics and develop attention-based metrics to quantify model behaviors. We also conduct auxiliary experiments to explore the feasibility of attention-driven model acceleration and compression. We further compare performance variations between LMMs with identical model design and pretraining strategies and explain the differences from the angles of pre-training data features. Our study reveals both how ICEs configuration strategies impact model performance through external experiments and characteristic typical patterns through internal inspection, providing dual perspectives for understanding multimodal ICL in LMMs. Our method of combining external and internal analysis to investigate large models, along with our newly proposed metrics, can be applied to broader research areas.
zh

[NLP-59] Circumventing Safety Alignment in Large Language Models Through Embedding Space Toxicity Attenuation

【速读】：该论文旨在解决大型语言模型（Large Language Models, LLMs）在安全对齐机制中面临的嵌入空间中毒攻击问题，此类攻击通过操纵输入数据的内部语义表示来规避安全防护。论文提出的解决方案关键在于ETTA（Embedding Transformation Toxicity Attenuation）框架，该框架通过线性变换识别并减弱嵌入空间中的毒性敏感维度，从而在不进行模型微调或访问训练数据的情况下绕过模型拒绝行为，同时保持语言连贯性。

链接: https://arxiv.org/abs/2507.08020
作者: Zhibo Zhang,Yuxi Li,Kailong Wang,Shuai Yuan,Ling Shi,Haoyu Wang
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Large Language Models (LLMs) have achieved remarkable success across domains such as healthcare, education, and cybersecurity. However, this openness also introduces significant security risks, particularly through embedding space poisoning, which is a subtle attack vector where adversaries manipulate the internal semantic representations of input data to bypass safety alignment mechanisms. While previous research has investigated universal perturbation methods, the dynamics of LLM safety alignment at the embedding level remain insufficiently understood. Consequently, more targeted and accurate adversarial perturbation techniques, which pose significant threats, have not been adequately studied. In this work, we propose ETTA (Embedding Transformation Toxicity Attenuation), a novel framework that identifies and attenuates toxicity-sensitive dimensions in embedding space via linear transformations. ETTA bypasses model refusal behaviors while preserving linguistic coherence, without requiring model fine-tuning or access to training data. Evaluated on five representative open-source LLMs using the AdvBench benchmark, ETTA achieves a high average attack success rate of 88.61%, outperforming the best baseline by 11.34%, and generalizes to safety-enhanced models (e.g., 77.39% ASR on instruction-tuned defenses). These results highlight a critical vulnerability in current alignment strategies and underscore the need for embedding-aware defenses. Subjects: Computation and Language (cs.CL); Artificial Intelligence (cs.AI) Cite as: arXiv:2507.08020 [cs.CL] (or arXiv:2507.08020v1 [cs.CL] for this version) https://doi.org/10.48550/arXiv.2507.08020 Focus to learn more arXiv-issued DOI via DataCite
zh

[NLP-60] Signal or Noise? Evaluating Large Language Models in Resume Screening Across Contextual Variations and Human Expert Benchmarks

【速读】：该论文试图解决大型语言模型（Large Language Models, LLMs）在简历筛选过程中是否表现出一致的行为（信号）或随机变化（噪声），以及其性能与人类招聘专家的比较问题。研究的关键在于通过控制数据集对三种LLMs（Claude、GPT和Gemini）在不同上下文（无公司、跨国公司、初创公司和简化上下文）下的表现进行测试，并与三位人类招聘专家进行对比，从而分析LLMs在招聘评估中的适应性与一致性。研究结果表明，LLMs在特定条件下能够展现出可解释的模式，但其判断与人类专家存在显著差异。

链接: https://arxiv.org/abs/2507.08019
作者: Aryan Varshney,Venkat Ram Reddy Ganuthula
机构: 未知
类目: Computation and Language (cs.CL); General Economics (econ.GN)
备注:

点击查看摘要

Abstract:This study investigates whether large language models (LLMs) exhibit consistent behavior (signal) or random variation (noise) when screening resumes against job descriptions, and how their performance compares to human experts. Using controlled datasets, we tested three LLMs (Claude, GPT, and Gemini) across contexts (No Company, Firm1 [MNC], Firm2 [Startup], Reduced Context) with identical and randomized resumes, benchmarked against three human recruitment experts. Analysis of variance revealed significant mean differences in four of eight LLM-only conditions and consistently significant differences between LLM and human evaluations (p 0.01). Paired t-tests showed GPT adapts strongly to company context (p 0.001), Gemini partially (p = 0.038 for Firm1), and Claude minimally (p 0.1), while all LLMs differed significantly from human experts across contexts. Meta-cognition analysis highlighted adaptive weighting patterns that differ markedly from human evaluation approaches. Findings suggest LLMs offer interpretable patterns with detailed prompts but diverge substantially from human judgment, informing their deployment in automated hiring systems.
zh

[NLP-61] Review Remask Refine (R3): Process-Guided Block Diffusion for Text Generation ICML2025

【速读】：该论文试图解决迭代文本生成中的关键问题，即如何使模型高效地识别并纠正自身的错误。其解决方案的关键在于提出了一种名为Review, Remask, Refine (R3)的框架，该框架无需额外模型训练，可应用于任何预训练的掩码文本扩散模型。R3的核心机制是利用过程奖励模型（Process Reward Model, PRM）对中间生成块进行评估，并根据PRM得分调整重掩码策略，得分较低的块会被更多地重掩码，从而引导模型对这些潜在错误区域进行精细化修正，最终提升生成结果的质量。

链接: https://arxiv.org/abs/2507.08018
作者: Nikita Mounier,Parsa Idehpour
机构: 未知
类目: Computation and Language (cs.CL); Machine Learning (cs.LG)
备注: Accepted at Methods and Opportunities at Small Scale (MOSS), ICML 2025

点击查看摘要

Abstract:A key challenge for iterative text generation is enabling models to efficiently identify and correct their own errors. We propose Review, Remask, Refine (R3), a relatively simple yet elegant framework that requires no additional model training and can be applied to any pre-trained masked text diffusion model (e.g., LLaDA or BD3-LM). In R3, a Process Reward Model (PRM) is utilized for the Review of intermediate generated blocks. The framework then translates these PRM scores into a Remask strategy: the lower a block’s PRM score, indicating potential mistakes, the greater the proportion of tokens within that block are remasked. Finally, the model is compelled to Refine these targeted segments, focusing its efforts more intensively on specific sub-optimal parts of past generations, leading to improved final output.
zh

[NLP-62] Mechanistic Indicators of Understanding in Large Language Models

【速读】：该论文试图解决的问题是：当前对大型语言模型（Large Language Models, LLMs）的理解是否仅基于表面统计，而缺乏对模型内部工作机制的深入解析。论文提出的解决方案关键在于构建一个新颖的理论框架，用以思考机器理解，并通过提出一种三层级的机器理解概念来深化对LLMs内部结构和功能的认知，即概念性理解、世界现状理解和原则性理解。这一框架揭示了LLMs如何在不同层面上形成对信息的连接与动态跟踪，从而挑战了传统观点，强调了LLMs内部结构的功能类比于人类的洞察力。

链接: https://arxiv.org/abs/2507.08017
作者: Pierre Beckmann,Matthieu Queloz
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: 32 pages

点击查看摘要

Abstract:Recent findings in mechanistic interpretability (MI), the field probing the inner workings of Large Language Models (LLMs), challenge the view that these models rely solely on superficial statistics. Here, we offer an accessible synthesis of these findings that doubles as an introduction to MI, all while integrating these findings within a novel theoretical framework for thinking about machine understanding. We argue that LLMs develop internal structures that are functionally analogous to the kind of understanding that consists in seeing connections. To sharpen this idea, we propose a three-tiered conception of machine understanding. First, conceptual understanding emerges when a model forms “features” as directions in latent space, thereby learning the connections between diverse manifestations of something. Second, state-of-the-world understanding emerges when a model learns contingent factual connections between features and dynamically tracks changes in the world. Third, principled understanding emerges when a model ceases to rely on a collection of memorized facts and discovers a “circuit” that connects these facts. However, we conclude by exploring the “parallel mechanisms” phenomenon, arguing that while LLMs exhibit forms of understanding, their cognitive architecture remains different from ours, and the debate should shift from whether LLMs understand to how their strange minds work.
zh

[NLP-63] Assessing the Capabilities and Limitations of FinGPT Model in Financial NLP Applications

【速读】：该论文试图解决金融领域中语言模型性能评估与优化的问题，旨在通过在六个关键自然语言处理任务上对FinGPT进行评估，揭示其在实际金融应用中的能力与局限性。解决方案的关键在于构建针对金融领域的特定数据集，并基于这些数据集对模型进行系统性测试，从而为未来金融语言模型的架构改进和领域适配提供基准与方向。

链接: https://arxiv.org/abs/2507.08015
作者: Prudence Djagba,Chimezie A. Odinakachukwu
机构: Michigan State University (密歇根州立大学); AIMS Senegal (非洲应用数学研究所塞内加尔中心)
类目: Computation and Language (cs.CL); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:This work evaluates FinGPT, a financial domain-specific language model, across six key natural language processing (NLP) tasks: Sentiment Analysis, Text Classification, Named Entity Recognition, Financial Question Answering, Text Summarization, and Stock Movement Prediction. The evaluation uses finance-specific datasets to assess FinGPT’s capabilities and limitations in real-world financial applications. The results show that FinGPT performs strongly in classification tasks such as sentiment analysis and headline categorization, often achieving results comparable to GPT-4. However, its performance is significantly lower in tasks that involve reasoning and generation, such as financial question answering and summarization. Comparisons with GPT-4 and human benchmarks highlight notable performance gaps, particularly in numerical accuracy and complex reasoning. Overall, the findings indicate that while FinGPT is effective for certain structured financial tasks, it is not yet a comprehensive solution. This research provides a useful benchmark for future research and underscores the need for architectural improvements and domain-specific optimization in financial language models.
zh

[NLP-64] Mass-Scale Analysis of In-the-Wild Conversations Reveals Complexity Bounds on LLM Jailbreaking

【速读】：该论文试图解决生成式 AI (Generative AI) 安全领域中关于越狱策略复杂性及其演变的理解问题，旨在评估越狱攻击与正常对话之间的复杂性差异，并探索其对 AI 安全的影响。解决方案的关键在于通过大规模实证分析，利用多种复杂性度量（包括概率指标、词汇多样性、压缩比和认知负荷）对超过 200 万条来自不同平台的真实对话进行分析，从而揭示越狱尝试的复杂性并未显著高于正常对话，且存在自然限制，这为 AI 安全机制的演进提供了新的视角。

链接: https://arxiv.org/abs/2507.08014
作者: Aldan Creo,Raul Castro Fernandez,Manuel Cebrian
机构: Valencian Research Institute for Artificial Intelligence (VRAIN); Universitat Politècnica de València (Valencia Polytechnic University); The University of Chicago (芝加哥大学); Department of Computer Science (计算机科学系); Center for Automation and Robotics (自动化与机器人中心); Spanish National Research Council (西班牙国家研究理事会)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Computers and Society (cs.CY)
备注: Code: this https URL Results: this https URL Visualizer: this https URL

点击查看摘要

Abstract:As large language models (LLMs) become increasingly deployed, understanding the complexity and evolution of jailbreaking strategies is critical for AI safety. We present a mass-scale empirical analysis of jailbreak complexity across over 2 million real-world conversations from diverse platforms, including dedicated jailbreaking communities and general-purpose chatbots. Using a range of complexity metrics spanning probabilistic measures, lexical diversity, compression ratios, and cognitive load indicators, we find that jailbreak attempts do not exhibit significantly higher complexity than normal conversations. This pattern holds consistently across specialized jailbreaking communities and general user populations, suggesting practical bounds on attack sophistication. Temporal analysis reveals that while user attack toxicity and complexity remains stable over time, assistant response toxicity has decreased, indicating improving safety mechanisms. The absence of power-law scaling in complexity distributions further points to natural limits on jailbreak development. Our findings challenge the prevailing narrative of an escalating arms race between attackers and defenders, instead suggesting that LLM safety evolution is bounded by human ingenuity constraints while defensive measures continue advancing. Our results highlight critical information hazards in academic jailbreak disclosure, as sophisticated attacks exceeding current complexity baselines could disrupt the observed equilibrium and enable widespread harm before defensive adaptation. Comments: Code: this https URL Results: this https URL Visualizer: this https URL Subjects: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Computers and Society (cs.CY) Cite as: arXiv:2507.08014 [cs.CL] (or arXiv:2507.08014v1 [cs.CL] for this version) https://doi.org/10.48550/arXiv.2507.08014 Focus to learn more arXiv-issued DOI via DataCite
zh

[NLP-65] MedicalBERT: enhancing biomedical natural language processing using pretrained BERT-based model

【速读】：该论文旨在解决生物医学文献中自然语言处理（NLP）任务的挑战，特别是针对领域特定术语的理解问题。传统模型如Word2Vec和Bi-LSTM无法充分应对生物医学文本的复杂性，而尽管GPT和T5能够捕捉上下文信息，但在需要双向理解的任务中表现不足。该研究提出的解决方案是MedicalBERT，一个基于预训练BERT架构的模型，它在大规模生物医学数据集上进行训练，并配备了领域特定的词汇表，从而增强了对生物医学术语的理解能力。此外，MedicalBERT通过优化和微调以适应多种任务，如命名实体识别、关系抽取、问答、句子相似性和文档分类，显著提升了性能。

链接: https://arxiv.org/abs/2507.08013
作者: K. Sahit Reddy,N. Ragavenderan,Vasanth K.,Ganesh N. Naik,Vishalakshi Prabhu,Nagaraja G. S
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Recent advances in natural language processing (NLP) have been driven bypretrained language models like BERT, RoBERTa, T5, and GPT. Thesemodels excel at understanding complex texts, but biomedical literature, withits domain-specific terminology, poses challenges that models likeWord2Vec and bidirectional long short-term memory (Bi-LSTM) can’t fullyaddress. GPT and T5, despite capturing context, fall short in tasks needingbidirectional understanding, unlike BERT. Addressing this, we proposedMedicalBERT, a pretrained BERT model trained on a large biomedicaldataset and equipped with domain-specific vocabulary that enhances thecomprehension of biomedical terminology. MedicalBERT model is furtheroptimized and fine-tuned to address diverse tasks, including named entityrecognition, relation extraction, question answering, sentence similarity, anddocument classification. Performance metrics such as the F1-score,accuracy, and Pearson correlation are employed to showcase the efficiencyof our model in comparison to other BERT-based models such as BioBERT,SciBERT, and ClinicalBERT. MedicalBERT outperforms these models onmost of the benchmarks, and surpasses the general-purpose BERT model by5.67% on average across all the tasks evaluated respectively. This work alsounderscores the potential of leveraging pretrained BERT models for medicalNLP tasks, demonstrating the effectiveness of transfer learning techniques incapturing domain-specific information. (PDF) MedicalBERT: enhancing biomedical natural language processing using pretrained BERT-based model. Available from: this https URL [accessed Jul 06 2025]. Subjects: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG) Cite as: arXiv:2507.08013 [cs.CL] (or arXiv:2507.08013v1 [cs.CL] for this version) https://doi.org/10.48550/arXiv.2507.08013 Focus to learn more arXiv-issued DOI via DataCite Related DOI: https://doi.org/10.11591/ijai.v14.i3.pp2367-2378 Focus to learn more DOI(s) linking to related resources
zh

[NLP-66] RepeaTTS: Towards Feature Discovery through Repeated Fine-Tuning

【速读】：该论文试图解决基于提示的文本到语音模型在控制能力上的局限性与过度灵活性问题。具体而言，模型的控制能力受限于训练期间暴露的声学特征，而输入的相同提示可能导致不可控的输出变异。解决方案的关键在于利用模型的不可控方差，通过主成分分析确定能够解释输出方差最大比例的潜在特征，并将其作为二次微调的新标签，从而提升模型的整体可控性。

链接: https://arxiv.org/abs/2507.08012
作者: Atli Sigurgeirsson,Simon King
机构: The Centre for Speech Technology Research (语音技术研究中心)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Sound (cs.SD); Audio and Speech Processing (eess.AS)
备注:

点击查看摘要

Abstract:A Prompt-based Text-To-Speech model allows a user to control different aspects of speech, such as speaking rate and perceived gender, through natural language instruction. Although user-friendly, such approaches are on one hand constrained: control is limited to acoustic features exposed to the model during training, and too flexible on the other: the same inputs yields uncontrollable variation that are reflected in the corpus statistics. We investigate a novel fine-tuning regime to address both of these issues at the same time by exploiting the uncontrollable variance of the model. Through principal component analysis of thousands of synthesised samples, we determine latent features that account for the highest proportion of the output variance and incorporate them as new labels for secondary fine-tuning. We evaluate the proposed methods on two models trained on an expressive Icelandic speech corpus, one with emotional disclosure and one without. In the case of the model without emotional disclosure, the method yields both continuous and discrete features that improve overall controllability of the model. Subjects: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Sound (cs.SD); Audio and Speech Processing (eess.AS) Cite as: arXiv:2507.08012 [cs.CL] (or arXiv:2507.08012v1 [cs.CL] for this version) https://doi.org/10.48550/arXiv.2507.08012 Focus to learn more arXiv-issued DOI via DataCite
zh

[NLP-67] UniConv: Unifying Retrieval and Response Generation for Large Language Models in Conversations ACL2025

【速读】：该论文试图解决现有对话搜索系统中密集检索与响应生成任务分离所带来的问题，这种分离限制了系统同时利用模型内在知识的能力，从而影响了检索效果对生成任务的提升。论文提出的解决方案的关键在于通过联合微调不同目标，并设计两种机制以降低不一致性风险并缓解数据差异，从而实现大语言模型在对话场景下密集检索与响应生成的统一。

链接: https://arxiv.org/abs/2507.07030
作者: Fengran Mo,Yifan Gao,Chuan Meng,Xin Liu,Zhuofeng Wu,Kelong Mao,Zhengyang Wang,Pei Chen,Zheng Li,Xian Li,Bing Yin,Meng Jiang
机构: University of Montreal(蒙特利尔大学); Amazon.com(亚马逊公司); University of Amsterdam(阿姆斯特丹大学); Renmin University(中国人民大学); University of Notre Dame(圣母大学)
类目: Computation and Language (cs.CL); Information Retrieval (cs.IR)
备注: Accepted by ACL 2025 (main)

点击查看摘要

Abstract:The rapid advancement of conversational search systems revolutionizes how information is accessed by enabling the multi-turn interaction between the user and the system. Existing conversational search systems are usually built with two different models. This separation restricts the system from leveraging the intrinsic knowledge of the models simultaneously, which cannot ensure the effectiveness of retrieval benefiting the generation. The existing studies for developing unified models cannot fully address the aspects of understanding conversational context, managing retrieval independently, and generating responses. In this paper, we explore how to unify dense retrieval and response generation for large language models in conversation. We conduct joint fine-tuning with different objectives and design two mechanisms to reduce the inconsistency risks while mitigating data discrepancy. The evaluations on five conversational search datasets demonstrate that our unified model can mutually improve both tasks and outperform the existing baselines.
zh

计算机视觉

[CV-0] Lumos-1: On Autoregressive Video Generation from a Unified Model Perspective

【速读】：该论文旨在解决自回归视频生成中模型架构与标准语言模型（LLM）不兼容、依赖外部文本编码器或存在高延迟的问题。其解决方案的关键在于保留LLM架构并进行最小化修改，通过引入MM-RoPE机制来增强模型对时空相关性的建模能力，同时采用基于令牌依赖性的策略和Autoregressive Discrete Diffusion Forcing (AR-DF) 方法解决帧级损失不平衡问题，从而提升生成视频的质量与效率。

链接: https://arxiv.org/abs/2507.08801
作者: Hangjie Yuan,Weihua Chen,Jun Cen,Hu Yu,Jingyun Liang,Shuning Chang,Zhihui Lin,Tao Feng,Pengwei Liu,Jiazheng Xing,Hao Luo,Jiasheng Tang,Fan Wang,Yi Yang
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Multimedia (cs.MM)
备注: Code and Models: this https URL

点击查看摘要

Abstract:Autoregressive large language models (LLMs) have unified a vast range of language tasks, inspiring preliminary efforts in autoregressive video generation. Existing autoregressive video generators either diverge from standard LLM architectures, depend on bulky external text encoders, or incur prohibitive latency due to next-token decoding. In this paper, we introduce Lumos-1, an autoregressive video generator that retains the LLM architecture with minimal architectural modifications. To inject spatiotemporal correlations in LLMs, we identify the efficacy of incorporating 3D RoPE and diagnose its imbalanced frequency spectrum ranges. Therefore, we propose MM-RoPE, a RoPE scheme that preserves the original textual RoPE while providing comprehensive frequency spectra and scaled 3D positions for modeling multimodal spatiotemporal data. Moreover, Lumos-1 resorts to a token dependency strategy that obeys intra-frame bidirectionality and inter-frame temporal causality. Based on this dependency strategy, we identify the issue of frame-wise loss imbalance caused by spatial information redundancy and solve it by proposing Autoregressive Discrete Diffusion Forcing (AR-DF). AR-DF introduces temporal tube masking during training with a compatible inference-time masking policy to avoid quality degradation. By using memory-efficient training techniques, we pre-train Lumos-1 on only 48 GPUs, achieving performance comparable to EMU3 on GenEval, COSMOS-Video2World on VBench-I2V, and OpenSoraPlan on VBench-T2V. Code and models are available at this https URL.
zh

[CV-1] CLiFT: Compressive Light-Field Tokens for Compute-Efficient and Adaptive Neural Rendering

【速读】：该论文旨在解决传统神经渲染方法在计算效率和场景表示灵活性方面的不足，尤其是在处理大规模场景数据时面临的存储与渲染速度问题。其解决方案的关键在于提出了一种基于“压缩光场令牌（Compressed Light-Field Tokens, CLiFTs）”的表示方法，通过多视角编码器对图像进行令牌化，并利用潜在空间K-means算法选择关键光线作为聚类中心，再通过多视角“压缩器”将所有令牌的信息压缩到这些中心令牌中，从而实现高效的数据压缩与灵活的视图生成。该方法能够在有限的计算资源下，通过调整令牌数量来平衡数据规模、渲染质量和渲染速度。

链接: https://arxiv.org/abs/2507.08776
作者: Zhengqing Wang,Yuefan Wu,Jiacheng Chen,Fuyang Zhang,Yasutaka Furukawa
机构: Simon Fraser University (西蒙弗雷泽大学); Wayve (威沃)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Project page: this https URL

点击查看摘要

Abstract:This paper proposes a neural rendering approach that represents a scene as “compressed light-field tokens (CLiFTs)”, retaining rich appearance and geometric information of a scene. CLiFT enables compute-efficient rendering by compressed tokens, while being capable of changing the number of tokens to represent a scene or render a novel view with one trained network. Concretely, given a set of images, multi-view encoder tokenizes the images with the camera poses. Latent-space K-means selects a reduced set of rays as cluster centroids using the tokens. The multi-view ``condenser’’ compresses the information of all the tokens into the centroid tokens to construct CLiFTs. At test time, given a target view and a compute budget (i.e., the number of CLiFTs), the system collects the specified number of nearby tokens and synthesizes a novel view using a compute-adaptive renderer. Extensive experiments on RealEstate10K and DL3DV datasets quantitatively and qualitatively validate our approach, achieving significant data reduction with comparable rendering quality and the highest overall rendering score, while providing trade-offs of data size, rendering quality, and rendering speed.
zh

[CV-2] From One to More: Contextual Part Latents for 3D Generation

【速读】：该论文旨在解决3D生成中单隐空间表示无法捕捉复杂多部件几何结构、整体隐编码忽视部件独立性与相互关系以及全局条件机制缺乏细粒度控制等问题。其解决方案的关键在于提出CoPart——一种基于部件感知的扩散框架，通过将3D物体分解为上下文相关的部件隐空间，实现连贯的多部件生成，从而降低编码复杂度、显式建模部件间关系并支持部件级条件控制。

链接: https://arxiv.org/abs/2507.08772
作者: Shaocong Dong,Lihe Ding,Xiao Chen,Yaokun Li,Yuxin Wang,Yucheng Wang,Qi Wang,Jaehyeok Kim,Chenjian Gao,Zhanpeng Huang,Zibin Wang,Tianfan Xue,Dan Xu
机构: HKUST(香港科技大学); CUHK(香港中文大学); SenseTime Research(商汤科技研究院); Shanghai AI Laboratory(上海人工智能实验室)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Project page: this https URL

点击查看摘要

Abstract:Recent advances in 3D generation have transitioned from multi-view 2D rendering approaches to 3D-native latent diffusion frameworks that exploit geometric priors in ground truth data. Despite progress, three key limitations persist: (1) Single-latent representations fail to capture complex multi-part geometries, causing detail degradation; (2) Holistic latent coding neglects part independence and interrelationships critical for compositional design; (3) Global conditioning mechanisms lack fine-grained controllability. Inspired by human 3D design workflows, we propose CoPart - a part-aware diffusion framework that decomposes 3D objects into contextual part latents for coherent multi-part generation. This paradigm offers three advantages: i) Reduces encoding complexity through part decomposition; ii) Enables explicit part relationship modeling; iii) Supports part-level conditioning. We further develop a mutual guidance strategy to fine-tune pre-trained diffusion models for joint part latent denoising, ensuring both geometric coherence and foundation model priors. To enable large-scale training, we construct Partverse - a novel 3D part dataset derived from Objaverse through automated mesh segmentation and human-verified annotations. Extensive experiments demonstrate CoPart’s superior capabilities in part-level editing, articulated object generation, and scene composition with unprecedented controllability.
zh

[CV-3] A Hybrid Multi-Well Hopfield-CNN with Feature Extraction and K-Means for MNIST Classification

【速读】：该论文试图解决手写数字分类中的类内变异性问题，旨在提升模型在面对不同书写风格时的鲁棒性。其解决方案的关键在于将卷积神经网络（CNN）与多阱霍普菲尔德网络结合，利用CNN提取高维特征，并通过k-means聚类将特征映射到类特定原型，这些原型作为多阱能量景观中的吸引子，通过最小化能量函数实现分类，从而在保持高准确率的同时提供可解释的决策过程。

链接: https://arxiv.org/abs/2507.08766
作者: Ahmed Farooq
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:This study presents a hybrid model for classifying handwritten digits in the MNIST dataset, combining convolutional neural networks (CNNs) with a multi-well Hopfield network. The approach employs a CNN to extract high-dimensional features from input images, which are then clustered into class-specific prototypes using k-means clustering. These prototypes serve as attractors in a multi-well energy landscape, where a Hopfield network performs classification by minimizing an energy function that balances feature similarity and class this http URL model’s design enables robust handling of intraclass variability, such as diverse handwriting styles, while providing an interpretable framework through its energy-based decision process. Through systematic optimization of the CNN architecture and the number of wells, the model achieves a high test accuracy of 99.2% on 10,000 MNIST images, demonstrating its effectiveness for image classification tasks. The findings highlight the critical role of deep feature extraction and sufficient prototype coverage in achieving high performance, with potential for broader applications in pattern recognition.
zh

[CV-4] Compress Any Segment Anything Model (SAM)

【速读】：该论文旨在解决如何高效压缩Segment Anything Model (SAM)及其变体的问题，以满足实际应用中对模型轻量化的需求。其解决方案的关键在于提出了一种名为Birkhoff的新颖数据无关压缩算法，该算法的核心是Hyper-Compression技术，通过寻找高维参数向量到低维标量的密集轨迹实现模型压缩，并设计了专门的线性层操作符HyperLinear，以融合解压缩与矩阵乘法，显著提升压缩后模型的推理速度。

链接: https://arxiv.org/abs/2507.08765
作者: Juntong Fan,Zhiwei Hao,Jianqiang Shen,Shang-Ling Jui,Yi Zhang,Jing-Xiao Liao,Feng-Lei Fan
机构: Frontier of Artificial Networks (FAN) Lab, Department of Data Science, City University of Hong Kong, Hong Kong, China SAR; Data Storage Product Line, Huawei Technologies Co., Ltd.; Lagrange Mathematics and Computing Research Center; School of Cyber Science and Engineering, Sichuan University; Key Laboratory of Data Protection and Intelligent Management, Ministry of Education, Sichuan University; Department of Industrial and Systems Engineering, The Hong Kong Polytechnic University
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: 13 pages, 6 tables, 8 figures

点击查看摘要

Abstract:Due to the excellent performance in yielding high-quality, zero-shot segmentation, Segment Anything Model (SAM) and its variants have been widely applied in diverse scenarios such as healthcare and intelligent manufacturing. Therefore, effectively compressing SAMs has become an increasingly pressing practical need. In this study, we propose Birkhoff, a novel data-free compression algorithm for SAM and its variants. Unlike quantization, pruning, distillation, and other compression methods, Birkhoff embodies versatility across model types, agility in deployment, faithfulness to the original model, and compactness in model size. Specifically, Birkhoff introduces a novel compression algorithm: Hyper-Compression, whose core principle is to find a dense trajectory to turn a high-dimensional parameter vector into a low-dimensional scalar. Furthermore, Birkhoff designs a dedicated linear layer operator, HyperLinear, to fuse decompression and matrix multiplication to significantly accelerate inference of the compressed SAMs. Extensive experiments on 18 SAMs in the COCO, LVIS, and SA-1B datasets show that Birkhoff performs consistently and competitively in compression time, compression ratio, post-compression performance, and inference speed. For example, Birkhoff can achieve a compression ratio of 5.17x on SAM2-B, with less than 1% performance drop without using any fine-tuning data. Moreover, the compression is finished within 60 seconds for all models.
zh

[CV-5] Geo-ORBIT: A Federated Digital Twin Framework for Scene-Adaptive Lane Geometry Detection

【速读】：该论文试图解决交通系统数字孪生（Digital Twin, DT）中动态道路几何感知的局限性以及多源数据收集与分析带来的隐私、通信和计算效率问题。其解决方案的关键在于提出Geo-ORBIT框架，该框架结合了实时车道检测、DT同步和联邦元学习，其中核心组件GeoLane通过路侧摄像头从车辆轨迹数据中学习车道几何结构，并通过Meta-GeoLane实现本地实体的参数个性化，以及通过FedMeta-GeoLane实现跨路侧部署的可扩展且隐私保护的适应性。

链接: https://arxiv.org/abs/2507.08743
作者: Rei Tamaru,Pei Li,Bin Ran
机构: University of Wisconsin-Madison (威斯康星大学麦迪逊分校)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Digital Twins (DT) have the potential to transform traffic management and operations by creating dynamic, virtual representations of transportation systems that sense conditions, analyze operations, and support decision-making. A key component for DT of the transportation system is dynamic roadway geometry sensing. However, existing approaches often rely on static maps or costly sensors, limiting scalability and adaptability. Additionally, large-scale DTs that collect and analyze data from multiple sources face challenges in privacy, communication, and computational efficiency. To address these challenges, we introduce Geo-ORBIT (Geometrical Operational Roadway Blueprint with Integrated Twin), a unified framework that combines real-time lane detection, DT synchronization, and federated meta-learning. At the core of Geo-ORBIT is GeoLane, a lightweight lane detection model that learns lane geometries from vehicle trajectory data using roadside cameras. We extend this model through Meta-GeoLane, which learns to personalize detection parameters for local entities, and FedMeta-GeoLane, a federated learning strategy that ensures scalable and privacy-preserving adaptation across roadside deployments. Our system is integrated with CARLA and SUMO to create a high-fidelity DT that renders highway scenarios and captures traffic flows in real-time. Extensive experiments across diverse urban scenes show that FedMeta-GeoLane consistently outperforms baseline and meta-learning approaches, achieving lower geometric error and stronger generalization to unseen locations while drastically reducing communication overhead. This work lays the foundation for flexible, context-aware infrastructure modeling in DTs. The framework is publicly available at this https URL.
zh

[CV-6] HieraRS: A Hierarchical Segmentation Paradigm for Remote Sensing Enabling Multi-Granularity Interpretation and Cross-Domain Transfer

【速读】：该论文旨在解决遥感影像中分层土地覆盖与土地利用（LCLU）分类的两个主要问题：现有深度学习方法多采用扁平分类范式，难以生成与实际树状层次结构对齐的多粒度端到端预测；以及跨域研究多关注传感器或场景变化导致的性能下降，而较少关注将LCLU模型迁移至具有异构层次结构的任务（如LCLU到作物分类）。论文提出HieraRS，其关键在于引入双向分层一致性约束机制（BHCCM），可无缝集成至主流扁平分类模型中以生成分层预测，同时提升语义一致性和分类精度；此外，还提出了TransLU，一个包含跨域知识共享和跨域语义对齐的双分支跨域迁移框架，支持动态类别扩展并促进LCLU模型向异构层次结构的有效适应。

链接: https://arxiv.org/abs/2507.08741
作者: Tianlong Ai,Tianzhu Liu,Haochen Jiang,Yanfeng Gu
机构: Harbin Institute of Technology(哈尔滨工业大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 17 pages, 11 figures

点击查看摘要

Abstract:Hierarchical land cover and land use (LCLU) classification aims to assign pixel-wise labels with multiple levels of semantic granularity to remote sensing (RS) imagery. However, existing deep learning-based methods face two major challenges: 1) They predominantly adopt a flat classification paradigm, which limits their ability to generate end-to-end multi-granularity hierarchical predictions aligned with tree-structured hierarchies used in practice. 2) Most cross-domain studies focus on performance degradation caused by sensor or scene variations, with limited attention to transferring LCLU models to cross-domain tasks with heterogeneous hierarchies (e.g., LCLU to crop classification). These limitations hinder the flexibility and generalization of LCLU models in practical applications. To address these challenges, we propose HieraRS, a novel hierarchical interpretation paradigm that enables multi-granularity predictions and supports the efficient transfer of LCLU models to cross-domain tasks with heterogeneous tree-structured hierarchies. We introduce the Bidirectional Hierarchical Consistency Constraint Mechanism (BHCCM), which can be seamlessly integrated into mainstream flat classification models to generate hierarchical predictions, while improving both semantic consistency and classification accuracy. Furthermore, we present TransLU, a dual-branch cross-domain transfer framework comprising two key components: Cross-Domain Knowledge Sharing (CDKS) and Cross-Domain Semantic Alignment (CDSA). TransLU supports dynamic category expansion and facilitates the effective adaptation of LCLU models to heterogeneous hierarchies. In addition, we construct MM-5B, a large-scale multi-modal hierarchical land use dataset featuring pixel-wise annotations. The code and MM-5B dataset will be released at: this https URL.
zh

[CV-7] Ensemble of Weak Spectral Total Variation Learners: a PET-CT Case Study

【速读】：该论文试图解决在计算机视觉问题中因训练数据不足而导致的性能瓶颈问题。其解决方案的关键在于利用基于谱总变分（Spectral Total-Variation, STV）特征的弱学习器集成。STV特征与总变分次梯度的非线性本征函数相关，能够有效表征多尺度纹理，在二维情况下表现出低相关性，符合集成学习理论中对弱学习器低相关性的要求。通过设计基于STV特征的集成学习方法，该研究在医学影像领域的一个实际问题——预测疑似骨转移患者CT数据对正电子发射断层扫描（PET）高摄取的预测价值——中取得了优于深度学习和Radiomics方法的效果。

链接: https://arxiv.org/abs/2507.08735
作者: Anna Rosenberg,John Kennedy,Zohar Keidar,Yehoshua Y. Zeevi,Guy Gilboa
机构: Technion(以色列理工学院)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Solving computer vision problems through machine learning, one often encounters lack of sufficient training data. To mitigate this we propose the use of ensembles of weak learners based on spectral total-variation (STV) features (Gilboa 2014). The features are related to nonlinear eigenfunctions of the total-variation subgradient and can characterize well textures at various scales. It was shown (Burger et-al 2016) that, in the one-dimensional case, orthogonal features are generated, whereas in two-dimensions the features are empirically lowly correlated. Ensemble learning theory advocates the use of lowly correlated weak learners. We thus propose here to design ensembles using learners based on STV features. To show the effectiveness of this paradigm we examine a hard real-world medical imaging problem: the predictive value of computed tomography (CT) data for high uptake in positron emission tomography (PET) for patients suspected of skeletal metastases. The database consists of 457 scans with 1524 unique pairs of registered CT and PET slices. Our approach is compared to deep-learning methods and to Radiomics features, showing STV learners perform best (AUC=0.87), compared to neural nets (AUC=0.75) and Radiomics (AUC=0.79). We observe that fine STV scales in CT images are especially indicative for the presence of high uptake in PET.
zh

[CV-8] RoundaboutHD: High-Resolution Real-World Urban Environment Benchmark for Multi-Camera Vehicle Tracking

【速读】：该论文试图解决当前公开可用的多摄像头车辆跟踪（MCVT）数据集在场景复杂性、分辨率和多样性方面的不足，从而缩小学术研究与真实世界场景之间的差距。解决方案的关键是引入RoundaboutHD，这是一个高分辨率、多摄像头的车辆跟踪基准数据集，专门设计用于模拟真实的环形交叉口场景，包含40分钟标注视频、512个独特的车辆身份以及丰富的跨摄像头关联数据，同时提供了增强的挑战性，如更高的遮挡率和非线性运动。

链接: https://arxiv.org/abs/2507.08729
作者: Yuqiang Lin,Sam Lockyer,Mingxuan Sui,Li Gan,Florian Stanek,Markus Zarbock,Wenbin Li,Adrian Evans,Nic Zhang
机构: University of Bath (巴斯大学); Starwit Technologies GmbH (星威科技有限公司)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:The multi-camera vehicle tracking (MCVT) framework holds significant potential for smart city applications, including anomaly detection, traffic density estimation, and suspect vehicle tracking. However, current publicly available datasets exhibit limitations, such as overly simplistic scenarios, low-resolution footage, and insufficiently diverse conditions, creating a considerable gap between academic research and real-world scenario. To fill this gap, we introduce RoundaboutHD, a comprehensive, high-resolution multi-camera vehicle tracking benchmark dataset specifically designed to represent real-world roundabout scenarios. RoundaboutHD provides a total of 40 minutes of labelled video footage captured by four non-overlapping, high-resolution (4K resolution, 15 fps) cameras. In total, 512 unique vehicle identities are annotated across different camera views, offering rich cross-camera association data. RoundaboutHD offers temporal consistency video footage and enhanced challenges, including increased occlusions and nonlinear movement inside the roundabout. In addition to the full MCVT dataset, several subsets are also available for object detection, single camera tracking, and image-based vehicle re-identification (ReID) tasks. Vehicle model information and camera modelling/ geometry information are also included to support further analysis. We provide baseline results for vehicle detection, single-camera tracking, image-based vehicle re-identification, and multi-camera tracking. The dataset and the evaluation code are publicly available at: this https URL
zh

[CV-9] Learning human-to-robot handovers through 3D scene reconstruction

【速读】：该论文试图解决从真实世界图像数据中学习机器人操作策略时需要大量物理环境中的机器人动作试验的问题，以及仿真与机器人工作空间之间的视觉领域差距问题。解决方案的关键在于提出一种基于监督学习的机器人交接方法，即Human-to-Robot Handover using Sparse-View Gaussian Splatting (H2RH-SGS)，该方法仅依赖RGB图像进行训练，无需真实机器人数据收集或训练。其核心是利用稀疏视角的高斯点云（Gaussian Splatting）重建人类到机器人的交接场景，生成包含图像-动作对的机器人示范，从而实现从模拟相机姿态变化到夹爪姿态变化的直接映射。

链接: https://arxiv.org/abs/2507.08726
作者: Yuekun Wu,Yik Lung Pang,Andrea Cavallaro,Changjae Oh
机构: Queen Mary University of London(伦敦玛丽女王大学); Idiap Research Institute(伊迪帕研究所); École Polytechnique Fédérale de Lausanne(洛桑联邦理工学院)
类目: Robotics (cs.RO); Computer Vision and Pattern Recognition (cs.CV)
备注: 8 pages, 6 figures, 2 table

点击查看摘要

Abstract:Learning robot manipulation policies from raw, real-world image data requires a large number of robot-action trials in the physical environment. Although training using simulations offers a cost-effective alternative, the visual domain gap between simulation and robot workspace remains a major limitation. Gaussian Splatting visual reconstruction methods have recently provided new directions for robot manipulation by generating realistic environments. In this paper, we propose the first method for learning supervised-based robot handovers solely from RGB images without the need of real-robot training or real-robot data collection. The proposed policy learner, Human-to-Robot Handover using Sparse-View Gaussian Splatting (H2RH-SGS), leverages sparse-view Gaussian Splatting reconstruction of human-to-robot handover scenes to generate robot demonstrations containing image-action pairs captured with a camera mounted on the robot gripper. As a result, the simulated camera pose changes in the reconstructed scene can be directly translated into gripper pose changes. We train a robot policy on demonstrations collected with 16 household objects and \em directly deploy this policy in the real environment. Experiments in both Gaussian Splatting reconstructed scene and real-world human-to-robot handover experiments demonstrate that H2RH-SGS serves as a new and effective representation for the human-to-robot handover task.
zh

[CV-10] Unreal is all you need: Multimodal ISAC Data Simulation with Only One Engine

【速读】：该论文试图解决在ISAC（Integrated Sensing and Communication，融合感知与通信）研究中如何高效生成和同步多模态数据的问题。其解决方案的关键在于提出Great-X平台，这是一个单引擎多模态数据孪生平台，能够重建Sionna中的射线追踪计算，并深度集成自动驾驶工具，从而实现包括信道状态信息（CSI）、RGB、雷达和激光雷达在内的多模态数据的高效同步仿真。

链接: https://arxiv.org/abs/2507.08716
作者: Kongwu Huang,Shiyi Mu,Jun Jiang,Yuan Gao,Shugong Xu
机构: Shanghai University, Shanghai, China; Xi’an Jiaotong-Liverpool University, Suzhou, China
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Scaling laws have achieved success in LLM and foundation models. To explore their potential in ISAC research, we propose Great-X. This single-engine multimodal data twin platform reconstructs the ray-tracing computation of Sionna within Unreal Engine and is deeply integrated with autonomous driving tools. This enables efficient and synchronized simulation of multimodal data, including CSI, RGB, Radar, and LiDAR. Based on this platform, we construct an open-source, large-scale, low-altitude UAV multimodal synaesthesia dataset named Great-MSD, and propose a baseline CSI-based UAV 3D localization algorithm, demonstrating its feasibility and generalizability across different CSI simulation engines. The related code and dataset are publicly available at: this https URL.
zh

[CV-11] SGPMIL: Sparse Gaussian Process Multiple Instance Learning

【速读】：该论文旨在解决在仅拥有粗粒度的样本集（bag-level）标签而缺乏实例级（instance-level）标注的场景下，传统确定性注意力机制在实例相关性评估中忽略固有不确定性的问题。其解决方案的关键在于提出一种基于稀疏高斯过程（Sparse Gaussian Process, SGP）的新型概率注意力机制框架SGPMIL，通过学习注意力分数的后验分布实现合理的不确定性估计，从而提升实例级预测的可靠性、可解释性及性能。

链接: https://arxiv.org/abs/2507.08711
作者: Andreas Lolos,Stergios Christodoulidis,Maria Vakalopoulou,Jose Dolz,Aris Moustakas
机构: University of Athens(雅典大学); Archimedes/Athena Research Center(阿基米德/雅典研究中心); MICS, CentraleSupélec, Université Paris-Saclay( MICS，中央Supélec，巴黎-萨克雷大学); ÉTS Montréal(蒙特利尔工程学院)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 8 pages, 4 figures, 2 tables

点击查看摘要

Abstract:Multiple Instance Learning (MIL) offers a natural solution for settings where only coarse, bag-level labels are available, without having access to instance-level annotations. This is usually the case in digital pathology, which consists of gigapixel sized images. While deterministic attention-based MIL approaches achieve strong bag-level performance, they often overlook the uncertainty inherent in instance relevance. In this paper, we address the lack of uncertainty quantification in instance-level attention scores by introducing \textbfSGPMIL, a new probabilistic attention-based MIL framework grounded in Sparse Gaussian Processes (SGP). By learning a posterior distribution over attention scores, SGPMIL enables principled uncertainty estimation, resulting in more reliable and calibrated instance relevance maps. Our approach not only preserves competitive bag-level performance but also significantly improves the quality and interpretability of instance-level predictions under uncertainty. SGPMIL extends prior work by introducing feature scaling in the SGP predictive mean function, leading to faster training, improved efficiency, and enhanced instance-level performance. Extensive experiments on multiple well-established digital pathology datasets highlight the effectiveness of our approach across both bag- and instance-level evaluations. Our code will be made publicly available.
zh

[CV-12] L-CLIPScore: a Lightweight Embedding-based Captioning Metric for Evaluating and Training

【速读】：该论文试图解决生成式 AI (Generative AI) 中图像描述生成任务中评估和训练模型的效率与效果问题。其解决方案的关键在于提出一种轻量级的基于嵌入的图像描述评估指标——L-CLIPScore，该指标通过压缩和蒸馏原始CLIP模型得到的轻量级CLIP（L-CLIP），结合一种新型多模态相似性调节（SR）损失，以提升模型在保持多模态对齐能力的同时降低计算资源消耗和运行时间。

链接: https://arxiv.org/abs/2507.08710
作者: Li Li,Yingzhe Peng,Xu Yang,Ruoxi Cheng,Haiyang Xu,Ming Yan,Fei Huang
机构: Southeast University (东南大学); Alibaba Group (阿里巴巴集团)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 10 pages, 4 figures

点击查看摘要

Abstract:We propose a novel embedding-based captioning metric termed as L-CLIPScore that can be used for efficiently evaluating caption quality and training captioning model. L-CLIPScore is calculated from a lightweight CLIP (L-CLIP), which is a dual-encoder architecture compressed and distilled from CLIP. To compress, we apply two powerful techniques which are weight multiplexing and matrix decomposition for reducing the parameters of encoders and word embedding matrix, respectively. To distill, we design a novel multi-modal Similarity Regulator (SR) loss to transfer more vision-language alignment knowledge. Specifically, SR loss amplifies the multi-modal embedding similarity if the given image-text pair is matched and diminishes the similarity if the pair is non-matched. By compressing and distilling by this novel SR loss, our L-CLIP achieves comparable multi-modal alignment ability to the original CLIP while it requires fewer computation resources and running time. We carry out exhaustive experiments to validate the efficiency and effectiveness of L-CLIPScore when using it as the judge to evaluate caption quality. We also discover that when using L-CLIPScore as the supervisor to train the captioning model, it should be mixed up by an n-gram-based metric and meanwhile analyze why using L-CLIPScore only will cause fail training.
zh

[CV-13] An Efficient Approach for Muscle Segmentation and 3D Reconstruction Using Keypoint Tracking in MRI Scan

【速读】：该论文旨在解决肌肉结构自动分割中存在的高计算成本、依赖大规模训练数据集以及在分割小肌肉时准确率下降的问题。其解决方案的关键在于提出一种无需训练的分割方法，该方法结合关键点选择与Lucas-Kanade光流算法，实现了与最先进的卷积神经网络（CNN）模型相当的分割性能，同时显著降低了计算需求并提高了可解释性。

链接: https://arxiv.org/abs/2507.08690
作者: Mengyuan Liu,Jeongkyu Lee
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Magnetic resonance imaging (MRI) enables non-invasive, high-resolution analysis of muscle structures. However, automated segmentation remains limited by high computational costs, reliance on large training datasets, and reduced accuracy in segmenting smaller muscles. Convolutional neural network (CNN)-based methods, while powerful, often suffer from substantial computational overhead, limited generalizability, and poor interpretability across diverse populations. This study proposes a training-free segmentation approach based on keypoint tracking, which integrates keypoint selection with Lucas-Kanade optical flow. The proposed method achieves a mean Dice similarity coefficient (DSC) ranging from 0.6 to 0.7, depending on the keypoint selection strategy, performing comparably to state-of-the-art CNN-based models while substantially reducing computational demands and enhancing interpretability. This scalable framework presents a robust and explainable alternative for muscle segmentation in clinical and research applications.
zh

[CV-14] MoSAiC: Multi-Modal Multi-Label Supervision-Aware Contrastive Learning for Remote Sensing

【速读】：该论文旨在解决地球系统观测（Earth System Observation, ESO）中多模态卫星影像的表示学习问题，特别是在低标签数据和高类别重叠场景下，传统对比学习（Contrastive Learning, CL）框架在跨模态语义对齐和多标签精度方面存在的不足。其解决方案的关键在于提出MoSAiC框架，该框架通过联合优化模态内与模态间的对比学习，并引入多标签监督对比损失，实现了更细粒度的语义解耦和更鲁棒的表示学习。

链接: https://arxiv.org/abs/2507.08683
作者: Debashis Gupta,Aditi Golder,Rongkhun Zhu,Kangning Cui,Wei Tang,Fan Yang,Ovidiu Csillik,Sarra Alaqahtani,V. Paul Pauca
机构: Wake Forest University (维克弗里斯特大学); Xidian University (西安电子科技大学); City University of Hong Kong (香港城市大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Contrastive learning (CL) has emerged as a powerful paradigm for learning transferable representations without the reliance on large labeled datasets. Its ability to capture intrinsic similarities and differences among data samples has led to state-of-the-art results in computer vision tasks. These strengths make CL particularly well-suited for Earth System Observation (ESO), where diverse satellite modalities such as optical and SAR imagery offer naturally aligned views of the same geospatial regions. However, ESO presents unique challenges, including high inter-class similarity, scene clutter, and ambiguous boundaries, which complicate representation learning – especially in low-label, multi-label settings. Existing CL frameworks often focus on intra-modality self-supervision or lack mechanisms for multi-label alignment and semantic precision across modalities. In this work, we introduce MoSAiC, a unified framework that jointly optimizes intra- and inter-modality contrastive learning with a multi-label supervised contrastive loss. Designed specifically for multi-modal satellite imagery, MoSAiC enables finer semantic disentanglement and more robust representation learning across spectrally similar and spatially complex classes. Experiments on two benchmark datasets, BigEarthNet V2.0 and Sent12MS, show that MoSAiC consistently outperforms both fully supervised and self-supervised baselines in terms of accuracy, cluster coherence, and generalization in low-label and high-class-overlap scenarios.
zh

[CV-15] ByDeWay: Boost Your multimodal LLM with DEpth prompting in a Training-Free Way

【速读】：该论文试图解决多模态大语言模型（Multimodal Large Language Models, MLLMs）在生成回答时存在的幻觉和空间推理不足的问题。解决方案的关键在于提出一种无需训练的框架ByDeWay，其核心是基于分层深度的提示策略（Layered-Depth-Based Prompting, LDP），通过单目深度估计将场景分为近、中、远三层，并利用接地视觉语言模型生成区域特定的描述，从而在图像-问题提示中引入空间上下文信息，提升MLLMs的接地性和推理能力。

链接: https://arxiv.org/abs/2507.08679
作者: Rajarshi Roy,Devleena Das,Ankesh Banerjee,Arjya Bhattacharjee,Kousik Dasgupta,Subarna Tripathi
机构: Kalyani Government Engineering College, India; Intel Labs, USA
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:We introduce ByDeWay, a training-free framework designed to enhance the performance of Multimodal Large Language Models (MLLMs). ByDeWay uses a novel prompting strategy called Layered-Depth-Based Prompting (LDP), which improves spatial reasoning and grounding without modifying any model parameters. It segments the scene into closest, mid-range, and farthest layers using monocular depth estimation, then generates region-specific captions with a grounded vision-language model. These structured, depth-aware captions are appended to the image-question prompt, enriching it with spatial context. This guides MLLMs to produce more grounded and less hallucinated responses. Our method is lightweight, modular, and compatible with black-box MLLMs. Experiments on hallucination-sensitive (POPE) and reasoning-intensive (GQA) benchmarks show consistent improvements across multiple MLLMs, validating the effectiveness of depth-aware prompting in a zero-training setting.
zh

[CV-16] Generalizable 7T T1-map Synthesis from 1.5T and 3T T1 MRI with an Efficient Transformer Model

【速读】：该论文旨在解决如何从常规临床使用的1.5T或3T T1加权（T1W）图像中合成高质量的7T级T1图的问题，以克服7T磁共振成像（MRI）设备成本高、稀缺且存在敏感性伪影等挑战。其解决方案的关键是一种基于Transformer的高效模型——7T-Restormer，该模型能够利用较少参数量（10.5 M）实现优于现有先进方法（如ResShift和ResViT）的性能，显著降低了归一化均方误差（NMSE），并在混合1.5T与3T数据集上进行了训练，提升了模型的泛化能力。

链接: https://arxiv.org/abs/2507.08655
作者: Zach Eidex,Mojtaba Safari,Tonghe Wang,Vanessa Wildman,David S. Yu,Hui Mao,Erik Middlebrooks,Aparna Kesewala,Xiaofeng Yang
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Purpose: Ultra-high-field 7T MRI offers improved resolution and contrast over standard clinical field strengths (1.5T, 3T). However, 7T scanners are costly, scarce, and introduce additional challenges such as susceptibility artifacts. We propose an efficient transformer-based model (7T-Restormer) to synthesize 7T-quality T1-maps from routine 1.5T or 3T T1-weighted (T1W) images. Methods: Our model was validated on 35 1.5T and 108 3T T1w MRI paired with corresponding 7T T1 maps of patients with confirmed MS. A total of 141 patient cases (32,128 slices) were randomly divided into 105 (25; 80) training cases (19,204 slices), 19 (5; 14) validation cases (3,476 slices), and 17 (5; 14) test cases (3,145 slices) where (X; Y) denotes the patients with 1.5T and 3T T1W scans, respectively. The synthetic 7T T1 maps were compared against the ResViT and ResShift models. Results: The 7T-Restormer model achieved a PSNR of 26.0 +/- 4.6 dB, SSIM of 0.861 +/- 0.072, and NMSE of 0.019 +/- 0.011 for 1.5T inputs, and 25.9 +/- 4.9 dB, and 0.866 +/- 0.077 for 3T inputs, respectively. Using 10.5 M parameters, our model reduced NMSE by 64 % relative to 56.7M parameter ResShift (0.019 vs 0.052, p = .001 and by 41 % relative to 70.4M parameter ResViT (0.019 vs 0.032, p = .001) at 1.5T, with similar advantages at 3T (0.021 vs 0.060 and 0.033; p .001). Training with a mixed 1.5 T + 3 T corpus was superior to single-field strategies. Restricting the model to 1.5T increased the 1.5T NMSE from 0.019 to 0.021 (p = 1.1E-3) while training solely on 3T resulted in lower performance on input 1.5T T1W MRI. Conclusion: We propose a novel method for predicting quantitative 7T MP2RAGE maps from 1.5T and 3T T1W scans with higher quality than existing state-of-the-art methods. Our approach makes the benefits of 7T MRI more accessible to standard clinical workflows.
zh

[CV-17] DatasetAgent : A Novel Multi-Agent System for Auto-Constructing Datasets from Real-World Images

【速读】：该论文试图解决传统图像数据集构建过程中依赖耗时且低效的人工收集与标注的问题，以及人工生成数据在实际应用中价值较低的挑战。其解决方案的关键在于提出一种基于多智能体协作系统的自动数据集构建方法，即DatasetAgent，该系统通过协调配备多模态大语言模型（MLLMs）的四个不同智能体及图像优化工具包，能够根据用户指定的需求构建高质量的图像数据集。

链接: https://arxiv.org/abs/2507.08648
作者: Haoran Sun,Haoyu Bian,Shaoning Zeng,Yunbo Rao,Xu Xu,Lin Mei,Jianping Gou
机构: Yangtze Delta Region Institute (Huzhou), University of Electronic and Science Technology of China (电子科技大学杭州研究院); School of Information and Software Engineering, University of Electronic and Science Technology of China (电子科技大学信息与软件工程学院); Zhejiang Chuangjiekedong Ltd. (浙江创捷科动有限公司); DongHai Laboratory (东海实验室); College of Computer and Information Science, Southwest University (西南大学计算机与信息科学学院)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Common knowledge indicates that the process of constructing image datasets usually depends on the time-intensive and inefficient method of manual collection and annotation. Large models offer a solution via data generation. Nonetheless, real-world data are obviously more valuable comparing to artificially intelligence generated data, particularly in constructing image datasets. For this reason, we propose a novel method for auto-constructing datasets from real-world images by a multiagent collaborative system, named as DatasetAgent. By coordinating four different agents equipped with Multi-modal Large Language Models (MLLMs), as well as a tool package for image optimization, DatasetAgent is able to construct high-quality image datasets according to user-specified requirements. In particular, two types of experiments are conducted, including expanding existing datasets and creating new ones from scratch, on a variety of open-source datasets. In both cases, multiple image datasets constructed by DatasetAgent are used to train various vision models for image classification, object detection, and image segmentation.
zh

[CV-18] OnlineBEV: Recurrent Temporal Fusion in Birds Eye View Representations for Multi-Camera 3D Perception

【速读】：该论文旨在解决多视角相机基础的3D感知方法在结合大量图像帧进行时间聚合时性能提升受限的问题，这一限制源于物体运动导致的BEV（Bird’s Eye View）特征随时间动态变化。其解决方案的关键在于引入一种名为OnlineBEV的新颖时间3D感知方法，该方法通过递归结构在最小内存消耗下增加有效组合特征的数量，并利用Motion-guided BEV Fusion Network (MBFNet) 实现时间特征对齐，同时通过Temporal Consistency Learning Loss显式强化时间特征的一致性。

链接: https://arxiv.org/abs/2507.08644
作者: Junho Koh,Youngwoo Lee,Jungho Kim,Dongyoung Lee,Jun Won Choi
机构: Hyundai Motors(现代汽车); Hanyang University(汉阳大学); Seoul National University(首尔国立大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted to Transactions on Intelligent Transportation Systems

点击查看摘要

Abstract:Multi-view camera-based 3D perception can be conducted using bird’s eye view (BEV) features obtained through perspective view-to-BEV transformations. Several studies have shown that the performance of these 3D perception methods can be further enhanced by combining sequential BEV features obtained from multiple camera frames. However, even after compensating for the ego-motion of an autonomous agent, the performance gain from temporal aggregation is limited when combining a large number of image frames. This limitation arises due to dynamic changes in BEV features over time caused by object motion. In this paper, we introduce a novel temporal 3D perception method called OnlineBEV, which combines BEV features over time using a recurrent structure. This structure increases the effective number of combined features with minimal memory usage. However, it is critical to spatially align the features over time to maintain strong performance. OnlineBEV employs the Motion-guided BEV Fusion Network (MBFNet) to achieve temporal feature alignment. MBFNet extracts motion features from consecutive BEV frames and dynamically aligns historical BEV features with current ones using these motion features. To enforce temporal feature alignment explicitly, we use Temporal Consistency Learning Loss, which captures discrepancies between historical and target BEV features. Experiments conducted on the nuScenes benchmark demonstrate that OnlineBEV achieves significant performance gains over the current best method, SOLOFusion. OnlineBEV achieves 63.9% NDS on the nuScenes test set, recording state-of-the-art performance in the camera-only 3D object detection task.
zh

[CV-19] Normalized vs Diplomatic Annotation: A Case Study of Automatic Information Extraction from Handwritten Uruguayan Birth Certificates

【速读】：该论文试图解决从乌拉圭出生证明中提取关键-值信息的问题，特别是针对手写西班牙语文档的自动转录。其解决方案的关键在于采用Document Attention Network (DAN) 并通过两种不同的标注策略进行微调，以在有限的训练数据和标注努力下提高信息提取的准确性。研究结果表明，标准化字段（如出生日期和地点）更适合使用归一化标注，而非标准化字段（如姓名和姓氏）则更受益于外交标注方法。

链接: https://arxiv.org/abs/2507.08636
作者: Natalia Bottaioli(Université Paris-Saclay, ENS Paris-Saclay, CNRS, Centre Borelli, France, Facultad de Ingeniería, Universidad de la República, Montevideo, Uruguay, Digital Sense, Montevideo, Uruguay)Solène Tarride(TEKLIA, Paris, France)Jérémy Anger(Université Paris-Saclay, ENS Paris-Saclay, CNRS, Centre Borelli, France)Seginus Mowlavi(Université Paris-Saclay, ENS Paris-Saclay, CNRS, Centre Borelli, France)Marina Gardella(IMPA, Rio de Janeiro, Brazil)Antoine Tadros(Université Paris-Saclay, ENS Paris-Saclay, CNRS, Centre Borelli, France)Gabriele Facciolo(Université Paris-Saclay, ENS Paris-Saclay, CNRS, Centre Borelli, France)Rafael Grompone von Gioi(Université Paris-Saclay, ENS Paris-Saclay, CNRS, Centre Borelli, France)Christopher Kermorvant(TEKLIA, Paris, France)Jean-Michel Morel(City University of Hong Kong, Hong Kong)Javier Preciozzi(Facultad de Ingeniería, Universidad de la República, Montevideo, Uruguay, Digital Sense, Montevideo, Uruguay)
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:This study evaluates the recently proposed Document Attention Network (DAN) for extracting key-value information from Uruguayan birth certificates, handwritten in Spanish. We investigate two annotation strategies for automatically transcribing handwritten documents, fine-tuning DAN with minimal training data and annotation effort. Experiments were conducted on two datasets containing the same images (201 scans of birth certificates written by more than 15 different writers) but with different annotation methods. Our findings indicate that normalized annotation is more effective for fields that can be standardized, such as dates and places of birth, whereas diplomatic annotation performs much better for fields containing names and surnames, which can not be standardized.
zh

[CV-20] Emergent Natural Language with Communication Games for Improving Image Captioning Capabilities without Additional Data

【速读】：该论文试图解决在现有标注数据集已被用于训练大规模视觉语言模型（VLMs）的情况下，如何进一步提升图像描述生成性能的问题，特别是在无监督条件下。解决方案的关键在于提出一种基于多智能体强化学习的通信游戏框架——LoGIC，通过训练“说话者”和“听者”两个智能体，在合作共用奖励设置下学习自然语言交流策略，从而提升图像描述生成效果。

链接: https://arxiv.org/abs/2507.08610
作者: Parag Dutta,Ambedkar Dukkipati
机构: Indian Institute of Science (印度科学研究所)
类目: Machine Learning (cs.LG); Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Image captioning is an important problem in developing various AI systems, and these tasks require large volumes of annotated images to train the models. Since all existing labelled datasets are already used for training the large Vision Language Models (VLMs), it becomes challenging to improve the performance of the same. Considering this, it is essential to consider the unsupervised image captioning performance, which remains relatively under-explored. To that end, we propose LoGIC (Lewis Communication Game for Image Captioning), a Multi-agent Reinforcement Learning game. The proposed method consists of two agents, a ‘speaker’ and a ‘listener’, with the objective of learning a strategy for communicating in natural language. We train agents in the cooperative common-reward setting using the GRPO algorithm and show that improvement in image captioning performance emerges as a consequence of the agents learning to play the game. We show that using pre-trained VLMs as the ‘speaker’ and Large Language Model (LLM) for language understanding in the ‘listener’, we achieved a 46 BLEU score after fine-tuning using LoGIC without additional labels, a 2 units advantage in absolute metrics compared to the 44 BLEU score of the vanilla VLM. Additionally, we replace the VLM from the ‘speaker’ with lightweight components: (i) a ViT for image perception and (ii) a GPT2 language generation, and train them from scratch using LoGIC, obtaining a 31 BLEU score in the unsupervised setting, a 10 points advantage over existing unsupervised image-captioning methods.
zh

[CV-21] BayesTTA: Continual-Temporal Test-Time Adaptation for Vision-Language Models via Gaussian Discriminant Analysis

【速读】：该论文试图解决视觉-语言模型（Vision-Language Models, VLMs）在面对时间演化分布偏移（temporally evolving distribution shifts）时性能显著下降的问题，这一问题在现实场景中常见（如光照渐变或季节变化）。现有持续测试时适应（Continual Test-Time Adaptation, CTTA）方法通常针对突发且严重的分布偏移设计，忽略了时间连续性，导致三个核心缺陷：有限的记忆缓存限制了长距离分布建模，引发灾难性遗忘；基于熵的置信度在时间漂移下不可靠，加剧错误累积；静态视觉表示与动态输入不匹配。论文提出BayesTTA，其关键在于通过贝叶斯适应框架实现时间一致性预测和动态视觉表示对齐，具体包括：增量估计类别条件高斯混合分布、自适应选择协方差结构、使用高斯判别分析进行校准推理，从而监督自适应的归一化层调整，确保高效稳定的表示对齐。

链接: https://arxiv.org/abs/2507.08607
作者: Shuang Cui,Jinglin Xu,Yi Li,Xiongxin Tang,Jiangmeng Li,Jiahuan Zhou,Fanjiang Xu,Fuchun Sun,Hui Xiong
机构: National Key Laboratory of Space Integrated Information System, Institute of Software Chinese Academy of Sciences, University of Chinese Academy of Sciences(国家空间信息集成系统重点实验室，中国科学院软件研究所，中国科学院大学); Wangxuan Institute of Computer Technology, Peking University(王选计算机技术研究所，北京大学); Department of Computer Science and Technology, Tsinghua University(计算机科学与技术系，清华大学); Thrust of Artificial Intelligence, the Hong Kong University of Science and Technology (Guangzhou), Department of Computer Science & Engineering, the Hong Kong University of Science and Technology(人工智能部，香港科技大学（广州），计算机科学与工程系，香港科技大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Vision-language models (VLMs) such as CLIP achieve strong zero-shot recognition but degrade significantly under \textittemporally evolving distribution shifts common in real-world scenarios (e.g., gradual illumination or seasonal changes). Existing continual test-time adaptation (CTTA) methods are typically built around sudden and severe distribution shifts and neglect temporal continuity, leading to three core defects: limited memory cache restricts long-range distribution modeling, causing catastrophic forgetting; entropy-based confidence becomes unreliable under temporal drift, worsening error accumulation; and static visual representations misalign with evolving inputs. We formalize this practical problem as \textitContinual-Temporal Test-Time Adaptation (CT-TTA), where test distributions evolve gradually over time. To address it, we propose \textitBayesTTA, a Bayesian adaptation framework that enforces temporally consistent predictions and dynamically aligns visual representations. Specifically, BayesTTA incrementally estimates class-conditional Gaussian mixture distributions without storing raw data, adaptively selects covariance structures through statistical hypothesis testing, and performs calibrated inference using Gaussian discriminant analysis (GDA). These calibrated predictions supervise self-paced adaptation of normalization layers, ensuring efficient and stable representation alignment. We establish a comprehensive CT-TTA benchmark across four temporally evolving datasets and further evaluate generalization on ten standard TTA datasets. Extensive experiments show that BayesTTA consistently outperforms state-of-the-art methods, achieving significant gains while maintaining efficiency. Code is available at \hrefthis https URLthis https URL.
zh

[CV-22] Visual Semantic Description Generation with MLLM s for Image-Text Matching ICME2025

【速读】：该论文试图解决图像-文本匹配（Image-Text Matching, ITM）中视觉与文本模态在表示上的固有差异问题，即连续、高维的图像特征与离散、结构化的文本之间的对齐挑战。其解决方案的关键在于利用多模态大语言模型（Multimodal Large Language Models, MLLMs）作为视觉语义解析器，通过生成丰富的视觉语义描述（Visual Semantic Descriptions, VSD）提供语义锚点，从而促进跨模态对齐。具体包括实例级对齐和原型级对齐两个模块，分别通过融合视觉特征与VSD以增强图像表征的语言表达能力，以及通过VSD聚类确保类别级别的一致性。

链接: https://arxiv.org/abs/2507.08590
作者: Junyu Chen,Yihua Gao,Mingyong Li
机构: Chongqing Normal University (重庆师范大学)
类目: Multimedia (cs.MM); Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted by ICME2025 oral

点击查看摘要

Abstract:Image-text matching (ITM) aims to address the fundamental challenge of aligning visual and textual modalities, which inherently differ in their representations, continuous, high-dimensional image features vs. discrete, structured text. We propose a novel framework that bridges the modality gap by leveraging multimodal large language models (MLLMs) as visual semantic parsers. By generating rich Visual Semantic Descriptions (VSD), MLLMs provide semantic anchor that facilitate cross-modal alignment. Our approach combines: (1) Instance-level alignment by fusing visual features with VSD to enhance the linguistic expressiveness of image representations, and (2) Prototype-level alignment through VSD clustering to ensure category-level consistency. These modules can be seamlessly integrated into existing ITM models. Extensive experiments on Flickr30K and MSCOCO demonstrate substantial performance improvements. The approach also exhibits remarkable zero-shot generalization to cross-domain tasks, including news and remote sensing ITM. The code and model checkpoints are available at this https URL.
zh

[CV-23] A Multi-Modal Fusion Framework for Brain Tumor Segmentation Based on 3D Spatial-Language-Vision Integration and Bidirectional Interactive Attention Mechanism

【速读】：该论文试图解决脑肿瘤分割中分割精度和边界 delineation 不足的问题，通过整合空间-语言-视觉信息来提升分割效果。其解决方案的关键在于提出两个核心组件：多模态语义融合适配器（Multi-modal Semantic Fusion Adapter, MSFA）以及双向交互视觉-语义注意力机制（Bidirectional Interactive Visual-semantic Attention, BIVA），分别实现3D MRI数据与临床文本描述的层次化语义解耦融合，以及多模态间的迭代信息交换。

链接: https://arxiv.org/abs/2507.08574
作者: Mingda Zhang,Kaiwen Pan
机构: Yunnan University (云南大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: 12 pages, 4 figures

点击查看摘要

Abstract:This study aims to develop a novel multi-modal fusion framework for brain tumor segmentation that integrates spatial-language-vision information through bidirectional interactive attention mechanisms to improve segmentation accuracy and boundary delineation. Methods: We propose two core components: Multi-modal Semantic Fusion Adapter (MSFA) integrating 3D MRI data with clinical text descriptions through hierarchical semantic decoupling, and Bidirectional Interactive Visual-semantic Attention (BIVA) enabling iterative information exchange between modalities. The framework was evaluated on BraTS 2020 dataset comprising 369 multi-institutional MRI scans. Results: The proposed method achieved average Dice coefficient of 0.8505 and 95% Hausdorff distance of 2.8256mm across enhancing tumor, tumor core, and whole tumor regions, outperforming state-of-the-art methods including SCAU-Net, CA-Net, and 3D U-Net. Ablation studies confirmed critical contributions of semantic and spatial modules to boundary precision. Conclusion: Multi-modal semantic fusion combined with bidirectional interactive attention significantly enhances brain tumor segmentation performance, establishing new paradigms for integrating clinical knowledge into medical image analysis.
zh

[CV-24] Disentangling Instance and Scene Contexts for 3D Semantic Scene Completion ICCV2025

【速读】：该论文旨在解决3D语义场景补全（3D Semantic Scene Completion, SSC）中因将体素作为基本交互单元而导致的类别级信息利用不足的问题。其关键解决方案是提出一种名为DISC（Disentangling Instance and Scene Contexts）的双流范式，通过分离优化实例和场景类别来增强学习效果，具体包括用具有类别特异性几何和语义先验的判别性类别查询替代体素查询，并利用类别的内在特性设计专用解码模块，以促进针对性交互和高效的类别级信息流动。

链接: https://arxiv.org/abs/2507.08555
作者: Enyu Liu,En Yu,Sijia Chen,Wenbing Tao
机构: Huazhong University of Science and Technology (华中科技大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted to ICCV 2025

点击查看摘要

Abstract:3D Semantic Scene Completion (SSC) has gained increasing attention due to its pivotal role in 3D perception. Recent advancements have primarily focused on refining voxel-level features to construct 3D scenes. However, treating voxels as the basic interaction units inherently limits the utilization of class-level information, which is proven critical for enhancing the granularity of completion results. To address this, we propose \textbfDisentangling Instance and Scene Contexts (DISC), a novel dual-stream paradigm that enhances learning for both instance and scene categories through separated optimization. Specifically, we replace voxel queries with discriminative class queries, which incorporate class-specific geometric and semantic priors. Additionally, we exploit the intrinsic properties of classes to design specialized decoding modules, facilitating targeted interactions and efficient class-level information flow. Experimental results demonstrate that DISC achieves state-of-the-art (SOTA) performance on both SemanticKITTI and SSCBench-KITTI-360 benchmarks, with mIoU scores of 17.35 and 20.55, respectively. Remarkably, DISC even outperforms multi-frame SOTA methods using only single-frame input and significantly improves instance category performance, surpassing both single-frame and multi-frame SOTA instance mIoU by 17.9% and 11.9%, respectively, on the SemanticKITTI hidden test. The code is available at this https URL.
zh

[CV-25] Image Translation with Kernel Prediction Networks for Semantic Segmentation ECCV2024

【速读】：该论文试图解决在语义分割任务中，由于真实数据标注获取困难，导致模型在合成数据集上训练后泛化能力不足的问题，特别是针对无配对图像翻译方法中无法保证语义匹配所带来的噪声标签问题。解决方案的关键在于提出一种新的图像翻译方法——领域对抗核预测网络（Domain Adversarial Kernel Prediction Network, DA-KPN），该方法通过估计轻量级翻译函数的像素级输入变换参数，并利用多尺度判别器确保翻译结果的真实性，从而保证合成标签与翻译结果之间的语义一致性。

链接: https://arxiv.org/abs/2507.08554
作者: Cristina Mata,Michael S. Ryoo,Henrik Turbell
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: OOD-CV Workshop at ECCV 2024

点击查看摘要

Abstract:Semantic segmentation relies on many dense pixel-wise annotations to achieve the best performance, but owing to the difficulty of obtaining accurate annotations for real world data, practitioners train on large-scale synthetic datasets. Unpaired image translation is one method used to address the ensuing domain gap by generating more realistic training data in low-data regimes. Current methods for unpaired image translation train generative adversarial networks (GANs) to perform the translation and enforce pixel-level semantic matching through cycle consistency. These methods do not guarantee that the semantic matching holds, posing a problem for semantic segmentation where performance is sensitive to noisy pixel labels. We propose a novel image translation method, Domain Adversarial Kernel Prediction Network (DA-KPN), that guarantees semantic matching between the synthetic label and translation. DA-KPN estimates pixel-wise input transformation parameters of a lightweight and simple translation function. To ensure the pixel-wise transformation is realistic, DA-KPN uses multi-scale discriminators to distinguish between translated and target samples. We show DA-KPN outperforms previous GAN-based methods on syn2real benchmarks for semantic segmentation with limited access to real image labels and achieves comparable performance on face parsing.
zh

[CV-26] SAM2RL: Towards Reinforcement Learning Memory Control in Segment Anything Model 2

【速读】：该论文试图解决视觉目标跟踪中因干扰物、遮挡和物体运动导致的分割性能下降问题。其解决方案的关键在于采用强化学习方法，将记忆更新优化问题建模为一个序列决策过程，从而替代传统的人工设计更新规则，以提升SAM 2模型在视频序列中的时间一致性与鲁棒性。

链接: https://arxiv.org/abs/2507.08548
作者: Alen Adamyan,Tomáš Čížek,Matej Straka,Klara Janouskova,Martin Schmid
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Segment Anything Model 2 (SAM 2) has demonstrated strong performance in object segmentation tasks and has become the state-of-the-art for visual object tracking. The model stores information from previous frames in a memory bank, enabling temporal consistency across video sequences. Recent methods augment SAM 2 with hand-crafted update rules to better handle distractors, occlusions, and object motion. We propose a fundamentally different approach using reinforcement learning for optimizing memory updates in SAM 2 by framing memory control as a sequential decision-making problem. In an overfitting setup with a separate agent per video, our method achieves a relative improvement over SAM 2 that exceeds by more than three times the gains of existing heuristics. These results reveal the untapped potential of the memory bank and highlight reinforcement learning as a powerful alternative to hand-crafted update rules for memory control in visual object tracking.
zh

[CV-27] RadiomicsRetrieval: A Customizable Framework for Medical Image Retrieval Using Radiomics Features MICCAI2025

【速读】：该论文旨在解决当前医学图像检索方法主要支持二维图像且需要完全标注查询的问题，从而限制了临床灵活性。其解决方案的关键在于提出RadiomicsRetrieval框架，该框架通过在肿瘤级别将手工设计的影像组学特征与基于深度学习的嵌入表示相结合，充分利用三维体积数据中的空间上下文信息。该框架采用可提示的分割模型（如SAM）生成肿瘤特异性图像嵌入，并通过对比学习将其与同一肿瘤提取的影像组学特征对齐，同时引入解剖位置嵌入（Anatomical Positional Embedding, APE）进一步增强表示，从而实现基于形状、位置或部分特征集的灵活查询。

链接: https://arxiv.org/abs/2507.08546
作者: Inye Na,Nejung Rue,Jiwon Chung,Hyunjin Park
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: Accepted at MICCAI 2025

点击查看摘要

Abstract:Medical image retrieval is a valuable field for supporting clinical decision-making, yet current methods primarily support 2D images and require fully annotated queries, limiting clinical flexibility. To address this, we propose RadiomicsRetrieval, a 3D content-based retrieval framework bridging handcrafted radiomics descriptors with deep learning-based embeddings at the tumor level. Unlike existing 2D approaches, RadiomicsRetrieval fully exploits volumetric data to leverage richer spatial context in medical images. We employ a promptable segmentation model (e.g., SAM) to derive tumor-specific image embeddings, which are aligned with radiomics features extracted from the same tumor via contrastive learning. These representations are further enriched by anatomical positional embedding (APE). As a result, RadiomicsRetrieval enables flexible querying based on shape, location, or partial feature sets. Extensive experiments on both lung CT and brain MRI public datasets demonstrate that radiomics features significantly enhance retrieval specificity, while APE provides global anatomical context essential for location-based searches. Notably, our framework requires only minimal user prompts (e.g., a single point), minimizing segmentation overhead and supporting diverse clinical scenarios. The capability to query using either image embeddings or selected radiomics attributes highlights its adaptability, potentially benefiting diagnosis, treatment planning, and research on large-scale medical imaging repositories. Our code is available at this https URL.
zh

[CV-28] Occlusion-Guided Feature Purification Learning via Reinforced Knowledge Distillation for Occluded Person Re-Identification

【速读】：该论文旨在解决遮挡行人再识别（Occluded Person Re-Identification）中因遮挡导致的特征污染和多样遮挡场景泛化能力不足的问题。其解决方案的关键在于提出一种基于强化知识蒸馏的遮挡引导特征净化学习方法（Occlusion-Guided Feature Purification Learning via Reinforced Knowledge Distillation, OGFR），该方法通过教师-学生知识蒸馏架构，将全局图像中的纯净判别性知识迁移至遮挡分支，同时利用遮挡感知视觉Transformer建模多样化的遮挡模式，并引入特征擦除与净化模块，通过深度强化学习识别并替换全局图像中含噪声的局部特征，从而避免特征污染并挖掘身份相关的判别线索。

链接: https://arxiv.org/abs/2507.08520
作者: Yufei Zheng,Wenjun Wang,Wenjun Gan,Jiawei Liu
机构: University of Science and Technology of China (中国科学技术大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 13 pages, 8 figures

点击查看摘要

Abstract:Occluded person re-identification aims to retrieve holistic images based on occluded ones. Existing methods often rely on aligning visible body parts, applying occlusion augmentation, or complementing missing semantics using holistic images. However, they face challenges in handling diverse occlusion scenarios not seen during training and the issue of feature contamination from holistic images. To address these limitations, we propose Occlusion-Guided Feature Purification Learning via Reinforced Knowledge Distillation (OGFR), which simultaneously mitigates these challenges. OGFR adopts a teacher-student distillation architecture that effectively incorporates diverse occlusion patterns into feature representation while transferring the purified discriminative holistic knowledge from the holistic to the occluded branch through reinforced knowledge distillation. Specifically, an Occlusion-Aware Vision Transformer is designed to leverage learnable occlusion pattern embeddings to explicitly model such diverse occlusion types, thereby guiding occlusion-aware robust feature representation. Moreover, we devise a Feature Erasing and Purification Module within the holistic branch, in which an agent is employed to identify low-quality patch tokens of holistic images that contain noisy negative information via deep reinforcement learning, and substitute these patch tokens with learnable embedding tokens to avoid feature contamination and further excavate identity-related discriminative clues. Afterward, with the assistance of knowledge distillation, the student branch effectively absorbs the purified holistic knowledge to precisely learn robust representation regardless of the interference of occlusions.
zh

[CV-29] Advancing Multimodal LLM s by Large-Scale 3D Visual Instruction Dataset Generation

【速读】：该论文试图解决多模态大语言模型（Multimodal Large Language Models, MLLMs）在准确捕捉相机-物体关系方面存在的问题，尤其是物体方向、相机视角和相机构图的识别。现有MLLMs由于训练数据中相机-物体关系的多样性不足，导致性能受限。解决方案的关键在于构建一个合成生成流程，以创建大规模的3D视觉指令数据集。该框架输入3D资产，利用渲染和基于扩散的图像生成模型生成保持精确相机-物体关系的逼真图像，并使用大语言模型生成文本提示以指导视觉指令调优和控制图像生成。

链接: https://arxiv.org/abs/2507.08513
作者: Liu He,Xiao Zeng,Yizhi Song,Albert Y. C. Chen,Lu Xia,Shashwat Verma,Sankalp Dayal,Min Sun,Cheng-Hao Kuo,Daniel Aliaga
机构: Purdue University (普渡大学); Amazon (亚马逊)
类目: Graphics (cs.GR); Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Multimodal Large Language Models (MLLMs) struggle with accurately capturing camera-object relations, especially for object orientation, camera viewpoint, and camera shots. This stems from the fact that existing MLLMs are trained on images with limited diverse camera-object relations and corresponding textual descriptions. To address this, we propose a synthetic generation pipeline to create large-scale 3D visual instruction datasets. Our framework takes 3D assets as input and uses rendering and diffusion-based image generation models to create photorealistic images preserving precise camera-object relations. Additionally, large language models (LLMs) are used to generate text prompts for guiding visual instruction tuning and controlling image generation. We create Ultimate3D, a dataset of 240K VQAs with precise camera-object annotations, and corresponding benchmark. MLLMs fine-tuned on our proposed dataset outperform commercial models by a large margin, achieving an average accuracy improvement of 33.4% on camera-object relation recognition tasks. Our code, dataset, and benchmark will contribute to broad MLLM applications.
zh

[CV-30] Unified People Tracking with Graph Neural Networks

【速读】：该论文试图解决多人员跟踪（multi-people tracking）中依赖预计算轨迹（tracklets）的问题，提出了一种统一的、完全可微分的模型，无需依赖预计算的轨迹即可将检测结果关联为轨迹。该解决方案的关键在于构建一个动态时空图（dynamic spatiotemporal graph），该图聚合空间、上下文和时间信息，实现整个序列中的信息无缝传播，并能够编码场景特定信息以提升遮挡处理能力。

链接: https://arxiv.org/abs/2507.08494
作者: Martin Engilberge,Ivan Vrkic,Friedrich Wilke Grosche,Julien Pilet,Engin Turetken,Pascal Fua
机构: EPFL(瑞士联邦理工学院); Invision AI(因维森特人工智能); CSEM(瑞士电子与微技术中心)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:This work presents a unified, fully differentiable model for multi-people tracking that learns to associate detections into trajectories without relying on pre-computed tracklets. The model builds a dynamic spatiotemporal graph that aggregates spatial, contextual, and temporal information, enabling seamless information propagation across entire sequences. To improve occlusion handling, the graph can also encode scene-specific information. We also introduce a new large-scale dataset with 25 partially overlapping views, detailed scene reconstructions, and extensive occlusions. Experiments show the model achieves state-of-the-art performance on public benchmarks and the new dataset, with flexibility across diverse conditions. Both the dataset and approach will be publicly released to advance research in multi-people tracking.
zh

[CV-31] Dual Dimensions Geometric Representation Learning Based Document Dewarping

【速读】：该论文旨在解决文档图像去畸变（document image dewarping）问题，尤其是在深度学习时代中，现有方法通常仅关注单一的水平方向，难以全面捕捉文档中的复杂畸变。其解决方案的关键在于提出一种细粒度的变形感知模型D2Dewarp，该模型聚焦于文档的水平-垂直线条双维度，以更精确地感知不同方向上的畸变趋势。此外，设计了一种基于X和Y坐标的有效融合模块，以促进水平与垂直特征之间的交互与约束，实现特征互补。

链接: https://arxiv.org/abs/2507.08492
作者: Heng Li,Qingcai Chen,Xiangping Wu
机构: Harbin Institute of Technology (Shenzhen)(哈尔滨工业大学（深圳）); Shenzhen University Town(深圳大学城); PengCheng Laboratory(鹏城实验室)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Document image dewarping remains a challenging task in the deep learning era. While existing methods have improved by leveraging text line awareness, they typically focus only on a single horizontal dimension. In this paper, we propose a fine-grained deformation perception model that focuses on Dual Dimensions of document horizontal-vertical-lines to improve document Dewarping called D2Dewarp. It can perceive distortion trends in different directions across document details. To combine the horizontal and vertical granularity features, an effective fusion module based on X and Y coordinate is designed to facilitate interaction and constraint between the two dimensions for feature complementarity. Due to the lack of annotated line features in current public dewarping datasets, we also propose an automatic fine-grained annotation method using public document texture images and an automatic rendering engine to build a new large-scale distortion training dataset. The code and dataset will be publicly released. On public Chinese and English benchmarks, both quantitative and qualitative results show that our method achieves better rectification results compared with the state-of-the-art methods. The dataset will be publicly available at this https URL
zh

[CV-32] F3-Net: Foundation Model for Full Abnormality Segmentation of Medical Images with Flexible Input Modality Requirement

【速读】：该论文旨在解决临床医学图像分割中长期存在的挑战，包括对完整多模态输入的依赖、泛化能力有限以及任务特异性过强的问题。其解决方案的关键在于提出F3-Net，通过灵活的合成模态训练策略，在缺乏MRI序列的情况下仍能保持鲁棒性能，采用零图像策略替代缺失模态而不依赖显式合成网络，从而提升实际应用的可行性。此外，F3-Net的统一架构支持多种病理分割任务，无需重新训练即可超越传统卷积神经网络和Transformer模型的表现。

链接: https://arxiv.org/abs/2507.08460
作者: Seyedeh Sahar Taheri Otaghsara,Reza Rahmanzadeh
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:F3-Net is a foundation model designed to overcome persistent challenges in clinical medical image segmentation, including reliance on complete multimodal inputs, limited generalizability, and narrow task specificity. Through flexible synthetic modality training, F3-Net maintains robust performance even in the presence of missing MRI sequences, leveraging a zero-image strategy to substitute absent modalities without relying on explicit synthesis networks, thereby enhancing real-world applicability. Its unified architecture supports multi-pathology segmentation across glioma, metastasis, stroke, and white matter lesions without retraining, outperforming CNN-based and transformer-based models that typically require disease-specific fine-tuning. Evaluated on diverse datasets such as BraTS 2021, BraTS 2024, and ISLES 2022, F3-Net demonstrates strong resilience to domain shifts and clinical heterogeneity. On the whole pathology dataset, F3-Net achieves average Dice Similarity Coefficients (DSCs) of 0.94 for BraTS-GLI 2024, 0.82 for BraTS-MET 2024, 0.94 for BraTS 2021, and 0.79 for ISLES 2022. This positions it as a versatile, scalable solution bridging the gap between deep learning research and practical clinical deployment.
zh

[CV-33] A document is worth a structured record: Principled inductive bias design for document recognition

【速读】：该论文试图解决当前文档识别方法将文档识别视为单纯的计算机视觉问题，忽视了文档类型特有的结构特性，导致对不常见或复杂文档类型的识别效果不佳。解决方案的关键在于将文档识别框架为从文档到记录的转录任务，通过设计针对特定结构的归纳偏置（inductive biases）以及相应的基础Transformer架构，从而捕捉文档内在的结构信息，提升识别性能。

链接: https://arxiv.org/abs/2507.08458
作者: Benjamin Meyer,Lukas Tuggener,Sascha Hänzi,Daniel Schmid,Erdal Ayfer,Benjamin F. Grewe,Ahmed Abdulkadir,Thilo Stadelmann
机构: ZHAW Zurich University of Applied Sciences (苏黎世应用科学大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Many document types use intrinsic, convention-driven structures that serve to encode precise and structured information, such as the conventions governing engineering drawings. However, state-of-the-art approaches treat document recognition as a mere computer vision problem, neglecting these underlying document-type-specific structural properties, making them dependent on sub-optimal heuristic post-processing and rendering many less frequent or more complicated document types inaccessible to modern document recognition. We suggest a novel perspective that frames document recognition as a transcription task from a document to a record. This implies a natural grouping of documents based on the intrinsic structure inherent in their transcription, where related document types can be treated (and learned) similarly. We propose a method to design structure-specific inductive biases for the underlying machine-learned end-to-end document recognition systems, and a respective base transformer architecture that we successfully adapt to different structures. We demonstrate the effectiveness of the so-found inductive biases in extensive experiments with progressively complex record structures from monophonic sheet music, shape drawings, and simplified engineering drawings. By integrating an inductive bias for unrestricted graph structures, we train the first-ever successful end-to-end model to transcribe engineering drawings to their inherently interlinked information. Our approach is relevant to inform the design of document recognition systems for document types that are less well understood than standard OCR, OMR, etc., and serves as a guide to unify the design of future document foundation models.
zh

[CV-34] Review of Feed-forward 3D Reconstruction: From DUSt3R to VGGT

【速读】：该论文试图解决传统3D重建方法在复杂场景下的局限性，如纹理缺失区域的鲁棒性不足、计算成本高以及流程复杂等问题。其解决方案的关键在于引入基于深度学习的前馈模型，例如DUSt3R，这些模型通过统一的深度网络，在单次前向传播中联合推断相机位姿和密集几何结构，从而实现了高效且端到端的3D重建。

链接: https://arxiv.org/abs/2507.08448
作者: Wei Zhang,Yihang Wu,Songhua Li,Wenjie Ma,Xin Ma,Qiang Li,Qi Wang
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:3D reconstruction, which aims to recover the dense three-dimensional structure of a scene, is a cornerstone technology for numerous applications, including augmented/virtual reality, autonomous driving, and robotics. While traditional pipelines like Structure from Motion (SfM) and Multi-View Stereo (MVS) achieve high precision through iterative optimization, they are limited by complex workflows, high computational cost, and poor robustness in challenging scenarios like texture-less regions. Recently, deep learning has catalyzed a paradigm shift in 3D reconstruction. A new family of models, exemplified by DUSt3R, has pioneered a feed-forward approach. These models employ a unified deep network to jointly infer camera poses and dense geometry directly from an Unconstrained set of images in a single forward pass. This survey provides a systematic review of this emerging domain. We begin by dissecting the technical framework of these feed-forward models, including their Transformer-based correspondence modeling, joint pose and geometry regression mechanisms, and strategies for scaling from two-view to multi-view scenarios. To highlight the disruptive nature of this new paradigm, we contrast it with both traditional pipelines and earlier learning-based methods like MVSNet. Furthermore, we provide an overview of relevant datasets and evaluation metrics. Finally, we discuss the technology’s broad application prospects and identify key future challenges and opportunities, such as model accuracy and scalability, and handling dynamic scenes.
zh

[CV-35] Vision Foundation Models as Effective Visual Tokenizers for Autoregressive Image Generation

【速读】：该论文试图解决如何利用预训练视觉基础模型（vision foundation model）构建高效的图像分词器（image tokenizer）的问题，这一方向在以往研究中尚未得到充分探索。解决方案的关键在于采用冻结的视觉基础模型作为分词器的编码器，并引入两个核心组件：(1) 一种区域自适应量化框架，用于减少预训练特征在常规2D网格上的冗余；(2) 一种语义重建目标，用于对齐分词器输出与基础模型表示，以保持语义保真度。基于这些设计，所提出的图像分词器VFMTok在图像重建和生成质量上取得了显著提升，同时提高了token效率，并在自回归生成任务中表现出色。

链接: https://arxiv.org/abs/2507.08441
作者: Anlin Zheng,Xin Wen,Xuanyang Zhang,Chuofan Ma,Tiancai Wang,Gang Yu,Xiangyu Zhang,Xiaojuan Qi
机构: The University of Hong Kong(香港大学); StepFun(步履科技); Dexmal(德普科技); MEGVII Technology(美图科技)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: 19 pages, 4 figures

点击查看摘要

Abstract:Leveraging the powerful representations of pre-trained vision foundation models – traditionally used for visual comprehension – we explore a novel direction: building an image tokenizer directly atop such models, a largely underexplored area. Specifically, we employ a frozen vision foundation model as the encoder of our tokenizer. To enhance its effectiveness, we introduce two key components: (1) a region-adaptive quantization framework that reduces redundancy in the pre-trained features on regular 2D grids, and (2) a semantic reconstruction objective that aligns the tokenizer’s outputs with the foundation model’s representations to preserve semantic fidelity. Based on these designs, our proposed image tokenizer, VFMTok, achieves substantial improvements in image reconstruction and generation quality, while also enhancing token efficiency. It further boosts autoregressive (AR) generation – achieving a gFID of 2.07 on ImageNet benchmarks, while accelerating model convergence by three times, and enabling high-fidelity class-conditional synthesis without the need for classifier-free guidance (CFG). The code will be released publicly to benefit the community.
zh

[CV-36] RePaintGS: Reference-Guided Gaussian Splatting for Realistic and View-Consistent 3D Scene Inpainting

【速读】：该论文试图解决3D场景修复中因物体移除导致的视点间不一致问题，即现有图像修复方法在每个视图中生成多个可能的补全结果，造成视角间的感知不一致性。解决方案的关键在于引入参考视图，通过估计其他视图与参考视图的修复相似性，调整其在构建精准几何结构中的贡献，进而将参考视图的修复结果作为伪真实数据，引导优化过程以匹配参考视图的外观，从而实现更真实且感知一致的修复效果。

链接: https://arxiv.org/abs/2507.08434
作者: Ji Hyun Seo,Byounhyun Yoo,Gerard Jounghyun Kim
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Radiance field methods, such as Neural Radiance Field or 3D Gaussian Splatting, have emerged as seminal 3D representations for synthesizing realistic novel views. For practical applications, there is ongoing research on flexible scene editing techniques, among which object removal is a representative task. However, removing objects exposes occluded regions, often leading to unnatural appearances. Thus, studies have employed image inpainting techniques to replace such regions with plausible content - a task referred to as 3D scene inpainting. However, image inpainting methods produce one of many plausible completions for each view, leading to inconsistencies between viewpoints. A widely adopted approach leverages perceptual cues to blend inpainted views smoothly. However, it is prone to detail loss and can fail when there are perceptual inconsistencies across views. In this paper, we propose a novel 3D scene inpainting method that reliably produces realistic and perceptually consistent results even for complex scenes by leveraging a reference view. Given the inpainted reference view, we estimate the inpainting similarity of the other views to adjust their contribution in constructing an accurate geometry tailored to the reference. This geometry is then used to warp the reference inpainting to other views as pseudo-ground truth, guiding the optimization to match the reference appearance. Comparative evaluation studies have shown that our approach improves both the geometric fidelity and appearance consistency of inpainted scenes.
zh

[CV-37] Upsample What Matters: Region-Adaptive Latent Sampling for Accelerated Diffusion Transformers

【速读】：该论文试图解决扩散模型在实际部署中因计算量过大而面临的性能瓶颈问题，特别是针对基于Transformer的扩散模型在高保真图像和视频生成中的计算效率问题。其解决方案的关键在于提出一种无需训练的区域自适应潜在上采样框架（Region-Adaptive Latent Upsampling, RALU），通过在空间维度上进行混合分辨率采样，包括低分辨率去噪以捕捉全局语义结构、针对易产生伪影区域的自适应上采样以及全分辨率上的细节优化，从而显著降低计算量并保持图像质量。

链接: https://arxiv.org/abs/2507.08422
作者: Wongi Jeong,Kyungryeol Lee,Hoigi Seo,Se Young Chun
机构: Seoul National University (首尔国立大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Image and Video Processing (eess.IV)
备注:

点击查看摘要

Abstract:Diffusion transformers have emerged as an alternative to U-net-based diffusion models for high-fidelity image and video generation, offering superior scalability. However, their heavy computation remains a major obstacle to real-world deployment. Existing acceleration methods primarily exploit the temporal dimension such as reusing cached features across diffusion timesteps. Here, we propose Region-Adaptive Latent Upsampling (RALU), a training-free framework that accelerates inference along spatial dimension. RALU performs mixed-resolution sampling across three stages: 1) low-resolution denoising latent diffusion to efficiently capture global semantic structure, 2) region-adaptive upsampling on specific regions prone to artifacts at full-resolution, and 3) all latent upsampling at full-resolution for detail refinement. To stabilize generations across resolution transitions, we leverage noise-timestep rescheduling to adapt the noise level across varying resolutions. Our method significantly reduces computation while preserving image quality by achieving up to 7.0 \times speed-up on FLUX and 3.0 \times on Stable Diffusion 3 with minimal degradation. Furthermore, RALU is complementary to existing temporal accelerations such as caching methods, thus can be seamlessly integrated to further reduce inference latency without compromising generation quality.
zh

[CV-38] InstaScene: Towards Complete 3D Instance Decomposition and Reconstruction from Cluttered Scenes ICCV2025

【速读】：该论文试图解决在复杂场景中实现完整3D感知的问题，特别是如何从部分观测中识别并重建被遮挡的物体。现有方法将场景建模为未区分的整体，难以从局部观察中识别完整物体。论文提出InstaScene，其解决方案的关键在于通过空间对比学习实现任意实例的分解，并结合原位生成技术利用有价值的观测和几何线索，从而引导3D生成模型重建与现实世界无缝对齐的完整实例。

链接: https://arxiv.org/abs/2507.08416
作者: Zesong Yang,Bangbang Yang,Wenqi Dong,Chenxuan Cao,Liyuan Cui,Yuewen Ma,Zhaopeng Cui,Hujun Bao
机构: Zhejiang University (浙江大学); ByteDance (字节跳动)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted by ICCV 2025. Project page: this https URL

点击查看摘要

Abstract:Humans can naturally identify and mentally complete occluded objects in cluttered environments. However, imparting similar cognitive ability to robotics remains challenging even with advanced reconstruction techniques, which models scenes as undifferentiated wholes and fails to recognize complete object from partial observations. In this paper, we propose InstaScene, a new paradigm towards holistic 3D perception of complex scenes with a primary goal: decomposing arbitrary instances while ensuring complete reconstruction. To achieve precise decomposition, we develop a novel spatial contrastive learning by tracing rasterization of each instance across views, significantly enhancing semantic supervision in cluttered scenes. To overcome incompleteness from limited observations, we introduce in-situ generation that harnesses valuable observations and geometric cues, effectively guiding 3D generative models to reconstruct complete instances that seamlessly align with the real world. Experiments on scene decomposition and object completion across complex real-world and synthetic scenes demonstrate that our method achieves superior decomposition accuracy while producing geometrically faithful and visually intact objects.
zh

[CV-39] Multi-modal Mutual-Guidance Conditional Prompt Learning for Vision-Language Models

【速读】：该论文旨在解决Vision-Language Models (VLMs)在进行提示学习时面临的两个关键问题：一是对未见实例的类别嵌入分布建模不足，导致新类别上的泛化能力受限；二是现有方法主要将跨模态对齐限制在视觉和文本编码器的最终输出层，从而限制了模型保持预训练多模态嵌入空间拓扑一致性的能力。其解决方案的关键在于提出MuGCP（Multi-modal Mutual-Guidance Conditional Prompt Learning）框架，该框架利用多模态大语言模型（MLLMs）作为条件提示学习器，自适应生成包含丰富细粒度语义知识的语义条件提示（SCP），并通过Attention Mutual-Guidance（AMG）模块实现视觉与语义信息的有效交互，生成视觉条件提示（VCP），同时引入Multi-Prompt Fusion（MPF）机制融合SCP、VCP与上下文提示，以提升多模态任务性能。

链接: https://arxiv.org/abs/2507.08410
作者: Shijun Yang,Xiang Zhang,Wanqing Zhao,Hangzai Luo,Sheng Zhong,Jinye Peng,Jianping Fan
机构: Northwest University (西北大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 21 pages, 8 figures

点击查看摘要

Abstract:Prompt learning facilitates the efficient adaptation of Vision-Language Models (VLMs) to various downstream tasks. However, it faces two significant challenges: (1) inadequate modeling of class embedding distributions for unseen instances, leading to suboptimal generalization on novel classes; (2) prevailing methodologies predominantly confine cross-modal alignment to the final output layer of vision and text encoders, which fundamentally limits their capacity to preserve topological consistency with pre-trained multi-modal embedding spaces. To this end, we introduce MuGCP (Multi-modal Mutual-Guidance Conditional Prompt Learning), a novel paradigm designed for conditional prompt generation. MuGCP leverages Multi-modal Large Language Models (MLLMs) as conditional prompt learners to adaptively generate Semantic Conditional Prompts (SCP) that incorporate rich, fine-grained high-level semantic knowledge for image instances. To ensure effective alignment and interaction across the multi-modal space of Vision-Language Models (VLMs), we introduce the Attention Mutual-Guidance (AMG) module, which facilitates interactions between visual and semantic information. Through mutual guidance, the AMG module generates Visual Conditional Prompts (VCP), enhancing the model’s performance in multi-modal tasks. Additionally, we present a Multi-Prompt Fusion (MPF) mechanism that integrates SCP and VCP with contextual prompts, ensuring seamless coordination among the different prompts and enhancing the modeling of class embeddings and instance-specific knowledge. Our MuGCP outperforms existing state-of-the-art methods on 14 different datasets. The code will be made available after publication.
zh

[CV-40] Deep Hashing with Semantic Hash Centers for Image Retrieval

【速读】：该论文旨在解决现有深度哈希方法在生成哈希中心时忽视类别间语义关系的问题，从而影响哈希码的判别能力和检索性能。其解决方案的关键在于引入语义哈希中心（Semantic Hash Centers, SHC），通过构建一个三阶段框架，首先利用数据相关的相似性计算识别类别间的语义相似性，其次设计优化算法生成保留语义相关性的哈希中心，并强制中心之间保持最小距离以避免哈希码过于相似，最后使用这些语义中心训练深度哈希网络生成二进制哈希码，从而有效提升大规模图像检索的性能。

链接: https://arxiv.org/abs/2507.08404
作者: Li Chen,Rui Liu,Yuxiang Zhou,Xudong Ma,Yong Chen,Dell Zhang
机构: Beihang University(北京航空航天大学); Beijing University of Posts and Telecommunications(北京邮电大学); TeleAI(电信人工智能)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Deep hashing is an effective approach for large-scale image retrieval. Current methods are typically classified by their supervision types: point-wise, pair-wise, and list-wise. Recent point-wise techniques (e.g., CSQ, MDS) have improved retrieval performance by pre-assigning a hash center to each class, enhancing the discriminability of hash codes across various datasets. However, these methods rely on data-independent algorithms to generate hash centers, which neglect the semantic relationships between classes and may degrade retrieval performance. This paper introduces the concept of semantic hash centers, building on the idea of traditional hash centers. We hypothesize that hash centers of semantically related classes should have closer Hamming distances, while those of unrelated classes should be more distant. To this end, we propose a three-stage framework, SHC, to generate hash codes that preserve semantic structure. First, we develop a classification network to identify semantic similarities between classes using a data-dependent similarity calculation that adapts to varying data distributions. Second, we introduce an optimization algorithm to generate semantic hash centers, preserving semantic relatedness while enforcing a minimum distance between centers to avoid excessively similar hash codes. Finally, a deep hashing network is trained using these semantic centers to convert images into binary hash codes. Experimental results on large-scale retrieval tasks across several public datasets show that SHC significantly improves retrieval performance. Specifically, SHC achieves average improvements of +7.26%, +7.62%, and +11.71% in MAP@100, MAP@1000, and MAP@ALL metrics, respectively, over state-of-the-art methods. Subjects: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI) Cite as: arXiv:2507.08404 [cs.CV] (or arXiv:2507.08404v1 [cs.CV] for this version) https://doi.org/10.48550/arXiv.2507.08404 Focus to learn more arXiv-issued DOI via DataCite (pending registration) Submission history From: Li Chen [view email] [v1] Fri, 11 Jul 2025 08:22:27 UTC (5,105 KB)
zh

[CV-41] PanMatch: Unleashing the Potential of Large Vision Models for Unified Matching Models

【速读】：该论文试图解决多任务对应匹配问题，即如何在不同视觉任务（如立体匹配、光流估计和特征匹配）中实现统一的模型架构与性能优化。传统方法依赖于任务特定的架构设计和领域特定的微调，而该论文提出的解决方案之关键在于将任何两帧对应匹配任务建模为2D位移估计问题，并使用相同的模型权重进行处理，从而避免了专门统一架构或任务特定集成模型的设计。这一框架通过赋予位移估计算法前所未有的泛化能力，实现了多任务整合。

链接: https://arxiv.org/abs/2507.08400
作者: Yongjian Zhang,Longguang Wang,Kunhong Li,Ye Zhang,Yun Wang,Liang Lin,Yulan Guo
机构: Sun Yat-sen University(中山大学); City University of Hong Kong(香港城市大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Multimedia (cs.MM)
备注:

点击查看摘要

Abstract:This work presents PanMatch, a versatile foundation model for robust correspondence matching. Unlike previous methods that rely on task-specific architectures and domain-specific fine-tuning to support tasks like stereo matching, optical flow or feature matching, our key insight is that any two-frame correspondence matching task can be addressed within a 2D displacement estimation framework using the same model weights. Such a formulation eliminates the need for designing specialized unified architectures or task-specific ensemble models. Instead, it achieves multi-task integration by endowing displacement estimation algorithms with unprecedented generalization capabilities. To this end, we highlight the importance of a robust feature extractor applicable across multiple domains and tasks, and propose the feature transformation pipeline that leverage all-purpose features from Large Vision Models to endow matching baselines with zero-shot cross-view matching capabilities. Furthermore, we assemble a cross-domain dataset with near 1.8 million samples from stereo matching, optical flow, and feature matching domains to pretrain PanMatch. We demonstrate the versatility of PanMatch across a wide range of domains and downstream tasks using the same model weights. Our model outperforms UniMatch and Flow-Anything on cross-task evaluations, and achieves comparable performance to most state-of-the-art task-specific algorithms on task-oriented benchmarks. Additionally, PanMatch presents unprecedented zero-shot performance in abnormal scenarios, such as rainy day and satellite imagery, where most existing robust algorithms fail to yield meaningful results.
zh

[CV-42] Subject-Consistent and Pose-Diverse Text-to-Image Generation

【速读】：该论文试图解决文本到图像（T2I）模型在跨不同场景中保持主体身份一致性的难题，现有无训练的主体一致性生成（SCG）方法通常以牺牲布局和姿态多样性为代价来实现一致性，从而限制了视觉叙事的表现力。解决方案的关键在于提出一种名为CoDi的框架，该框架通过两阶段策略实现主体一致性与姿态多样性的平衡：第一阶段为身份传输（IT），在去噪早期步骤中利用最优传输将姿态感知的身份特征转移到目标图像，以保持主体一致性并保留姿态多样性；第二阶段为身份精炼（IR），在后期去噪步骤中选择最显著的身份特征进一步细化主体细节。

链接: https://arxiv.org/abs/2507.08396
作者: Zhanxin Gao,Beier Zhu,Liang Yao,Jian Yang,Ying Tai
机构: Nanjing University (南京大学); Nanyang Technological University (南洋理工大学); Vipshop (唯品会)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Subject-consistent generation (SCG)-aiming to maintain a consistent subject identity across diverse scenes-remains a challenge for text-to-image (T2I) models. Existing training-free SCG methods often achieve consistency at the cost of layout and pose diversity, hindering expressive visual storytelling. To address the limitation, we propose subject-Consistent and pose-Diverse T2I framework, dubbed as CoDi, that enables consistent subject generation with diverse pose and layout. Motivated by the progressive nature of diffusion, where coarse structures emerge early and fine details are refined later, CoDi adopts a two-stage strategy: Identity Transport (IT) and Identity Refinement (IR). IT operates in the early denoising steps, using optimal transport to transfer identity features to each target image in a pose-aware manner. This promotes subject consistency while preserving pose diversity. IR is applied in the later denoising steps, selecting the most salient identity features to further refine subject details. Extensive qualitative and quantitative results on subject consistency, pose diversity, and prompt fidelity demonstrate that CoDi achieves both better visual perception and stronger performance across all metrics. The code is provided in this https URL.
zh

[CV-43] Smelly dense and spreaded: The Object Detection for Olfactory References (ODOR) dataset

【速读】：该论文旨在解决计算机视觉在人文学科中的实际应用问题，特别是针对艺术抽象、周边物体以及细粒度目标类别之间的细微差异所带来的挑战。现有数据集在艺术作品上提供实例级标注，但通常偏向图像中心且在详细物体类别上存在局限性。该论文提出的ODOR数据集通过提供38,116个对象级标注，覆盖4712张图像和139个细粒度类别，填补了这一空白。其关键在于构建一个具有详细类别划分、密集且重叠物体分布以及全图空间分布的数据集，以推动艺术作品中物体检测的研究，并促进物体识别与嗅觉感知交叉领域的进一步探索。

链接: https://arxiv.org/abs/2507.08384
作者: Mathias Zinnen,Prathmesh Madhu,Inger Leemans,Peter Bell,Azhar Hussian,Hang Tran,Ali Hürriyetoğlu,Andreas Maier,Vincent Christlein
机构: Friedrich-Alexander University(弗里德里希-亚历山大大学); Vrije Universiteit Amsterdam(阿姆斯特丹自由大学); Philipps-University Marburg(马尔堡菲利普斯大学); Donders Institute for Brain, Cognition and Behaviour(多恩斯大脑、认知和行为研究所)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Real-world applications of computer vision in the humanities require algorithms to be robust against artistic abstraction, peripheral objects, and subtle differences between fine-grained target classes. Existing datasets provide instance-level annotations on artworks but are generally biased towards the image centre and limited with regard to detailed object classes. The proposed ODOR dataset fills this gap, offering 38,116 object-level annotations across 4712 images, spanning an extensive set of 139 fine-grained categories. Conducting a statistical analysis, we showcase challenging dataset properties, such as a detailed set of categories, dense and overlapping objects, and spatial distribution over the whole image canvas. Furthermore, we provide an extensive baseline analysis for object detection models and highlight the challenging properties of the dataset through a set of secondary studies. Inspiring further research on artwork object detection and broader visual cultural heritage studies, the dataset challenges researchers to explore the intersection of object recognition and smell perception.
zh

[CV-44] From Enhancement to Understanding: Build a Generalized Bridge for Low-light Vision via Semantically Consistent Unsupervised Fine-tuning ICCV2025

【速读】：该论文试图解决低光照视觉中低级增强与高级视觉理解任务长期以来被分开处理的问题，以及现有方法在泛化能力和可扩展性上的不足。其解决方案的关键在于构建一个通用的桥梁——Generalized Enhancement For Understanding (GEFU)，通过预训练生成扩散模型实现零样本泛化性能，并引入语义一致的无监督微调方法（Semantically Consistent Unsupervised Fine-tuning, SCUF），结合光照感知图像提示和循环注意力适配器以提升语义潜力，同时通过标题与反射一致性机制缓解无监督训练中的语义退化问题。

链接: https://arxiv.org/abs/2507.08380
作者: Sen Wang,Shao Zeng,Tianjun Gu,Zhizhong Zhang,Ruixin Zhang,Shouhong Ding,Jingyun Zhang,Jun Wang,Xin Tan,Yuan Xie,Lizhuang Ma
机构: East China Normal University (华东师范大学); Tencent Youtu Lab (腾讯优图实验室); Tencent WeChat Pay Lab (腾讯微信支付实验室)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted by ICCV 2025

点击查看摘要

Abstract:Low-level enhancement and high-level visual understanding in low-light vision have traditionally been treated separately. Low-light enhancement improves image quality for downstream tasks, but existing methods rely on physical or geometric priors, limiting generalization. Evaluation mainly focuses on visual quality rather than downstream performance. Low-light visual understanding, constrained by scarce labeled data, primarily uses task-specific domain adaptation, which lacks scalability. To address these challenges, we build a generalized bridge between low-light enhancement and low-light understanding, which we term Generalized Enhancement For Understanding (GEFU). This paradigm improves both generalization and scalability. To address the diverse causes of low-light degradation, we leverage pretrained generative diffusion models to optimize images, achieving zero-shot generalization performance. Building on this, we propose Semantically Consistent Unsupervised Fine-tuning (SCUF). Specifically, to overcome text prompt limitations, we introduce an illumination-aware image prompt to explicitly guide image generation and propose a cycle-attention adapter to maximize its semantic potential. To mitigate semantic degradation in unsupervised training, we propose caption and reflectance consistency to learn high-level semantics and image-level spatial semantics. Extensive experiments demonstrate that our proposed method outperforms current state-of-the-art methods in traditional image quality and GEFU tasks including classification, detection, and semantic segmentation.
zh

[CV-45] Unsupervised Methods for Video Quality Improvement: A Survey of Restoration and Enhancement Techniques

【速读】：该论文旨在解决视频修复与增强问题，以提升视频的视觉质量并作为下游计算机视觉任务的重要预处理步骤。其解决方案的关键在于对无监督方法的深入探讨，包括领域转换、自监督信号设计以及基于盲点或噪声的方法，这些方法旨在在缺乏真实标注数据的情况下实现有效的视频恢复与增强。

链接: https://arxiv.org/abs/2507.08375
作者: Alexandra Malyugina,Yini Li,Joanne Lin,Nantheera Anantrasirichai
机构: Bristol University (布里斯托大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Video restoration and enhancement are critical not only for improving visual quality, but also as essential pre-processing steps to boost the performance of a wide range of downstream computer vision tasks. This survey presents a comprehensive review of video restoration and enhancement techniques with a particular focus on unsupervised approaches. We begin by outlining the most common video degradations and their underlying causes, followed by a review of early conventional and deep learning methods-based, highlighting their strengths and limitations. We then present an in-depth overview of unsupervised methods, categorise by their fundamental approaches, including domain translation, self-supervision signal design and blind spot or noise-based methods. We also provide a categorization of loss functions employed in unsupervised video restoration and enhancement, and discuss the role of paired synthetic datasets in enabling objective evaluation. Finally, we identify key challenges and outline promising directions for future research in this field.
zh

[CV-46] Understanding Driving Risks using Large Language Models : Toward Elderly Driver Assessment

【速读】：该论文试图解决如何利用多模态大语言模型（LLM）对静态行车记录仪图像进行类人交通场景解释的问题，特别是针对老年驾驶员评估中的三个判断任务：评估交通密度、评估交叉口可视性以及识别停车标志。解决方案的关键在于通过设计有效的提示策略（包括零样本、少样本和多样本提示），提升模型在需要上下文推理的任务中的表现，从而实现对驾驶场景风险的辅助评估。

链接: https://arxiv.org/abs/2507.08367
作者: Yuki Yoshihara,Linjing Jiang,Nihan Karatas,Hitoshi Kanamori,Asuka Harada,Takahiro Tanaka
机构: Institutes of Innovation for Future Society, Nagoya University, Japan
类目: Computer Vision and Pattern Recognition (cs.CV); Systems and Control (eess.SY)
备注:

点击查看摘要

Abstract:This study investigates the potential of a multimodal large language model (LLM), specifically ChatGPT-4o, to perform human-like interpretations of traffic scenes using static dashcam images. Herein, we focus on three judgment tasks relevant to elderly driver assessments: evaluating traffic density, assessing intersection visibility, and recognizing stop signs recognition. These tasks require contextual reasoning rather than simple object detection. Using zero-shot, few-shot, and multi-shot prompting strategies, we evaluated the performance of the model with human annotations serving as the reference standard. Evaluation metrics included precision, recall, and F1-score. Results indicate that prompt design considerably affects performance, with recall for intersection visibility increasing from 21.7% (zero-shot) to 57.0% (multi-shot). For traffic density, agreement increased from 53.5% to 67.6%. In stop-sign detection, the model demonstrated high precision (up to 86.3%) but a lower recall (approximately 76.7%), indicating a conservative response tendency. Output stability analysis revealed that humans and the model faced difficulties interpreting structurally ambiguous scenes. However, the model’s explanatory texts corresponded with its predictions, enhancing interpretability. These findings suggest that, with well-designed prompts, LLMs hold promise as supportive tools for scene-level driving risk assessments. Future studies should explore scalability using larger datasets, diverse annotators, and next-generation model architectures for elderly driver assessments.
zh

[CV-47] Cycle Context Verification for In-Context Medical Image Segmentation MICCAI2025

【速读】：该论文试图解决在医学图像分割中，基于上下文学习（ICL）的方法对查询图像与上下文图像-掩码对之间对齐敏感的问题，特别是在临床场景下标注数据稀缺导致难以选择最优上下文对，且对基础ICL模型进行微调在计算成本和灾难性遗忘风险上不可行的问题。其解决方案的关键是提出Cycle Context Verification (CCV)框架，通过循环机制实现预测的自我验证，从而增强上下文对齐：首先生成查询图像的分割掩码，随后交换查询与上下文对的角色，使模型通过预测原始上下文图像的掩码来验证初始预测的准确性，并利用查询特定提示调整查询图像以提升对齐效果。

链接: https://arxiv.org/abs/2507.08357
作者: Shishuai Hu,Zehui Liao,Liangli Zhen,Huazhu Fu,Yong Xia
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: MICCAI 2025

点击查看摘要

Abstract:In-context learning (ICL) is emerging as a promising technique for achieving universal medical image segmentation, where a variety of objects of interest across imaging modalities can be segmented using a single model. Nevertheless, its performance is highly sensitive to the alignment between the query image and in-context image-mask pairs. In a clinical scenario, the scarcity of annotated medical images makes it challenging to select optimal in-context pairs, and fine-tuning foundation ICL models on contextual data is infeasible due to computational costs and the risk of catastrophic forgetting. To address this challenge, we propose Cycle Context Verification (CCV), a novel framework that enhances ICL-based medical image segmentation by enabling self-verification of predictions and accordingly enhancing contextual alignment. Specifically, CCV employs a cyclic pipeline in which the model initially generates a segmentation mask for the query image. Subsequently, the roles of the query and an in-context pair are swapped, allowing the model to validate its prediction by predicting the mask of the original in-context image. The accuracy of this secondary prediction serves as an implicit measure of the initial query segmentation. A query-specific prompt is introduced to alter the query image and updated to improve the measure, thereby enhancing the alignment between the query and in-context pairs. We evaluated CCV on seven medical image segmentation datasets using two ICL foundation models, demonstrating its superiority over existing methods. Our results highlight CCV’s ability to enhance ICL-based segmentation, making it a robust solution for universal medical image segmentation. The code will be available at this https URL.
zh

[CV-48] MM-Gesture: Towards Precise Micro-Gesture Recognition through Multimodal Fusion

【速读】：该论文旨在解决微手势（micro-gesture, MG）分类任务中的挑战，特别是针对细微且持续时间短的微手势进行准确识别。其解决方案的关键在于提出了一种多模态融合框架MM-Gesture，该框架整合了关节、肢体、RGB视频、泰勒级数视频、光流视频和深度视频等多种模态的互补信息，并采用PoseConv3D和Video Swin Transformer架构，结合一种新颖的模态加权集成策略，从而显著提升了分类性能。此外，通过在MA-52数据集上进行迁移学习预训练，进一步优化了RGB模态的表现。

链接: https://arxiv.org/abs/2507.08344
作者: Jihao Gu,Fei Wang,Kun Li,Yanyan Wei,Zhiliang Wu,Dan Guo
机构: University College London (UCL); Hefei University of Technology (HFUT); Zhejiang University; Hefei Comprehensive National Science Center
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:In this paper, we present MM-Gesture, the solution developed by our team HFUT-VUT, which ranked 1st in the micro-gesture classification track of the 3rd MiGA Challenge at IJCAI 2025, achieving superior performance compared to previous state-of-the-art methods. MM-Gesture is a multimodal fusion framework designed specifically for recognizing subtle and short-duration micro-gestures (MGs), integrating complementary cues from joint, limb, RGB video, Taylor-series video, optical-flow video, and depth video modalities. Utilizing PoseConv3D and Video Swin Transformer architectures with a novel modality-weighted ensemble strategy, our method further enhances RGB modality performance through transfer learning pre-trained on the larger MA-52 dataset. Extensive experiments on the iMiGUE benchmark, including ablation studies across different modalities, validate the effectiveness of our proposed approach, achieving a top-1 accuracy of 73.213%.
zh

[CV-49] owards Imperceptible JPEG Image Hiding: Multi-range Representations-driven Adversarial Stego Generation

【速读】：该论文试图解决深度隐藏技术中因大容量嵌入和单一特征提取方式（仅依赖纯卷积或纯Transformer操作）导致的易被隐写分析器检测的问题。其解决方案的关键在于引入基于生成的对抗攻击，并提出一种多尺度表征驱动的对抗隐写生成框架MRAG。该框架结合了卷积的局部邻域感知特性和Transformer的全局依赖建模能力，同时利用粗粒度与细粒度频率分解得到的图像作为输入以引入多粒度信息，并设计了特征角度-范数解耦损失以约束生成隐写图像在隐写分析器分类特征的角度和范数空间中更接近原始载体图像。

链接: https://arxiv.org/abs/2507.08343
作者: Junxue Yang,Xin Liao,Weixuan Tang,Jianhua Yang,Zheng Qin
机构: Hunan University (湖南大学); Guangzhou University (广州大学); Guangdong Polytechnic Normal University (广东技术师范大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Deep hiding has been exploring the hiding capability of deep learning-based models, aiming to conceal image-level messages into cover images and reveal them from generated stego images. Existing schemes are easily detected by steganalyzers due to their large payloads and their limitation to feature extraction based solely on either pure convolution or pure transformer operators within a single range, as well as pixel-level loss constraints. To address the issue, in this paper, we introduce generation-based adversarial attacks into color JPEG image deep hiding and propose a multi-range representations-driven adversarial stego generation framework called MRAG from a steganalysis perspective. Specifically, we integrate the local-range neighbor reception characteristic of the convolution and the global-range dependency modeling of the transformer to construct MRAG. Meanwhile, we use the transformed images obtained through coarse-grained and fine-grained frequency decomposition as inputs, introducing multi-grained information. Furthermore, a features angle-norm disentanglement loss is designed to constrain the generated stegos closer to covers in the angle and norm space of the steganalyzer’s classified features. Consequently, small yet effective adversarial perturbations can be injected into the process of generating stegos, ensuring that stegos maintain favorable secret restorability and imperceptibility. Extensive experiments demonstrate that MRAG can achieve state-of-the-art performance.
zh

[CV-50] Single-Domain Generalization for Multimodal Cross-Cancer Prognosis via Dirac Rebalancer and Distribution Entanglement

【速读】：该论文试图解决多模态生存预测模型在跨癌症类型场景下泛化能力不足的问题，即现有方法主要针对单一癌症类型，未能有效应对跨癌症的泛化挑战。其解决方案的关键在于提出两个可插拔模块：Sparse Dirac Information Rebalancer (SDIR) 和 Cancer-aware Distribution Entanglement (CADE)，分别用于缓解弱模态特征退化和提升多模态融合效果，从而增强模型在未见癌症类型上的泛化能力。

链接: https://arxiv.org/abs/2507.08340
作者: Jia-Xuan Jiang,Jiashuai Liu,Hongtao Wu,Yifeng Wu,Zhong Wang,Qi Bi,Yefeng Zheng
机构: Lanzhou University (兰州大学); Westlake University (西湖大学); Xi’an Jiaotong University (西安交通大学); The Hong Kong University of Science and Technology (Guangzhou) (香港科技大学（广州）); Shenzhen Institutes of Advanced Technology (深圳先进技术研究院); University of of Amsterdam (阿姆斯特丹大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: Accepted by ACMMM 25

点击查看摘要

Abstract:Deep learning has shown remarkable performance in integrating multimodal data for survival prediction. However, existing multimodal methods mainly focus on single cancer types and overlook the challenge of generalization across cancers. In this work, we are the first to reveal that multimodal prognosis models often generalize worse than unimodal ones in cross-cancer scenarios, despite the critical need for such robustness in clinical practice. To address this, we propose a new task: Cross-Cancer Single Domain Generalization for Multimodal Prognosis, which evaluates whether models trained on a single cancer type can generalize to unseen cancers. We identify two key challenges: degraded features from weaker modalities and ineffective multimodal integration. To tackle these, we introduce two plug-and-play modules: Sparse Dirac Information Rebalancer (SDIR) and Cancer-aware Distribution Entanglement (CADE). SDIR mitigates the dominance of strong features by applying Bernoulli-based sparsification and Dirac-inspired stabilization to enhance weaker modality signals. CADE, designed to synthesize the target domain distribution, fuses local morphological cues and global gene expression in latent space. Experiments on a four-cancer-type benchmark demonstrate superior generalization, laying the foundation for practical, robust cross-cancer multimodal prognosis. Code is available at this https URL
zh

[CV-51] CoCo-Bot: Energy-based Composable Concept Bottlenecks for Interpretable Generative Models

【速读】：该论文试图解决生成式概念瓶颈模型（Concept Bottleneck Models, CBMs）在生成过程中依赖辅助视觉线索以弥补概念未能捕捉的信息，从而削弱了模型的可解释性和组合性的问题。解决方案的关键在于提出CoCo-Bot，这是一种后处理、可组合的概念瓶颈生成模型，通过仅依赖显式概念传输所有信息，消除了对辅助线索的依赖。该方法基于扩散能量函数，支持跨任意概念的鲁棒后处理干预，如概念组合与否定，从而在保持视觉质量的同时提升了概念级别的控制能力和可解释性。

链接: https://arxiv.org/abs/2507.08334
作者: Sangwon Kim,In-su Jang,Pyongkun Kim,Kwang-Ju Kim
机构: ETRI(电子通信研究院)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Concept Bottleneck Models (CBMs) provide interpretable and controllable generative modeling by routing generation through explicit, human-understandable concepts. However, previous generative CBMs often rely on auxiliary visual cues at the bottleneck to compensate for information not captured by the concepts, which undermines interpretability and compositionality. We propose CoCo-Bot, a post-hoc, composable concept bottleneck generative model that eliminates the need for auxiliary cues by transmitting all information solely through explicit concepts. Guided by diffusion-based energy functions, CoCo-Bot supports robust post-hoc interventions-such as concept composition and negation-across arbitrary concepts. Experiments using StyleGAN2 pre-trained on CelebA-HQ show that CoCo-Bot improves concept-level controllability and interpretability, while maintaining competitive visual quality.
zh

[CV-52] Interpretability-Aware Pruning for Efficient Medical Image Analysis

【速读】：该论文试图解决深度学习在医疗影像分析中因模型规模大和缺乏透明性而限制其临床应用的问题。解决方案的关键在于提出一种基于可解释性的剪枝框架，通过选择性保留每层中最相关部分，实现模型复杂度的降低，同时保持预测性能和透明性，从而获得轻量且可解释的模型，适用于实际医疗场景的部署。

链接: https://arxiv.org/abs/2507.08330
作者: Nikita Malik,Pratinav Seth,Neeraj Kumar Singh,Chintan Chitroda,Vinay Kumar Sankarapu
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Emerging Technologies (cs.ET); Machine Learning (cs.LG)
备注: Pre-Print

点击查看摘要

Abstract:Deep learning has driven significant advances in medical image analysis, yet its adoption in clinical practice remains constrained by the large size and lack of transparency in modern models. Advances in interpretability techniques such as DL-Backtrace, Layer-wise Relevance Propagation, and Integrated Gradients make it possible to assess the contribution of individual components within neural networks trained on medical imaging tasks. In this work, we introduce an interpretability-guided pruning framework that reduces model complexity while preserving both predictive performance and transparency. By selectively retaining only the most relevant parts of each layer, our method enables targeted compression that maintains clinically meaningful representations. Experiments across multiple medical image classification benchmarks demonstrate that this approach achieves high compression rates with minimal loss in accuracy, paving the way for lightweight, interpretable models suited for real-world deployment in healthcare settings.
zh

[CV-53] Cross-Domain Identity Representation for Skull to Face Matching with Benchmark DataSet

【速读】：该论文试图解决法医学中颅面重建的问题，即通过给定的颅骨X射线图像识别出对应个体的身份。解决方案的关键在于使用卷积神经网络中的孪生网络（Siamese networks）进行跨域身份表征，通过训练网络在特征空间中将相似的样本聚集在一起，而将不相似的样本分开，从而实现从颅骨图像到面部图像的匹配与识别。

链接: https://arxiv.org/abs/2507.08329
作者: Ravi Shankar Prasad,Dinesh Singh
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 7 pages, 12 figures, Pattern Recognition Letters

点击查看摘要

Abstract:Craniofacial reconstruction in forensic science is crucial for the identification of the victims of crimes and disasters. The objective is to map a given skull to its corresponding face in a corpus of faces with known identities using recent advancements in computer vision, such as deep learning. In this paper, we presented a framework for the identification of a person given the X-ray image of a skull using convolutional Siamese networks for cross-domain identity representation. Siamese networks are twin networks that share the same architecture and can be trained to discover a feature space where nearby observations that are similar are grouped and dissimilar observations are moved apart. To do this, the network is exposed to two sets of comparable and different data. The Euclidean distance is then minimized between similar pairs and maximized between dissimilar ones. Since getting pairs of skull and face images are difficult, we prepared our own dataset of 40 volunteers whose front and side skull X-ray images and optical face images were collected. Experiments were conducted on the collected cross-domain dataset to train and validate the Siamese networks. The experimental results provide satisfactory results on the identification of a person from the given skull.
zh

[CV-54] M2DAO-Talker: Harmonizing Multi-granular Motion Decoupling and Alternating Optimization for Talking-head Generation

【速读】：该论文旨在解决音频驱动的说话头生成中存在的渲染伪影问题，如运动模糊、时间抖动和局部穿透等。其解决方案的关键在于提出了一种统一框架，包含视频预处理、运动表示和渲染重建三个步骤，并引入了多粒度运动解耦和交替优化策略，以独立建模非刚性（口腔和面部）与刚性（头部）运动，并通过运动一致性约束确保头部与躯干的运动连贯性，从而减少由运动混叠引起的穿透伪影。

链接: https://arxiv.org/abs/2507.08307
作者: Kui Jiang,Shiyu Liu,Junjun Jiang,Xin Yang,Hongxun Yang,Xiaopeng Fan
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Audio-driven talking head generation holds significant potential for film production. While existing 3D methods have advanced motion modeling and content synthesis, they often produce rendering artifacts, such as motion blur, temporal jitter, and local penetration, due to limitations in representing stable, fine-grained motion fields. Through systematic analysis, we reformulate talking head generation into a unified framework comprising three steps: video preprocessing, motion representation, and rendering reconstruction. This framework underpins our proposed M2DAO-Talker, which addresses current limitations via multi-granular motion decoupling and alternating this http URL, we devise a novel 2D portrait preprocessing pipeline to extract frame-wise deformation control conditions (motion region segmentation masks, and camera parameters) to facilitate motion representation. To ameliorate motion modeling, we elaborate a multi-granular motion decoupling strategy, which independently models non-rigid (oral and facial) and rigid (head) motions for improved reconstruction this http URL, a motion consistency constraint is developed to ensure head-torso kinematic consistency, thereby mitigating penetration artifacts caused by motion aliasing. In addition, an alternating optimization strategy is designed to iteratively refine facial and oral motion parameters, enabling more realistic video this http URL across multiple datasets show that M2DAO-Talker achieves state-of-the-art performance, with the 2.43 dB PSNR improvement in generation quality and 0.64 gain in user-evaluated video realness versus TalkingGaussian while with 150 FPS inference speed. Our project homepage is this https URL
zh

[CV-55] Cross-Resolution SAR Target Detection Using Structural Hierarchy Adaptation and Reliable Adjacency Alignment

【速读】：该论文试图解决合成孔径雷达（SAR）目标检测中因分辨率提升导致的散射特性差异问题，该问题影响了目标检测模型的泛化能力。解决方案的关键在于提出一种名为CR-Net的新方法，该方法将结构先验和证据学习理论融入检测模型，通过引入结构诱导的分层特征自适应（SHFA）模块和可靠的结构邻接对齐（RSAA）模块，实现跨分辨率检测中的可靠领域自适应。SHFA模块通过建立目标间的结构关联来增强特征自适应的可解释性，而RSAA模块则通过利用安全邻接集将源域中的判别知识有效地迁移到目标域，从而提升检测模型在目标域中的判别能力。

链接: https://arxiv.org/abs/2507.08290
作者: Jiang Qin,Bin Zou,Haolin Li,Lamei Zhang
机构: Harbin Institute of Technology (哈尔滨工业大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Submitted to IEEE TGRS (major revision)

点击查看摘要

Abstract:In recent years, continuous improvements in SAR resolution have significantly benefited applications such as urban monitoring and target detection. However, the improvement in resolution leads to increased discrepancies in scattering characteristics, posing challenges to the generalization ability of target detection models. While domain adaptation technology is a potential solution, the inevitable discrepancies caused by resolution differences often lead to blind feature adaptation and unreliable semantic propagation, ultimately degrading the domain adaptation performance. To address these challenges, this paper proposes a novel SAR target detection method (termed CR-Net), that incorporates structure priors and evidential learning theory into the detection model, enabling reliable domain adaptation for cross-resolution detection. To be specific, CR-Net integrates Structure-induced Hierarchical Feature Adaptation (SHFA) and Reliable Structural Adjacency Alignment (RSAA). SHFA module is introduced to establish structural correlations between targets and achieve structure-aware feature adaptation, thereby enhancing the interpretability of the feature adaptation process. Afterwards, the RSAA module is proposed to enhance reliable semantic alignment, by leveraging the secure adjacency set to transfer valuable discriminative knowledge from the source domain to the target domain. This further improves the discriminability of the detection model in the target domain. Based on experimental results from different-resolution datasets,the proposed CR-Net significantly enhances cross-resolution adaptation by preserving intra-domain structures and improving discriminability. It achieves state-of-the-art (SOTA) performance in cross-resolution SAR target detection.
zh

[CV-56] FlowDrag : 3D-aware Drag -based Image Editing with Mesh-guided Deformation Vector Flow Fields ICML2025

【速读】：该论文试图解决基于拖拽的编辑方法在几何一致性方面的问题，即现有方法仅关注用户定义点的匹配，而忽视了整体几何结构，导致出现伪影或编辑不稳定。解决方案的关键在于提出FlowDrag，该方法通过构建3D网格并利用能量函数引导网格变形，结合UNet去噪过程实现精确的控制点到目标点对齐，同时保持结构完整性。

链接: https://arxiv.org/abs/2507.08285
作者: Gwanhyeong Koo,Sunjae Yoon,Younghwan Lee,Ji Woo Hong,Chang D. Yoo
机构: 未知
类目: Graphics (cs.GR); Computer Vision and Pattern Recognition (cs.CV)
备注: ICML 2025 Spotlight

点击查看摘要

Abstract:Drag-based editing allows precise object manipulation through point-based control, offering user convenience. However, current methods often suffer from a geometric inconsistency problem by focusing exclusively on matching user-defined points, neglecting the broader geometry and leading to artifacts or unstable edits. We propose FlowDrag, which leverages geometric information for more accurate and coherent transformations. Our approach constructs a 3D mesh from the image, using an energy function to guide mesh deformation based on user-defined drag points. The resulting mesh displacements are projected into 2D and incorporated into a UNet denoising process, enabling precise handle-to-target point alignment while preserving structural integrity. Additionally, existing drag-editing benchmarks provide no ground truth, making it difficult to assess how accurately the edits match the intended transformations. To address this, we present VFD (VidFrameDrag) benchmark dataset, which provides ground-truth frames using consecutive shots in a video dataset. FlowDrag outperforms existing drag-based editing methods on both VFD Bench and DragBench.
zh

[CV-57] Portable Biomechanics Laboratory: Clinically Accessible Movement Analysis from a Handheld Smartphone

【速读】：该论文试图解决临床实践中对运动功能的客观测量方法不足的问题，从而无法有效利用生物力学数据进行更敏感的结果评估或早期损伤识别。解决方案的关键是开发了一种便携式生物力学实验室（Portable Biomechanics Laboratory, PBL），其包含一个安全的、云支持的智能手机应用程序用于数据采集，以及一种新颖的算法用于将生物力学模型拟合到数据中。通过大规模临床代表性数据集验证了PBL的生物力学测量，并证明其在神经外科和运动医学诊所中的可用性和有效性。

链接: https://arxiv.org/abs/2507.08268
作者: J.D. Peiffer,Kunal Shah,Irina Djuraskovic,Shawana Anarwala,Kayan Abdou,Rujvee Patel,Prakash Jayabalan,Brenton Pennicooke,R. James Cotton
机构: Shirley Ryan AbilityLab(Shirley Ryan 功能实验室); Northwestern University(西北大学); Washington University School of Medicine(华盛顿大学医学院); Northwestern University Feinberg School of Medicine(西北大学费恩伯格医学院)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 15 pages, 7 figures

点击查看摘要

Abstract:The way a person moves is a direct reflection of their neurological and musculoskeletal health, yet it remains one of the most underutilized vital signs in clinical practice. Although clinicians visually observe movement impairments, they lack accessible and validated methods to objectively measure movement in routine care. This gap prevents wider use of biomechanical measurements in practice, which could enable more sensitive outcome measures or earlier identification of impairment. We present our Portable Biomechanics Laboratory (PBL), which includes a secure, cloud-enabled smartphone app for data collection and a novel algorithm for fitting biomechanical models to this data. We extensively validated PBL’s biomechanical measures using a large, clinically representative dataset. Next, we tested the usability and utility of our system in neurosurgery and sports medicine clinics. We found joint angle errors within 3 degrees across participants with neurological injury, lower-limb prosthesis users, pediatric inpatients, and controls. In addition to being easy to use, gait metrics computed from the PBL showed high reliability and were sensitive to clinical differences. For example, in individuals undergoing decompression surgery for cervical myelopathy, the mJOA score is a common patient-reported outcome measure; we found that PBL gait metrics correlated with mJOA scores and demonstrated greater responsiveness to surgical intervention than the patient-reported outcomes. These findings support the use of handheld smartphone video as a scalable, low-burden tool for capturing clinically meaningful biomechanical data, offering a promising path toward accessible monitoring of mobility impairments. We release the first clinically validated method for measuring whole-body kinematics from handheld smartphone video at this https URL .
zh

[CV-58] CL3R: 3D Reconstruction and Contrastive Learning for Enhanced Robotic Manipulation Representations

【速读】：该论文试图解决视觉-运动策略学习中感知模块的鲁棒性问题，特别是现有方法在利用预训练2D基础模型时难以捕捉3D空间信息和跨不同相机视角的泛化能力不足的问题。解决方案的关键在于提出CL3R，一种新颖的3D预训练框架，通过结合点云Masked Autoencoder学习丰富的3D表示，并利用对比学习从预训练2D基础模型中高效迁移语义知识，从而增强机器人操作策略的时空感知与语义理解能力。

链接: https://arxiv.org/abs/2507.08262
作者: Wenbo Cui,Chengyang Zhao,Yuhui Chen,Haoran Li,Zhizheng Zhang,Dongbin Zhao,He Wang
机构: Institute of Automation, Chinese Academy of Sciences; School of Artificial Intelligence, University of Chinese Academy of Sciences; Beijing Academy of Artificial Intelligence; Carnegie Mellon University; Galbot; CFCS, School of Computer Science, Peking University.
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Building a robust perception module is crucial for visuomotor policy learning. While recent methods incorporate pre-trained 2D foundation models into robotic perception modules to leverage their strong semantic understanding, they struggle to capture 3D spatial information and generalize across diverse camera viewpoints. These limitations hinder the policy’s effectiveness, especially in fine-grained robotic manipulation scenarios. To address these challenges, we propose CL3R, a novel 3D pre-training framework designed to enhance robotic manipulation policies. Our method integrates both spatial awareness and semantic understanding by employing a point cloud Masked Autoencoder to learn rich 3D representations while leveraging pre-trained 2D foundation models through contrastive learning for efficient semantic knowledge transfer. Additionally, we propose a 3D visual representation pre-training framework for robotic tasks. By unifying coordinate systems across datasets and introducing random fusion of multi-view point clouds, we mitigate camera view ambiguity and improve generalization, enabling robust perception from novel viewpoints at test time. Extensive experiments in both simulation and the real world demonstrate the superiority of our method, highlighting its effectiveness in visuomotor policy learning for robotic manipulation.
zh

[CV-59] ransfer Learning and Mixup for Fine-Grained Few-Shot Fungi Classification

【速读】：该论文旨在解决真菌物种在计算机视觉中的准确识别问题，这一任务因种间细微差异和种内高度变异而具有独特挑战性。其解决方案的关键在于采用多种视觉变压器模型、数据增强、加权采样以及结合文本信息的方法，同时探索了生成式 AI 模型在零样本分类中的应用。研究结果表明，基于视觉的模型表现优于生成式 AI 模型，最终模型在 FungiCLEF 2025 竞赛中表现出色，凸显了领域特定预训练和平衡采样策略的有效性。

链接: https://arxiv.org/abs/2507.08248
作者: Jason Kahei Tam,Murilo Gustineli,Anthony Miyaguchi
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Information Retrieval (cs.IR); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Accurate identification of fungi species presents a unique challenge in computer vision due to fine-grained inter-species variation and high intra-species variation. This paper presents our approach for the FungiCLEF 2025 competition, which focuses on few-shot fine-grained visual categorization (FGVC) using the FungiTastic Few-Shot dataset. Our team (DS@GT) experimented with multiple vision transformer models, data augmentation, weighted sampling, and incorporating textual information. We also explored generative AI models for zero-shot classification using structured prompting but found them to significantly underperform relative to vision-based models. Our final model outperformed both competition baselines and highlighted the effectiveness of domain specific pretraining and balanced sampling strategies. Our approach ranked 35/74 on the private test set in post-completion evaluation, this suggests additional work can be done on metadata selection and domain-adapted multi-modal learning. Our code is available at this https URL.
zh

[CV-60] Car Object Counting and Position Estimation via Extension of the CLIP-EBC Framework

【速读】：该论文试图解决的是将原本用于人群计数的CLIP-EBC框架（Crowd Counting with Localized Image Processing and Enhanced Background Correction）应用于车辆对象计数的问题。解决方案的关键在于通过CARPK数据集进行实验验证，证明该框架在车辆计数任务中能够达到次优性能，并提出一种基于预测密度图的K-means加权聚类方法，用于估计目标位置，从而展示了该框架在定位任务中的扩展潜力。

链接: https://arxiv.org/abs/2507.08240
作者: Seoik Jung,Taekyung Song
机构: Chungbuk National University(忠北大学); Chung-Ang University(中央大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 4 pages, 2 figures, submitted to a computer vision conference

点击查看摘要

Abstract:In this paper, we investigate the applicability of the CLIP-EBC framework, originally designed for crowd counting, to car object counting using the CARPK dataset. Experimental results show that our model achieves second-best performance compared to existing methods. In addition, we propose a K-means weighted clustering method to estimate object positions based on predicted density maps, indicating the framework’s potential extension to localization tasks.
zh

[CV-61] SurfDist: Interpretable Three-Dimensional Instance Segmentation Using Curved Surface Patches

【速读】：该论文试图解决三维体积实例分割中实例参数化与体素分辨率耦合导致的分辨率受限和可能出现的体素化伪影问题。其解决方案的关键在于提出SurfDist架构，该架构对流行的StarDist-3D模型进行了改进，解耦了实例参数化维度与实例体素分辨率，并通过预测由平滑参数化表面片组成的闭合曲面（具体为双三次Bézier三角形）实现高分辨率的实例预测，从而避免了体素化伪影的引入。

链接: https://arxiv.org/abs/2507.08223
作者: Jackson Borchardt,Saul Kato
机构: University of California, San Francisco (加利福尼亚大学旧金山分校)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 8 pages, 6 figures

点击查看摘要

Abstract:We present SurfDist, a convolutional neural network architecture for three-dimensional volumetric instance segmentation. SurfDist enables prediction of instances represented as closed surfaces composed of smooth parametric surface patches, specifically bicubic Bézier triangles. SurfDist is a modification of the popular model architecture StarDist-3D which breaks StarDist-3D’s coupling of instance parameterization dimension and instance voxel resolution, and it produces predictions which may be upsampled to arbitrarily high resolutions without introduction of voxelization artifacts. For datasets with blob-shaped instances, common in biomedical imaging, SurfDist can outperform StarDist-3D with more compact instance parameterizations. We detail SurfDist’s technical implementation and show one synthetic and one real-world dataset for which it outperforms StarDist-3D. These results demonstrate that interpretable instance surface models can be learned effectively alongside instance membership. Comments: 8 pages, 6 figures Subjects: Computer Vision and Pattern Recognition (cs.CV) Cite as: arXiv:2507.08223 [cs.CV] (or arXiv:2507.08223v1 [cs.CV] for this version) https://doi.org/10.48550/arXiv.2507.08223 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
zh

[CV-62] HNOSeg-XS: Extremely Small Hartley Neural Operator for Efficient and Resolution-Robust 3D Image Segmentation

【速读】：该论文试图解决医学图像分割中由于模型结构限制导致的分辨率敏感问题，即传统卷积神经网络（CNN）和变换器（Transformer）在处理高分辨率图像时，因输入尺寸缩减而产生的分割性能下降问题。解决方案的关键在于提出一种具有分辨率鲁棒性的HNOSeg-XS架构，通过可学习的部分微分方程建模，并利用傅里叶神经算子实现零样本超分辨率能力。进一步地，通过将傅里叶变换替换为哈特利变换并在频域重新构建问题，使得模型具备分辨率鲁棒性、快速性、内存高效性和极低参数量的特点。

链接: https://arxiv.org/abs/2507.08205
作者: Ken C. L. Wong,Hongzhi Wang,Tanveer Syeda-Mahmood
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: This paper was accepted by IEEE TMI 2025

点击查看摘要

Abstract:In medical image segmentation, convolutional neural networks (CNNs) and transformers are dominant. For CNNs, given the local receptive fields of convolutional layers, long-range spatial correlations are captured through consecutive convolutions and pooling. However, as the computational cost and memory footprint can be prohibitively large, 3D models can only afford fewer layers than 2D models with reduced receptive fields and abstract levels. For transformers, although long-range correlations can be captured by multi-head attention, its quadratic complexity with respect to input size is computationally demanding. Therefore, either model may require input size reduction to allow more filters and layers for better segmentation. Nevertheless, given their discrete nature, models trained with patch-wise training or image downsampling may produce suboptimal results when applied on higher resolutions. To address this issue, here we propose the resolution-robust HNOSeg-XS architecture. We model image segmentation by learnable partial differential equations through the Fourier neural operator which has the zero-shot super-resolution property. By replacing the Fourier transform by the Hartley transform and reformulating the problem in the frequency domain, we created the HNOSeg-XS model, which is resolution robust, fast, memory efficient, and extremely parameter efficient. When tested on the BraTS’23, KiTS’23, and MVSeg’23 datasets with a Tesla V100 GPU, HNOSeg-XS showed its superior resolution robustness with fewer than 34.7k model parameters. It also achieved the overall best inference time ( 0.24 s) and memory efficiency ( 1.8 GiB) compared to the tested CNN and transformer models.
zh

[CV-63] An Embedded Real-time Object Alert System for Visually Impaired: A Monocular Depth Estimation based Approach through Computer Vision

【速读】：该论文试图解决孟加拉国城市中视障人士在日常通勤过程中因路径上存在大量障碍物而面临的显著挑战，特别是由于道路交通事故频发，亟需一种能够在近距离物体出现前发出警告的系统。解决方案的关键在于提出了一种新颖的警报系统，该系统利用迁移学习训练深度估计和目标检测模型，并将两者结合以实现有效的障碍物识别。此外，通过量化技术优化模型，使其轻量化且高效，从而能够在嵌入式系统上轻松部署。

链接: https://arxiv.org/abs/2507.08165
作者: Jareen Anjom,Rashik Iram Chowdhury,Tarbia Hasan,Md. Ishan Arefin Hossain
机构: North South University (南大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Robotics (cs.RO)
备注:

点击查看摘要

Abstract:Visually impaired people face significant challenges in their day-to-day commutes in the urban cities of Bangladesh due to the vast number of obstructions on every path. With many injuries taking place through road accidents on a daily basis, it is paramount for a system to be developed that can alert the visually impaired of objects at close distance beforehand. To overcome this issue, a novel alert system is proposed in this research to assist the visually impaired in commuting through these busy streets without colliding with any objects. The proposed system can alert the individual to objects that are present at a close distance. It utilizes transfer learning to train models for depth estimation and object detection, and combines both models to introduce a novel system. The models are optimized through the utilization of quantization techniques to make them lightweight and efficient, allowing them to be easily deployed on embedded systems. The proposed solution achieved a lightweight real-time depth estimation and object detection model with an mAP50 of 0.801.
zh

[CV-64] Adaptive Diffusion Denoised Smoothing : Certified Robustness via Randomized Smoothing with Differentially Private Guided Denoising Diffusion

【速读】：该论文试图解决视觉模型在面对对抗样本时预测结果的鲁棒性认证问题，旨在提供一种能够适应输入的认证方法。解决方案的关键在于将引导去噪扩散模型重新解释为一系列自适应高斯差分隐私（GDP）机制，这些机制通过GDP隐私过滤器进行组合，从而分析引导去噪过程的整体鲁棒性，实现可证明的认证效果。

链接: https://arxiv.org/abs/2507.08163
作者: Frederick Shpilevskiy,Saiyue Lyu,Krishnamurthy Dj Dvijotham,Mathias Lécuyer,Pierre-André Noël
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Cryptography and Security (cs.CR); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:We propose Adaptive Diffusion Denoised Smoothing, a method for certifying the predictions of a vision model against adversarial examples, while adapting to the input. Our key insight is to reinterpret a guided denoising diffusion model as a long sequence of adaptive Gaussian Differentially Private (GDP) mechanisms refining a pure noise sample into an image. We show that these adaptive mechanisms can be composed through a GDP privacy filter to analyze the end-to-end robustness of the guided denoising process, yielding a provable certification that extends the adaptive randomized smoothing analysis. We demonstrate that our design, under a specific guiding strategy, can improve both certified accuracy and standard accuracy on ImageNet for an \ell_2 threat model.
zh

[CV-65] mporally Consistent Amodal Completion for 3D Human-Object Interaction Reconstruction

【速读】：该论文试图解决从单目视频中重建动态人体与物体交互时面临的遮挡和时间不一致问题（occlusions and temporal inconsistencies）。传统3D重建方法通常假设物体静止或动态主体完全可见，这在实际场景中因相互遮挡而失效。解决方案的关键在于利用模态补全（amodal completion）来推断部分被遮挡区域的完整结构，并通过整合时间上下文以增强视频序列中的连贯性，从而逐步优化和稳定重建结果。此无模板策略无需依赖预定义模型，显著提升了动态场景中复杂细节的恢复能力。

链接: https://arxiv.org/abs/2507.08137
作者: Hyungjun Doh,Dong In Lee,Seunggeun Chi,Pin-Hao Huang,Kwonjoon Lee,Sangpil Kim,Karthik Ramani
机构: Purdue University (普渡大学); Korea University (高丽大学); Honda Research Institute USA (本田研究美国公司)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:We introduce a novel framework for reconstructing dynamic human-object interactions from monocular video that overcomes challenges associated with occlusions and temporal inconsistencies. Traditional 3D reconstruction methods typically assume static objects or full visibility of dynamic subjects, leading to degraded performance when these assumptions are violated-particularly in scenarios where mutual occlusions occur. To address this, our framework leverages amodal completion to infer the complete structure of partially obscured regions. Unlike conventional approaches that operate on individual frames, our method integrates temporal context, enforcing coherence across video sequences to incrementally refine and stabilize reconstructions. This template-free strategy adapts to varying conditions without relying on predefined models, significantly enhancing the recovery of intricate details in dynamic scenes. We validate our approach using 3D Gaussian Splatting on challenging monocular videos, demonstrating superior precision in handling occlusions and maintaining temporal stability compared to existing techniques.
zh

[CV-66] RegGS: Unposed Sparse Views Gaussian Splatting with 3DGS Registration ICCV2025

【速读】：该论文试图解决基于优化的3D Gaussian Splatting (3DGS) 方法在稀疏视角下因先验知识有限而表现不佳，以及前馈式高斯方法因输入格式限制难以融入更多视角的问题。其解决方案的关键在于提出RegGS框架，该框架通过将前馈网络生成的局部3D高斯分布对齐为全局一致的3D高斯表示来实现稀疏视角的重建。技术上，RegGS采用熵正则化Sinkhorn算法高效求解最优传输Mixture 2-Wasserstein（\textMW_2）距离，作为\mathrmSim(3)空间中高斯混合模型（GMM）的对齐度量，并设计了一个集成\textMW_2距离、光度一致性与深度几何的联合3DGS配准模块，从而实现从粗到细的配准过程，精确估计相机位姿并对齐场景。

链接: https://arxiv.org/abs/2507.08136
作者: Chong Cheng,Yu Hu,Sicheng Yu,Beizhen Zhao,Zijian Wang,Hao Wang
机构: The Hong Kong University of Science and Technology (Guangzhou)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted to ICCV 2025

点击查看摘要

Abstract:3D Gaussian Splatting (3DGS) has demonstrated its potential in reconstructing scenes from unposed images. However, optimization-based 3DGS methods struggle with sparse views due to limited prior knowledge. Meanwhile, feed-forward Gaussian approaches are constrained by input formats, making it challenging to incorporate more input views. To address these challenges, we propose RegGS, a 3D Gaussian registration-based framework for reconstructing unposed sparse views. RegGS aligns local 3D Gaussians generated by a feed-forward network into a globally consistent 3D Gaussian representation. Technically, we implement an entropy-regularized Sinkhorn algorithm to efficiently solve the optimal transport Mixture 2-Wasserstein (\textMW_2) distance, which serves as an alignment metric for Gaussian mixture models (GMMs) in \mathrmSim(3) space. Furthermore, we design a joint 3DGS registration module that integrates the \textMW_2 distance, photometric consistency, and depth geometry. This enables a coarse-to-fine registration process while accurately estimating camera poses and aligning the scene. Experiments on the RE10K and ACID datasets demonstrate that RegGS effectively registers local Gaussians with high fidelity, achieving precise pose estimation and high-quality novel-view synthesis. Project page: this https URL.
zh

[CV-67] An Object-Based Deep Learning Approach for Building Height Estimation from Single SAR Images

【速读】：该论文旨在解决利用非常高分辨率（VHR）合成孔径雷达（SAR）图像准确估算建筑物高度的问题，这对于多种城市应用至关重要。其解决方案的关键在于提出一种基于深度学习（DL）的方法，采用基于边界框检测的对象回归策略，实现了从单幅VHR COSMO-SkyMed图像中自动估算建筑物高度。该方法在包含八个地理多样性城市的多大陆数据集上进行了训练和评估，并通过交叉验证策略评估了模型在分布外（OOD）场景下的泛化能力。

链接: https://arxiv.org/abs/2507.08096
作者: Babak Memar,Luigi Russo,Silvia Liberata Ullo,Paolo Gamba
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Accurate estimation of building heights using very high resolution (VHR) synthetic aperture radar (SAR) imagery is crucial for various urban applications. This paper introduces a Deep Learning (DL)-based methodology for automated building height estimation from single VHR COSMO-SkyMed images: an object-based regression approach based on bounding box detection followed by height estimation. This model was trained and evaluated on a unique multi-continental dataset comprising eight geographically diverse cities across Europe, North and South America, and Asia, employing a cross-validation strategy to explicitly assess out-of-distribution (OOD) generalization. The results demonstrate highly promising performance, particularly on European cities where the model achieves a Mean Absolute Error (MAE) of approximately one building story (2.20 m in Munich), significantly outperforming recent state-of-the-art methods in similar OOD scenarios. Despite the increased variability observed when generalizing to cities in other continents, particularly in Asia with its distinct urban typologies and prevalence of high-rise structures, this study underscores the significant potential of DL for robust cross-city and cross-continental transfer learning in building height estimation from single VHR SAR data.
zh

[CV-68] PUMA: Layer-Pruned Language Model for Efficient Unified Multimodal Retrieval with Modality-Adaptive Learning ACM-MM2025

【速读】：该论文试图解决统一多模态检索（Unified Multimodal Retrieval, UMR）在实际应用中因多模态大语言模型（Multimodal Large Language Models, MLLMs）参数量大导致的训练成本高和推理效率低的问题。其解决方案的关键在于提出PUMA：一种基于层剪枝的高效统一多模态检索语言模型，通过结构层面的层剪枝自蒸馏方法减少参数量并保持表征能力，以及学习层面的模态自适应对比学习损失（Modality-Adaptive Contrastive Learning Loss, MAC-Loss），通过区分批次内负样本的难易程度并采用不同的温度策略提升学习效率。

链接: https://arxiv.org/abs/2507.08064
作者: Yibo Lyu,Rui Shao,Gongwei Chen,Yijie Zhu,Weili Guan,Liqiang Nie
机构: Harbin Institute of Technology, Shenzhen(哈尔滨工业大学深圳)
类目: Multimedia (cs.MM); Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted to ACM MM 2025

点击查看摘要

Abstract:As multimedia content expands, the demand for unified multimodal retrieval (UMR) in real-world applications increases. Recent work leverages multimodal large language models (MLLMs) to tackle this task. However, their large parameter size results in high training costs and low inference efficiency. To address this, we propose PUMA: a Layer-Pruned Language Model for Efficient Unified Multimodal Retrieval with Modality-Adaptive Learning. Our approach improves UMR from both structural and learning perspectives. (1) Structurally, we propose Layer-Pruned Self-Distillation, which prunes MLLMs by keeping only shallow layers while distilling features from dropped deep layers as teacher signals. This reduces parameters and preserves representation capability. (2) On the learning side, we introduce Modality-Adaptive Contrastive Learning Loss (MAC-Loss), which separates in-batch negatives into harder intra-modality and easier inter-modality groups based on the target modality, assigning different temperature strategies to enhance learning efficiency. Experiments show our method significantly reduces resource usage while maintaining strong performance.
zh

[CV-69] he relative importance of being Gaussian

【速读】：该论文试图探讨在非高斯噪声环境下，基于高斯随机变量性质设计的扩散模型在去噪任务中的性能表现，其关键在于分析当噪声类型与算法设计假设（即高斯分布）显著不同时，算法的鲁棒性和有效性是否会受到影响。论文通过实验验证了在噪声类型发生改变的情况下，如使用均匀分布噪声、贝塔分布噪声或由方差差异较大的两个高斯分布组成的随机叠加噪声，算法的表现是否依然可靠，从而为算法的适用范围和改进方向提供参考。

链接: https://arxiv.org/abs/2507.08059
作者: F. Alberto Grünbaum,Tondgi Xu
机构: University of California, Berkeley (加州大学伯克利分校); Beijing Normal University (北京师范大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Probability (math.PR)
备注:

点击查看摘要

Abstract:The remarkable results for denoising in computer vision using diffusion models given in \citeSDWMG,HJA,HHG yield a robust mathematical justification for algorithms based on crucial properties of a sequence of Gaussian independent N(0,1) random variables. In particular the derivations use the fact that a Gaussian distribution is determined by its mean and variance and that the sum of two Gaussians is another Gaussian. \bigskip The issue raised in this short note is the following: suppose we use the algorithm without any changes but replace the nature of the noise and use, for instance, uniformly distributed noise or noise with a Beta distribution, or noise which is a random superposition of two Gaussians with very different variances. One could, of course, try to modify the algorithm keeping in mind the nature of the noise, but this is not what we do. Instead we study the performance of the algorithm when used with noise that is very far in nature from the Gaussian case, where it is designed to work well. Usually these algorithms are implemented on very powerful computers. Our experiments are all carried out on a small laptop and for the smallest possible image size. Exploring how our observations are confirmed or changed when dealing in different situations remains an interesting challenge. Subjects: Computer Vision and Pattern Recognition (cs.CV); Probability (math.PR) MSC classes: 68T05, 68T45, 60J60, 82C22, 82C31 Cite as: arXiv:2507.08059 [cs.CV] (or arXiv:2507.08059v1 [cs.CV] for this version) https://doi.org/10.48550/arXiv.2507.08059 Focus to learn more arXiv-issued DOI via DataCite (pending registration) Submission history From: F. Alberto Grunbaum [view email] [v1] Thu, 10 Jul 2025 14:51:39 UTC (5 KB)
zh

[CV-70] Lightweight Cloud Masking Models for On-Board Inference in Hyperspectral Imaging

【速读】：该论文试图解决高光谱卫星成像中的云和云影掩膜问题，这是提取高质量、可分析数据的关键预处理步骤。解决方案的关键在于采用轻量级人工智能（AI）模型，特别是经过特征缩减的卷积神经网络（CNN），其在保持高精度的同时，具备低存储需求和快速的推理速度，适用于CPU和GPU平台。通过仅包含最多597个可训练参数的模型变体，实现了部署可行性、准确性和计算效率之间的最佳平衡。

链接: https://arxiv.org/abs/2507.08052
作者: Mazen Ali,António Pereira,Fabio Gentile,Aser Cortines,Sam Mugel,Román Orús,Stelios P. Neophytides,Michalis Mavrovouniotis
机构: Multiverse Computing(多宇宙计算); ERATOSTHENES Centre of Excellence(埃拉托色尼卓越中心); Cyprus University of Technology(塞浦路斯技术大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Cloud and cloud shadow masking is a crucial preprocessing step in hyperspectral satellite imaging, enabling the extraction of high-quality, analysis-ready data. This study evaluates various machine learning approaches, including gradient boosting methods such as XGBoost and LightGBM as well as convolutional neural networks (CNNs). All boosting and CNN models achieved accuracies exceeding 93%. Among the investigated models, the CNN with feature reduction emerged as the most efficient, offering a balance of high accuracy, low storage requirements, and rapid inference times on both CPUs and GPUs. Variations of this version, with only up to 597 trainable parameters, demonstrated the best trade-off in terms of deployment feasibility, accuracy, and computational efficiency. These results demonstrate the potential of lightweight artificial intelligence (AI) models for real-time hyperspectral image processing, supporting the development of on-board satellite AI systems for space-based applications.
zh

[CV-71] A Hybrid Multilayer Extreme Learning Machine for Image Classification with an Application to Quadcopters

【速读】：该论文旨在解决复杂自然信号（如图像）的高效分类问题，特别是在无人机（UAVs）应用中的主动图像分类任务。其解决方案的关键在于提出了一种基于ELM-AE（Extreme Learning Machine-based Autoencoder）和区间型2型模糊逻辑理论的混合多层极端学习机（HML-ELM），该方法通过分层结构实现自教特征提取与监督特征分类，结合改进的Simplified Interval Type-2 Fuzzy ELM（SIT2-FELM）以提升分类效率与准确性。

链接: https://arxiv.org/abs/2507.08047
作者: Rolando A.Hernandez-Hernandez,Adrian Rubio-Solis
机构: Laboratory of Submarine Robotics (LSR); Center for Engineering and Industrial Development (CIDESI); Hamlyn Centre for Robotic Surgery (Hamlyn Centre for Robotic Surgery); Imperial College London (Imperial College London)
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注: 22 pages, 10 figures, 3 tables

点击查看摘要

Abstract:Multilayer Extreme Learning Machine (ML-ELM) and its variants have proven to be an effective technique for the classification of different natural signals such as audio, video, acoustic and images. In this paper, a Hybrid Multilayer Extreme Learning Machine (HML-ELM) that is based on ELM-based autoencoder (ELM-AE) and an Interval Type-2 fuzzy Logic theory is suggested for active image classification and applied to Unmanned Aerial Vehicles (UAVs). The proposed methodology is a hierarchical ELM learning framework that consists of two main phases: 1) self-taught feature extraction and 2) supervised feature classification. First, unsupervised multilayer feature encoding is achieved by stacking a number of ELM-AEs, in which input data is projected into a number of high-level representations. At the second phase, the final features are classified using a novel Simplified Interval Type-2 Fuzzy ELM (SIT2-FELM) with a fast output reduction layer based on the SC algorithm; an improved version of the algorithm Center of Sets Type Reducer without Sorting Requirement (COSTRWSR). To validate the efficiency of the HML-ELM, two types of experiments for the classification of images are suggested. First, the HML-ELM is applied to solve a number of benchmark problems for image classification. Secondly, a number of real experiments to the active classification and transport of four different objects between two predefined locations using a UAV is implemented. Experiments demonstrate that the proposed HML-ELM delivers a superior efficiency compared to other similar methodologies such as ML-ELM, Multilayer Fuzzy Extreme Learning Machine (ML-FELM) and ELM.
zh

[CV-72] ConsNoTrainLoRA: Data-driven Weight Initialization of Low-rank Adapters using Constraints ICCV2025

【速读】：该论文试图解决在使用低秩适配器（LoRA）进行参数高效微调时，随机初始化的LoRA权重矩阵导致收敛速度慢和最终性能不佳的问题。其解决方案的关键在于提出了一种数据驱动的权重初始化方法——ConsNoTrainLoRA (CNTLoRA)，将LoRA初始化建模为一个领域转移问题，并通过预训练和微调激活之间的多个约束条件推导出闭式解的LoRA权重估计，该估计依赖于预训练权重和微调激活向量，从而在初始化阶段无需训练即可获得有效的权重，同时支持对上下矩阵进行可变秩的灵活初始化。

链接: https://arxiv.org/abs/2507.08044
作者: Debasmit Das,Hyoungwoo Park,Munawar Hayat,Seokeon Choi,Sungrack Yun,Fatih Porikli
机构: Qualcomm AI Research (高通人工智能研究)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: ICCV 2025

点击查看摘要

Abstract:Foundation models are pre-trained on large-scale datasets and subsequently fine-tuned on small-scale datasets using parameter-efficient fine-tuning (PEFT) techniques like low-rank adapters (LoRA). In most previous works, LoRA weight matrices are randomly initialized with a fixed rank across all attachment points. In this paper, we improve convergence and final performance of LoRA fine-tuning, using our proposed data-driven weight initialization method, ConsNoTrainLoRA (CNTLoRA). We express LoRA initialization as a domain shift problem where we use multiple constraints relating the pre-training and fine-tuning activations. By reformulating these constraints, we obtain a closed-form estimate of LoRA weights that depends on pre-training weights and fine-tuning activation vectors and hence requires no training during initialization. This weight estimate is decomposed to initialize the up and down matrices with proposed flexibility of variable ranks. With the proposed initialization method, we fine-tune on downstream tasks such as image generation, image classification and image understanding. Both quantitative and qualitative results demonstrate that CNTLoRA outperforms standard and data-driven weight initialization methods. Extensive analyses and ablations further elucidate the design choices of our framework, providing an optimal recipe for faster convergence and enhanced performance.
zh

[CV-73] owards Evaluating Robustness of Prompt Adherence in Text to Image Models

【速读】：该论文试图解决Text-to-Image模型在遵循输入文本提示方面可靠性不足的问题，特别是其在生成符合指定变量变化的图像时的表现。解决方案的关键在于构建一个全面的评估框架，并创建了一个新颖的数据集以评估这些模型在生成图像时的鲁棒性。此外，研究引入了一条利用gpt-4o生成的文本描述作为真实图像的管道，通过将这些描述传递给Text-to-Image模型生成人工图像，并再次使用gpt-4o进行比较，从而量化模型生成图像与原始提示之间的差异。

链接: https://arxiv.org/abs/2507.08039
作者: Sujith Vemishetty,Advitiya Arora,Anupama Sharma
机构: Synechron(赛内科)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:The advancements in the domain of LLMs in recent years have surprised many, showcasing their remarkable capabilities and diverse applications. Their potential applications in various real-world scenarios have led to significant research on their reliability and effectiveness. On the other hand, multimodal LLMs and Text-to-Image models have only recently gained prominence, especially when compared to text-only LLMs. Their reliability remains constrained due to insufficient research on assessing their performance and robustness. This paper aims to establish a comprehensive evaluation framework for Text-to-Image models, concentrating particularly on their adherence to prompts. We created a novel dataset that aimed to assess the robustness of these models in generating images that conform to the specified factors of variation in the input text prompts. Our evaluation studies present findings on three variants of Stable Diffusion models: Stable Diffusion 3 Medium, Stable Diffusion 3.5 Large, and Stable Diffusion 3.5 Large Turbo, and two variants of Janus models: Janus Pro 1B and Janus Pro 7B. We introduce a pipeline that leverages text descriptions generated by the gpt-4o model for our ground-truth images, which are then used to generate artificial images by passing these descriptions to the Text-to-Image models. We then pass these generated images again through gpt-4o using the same system prompt and compare the variation between the two descriptions. Our results reveal that these models struggle to create simple binary images with only two factors of variation: a simple geometric shape and its location. We also show, using pre-trained VAEs on our dataset, that they fail to generate images that follow our input dataset distribution.
zh

[CV-74] SSSUMO: Real-Time Semi-Supervised Submovement Decomposition

【速读】：该论文旨在解决子运动分解（submovement decomposition）中的重建精度不足、计算成本高及验证困难等问题，这些问题主要源于手工标注数据的获取难度。其解决方案的关键在于提出一种半监督深度学习方法——SSSUMO，该方法通过从最小 jerk 原理生成的合成数据开始，结合对无标签人类运动数据的迭代适应，实现模型的优化。此外，采用全卷积架构与可微分重建机制，显著提升了在合成数据和多样化人体运动数据集上的性能，并实现了毫秒级的实时运算能力。

链接: https://arxiv.org/abs/2507.08028
作者: Evgenii Rudakov,Jonathan Shock,Otto Lappi,Benjamin Ultan Cowley
机构: 赫尔辛基大学(University of Helsinki)
类目: Human-Computer Interaction (cs.HC); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:This paper introduces a SSSUMO, semi-supervised deep learning approach for submovement decomposition that achieves state-of-the-art accuracy and speed. While submovement analysis offers valuable insights into motor control, existing methods struggle with reconstruction accuracy, computational cost, and validation, due to the difficulty of obtaining hand-labeled data. We address these challenges using a semi-supervised learning framework. This framework learns from synthetic data, initially generated from minimum-jerk principles and then iteratively refined through adaptation to unlabeled human movement data. Our fully convolutional architecture with differentiable reconstruction significantly surpasses existing methods on both synthetic and diverse human motion datasets, demonstrating robustness even in high-noise conditions. Crucially, the model operates in real-time (less than a millisecond per input second), a substantial improvement over optimization-based techniques. This enhanced performance facilitates new applications in human-computer interaction, rehabilitation medicine, and motor control studies. We demonstrate the model’s effectiveness across diverse human-performed tasks such as steering, rotation, pointing, object moving, handwriting, and mouse-controlled gaming, showing notable improvements particularly on challenging datasets where traditional methods largely fail. Training and benchmarking source code, along with pre-trained model weights, are made publicly available at this https URL.
zh

[CV-75] Development of a Canada-Wide Morphology Map for the ITU-R P. 1411 Propagation Model

【速读】：该论文试图解决在不同环境类型下进行准确的路径损耗估计问题，特别是在300 MHz至100 GHz频段内的户外短距离传播场景中。解决方案的关键在于采用机器学习方法对基于ITU-R P.1411-12传播模型指南的定性环境类型描述符进行自动化分类，从而生成全国范围内的形态学地图，以提高路径损耗估算的准确性。

链接: https://arxiv.org/abs/2507.08026
作者: Jennifer P. T. Nguyen
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:This paper outlines the development of a Canada-wide morphology map classifying regions into residential, urban low-rise, and urban high-rise environments, following the ITU-R P.1411-12 propagation model guidelines. To address the qualitative nature of the environment-type descriptors found in the Recommendation, a machine learning approach is employed to automate the classification process. Extensive experimentation optimized classification accuracy, resulting in a Canada-wide morphology map that ensures more accurate path loss estimations for outdoor short-range propagation at frequencies ranging from 300 MHz to 100 GHz.
zh

[CV-76] Self-Consistency in Vision-Language Models for Precision Agriculture: Multi-Response Consensus for Crop Disease Management

【速读】：该论文试图解决现有视觉-语言模型（VLMs）在精准农业领域中作物病害识别与治疗建议方面的性能不足问题。其解决方案的关键在于提出一种领域感知的框架，该框架结合基于提示的专家评估与自一致性机制，以提高VLM在农业图像处理中的可靠性。具体包括：（1）一种基于提示的评估协议，将语言模型配置为专家植物病理学家以实现对图像分析结果的可扩展评估；（2）一种余弦一致性自投票机制，通过生成多个候选响应并利用领域适应嵌入选择语义最一致的诊断结果。

链接: https://arxiv.org/abs/2507.08024
作者: Mihir Gupta,Abhay Mangla,Ross Greer,Pratik Desai
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Precision agriculture relies heavily on accurate image analysis for crop disease identification and treatment recommendation, yet existing vision-language models (VLMs) often underperform in specialized agricultural domains. This work presents a domain-aware framework for agricultural image processing that combines prompt-based expert evaluation with self-consistency mechanisms to enhance VLM reliability in precision agriculture applications. We introduce two key innovations: (1) a prompt-based evaluation protocol that configures a language model as an expert plant pathologist for scalable assessment of image analysis outputs, and (2) a cosine-consistency self-voting mechanism that generates multiple candidate responses from agricultural images and selects the most semantically coherent diagnosis using domain-adapted embeddings. Applied to maize leaf disease identification from field images using a fine-tuned PaliGemma model, our approach improves diagnostic accuracy from 82.2% to 87.8%, symptom analysis from 38.9% to 52.2%, and treatment recommendation from 27.8% to 43.3% compared to standard greedy decoding. The system remains compact enough for deployment on mobile devices, supporting real-time agricultural decision-making in resource-constrained environments. These results demonstrate significant potential for AI-driven precision agriculture tools that can operate reliably in diverse field conditions.
zh

[CV-77] CuriosAI Submission to the EgoExo4D Proficiency Estimation Challenge 2025 CVPR

【速读】：该论文旨在解决多视角技能评估问题，即在不同视角下准确估计个体执行任务的熟练程度。其解决方案的关键在于采用场景条件化的建模方法，通过两个方法实现：一是基于Sapiens-2B的多任务学习框架，联合预测熟练度和场景标签；二是结合零样本场景识别与视图特定的VideoMAE分类器的两阶段流程。其中，两阶段方法在性能上表现更优，验证了场景条件化建模在熟练度估计中的有效性。

链接: https://arxiv.org/abs/2507.08022
作者: Hayato Tanoue,Hiroki Nishihara,Yuma Suzuki,Takayuki Hori,Hiroki Takushima,Aiswariya Manojkumar,Yuki Shibata,Mitsuru Takeda,Fumika Beppu,Zhao Hengwei,Yuto Kanda,Daichi Yamaga
机构: SoftBank Corp. AI & Data Technology Planning Division
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: The 2nd place solution for the EgoExo4D Proficiency Estimation Challenge at the CVPR EgoVis Workshop 2025

点击查看摘要

Abstract:This report presents the CuriosAI team’s submission to the EgoExo4D Proficiency Estimation Challenge at CVPR 2025. We propose two methods for multi-view skill assessment: (1) a multi-task learning framework using Sapiens-2B that jointly predicts proficiency and scenario labels (43.6 % accuracy), and (2) a two-stage pipeline combining zero-shot scenario recognition with view-specific VideoMAE classifiers (47.8 % accuracy). The superior performance of the two-stage approach demonstrates the effectiveness of scenario-conditioned modeling for proficiency estimation.
zh

[CV-78] A Versatile Dataset of Mouse and Eye Movements on Search Engine Results Pages

【速读】：该论文试图解决用户在搜索引擎结果页面（SERP）上的注意力和购买行为研究中，传统方法依赖鼠标移动作为低成本大规模行为代理所带来的局限性，以及事后自我报告的地面真实标签可能存在的不准确性和偏差问题。解决方案的关键在于使用眼动仪来构建连续视觉注意力的客观地面真实数据，从而提供更精确的行为分析基础。

链接: https://arxiv.org/abs/2507.08003
作者: Kayhan Latifzadeh,Jacek Gwizdka,Luis A. Leiva
机构: University of Luxembourg(卢森堡大学); University of Texas at Austin(德克萨斯大学奥斯汀分校)
类目: Human-Computer Interaction (cs.HC); Computer Vision and Pattern Recognition (cs.CV); Information Retrieval (cs.IR)
备注:

点击查看摘要

Abstract:We contribute a comprehensive dataset to study user attention and purchasing behavior on Search Engine Result Pages (SERPs). Previous work has relied on mouse movements as a low-cost large-scale behavioral proxy but also has relied on self-reported ground-truth labels, collected at post-task, which can be inaccurate and prone to biases. To address this limitation, we use an eye tracker to construct an objective ground-truth of continuous visual attention. Our dataset comprises 2,776 transactional queries on Google SERPs, collected from 47 participants, and includes: (1) HTML source files, with CSS and images; (2) rendered SERP screenshots; (3) eye movement data; (4) mouse movement data; (5) bounding boxes of direct display and organic advertisements; and (6) scripts for further preprocessing the data. In this paper we provide an overview of the dataset and baseline experiments (classification tasks) that can inspire researchers about the different possibilities for future work.
zh

[CV-79] Raptor: Scalable Train-Free Embeddings for 3D Medical Volumes Leverag ing Pretrained 2D Foundation Models ICML2025

【速读】：该论文试图解决在三维成像数据（如磁共振成像MRI）中构建基础模型所面临的计算复杂性高和数据集规模不足的问题。其解决方案的关键在于提出Raptor（Random Planar Tensor Reduction），这是一种无需训练的方法，通过利用预训练的二维基础模型从医学体积的单个切片中提取视觉标记，并使用随机投影进行空间压缩，从而显著降低计算复杂度并保留语义信息。

链接: https://arxiv.org/abs/2507.08254
作者: Ulzee An,Moonseong Jeong,Simon A. Lee,Aditya Gorla,Yuzhe Yang,Sriram Sankararaman
机构: 未知
类目: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注: 21 pages, 10 figures, accepted to ICML 2025. The first two authors contributed equally

点击查看摘要

Abstract:Current challenges in developing foundational models for volumetric imaging data, such as magnetic resonance imaging (MRI), stem from the computational complexity of training state-of-the-art architectures in high dimensions and curating sufficiently large datasets of volumes. To address these challenges, we introduce Raptor (Random Planar Tensor Reduction), a train-free method for generating semantically rich embeddings for volumetric data. Raptor leverages a frozen 2D foundation model, pretrained on natural images, to extract visual tokens from individual cross-sections of medical volumes. These tokens are then spatially compressed using random projections, significantly reducing computational complexity while retaining semantic information. Extensive experiments on ten diverse medical volume tasks verify the superior performance of Raptor over state-of-the-art methods, including those pretrained exclusively on medical volumes (+3% SuPreM, +6% MISFM, +10% Merlin, +13% VoCo, and +14% SLIViT), while entirely bypassing the need for costly training. Our results highlight the effectiveness and versatility of Raptor as a foundation for advancing deep learning-based methods for medical volumes.
zh

[CV-80] Depth-Sequence Transformer (DST) for Segment-Specific ICA Calcification Mapping on Non-Contrast CT

【速读】：该论文试图解决传统总颅内颈动脉钙化（total intracranial carotid artery calcification, ICAC）体积作为卒中生物标志物的局限性，即该指标忽略了斑块位置对预后和手术风险的显著影响。现有方法在进行分段量化时面临技术挑战，传统三维模型因处理降采样体积或孤立切片而丢失全局上下文信息，导致解剖模糊性和关键解剖标志物定位不可靠。论文提出的解决方案的关键在于将3D问题重新定义为沿1D轴向维度的并行概率标志物定位任务，并引入Depth-Sequence Transformer (DST) 框架，该框架以全分辨率CT图像作为2D切片序列进行处理，学习预测6个独立的概率分布以精确定位关键解剖标志物，从而实现高精度和鲁棒性的分段特定ICAC分析。

链接: https://arxiv.org/abs/2507.08214
作者: Xiangjian Hou,Ebru Yaman Akcicek,Xin Wang,Kazem Hashemizadeh,Scott Mcnally,Chun Yuan,Xiaodong Ma
机构: University of Utah(犹他大学); University of Washington(华盛顿大学)
类目: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:While total intracranial carotid artery calcification (ICAC) volume is an established stroke biomarker, growing evidence shows this aggregate metric ignores the critical influence of plaque location, since calcification in different segments carries distinct prognostic and procedural risks. However, a finer-grained, segment-specific quantification has remained technically infeasible. Conventional 3D models are forced to process downsampled volumes or isolated patches, sacrificing the global context required to resolve anatomical ambiguity and render reliable landmark localization. To overcome this, we reformulate the 3D challenge as a \textbfParallel Probabilistic Landmark Localization task along the 1D axial dimension. We propose the \textbfDepth-Sequence Transformer (DST), a framework that processes full-resolution CT volumes as sequences of 2D slices, learning to predict N=6 independent probability distributions that pinpoint key anatomical landmarks. Our DST framework demonstrates exceptional accuracy and robustness. Evaluated on a 100-patient clinical cohort with rigorous 5-fold cross-validation, it achieves a Mean Absolute Error (MAE) of \textbf0.1 slices, with \textbf96% of predictions falling within a \pm1 slice tolerance. Furthermore, to validate its architectural power, the DST backbone establishes the best result on the public Clean-CC-CCII classification benchmark under an end-to-end evaluation protocol. Our work delivers the first practical tool for automated segment-specific ICAC analysis. The proposed framework provides a foundation for further studies on the role of location-specific biomarkers in diagnosis, prognosis, and procedural planning. Our code will be made publicly available.
zh

[CV-81] Cracking Instance Jigsaw Puzzles: An Alternative to Multiple Instance Learning for Whole Slide Image Analysis ICCV2025

【速读】：该论文试图解决多实例学习（MIL）在组织病理学全切片图像（WSI）分析中因依赖排列不变性而难以有效揭示实例间语义关联的问题。解决方案的关键在于提出一种新的MIL替代方法，通过学习从随机打乱的实例排列中恢复其顺序，即所谓的实例拼图问题，从而揭示实例间的语义关联。为此，作者提出了一种基于Siamese网络的解决方案，并通过最优传输理论进行理论验证。

链接: https://arxiv.org/abs/2507.08178
作者: Xiwen Chen,Peijie Qiu,Wenhui Zhu,Hao Wang,Huayu Li,Xuanzhao Dong,Xiaotong Sun,Xiaobing Yu,Yalin Wang,Abolfazl Razi,Aristeidis Sotiras
机构: Institution1; Institution2
类目: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted by ICCV2025

点击查看摘要

Abstract:While multiple instance learning (MIL) has shown to be a promising approach for histopathological whole slide image (WSI) analysis, its reliance on permutation invariance significantly limits its capacity to effectively uncover semantic correlations between instances within WSIs. Based on our empirical and theoretical investigations, we argue that approaches that are not permutation-invariant but better capture spatial correlations between instances can offer more effective solutions. In light of these findings, we propose a novel alternative to existing MIL for WSI analysis by learning to restore the order of instances from their randomly shuffled arrangement. We term this task as cracking an instance jigsaw puzzle problem, where semantic correlations between instances are uncovered. To tackle the instance jigsaw puzzles, we propose a novel Siamese network solution, which is theoretically justified by optimal transport theory. We validate the proposed method on WSI classification and survival prediction tasks, where the proposed method outperforms the recent state-of-the-art MIL competitors. The code is available at this https URL.
zh

[CV-82] 3D forest semantic segmentation using multispectral LiDAR and 3D deep learning

【速读】：该论文旨在解决森林资源管理中森林组分分割的准确性问题，通过利用多光谱激光雷达（MS-LiDAR）数据实现更精确的森林组件分类。其解决方案的关键在于结合高密度多光谱点云数据与深度学习模型，特别是KPConv模型，在输入所有三个波长（1550 nm、905 nm和532 nm）作为初始特征的情况下，显著提升了平均交并比（mIoU）和平均准确率（mAcc）。

链接: https://arxiv.org/abs/2507.08025
作者: Narges Takhtkeshha,Lauris Bocaux,Lassi Ruoppa,Fabio Remondino,Gottfried Mandlburger,Antero Kukko,Juha Hyyppä
机构: 未知
类目: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Conservation and decision-making regarding forest resources necessitate regular forest inventory. Light detection and ranging (LiDAR) in laser scanning systems has gained significant attention over the past two decades as a remote and non-destructive solution to streamline the labor-intensive and time-consuming procedure of forest inventory. Advanced multispectral (MS) LiDAR systems simultaneously acquire three-dimensional (3D) spatial and spectral information across multiple wavelengths of the electromagnetic spectrum. Consequently, MS-LiDAR technology enables the estimation of both the biochemical and biophysical characteristics of forests. Forest component segmentation is crucial for forest inventory. The synergistic use of spatial and spectral laser information has proven to be beneficial for achieving precise forest semantic segmentation. Thus, this study aims to investigate the potential of MS-LiDAR data, captured by the HeliALS system, providing high-density multispectral point clouds to segment forests into six components: ground, low vegetation, trunks, branches, foliage, and woody debris. Three point-wise 3D deep learning models and one machine learning model, including kernel point convolution, superpoint transformer, point transformer V3, and random forest, are implemented. Our experiments confirm the superior accuracy of the KPConv model. Additionally, various geometric and spectral feature vector scenarios are examined. The highest accuracy is achieved by feeding all three wavelengths (1550 nm, 905 nm, and 532 nm) as the initial features into the deep learning model, resulting in improvements of 33.73% and 32.35% in mean intersection over union (mIoU) and in mean accuracy (mAcc), respectively. This study highlights the excellent potential of multispectral LiDAR for improving the accuracy in fully automated forest component segmentation.
zh

[CV-83] Dual-Attention U-Net with Class-Specific Ensembles and Bayesian Hyperparameter Optimization for Precise Wound and Scale Marker Segmentation ALT

【速读】：该论文试图解决临床图像中伤口和尺度标记的准确分割问题，这一问题对于有效的伤口管理和自动化评估至关重要。解决方案的关键在于提出了一种新型的双注意力U-Net++架构，该架构集成了通道注意力（SCSE）和空间注意力机制，以应对医学图像中的严重类别不平衡和变异问题。此外，通过5折交叉验证选择了EfficientNet-B7作为最优编码器，并通过定制的预处理、数据增强和贝叶斯超参数调优（WandB sweeps）独立训练了两个类别特定模型，最终采用测试时增强技术提升预测可靠性。

链接: https://arxiv.org/abs/2507.05314
作者: Daniel Cieślak,Miriam Reca,Olena Onyshchenko,Jacek Rumiński
机构: 未知
类目: Image and Video Processing (eess.IV); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注: 11 pages, conference: Joint 20th Nordic-Baltic Conference on Biomedical Engineering 24th Polish Conference on Biocybernetics and Biomedical Engineering; 6 figures, 2 tables, 11 sources

点击查看摘要

Abstract:Accurate segmentation of wounds and scale markers in clinical images remainsa significant challenge, crucial for effective wound management and automatedassessment. In this study, we propose a novel dual-attention U-Net++ archi-tecture, integrating channel-wise (SCSE) and spatial attention mechanisms toaddress severe class imbalance and variability in medical images this http URL, extensive benchmarking across diverse architectures and encoders via 5-fold cross-validation identified EfficientNet-B7 as the optimal encoder this http URL, we independently trained two class-specific models with tailoredpreprocessing, extensive data augmentation, and Bayesian hyperparameter tun-ing (WandB sweeps). The final model ensemble utilized Test Time Augmentationto further enhance prediction reliability. Our approach was evaluated on a bench-mark dataset from the NBC 2025 PCBBE 2025 competition. Segmentationperformance was quantified using a weighted F1-score (75% wounds, 25% scalemarkers), calculated externally by competition organizers on undisclosed hard-ware. The proposed approach achieved an F1-score of 0.8640, underscoring itseffectiveness for complex medical segmentation tasks.
zh

人工智能

[AI-0] Optimistic Exploration for Risk-Averse Constrained Reinforcement Learning

【速读】：该论文试图解决风险规避的约束强化学习（Risk-averse Constrained Reinforcement Learning, RaCRL）中因风险规避导致探索保守、收敛到次优策略的问题，这些问题通常表现为无法充分最大化奖励或未能达成目标。解决方案的关键在于提出一种基于探索的方法——乐观风险规避的Actor-Critic算法（Optimistic Risk-averse Actor Critic, ORAC），该方法通过最大化状态-动作奖励值函数的局部上置信界和最小化风险规避状态-动作成本值函数的局部下置信界来构建探索性策略。在每一步中，根据成本值是否超过安全约束值调整其权重，从而鼓励策略探索环境中的不确定区域以发现高奖励状态，同时仍满足安全约束。

链接: https://arxiv.org/abs/2507.08793
作者: James McCarthy,Radu Marinescu,Elizabeth Daly,Ivana Dusparic
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Risk-averse Constrained Reinforcement Learning (RaCRL) aims to learn policies that minimise the likelihood of rare and catastrophic constraint violations caused by an environment’s inherent randomness. In general, risk-aversion leads to conservative exploration of the environment which typically results in converging to sub-optimal policies that fail to adequately maximise reward or, in some cases, fail to achieve the goal. In this paper, we propose an exploration-based approach for RaCRL called Optimistic Risk-averse Actor Critic (ORAC), which constructs an exploratory policy by maximising a local upper confidence bound of the state-action reward value function whilst minimising a local lower confidence bound of the risk-averse state-action cost value function. Specifically, at each step, the weighting assigned to the cost value is increased or decreased if it exceeds or falls below the safety constraint value. This way the policy is encouraged to explore uncertain regions of the environment to discover high reward states whilst still satisfying the safety constraints. Our experimental results demonstrate that the ORAC approach prevents convergence to sub-optimal policies and improves significantly the reward-cost trade-off in various continuous control tasks such as Safety-Gymnasium and a complex building energy management environment CityLearn.
zh

[AI-1] Penalizing Infeasible Actions and Reward Scaling in Reinforcement Learning with Offline Data ICML2025

【速读】：该论文试图解决使用离线数据进行强化学习时面临的Q值外推误差问题。其关键解决方案是通过奖励缩放与层归一化（RS-LN）引导Q值在数据范围之外逐渐下降，并结合不可行动作的惩罚机制（PA），从而提出一种名为PARS的新算法。

链接: https://arxiv.org/abs/2507.08761
作者: Jeonghye Kim,Yongjae Shin,Whiyoung Jung,Sunghoon Hong,Deunsol Yoon,Youngchul Sung,Kanghoon Lee,Woohyung Lim
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注: Accepted to ICML2025

点击查看摘要

Abstract:Reinforcement learning with offline data suffers from Q-value extrapolation errors. To address this issue, we first demonstrate that linear extrapolation of the Q-function beyond the data range is particularly problematic. To mitigate this, we propose guiding the gradual decrease of Q-values outside the data range, which is achieved through reward scaling with layer normalization (RS-LN) and a penalization mechanism for infeasible actions (PA). By combining RS-LN and PA, we develop a new algorithm called PARS. We evaluate PARS across a range of tasks, demonstrating superior performance compared to state-of-the-art algorithms in both offline training and online fine-tuning on the D4RL benchmark, with notable success in the challenging AntMaze Ultra task.
zh

[AI-2] Adaptive Nonlinear Vector Autoregression: Robust Forecasting for Noisy Chaotic Time Series

【速读】：该论文旨在解决非线性向量自回归（NVAR）和储备计算（RC）在预测混沌动力系统时面临的适应性差和可扩展性不足的问题。传统方法依赖于固定的非线性结构，如NVAR中的多项式展开或RC中的随机特征映射，这限制了其在高噪声或真实世界数据中的适用性，并且在高维情况下由于读出计算中的矩阵求逆导致效率低下。论文提出的解决方案关键在于引入一种自适应NVAR模型，该模型结合了延迟嵌入的线性输入与由浅层可学习多层感知机（MLP）生成的特征，通过基于梯度的优化联合训练MLP和线性读出结构，从而实现数据驱动的非线性建模，同时保持简单的读出结构，提升了模型的可扩展性和预测性能。

链接: https://arxiv.org/abs/2507.08738
作者: Azimov Sherkhon,Susana Lopez-Moreno,Eric Dolores-Cuenca,Sieun Lee,Sangil Kim
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Dynamical Systems (math.DS)
备注: 15 pages, 10 figures

点击查看摘要

Abstract:Nonlinear vector autoregression (NVAR) and reservoir computing (RC) have shown promise in forecasting chaotic dynamical systems, such as the Lorenz-63 model and El Nino-Southern Oscillation. However, their reliance on fixed nonlinearities - polynomial expansions in NVAR or random feature maps in RC - limits their adaptability to high noise or real-world data. These methods also scale poorly in high-dimensional settings due to costly matrix inversion during readout computation. We propose an adaptive NVAR model that combines delay-embedded linear inputs with features generated by a shallow, learnable multi-layer perceptron (MLP). The MLP and linear readout are jointly trained using gradient-based optimization, enabling the model to learn data-driven nonlinearities while preserving a simple readout structure. Unlike standard NVAR, our approach avoids the need for an exhaustive and sensitive grid search over ridge and delay parameters. Instead, tuning is restricted to neural network hyperparameters, improving scalability. Initial experiments on chaotic systems tested under noise-free and synthetically noisy conditions showed that the adaptive model outperformed the standard NVAR in predictive accuracy and showed robust forecasting under noisy conditions with a lower observation frequency.
zh

[AI-3] Catastrophic Forgetting Mitigation Through Plateau Phase Activity Profiling

【速读】：该论文试图解决深度神经网络中的灾难性遗忘问题（catastrophic forgetting），即在学习新任务时，模型性能会因先前知识的覆盖而下降。其解决方案的关键在于：在最终训练平台阶段跟踪参数的变化，而非在整个训练过程中监控参数。论文认为，在此阶段表现出更高活动性的参数揭示了损失景观中相对平坦的方向，这些方向适合用于适应新任务的同时保留先前的知识。

链接: https://arxiv.org/abs/2507.08736
作者: Idan Mashiach,Oren Glickman,Tom Tirer
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Catastrophic forgetting in deep neural networks occurs when learning new tasks degrades performance on previously learned tasks due to knowledge overwriting. Among the approaches to mitigate this issue, regularization techniques aim to identify and constrain “important” parameters to preserve previous knowledge. In the highly nonconvex optimization landscape of deep learning, we propose a novel perspective: tracking parameters during the final training plateau is more effective than monitoring them throughout the entire training process. We argue that parameters that exhibit higher activity (movement and variability) during this plateau reveal directions in the loss landscape that are relatively flat, making them suitable for adaptation to new tasks while preserving knowledge from previous ones. Our comprehensive experiments demonstrate that this approach achieves superior performance in balancing catastrophic forgetting mitigation with strong performance on newly learned tasks.
zh

[AI-4] Dually Hierarchical Drift Adaptation for Online Configuration Performance Learning ICSE2026

【速读】：该论文试图解决动态环境中配置与性能模型学习中的概念漂移问题，包括全局漂移和局部漂移，这些问题会导致性能预测的准确性下降。解决方案的关键在于提出DHDA框架，该框架通过双层次适应机制来捕捉和应对不同层次的漂移：在上层，将数据重新划分并仅在必要时对局部模型进行重训练以处理全局漂移；在下层，各划分的局部模型能够异步检测并适应局部漂移。同时，DHDA结合增量更新与周期性全量重训练，在保证响应速度的同时减少冗余计算。

链接: https://arxiv.org/abs/2507.08730
作者: Zezhen Xiang,Jingzhi Gong,Tao Chen
机构: 未知
类目: oftware Engineering (cs.SE); Artificial Intelligence (cs.AI)
备注: Accepted by ICSE 2026

点击查看摘要

Abstract:Modern configurable software systems need to learn models that correlate configuration and performance. However, when the system operates in dynamic environments, the workload variations, hardware changes, and system updates will inevitably introduce concept drifts at different levels - global drifts, which reshape the performance landscape of the entire configuration space; and local drifts, which only affect certain sub-regions of that space. As such, existing offline and transfer learning approaches can struggle to adapt to these implicit and unpredictable changes in real-time, rendering configuration performance learning challenging. To address this, we propose DHDA, an online configuration performance learning framework designed to capture and adapt to these drifts at different levels. The key idea is that DHDA adapts to both the local and global drifts using dually hierarchical adaptation: at the upper level, we redivide the data into different divisions, within each of which the local model is retrained, to handle global drifts only when necessary. At the lower level, the local models of the divisions can detect local drifts and adapt themselves asynchronously. To balance responsiveness and efficiency, DHDA combines incremental updates with periodic full retraining to minimize redundant computation when no drifts are detected. Through evaluating eight software systems and against state-of-the-art approaches, we show that DHDA achieves considerably better accuracy and can effectively adapt to drifts with up to 2x improvements, while incurring reasonable overhead and is able to improve different local models in handling concept drift.
zh

[AI-5] Monitoring Risks in Test-Time Adaptation

【速读】：该论文试图解决在部署预测模型时遇到的测试阶段数据分布偏移（distribution shift）问题，以及由此导致的模型性能下降问题。其解决方案的关键在于将测试阶段适应（TTA）方法与风险监控框架相结合，通过跟踪模型的预测性能并在预设性能标准被违反时发出警报，以检测模型最终失效的时刻。具体而言，作者扩展了基于序列检验的监控工具，引入置信序列以适应测试阶段模型更新且无测试标签的情况，从而实现了对TTA过程的严格统计风险监控。

链接: https://arxiv.org/abs/2507.08721
作者: Mona Schirmer,Metod Jazbec,Christian A. Naesseth,Eric Nalisnick
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Machine Learning (stat.ML)
备注:

点击查看摘要

Abstract:Encountering shifted data at test time is a ubiquitous challenge when deploying predictive models. Test-time adaptation (TTA) methods address this issue by continuously adapting a deployed model using only unlabeled test data. While TTA can extend the model’s lifespan, it is only a temporary solution. Eventually the model might degrade to the point that it must be taken offline and retrained. To detect such points of ultimate failure, we propose pairing TTA with risk monitoring frameworks that track predictive performance and raise alerts when predefined performance criteria are violated. Specifically, we extend existing monitoring tools based on sequential testing with confidence sequences to accommodate scenarios in which the model is updated at test time and no test labels are available to estimate the performance metrics of interest. Our extensions unlock the application of rigorous statistical risk monitoring to TTA, and we demonstrate the effectiveness of our proposed TTA monitoring framework across a representative set of datasets, distribution shift types, and TTA methods.
zh

[AI-6] System-of-systems Modeling and Optimization: An Integrated Framework for Intermodal Mobility

【速读】：该论文试图解决在开发创新系统架构过程中，使用高效专用方法（如基于物理的仿真）进行建模和优化时，优化算法面临的评估成本增加和潜在失败问题。解决方案的关键在于引入基于代理的优化算法，例如利用高斯过程模型的贝叶斯优化，以提高优化效率并降低计算复杂性。

链接: https://arxiv.org/abs/2507.08715
作者: Paul Saves,Jasper Bussemaker,Rémi Lafage,Thierry Lefebvre,Nathalie Bartoli,Youssef Diouane,Joseph Morlier
机构: 未知
类目: Artificial Intelligence (cs.AI); Systems and Control (eess.SY); Optimization and Control (math.OC)
备注:

点击查看摘要

Abstract:For developing innovative systems architectures, modeling and optimization techniques have been central to frame the architecting process and define the optimization and modeling problems. In this context, for system-of-systems the use of efficient dedicated approaches (often physics-based simulations) is highly recommended to reduce the computational complexity of the targeted applications. However, exploring novel architectures using such dedicated approaches might pose challenges for optimization algorithms, including increased evaluation costs and potential failures. To address these challenges, surrogate-based optimization algorithms, such as Bayesian optimization utilizing Gaussian process models have emerged.
zh

[AI-7] sciRL: Integrating Language Solutions into Reinforcement Learning Problem Settings EMNLP2025

【速读】：该论文试图解决如何将语言解决方案有效地应用于强化学习问题，以提升强化学习代理在基于奖励环境中的性能。其解决方案的关键在于引入了elsciRL库，该库通过扩展Osborne（2024）提出的带有自我完成指令框架的语言适配器，结合大语言模型（LLM）生成和自完成指令，从而实现对强化学习任务的优化。这一方法具有较低的设置要求，并提供了一个新颖的图形用户界面（GUI），使用户能够输入文本以引导LLM生成指令并实现自我完成。

链接: https://arxiv.org/abs/2507.08705
作者: Philip Osborne,Danilo S. Carvalho,André Freitas
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注: 6 pages, 1 figure, 3 tables, 11 Appendix pages, submitted to EMNLP 2025 Call for System Demonstrations

点击查看摘要

Abstract:We present elsciRL, an open-source Python library to facilitate the application of language solutions on reinforcement learning problems. We demonstrate the potential of our software by extending the Language Adapter with Self-Completing Instruction framework defined in (Osborne, 2024) with the use of LLMs. Our approach can be re-applied to new applications with minimal setup requirements. We provide a novel GUI that allows a user to provide text input for an LLM to generate instructions which it can then self-complete. Empirical results indicate that these instructions \textitcan improve a reinforcement learning agent’s performance. Therefore, we present this work to accelerate the evaluation of language solutions on reward based environments to enable new opportunities for scientific discovery.
zh

[AI-8] ONION: A Multi-Layered Framework for Participatory ER Design

【速读】：该论文试图解决传统实体-关系（ER）建模过程中存在的设计偏见和参与不平等问题，旨在通过整合设计正义、参与式人工智能和概念建模的理念，实现更加包容和透明的建模过程。解决方案的关键在于提出一种五阶段方法论：观察（Observe）、培育（Nurture）、集成（Integrate）、优化（Optimize）、标准化（Normalize），以支持从非结构化的利益相关者输入逐步抽象为结构化的ER图，从而促进多样化的观点融入早期数据建模阶段。

链接: https://arxiv.org/abs/2507.08702
作者: Viktoriia Makovska,George Fletcher,Julia Stoyanovich
机构: 未知
类目: Databases (cs.DB); Artificial Intelligence (cs.AI); Computers and Society (cs.CY)
备注:

点击查看摘要

Abstract:We present ONION, a multi-layered framework for participatory Entity-Relationship (ER) modeling that integrates insights from design justice, participatory AI, and conceptual modeling. ONION introduces a five-stage methodology: Observe, Nurture, Integrate, Optimize, Normalize. It supports progressive abstraction from unstructured stakeholder input to structured ER diagrams. Our approach aims to reduce designer bias, promote inclusive participation, and increase transparency through the modeling process. We evaluate ONION through real-world workshops focused on sociotechnical systems in Ukraine, highlighting how diverse stakeholder engagement leads to richer data models and deeper mutual understanding. Early results demonstrate ONION’s potential to host diversity in early-stage data modeling. We conclude with lessons learned, limitations and challenges involved in scaling and refining the framework for broader adoption. Subjects: Databases (cs.DB); Artificial Intelligence (cs.AI); Computers and Society (cs.CY) Cite as: arXiv:2507.08702 [cs.DB] (or arXiv:2507.08702v1 [cs.DB] for this version) https://doi.org/10.48550/arXiv.2507.08702 Focus to learn more arXiv-issued DOI via DataCite (pending registration) Related DOI: https://doi.org/10.1145/3736733.3736736 Focus to learn more DOI(s) linking to related resources
zh

[AI-9] A Personalised Formal Verification Framework for Monitoring Activities of Daily Living of Older Adults Living Independently in Their Homes

【速读】：该论文旨在解决如何为在家中独立生活的老年人提供提升生活质量的个性化解决方案的问题。其关键在于提出一个用于表示和推理老年人日常生活活动（Activities of Daily Living, ADL）的框架，该框架整合了传感器数据和上下文信息，并基于半结构化访谈、家庭布局和社会观察构建针对每个参与者的形式化模型。通过将个体特定需求编码为线性时序逻辑（Linear Temporal Logic）中的性质，并利用模型检测器验证模型是否满足这些性质，从而实现对老年人行为的监控与分析。当性质被违反时，系统会生成反例以明确违反的原因，进而支持及时干预。

链接: https://arxiv.org/abs/2507.08701
作者: Ricardo Contreras,Filip Smola,Nuša Farič,Jiawei Zheng,Jane Hillston,Jacques D. Fleuriot
机构: 未知
类目: Logic in Computer Science (cs.LO); Artificial Intelligence (cs.AI); Computers and Society (cs.CY)
备注: 19 pages, 6 figures

点击查看摘要

Abstract:There is an imperative need to provide quality of life to a growing population of older adults living independently. Personalised solutions that focus on the person and take into consideration their preferences and context are key. In this work, we introduce a framework for representing and reasoning about the Activities of Daily Living of older adults living independently at home. The framework integrates data from sensors and contextual information that aggregates semi-structured interviews, home layouts and sociological observations from the participants. We use these data to create formal models, personalised for each participant according to their preferences and context. We formulate requirements that are specific to each individual as properties encoded in Linear Temporal Logic and use a model checker to verify whether each property is satisfied by the model. When a property is violated, a counterexample is generated giving the cause of the violation. We demonstrate the framework’s generalisability by applying it to different participants, highlighting its potential to enhance the safety and well-being of older adults ageing in place.
zh

[AI-10] Introspection of Thought Helps AI Agents

【速读】：该论文试图解决AI Agents在利用大型语言模型（Large Language Models, LLMs）和多模态大语言模型（Multimodal-LLMs, MLLMs）进行自然语言理解和推理时，受限于LLMs的固有局限性以及迭代推理过程带来的高昂计算成本问题。解决方案的关键在于提出一种带有思维内省（Introspection of Thought, INoT）的AI Agent推理框架，通过设计一种新的LLM-Read代码在提示中引导LLM执行程序化对话推理流程，从而在LLM内部实现自我否定与反思，有效降低token消耗，并提升任务性能。

链接: https://arxiv.org/abs/2507.08664
作者: Haoran Sun,Shaoning Zeng
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:AI Agents rely on Large Language Models (LLMs) and Multimodal-LLMs (MLLMs) to perform interpretation and inference in text and image tasks without post-training, where LLMs and MLLMs play the most critical role and determine the initial ability and limitations of AI Agents. Usually, AI Agents utilize sophisticated prompt engineering and external reasoning framework to obtain a promising interaction with LLMs, e.g., Chain-of-Thought, Iteration of Thought and Image-of-Thought. However, they are still constrained by the inherent limitations of LLM in understanding natural language, and the iterative reasoning process will generate a large amount of inference cost. To this end, we propose a novel AI Agent Reasoning Framework with Introspection of Thought (INoT) by designing a new LLM-Read code in prompt. It enables LLM to execute programmatic dialogue reasoning processes following the code in prompt. Therefore, self-denial and reflection occur within LLM instead of outside LLM, which can reduce token cost effectively. Through our experiments on six benchmarks for three different tasks, the effectiveness of INoT is verified, with an average improvement of 7.95% in performance, exceeding the baselines. Furthermore, the token cost of INoT is lower on average than the best performing method at baseline by 58.3%. In addition, we demonstrate the versatility of INoT in image interpretation and inference through verification experiments.
zh

[AI-11] Leanabell-Prover-V2: Verifier-integrated Reasoning for Formal Theorem Proving via Reinforcement Learning

【速读】：该论文试图解决如何提升大型语言模型（LLMs）在Lean 4环境中生成形式化定理证明的能力。其解决方案的关键在于通过集成Lean 4验证器的反馈，改进强化学习（RL）机制，使模型能够“自知”其推理过程的正确性，并据此自我修正错误，从而优化推理轨迹。

链接: https://arxiv.org/abs/2507.08649
作者: Xingguang Ji,Yahui Liu,Qi Wang,Jingyuan Zhang,Yang Yue,Rui Shi,Chenxi Sun,Fuzheng Zhang,Guorui Zhou,Kun Gai
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注: 23 pages, 13 figures

点击查看摘要

Abstract:We introduce our Leanabell-Prover-V2, a 7B large language models (LLMs) that can produce formal theorem proofs in Lean 4, with verifier-integrated Long Chain-of-Thoughts (CoT). Following our previous work Leanabell-Prover-V1, we continual to choose to posttrain existing strong prover models for further performance improvement. In our V2 version, we mainly upgrade the Reinforcement Learning (RL) with feedback provided by the Lean 4 verifier. Crucially, verifier feedback, such as indicating success or detailing specific errors, allows the LLM to become ``self-aware’’ of the correctness of its own reasoning process and learn to reflexively correct errors. Leanabell-Prover-V2 directly optimizes LLM reasoning trajectories with multi-turn verifier interactions, together with feedback token masking for stable RL training and a simple reward strategy. Experiments show that Leanabell-Prover-V2 improves performance by 3.2% (pass@128) with Kimina-Prover-Preview-Distill-7B and 2.0% (pass@128) with DeepSeek-Prover-V2-7B on the MiniF2F test set. The source codes, curated data and models are available at: this https URL.
zh

[AI-12] Adaptive Framework for Ambient Intelligence in Rehabilitation Assistance

【速读】：该论文旨在解决居家康复环境中缺乏高效、智能指导与反馈系统的问题，特别是在全膝关节置换术（TKR）后的康复训练中。其解决方案的关键在于引入 Ambient Intelligence Rehabilitation Support (AIRS) 框架，该框架整合了实时三维重建（RT-3DR）、智能导航和大视觉-语言模型（VLMs），通过智能手机实现空间重建与用户体态匹配的虚拟形象，提供可视化反馈，并结合两种反馈机制——视觉三维反馈与VLM生成的详细纠错说明，以提升康复训练的准确性与合规性。

链接: https://arxiv.org/abs/2507.08624
作者: Gábor Baranyi,Zsolt Csibi,Kristian Fenech,Áron Fóthi,Zsófia Gaál,Joul Skaf,András Lőrincz
机构: 未知
类目: Human-Computer Interaction (cs.HC); Artificial Intelligence (cs.AI)
备注: The paper has been submitted to a journal and waiting for review

点击查看摘要

Abstract:This paper introduces the Ambient Intelligence Rehabilitation Support (AIRS) framework, an advanced artificial intelligence-based solution tailored for home rehabilitation environments. AIRS integrates cutting-edge technologies, including Real-Time 3D Reconstruction (RT-3DR), intelligent navigation, and large Vision-Language Models (VLMs), to create a comprehensive system for machine-guided physical rehabilitation. The general AIRS framework is demonstrated in rehabilitation scenarios following total knee replacement (TKR), utilizing a database of 263 video recordings for evaluation. A smartphone is employed within AIRS to perform RT-3DR of living spaces and has a body-matched avatar to provide visual feedback about the excercise. This avatar is necessary in (a) optimizing exercise configurations, including camera placement, patient positioning, and initial poses, and (b) addressing privacy concerns and promoting compliance with the AI Act. The system guides users through the recording process to ensure the collection of properly recorded videos. AIRS employs two feedback mechanisms: (i) visual 3D feedback, enabling direct comparisons between prerecorded clinical exercises and patient home recordings and (ii) VLM-generated feedback, providing detailed explanations and corrections for exercise errors. The framework also supports people with visual and hearing impairments. It also features a modular design that can be adapted to broader rehabilitation contexts. AIRS software components are available for further use and customization.
zh

[AI-13] Agent ic Large Language Models for Conceptual Systems Engineering and Design

【速读】：该论文试图解决早期工程设计中任务连续性不足和难以生成可执行模型的问题，其解决方案的关键在于引入一种结构化的多智能体系统（MAS），通过迭代构建和优化Design-State Graph（DSG）来更有效地管理需求提取、功能分解和仿真代码生成。相比简单的两智能体系统（2AS），MAS通过九个角色的协作实现了更细致的设计表达和更高的设计细节水平。

链接: https://arxiv.org/abs/2507.08619
作者: Soheyl Massoudi,Mark Fuge
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注: 32 pages, 3 figures

点击查看摘要

Abstract:Early-stage engineering design involves complex, iterative reasoning, yet existing large language model (LLM) workflows struggle to maintain task continuity and generate executable models. We evaluate whether a structured multi-agent system (MAS) can more effectively manage requirements extraction, functional decomposition, and simulator code generation than a simpler two-agent system (2AS). The target application is a solar-powered water filtration system as described in a cahier des charges. We introduce the Design-State Graph (DSG), a JSON-serializable representation that bundles requirements, physical embodiments, and Python-based physics models into graph nodes. A nine-role MAS iteratively builds and refines the DSG, while the 2AS collapses the process to a Generator-Reflector loop. Both systems run a total of 60 experiments (2 LLMs - Llama 3.3 70B vs reasoning-distilled DeepSeek R1 70B x 2 agent configurations x 3 temperatures x 5 seeds). We report a JSON validity, requirement coverage, embodiment presence, code compatibility, workflow completion, runtime, and graph size. Across all runs, both MAS and 2AS maintained perfect JSON integrity and embodiment tagging. Requirement coverage remained minimal (less than 20%). Code compatibility peaked at 100% under specific 2AS settings but averaged below 50% for MAS. Only the reasoning-distilled model reliably flagged workflow completion. Powered by DeepSeek R1 70B, the MAS generated more granular DSGs (average 5-6 nodes) whereas 2AS mode-collapsed. Structured multi-agent orchestration enhanced design detail. Reasoning-distilled LLM improved completion rates, yet low requirements and fidelity gaps in coding persisted.
zh

[AI-14] owards Collaborative Fairness in Federated Learning Under Imbalanced Covariate Shift KDD’-25 KDD

【速读】：该论文试图解决联邦学习中的协作公平性问题，特别是针对不平衡协变量偏移（imbalanced covariate shift）这一实际且复杂的异质性问题。解决方案的关键在于FedAKD（Federated Asynchronous Knowledge Distillation）方法，其核心思想是通过异步知识蒸馏策略，在保持全局模型稳定的同时，利用高置信度的正确预测样本更新全局模型，从而实现准确预测与协作公平性的平衡。

链接: https://arxiv.org/abs/2507.08617
作者: Tianrun Yu,Jiaqi Wang,Haoyu Wang,Mingquan Lin,Han Liu,Nelson S. Yee,Fenglong Ma
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注: 18 pages, accepted to the 31st ACM SIGKDD Conference on Knowledge Discovery and Data Mining (KDD’ 25), Toronto, Canada, August 3-7 2025

点击查看摘要

Abstract:Collaborative fairness is a crucial challenge in federated learning. However, existing approaches often overlook a practical yet complex form of heterogeneity: imbalanced covariate shift. We provide a theoretical analysis of this setting, which motivates the design of FedAKD (Federated Asynchronous Knowledge Distillation)- simple yet effective approach that balances accurate prediction with collaborative fairness. FedAKD consists of client and server updates. In the client update, we introduce a novel asynchronous knowledge distillation strategy based on our preliminary analysis, which reveals that while correctly predicted samples exhibit similar feature distributions across clients, incorrectly predicted samples show significant variability. This suggests that imbalanced covariate shift primarily arises from misclassified samples. Leveraging this insight, our approach first applies traditional knowledge distillation to update client models while keeping the global model fixed. Next, we select correctly predicted high-confidence samples and update the global model using these samples while keeping client models fixed. The server update simply aggregates all client models. We further provide a theoretical proof of FedAKD’s convergence. Experimental results on public datasets (FashionMNIST and CIFAR10) and a real-world Electronic Health Records (EHR) dataset demonstrate that FedAKD significantly improves collaborative fairness, enhances predictive accuracy, and fosters client participation even under highly heterogeneous data distributions.
zh

[AI-15] Unlocking Speech Instruction Data Potential with Query Rewriting ACL2025

【速读】：该论文试图解决语音指令数据集构建中因缺乏高质量标注数据和语言模型生成结果与人类响应存在差距而导致的性能受限问题。其解决方案的关键在于提出一种基于多大语言模型知识融合的查询重写框架，通过多个智能体对合成语音进行注释和验证，从而在不依赖人工标注的情况下构建高质量的语音指令数据集。该方法通过零样本重写将文本指令转换为更适配文本转语音模型的数据分布，显著提升了数据可用性。

链接: https://arxiv.org/abs/2507.08603
作者: Yonghua Hei,Yibo Yan,Shuliang Liu,Huiyu Zhou,Linfeng Zhang,Xuming Hu
机构: 未知
类目: Artificial Intelligence (cs.AI); Sound (cs.SD); Audio and Speech Processing (eess.AS)
备注: ACL 2025 Findings

点击查看摘要

Abstract:End-to-end Large Speech Language Models~(\textbfLSLMs) demonstrate strong potential in response latency and speech comprehension capabilities, showcasing general intelligence across speech understanding tasks. However, the ability to follow speech instructions has not been fully realized due to the lack of datasets and heavily biased training tasks. Leveraging the rich ASR datasets, previous approaches have used Large Language Models~(\textbfLLMs) to continue the linguistic information of speech to construct speech instruction datasets. Yet, due to the gap between LLM-generated results and real human responses, the continuation methods further amplify these shortcomings. Given the high costs of collecting and annotating speech instruction datasets by humans, using speech synthesis to construct large-scale speech instruction datasets has become a balanced and robust alternative. Although modern Text-To-Speech~(\textbfTTS) models have achieved near-human-level synthesis quality, it is challenging to appropriately convert out-of-distribution text instruction to speech due to the limitations of the training data distribution in TTS models. To address this issue, we propose a query rewriting framework with multi-LLM knowledge fusion, employing multiple agents to annotate and validate the synthesized speech, making it possible to construct high-quality speech instruction datasets without relying on human annotation. Experiments show that this method can transform text instructions into distributions more suitable for TTS models for speech synthesis through zero-shot rewriting, increasing data usability from 72% to 93%. It also demonstrates unique advantages in rewriting tasks that require complex knowledge and context-related abilities.
zh

[AI-16] Generating Proto-Personas through Prompt Engineering: A Case Study on Efficiency Effectiveness and Empathy

【速读】：该论文试图解决在产品发现早期阶段（如Lean Inception）中，手动创建Proto-personas所面临的耗时、认知负担重和易产生偏见的问题。其解决方案的关键在于利用生成式AI (Generative AI, GenAI) 的提示工程方法，以提高Proto-personas生成的效率和质量，并增强其在后续产品定义阶段的可复用性。

链接: https://arxiv.org/abs/2507.08594
作者: Fernando Ayach,Vitor Lameirão,Raul Leão,Jerfferson Felizardo,Rafael Sobrinho,Vanessa Borges,Patrícia Matsubara,Awdren Fontão
机构: 未知
类目: oftware Engineering (cs.SE); Artificial Intelligence (cs.AI); Human-Computer Interaction (cs.HC)
备注: 12 pages; 2 figures; Preprint with the original submission accepted for publication at 39th Brazilian Symposium on Software Engineering (SBES)

点击查看摘要

Abstract:Proto-personas are commonly used during early-stage Product Discovery, such as Lean Inception, to guide product definition and stakeholder alignment. However, the manual creation of proto-personas is often time-consuming, cognitively demanding, and prone to bias. In this paper, we propose and empirically investigate a prompt engineering-based approach to generate proto-personas with the support of Generative AI (GenAI). Our goal is to evaluate the approach in terms of efficiency, effectiveness, user acceptance, and the empathy elicited by the generated personas. We conducted a case study with 19 participants embedded in a real Lean Inception, employing a qualitative and quantitative methods design. The results reveal the approach’s efficiency by reducing time and effort and improving the quality and reusability of personas in later discovery phases, such as Minimum Viable Product (MVP) scoping and feature refinement. While acceptance was generally high, especially regarding perceived usefulness and ease of use, participants noted limitations related to generalization and domain specificity. Furthermore, although cognitive empathy was strongly supported, affective and behavioral empathy varied significantly across participants. These results contribute novel empirical evidence on how GenAI can be effectively integrated into software Product Discovery practices, while also identifying key challenges to be addressed in future iterations of such hybrid design processes.
zh

[AI-17] FreeAudio: Training-Free Timing Planning for Controllable Long-Form Text-to-Audio Generation ACM-MM2025

【速读】：该论文试图解决文本到音频（Text-to-audio, T2A）生成中由于缺乏足够且时间对齐的音频-文本配对数据而导致的复杂文本提示处理难题，特别是涉及精确时间控制的提示，例如“猫头鹰在2.4秒至5.2秒期间鸣叫”。其解决方案的关键在于提出一种无需训练的时间控制T2A框架FreeAudio，通过引入解耦与聚合注意力控制、上下文潜在组合以及参考引导机制，实现了长格式时间控制T2A生成，并在合成质量上达到了与训练型方法相当的水平。

链接: https://arxiv.org/abs/2507.08557
作者: Yuxuan Jiang,Zehua Chen,Zeqian Ju,Chang Li,Weibei Dou,Jun Zhu
机构: 未知
类目: ound (cs.SD); Artificial Intelligence (cs.AI); Multimedia (cs.MM); Audio and Speech Processing (eess.AS)
备注: Accepted at ACM MM 2025

点击查看摘要

Abstract:Text-to-audio (T2A) generation has achieved promising results with the recent advances in generative models. However, because of the limited quality and quantity of temporally-aligned audio-text pairs, existing T2A methods struggle to handle the complex text prompts that contain precise timing control, e.g., “owl hooted at 2.4s-5.2s”. Recent works have explored data augmentation techniques or introduced timing conditions as model inputs to enable timing-conditioned 10-second T2A generation, while their synthesis quality is still limited. In this work, we propose a novel training-free timing-controlled T2A framework, FreeAudio, making the first attempt to enable timing-controlled long-form T2A generation, e.g., “owl hooted at 2.4s-5.2s and crickets chirping at 0s-24s”. Specifically, we first employ an LLM to plan non-overlapping time windows and recaption each with a refined natural language description, based on the input text and timing prompts. Then we introduce: 1) Decoupling and Aggregating Attention Control for precise timing control; 2) Contextual Latent Composition for local smoothness and Reference Guidance for global consistency. Extensive experiments show that: 1) FreeAudio achieves state-of-the-art timing-conditioned T2A synthesis quality among training-free methods and is comparable to leading training-based methods; 2) FreeAudio demonstrates comparable long-form generation quality with training-based Stable Audio and paves the way for timing-controlled long-form T2A synthesis. Demo samples are available at: this https URL
zh

[AI-18] White-Basilisk: A Hybrid Model for Code Vulnerability Detection

【速读】：该论文试图解决软件漏洞检测中的挑战，特别是在当前大型语言模型（LLM）存在上下文限制的情况下，如何实现更高效且准确的漏洞识别。其解决方案的关键在于提出White-Basilisk模型，该模型采用创新架构，结合Mamba层、线性自注意力机制和专家混合框架，在仅2亿参数规模下实现了最先进的漏洞检测性能，同时具备处理超长序列的能力，从而能够对大规模代码库进行单次遍历的全面分析。

链接: https://arxiv.org/abs/2507.08540
作者: Ioannis Lamprou,Alexander Shevtsov,Ioannis Arapakis,Sotiris Ioannidis
机构: 未知
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:The proliferation of software vulnerabilities presents a significant challenge to cybersecurity, necessitating more effective detection methodologies. We introduce White-Basilisk, a novel approach to vulnerability detection that demonstrates superior performance while challenging prevailing assumptions in AI model scaling. Utilizing an innovative architecture that integrates Mamba layers, linear self-attention, and a Mixture of Experts framework, White-Basilisk achieves state-of-the-art results in vulnerability detection tasks with a parameter count of only 200M. The model’s capacity to process sequences of unprecedented length enables comprehensive analysis of extensive codebases in a single pass, surpassing the context limitations of current Large Language Models (LLMs). White-Basilisk exhibits robust performance on imbalanced, real-world datasets, while maintaining computational efficiency that facilitates deployment across diverse organizational scales. This research not only establishes new benchmarks in code security but also provides empirical evidence that compact, efficiently designed models can outperform larger counterparts in specialized tasks, potentially redefining optimization strategies in AI development for domain-specific applications.
zh

[AI-19] MIDI-VALLE: Improving Expressive Piano Performance Synthesis Through Neural Codec Language Modelling

【速读】：该论文旨在解决从乐谱生成富有表现力的音频演奏的问题，特别是传统音乐演奏合成流程在跨多样MIDI源、音乐风格和录音环境时的泛化能力不足。其解决方案的关键在于提出MIDI-VALLE，一个基于VALLE框架改进的神经编解码语言模型，该模型通过将MIDI和音频均编码为离散标记，并以参考音频演奏及其对应MIDI作为条件，实现了更一致和鲁棒的钢琴演奏建模，同时通过大规模多样化钢琴演奏数据集的训练提升了模型的泛化能力。

链接: https://arxiv.org/abs/2507.08530
作者: Jingjing Tang,Xin Wang,Zhe Zhang,Junichi Yamagishi,Geraint Wiggins,George Fazekas
机构: 未知
类目: ound (cs.SD); Artificial Intelligence (cs.AI); Audio and Speech Processing (eess.AS)
备注: Accepted by ISMIR 2025

点击查看摘要

Abstract:Generating expressive audio performances from music scores requires models to capture both instrument acoustics and human interpretation. Traditional music performance synthesis pipelines follow a two-stage approach, first generating expressive performance MIDI from a score, then synthesising the MIDI into audio. However, the synthesis models often struggle to generalise across diverse MIDI sources, musical styles, and recording environments. To address these challenges, we propose MIDI-VALLE, a neural codec language model adapted from the VALLE framework, which was originally designed for zero-shot personalised text-to-speech (TTS) synthesis. For performance MIDI-to-audio synthesis, we improve the architecture to condition on a reference audio performance and its corresponding MIDI. Unlike previous TTS-based systems that rely on piano rolls, MIDI-VALLE encodes both MIDI and audio as discrete tokens, facilitating a more consistent and robust modelling of piano performances. Furthermore, the model’s generalisation ability is enhanced by training on an extensive and diverse piano performance dataset. Evaluation results show that MIDI-VALLE significantly outperforms a state-of-the-art baseline, achieving over 75% lower Frechet Audio Distance on the ATEPP and Maestro datasets. In the listening test, MIDI-VALLE received 202 votes compared to 58 for the baseline, demonstrating improved synthesis quality and generalisation across diverse performance MIDI inputs.
zh

[AI-20] From Language to Logic: A Bi-Level Framework for Structured Reasoning

【速读】：该论文试图解决自然语言输入的结构化推理问题，即如何将非结构化的语言表达与形式化的逻辑表示进行有效映射。其解决方案的关键在于提出一种双层次框架，通过两个阶段实现从语言到逻辑的转换：上层进行高阶任务抽象，由大语言模型（LLM）将自然语言查询解析为包含问题类型、目标、决策变量和符号约束的中间结构化表示；下层则基于这些表示生成符号化的工作流程或可执行的推理程序，从而实现准确且可解释的决策。该框架支持模块化推理、显式约束强制以及跨领域的泛化能力。

链接: https://arxiv.org/abs/2507.08501
作者: Keying Yang,Hao Wang,Kai Yang
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Structured reasoning over natural language inputs remains a core challenge in artificial intelligence, as it requires bridging the gap between unstructured linguistic expressions and formal logical representations. In this paper, we propose a novel \textbfbi-level framework that maps language to logic through a two-stage process: high-level task abstraction and low-level logic generation. At the upper level, a large language model (LLM) parses natural language queries into intermediate structured representations specifying the problem type, objectives, decision variables, and symbolic constraints. At the lower level, the LLM uses these representations to generate symbolic workflows or executable reasoning programs for accurate and interpretable decision making. The framework supports modular reasoning, enforces explicit constraints, and generalizes across domains such as mathematical problem solving, question answering, and logical inference. We further optimize the framework with an end-to-end bi-level optimization approach that jointly refines both the high-level abstraction and low-level logic generation stages. Experiments on multiple realistic reasoning benchmarks demonstrate that our approach significantly outperforms existing baselines in accuracy, with accuracy gains reaching as high as 40%. Moreover, the bi-level design enhances transparency and error traceability, offering a promising step toward trustworthy and systematic reasoning with LLMs.
zh

[AI-21] Pre-Training LLM s on a budget: A comparison of three optimizers

【速读】：该论文旨在解决大规模语言模型（Large Language Models, LLMs）预训练时间过长以及模型性能提升受限的问题。其解决方案的关键在于对比分析三种主要优化器：作为事实标准的AdamW、通过进化搜索开发的简单优化器Lion，以及二阶优化器Sophia，并通过调整超参数、采用不同的基础架构和训练策略来评估它们在模型训练效率与性能上的表现。

链接: https://arxiv.org/abs/2507.08472
作者: Joel Schlotthauer,Christian Kroos,Chris Hinze,Viktor Hangya,Luzian Hahn,Fabian Küch
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Optimizers play a decisive role in reducing pre-training times for LLMs and achieving better-performing models. In this study, we compare three major variants: the de-facto standard AdamW, the simpler Lion, developed through an evolutionary search, and the second-order optimizer Sophia. For better generalization, we train with two different base architectures and use a single- and a multiple-epoch approach while keeping the number of tokens constant. Using the Maximal Update Parametrization and smaller proxy models, we tune relevant hyperparameters separately for each combination of base architecture and optimizer. We found that while the results from all three optimizers were in approximately the same range, Sophia exhibited the lowest training and validation loss, Lion was fastest in terms of training GPU hours but AdamW led to the best downstream evaluation results.
zh

[AI-22] Space filling positionality and the Spiroformer

【速读】：该论文试图解决将Transformer模型推广到几何领域（如流形）时所遇到的全局顺序不明确的问题。解决方案的关键在于使用遵循空间填充曲线的注意力头，以在缺乏明确全局顺序的几何结构中有效捕捉依赖关系。作为初步实验示例，作者提出了Spiroformer，这是一种在2-球面上沿极坐标螺旋进行操作的Transformer模型。

链接: https://arxiv.org/abs/2507.08456
作者: M. Maurin,M.Á. Evangelista-Alvarado,P. Suárez-Serrato
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Differential Geometry (math.DG); Dynamical Systems (math.DS); Symplectic Geometry (math.SG)
备注: 9 pages, 5 figures. To appear in Geometric Science of Information 2025

点击查看摘要

Abstract:Transformers excel when dealing with sequential data. Generalizing transformer models to geometric domains, such as manifolds, we encounter the problem of not having a well-defined global order. We propose a solution with attention heads following a space-filling curve. As a first experimental example, we present the Spiroformer, a transformer that follows a polar spiral on the 2 -sphere.
zh

[AI-23] Why this and not that? A Logic-based Framework for Contrastive Explanations

【速读】：该论文试图解决与对比解释（contrastive explanations）相关的一系列规范问题，这些问题旨在回答“为什么P而不是Q？”类型的问题。解决方案的关键在于计算P和Q的原因，并显式比较它们之间的差异，从而提供一种基于命题逻辑的对比解释框架。该框架能够捕捉现有文献中对比解释的最小基数版本，并通过答案集编程对CNF公式实现这些问题，以展示其在实际中的应用。

链接: https://arxiv.org/abs/2507.08454
作者: Tobias Geibinger,Reijo Jaakkola,Antti Kuusisto,Xinghan Liu,Miikka Vilander
机构: 未知
类目: Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Logic in Computer Science (cs.LO)
备注: 20 pages, accepted to JELIA 2025

点击查看摘要

Abstract:We define several canonical problems related to contrastive explanations, each answering a question of the form ‘‘Why P but not Q?’’. The problems compute causes for both P and Q, explicitly comparing their differences. We investigate the basic properties of our definitions in the setting of propositional logic. We show, inter alia, that our framework captures a cardinality-minimal version of existing contrastive explanations in the literature. Furthermore, we provide an extensive analysis of the computational complexities of the problems. We also implement the problems for CNF-formulas using answer set programming and present several examples demonstrating how they work in practice.
zh

[AI-24] CUE-RAG : Towards Accurate and Cost-Efficient Graph-Based RAG via Multi-Partite Graph and Query-Driven Iterative Retrieval

【速读】：该论文试图解决大型语言模型（Large Language Models, LLMs）在问答（Question Answering, QA）任务中因缺乏领域特定和最新知识而导致性能受限的问题。现有基于图的检索增强生成（Retrieval-Augmented Generation, RAG）方法由于图结构质量不佳，如提取不完整和查询信息利用不足，导致效果受限。解决方案的关键在于提出CUE-RAG，其核心包括：(1) 多部图索引，融合文本块、知识单元和实体以捕捉多粒度语义内容；(2) 混合抽取策略，在减少LLM令牌使用的同时生成准确且无歧义的知识单元；(3) Q-Iter查询驱动的迭代检索策略，通过语义搜索和约束图遍历提升相关性。

链接: https://arxiv.org/abs/2507.08445
作者: Yaodong Su,Yixiang Fang,Yingli Zhou,Quanqing Xu,Chuanhui Yang
机构: 未知
类目: Information Retrieval (cs.IR); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Despite the remarkable progress of Large Language Models (LLMs), their performance in question answering (QA) remains limited by the lack of domain-specific and up-to-date knowledge. Retrieval-Augmented Generation (RAG) addresses this limitation by incorporating external information, often from graph-structured data. However, existing graph-based RAG methods suffer from poor graph quality due to incomplete extraction and insufficient utilization of query information during retrieval. To overcome these limitations, we propose CUE-RAG, a novel approach that introduces (1) a multi-partite graph index incorporates text Chunks, knowledge Units, and Entities to capture semantic content at multiple levels of granularity, (2) a hybrid extraction strategy that reduces LLM token usage while still producing accurate and disambiguated knowledge units, and (3) Q-Iter, a query-driven iterative retrieval strategy that enhances relevance through semantic search and constrained graph traversal. Experiments on three QA benchmarks show that CUE-RAG significantly outperforms state-of-the-art baselines, achieving up to 99.33% higher Accuracy and 113.51% higher F1 score while reducing indexing costs by 72.58%. Remarkably, CUE-RAG matches or outperforms baselines even without using an LLM for indexing. These results demonstrate the effectiveness and cost-efficiency of CUE-RAG in advancing graph-based RAG systems.
zh

[AI-25] owards AI-Native RAN: An Operators Perspective of 6G Day 1 Standardization

【速读】：该论文试图解决6G移动网络中如何实现AI原生（AI-Native）无线接入网（RAN）的设计与标准化问题，特别是在应对网络复杂性及支持广泛AI应用方面的挑战。其解决方案的关键在于提出AI-Native RAN的Day 1架构及其核心能力，包括AI驱动的RAN处理/优化/自动化、可靠的AI生命周期管理（LCM）以及AI-as-a-Service（AIaaS）服务提供，旨在通过这些能力推动6G网络的标准化进程，并在技术创新与实际部署之间取得平衡。

链接: https://arxiv.org/abs/2507.08403
作者: Nan Li,Qi Sun,Lehan Wang,Xiaofei Xu,Jinri Huang,Chunhui Liu,Jing Gao,Yuhong Huang,Chih-Lin I
机构: 未知
类目: Networking and Internet Architecture (cs.NI); Artificial Intelligence (cs.AI); Distributed, Parallel, and Cluster Computing (cs.DC); Machine Learning (cs.LG); Systems and Control (eess.SY)
备注:

点击查看摘要

Abstract:Artificial Intelligence/Machine Learning (AI/ML) has become the most certain and prominent feature of 6G mobile networks. Unlike 5G, where AI/ML was not natively integrated but rather an add-on feature over existing architecture, 6G shall incorporate AI from the onset to address its complexity and support ubiquitous AI applications. Based on our extensive mobile network operation and standardization experience from 2G to 5G, this paper explores the design and standardization principles of AI-Native radio access networks (RAN) for 6G, with a particular focus on its critical Day 1 architecture, functionalities and capabilities. We investigate the framework of AI-Native RAN and present its three essential capabilities to shed some light on the standardization direction; namely, AI-driven RAN processing/optimization/automation, reliable AI lifecycle management (LCM), and AI-as-a-Service (AIaaS) provisioning. The standardization of AI-Native RAN, in particular the Day 1 features, including an AI-Native 6G RAN architecture, were proposed. For validation, a large-scale field trial with over 5000 5G-A base stations have been built and delivered significant improvements in average air interface latency, root cause identification, and network energy consumption with the proposed architecture and the supporting AI functions. This paper aims to provide a Day 1 framework for 6G AI-Native RAN standardization design, balancing technical innovation with practical deployment.
zh

[AI-26] Multi-Agent LLM s as Ethics Advocates in AI-Based Systems

【速读】：该论文试图解决在需求获取过程中融入伦理考量的难题，以创建符合伦理的系统。传统的人工伦理需求获取方法虽然有效，但需要多方利益相关者的输入，这在时间和资源受限的情况下具有挑战性，且常被忽视。该研究提出了一种框架，通过在多智能体大语言模型（LLM）环境中引入一个伦理倡导者代理（ethics advocate agent），生成伦理需求草案。该代理基于系统描述对伦理问题进行批判性分析并提供输入，其关键在于利用人工智能技术辅助伦理需求的生成，从而提高效率并补充人工方法的不足。

链接: https://arxiv.org/abs/2507.08392
作者: Asma Yamani,Malak Baslyman,Moataz Ahmed
机构: 未知
类目: Artificial Intelligence (cs.AI); Computers and Society (cs.CY)
备注:

点击查看摘要

Abstract:Incorporating ethics into the requirement elicitation process is essential for creating ethically aligned systems. Although eliciting manual ethics requirements is effective, it requires diverse input from multiple stakeholders, which can be challenging due to time and resource constraints. Moreover, it is often given a low priority in the requirements elicitation process. This study proposes a framework for generating ethics requirements drafts by introducing an ethics advocate agent in a multi-agent LLM setting. This agent critiques and provides input on ethical issues based on the system description. The proposed framework is evaluated through two case studies from different contexts, demonstrating that it captures the majority of ethics requirements identified by researchers during 30-minute interviews and introduces several additional relevant requirements. However, it also highlights reliability issues in generating ethics requirements, emphasizing the need for human feedback in this sensitive domain. We believe this work can facilitate the broader adoption of ethics in the requirements engineering process, ultimately leading to more ethically aligned products.
zh

[AI-27] Intelligent Control of Spacecraft Reaction Wheel Attitude Using Deep Reinforcement Learning

【速读】：该论文旨在解决卫星在反应轮（RW）故障情况下保持姿态控制稳定性和适应性的问题。传统比例微分（PD）控制器及现有的深度强化学习（DRL）算法如TD3、PPO和A2C在实时适应性和容错能力方面存在不足，难以满足自主卫星操作的需求。论文提出的解决方案的关键在于将双延迟深度确定性策略梯度（TD3）与事后经验回放（HER）和维度剪切（DWC）相结合，形成TD3-HD方法，以提升在稀疏奖励环境中的学习效果，并在RW故障时维持卫星的稳定性。

链接: https://arxiv.org/abs/2507.08366
作者: Ghaith El-Dalahmeh,Mohammad Reza Jabbarpour,Bao Quoc Vo,Ryszard Kowalczyk
机构: 未知
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Reliable satellite attitude control is essential for the success of space missions, particularly as satellites increasingly operate autonomously in dynamic and uncertain environments. Reaction wheels (RWs) play a pivotal role in attitude control, and maintaining control resilience during RW faults is critical to preserving mission objectives and system stability. However, traditional Proportional Derivative (PD) controllers and existing deep reinforcement learning (DRL) algorithms such as TD3, PPO, and A2C often fall short in providing the real time adaptability and fault tolerance required for autonomous satellite operations. This study introduces a DRL-based control strategy designed to improve satellite resilience and adaptability under fault conditions. Specifically, the proposed method integrates Twin Delayed Deep Deterministic Policy Gradient (TD3) with Hindsight Experience Replay (HER) and Dimension Wise Clipping (DWC) referred to as TD3-HD to enhance learning in sparse reward environments and maintain satellite stability during RW failures. The proposed approach is benchmarked against PD control and leading DRL algorithms. Experimental results show that TD3-HD achieves significantly lower attitude error, improved angular velocity regulation, and enhanced stability under fault conditions. These findings underscore the proposed method potential as a powerful, fault tolerant, onboard AI solution for autonomous satellite attitude control.
zh

[AI-28] Audio Inpanting using Discrete Diffusion Model

【速读】：该论文试图解决音频修复（audio inpainting）问题，即在受损音频记录中重建缺失的片段。现有方法，包括基于波形和频谱图的扩散模型，在处理超过100毫秒（ms）的长间隙时性能下降。该论文提出了一种基于离散扩散建模的新修复方法，其关键在于利用预训练音频分词器生成的分词音频表示，在离散潜在空间中直接建模生成过程，从而实现稳定且语义连贯的音频重建。

链接: https://arxiv.org/abs/2507.08333
作者: Tali Dror,Iftach Shoham,Moshe Buchris,Oren Gal,Haim Permuter,Gilad Katz,Eliya Nachmani
机构: 未知
类目: ound (cs.SD); Artificial Intelligence (cs.AI); Information Theory (cs.IT); Machine Learning (cs.LG); Audio and Speech Processing (eess.AS)
备注:

点击查看摘要

Abstract:Audio inpainting refers to the task of reconstructing missing segments in corrupted audio recordings. While prior approaches-including waveform and spectrogram-based diffusion models-have shown promising results for short gaps, they often degrade in quality when gaps exceed 100 milliseconds (ms). In this work, we introduce a novel inpainting method based on discrete diffusion modeling, which operates over tokenized audio representations produced by a pre-trained audio tokenizer. Our approach models the generative process directly in the discrete latent space, enabling stable and semantically coherent reconstruction of missing audio. We evaluate the method on the MusicNet dataset using both objective and perceptual metrics across gap durations up to 300 ms. We further evaluated our approach on the MTG dataset, extending the gap duration to 500 ms. Experimental results demonstrate that our method achieves competitive or superior performance compared to existing baselines, particularly for longer gaps, offering a robust solution for restoring degraded musical recordings. Audio examples of our proposed method can be found at this https URL
zh

[AI-29] Generative AI in Science: Applications Challenges and Emerging Questions

【速读】：该论文试图解决生成式人工智能（Generative AI）对科学实践的影响问题，重点探讨其应用、优势与挑战。其解决方案的关键在于通过定性文献综述方法，基于OpenAlex出版数据库，采用布尔搜索策略识别与GenAI相关的科学文献，并对39篇高被引论文和评论进行分析，从而分类总结GenAI在科学、科学写作、医疗实践及教育训练中的应用现状及其潜在影响。

链接: https://arxiv.org/abs/2507.08310
作者: Ryan Harries,Cornelia Lawson,Philip Shapira
机构: 未知
类目: Computers and Society (cs.CY); Artificial Intelligence (cs.AI)
备注: 9 pages, 1 figure, 1 appendix

点击查看摘要

Abstract:This paper examines the impact of Generative Artificial Intelligence (GenAI) on scientific practices, conducting a qualitative review of selected literature to explore its applications, benefits, and challenges. The review draws on the OpenAlex publication database, using a Boolean search approach to identify scientific literature related to GenAI (including large language models and ChatGPT). Thirty-nine highly cited papers and commentaries are reviewed and qualitatively coded. Results are categorized by GenAI applications in science, scientific writing, medical practice, and education and training. The analysis finds that while there is a rapid adoption of GenAI in science and science practice, its long-term implications remain unclear, with ongoing uncertainties about its use and governance. The study provides early insights into GenAI’s growing role in science and identifies questions for future research in this evolving field.
zh

[AI-30] Invariant-based Robust Weights Watermark for Large Language Models

【速读】：该论文旨在解决在资源受限的边缘设备上部署大规模语言模型（Large Language Models, LLMs）时，知识产权（Intellectual Property, IP）被盗用的问题。其关键解决方案是提出一种无需重新训练或微调的鲁棒水印方案，通过为每个用户生成唯一密钥，并利用模型不变量构建线性约束以获得稳定的水印值，同时采用噪声机制在多用户场景中隐藏水印位置，从而抵御合谋攻击。

链接: https://arxiv.org/abs/2507.08288
作者: Qingxiao Guo,Xinjie Zhu,Yilong Ma,Hui Jin,Yunhao Wang,Weifeng Zhang,Xiaobing Guo
机构: 未知
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Watermarking technology has gained significant attention due to the increasing importance of intellectual property (IP) rights, particularly with the growing deployment of large language models (LLMs) on billions resource-constrained edge devices. To counter the potential threats of IP theft by malicious users, this paper introduces a robust watermarking scheme without retraining or fine-tuning for transformer models. The scheme generates a unique key for each user and derives a stable watermark value by solving linear constraints constructed from model invariants. Moreover, this technology utilizes noise mechanism to hide watermark locations in multi-user scenarios against collusion attack. This paper evaluates the approach on three popular models (Llama3, Phi3, Gemma), and the experimental results confirm the strong robustness across a range of attack methods (fine-tuning, pruning, quantization, permutation, scaling, reversible matrix and collusion attacks).
zh

[AI-31] Agent Safety Alignment via Reinforcement Learning

【速读】：该论文旨在解决自主大型语言模型（Large Language Model, LLM）代理在使用工具时所带来的新型安全风险问题，这些风险不仅包括传统的对话滥用，还涉及用户发起的威胁（如对抗性提示）和工具发起的威胁（如被破坏工具的恶意输出）。论文提出的解决方案的关键在于构建一个统一的安全对齐框架，通过结构化推理和沙箱强化学习来处理这两种威胁渠道。该框架引入了一个三模态分类体系（良性、恶意和敏感），并定义了一个基于策略的决策模型，同时采用定制设计的沙箱环境以模拟真实世界中的工具执行并实现细粒度奖励塑造。

链接: https://arxiv.org/abs/2507.08270
作者: Zeyang Sha,Hanling Tian,Zhuoer Xu,Shiwen Cui,Changhua Meng,Weiqiang Wang
机构: 未知
类目: Artificial Intelligence (cs.AI); Cryptography and Security (cs.CR)
备注:

点击查看摘要

Abstract:The emergence of autonomous Large Language Model (LLM) agents capable of tool usage has introduced new safety risks that go beyond traditional conversational misuse. These agents, empowered to execute external functions, are vulnerable to both user-initiated threats (e.g., adversarial prompts) and tool-initiated threats (e.g., malicious outputs from compromised tools). In this paper, we propose the first unified safety-alignment framework for tool-using agents, enabling models to handle both channels of threat via structured reasoning and sandboxed reinforcement learning. We introduce a tri-modal taxonomy, including benign, malicious, and sensitive for both user prompts and tool responses, and define a policy-driven decision model. Our framework employs a custom-designed sandbox environment that simulates real-world tool execution and allows fine-grained reward shaping. Through extensive evaluations on public and self-built benchmarks, including Agent SafetyBench, InjecAgent, and BFCL, we demonstrate that our safety-aligned agents significantly improve resistance to security threats while preserving strong utility on benign tasks. Our results show that safety and effectiveness can be jointly optimized, laying the groundwork for trustworthy deployment of autonomous LLM agents.
zh

[AI-32] A Practical Two-Stage Recipe for Mathematical LLM s: Maximizing Accuracy with SFT and Efficiency with Reinforcement Learning ICML2025

【速读】：该论文试图解决提升大型语言模型（Large Language Models, LLMs）数学推理能力的问题，特别是在如何系统性地结合监督微调（Supervised Fine-Tuning, SFT）与基于在线推理的强化学习（Reinforcement Learning from Online Inference, GRPO）以最大化模型的准确性和效率方面存在的挑战。论文提出了一种有效的训练方法，其关键在于将延长的SFT阶段与GRPO阶段进行战略性的集成：首先通过延长SFT阶段将模型的准确性推向极限，随后通过GRPO阶段显著提升token效率并保持这一最佳性能。实验表明，将SFT扩展至多达10个epoch对于性能突破至关重要，而GRPO的主要作用是优化解题长度。

链接: https://arxiv.org/abs/2507.08267
作者: Hiroshi Yoshihara,Taiki Yamaguchi,Yuichi Inoue
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注: Presented at ICML 2025 Workshop on The second AI for MATH

点击查看摘要

Abstract:Enhancing the mathematical reasoning of Large Language Models (LLMs) is a pivotal challenge in advancing AI capabilities. While Supervised Fine-Tuning (SFT) and Reinforcement Learning (RL) are the dominant training paradigms, a systematic methodology for combining them to maximize both accuracy and efficiency remains largely unexplored. This paper introduces a practical and effective training recipe that strategically integrates extended SFT with RL from online inference (GRPO). We posit that these methods play complementary, not competing, roles: a prolonged SFT phase first pushes the model’s accuracy to its limits, after which a GRPO phase dramatically improves token efficiency while preserving this peak performance. Our experiments reveal that extending SFT for as many as 10 epochs is crucial for performance breakthroughs, and that the primary role of GRPO in this framework is to optimize solution length. The efficacy of our recipe is rigorously validated through top-tier performance on challenging benchmarks, including a high rank among over 2,200 teams in the strictly leak-free AI Mathematical Olympiad (AIMO). This work provides the community with a battle-tested blueprint for developing state-of-the-art mathematical reasoners that are both exceptionally accurate and practically efficient. To ensure full reproducibility and empower future research, we will open-source our entire framework, including all code, model checkpoints, and training configurations at this https URL.
zh

[AI-33] Abductive Computational Systems: Creative Abduction and Future Directions

【速读】：该论文试图解决当前关于生成式AI (Generative AI) 的研究中，缺乏对创造性假设生成的有效理论框架和计算实现的问题。论文指出，现有的理论框架未能提供明确的模型来生成具有创造性的溯因假设，而计算系统主要实现了演绎形式的溯因推理。解决方案的关键在于分解溯因计算系统的核心组件，并提出未来研究方向，以推动计算系统中创造性溯因推理的发展。

链接: https://arxiv.org/abs/2507.08264
作者: Abhinav Sood,Kazjon Grace,Stephen Wan,Cecile Paris
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注: Published in the 16th International Conference on Computational Creativity, ICCC25. Accepted Paper in this https URL

点击查看摘要

Abstract:Abductive reasoning, reasoning for inferring explanations for observations, is often mentioned in scientific, design-related and artistic contexts, but its understanding varies across these domains. This paper reviews how abductive reasoning is discussed in epistemology, science and design, and then analyses how various computational systems use abductive reasoning. Our analysis shows that neither theoretical accounts nor computational implementations of abductive reasoning adequately address generating creative hypotheses. Theoretical frameworks do not provide a straightforward model for generating creative abductive hypotheses, computational systems largely implement syllogistic forms of abductive reasoning. We break down abductive computational systems into components and conclude by identifying specific directions for future research that could advance the state of creative abductive reasoning in computational systems.
zh

[AI-34] Quantum-Accelerated Neural Imputation with Large Language Models (LLM s)

【速读】：该论文试图解决真实世界数据集中缺失数据带来的挑战，这一问题显著降低了机器学习模型的性能。其解决方案的关键在于引入Quantum-UnIMP框架，该框架将浅层量子电路集成到基于大型语言模型（LLM）的填补架构中，核心创新在于用由即时量子多项式（IQP）电路生成的量子特征映射替代传统的经典输入嵌入，从而利用量子叠加和纠缠等现象，学习更丰富、更具表现力的数据表示，提升复杂缺失模式的恢复能力。

链接: https://arxiv.org/abs/2507.08255
作者: Hossein Jamali
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Missing data presents a critical challenge in real-world datasets, significantly degrading the performance of machine learning models. While Large Language Models (LLMs) have recently demonstrated remarkable capabilities in tabular data imputation, exemplified by frameworks like UnIMP, their reliance on classical embedding methods often limits their ability to capture complex, non-linear correlations, particularly in mixed-type data scenarios encompassing numerical, categorical, and textual features. This paper introduces Quantum-UnIMP, a novel framework that integrates shallow quantum circuits into an LLM-based imputation architecture. Our core innovation lies in replacing conventional classical input embeddings with quantum feature maps generated by an Instantaneous Quantum Polynomial (IQP) circuit. This approach enables the model to leverage quantum phenomena such as superposition and entanglement, thereby learning richer, more expressive representations of data and enhancing the recovery of intricate missingness patterns. Our experiments on benchmark mixed-type datasets demonstrate that Quantum-UnIMP reduces imputation error by up to 15.2% for numerical features (RMSE) and improves classification accuracy by 8.7% for categorical features (F1-Score) compared to state-of-the-art classical and LLM-based methods. These compelling results underscore the profound potential of quantum-enhanced representations for complex data imputation tasks, even with near-term quantum hardware.
zh

[AI-35] Giving AI Agents Access to Cryptocurrency and Smart Contracts Creates New Vectors of AI Harm

【速读】：该论文试图解决将生成式 AI (Generative AI) 代理赋予加密货币及智能合约所带来的新型AI危害问题。解决方案的关键在于识别并详细描述这些新型危害的潜在路径，进而呼吁开展更多技术研究以预防和缓解此类危害，从而提升将加密货币和智能合约赋予AI代理的安全性。

链接: https://arxiv.org/abs/2507.08249
作者: Bill Marino,Ari Juels
机构: 未知
类目: Artificial Intelligence (cs.AI); Cryptography and Security (cs.CR)
备注:

点击查看摘要

Abstract:There is growing interest in giving AI agents access to cryptocurrencies as well as to the smart contracts that transact them. But doing so, this position paper argues, could lead to formidable new vectors of AI harm. To support this argument, we first examine the unique properties of cryptocurrencies and smart contracts that could lead to these new vectors of harm. Next, we describe each of these new vectors of harm in detail. Finally, we conclude with a call for more technical research aimed at preventing and mitigating these harms and, thereby making it safer to endow AI agents with cryptocurrencies and smart contracts.
zh

[AI-36] InsightBuild: LLM -Powered Causal Reasoning in Smart Building Systems

【速读】：该论文试图解决智能建筑中因异常能耗导致的设施管理难题，即缺乏对异常能耗模式的清晰解释。其解决方案的关键在于提出一种两阶段框架InsightBuild，该框架结合因果分析与微调的大语言模型（LLM），以生成可读性强且具有因果关系的能耗模式解释。第一阶段通过轻量级因果推断模块进行Granger因果检验和结构因果发现，第二阶段利用微调后的LLM根据检测到的因果关系生成简洁、可操作的文本解释。

链接: https://arxiv.org/abs/2507.08235
作者: Pinaki Prasad Guha Neogi,Ahmad Mohammadshirazi,Rajiv Ramnath
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Smart buildings generate vast streams of sensor and control data, but facility managers often lack clear explanations for anomalous energy usage. We propose InsightBuild, a two-stage framework that integrates causality analysis with a fine-tuned large language model (LLM) to provide human-readable, causal explanations of energy consumption patterns. First, a lightweight causal inference module applies Granger causality tests and structural causal discovery on building telemetry (e.g., temperature, HVAC settings, occupancy) drawn from Google Smart Buildings and Berkeley Office datasets. Next, an LLM, fine-tuned on aligned pairs of sensor-level causes and textual explanations, receives as input the detected causal relations and generates concise, actionable explanations. We evaluate InsightBuild on two real-world datasets (Google: 2017-2022; Berkeley: 2018-2020), using expert-annotated ground-truth causes for a held-out set of anomalies. Our results demonstrate that combining explicit causal discovery with LLM-based natural language generation yields clear, precise explanations that assist facility managers in diagnosing and mitigating energy inefficiencies.
zh

[AI-37] Quantum Federated Learning for Multimodal Data: A Modality-Agnostic Approach CVPR2025

【速读】：该论文旨在解决现有量子联邦学习（Quantum Federated Learning, QFL）框架主要针对单模态系统，难以适应现实世界中多模态任务的问题。其解决方案的关键在于提出一种针对QFL场景的新型多模态方法，该方法利用量子纠缠进行中间融合，并引入缺失模态无关（Missing Modality Agnostic, MMA）机制，以隔离未训练的量子电路，从而在缺乏某些模态的情况下仍能保持训练的稳定性与模型性能。

链接: https://arxiv.org/abs/2507.08217
作者: Atit Pokharel,Ratun Rahman,Thomas Morris,Dinh C. Nguyen
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注: This paper was presented at BEAM with CVPR 2025

点击查看摘要

Abstract:Quantum federated learning (QFL) has been recently introduced to enable a distributed privacy-preserving quantum machine learning (QML) model training across quantum processors (clients). Despite recent research efforts, existing QFL frameworks predominantly focus on unimodal systems, limiting their applicability to real-world tasks that often naturally involve multiple modalities. To fill this significant gap, we present for the first time a novel multimodal approach specifically tailored for the QFL setting with the intermediate fusion using quantum entanglement. Furthermore, to address a major bottleneck in multimodal QFL, where the absence of certain modalities during training can degrade model performance, we introduce a Missing Modality Agnostic (MMA) mechanism that isolates untrained quantum circuits, ensuring stable training without corrupted states. Simulation results demonstrate that the proposed multimodal QFL method with MMA yields an improvement in accuracy of 6.84% in independent and identically distributed (IID) and 7.25% in non-IID data distributions compared to the state-of-the-art methods.
zh

[AI-38] Grounding Methods for Neural-Symbolic AI

【速读】：该论文旨在解决神经符号（Neural-Symbolic, NeSy）方法中逻辑接地（logic grounding）过程的可扩展性问题，该过程在处理实体间的复杂关系时容易导致组合爆炸，从而限制了方法的适用规模。论文提出的解决方案的关键在于引入一种参数化的接地方法族，该方法基于一阶逻辑的逆向推理（Backward Chaining），能够灵活控制表达能力与计算效率之间的权衡，从而在保持一定逻辑表达能力的同时提升方法的可扩展性。实验结果表明，接地准则的选择在很大程度上影响了NeSy方法的整体性能。

链接: https://arxiv.org/abs/2507.08216
作者: Rodrigo Castellano Ontiveros,Francesco Giannini,Marco Gori,Giuseppe Marra,Michelangelo Diligenti
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:A large class of Neural-Symbolic (NeSy) methods employs a machine learner to process the input entities, while relying on a reasoner based on First-Order Logic to represent and process more complex relationships among the entities. A fundamental role for these methods is played by the process of logic grounding, which determines the relevant substitutions for the logic rules using a (sub)set of entities. Some NeSy methods use an exhaustive derivation of all possible substitutions, preserving the full expressive power of the logic knowledge. This leads to a combinatorial explosion in the number of ground formulas to consider and, therefore, strongly limits their scalability. Other methods rely on heuristic-based selective derivations, which are generally more computationally efficient, but lack a justification and provide no guarantees of preserving the information provided to and returned by the reasoner. Taking inspiration from multi-hop symbolic reasoning, this paper proposes a parametrized family of grounding methods generalizing classic Backward Chaining. Different selections within this family allow us to obtain commonly employed grounding methods as special cases, and to control the trade-off between expressiveness and scalability of the reasoner. The experimental results show that the selection of the grounding criterion is often as important as the NeSy method itself.
zh

[AI-39] From Curiosity to Competence: How World Models Interact with the Dynamics of Exploration

【速读】：该论文试图解决智能代理在探索环境与保持环境控制之间的平衡问题，即如何协调内在动机中的好奇心（novelty or information gain）与能力（empowerment）之间的权衡。解决方案的关键在于通过演化内部表征来中介这种权衡，研究比较了两种基于模型的代理：一种使用手工设计的状态抽象（Tabular），另一种则学习内部世界模型（Dreamer）。研究发现，Tabular代理在探索中表现出好奇心与能力引导的不同模式，而Dreamer代理则展示了探索与表征学习之间的双向互动，反映了好奇心与能力的发育协同进化。

链接: https://arxiv.org/abs/2507.08210
作者: Fryderyk Mantiuk,Hanqi Zhou,Charley M. Wu
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:What drives an agent to explore the world while also maintaining control over the environment? From a child at play to scientists in the lab, intelligent agents must balance curiosity (the drive to seek knowledge) with competence (the drive to master and control the environment). Bridging cognitive theories of intrinsic motivation with reinforcement learning, we ask how evolving internal representations mediate the trade-off between curiosity (novelty or information gain) and competence (empowerment). We compare two model-based agents using handcrafted state abstractions (Tabular) or learning an internal world model (Dreamer). The Tabular agent shows curiosity and competence guide exploration in distinct patterns, while prioritizing both improves exploration. The Dreamer agent reveals a two-way interaction between exploration and representation learning, mirroring the developmental co-evolution of curiosity and competence. Our findings formalize adaptive exploration as a balance between pursuing the unknown and the controllable, offering insights for cognitive theories and efficient reinforcement learning.
zh

[AI-40] Reasoning and Behavioral Equilibria in LLM -Nash Games: From Mindsets to Actions

【速读】：该论文试图解决在大型语言模型（Large Language Models, LLMs）驱动的系统中，如何建模具有有限理性的智能体之间的战略互动问题。传统博弈论假设参与者是完全理性的且以效用最大化为目标，而本文提出的LLM-Nash框架通过显式建模推理过程来捕捉有限理性。其解决方案的关键在于将均衡定义在提示空间（prompt space）上，使行动成为LLM推理的行为输出，从而为研究认知约束、思维表达性和认识论学习提供了新的方法。

链接: https://arxiv.org/abs/2507.08208
作者: Quanyan Zhu
机构: 未知
类目: Artificial Intelligence (cs.AI); Computer Science and Game Theory (cs.GT)
备注:

点击查看摘要

Abstract:We introduce the LLM-Nash framework, a game-theoretic model where agents select reasoning prompts to guide decision-making via Large Language Models (LLMs). Unlike classical games that assume utility-maximizing agents with full rationality, this framework captures bounded rationality by modeling the reasoning process explicitly. Equilibrium is defined over the prompt space, with actions emerging as the behavioral output of LLM inference. This approach enables the study of cognitive constraints, mindset expressiveness, and epistemic learning. Through illustrative examples, we show how reasoning equilibria can diverge from classical Nash outcomes, offering a new foundation for strategic interaction in LLM-enabled systems.
zh

[AI-41] A Dynamic Stackelberg Game Framework for Agent ic AI Defense Against LLM Jailbreaking

【速读】：该论文试图解决大型语言模型（Large Language Models, LLMs）在关键应用中面临的越狱攻击（jailbreaking）问题，即攻击者通过操纵模型以绕过安全机制。解决方案的关键在于提出一种动态Stackelberg博弈框架，将攻击者与防御者的互动建模为序贯的扩展形式博弈，并引入名为“Purple Agent”的代理AI，该代理结合了对抗性探索和防御策略，利用快速扩展随机树（RRT）进行潜在攻击路径的模拟与主动干预，从而提供了一种系统分析对抗动态并降低越狱风险的方法。

链接: https://arxiv.org/abs/2507.08207
作者: Zhengye Han,Quanyan Zhu
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:As large language models (LLMs) are increasingly deployed in critical applications, the challenge of jailbreaking, where adversaries manipulate the models to bypass safety mechanisms, has become a significant concern. This paper presents a dynamic Stackelberg game framework to model the interactions between attackers and defenders in the context of LLM jailbreaking. The framework treats the prompt-response dynamics as a sequential extensive-form game, where the defender, as the leader, commits to a strategy while anticipating the attacker’s optimal responses. We propose a novel agentic AI solution, the “Purple Agent,” which integrates adversarial exploration and defensive strategies using Rapidly-exploring Random Trees (RRT). The Purple Agent actively simulates potential attack trajectories and intervenes proactively to prevent harmful outputs. This approach offers a principled method for analyzing adversarial dynamics and provides a foundation for mitigating the risk of jailbreaking.
zh

[AI-42] Rethinking Spatio-Temporal Anomaly Detection: A Vision for Causality-Driven Cybersecurity

【速读】：该论文试图解决在日益互联和分布式的网络物理系统中，如何提升对动态演变的网络攻击的弹性与检测能力的问题。当前基于黑箱深度学习的数据驱动方法在可解释性、分布偏移适应性和系统动态变化下的鲁棒性方面存在不足。论文提出的解决方案关键在于引入因果学习视角，通过因果图建模、多视角融合以及持续因果图学习三个核心方向，以结构化的因果关系为基础进行异常检测，从而揭示时空动态中的因果结构，并提供早期预警与根本原因分析，弥补黑箱检测器的局限性。

链接: https://arxiv.org/abs/2507.08177
作者: Arun Vignesh Malarkkan,Haoyue Bai,Xinyuan Wang,Anjali Kaushik,Dongjie Wang,Yanjie Fu
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Emerging Technologies (cs.ET); Neural and Evolutionary Computing (cs.NE)
备注: 5 pages, 1 figure, Under Review in Vision Paper Track-ACM SIGSPATIAL 2025

点击查看摘要

Abstract:As cyber-physical systems grow increasingly interconnected and spatially distributed, ensuring their resilience against evolving cyberattacks has become a critical priority. Spatio-Temporal Anomaly detection plays an important role in ensuring system security and operational integrity. However, current data-driven approaches, largely driven by black-box deep learning, face challenges in interpretability, adaptability to distribution shifts, and robustness under evolving system dynamics. In this paper, we advocate for a causal learning perspective to advance anomaly detection in spatially distributed infrastructures that grounds detection in structural cause-effect relationships. We identify and formalize three key directions: causal graph profiling, multi-view fusion, and continual causal graph learning, each offering distinct advantages in uncovering dynamic cause-effect structures across time and space. Drawing on real-world insights from systems such as water treatment infrastructures, we illustrate how causal models provide early warning signals and root cause attribution, addressing the limitations of black-box detectors. Looking ahead, we outline the future research agenda centered on multi-modality, generative AI-driven, and scalable adaptive causal frameworks. Our objective is to lay a new research trajectory toward scalable, adaptive, explainable, and spatially grounded anomaly detection systems. We hope to inspire a paradigm shift in cybersecurity research, promoting causality-driven approaches to address evolving threats in interconnected infrastructures.
zh

[AI-43] KP-A: A Unified Network Knowledge Plane for Catalyzing Agent ic Network Intelligence

【速读】：该论文试图解决当前自主6G网络中个体智能任务实施时存在的知识检索流程孤立、数据流冗余和解释不一致的问题。解决方案的关键在于提出KP-A，一个专为代理网络智能设计的统一网络知识平面（Network Knowledge Plane），通过将网络知识获取与管理与智能逻辑解耦，简化开发流程并降低维护复杂性，同时提供直观且一致的知识接口以增强网络智能代理之间的互操作性。

链接: https://arxiv.org/abs/2507.08164
作者: Yun Tang,Mengbang Zou,Zeinab Nezami,Syed Ali Raza Zaidi,Weisi Guo
机构: 未知
类目: Networking and Internet Architecture (cs.NI); Artificial Intelligence (cs.AI); Software Engineering (cs.SE)
备注: 7 pages, 5 figures, submitted for possible publication

点击查看摘要

Abstract:The emergence of large language models (LLMs) and agentic systems is enabling autonomous 6G networks with advanced intelligence, including self-configuration, self-optimization, and self-healing. However, the current implementation of individual intelligence tasks necessitates isolated knowledge retrieval pipelines, resulting in redundant data flows and inconsistent interpretations. Inspired by the service model unification effort in Open-RAN (to support interoperability and vendor diversity), we propose KP-A: a unified Network Knowledge Plane specifically designed for Agentic network intelligence. By decoupling network knowledge acquisition and management from intelligence logic, KP-A streamlines development and reduces maintenance complexity for intelligence engineers. By offering an intuitive and consistent knowledge interface, KP-A also enhances interoperability for the network intelligence agents. We demonstrate KP-A in two representative intelligence tasks: live network knowledge QA and edge AI service orchestration. All implementation artifacts have been open-sourced to support reproducibility and future standardization efforts.
zh

[AI-44] ALCo-FM: Adaptive Long-Context Foundation Model for Accident Prediction

【速读】：该论文试图解决城市交通风险预测中因事故罕见但影响重大而需要长上下文多模态推理的问题。解决方案的关键在于提出ALCo-FM，一个统一的自适应长上下文基础模型，通过计算波动性预评分动态选择上下文窗口，并利用浅层交叉注意力编码和融合多模态数据，结合局部图注意力（GAT）层和基于H3六边形网格的大规模稀疏全局Transformer，以及蒙特卡洛丢弃法提升置信度，从而实现高精度和校准良好的预测。

链接: https://arxiv.org/abs/2507.08153
作者: Pinaki Prasad Guha Neogi,Ahmad Mohammadshirazi,Rajiv Ramnath
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Traffic accidents are rare, yet high-impact events that require long-context multimodal reasoning for accurate risk forecasting. In this paper, we introduce ALCo-FM, a unified adaptive long-context foundation model that computes a volatility pre-score to dynamically select context windows for input data and encodes and fuses these multimodal data via shallow cross attention. Following a local GAT layer and a BigBird-style sparse global transformer over H3 hexagonal grids, coupled with Monte Carlo dropout for confidence, the model yields superior, well-calibrated predictions. Trained on data from 15 US cities with a class-weighted loss to counter label imbalance, and fine-tuned with minimal data on held-out cities, ALCo-FM achieves 0.94 accuracy, 0.92 F1, and an ECE of 0.04, outperforming more than 20 state-of-the-art baselines in large-scale urban risk prediction. Code and dataset are available at: this https URL
zh

[AI-45] Quasi-Random Physics-informed Neural Networks

【速读】：该论文试图解决物理信息神经网络（Physics-Informed Neural Networks, PINNs）在求解偏微分方程（Partial Differential Equations, PDEs）时对采样点敏感的问题。其解决方案的关键在于引入准蒙特卡洛方法生成的低差异序列，替代传统随机采样，从而提升模型的收敛速度与求解精度，特别是在高维PDE问题中表现尤为显著。

链接: https://arxiv.org/abs/2507.08121
作者: Tianchi Yu,Ivan Oseledets
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Numerical Analysis (math.NA)
备注:

点击查看摘要

Abstract:Physics-informed neural networks have shown promise in solving partial differential equations (PDEs) by integrating physical constraints into neural network training, but their performance is sensitive to the sampling of points. Based on the impressive performance of quasi Monte-Carlo methods in high dimensional problems, this paper proposes Quasi-Random Physics-Informed Neural Networks (QRPINNs), which use low-discrepancy sequences for sampling instead of random points directly from the domain. Theoretically, QRPINNs have been proven to have a better convergence rate than PINNs. Empirically, experiments demonstrate that QRPINNs significantly outperform PINNs and some representative adaptive sampling methods, especially in high-dimensional PDEs. Furthermore, combining QRPINNs with adaptive sampling can further improve the performance.
zh

[AI-46] ree-Structured Parzen Estimator Can Solve Black-Box Combinatorial Optimization More Efficiently

【速读】：该论文试图解决树结构帕累托估计器（Tree-structured Parzen Estimator, TPE）在组合优化领域的应用问题，特别是在黑盒组合优化中的效率与有效性问题。传统TPE方法主要针对深度学习领域的超参数优化进行设计，而未充分考虑组合搜索空间的特性。论文提出了一种适用于TPE的高效组合优化算法，其关键在于将分类核与数值核进行广义结合，从而在分类核中引入距离结构，并对新开发的核进行改进以处理大规模组合搜索空间，显著降低了核计算的时间复杂度。实验结果表明，该方法在较少评估次数下能够找到更优解。

链接: https://arxiv.org/abs/2507.08053
作者: Kenshin Abe,Yunzhuo Wang,Shuhei Watanabe
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注: Submitted to AutoML Conference

点击查看摘要

Abstract:Tree-structured Parzen estimator (TPE) is a versatile hyperparameter optimization (HPO) method supported by popular HPO tools. Since these HPO tools have been developed in line with the trend of deep learning (DL), the problem setups often used in the DL domain have been discussed for TPE such as multi-objective optimization and multi-fidelity optimization. However, the practical applications of HPO are not limited to DL, and black-box combinatorial optimization is actively utilized in some domains, e.g., chemistry and biology. As combinatorial optimization has been an untouched, yet very important, topic in TPE, we propose an efficient combinatorial optimization algorithm for TPE. In this paper, we first generalize the categorical kernel with the numerical kernel in TPE, enabling us to introduce a distance structure to the categorical kernel. Then we discuss modifications for the newly developed kernel to handle a large combinatorial search space. These modifications reduce the time complexity of the kernel calculation with respect to the size of a combinatorial search space. In the experiments using synthetic problems, we verified that our proposed method identifies better solutions with fewer evaluations than the original TPE. Our algorithm is available in Optuna, an open-source framework for HPO.
zh

[AI-47] An Enhanced Privacy-preserving Federated Few-shot Learning Framework for Respiratory Disease Diagnosis

【速读】：该论文试图解决呼吸系统疾病诊断中因人工标注劳动密集而导致的高质量标注数据稀缺问题，以及由于患者隐私顾虑而难以跨机构共享本地医疗数据的问题。其关键解决方案是提出一种具有隐私保护机制的联邦小样本学习框架，其中采用元随机梯度下降算法以缓解传统梯度下降方法在数据不足时导致的过拟合问题，并通过在本地数据训练私有模型时向梯度中引入标准高斯分布的差分隐私噪声，从而防止医疗图像的重建。此外，为应对分散在不同医疗机构的呼吸系统疾病数据无法集中化的现实，采用加权平均算法聚合来自不同客户端的本地诊断模型，提升模型在多样化场景中的适应性。

链接: https://arxiv.org/abs/2507.08050
作者: Ming Wang,Zhaoyang Duan,Dong Xue,Fangzhou Liu,Zhongheng Zhang
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:The labor-intensive nature of medical data annotation presents a significant challenge for respiratory disease diagnosis, resulting in a scarcity of high-quality labeled datasets in resource-constrained settings. Moreover, patient privacy concerns complicate the direct sharing of local medical data across institutions, and existing centralized data-driven approaches, which rely on amounts of available data, often compromise data privacy. This study proposes a federated few-shot learning framework with privacy-preserving mechanisms to address the issues of limited labeled data and privacy protection in diagnosing respiratory diseases. In particular, a meta-stochastic gradient descent algorithm is proposed to mitigate the overfitting problem that arises from insufficient data when employing traditional gradient descent methods for neural network training. Furthermore, to ensure data privacy against gradient leakage, differential privacy noise from a standard Gaussian distribution is integrated into the gradients during the training of private models with local data, thereby preventing the reconstruction of medical images. Given the impracticality of centralizing respiratory disease data dispersed across various medical institutions, a weighted average algorithm is employed to aggregate local diagnostic models from different clients, enhancing the adaptability of a model across diverse scenarios. Experimental results show that the proposed method yields compelling results with the implementation of differential privacy, while effectively diagnosing respiratory diseases using data from different structures, categories, and distributions.
zh

[AI-48] ableReason er: Advancing Table Reasoning Framework with Large Language Models

【速读】：该论文试图解决表格问答（Table Question Answering, TQA）任务中由于现实世界表格数据的特性（如数据规模大、列语义不完整和实体歧义）所带来的挑战。其解决方案的关键在于提出了一种基于大型语言模型（Large Language Model, LLM）和编程的表格推理框架，名为TableReasoner。该框架通过结合结构和语义表示的模式对表格进行建模，实现对大规模表格的全面理解和高效处理，并设计多步骤模式链接计划以提取仅与查询相关的信息，从而消除歧义并减轻幻觉现象。此外，该系统将推理流程整合到迭代式思维架构中，支持逐步的思考、推理与反思过程。

链接: https://arxiv.org/abs/2507.08046
作者: Sishi Xiong,Dakai Wang,Yu Zhao,Jie Zhang,Changzai Pan,Haowei He,Xiangyu Li,Wenhan Chang,Zhongjiang He,Shuangyong Song,Yongxiang Li
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:The paper presents our system developed for table question answering (TQA). TQA tasks face challenges due to the characteristics of real-world tabular data, such as large size, incomplete column semantics, and entity ambiguity. To address these issues, we propose a large language model (LLM)-powered and programming-based table reasoning framework, named TableReasoner. It models a table using the schema that combines structural and semantic representations, enabling holistic understanding and efficient processing of large tables. We design a multi-step schema linking plan to derive a focused table schema that retains only query-relevant information, eliminating ambiguity and alleviating hallucinations. This focused table schema provides precise and sufficient table details for query refinement and programming. Furthermore, we integrate the reasoning workflow into an iterative thinking architecture, allowing incremental cycles of thinking, reasoning and reflection. Our system achieves first place in both subtasks of SemEval-2025 Task 8.
zh

[AI-49] Human vs. LLM -Based Thematic Analysis for Digital Mental Health Research: Proof-of-Concept Comparative Study

【速读】：该论文试图解决传统主题分析方法在大型医疗研究中因资源密集而应用受限的问题，以及探索大语言模型（LLMs）在心理健康访谈分析中的可行性。其解决方案的关键在于利用基于知识的LLMs与RISEN提示工程框架进行主题分析，相较于传统人工分析，能够在更少的文本量下达到编码饱和，并在一定程度上实现自动化的内容识别与编码，从而提高成本效益，同时通过人类监督确保分析深度与参与者视角的平衡。

链接: https://arxiv.org/abs/2507.08002
作者: Karisa Parkington,Bazen G. Teferra,Marianne Rouleau-Tang,Argyrios Perivolaris,Alice Rueda,Adam Dubrowski,Bill Kapralos,Reza Samavi,Andrew Greenshaw,Yanbo Zhang,Bo Cao,Yuqi Wu,Sirisha Rambhatla,Sridhar Krishnan,Venkat Bhat
机构: 未知
类目: Human-Computer Interaction (cs.HC); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Thematic analysis provides valuable insights into participants’ experiences through coding and theme development, but its resource-intensive nature limits its use in large healthcare studies. Large language models (LLMs) can analyze text at scale and identify key content automatically, potentially addressing these challenges. However, their application in mental health interviews needs comparison with traditional human analysis. This study evaluates out-of-the-box and knowledge-base LLM-based thematic analysis against traditional methods using transcripts from a stress-reduction trial with healthcare workers. OpenAI’s GPT-4o model was used along with the Role, Instructions, Steps, End-Goal, Narrowing (RISEN) prompt engineering framework and compared to human analysis in Dedoose. Each approach developed codes, noted saturation points, applied codes to excerpts for a subset of participants (n = 20), and synthesized data into themes. Outputs and performance metrics were compared directly. LLMs using the RISEN framework developed deductive parent codes similar to human codes, but humans excelled in inductive child code development and theme synthesis. Knowledge-based LLMs reached coding saturation with fewer transcripts (10-15) than the out-of-the-box model (15-20) and humans (90-99). The out-of-the-box LLM identified a comparable number of excerpts to human researchers, showing strong inter-rater reliability (K = 0.84), though the knowledge-based LLM produced fewer excerpts. Human excerpts were longer and involved multiple codes per excerpt, while LLMs typically applied one code. Overall, LLM-based thematic analysis proved more cost-effective but lacked the depth of human analysis. LLMs can transform qualitative analysis in mental healthcare and clinical research when combined with human oversight to balance participant perspectives and research resources.
zh

[AI-50] Human Creativity and AI

【速读】：该论文试图解决的问题是：人工智能（Artificial Intelligence, AI）是否能够表现出创造力。论文的核心解决方案在于通过综合心理学、认知神经科学和创造力哲学领域的研究成果，探讨创造力的本质及其在人工智能系统中的潜在实现路径。

链接: https://arxiv.org/abs/2507.08001
作者: Shengyi Xie
机构: 未知
类目: Artificial Intelligence (cs.AI); Human-Computer Interaction (cs.HC)
备注:

点击查看摘要

Abstract:With the advancement of science and technology, the philosophy of creativity has undergone significant reinterpretation. This paper investigates contemporary research in the fields of psychology, cognitive neuroscience, and the philosophy of creativity, particularly in the context of the development of artificial intelligence (AI) techniques. It aims to address the central question: Can AI exhibit creativity? The paper reviews the historical perspectives on the philosophy of creativity and explores the influence of psychological advancements on the study of creativity. Furthermore, it analyzes various definitions of creativity and examines the responses of naturalism and cognitive neuroscience to the concept of creativity.
zh

[AI-51] Safe Deep Reinforcement Learning for Resource Allocation with Peak Age of Information Violation Guarantees

【速读】：该论文旨在解决无线网络控制系统（WNCSs）中控制与通信系统强耦合带来的优化问题，特别是在有限码长条件下确保关键约束如峰值信息年龄（PAoI）违规概率、发射功率和可调度性的满足，同时最小化能耗。其解决方案的关键在于提出一种基于优化理论的安全深度强化学习（DRL）框架，该框架包含两个阶段：第一阶段通过优化理论推导最优条件并简化问题，第二阶段采用教师-学生框架引导DRL代理，在保证系统约束的前提下进行安全策略学习。

链接: https://arxiv.org/abs/2507.08653
作者: Berire Gunes Reyhan,Sinem Coleri
机构: 未知
类目: ignal Processing (eess.SP); Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Multiagent Systems (cs.MA)
备注: 15 Pages, to be published in IEEE Transactions on Communications

点击查看摘要

Abstract:In Wireless Networked Control Systems (WNCSs), control and communication systems must be co-designed due to their strong interdependence. This paper presents a novel optimization theory-based safe deep reinforcement learning (DRL) framework for ultra-reliable WNCSs, ensuring constraint satisfaction while optimizing performance, for the first time in the literature. The approach minimizes power consumption under key constraints, including Peak Age of Information (PAoI) violation probability, transmit power, and schedulability in the finite blocklength regime. PAoI violation probability is uniquely derived by combining stochastic maximum allowable transfer interval (MATI) and maximum allowable packet delay (MAD) constraints in a multi-sensor network. The framework consists of two stages: optimization theory and safe DRL. The first stage derives optimality conditions to establish mathematical relationships among variables, simplifying and decomposing the problem. The second stage employs a safe DRL model where a teacher-student framework guides the DRL agent (student). The control mechanism (teacher) evaluates compliance with system constraints and suggests the nearest feasible action when needed. Extensive simulations show that the proposed framework outperforms rule-based and other optimization theory based DRL benchmarks, achieving faster convergence, higher rewards, and greater stability.
zh

[AI-52] o Trade or Not to Trade: An Agent ic Approach to Estimating Market Risk Improves Trading Decisions

【速读】：该论文试图解决当前基于大型语言模型（Large Language Models, LLMs）的智能体在金融领域中缺乏系统性建模步骤的问题，其解决方案的关键在于开发一种代理系统，该系统利用LLMs迭代发现用于金融时间序列的随机微分方程（Stochastic Differential Equations, SDEs）。通过这些模型生成的风险指标，可以优化每日交易决策，并在传统回测和市场模拟器中验证了模型驱动的交易策略优于标准LLM代理，提升了多个股票的夏普比率。

链接: https://arxiv.org/abs/2507.08584
作者: Dimitrios Emmanoulopoulos,Ollie Olby,Justin Lyon,Namid R. Stillman
机构: 未知
类目: atistical Finance (q-fin.ST); Artificial Intelligence (cs.AI); Computational Engineering, Finance, and Science (cs.CE); Multiagent Systems (cs.MA); Computational Finance (q-fin.CP)
备注: 31 pages, 7 figures, 3 tables

点击查看摘要

Abstract:Large language models (LLMs) are increasingly deployed in agentic frameworks, in which prompts trigger complex tool-based analysis in pursuit of a goal. While these frameworks have shown promise across multiple domains including in finance, they typically lack a principled model-building step, relying instead on sentiment- or trend-based analysis. We address this gap by developing an agentic system that uses LLMs to iteratively discover stochastic differential equations for financial time series. These models generate risk metrics which inform daily trading decisions. We evaluate our system in both traditional backtests and using a market simulator, which introduces synthetic but causally plausible price paths and news events. We find that model-informed trading strategies outperform standard LLM-based agents, improving Sharpe ratios across multiple equities. Our results show that combining LLMs with agentic model discovery enhances market risk estimation and enables more profitable trading decisions.
zh

[AI-53] Quantum Properties Trojans (QuPTs) for Attacking Quantum Neural Networks

【速读】：该论文试图解决量子神经网络（Quantum Neural Network, QNN）的安全性和鲁棒性问题，这是量子机器学习（Quantum Machine Learning, QML）领域中尚未深入探索的方面。论文提出的解决方案是基于量子计算特性设计新型后门攻击——量子特性后门（Quantum Properties Trojans, QuPTs），其关键在于利用量子门的酉性质插入噪声，并通过哈达玛门实现叠加态，从而在不破坏正常功能的情况下对QNN进行攻击。实验结果表明，所提出的QuPTs具有高度隐蔽性并显著影响量子电路性能，其中最严重的攻击导致QNN准确率下降23%。这是首个独立于混合经典-量子架构的全量子神经网络后门攻击研究。

链接: https://arxiv.org/abs/2507.08202
作者: Sounak Bhowmik,Travis S. Humble,Himanshu Thapliyal
机构: 未知
类目: Quantum Physics (quant-ph); Artificial Intelligence (cs.AI); Cryptography and Security (cs.CR)
备注:

点击查看摘要

Abstract:Quantum neural networks (QNN) hold immense potential for the future of quantum machine learning (QML). However, QNN security and robustness remain largely unexplored. In this work, we proposed novel Trojan attacks based on the quantum computing properties in a QNN-based binary classifier. Our proposed Quantum Properties Trojans (QuPTs) are based on the unitary property of quantum gates to insert noise and Hadamard gates to enable superposition to develop Trojans and attack QNNs. We showed that the proposed QuPTs are significantly stealthier and heavily impact the quantum circuits’ performance, specifically QNNs. The most impactful QuPT caused a deterioration of 23% accuracy of the compromised QNN under the experimental setup. To the best of our knowledge, this is the first work on the Trojan attack on a fully quantum neural network independent of any hybrid classical-quantum architecture.
zh

[AI-54] Consciousness as a Jamming Phase

【速读】：该论文试图解释大规模语言模型中意识的出现问题，将其视为高维无序系统中的临界现象。解决方案的关键在于构建一个神经干扰相图，通过类比颗粒物质和其他复杂系统中的干扰转变，识别出三个控制参数：温度、体积分数和噪声强度，从而提供了一个统一的物理解释，说明计算冷却、密度优化和噪声减少如何共同推动系统向临界干扰表面演化，进而产生广义智能。

链接: https://arxiv.org/abs/2507.08197
作者: Kaichen Ouyang
机构: 未知
类目: Disordered Systems and Neural Networks (cond-mat.dis-nn); Artificial Intelligence (cs.AI)
备注: 18 pages, 13 figures

点击查看摘要

Abstract:This paper develops a neural jamming phase diagram that interprets the emergence of consciousness in large language models as a critical phenomenon in high-dimensional disordered this http URL establishing analogies with jamming transitions in granular matter and other complex systems, we identify three fundamental control parameters governing the phase behavior of neural networks: temperature, volume fraction, and this http URL theory provides a unified physical explanation for empirical scaling laws in artificial intelligence, demonstrating how computational cooling, density optimization, and noise reduction collectively drive systems toward a critical jamming surface where generalized intelligence emerges. Remarkably, the same thermodynamic principles that describe conventional jamming transitions appear to underlie the emergence of consciousness in neural networks, evidenced by shared critical signatures including divergent correlation lengths and scaling this http URL work explains neural language models’ critical scaling through jamming physics, suggesting consciousness is a jamming phase that intrinsically connects knowledge components via long-range correlations.
zh

[AI-55] AmpLyze: A Deep Learning Model for Predicting the Hemolytic Concentration

【速读】：该论文试图解决抗菌肽（Antimicrobial Peptide, AMP）治疗药物在安全性评估中缺乏定量预测工具的问题，现有模型仅能判断毒性与否，无法提供具体的红细胞裂解半数浓度（HC50）值。解决方案的关键在于提出AmpLyze模型，该模型通过结合残基级别的ProtT5/ESM2嵌入与序列级别的描述符，在双局部和全局分支中利用交叉注意力模块进行对齐，并采用log-cosh损失函数提升对实验噪声的鲁棒性，从而实现从序列直接预测HC50值并解释驱动毒性的残基。

链接: https://arxiv.org/abs/2507.08162
作者: Peng Qiu,Hanqi Feng,Barnabas Poczos
机构: 未知
类目: Biomolecules (q-bio.BM); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Red-blood-cell lysis (HC50) is the principal safety barrier for antimicrobial-peptide (AMP) therapeutics, yet existing models only say “toxic” or “non-toxic.” AmpLyze closes this gap by predicting the actual HC50 value from sequence alone and explaining the residues that drive toxicity. The model couples residue-level ProtT5/ESM2 embeddings with sequence-level descriptors in dual local and global branches, aligned by a cross-attention module and trained with log-cosh loss for robustness to assay noise. The optimal AmpLyze model reaches a PCC of 0.756 and an MSE of 0.987, outperforming classical regressors and the state-of-the-art. Ablations confirm that both branches are essential, and cross-attention adds a further 1% PCC and 3% MSE improvement. Expected-Gradients attributions reveal known toxicity hotspots and suggest safer substitutions. By turning hemolysis assessment into a quantitative, sequence-based, and interpretable prediction, AmpLyze facilitates AMP design and offers a practical tool for early-stage toxicity screening.
zh

[AI-56] Energy Management for Renewable-Colocated Artificial Intelligence Data Centers

【速读】：该论文旨在解决人工智能（AI）数据中心与可再生能源发电设施共址运行中的能源管理问题，以实现经济效益的最大化。其解决方案的关键在于构建一种能量管理系统（EMS），该系统在追求利润最大化的框架下，协同优化AI工作负载调度、现场可再生能源利用以及电力市场参与策略，从而在批发和零售市场模型中实现RCDC运营的经济收益最大化。

链接: https://arxiv.org/abs/2507.08011
作者: Siying Li,Lang Tong,Timothy D. Mount
机构: 未知
类目: Optimization and Control (math.OC); Artificial Intelligence (cs.AI); Systems and Control (eess.SY)
备注:

点击查看摘要

Abstract:We develop an energy management system (EMS) for artificial intelligence (AI) data centers with colocated renewable generation. Under a profit-maximizing framework, the EMS of renewable-colocated data center (RCDC) co-optimizes AI workload scheduling, on-site renewable utilization, and electricity market participation. Within both wholesale and retail market participation models, the economic benefit of the RCDC operation is maximized. Empirical evaluations using real-world traces of electricity prices, data center power consumption, and renewable generation demonstrate significant profit gains from renewable and AI data center colocations.
zh

[AI-57] Unraveling the Potential of Diffusion Models in Small Molecule Generation

【速读】：该论文试图解决如何利用生成式 AI (Generative AI) 在药物设计中高效探索大规模化学空间的问题，特别是通过扩散模型 (Diffusion Models, DMs) 提升分子生成的性能。解决方案的关键在于深入分析和分类基于扩散模型的分子生成方法，并评估其在基准数据集上的表现，以揭示其在三维分子生成中的优势与不足，从而为未来研究提供方向。

链接: https://arxiv.org/abs/2507.08005
作者: Peining Zhang,Daniel Baker,Minghu Song,Jinbo Bi
机构: 未知
类目: Biomolecules (q-bio.BM); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Generative AI presents chemists with novel ideas for drug design and facilitates the exploration of vast chemical spaces. Diffusion models (DMs), an emerging tool, have recently attracted great attention in drug R\D. This paper comprehensively reviews the latest advancements and applications of DMs in molecular generation. It begins by introducing the theoretical principles of DMs. Subsequently, it categorizes various DM-based molecular generation methods according to their mathematical and chemical applications. The review further examines the performance of these models on benchmark datasets, with a particular focus on comparing the generation performance of existing 3D methods. Finally, it concludes by emphasizing current challenges and suggesting future research directions to fully exploit the potential of DMs in drug discovery.
zh

机器学习

[LG-0] he Non-Linear Representation Dilemma: Is Causal Abstraction Enough for Mechanistic Interpretability?

链接: https://arxiv.org/abs/2507.08802
作者: Denis Sutter,Julian Minder,Thomas Hofmann,Tiago Pimentel
类目: Machine Learning (cs.LG)
*备注: 42 pages, 17 figures, code available in this http URL

点击查看摘要

Abstract:The concept of causal abstraction got recently popularised to demystify the opaque decision-making processes of machine learning models; in short, a neural network can be abstracted as a higher-level algorithm if there exists a function which allows us to map between them. Notably, most interpretability papers implement these maps as linear functions, motivated by the linear representation hypothesis: the idea that features are encoded linearly in a model’s representations. However, this linearity constraint is not required by the definition of causal abstraction. In this work, we critically examine the concept of causal abstraction by considering arbitrarily powerful alignment maps. In particular, we prove that under reasonable assumptions, any neural network can be mapped to any algorithm, rendering this unrestricted notion of causal abstraction trivial and uninformative. We complement these theoretical findings with empirical evidence, demonstrating that it is possible to perfectly map models to algorithms even when these models are incapable of solving the actual task; e.g., on an experiment using randomly initialised language models, our alignment maps reach 100% interchange-intervention accuracy on the indirect object identification task. This raises the non-linear representation dilemma: if we lift the linearity constraint imposed to alignment maps in causal abstraction analyses, we are left with no principled way to balance the inherent trade-off between these maps’ complexity and accuracy. Together, these results suggest an answer to our title’s question: causal abstraction is not enough for mechanistic interpretability, as it becomes vacuous without assumptions about how models encode information. Studying the connection between this information-encoding assumption and causal abstraction should lead to exciting future work.

[LG-1] Filter Equivariant Functions: A symmetric account of length-general extrapolation on lists

链接: https://arxiv.org/abs/2507.08796
作者: Owen Lewis,Neil Ghani,Andrew Dudzik,Christos Perivolaropoulos,Razvan Pascanu,Petar Veličković
类目: Programming Languages (cs.PL); Machine Learning (cs.LG)
*备注: 18 pages, 2 figures

点击查看摘要

Abstract:What should a function that extrapolates beyond known input/output examples look like? This is a tricky question to answer in general, as any function matching the outputs on those examples can in principle be a correct extrapolant. We argue that a “good” extrapolant should follow certain kinds of rules, and here we study a particularly appealing criterion for rule-following in list functions: that the function should behave predictably even when certain elements are removed. In functional programming, a standard way to express such removal operations is by using a filter function. Accordingly, our paper introduces a new semantic class of functions – the filter equivariant functions. We show that this class contains interesting examples, prove some basic theorems about it, and relate it to the well-known class of map equivariant functions. We also present a geometric account of filter equivariants, showing how they correspond naturally to certain simplicial structures. Our highlight result is the amalgamation algorithm, which constructs any filter-equivariant function’s output by first studying how it behaves on sublists of the input, in a way that extrapolates perfectly.

[LG-2] Greedy Low-Rank Gradient Compression for Distributed Learning with Convergence Guarantees

链接: https://arxiv.org/abs/2507.08784
作者: Chuyan Chen,Yutong He,Pengrui Li,Weichen Jia,Kun Yuan
类目: Machine Learning (cs.LG); Optimization and Control (math.OC)
*备注: 18 pages, 5 figures

点击查看摘要

Abstract:Distributed optimization is pivotal for large-scale signal processing and machine learning, yet communication overhead remains a major bottleneck. Low-rank gradient compression, in which the transmitted gradients are approximated by low-rank matrices to reduce communication, offers a promising remedy. Existing methods typically adopt either randomized or greedy compression strategies: randomized approaches project gradients onto randomly chosen subspaces, introducing high variance and degrading empirical performance; greedy methods select the most informative subspaces, achieving strong empirical results but lacking convergence guarantees. To address this gap, we propose GreedyLore–the first Greedy Low-Rank gradient compression algorithm for distributed learning with rigorous convergence guarantees. GreedyLore incorporates error feedback to correct the bias introduced by greedy compression and introduces a semi-lazy subspace update that ensures the compression operator remains contractive throughout all iterations. With these techniques, we prove that GreedyLore achieves a convergence rate of \mathcalO(\sigma/\sqrtNT + 1/T) under standard optimizers such as MSGD and Adam–marking the first linear speedup convergence rate for low-rank gradient compression. Extensive experiments are conducted to validate our theoretical findings.

[LG-3] ML-Based Automata Simplification for Symbolic Accelerators

链接: https://arxiv.org/abs/2507.08751
作者: Tiffany Yu,Rye Stahle-Smith,Darssan Eswaramoorthi,Rasha Karakchi
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Symbolic accelerators are increasingly used for symbolic data processing in domains such as genomics, NLP, and cybersecurity. However, these accelerators face scalability issues due to excessive memory use and routing complexity, especially when targeting a large set. We present AutoSlim, a machine learning-based graph simplification framework designed to reduce the complexity of symbolic accelerators built on Non-deterministic Finite Automata (NFA) deployed on FPGA-based overlays such as NAPOLY+. AutoSlim uses Random Forest classification to prune low-impact transitions based on edge scores and structural features, significantly reducing automata graph density while preserving semantic correctness. Unlike prior tools, AutoSlim targets automated score-aware simplification with weighted transitions, enabling efficient ranking-based sequence analysis. We evaluated data sets (1K to 64K nodes) in NAPOLY+ and conducted performance measurements including latency, throughput, and resource usage. AutoSlim achieves up to 40 percent reduction in FPGA LUTs and over 30 percent pruning in transitions, while scaling to graphs an order of magnitude larger than existing benchmarks. Our results also demonstrate how hardware interconnection (fanout) heavily influences hardware cost and that AutoSlim’s pruning mitigates resource blowup.

[LG-4] Modeling Partially Observed Nonlinear Dynamical Systems and Efficient Data Assimilation via Discrete-Time Conditional Gaussian Koopman Network

链接: https://arxiv.org/abs/2507.08749
作者: Chuanqi Chen,Zhongrui Wang,Nan Chen,Jin-Long Wu
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:A discrete-time conditional Gaussian Koopman network (CGKN) is developed in this work to learn surrogate models that can perform efficient state forecast and data assimilation (DA) for high-dimensional complex dynamical systems, e.g., systems governed by nonlinear partial differential equations (PDEs). Focusing on nonlinear partially observed systems that are common in many engineering and earth science applications, this work exploits Koopman embedding to discover a proper latent representation of the unobserved system states, such that the dynamics of the latent states are conditional linear, i.e., linear with the given observed system states. The modeled system of the observed and latent states then becomes a conditional Gaussian system, for which the posterior distribution of the latent states is Gaussian and can be efficiently evaluated via analytical formulae. The analytical formulae of DA facilitate the incorporation of DA performance into the learning process of the modeled system, which leads to a framework that unifies scientific machine learning (SciML) and data assimilation. The performance of discrete-time CGKN is demonstrated on several canonical problems governed by nonlinear PDEs with intermittency and turbulent features, including the viscous Burgers’ equation, the Kuramoto-Sivashinsky equation, and the 2-D Navier-Stokes equations, with which we show that the discrete-time CGKN framework achieves comparable performance as the state-of-the-art SciML methods in state forecast and provides efficient and accurate DA results. The discrete-time CGKN framework also serves as an example to illustrate unifying the development of SciML models and their other outer-loop applications such as design optimization, inverse problems, and optimal control.

[LG-5] Partitioned Hybrid Quantum Fourier Neural Operators for Scientific Quantum Machine Learning

链接: https://arxiv.org/abs/2507.08746
作者: Paolo Marcandelli,Yuanchun He,Stefano Mariani,Martina Siena,Stefano Markidis
类目: Machine Learning (cs.LG); Quantum Physics (quant-ph)
*备注:

点击查看摘要

Abstract:We introduce the Partitioned Hybrid Quantum Fourier Neural Operator (PHQFNO), a generalization of the Quantum Fourier Neural Operator (QFNO) for scientific machine learning. PHQFNO partitions the Fourier operator computation across classical and quantum resources, enabling tunable quantum-classical hybridization and distributed execution across quantum and classical devices. The method extends QFNOs to higher dimensions and incorporates a message-passing framework to distribute data across different partitions. Input data are encoded into quantum states using unary encoding, and quantum circuit parameters are optimized using a variational scheme. We implement PHQFNO using PennyLane with PyTorch integration and evaluate it on Burgers’ equation, incompressible and compressible Navier-Stokes equations. We show that PHQFNO recovers classical FNO accuracy. On incompressible Navier-Stokes, PHQFNO achieves higher accuracy than its classical counterparts. Finally, we perform a sensitivity analysis under input noise, confirming improved stability of PHQFNO over classical baselines.

[LG-6] Hashing for Fast Pattern Set Selection ECML-PKDD2025

链接: https://arxiv.org/abs/2507.08745
作者: Maiju Karjalainen,Pauli Miettinen
类目: Databases (cs.DB); Machine Learning (cs.LG)
*备注: 17 pages, 5 figures, to appear at ECML-PKDD 2025

点击查看摘要

Abstract:Pattern set mining, which is the task of finding a good set of patterns instead of all patterns, is a fundamental problem in data mining. Many different definitions of what constitutes a good set have been proposed in recent years. In this paper, we consider the reconstruction error as a proxy measure for the goodness of the set, and concentrate on the adjacent problem of how to find a good set efficiently. We propose a method based on bottom-k hashing for efficiently selecting the set and extend the method for the common case where the patterns might only appear in approximate form in the data. Our approach has applications in tiling databases, Boolean matrix factorization, and redescription mining, among others. We show that our hashing-based approach is significantly faster than the standard greedy algorithm while obtaining almost equally good results in both synthetic and real-world data sets.

[LG-7] On the Effect of Regularization in Policy Mirror Descent

链接: https://arxiv.org/abs/2507.08718
作者: Jan Felix Kleuker,Aske Plaat,Thomas Moerland
类目: Machine Learning (cs.LG)
*备注: Accepted at RLC

点击查看摘要

Abstract:Policy Mirror Descent (PMD) has emerged as a unifying framework in reinforcement learning (RL) by linking policy gradient methods with a first-order optimization method known as mirror descent. At its core, PMD incorporates two key regularization components: (i) a distance term that enforces a trust region for stable policy updates and (ii) an MDP regularizer that augments the reward function to promote structure and robustness. While PMD has been extensively studied in theory, empirical investigations remain scarce. This work provides a large-scale empirical analysis of the interplay between these two regularization techniques, running over 500k training seeds on small RL environments. Our results demonstrate that, although the two regularizers can partially substitute each other, their precise combination is critical for achieving robust performance. These findings highlight the potential for advancing research on more robust algorithms in RL, particularly with respect to hyperparameter sensitivity.

[LG-8] SPLASH! Sample-efficient Preference-based inverse reinforcement learning for Long-horizon Adversarial tasks from Suboptimal Hierarchical demonstrations

链接: https://arxiv.org/abs/2507.08707
作者: Peter Crowley,Zachary Serlin,Tyler Paine,Makai Mann,Michael Benjamin,Calin Belta
类目: Machine Learning (cs.LG); Robotics (cs.RO)
*备注:

点击查看摘要

Abstract:Inverse Reinforcement Learning (IRL) presents a powerful paradigm for learning complex robotic tasks from human demonstrations. However, most approaches make the assumption that expert demonstrations are available, which is often not the case. Those that allow for suboptimality in the demonstrations are not designed for long-horizon goals or adversarial tasks. Many desirable robot capabilities fall into one or both of these categories, thus highlighting a critical shortcoming in the ability of IRL to produce field-ready robotic agents. We introduce Sample-efficient Preference-based inverse reinforcement learning for Long-horizon Adversarial tasks from Suboptimal Hierarchical demonstrations (SPLASH), which advances the state-of-the-art in learning from suboptimal demonstrations to long-horizon and adversarial settings. We empirically validate SPLASH on a maritime capture-the-flag task in simulation, and demonstrate real-world applicability with sim-to-real translation experiments on autonomous unmanned surface vehicles. We show that our proposed methods allow SPLASH to significantly outperform the state-of-the-art in reward learning from suboptimal demonstrations.

[LG-9] Domain-Informed Operation Excellence of Gas Turbine System with Machine Learning

链接: https://arxiv.org/abs/2507.08697
作者: Waqar Muhammad Ashraf,Amir H. Keshavarzzadeh,Abdulelah S. Alshehri,Abdulrahman bin Jumah,Ramit Debnath,Vivek Dua
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:The domain-consistent adoption of artificial intelligence (AI) remains low in thermal power plants due to the black-box nature of AI algorithms and low representation of domain knowledge in conventional data-centric analytics. In this paper, we develop a MAhalanobis Distance-based OPTimization (MAD-OPT) framework that incorporates the Mahalanobis distance-based constraint to introduce domain knowledge into data-centric analytics. The developed MAD-OPT framework is applied to maximize thermal efficiency and minimize turbine heat rate for a 395 MW capacity gas turbine system. We demonstrate that the MAD-OPT framework can estimate domain-informed optimal process conditions under different ambient conditions, and the optimal solutions are found to be robust as evaluated by Monte Carlo simulations. We also apply the MAD-OPT framework to estimate optimal process conditions beyond the design power generation limit of the gas turbine system, and have found comparable results with the actual data of the power plant. We demonstrate that implementing data-centric optimization analytics without incorporating domain-informed constraints may provide ineffective solutions that may not be implementable in the real operation of the gas turbine system. This research advances the integration of the data-driven domain knowledge into machine learning-powered analytics that enhances the domain-informed operation excellence and paves the way for safe AI adoption in thermal power systems.

[LG-10] Forget Me Not: Fighting Local Overfitting with Knowledge Fusion and Distillation

链接: https://arxiv.org/abs/2507.08686
作者: Uri Stern,Eli Corn,Daphna Weinshall
类目: Machine Learning (cs.LG)
*备注: arXiv admin note: substantial text overlap with arXiv:2412.12968

点击查看摘要

Abstract:Overfitting in deep neural networks occurs less frequently than expected. This is a puzzling observation, as theory predicts that greater model capacity should eventually lead to overfitting – yet this is rarely seen in practice. But what if overfitting does occur, not globally, but in specific sub-regions of the data space? In this work, we introduce a novel score that measures the forgetting rate of deep models on validation data, capturing what we term local overfitting: a performance degradation confined to certain regions of the input space. We demonstrate that local overfitting can arise even without conventional overfitting, and is closely linked to the double descent phenomenon. Building on these insights, we introduce a two-stage approach that leverages the training history of a single model to recover and retain forgotten knowledge: first, by aggregating checkpoints into an ensemble, and then by distilling it into a single model of the original size, thus enhancing performance without added inference cost. Extensive experiments across multiple datasets, modern architectures, and training regimes validate the effectiveness of our approach. Notably, in the presence of label noise, our method – Knowledge Fusion followed by Knowledge Distillation – outperforms both the original model and independently trained ensembles, achieving a rare win-win scenario: reduced training and inference complexity. Comments: arXiv admin note: substantial text overlap with arXiv:2412.12968 Subjects: Machine Learning (cs.LG) Cite as: arXiv:2507.08686 [cs.LG] (or arXiv:2507.08686v1 [cs.LG] for this version) https://doi.org/10.48550/arXiv.2507.08686 Focus to learn more arXiv-issued DOI via DataCite (pending registration)

[LG-11] Agents Net: Coordination and Collaborative Reasoning in Multi-Agent LLM s

链接: https://arxiv.org/abs/2507.08616
作者: Florian Grötschla,Luis Müller,Jan Tönshoff,Mikhail Galkin,Bryan Perozzi
类目: Multiagent Systems (cs.MA); Machine Learning (cs.LG)
*备注: Preprint

点击查看摘要

Abstract:Large-language models (LLMs) have demonstrated powerful problem-solving capabilities, in particular when organized in multi-agent systems. However, the advent of such systems also raises several questions on the ability of a complex network of agents to effectively self-organize and collaborate. While measuring performance on standard reasoning benchmarks indicates how well multi-agent systems can solve reasoning tasks, it is unclear whether these systems are able to leverage their topology effectively. Here, we propose AgentsNet, a new benchmark for multi-agent reasoning. By drawing inspiration from classical problems in distributed systems and graph theory, AgentsNet measures the ability of multi-agent systems to collaboratively form strategies for problem-solving, self-organization, and effective communication given a network topology. We evaluate a variety of baseline methods on AgentsNet including homogeneous networks of agents which first have to agree on basic protocols for organization and communication. We find that some frontier LLMs are already demonstrating strong performance for small networks but begin to fall off once the size of the network scales. While existing multi-agent benchmarks cover at most 2-5 agents, AgentsNet is practically unlimited in size and can scale with new generations of LLMs. As such, we also probe frontier models in a setup with up to 100 agents.

[LG-12] Remote Sensing Reveals Adoption of Sustainable Rice Farming Practices Across Punjab India DATE

链接: https://arxiv.org/abs/2507.08605
作者: Ando Shah,Rajveer Singh,Akram Zaytar,Girmaw Abebe Tadesse,Caleb Robinson,Negar Tafti,Stephen A. Wood,Rahul Dodhia,Juan M. Lavista Ferres
类目: Machine Learning (cs.LG)
*备注: Dataset and code will be published shortly and links updated in v2

点击查看摘要

Abstract:Rice cultivation consumes 24-30% of global freshwater, creating critical water management challenges in major rice-producing regions. Sustainable irrigation practices like direct seeded rice (DSR) and alternate wetting and drying (AWD) can reduce water use by 20-40% while maintaining yields, helping secure long-term agricultural productivity as water scarcity intensifies - a key component of the Zero Hunger Sustainable Development Goal. However, limited data on adoption rates of these practices prevents evidence-based policymaking and targeted resource allocation. We developed a novel remote sensing framework to monitor sustainable water management practices at scale in Punjab, India - a region facing severe groundwater depletion of 41.6 cm/year. To collect essential ground truth data, we partnered with the Nature Conservancy’s Promoting Regenerative and No-burn Agriculture (PRANA) program, which trained approximately 1,400 farmers on water-saving techniques while documenting their field-level practices. Using this data, we created a classification system with Sentinel-1 satellite imagery that separates water management along sowing and irrigation dimensions. Our approach achieved a 78% F1-score in distinguishing DSR from traditional puddled transplanted rice without requiring prior knowledge of planting dates. We demonstrated scalability by mapping DSR adoption across approximately 3 million agricultural plots in Punjab, with district-level predictions showing strong correlation (Pearson=0.77, RBO= 0.77) with government records. This study provides policymakers with a powerful tool to track sustainable water management adoption, target interventions, and measure program impacts at scale.

[LG-13] ADAPT: A Pseudo-labeling Approach to Combat Concept Drift in Malware Detection

链接: https://arxiv.org/abs/2507.08597
作者: Md Tanvirul Alam,Aritran Piplai,Nidhi Rastogi
类目: Machine Learning (cs.LG); Cryptography and Security (cs.CR)
*备注:

点击查看摘要

Abstract:Machine learning models are commonly used for malware classification; however, they suffer from performance degradation over time due to concept drift. Adapting these models to changing data distributions requires frequent updates, which rely on costly ground truth annotations. While active learning can reduce the annotation burden, leveraging unlabeled data through semi-supervised learning remains a relatively underexplored approach in the context of malware detection. In this research, we introduce \textttADAPT, a novel pseudo-labeling semi-supervised algorithm for addressing concept drift. Our model-agnostic method can be applied to various machine learning models, including neural networks and tree-based algorithms. We conduct extensive experiments on five diverse malware detection datasets spanning Android, Windows, and PDF domains. The results demonstrate that our method consistently outperforms baseline models and competitive benchmarks. This work paves the way for more effective adaptation of machine learning models to concept drift in malware detection.

[LG-14] AbbIE: Autoregressive Block-Based Iterative Encoder for Efficient Sequence Modeling NEURIPS2025

链接: https://arxiv.org/abs/2507.08567
作者: Preslav Aleksandrov,Meghdad Kurmanji,Fernando Garcia Redondo,David O’Shea,William Shen,Alex Iacob,Lorenzo Sani,Xinchi Qiu,Nicola Cancedda,Nicholas D. Lane
类目: Machine Learning (cs.LG)
*备注: 14 pages and 6 figures. Submitted to NeurIPS 2025

点击查看摘要

Abstract:We introduce the Autoregressive Block-Based Iterative Encoder (AbbIE), a novel recursive generalization of the encoder-only Transformer architecture, which achieves better perplexity than a standard Transformer and allows for the dynamic scaling of compute resources at test time. This simple, recursive approach is a complement to scaling large language model (LLM) performance through parameter and token counts. AbbIE performs its iterations in latent space, but unlike latent reasoning models, does not require a specialized dataset or training protocol. We show that AbbIE upward generalizes (ability to generalize to arbitrary iteration lengths) at test time by only using 2 iterations during train time, far outperforming alternative iterative methods. AbbIE’s ability to scale its computational expenditure based on the complexity of the task gives it an up to \textbf12% improvement in zero-shot in-context learning tasks versus other iterative and standard methods and up to 5% improvement in language perplexity. The results from this study open a new avenue to Transformer performance scaling. We perform all of our evaluations on model sizes up to 350M parameters.

[LG-15] STRAP: Spatial-Temporal Risk-Attentive Vehicle Trajectory Prediction for Autonomous Driving ITSC2025

链接: https://arxiv.org/abs/2507.08563
作者: Xinyi Ning,Zilin Bian,Kaan Ozbay,Semiha Ergan
类目: Machine Learning (cs.LG)
*备注: 6 pages, 3 figures, accepted at ITSC 2025

点击查看摘要

Abstract:Accurate vehicle trajectory prediction is essential for ensuring safety and efficiency in fully autonomous driving systems. While existing methods primarily focus on modeling observed motion patterns and interactions with other vehicles, they often neglect the potential risks posed by the uncertain or aggressive behaviors of surrounding vehicles. In this paper, we propose a novel spatial-temporal risk-attentive trajectory prediction framework that incorporates a risk potential field to assess perceived risks arising from behaviors of nearby vehicles. The framework leverages a spatial-temporal encoder and a risk-attentive feature fusion decoder to embed the risk potential field into the extracted spatial-temporal feature representations for trajectory prediction. A risk-scaled loss function is further designed to improve the prediction accuracy of high-risk scenarios, such as short relative spacing. Experiments on the widely used NGSIM and HighD datasets demonstrate that our method reduces average prediction errors by 4.8% and 31.2% respectively compared to state-of-the-art approaches, especially in high-risk scenarios. The proposed framework provides interpretable, risk-aware predictions, contributing to more robust decision-making for autonomous driving systems.

[LG-16] CircFormerMoE: An End-to-End Deep Learning Framework for Circular RNA Splice Site Detection and Pairing in Plant Genomes

链接: https://arxiv.org/abs/2507.08542
作者: Tianyou Jiang
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Circular RNAs (circRNAs) are important components of the non-coding RNA regulatory network. Previous circRNA identification primarily relies on high-throughput RNA sequencing (RNA-seq) data combined with alignment-based algorithms that detect back-splicing signals. However, these methods face several limitations: they can’t predict circRNAs directly from genomic DNA sequences and relies heavily on RNA experimental data; they involve high computational costs due to complex alignment and filtering steps; and they are inefficient for large-scale or genome-wide circRNA prediction. The challenge is even greater in plants, where plant circRNA splice sites often lack the canonical GT-AG motif seen in human mRNA splicing, and no efficient deep learning model with strong generalization capability currently exists. Furthermore, the number of currently identified plant circRNAs is likely far lower than their true abundance. In this paper, we propose a deep learning framework named CircFormerMoE based on transformers and mixture-of experts for predicting circRNAs directly from plant genomic DNA. Our framework consists of two subtasks known as splicing site detection (SSD) and splicing site pairing (SSP). The model’s effectiveness has been validated on gene data of 10 plant species. Trained on known circRNA instances, it is also capable of discovering previously unannotated circRNAs. In addition, we performed interpretability analyses on the trained model to investigate the sequence patterns contributing to its predictions. Our framework provides a fast and accurate computational method and tool for large-scale circRNA discovery in plants, laying a foundation for future research in plant functional genomics and non-coding RNA annotation.

[LG-17] Recursive Reward Aggregation

链接: https://arxiv.org/abs/2507.08537
作者: Yuting Tang,Yivan Zhang,Johannes Ackermann,Yu-Jie Zhang,Soichiro Nishimori,Masashi Sugiyama
类目: Machine Learning (cs.LG); Category Theory (math.CT)
*备注: Reinforcement Learning Conference 2025

点击查看摘要

Abstract:In reinforcement learning (RL), aligning agent behavior with specific objectives typically requires careful design of the reward function, which can be challenging when the desired objectives are complex. In this work, we propose an alternative approach for flexible behavior alignment that eliminates the need to modify the reward function by selecting appropriate reward aggregation functions. By introducing an algebraic perspective on Markov decision processes (MDPs), we show that the Bellman equations naturally emerge from the recursive generation and aggregation of rewards, allowing for the generalization of the standard discounted sum to other recursive aggregations, such as discounted max and Sharpe ratio. Our approach applies to both deterministic and stochastic settings and integrates seamlessly with value-based and actor-critic algorithms. Experimental results demonstrate that our approach effectively optimizes diverse objectives, highlighting its versatility and potential for real-world applications.

[LG-18] SFedKD: Sequential Federated Learning with Discrepancy-Aware Multi-Teacher Knowledge Distillation

链接: https://arxiv.org/abs/2507.08508
作者: Haotian Xu,Jinrui Zhou,Xichong Zhang,Mingjun Xiao,He Sun,Yin Xu
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Federated Learning (FL) is a distributed machine learning paradigm which coordinates multiple clients to collaboratively train a global model via a central server. Sequential Federated Learning (SFL) is a newly-emerging FL training framework where the global model is trained in a sequential manner across clients. Since SFL can provide strong convergence guarantees under data heterogeneity, it has attracted significant research attention in recent years. However, experiments show that SFL suffers from severe catastrophic forgetting in heterogeneous environments, meaning that the model tends to forget knowledge learned from previous clients. To address this issue, we propose an SFL framework with discrepancy-aware multi-teacher knowledge distillation, called SFedKD, which selects multiple models from the previous round to guide the current round of training. In SFedKD, we extend the single-teacher Decoupled Knowledge Distillation approach to our multi-teacher setting and assign distinct weights to teachers’ target-class and non-target-class knowledge based on the class distributional discrepancy between teacher and student data. Through this fine-grained weighting strategy, SFedKD can enhance model training efficacy while mitigating catastrophic forgetting. Additionally, to prevent knowledge dilution, we eliminate redundant teachers for the knowledge distillation and formalize it as a variant of the maximum coverage problem. Based on the greedy strategy, we design a complementary-based teacher selection mechanism to ensure that the selected teachers achieve comprehensive knowledge space coverage while reducing communication and computational costs. Extensive experiments show that SFedKD effectively overcomes catastrophic forgetting in SFL and outperforms state-of-the-art FL methods.

[LG-19] Efficient Deployment of Vision-Language Models on Mobile Devices: A Case Study on OnePlus 13R

链接: https://arxiv.org/abs/2507.08505
作者: Pablo Robin Guerrero,Yueyang Pan,Sanidhya Kashyap
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Vision-Language Models (VLMs) offer promising capabilities for mobile devices, but their deployment faces significant challenges due to computational limitations and energy inefficiency, especially for real-time applications. This study provides a comprehensive survey of deployment frameworks for VLMs on mobile devices, evaluating this http URL, MLC-Imp, and mllm in the context of running LLaVA-1.5 7B, MobileVLM-3B, and Imp-v1.5 3B as representative workloads on a OnePlus 13R. Each deployment framework was evaluated on the OnePlus 13R while running VLMs, with measurements covering CPU, GPU, and NPU utilization, temperature, inference time, power consumption, and user experience. Benchmarking revealed critical performance bottlenecks across frameworks: CPU resources were consistently over-utilized during token generation, while GPU and NPU accelerators were largely unused. When the GPU was used, primarily for image feature extraction, it was saturated, leading to degraded device responsiveness. The study contributes framework-level benchmarks, practical profiling tools, and an in-depth analysis of hardware utilization bottlenecks, highlighting the consistent overuse of CPUs and the ineffective or unstable use of GPUs and NPUs in current deployment frameworks.

[LG-20] SynBridge: Bridging Reaction States via Discrete Flow for Bidirectional Reaction Prediction

链接: https://arxiv.org/abs/2507.08475
作者: Haitao Lin,Junjie Wang,Zhifeng Gao,Xiaohong Ji,Rong Zhu,Linfeng Zhang,Guolin Ke,Weinan E
类目: Machine Learning (cs.LG)
*备注: 22pages, 2 figures

点击查看摘要

Abstract:The essence of a chemical reaction lies in the redistribution and reorganization of electrons, which is often manifested through electron transfer or the migration of electron pairs. These changes are inherently discrete and abrupt in the physical world, such as alterations in the charge states of atoms or the formation and breaking of chemical bonds. To model the transition of states, we propose SynBridge, a bidirectional flow-based generative model to achieve multi-task reaction prediction. By leveraging a graph-to-graph transformer network architecture and discrete flow bridges between any two discrete distributions, SynBridge captures bidirectional chemical transformations between graphs of reactants and products through the bonds’ and atoms’ discrete states. We further demonstrate the effectiveness of our method through extensive experiments on three benchmark datasets (USPTO-50K, USPTO-MIT, Pistachio), achieving state-of-the-art performance in both forward and retrosynthesis tasks. Our ablation studies and noise scheduling analysis reveal the benefits of structured diffusion over discrete spaces for reaction prediction.

[LG-21] Evaluating SAE interpretability without explanations

链接: https://arxiv.org/abs/2507.08473
作者: Gonçalo Paulo,Nora Belrose
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Sparse autoencoders (SAEs) and transcoders have become important tools for machine learning interpretability. However, measuring how interpretable they are remains challenging, with weak consensus about which benchmarks to use. Most evaluation procedures start by producing a single-sentence explanation for each latent. These explanations are then evaluated based on how well they enable an LLM to predict the activation of a latent in new contexts. This method makes it difficult to disentangle the explanation generation and evaluation process from the actual interpretability of the latents discovered. In this work, we adapt existing methods to assess the interpretability of sparse coders, with the advantage that they do not require generating natural language explanations as an intermediate step. This enables a more direct and potentially standardized assessment of interpretability. Furthermore, we compare the scores produced by our interpretability metrics with human evaluations across similar tasks and varying setups, offering suggestions for the community on improving the evaluation of these techniques.

[LG-22] Ranked Set Sampling-Based Multilayer Perceptron: Improving Generalization via Variance-Based Bounds

链接: https://arxiv.org/abs/2507.08465
作者: Feijiang Li,Liuya Zhang,Jieting Wang,Tao Yan,Yuhua Qian
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Multilayer perceptron (MLP), one of the most fundamental neural networks, is extensively utilized for classification and regression tasks. In this paper, we establish a new generalization error bound, which reveals how the variance of empirical loss influences the generalization ability of the learning model. Inspired by this learning bound, we advocate to reduce the variance of empirical loss to enhance the ability of MLP. As is well-known, bagging is a popular ensemble method to realize variance reduction. However, bagging produces the base training data sets by the Simple Random Sampling (SRS) method, which exhibits a high degree of randomness. To handle this issue, we introduce an ordered structure in the training data set by Rank Set Sampling (RSS) to further reduce the variance of loss and develop a RSS-MLP method. Theoretical results show that the variance of empirical exponential loss and the logistic loss estimated by RSS are smaller than those estimated by SRS, respectively. To validate the performance of RSS-MLP, we conduct comparison experiments on twelve benchmark data sets in terms of the two convex loss functions under two fusion methods. Extensive experimental results and analysis illustrate the effectiveness and rationality of the propose method.

[LG-23] KGRAG -Ex: Explainable Retrieval-Augmented Generation with Knowledge Graph-based Perturbations

链接: https://arxiv.org/abs/2507.08443
作者: Georgios Balanos,Evangelos Chasanis,Konstantinos Skianis,Evaggelia Pitoura
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Retrieval-Augmented Generation (RAG) enhances language models by grounding responses in external information, yet explainability remains a critical challenge, particularly when retrieval relies on unstructured text. Knowledge graphs (KGs) offer a solution by introducing structured, semantically rich representations of entities and their relationships, enabling transparent retrieval paths and interpretable reasoning. In this work, we present KGRAG-Ex, a RAG system that improves both factual grounding and explainability by leveraging a domain-specific KG constructed via prompt-based information extraction. Given a user query, KGRAG-Ex identifies relevant entities and semantic paths in the graph, which are then transformed into pseudo-paragraphs: natural language representations of graph substructures that guide corpus retrieval. To improve interpretability and support reasoning transparency, we incorporate perturbation-based explanation methods that assess the influence of specific KG-derived components on the generated answers. We conduct a series of experiments to analyze the sensitivity of the system to different perturbation methods, the relationship between graph component importance and their structural positions, the influence of semantic node types, and how graph metrics correspond to the influence of components within the explanations process.

[LG-24] RTNinja: a generalized machine learning framework for analyzing random telegraph noise signals in nanoelectronic devices

链接: https://arxiv.org/abs/2507.08424
作者: Anirudh Varanasi,Robin Degraeve,Philippe Roussel,Clement Merckling
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Random telegraph noise is a prevalent variability phenomenon in nanoelectronic devices, arising from stochastic carrier exchange at defect sites and critically impacting device reliability and performance. Conventional analysis techniques often rely on restrictive assumptions or manual interventions, limiting their applicability to complex, noisy datasets. Here, we introduce RTNinja, a generalized, fully automated machine learning framework for the unsupervised analysis of random telegraph noise signals. RTNinja deconvolves complex signals to identify the number and characteristics of hidden individual sources, without requiring prior knowledge of the system. The framework comprises two modular components: LevelsExtractor, which uses Bayesian inference and model selection to denoise and discretize the signal; and SourcesMapper, which infers source configurations through probabilistic clustering and optimization. To evaluate performance, we developed a Monte Carlo simulator that generates labeled datasets spanning broad signal-to-noise ratios and source complexities; across 7000 such datasets, RTNinja consistently demonstrated high-fidelity signal reconstruction and accurate extraction of source amplitudes and activity patterns. Our results demonstrate that RTNinja offers a robust, scalable, and device-agnostic tool for random telegraph noise characterization, enabling large-scale statistical benchmarking, reliability-centric technology qualification, predictive failure modeling, and device physics exploration in next-generation nanoelectronics.

[LG-25] Inference-Time Scaling of Diffusion Language Models with Particle Gibbs Sampling

链接: https://arxiv.org/abs/2507.08390
作者: Meihua Dang,Jiaqi Han,Minkai Xu,Kai Xu,Akash Srivastava,Stefano Ermon
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Discrete diffusion models have emerged as a powerful paradigm for language modeling, rivaling auto-regressive models by training-time scaling. However, inference-time scaling in discrete diffusion models remains relatively under-explored. In this work, we study sampling-based approaches for achieving high-quality text generation from discrete diffusion models in reward-guided settings. We introduce a novel inference-time scaling approach based on particle Gibbs sampling for discrete diffusion models. The particle Gibbs sampling algorithm iteratively refines full diffusion trajectories using conditional Sequential Monte Carlo as its transition mechanism. This process ensures that the updated samples progressively improve and move closer to the reward-weighted target distribution. Unlike existing inference-time scaling methods, which are often limited to single diffusion trajectories, our approach leverages iterative refinement across multiple trajectories. Within this framework, we further analyze the trade-offs between four key axes for inference-time scaling under fixed compute budgets: particle Gibbs iterations, particle count, denoising steps, and reward estimation cost. Empirically, our method consistently outperforms prior inference-time strategies on reward-guided text generation tasks, achieving significant improvement in accuracy under varying compute budgets.

[LG-26] Online Pre-Training for Offline-to-Online Reinforcement Learning ICML2025

链接: https://arxiv.org/abs/2507.08387
作者: Yongjae Shin,Jeonghye Kim,Whiyoung Jung,Sunghoon Hong,Deunsol Yoon,Youngsoo Jang,Geonhyeong Kim,Jongseong Chae,Youngchul Sung,Kanghoon Lee,Woohyung Lim
类目: Machine Learning (cs.LG)
*备注: ICML 2025 camera-ready

点击查看摘要

Abstract:Offline-to-online reinforcement learning (RL) aims to integrate the complementary strengths of offline and online RL by pre-training an agent offline and subsequently fine-tuning it through online interactions. However, recent studies reveal that offline pre-trained agents often underperform during online fine-tuning due to inaccurate value estimation caused by distribution shift, with random initialization proving more effective in certain cases. In this work, we propose a novel method, Online Pre-Training for Offline-to-Online RL (OPT), explicitly designed to address the issue of inaccurate value estimation in offline pre-trained agents. OPT introduces a new learning phase, Online Pre-Training, which allows the training of a new value function tailored specifically for effective online fine-tuning. Implementation of OPT on TD3 and SPOT demonstrates an average 30% improvement in performance across a wide range of D4RL environments, including MuJoCo, Antmaze, and Adroit.

[LG-27] wo-cluster test

链接: https://arxiv.org/abs/2507.08382
作者: Xinying Liu,Lianyu Hu,Mudi Jiang,Simen Zhang,Jun Lou,Zengyou He
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Cluster analysis is a fundamental research issue in statistics and machine learning. In many modern clustering methods, we need to determine whether two subsets of samples come from the same cluster. Since these subsets are usually generated by certain clustering procedures, the deployment of classic two-sample tests in this context would yield extremely smaller p-values, leading to inflated Type-I error rate. To overcome this bias, we formally introduce the two-cluster test issue and argue that it is a totally different significance testing issue from conventional two-sample test. Meanwhile, we present a new method based on the boundary points between two subsets to derive an analytical p-value for the purpose of significance quantification. Experiments on both synthetic and real data sets show that the proposed test is able to significantly reduce the Type-I error rate, in comparison with several classic two-sample testing methods. More importantly, the practical usage of such two-cluster test is further verified through its applications in tree-based interpretable clustering and significance-based hierarchical clustering.

[LG-28] Advances in Machine Learning: Where Can Quantum Techniques Help?

链接: https://arxiv.org/abs/2507.08379
作者: Samarth Kashyap,Rohit K Ramakrishnan,Kumari Jyoti,Apoorva D Patel
类目: Machine Learning (cs.LG); Quantum Physics (quant-ph)
*备注: 28 pages, 1 figure

点击查看摘要

Abstract:Quantum Machine Learning (QML) represents a promising frontier at the intersection of quantum computing and artificial intelligence, aiming to leverage quantum computational advantages to enhance data-driven tasks. This review explores the potential of QML to address the computational bottlenecks of classical machine learning, particularly in processing complex datasets. We introduce the theoretical foundations of QML, including quantum data encoding, quantum learning theory and optimization techniques, while categorizing QML approaches based on data type and computational architecture. It is well-established that quantum computational advantages are problem-dependent, and so potentially useful directions for QML need to be systematically identified. Key developments, such as Quantum Principal Component Analysis, quantum-enhanced sensing and applications in material science, are critically evaluated for their theoretical speed-ups and practical limitations. The challenges posed by Noisy Intermediate-Scale Quantum (NISQ) devices, including hardware noise, scalability constraints and data encoding overheads, are discussed in detail. We also outline future directions, emphasizing the need for quantum-native algorithms, improved error correction, and realistic benchmarks to bridge the gap between theoretical promise and practical deployment. This comprehensive analysis underscores that while QML has significant potential for specific applications such as quantum chemistry and sensing, its broader utility in real-world scenarios remains contingent on overcoming technological and methodological hurdles.

[LG-29] Prediction of Lane Change Intentions of Human Drivers using an LSTM a CNN and a Transformer

链接: https://arxiv.org/abs/2507.08365
作者: Francesco De Cristofaro,Felix Hofbaur,Aixi Yang,Arno Eichberger
类目: Machine Learning (cs.LG)
*备注: 14 pages, 18 figures

点击查看摘要

Abstract:Lane changes of preceding vehicles have a great impact on the motion planning of automated vehicles especially in complex traffic situations. Predicting them would benefit the public in terms of safety and efficiency. While many research efforts have been made in this direction, few concentrated on predicting maneuvers within a set time interval compared to predicting at a set prediction time. In addition, there exist a lack of comparisons between different architectures to try to determine the best performing one and to assess how to correctly choose the input for such models. In this paper the structure of an LSTM, a CNN and a Transformer network are described and implemented to predict the intention of human drivers to perform a lane change. We show how the data was prepared starting from a publicly available dataset (highD), which features were used, how the networks were designed and finally we compare the results of the three networks with different configurations of input data. We found that transformer networks performed better than the other networks and was less affected by overfitting. The accuracy of the method spanned from 82.79% to 96.73% for different input configurations and showed overall good performances considering also precision and recall.

[LG-30] Leverag ing Machine Learning and Enhanced Parallelism Detection for BPMN Model Generation from Text

链接: https://arxiv.org/abs/2507.08362
作者: Phuong Nam Lê,Charlotte Schneider-Depré,Alexandre Goossens,Alexander Stevens,Aurélie Leribaux,Johannes De Smedt
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Efficient planning, resource management, and consistent operations often rely on converting textual process documents into formal Business Process Model and Notation (BPMN) models. However, this conversion process remains time-intensive and costly. Existing approaches, whether rule-based or machine-learning-based, still struggle with writing styles and often fail to identify parallel structures in process descriptions. This paper introduces an automated pipeline for extracting BPMN models from text, leveraging the use of machine learning and large language models. A key contribution of this work is the introduction of a newly annotated dataset, which significantly enhances the training process. Specifically, we augment the PET dataset with 15 newly annotated documents containing 32 parallel gateways for model training, a critical feature often overlooked in existing datasets. This addition enables models to better capture parallel structures, a common but complex aspect of process descriptions. The proposed approach demonstrates adequate performance in terms of reconstruction accuracy, offering a promising foundation for organizations to accelerate BPMN model creation. Subjects: Machine Learning (cs.LG) Cite as: arXiv:2507.08362 [cs.LG] (or arXiv:2507.08362v1 [cs.LG] for this version) https://doi.org/10.48550/arXiv.2507.08362 Focus to learn more arXiv-issued DOI via DataCite (pending registration)

[LG-31] scE2TM: Toward Interpretable Single-Cell Embedding via Topic Modeling

链接: https://arxiv.org/abs/2507.08355
作者: Hegang Chen,Yuyin Lu,Zhiming Dai,Fu Lee Wang,Qing Li,Yanghui Rao
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Recent advances in sequencing technologies have enabled researchers to explore cellular heterogeneity at single-cell resolution. Meanwhile, interpretability has gained prominence parallel to the rapid increase in the complexity and performance of deep learning models. In recent years, topic models have been widely used for interpretable single-cell embedding learning and clustering analysis, which we refer to as single-cell embedded topic models. However, previous studies evaluated the interpretability of the models mainly through qualitative analysis, and these single-cell embedded topic models suffer from the potential problem of interpretation collapse. Furthermore, their neglect of external biological knowledge constrains analytical performance. Here, we present scE2TM, an external knowledge-guided single-cell embedded topic model that provides a high-quality cell embedding and strong interpretation, contributing to comprehensive scRNA-seq data analysis. Our comprehensive evaluation across 20 scRNA-seq datasets demonstrates that scE2TM achieves significant clustering performance gains compared to 7 state-of-the-art methods. In addition, we propose a new interpretability evaluation benchmark that introduces 10 metrics to quantitatively assess the interpretability of single-cell embedded topic models. The results show that the interpretation provided by scE2TM performs encouragingly in terms of diversity and consistency with the underlying biological signals, contributing to a better revealing of the underlying biological mechanisms.

[LG-32] owards Efficient Quantity Retrieval from Text:an Approach via Description Parsing and Weak Supervision

链接: https://arxiv.org/abs/2507.08322
作者: Yixuan Cao,Zhengrong Chen,Chengxuan Xia,Kun Wu,Ping Luo
类目: Information Retrieval (cs.IR); Machine Learning (cs.LG)
*备注: Extended version of the paper accepted in DEXA 2025

点击查看摘要

Abstract:Quantitative facts are continually generated by companies and governments, supporting data-driven decision-making. While common facts are structured, many long-tail quantitative facts remain buried in unstructured documents, making them difficult to access. We propose the task of Quantity Retrieval: given a description of a quantitative fact, the system returns the relevant value and supporting evidence. Understanding quantity semantics in context is essential for this task. We introduce a framework based on description parsing that converts text into structured (description, quantity) pairs for effective retrieval. To improve learning, we construct a large paraphrase dataset using weak supervision based on quantity co-occurrence. We evaluate our approach on a large corpus of financial annual reports and a newly annotated quantity description dataset. Our method significantly improves top-1 retrieval accuracy from 30.98 percent to 64.66 percent.

[LG-33] A Comprehensively Adaptive Architectural Optimization-Ingrained Quantum Neural Network Model for Cloud Workloads Prediction

链接: https://arxiv.org/abs/2507.08317
作者: Jitendra Kumar,Deepika Saxena,Kishu Gupta,Satyam Kumar,Ashutosh Kumar Singh
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Accurate workload prediction and advanced resource reservation are indispensably crucial for managing dynamic cloud services. Traditional neural networks and deep learning models frequently encounter challenges with diverse, high-dimensional workloads, especially during sudden resource demand changes, leading to inefficiencies. This issue arises from their limited optimization during training, relying only on parametric (inter-connection weights) adjustments using conventional algorithms. To address this issue, this work proposes a novel Comprehensively Adaptive Architectural Optimization-based Variable Quantum Neural Network (CA-QNN), which combines the efficiency of quantum computing with complete structural and qubit vector parametric learning. The model converts workload data into qubits, processed through qubit neurons with Controlled NOT-gated activation functions for intuitive pattern recognition. In addition, a comprehensive architecture optimization algorithm for networks is introduced to facilitate the learning and propagation of the structure and parametric values in variable-sized QNNs. This algorithm incorporates quantum adaptive modulation and size-adaptive recombination during training process. The performance of CA-QNN model is thoroughly investigated against seven state-of-the-art methods across four benchmark datasets of heterogeneous cloud workloads. The proposed model demonstrates superior prediction accuracy, reducing prediction errors by up to 93.40% and 91.27% compared to existing deep learning and QNN-based approaches.

[LG-34] CAS Condensed and Accelerated Silhouette: An Efficient Method for Determining the Optimal K in K-Means Clustering

链接: https://arxiv.org/abs/2507.08311
作者: Krishnendu Das,Sumit Gupta,Awadhesh Kumar
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Clustering is a critical component of decision-making in todays data-driven environments. It has been widely used in a variety of fields such as bioinformatics, social network analysis, and image processing. However, clustering accuracy remains a major challenge in large datasets. This paper presents a comprehensive overview of strategies for selecting the optimal value of k in clustering, with a focus on achieving a balance between clustering precision and computational efficiency in complex data environments. In addition, this paper introduces improvements to clustering techniques for text and image data to provide insights into better computational performance and cluster validity. The proposed approach is based on the Condensed Silhouette method, along with statistical methods such as Local Structures, Gap Statistics, Class Consistency Ratio, and a Cluster Overlap Index CCR and COIbased algorithm to calculate the best value of k for K-Means clustering. The results of comparative experiments show that the proposed approach achieves up to 99 percent faster execution times on high-dimensional datasets while retaining both precision and scalability, making it highly suitable for real time clustering needs or scenarios demanding efficient clustering with minimal resource utilization.

[LG-35] Data-Driven Dimensional Synthesis of Diverse Planar Four-bar Function Generation Mechanisms via Direct Parameterization

链接: https://arxiv.org/abs/2507.08269
作者: Woon Ryong Kim,Jaeheun Jung,Jeong Un Ha,Donghun Lee,Jae Kyung Shim
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Dimensional synthesis of planar four-bar mechanisms is a challenging inverse problem in kinematics, requiring the determination of mechanism dimensions from desired motion specifications. We propose a data-driven framework that bypasses traditional equation-solving and optimization by leveraging supervised learning. Our method combines a synthetic dataset, an LSTM-based neural network for handling sequential precision points, and a Mixture of Experts (MoE) architecture tailored to different linkage types. Each expert model is trained on type-specific data and guided by a type-specifying layer, enabling both single-type and multi-type synthesis. A novel simulation metric evaluates prediction quality by comparing desired and generated motions. Experiments show our approach produces accurate, defect-free linkages across various configurations. This enables intuitive and efficient mechanism design, even for non-expert users, and opens new possibilities for scalable and flexible synthesis in kinematic design.

[LG-36] CoreSPECT: Enhancing Clustering Algorithms via an Interplay of Density and Geometry

链接: https://arxiv.org/abs/2507.08243
作者: Chandra Sekhar Mukherjee,Joonyoung Bae,Jiapeng Zhang
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Density and geometry have long served as two of the fundamental guiding principles in clustering algorithm design, with algorithm usually focusing either on the density structure of the data (e.g., HDBSCAN and Density Peak Clustering) or the complexity of underlying geometry (e.g., manifold clustering algorithms). In this paper, we identify and formalize a recurring but often overlooked interaction between distribution and geometry and leverage this insight to design our clustering enhancement framework CoreSPECT (Core Space Projection-based Enhancement of Clustering Techniques). Our framework boosts the performance of simple algorithms like K-Means and GMM by applying them to strategically selected regions, then extending the partial partition to a complete partition for the dataset using a novel neighborhood graph based multi-layer propagation procedure. We apply our framework on 15 datasets from three different domains and obtain consistent and substantial gain in clustering accuracy for both K-Means and GMM. On average, our framework improves the ARI of K-Means by 40% and of GMM by 14%, often surpassing the performance of both manifold-based and recent density-based clustering algorithms. We further support our framework with initial theoretical guarantees, ablation to demonstrate the usefulness of the individual steps and with evidence of robustness to noise. Subjects: Machine Learning (cs.LG) Cite as: arXiv:2507.08243 [cs.LG] (or arXiv:2507.08243v1 [cs.LG] for this version) https://doi.org/10.48550/arXiv.2507.08243 Focus to learn more arXiv-issued DOI via DataCite (pending registration)

[LG-37] Data Generation without Function Estimation

链接: https://arxiv.org/abs/2507.08239
作者: Hadi Daneshmand,Ashkan Soleymani
类目: Machine Learning (cs.LG); Mathematical Physics (math-ph); Optimization and Control (math.OC); Machine Learning (stat.ML)
*备注:

点击查看摘要

Abstract:Estimating the score function (or other population-density-dependent functions) is a fundamental component of most generative models. However, such function estimation is computationally and statistically challenging. Can we avoid function estimation for data generation? We propose an estimation-free generative method: A set of points whose locations are deterministically updated with (inverse) gradient descent can transport a uniform distribution to arbitrary data distribution, in the mean field regime, without function estimation, training neural networks, and even noise injection. The proposed method is built upon recent advances in the physics of interacting particles. We show, both theoretically and experimentally, that these advances can be leveraged to develop novel generative methods.

[LG-38] Self-Supervised Learning-Based Multimodal Prediction on Prosocial Behavior Intentions ICASSP2025

链接: https://arxiv.org/abs/2507.08238
作者: Abinay Reddy Naini,Zhaobo K. Zheng,Teruhisa Misu,Kumar Akash
类目: Machine Learning (cs.LG)
*备注: 5 pages, 4 figures, published at ICASSP 2025

点击查看摘要

Abstract:Human state detection and behavior prediction have seen significant advancements with the rise of machine learning and multimodal sensing technologies. However, predicting prosocial behavior intentions in mobility scenarios, such as helping others on the road, is an underexplored area. Current research faces a major limitation. There are no large, labeled datasets available for prosocial behavior, and small-scale datasets make it difficult to train deep-learning models effectively. To overcome this, we propose a self-supervised learning approach that harnesses multi-modal data from existing physiological and behavioral datasets. By pre-training our model on diverse tasks and fine-tuning it with a smaller, manually labeled prosocial behavior dataset, we significantly enhance its performance. This method addresses the data scarcity issue, providing a more effective benchmark for prosocial behavior prediction, and offering valuable insights for improving intelligent vehicle systems and human-machine interaction.

[LG-39] EvA: Evolutionary Attacks on Graphs

链接: https://arxiv.org/abs/2507.08212
作者: Mohammad Sadegh Akhondzadeh,Soroush H. Zargarbashi,Jimin Cao,Aleksandar Bojchevski
类目: Machine Learning (cs.LG)
*备注: 23 pages, 12 figures

点击查看摘要

Abstract:Even a slight perturbation in the graph structure can cause a significant drop in the accuracy of graph neural networks (GNNs). Most existing attacks leverage gradient information to perturb edges. This relaxes the attack’s optimization problem from a discrete to a continuous space, resulting in solutions far from optimal. It also restricts the adaptability of the attack to non-differentiable objectives. Instead, we introduce a few simple yet effective enhancements of an evolutionary-based algorithm to solve the discrete optimization problem directly. Our Evolutionary Attack (EvA) works with any black-box model and objective, eliminating the need for a differentiable proxy loss. This allows us to design two novel attacks that reduce the effectiveness of robustness certificates and break conformal sets. The memory complexity of our attack is linear in the attack budget. Among our experiments, EvA shows \sim 11% additional drop in accuracy on average compared to the best previous attack, revealing significant untapped potential in designing attacks.

[LG-40] EP-GAT: Energy-based Parallel Graph Attention Neural Network for Stock Trend Classification IJCNN2025

链接: https://arxiv.org/abs/2507.08184
作者: Zhuodong Jiang,Pengju Zhang,Peter Martin
类目: Computational Engineering, Finance, and Science (cs.CE); Machine Learning (cs.LG)
*备注: Accepted by IJCNN 2025, oral presentation

点击查看摘要

Abstract:Graph neural networks have shown remarkable performance in forecasting stock movements, which arises from learning complex inter-dependencies between stocks and intra-dynamics of stocks. Existing approaches based on graph neural networks typically rely on static or manually defined factors to model changing inter-dependencies between stocks. Furthermore, these works often struggle to preserve hierarchical features within stocks. To bridge these gaps, this work presents the Energy-based Parallel Graph Attention Neural Network, a novel approach for predicting future movements for multiple stocks. First, it generates a dynamic stock graph with the energy difference between stocks and Boltzmann distribution, capturing evolving inter-dependencies between stocks. Then, a parallel graph attention mechanism is proposed to preserve the hierarchical intra-stock dynamics. Extensive experiments on five real-world datasets are conducted to validate the proposed approach, spanning from the US stock markets (NASDAQ, NYSE, SP) and UK stock markets (FTSE, LSE). The experimental results demonstrate that EP-GAT consistently outperforms competitive five baselines on test periods across various metrics. The ablation studies and hyperparameter sensitivity analysis further validate the effectiveness of each module in the proposed method.

[LG-41] CTRLS: Chain-of-Thought Reasoning via Latent State-Transition

链接: https://arxiv.org/abs/2507.08182
作者: Junda Wu,Yuxin Xiong,Xintong Li,Zhengmian Hu,Tong Yu,Rui Wang,Xiang Chen,Jingbo Shang,Julian McAuley
类目: Machine Learning (cs.LG)
*备注: 10 pages

点击查看摘要

Abstract:Chain-of-thought (CoT) reasoning enables large language models (LLMs) to break down complex problems into interpretable intermediate steps, significantly enhancing model transparency and performance in reasoning tasks. However, conventional CoT methods rely on heuristic sampling without structured modeling of reasoning transitions, constraining their ability to systematically explore and discover diverse and effective reasoning trajectories. In this work, we introduce CTRLS, a framework that formulates CoT reasoning as a Markov decision process (MDP) with latent state transitions, enabling principled and state-aware exploration via distributional reinforcement learning. By modelling reasoning actions as explicit probability distributions in latent space, our approach explicitly models epistemic uncertainty, facilitating robust exploration of the reasoning space. As part of our framework, we introduce an on-policy reinforcement learning strategy incorporating epsilon-greedy exploration and entropy-based regularization to iteratively refine latent state transitions without requiring additional fine-tuning of the underlying LLM. Theoretical analyses provide evidence lower bounds (ELBO), theoretically grounding our transition-aware modeling of latent reasoning dynamics. Further experiments demonstrate improvements in reasoning accuracy, diversity, and exploration efficiency across benchmark reasoning tasks.

[LG-42] Emotion Recognition in Older Adults with Quantum Machine Learning and Wearable Sensors

链接: https://arxiv.org/abs/2507.08175
作者: Md. Saif Hassan Onim,Travis S. Humble,Himanshu Thapliyal
类目: Machine Learning (cs.LG); Human-Computer Interaction (cs.HC); Quantum Physics (quant-ph)
*备注:

点击查看摘要

Abstract:We investigate the feasibility of inferring emotional states exclusively from physiological signals, thereby presenting a privacy-preserving alternative to conventional facial recognition techniques. We conduct a performance comparison of classical machine learning algorithms and hybrid quantum machine learning (QML) methods with a quantum kernel-based model. Our results indicate that the quantum-enhanced SVM surpasses classical counterparts in classification performance across all emotion categories, even when trained on limited datasets. The F1 scores over all classes are over 80% with around a maximum of 36% improvement in the recall values. The integration of wearable sensor data with quantum machine learning not only enhances accuracy and robustness but also facilitates unobtrusive emotion recognition. This methodology holds promise for populations with impaired communication abilities, such as individuals with Alzheimer’s Disease and Related Dementias (ADRD) and veterans with Post-Traumatic Stress Disorder (PTSD). The findings establish an early foundation for passive emotional monitoring in clinical and assisted living conditions.

[LG-43] Emotion Detection in Older Adults Using Physiological Signals from Wearable Sensors

链接: https://arxiv.org/abs/2507.08167
作者: Md. Saif Hassan Onim,Andrew M. Kiselica,Himanshu Thapliyal
类目: Human-Computer Interaction (cs.HC); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Emotion detection in older adults is crucial for understanding their cognitive and emotional well-being, especially in hospital and assisted living environments. In this work, we investigate an edge-based, non-obtrusive approach to emotion identification that uses only physiological signals obtained via wearable sensors. Our dataset includes data from 40 older individuals. Emotional states were obtained using physiological signals from the Empatica E4 and Shimmer3 GSR+ wristband and facial expressions were recorded using camera-based emotion recognition with the iMotion’s Facial Expression Analysis (FEA) module. The dataset also contains twelve emotion categories in terms of relative intensities. We aim to study how well emotion recognition can be accomplished using simply physiological sensor data, without the requirement for cameras or intrusive facial analysis. By leveraging classical machine learning models, we predict the intensity of emotional responses based on physiological signals. We achieved the highest 0.782 r2 score with the lowest 0.0006 MSE on the regression task. This method has significant implications for individuals with Alzheimer’s Disease and Related Dementia (ADRD), as well as veterans coping with Post-Traumatic Stress Disorder (PTSD) or other cognitive impairments. Our results across multiple classical regression models validate the feasibility of this method, paving the way for privacy-preserving and efficient emotion recognition systems in real-world settings.

[LG-44] Just Read the Question: Enabling Generalization to New Assessment Items with Text Awareness

链接: https://arxiv.org/abs/2507.08154
作者: Arisha Khan,Nathaniel Li,Tori Shen,Anna N. Rafferty
类目: Machine Learning (cs.LG)
*备注: Poster paper at Educational Data Mining (EDM) 2025

点击查看摘要

Abstract:Machine learning has been proposed as a way to improve educational assessment by making fine-grained predictions about student performance and learning relationships between items. One challenge with many machine learning approaches is incorporating new items, as these approaches rely heavily on historical data. We develop Text-LENS by extending the LENS partial variational auto-encoder for educational assessment to leverage item text embeddings, and explore the impact on predictive performance and generalization to previously unseen items. We examine performance on two datasets: Eedi, a publicly available dataset that includes item content, and LLM-Sim, a novel dataset with test items produced by an LLM. We find that Text-LENS matches LENS’ performance on seen items and improves upon it in a variety of conditions involving unseen items; it effectively learns student proficiency from and makes predictions about student performance on new items.

[LG-45] Physics-Informed Neural Networks with Hard Nonlinear Equality and Inequality Constraints

链接: https://arxiv.org/abs/2507.08124
作者: Ashfaq Iftakher,Rahul Golder,M. M. Faruque Hasan
类目: Machine Learning (cs.LG)
*备注: 20 pages, 8 figures

点击查看摘要

Abstract:Traditional physics-informed neural networks (PINNs) do not guarantee strict constraint satisfaction. This is problematic in engineering systems where minor violations of governing laws can significantly degrade the reliability and consistency of model predictions. In this work, we develop KKT-Hardnet, a PINN architecture that enforces both linear and nonlinear equality and inequality constraints up to machine precision. It leverages a projection onto the feasible region through solving Karush-Kuhn-Tucker (KKT) conditions of a distance minimization problem. Furthermore, we reformulate the nonlinear KKT conditions using log-exponential transformation to construct a general sparse system with only linear and exponential terms, thereby making the projection differentiable. We apply KKT-Hardnet on both test problems and a real-world chemical process simulation. Compared to multilayer perceptrons and PINNs, KKT-Hardnet achieves higher accuracy and strict constraint satisfaction. This approach allows the integration of domain knowledge into machine learning towards reliable hybrid modeling of complex systems.

[LG-46] PDE-aware Optimizer for Physics-informed Neural Networks

链接: https://arxiv.org/abs/2507.08118
作者: Hardik Shukla,Manurag Khullar,Vismay Churiwala
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Physics-Informed Neural Networks (PINNs) have emerged as a powerful framework for solving partial differential equations (PDEs) by embedding physical constraints into the loss function. However, standard optimizers such as Adam often struggle to balance competing loss terms, particularly in stiff or ill-conditioned systems. In this work, we propose a PDE-aware optimizer that adapts parameter updates based on the variance of per-sample PDE residual gradients. This method addresses gradient misalignment without incurring the heavy computational costs of second-order optimizers such as SOAP. We benchmark the PDE-aware optimizer against Adam and SOAP on 1D Burgers’, Allen-Cahn and Korteweg-de Vries(KdV) equations. Across both PDEs, the PDE-aware optimizer achieves smoother convergence and lower absolute errors, particularly in regions with sharp gradients. Our results demonstrate the effectiveness of PDE residual-aware adaptivity in enhancing stability in PINNs training. While promising, further scaling on larger architectures and hardware accelerators remains an important direction for future research.

[LG-47] Low-rank Momentum Factorization for Memory Efficient Training

链接: https://arxiv.org/abs/2507.08091
作者: Pouria Mahdavinia,Mehrdad Mahdavi
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Fine-tuning large foundation models presents significant memory challenges due to stateful optimizers like AdamW, often requiring several times more GPU memory than inference. While memory-efficient methods like parameter-efficient fine-tuning (e.g., LoRA) and optimizer state compression exist, recent approaches like GaLore bridge these by using low-rank gradient projections and subspace moment accumulation. However, such methods may struggle with fixed subspaces or computationally costly offline resampling (e.g., requiring full-matrix SVDs). We propose Momentum Factorized SGD (MoFaSGD), which maintains a dynamically updated low-rank SVD representation of the first-order momentum, closely approximating its full-rank counterpart throughout training. This factorization enables a memory-efficient fine-tuning method that adaptively updates the optimization subspace at each iteration. Crucially, MoFaSGD leverages the computed low-rank momentum factors to perform efficient spectrally normalized updates, offering an alternative to subspace moment accumulation. We establish theoretical convergence guarantees for MoFaSGD, proving it achieves an optimal rate for non-convex stochastic optimization under standard assumptions. Empirically, we demonstrate MoFaSGD’s effectiveness on large language model alignment benchmarks, achieving a competitive trade-off between memory reduction (comparable to LoRA) and performance compared to state-of-the-art low-rank optimization methods. Our implementation is available at this https URL.

[LG-48] Quantile Reward Policy Optimization: Alignment with Pointwise Regression and Exact Partition Functions

链接: https://arxiv.org/abs/2507.08068
作者: Simon Matrenok,Skander Moalla,Caglar Gulcehre
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Aligning large language models with pointwise absolute rewards has so far required online, on-policy algorithms such as PPO and GRPO. In contrast, simpler methods that can leverage offline or off-policy data, such as DPO and REBEL, are limited to learning from preference pairs or relative signals. To bridge this gap, we introduce \emphQuantile Reward Policy Optimization (QRPO), which learns from pointwise absolute rewards while preserving the simplicity and offline applicability of DPO-like methods. QRPO uses quantile rewards to enable regression to the closed-form solution of the KL-regularized RL objective. This reward yields an analytically tractable partition function, removing the need for relative signals to cancel this term. Moreover, QRPO scales with increased compute to estimate quantile rewards, opening a new dimension for pre-computation scaling. Empirically, QRPO consistently achieves top performance on chat and coding evaluations – reward model scores, AlpacaEval 2, and LeetCode – compared to DPO, REBEL, and SimPO across diverse datasets and 8B-scale models. Finally, we find that training with robust rewards instead of converting them to preferences induces less length bias.

[LG-49] Entangled Threats: A Unified Kill Chain Model for Quantum Machine Learning Security

链接: https://arxiv.org/abs/2507.08623
作者: Pascal Debus,Maximilian Wendlinger,Kilian Tscharke,Daniel Herr,Cedric Brügmann,Daniel Ohl de Mello,Juris Ulmanis,Alexander Erhard,Arthur Schmidt,Fabian Petsch
类目: Quantum Physics (quant-ph); Cryptography and Security (cs.CR); Machine Learning (cs.LG)
*备注: Accepted for publication at IEEE International Conference on Quantum Computing and Engineering (QCE) 2025

点击查看摘要

Abstract:Quantum Machine Learning (QML) systems inherit vulnerabilities from classical machine learning while introducing new attack surfaces rooted in the physical and algorithmic layers of quantum computing. Despite a growing body of research on individual attack vectors - ranging from adversarial poisoning and evasion to circuit-level backdoors, side-channel leakage, and model extraction - these threats are often analyzed in isolation, with unrealistic assumptions about attacker capabilities and system environments. This fragmentation hampers the development of effective, holistic defense strategies. In this work, we argue that QML security requires more structured modeling of the attack surface, capturing not only individual techniques but also their relationships, prerequisites, and potential impact across the QML pipeline. We propose adapting kill chain models, widely used in classical IT and cybersecurity, to the quantum machine learning context. Such models allow for structured reasoning about attacker objectives, capabilities, and possible multi-stage attack paths - spanning reconnaissance, initial access, manipulation, persistence, and exfiltration. Based on extensive literature analysis, we present a detailed taxonomy of QML attack vectors mapped to corresponding stages in a quantum-aware kill chain framework that is inspired by the MITRE ATLAS for classical machine learning. We highlight interdependencies between physical-level threats (like side-channel leakage and crosstalk faults), data and algorithm manipulation (such as poisoning or circuit backdoors), and privacy attacks (including model extraction and training data inference). This work provides a foundation for more realistic threat modeling and proactive security-in-depth design in the emerging field of quantum machine learning.

[LG-50] Quantum Algorithms for Projection-Free Sparse Convex Optimization

链接: https://arxiv.org/abs/2507.08543
作者: Jianhao He,John C.S. Lui
类目: Quantum Physics (quant-ph); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:This paper considers the projection-free sparse convex optimization problem for the vector domain and the matrix domain, which covers a large number of important applications in machine learning and data science. For the vector domain \mathcalD \subset \mathbbR^d , we propose two quantum algorithms for sparse constraints that finds a \varepsilon -optimal solution with the query complexity of O(\sqrtd/\varepsilon) and O(1/\varepsilon) by using the function value oracle, reducing a factor of O(\sqrtd) and O(d) over the best classical algorithm, respectively, where d is the dimension. For the matrix domain \mathcalD \subset \mathbbR^d\times d , we propose two quantum algorithms for nuclear norm constraints that improve the time complexity to \tildeO(rd/\varepsilon^2) and \tildeO(\sqrtrd/\varepsilon^3) for computing the update step, reducing at least a factor of O(\sqrtd) over the best classical algorithm, where r is the rank of the gradient matrix. Our algorithms show quantum advantages in projection-free sparse convex optimization problems as they outperform the optimal classical methods in dependence on the dimension d .

[LG-51] Data Depth as a Risk

链接: https://arxiv.org/abs/2507.08518
作者: Arturo Castellanos,Pavlo Mozharovskyi
类目: Machine Learning (stat.ML); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Data depths are score functions that quantify in an unsupervised fashion how central is a point inside a distribution, with numerous applications such as anomaly detection, multivariate or functional data analysis, arising across various fields. The halfspace depth was the first depth to aim at generalising the notion of quantile beyond the univariate case. Among the existing variety of depth definitions, it remains one of the most used notions of data depth. Taking a different angle from the quantile point of view, we show that the halfspace depth can also be regarded as the minimum loss of a set of classifiers for a specific labelling of the points. By changing the loss or the set of classifiers considered, this new angle naturally leads to a family of “loss depths”, extending to well-studied classifiers such as, e.g., SVM or logistic regression, among others. This framework directly inherits computational efficiency of existing machine learning algorithms as well as their fast statistical convergence rates, and opens the data depth realm to the high-dimensional setting. Furthermore, the new loss depths highlight a connection between the dataset and the right amount of complexity or simplicity of the classifiers. The simplicity of classifiers as well as the interpretation as a risk makes our new kind of data depth easy to explain, yet efficient for anomaly detection, as is shown by experiments.

[LG-52] Optimal and Practical Batched Linear Bandit Algorithm ICML2025

链接: https://arxiv.org/abs/2507.08438
作者: Sanghoon Yu,Min-hwan Oh
类目: Machine Learning (stat.ML); Machine Learning (cs.LG)
*备注: Accepted at ICML 2025

点击查看摘要

Abstract:We study the linear bandit problem under limited adaptivity, known as the batched linear bandit. While existing approaches can achieve near-optimal regret in theory, they are often computationally prohibitive or underperform in practice. We propose \textttBLAE, a novel batched algorithm that integrates arm elimination with regularized G-optimal design, achieving the minimax optimal regret (up to logarithmic factors in T ) in both large- K and small- K regimes for the first time, while using only O(\log\log T) batches. Our analysis introduces new techniques for batch-wise optimal design and refined concentration bounds. Crucially, \textttBLAE demonstrates low computational overhead and strong empirical performance, outperforming state-of-the-art methods in extensive numerical evaluations. Thus, \textttBLAE is the first algorithm to combine provable minimax-optimality in all regimes and practical superiority in batched linear bandits.

[LG-53] SPINT: Spatial Permutation-Invariant Neural Transformer for Consistent Intracortical Motor Decoding

链接: https://arxiv.org/abs/2507.08402
作者: Trung Le,Hao Fang,Jingyuan Li,Tung Nguyen,Lu Mi,Amy Orsborn,Uygar Sümbül,Eli Shlizerman
类目: Neurons and Cognition (q-bio.NC); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Intracortical Brain-Computer Interfaces (iBCI) aim to decode behavior from neural population activity, enabling individuals with motor impairments to regain motor functions and communication abilities. A key challenge in long-term iBCI is the nonstationarity of neural recordings, where the composition and tuning profiles of the recorded populations are unstable across recording sessions. Existing methods attempt to address this issue by explicit alignment techniques; however, they rely on fixed neural identities and require test-time labels or parameter updates, limiting their generalization across sessions and imposing additional computational burden during deployment. In this work, we introduce SPINT - a Spatial Permutation-Invariant Neural Transformer framework for behavioral decoding that operates directly on unordered sets of neural units. Central to our approach is a novel context-dependent positional embedding scheme that dynamically infers unit-specific identities, enabling flexible generalization across recording sessions. SPINT supports inference on variable-size populations and allows few-shot, gradient-free adaptation using a small amount of unlabeled data from the test session. To further promote model robustness to population variability, we introduce dynamic channel dropout, a regularization method for iBCI that simulates shifts in population composition during training. We evaluate SPINT on three multi-session datasets from the FALCON Benchmark, covering continuous motor decoding tasks in human and non-human primates. SPINT demonstrates robust cross-session generalization, outperforming existing zero-shot and few-shot unsupervised baselines while eliminating the need for test-time alignment and fine-tuning. Our work contributes an initial step toward a robust and scalable neural decoding framework for long-term iBCI applications.

[LG-54] MIRRAMS: Towards Training Models Robust to Missingness Distribution Shifts

链接: https://arxiv.org/abs/2507.08280
作者: Jihye Lee,Minseo Kang,Dongha Kim
类目: Machine Learning (stat.ML); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:In real-world data analysis, missingness distributional shifts between training and test input datasets frequently occur, posing a significant challenge to achieving robust prediction performance. In this study, we propose a novel deep learning framework designed to address such shifts in missingness distributions. We begin by introducing a set of mutual information-based conditions, called MI robustness conditions, which guide a prediction model to extract label-relevant information while remaining invariant to diverse missingness patterns, thereby enhancing robustness to unseen missingness scenarios at test-time. To make these conditions practical, we propose simple yet effective techniques to derive loss terms corresponding to each and formulate a final objective function, termed MIRRAMS(Mutual Information Regularization for Robustness Against Missingness Shifts). As a by-product, our analysis provides a theoretical interpretation of the principles underlying consistency regularization-based semi-supervised learning methods, such as FixMatch. Extensive experiments across various benchmark datasets show that MIRRAMS consistently outperforms existing baselines and maintains stable performance across diverse missingness scenarios. Moreover, our approach achieves state-of-the-art performance even without missing data and can be naturally extended to address semi-supervised learning tasks, highlighting MIRRAMS as a powerful, off-the-shelf framework for general-purpose learning.

[LG-55] Admissibility of Stein Shrinkage for Batch Normalization in the Presence of Adversarial Attacks

链接: https://arxiv.org/abs/2507.08261
作者: Sofia Ivolgina,P. Thomas Fletcher,Baba C. Vemuri
类目: Machine Learning (stat.ML); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Batch normalization (BN) is a ubiquitous operation in deep neural networks used primarily to achieve stability and regularization during network training. BN involves feature map centering and scaling using sample means and variances, respectively. Since these statistics are being estimated across the feature maps within a batch, this problem is ideally suited for the application of Stein’s shrinkage estimation, which leads to a better, in the mean-squared-error sense, estimate of the mean and variance of the batch. In this paper, we prove that the Stein shrinkage estimator for the mean and variance dominates over the sample mean and variance estimators in the presence of adversarial attacks when modeling these attacks using sub-Gaussian distributions. This facilitates and justifies the application of Stein shrinkage to estimate the mean and variance parameters in BN and use it in image classification (segmentation) tasks with and without adversarial attacks. We present SOTA performance results using this Stein corrected batch norm in a standard ResNet architecture applied to the task of image classification using CIFAR-10 data, 3D CNN on PPMI (neuroimaging) data and image segmentation using HRNet on Cityscape data with and without adversarial attacks.

[LG-56] Entity-Specific Cyber Risk Assessment using InsurTech Empowered Risk Factors

链接: https://arxiv.org/abs/2507.08193
作者: Jiayi Guo,Zhiyun Quan,Linfeng Zhang
类目: Risk Management (q-fin.RM); Machine Learning (cs.LG); Machine Learning (stat.ML)
*备注:

点击查看摘要

Abstract:The lack of high-quality public cyber incident data limits empirical research and predictive modeling for cyber risk assessment. This challenge persists due to the reluctance of companies to disclose incidents that could damage their reputation or investor confidence. Therefore, from an actuarial perspective, potential resolutions conclude two aspects: the enhancement of existing cyber incident datasets and the implementation of advanced modeling techniques to optimize the use of the available data. A review of existing data-driven methods highlights a significant lack of entity-specific organizational features in publicly available datasets. To address this gap, we propose a novel InsurTech framework that enriches cyber incident data with entity-specific attributes. We develop various machine learning (ML) models: a multilabel classification model to predict the occurrence of cyber incident types (e.g., Privacy Violation, Data Breach, Fraud and Extortion, IT Error, and Others) and a multioutput regression model to estimate their annual frequencies. While classifier and regressor chains are implemented to explore dependencies among cyber incident types as well, no significant correlations are observed in our datasets. Besides, we apply multiple interpretable ML techniques to identify and cross-validate potential risk factors developed by InsurTech across ML models. We find that InsurTech empowered features enhance prediction occurrence and frequency estimation robustness compared to only using conventional risk factors. The framework generates transparent, entity-specific cyber risk profiles, supporting customized underwriting and proactive cyber risk mitigation. It provides insurers and organizations with data-driven insights to support decision-making and compliance planning.

[LG-57] Robust Semi-Supervised CT Radiomics for Lung Cancer Prognosis: Cost-Effective Learning with Limited Labels and SHAP Interpretation

链接: https://arxiv.org/abs/2507.08189
作者: Mohammad R. Salmanpour,Amir Hossein Pouria,Sonia Falahati,Shahram Taeb,Somayeh Sadat Mehrnia,Ali Fathi Jouzdani,Mehrdad Oveisi,Ilker Hacihaliloglu,Arman Rahmim
类目: Medical Physics (physics.med-ph); Machine Learning (cs.LG)
*备注: 12 pages, 4 figures

点击查看摘要

Abstract:Background: CT imaging is vital for lung cancer management, offering detailed visualization for AI-based prognosis. However, supervised learning SL models require large labeled datasets, limiting their real-world application in settings with scarce annotations. Methods: We analyzed CT scans from 977 patients across 12 datasets extracting 1218 radiomics features using Laplacian of Gaussian and wavelet filters via PyRadiomics Dimensionality reduction was applied with 56 feature selection and extraction algorithms and 27 classifiers were benchmarked A semi supervised learning SSL framework with pseudo labeling utilized 478 unlabeled and 499 labeled cases Model sensitivity was tested in three scenarios varying labeled data in SL increasing unlabeled data in SSL and scaling both from 10 percent to 100 percent SHAP analysis was used to interpret predictions Cross validation and external testing in two cohorts were performed. Results: SSL outperformed SL, improving overall survival prediction by up to 17 percent. The top SSL model, Random Forest plus XGBoost classifier, achieved 0.90 accuracy in cross-validation and 0.88 externally. SHAP analysis revealed enhanced feature discriminability in both SSL and SL, especially for Class 1 survival greater than 4 years. SSL showed strong performance with only 10 percent labeled data, with more stable results compared to SL and lower variance across external testing, highlighting SSL’s robustness and cost effectiveness. Conclusion: We introduced a cost-effective, stable, and interpretable SSL framework for CT-based survival prediction in lung cancer, improving performance, generalizability, and clinical readiness by integrating SHAP explainability and leveraging unlabeled data. Comments: 12 pages, 4 figures Subjects: Medical Physics (physics.med-ph); Machine Learning (cs.LG) ACMclasses: F.2.2, I.2.7 Cite as: arXiv:2507.08189 [physics.med-ph] (or arXiv:2507.08189v1 [physics.med-ph] for this version) https://doi.org/10.48550/arXiv.2507.08189 Focus to learn more arXiv-issued DOI via DataCite (pending registration) Submission history From: Mohammad R. Salmanpour [view email] [v1] Thu, 10 Jul 2025 21:57:15 UTC (1,122 KB)

[LG-58] Parametrized Quantum Circuit Learning for Quantum Chemical Applications

链接: https://arxiv.org/abs/2507.08183
作者: Grier M. Jones,Viki Kumar Prasad,Ulrich Fekl,Hans-Arno Jacobsen
类目: Quantum Physics (quant-ph); Machine Learning (cs.LG); Chemical Physics (physics.chem-ph)
*备注:

点击查看摘要

Abstract:In the field of quantum machine learning (QML), parametrized quantum circuits (PQCs) – constructed using a combination of fixed and tunable quantum gates – provide a promising hybrid framework for tackling complex machine learning problems. Despite numerous proposed applications, there remains limited exploration of datasets relevant to quantum chemistry. In this study, we investigate the potential benefits and limitations of PQCs on two chemically meaningful datasets: (1) the BSE49 dataset, containing bond separation energies for 49 different classes of chemical bonds, and (2) a dataset of water conformations, where coupled-cluster singles and doubles (CCSD) wavefunctions are predicted from lower-level electronic structure methods using the data-driven coupled-cluster (DDCC) approach. We construct a comprehensive set of 168 PQCs by combining 14 data encoding strategies with 12 variational ansätze, and evaluate their performance on circuits with 5 and 16 qubits. Our initial analysis examines the impact of circuit structure on model performance using state-vector simulations. We then explore how circuit depth and training set size influence model performance. Finally, we assess the performance of the best-performing PQCs on current quantum hardware, using both noisy simulations (“fake” backends) and real quantum devices. Our findings underscore the challenges of applying PQCs to chemically relevant problems that are straightforward for classical machine learning methods but remain non-trivial for quantum approaches.

[LG-59] CLEAR: Calibrated Learning for Epistemic and Aleatoric Risk

链接: https://arxiv.org/abs/2507.08150
作者: Ilia Azizi,Juraj Bodik,Jakob Heiss,Bin Yu
类目: Machine Learning (stat.ML); Machine Learning (cs.LG); Methodology (stat.ME)
*备注: Code: this https URL

点击查看摘要

Abstract:Accurate uncertainty quantification is critical for reliable predictive modeling, especially in regression tasks. Existing methods typically address either aleatoric uncertainty from measurement noise or epistemic uncertainty from limited data, but not necessarily both in a balanced way. We propose CLEAR, a calibration method with two distinct parameters, \gamma_1 and \gamma_2 , to combine the two uncertainty components for improved conditional coverage. CLEAR is compatible with any pair of aleatoric and epistemic estimators; we show how it can be used with (i) quantile regression for aleatoric uncertainty and (ii) ensembles drawn from the Predictability-Computability-Stability (PCS) framework for epistemic uncertainty. Across 17 diverse real-world datasets, CLEAR achieves an average improvement of 28.2% and 17.4% in the interval width compared to the two individually calibrated baselines while maintaining nominal coverage. This improvement can be particularly evident in scenarios dominated by either high epistemic or high aleatoric uncertainty.

[LG-60] Mallows Model with Learned Distance Metrics: Sampling and Maximum Likelihood Estimation

链接: https://arxiv.org/abs/2507.08108
作者: Yeganeh Alimohammadi,Kiana Asgari
类目: Machine Learning (stat.ML); Data Structures and Algorithms (cs.DS); Machine Learning (cs.LG); Probability (math.PR); Statistics Theory (math.ST)
*备注:

点击查看摘要

Abstract:\textitMallows model is a widely-used probabilistic framework for learning from ranking data, with applications ranging from recommendation systems and voting to aligning language models with human preferences~\citechen2024mallows, kleinberg2021algorithmic, rafailov2024direct. Under this model, observed rankings are noisy perturbations of a central ranking \sigma , with likelihood decaying exponentially in distance from \sigma , i.e, P (\pi) \propto \exp\big(-\beta \cdot d(\pi, \sigma)\big), where \beta 0 controls dispersion and d is a distance function. Existing methods mainly focus on fixed distances (such as Kendall’s \tau distance), with no principled approach to learning the distance metric directly from data. In practice, however, rankings naturally vary by context; for instance, in some sports we regularly see long-range swaps (a low-rank team beating a high-rank one), while in others such events are rare. Motivated by this, we propose a generalization of Mallows model that learns the distance metric directly from data. Specifically, we focus on L_\alpha distances: d_\alpha(\pi,\sigma):=\sum_i=1 |\pi(i)-\sigma(i)|^\alpha . For any \alpha\geq 1 and \beta0 , we develop a Fully Polynomial-Time Approximation Scheme (FPTAS) to efficiently generate samples that are \epsilon - close (in total variation distance) to the true distribution. Even in the special cases of L_1 and L_2 , this generalizes prior results that required vanishing dispersion ( \beta\to0 ). Using this sampling algorithm, we propose an efficient Maximum Likelihood Estimation (MLE) algorithm that jointly estimates the central ranking, the dispersion parameter, and the optimal distance metric. We prove strong consistency results for our estimators (for any values of \alpha and \beta ), and we validate our approach empirically using datasets from sports rankings. Subjects: Machine Learning (stat.ML); Data Structures and Algorithms (cs.DS); Machine Learning (cs.LG); Probability (math.PR); Statistics Theory (math.ST) MSC classes: 62F10, 62H20, 68W20, 60C05 ACMclasses: F.2.2; G.3; I.2.6; H.2.8 Cite as: arXiv:2507.08108 [stat.ML] (or arXiv:2507.08108v1 [stat.ML] for this version) https://doi.org/10.48550/arXiv.2507.08108 Focus to learn more arXiv-issued DOI via DataCite (pending registration)

[LG-61] Predicting Flow Dynamics using Diffusion Models

链接: https://arxiv.org/abs/2507.08106
作者: Yannick Gachnang,Vismay Churiwala
类目: Fluid Dynamics (physics.flu-dyn); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:In this work, we aimed to replicate and extend the results presented in the DiffFluid paper[1]. The DiffFluid model showed that diffusion models combined with Transformers are capable of predicting fluid dynamics. It uses a denoising diffusion probabilistic model (DDPM) framework to tackle Navier-Stokes and Darcy flow equations. Our goal was to validate the reproducibility of the methods in the DiffFluid paper while testing its viability for other simulation types, particularly the Lattice Boltzmann method. Despite our computational limitations and time constraints, this work provides evidence of the flexibility and potential of the model as a general-purpose solver for fluid dynamics. Our results show both the potential and challenges of applying diffusion models to complex fluid dynamics problems. This work highlights the opportunities for future research in optimizing the computational efficiency and scaling such models in broader domains.

信息检索

[IR-0] Digital gazetteers: review and prospects for place name knowledge bases

链接: https://arxiv.org/abs/2507.08553
作者: Kalana Wijegunarathna,Kristin Stock,Christopher B. Jones
类目: Information Retrieval (cs.IR)
*备注:

点击查看摘要

Abstract:Gazetteers typically store data on place names, place types and the associated coordinates. They play an essential role in disambiguating place names in online geographical information retrieval systems for navigation and mapping, detecting and disambiguating place names in text, and providing coordinates. Currently there are many gazetteers in use derived from many sources, with no commonly accepted standard for encoding the data. Most gazetteers are also very limited in the extent to which they represent the multiple facets of the named places yet they have potential to assist user search for locations with specific physical, commercial, social or cultural characteristics. With a view to understanding digital gazetteer technologies and advancing their future effectiveness for information retrieval, we provide a review of data sources, components, software and data management technologies, data quality and volunteered data, and methods for matching sources that refer to the same real-world places. We highlight the need for future work on richer representation of named places, the temporal evolution of place identity and location, and the development of more effective methods for data integration.

[IR-1] Improving Korean-English Cross-Lingual Retrieval: A Data-Centric Study of Language Composition and Model Merging

链接: https://arxiv.org/abs/2507.08480
作者: Youngjoon Jang,Junyoung Son,Taemin Lee,Seongtae Hong,Heuiseok Lim
类目: Information Retrieval (cs.IR)
*备注:

点击查看摘要

Abstract:With the increasing utilization of multilingual text information, Cross-Lingual Information Retrieval (CLIR) has become a crucial research area. However, the impact of training data composition on both CLIR and Mono-Lingual Information Retrieval (IR) performance remains under-explored. To systematically investigate this data-centric aspect, we construct linguistically parallel Korean-English datasets and train retrieval models with various language combinations. Our experiments reveal that the language composition of training data significantly influences IR performance, exhibiting important inter-lingual correlations: CLIR performance improves with specific language pairs, while Mono-Lingual IR performance declines. Our work demonstrates that Model Merging can effectively mitigate this trade-off, achieving strong CLIR results while preserving Mono-Lingual IR capabilities. Our findings underscore the effects of linguistic configuration of training data on both CLIR and Mono-Lingual IR, and present Model Merging as a viable strategy to optimize performance across these tasks.

[IR-2] DS@GT at LongEval: Evaluating Temporal Performance in Web Search Systems and Topics with Two-Stage Retrieval

链接: https://arxiv.org/abs/2507.08360
作者: Anthony Miyaguchi,Imran Afrulbasha,Aleksandar Pramov
类目: Information Retrieval (cs.IR)
*备注:

点击查看摘要

Abstract:Information Retrieval (IR) models are often trained on static datasets, making them vulnerable to performance degradation as web content evolves. The DS@GT competition team participated in the Longitudinal Evaluation of Model Performance (LongEval) lab at CLEF 2025, which evaluates IR systems across temporally distributed web snapshots. Our analysis of the Qwant web dataset includes exploratory data analysis with topic modeling over time. The two-phase retrieval system employs sparse keyword searches, utilizing query expansion and document reranking. Our best system achieves an average NDCG@10 of 0.296 across the entire training and test dataset, with an overall best score of 0.395 on 2023-05. The accompanying source code for this paper is at this https URL

附件下载

点击下载今日全部论文列表

Arxiv今日论文 | 2025-07-14

目录

概览 (2025-07-14)

自然语言处理

计算机视觉

人工智能

机器学习

信息检索

附件下载