本篇博文主要内容为 2025-07-15 从Arxiv.org论文网站获取的最新论文列表,自动更新,按照NLP、CV、ML、AI、IR五个大方向区分,若需要邮件定时接收,请在评论区留下你的邮箱号。
说明:每日论文数据从Arxiv.org获取,每天早上12:00左右定时自动更新。
友情提示: 如何您需要邮箱接收每日论文数据,请在评论处留下你的邮箱。
目录
概览 (2025-07-15)
今日共更新908篇论文,其中:
- 自然语言处理共96篇(Computation and Language (cs.CL))
- 人工智能共251篇(Artificial Intelligence (cs.AI))
- 计算机视觉共199篇(Computer Vision and Pattern Recognition (cs.CV))
- 机器学习共284篇(Machine Learning (cs.LG))
自然语言处理
[NLP-0] EmbRACE-3K: Embodied Reasoning and Action in Complex Environments
【速读】: 该论文试图解决视觉语言模型(Vision-Language Models, VLMs)在具身化设置中的有效性问题,即在需要在线交互和主动场景理解的环境中,VLMs表现出的局限性。其解决方案的关键在于引入EmRACE-3K数据集,该数据集包含3,000多个语言引导任务,这些任务位于使用Unreal Engine和UnrealCV-Zoo框架构建的多样化、逼真环境中,涵盖了导航、物体操作和多阶段目标执行等具身挑战。通过EmRACE-3K,研究者建立了评估VLMs在探索、动态空间语义推理和多阶段目标执行三个关键维度上的具身推理能力的基准。
链接: https://arxiv.org/abs/2507.10548
作者: Mingxian Lin,Wei Huang,Yitang Li,Chengjie Jiang,Kui Wu,Fangwei Zhong,Shengju Qian,Xin Wang,Xiaojuan Qi
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注: Project page: this https URL
Abstract:Recent advanced vision-language models(VLMs) have demonstrated strong performance on passive, offline image and video understanding tasks. However, their effectiveness in embodied settings, which require online interaction and active scene understanding remains limited. In such scenarios, an agent perceives the environment from a first-person perspective, with each action dynamically shaping subsequent observations. Even state-of-the-art models such as GPT-4o, Claude 3.5 Sonnet, and Gemini 2.5 Pro struggle in open-environment interactions, exhibiting clear limitations in spatial reasoning and long-horizon planning. To address this gap, we introduce EmRACE-3K, a dataset of over 3,000 language-guided tasks situated in diverse, photorealistic environments constructed using Unreal Engine and the UnrealCV-Zoo framework. The tasks encompass a wide range of embodied challenges, including navigation, object manipulation, and multi-stage goal execution. Each task unfolds as a multi-step trajectory, pairing first-person visual observations with high-level instructions, grounded actions, and natural language rationales that express the agent’s intent at every step. Using EmRACE-3K, we establish a benchmark to evaluate the embodied reasoning capabilities of VLMs across three key dimensions: Exploration, Dynamic Spatial-Semantic Reasoning, and Multi-stage Goal Execution. In zero-shot settings, all models achieve success rates below 20%, underscoring the challenge posed by our benchmark and the current limitations of VLMs in interactive environments. To demonstrate the utility of EmRACE-3K, we further fine-tune Qwen2.5-VL-7B using supervised learning followed by reinforcement learning. This approach yields substantial improvements across all three challenge categories, highlighting the dataset’s effectiveness in enabling the development of embodied reasoning capabilities.
zh
[NLP-1] REST: Stress Testing Large Reasoning Models by Asking Multiple Problems at Once
【速读】: 该论文试图解决当前大型推理模型(Large Reasoning Models, LRMs)在评估方法上的局限性,即现有基准测试主要通过顺序测试单个问题来评估模型的推理能力,导致数据污染风险高、题目难度不足以及无法有效评估模型在多上下文压力下的表现。解决方案的关键是提出REST(Reasoning Evaluation through Simultaneous Testing)框架,该框架通过同时暴露LRMs于多个问题,以更真实地模拟现实世界的推理需求,并重点评估模型在上下文优先级分配、跨问题干扰抵抗和动态认知负荷管理等方面的性能。
链接: https://arxiv.org/abs/2507.10541
作者: Zhuoshi Pan,Qizhi Pei,Yu Li,Qiyao Sun,Zinan Tang,H. Vicky Zhao,Conghui He,Lijun Wu
机构: 未知
类目: Computation and Language (cs.CL)
备注: REST (Reasoning Evaluation through Simultaneous Testing), a stress-testing framework that concurrently exposes LRMs to multiple problems simultaneously
Abstract:Recent Large Reasoning Models (LRMs) have achieved remarkable progress on task-specific benchmarks, yet their evaluation methods remain constrained by isolated problem-solving paradigms. Existing benchmarks predominantly assess single-question reasoning through sequential testing, resulting critical limitations: (1) vulnerability to data contamination and less challenging (e.g., DeepSeek-R1 achieves 97.0% on MATH500), forcing costly and perpetual creation of new questions with large human efforts, (2) failure to evaluate models under multi-context pressure, a key requirement for real-world deployment. To bridge this gap, we present REST (Reasoning Evaluation through Simultaneous Testing), a stress-testing framework that concurrently exposes LRMs to multiple problems simultaneously. Beyond basic reasoning, REST specifically evaluates several under-tested capabilities: contextual priority allocation, cross-problem interference resistance, and dynamic cognitive load management. Our evaluation reveals several striking findings: Even state-of-the-art (SOTA) models like DeepSeek-R1 exhibit substantial performance degradation under stress testing. Crucially, REST demonstrates stronger discriminative power than existing benchmarks, revealing pronounced performance differences among models that exhibit similar, near-ceiling performance under single-question evaluations. Some key mechanistic insights emerge from our analysis: (1) the “overthinking trap” is a critical factor contributing to the performance degradation; (2) the models trained with “long2short” technique preserve more accuracy of their single-problem performance under REST, outperforming standard-trained counterparts. These results establish REST as a cost-efficient, future-proof evaluation paradigm that better reflects real-world reasoning demands while reducing reliance on continuous human annotation.
zh
[NLP-2] CodeJudgeBench: Benchmarking LLM -as-a-Judge for Coding Tasks
【速读】: 该论文试图解决在代码生成场景中,大型语言模型作为评判者(LLM-as-a-Judge)的有效性与可靠性问题,尤其是缺乏专门的评估基准导致该方法在代码任务中的表现未被充分研究。解决方案的关键在于引入CodeJudgeBench,这是一个专门为评估LLM-as-a-Judge模型在代码生成、代码修复和单元测试生成三个关键任务上的性能而设计的基准。通过该基准,研究揭示了思维型模型相较于非思维型模型在代码评判任务中的显著优势,并探索了提升评判效果的最优提示策略。
链接: https://arxiv.org/abs/2507.10535
作者: Hongchao Jiang,Yiming Chen,Yushi Cao,Hung-yi Lee,Robby T. Tan
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Software Engineering (cs.SE)
备注: Dataset is available at this https URL
Abstract:Large Language Models (LLMs) have significantly advanced the state-of-the-art in various coding tasks. Beyond directly answering user queries, LLMs can also serve as judges, assessing and comparing the quality of responses generated by other models. Such an evaluation capability is crucial both for benchmarking different LLMs and for improving response quality through response ranking. However, despite the growing adoption of the LLM-as-a-Judge paradigm, its effectiveness in coding scenarios remains underexplored due to the absence of dedicated benchmarks. To address this gap, we introduce CodeJudgeBench, a benchmark explicitly designed to evaluate the performance of LLM-as-a-Judge models across three critical coding tasks: code generation, code repair, and unit test generation. Through comprehensive benchmarking of 26 LLM-as-a-Judge models, we find that recent thinking models significantly outperform non-thinking models on our carefully designed code judging tasks. Notably, even relatively small thinking models, such as Qwen3-8B, can outperform specially trained LLM-as-a-Judge models up to 70B in size. Nevertheless, all models still exhibit significant randomness in their judgment of coding tasks. For pairwise judging tasks, simply changing the order in which responses are presented can substantially impact accuracy. In addition, when judging code and unit tests written by different LLMs, LLM-as-a-Judge models also show variance in performance. This sensitivity raises concerns about the reliability and consistency of LLM-as-a-Judge in coding scenarios. Lastly, we study optimal prompting strategies for LLM-as-a-Judge. We find that using pair-wise comparison outperforms scalar point-wise judging. Furthermore, retaining comments and reasoning in the full, unprocessed LLM response leads to improved judge performance.
zh
[NLP-3] Reasoning or Memorization? Unreliable Results of Reinforcement Learning Due to Data Contamination
【速读】: 该论文试图解决当前大型语言模型(Large Language Models, LLMs)在强化学习(Reinforcement Learning, RL)训练中依赖可能存在数据泄露的基准测试所导致的性能评估不可靠问题。其解决方案的关键在于引入一个生成器,用于创建完全合成的算术问题数据集,即RandomCalculation,该数据集避免了数据污染,从而能够更准确地评估RL方法的有效性。通过使用无泄漏的数据集,研究证明只有准确的奖励信号才能持续提升模型性能,而噪声或错误的奖励信号则无效。
链接: https://arxiv.org/abs/2507.10532
作者: Mingqi Wu,Zhihao Zhang,Qiaole Dong,Zhiheng Xi,Jun Zhao,Senjie Jin,Xiaoran Fan,Yuhao Zhou,Yanwei Fu,Qin Liu,Songyang Zhang,Qi Zhang
机构: Fudan University (复旦大学); Shanghai Artificial Intelligence Laboratory (上海人工智能实验室); University of California, Davis (加州大学戴维斯分校)
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注: 26 pages
Abstract:The reasoning capabilities of large language models (LLMs) have been a longstanding focus of research. Recent works have further enhanced these capabilities using reinforcement learning (RL), with many new methods claiming significant improvements with minimal or no external supervision. Surprisingly, some studies even suggest that random or incorrect reward signals can enhance reasoning performance. However, these breakthroughs are mostly reported on the Qwen2.5 model family and evaluated on well-known benchmarks such as MATH-500, AMC, and AIME, while failing to achieve similar gains on other models like Llama, which warrants further investigation. Our analysis shows that although Qwen2.5 achieves strong mathematical reasoning performance, its pretraining on large-scale web corpora makes it vulnerable to data contamination in popular benchmarks. As a result, results derived from these benchmarks may be unreliable. To address this, we introduce a generator that produces fully synthetic arithmetic problems of arbitrary length and difficulty, yielding a clean dataset we call RandomCalculation. Using these leakage-free datasets, we show that only accurate reward signals consistently improve performance, while noisy or incorrect signals do not. We advocate for evaluating RL methods on uncontaminated benchmarks and across diverse model families to ensure trustworthy conclusions.
zh
[NLP-4] Mixture-of-Recursions: Learning Dynamic Recursive Depths for Adaptive Token-Level Computation
【速读】: 该论文试图解决大规模语言模型在训练和部署过程中面临的计算与内存开销过高的问题,特别是如何同时实现参数效率和自适应计算。其解决方案的关键在于引入Mixture-of-Recursions (MoR),这是一种统一框架,将参数共享与自适应计算整合到一个递归Transformer结构中。MoR通过在递归步骤中复用共享的层栈实现参数效率,同时利用轻量级路由器动态分配不同递归深度给单个令牌,从而实现令牌级别的自适应计算。此外,MoR通过仅对当前递归深度下活跃的令牌进行二次注意力计算,并选择性缓存其键值对,进一步提升了内存访问效率。
链接: https://arxiv.org/abs/2507.10524
作者: Sangmin Bae,Yujin Kim,Reza Bayat,Sungnyun Kim,Jiyoun Ha,Tal Schuster,Adam Fisch,Hrayr Harutyunyan,Ziwei Ji,Aaron Courville,Se-Young Yun
机构: KAIST AI(韩国科学技术院人工智能); Mila(蒙特利尔学习算法研究所); Google Cloud(谷歌云); Google DeepMind(谷歌深度思维); Google Research(谷歌研究院); Université de Montréal(蒙特利尔大学)
类目: Computation and Language (cs.CL); Machine Learning (cs.LG)
备注: 36 pages, 9 figures, 14 tables, codes at this https URL
Abstract:Scaling language models unlocks impressive capabilities, but the accompanying computational and memory demands make both training and deployment expensive. Existing efficiency efforts typically target either parameter sharing or adaptive computation, leaving open the question of how to attain both simultaneously. We introduce Mixture-of-Recursions (MoR), a unified framework that combines the two axes of efficiency inside a single Recursive Transformer. MoR reuses a shared stack of layers across recursion steps to achieve parameter efficiency, while lightweight routers enable adaptive token-level thinking by dynamically assigning different recursion depths to individual tokens. This allows MoR to focus quadratic attention computation only among tokens still active at a given recursion depth, further improving memory access efficiency by selectively caching only their key-value pairs. Beyond these core mechanisms, we also propose a KV sharing variant that reuses KV pairs from the first recursion, specifically designed to decrease prefill latency and memory footprint. Across model scales ranging from 135M to 1.7B parameters, MoR forms a new Pareto frontier: at equal training FLOPs and smaller model sizes, it significantly lowers validation perplexity and improves few-shot accuracy, while delivering higher throughput compared with vanilla and existing recursive baselines. These gains demonstrate that MoR is an effective path towards large-model quality without incurring large-model cost.
zh
[NLP-5] DeepResearchtextEco: A Recursive Agent ic Workflow for Complex Scientific Question Answering in Ecology
【速读】: 该论文试图解决科学文献综述过程中检索多样性与深度不足的问题,以及传统检索增强生成流程在用户可控性、透明推理和参数配置方面的局限性。其解决方案的关键在于提出一种基于代理的大型语言模型(LLM)系统——DeepResearch^Eco,该系统支持递归、深度和广度可控的研究问题探索,实现了领域特定证据的高通量整合,同时保持分析严谨性。
链接: https://arxiv.org/abs/2507.10522
作者: Jennifer D’Souza,Endres Keno Sander,Andrei Aioanei
机构: TIB Leibniz Information Centre for Science and Technology, Hannover, Germany(德国汉诺威莱布尼茨信息中心); Leibniz University Hannover, Germany(德国汉诺威莱布尼茨大学)
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Multiagent Systems (cs.MA)
备注: 12 pages, 3 figures
Abstract:We introduce DeepResearch ^\textEco , a novel agentic LLM-based system for automated scientific synthesis that supports recursive, depth- and breadth-controlled exploration of original research questions – enhancing search diversity and nuance in the retrieval of relevant scientific literature. Unlike conventional retrieval-augmented generation pipelines, DeepResearch enables user-controllable synthesis with transparent reasoning and parameter-driven configurability, facilitating high-throughput integration of domain-specific evidence while maintaining analytical rigor. Applied to 49 ecological research questions, DeepResearch achieves up to a 21-fold increase in source integration and a 14.9-fold rise in sources integrated per 1,000 words. High-parameter settings yield expert-level analytical depth and contextual diversity. Source code available at: this https URL. Comments: 12 pages, 3 figures Subjects: Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Multiagent Systems (cs.MA) Cite as: arXiv:2507.10522 [cs.AI] (or arXiv:2507.10522v1 [cs.AI] for this version) https://doi.org/10.48550/arXiv.2507.10522 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
zh
[NLP-6] Can You Detect the Difference?
【速读】: 该论文试图解决如何有效检测扩散模型生成的文本(LLaDA)与自回归模型生成的文本(LLaMA)的问题,特别是在现有基于自回归模型的检测方法在面对扩散模型生成文本时表现出较高的误漏检率。论文的关键解决方案在于提出需要开发针对扩散模型的检测方法,包括构建混合模型、识别扩散特有的风格特征以及采用稳健的水印技术,以提高对扩散生成文本的检测能力。
链接: https://arxiv.org/abs/2507.10475
作者: İsmail Tarım,Aytuğ Onan
机构: İzmir Katip Çelebi Univ.(伊兹密尔卡蒂普切莱比大学)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: 11 pages, 3 figures, 2 tables. Code and data: this https URL . Cross-list requested to cs.AI for AI-safety relevance
Abstract:The rapid advancement of large language models (LLMs) has raised concerns about reliably detecting AI-generated text. Stylometric metrics work well on autoregressive (AR) outputs, but their effectiveness on diffusion-based models is unknown. We present the first systematic comparison of diffusion-generated text (LLaDA) and AR-generated text (LLaMA) using 2 000 samples. Perplexity, burstiness, lexical diversity, readability, and BLEU/ROUGE scores show that LLaDA closely mimics human text in perplexity and burstiness, yielding high false-negative rates for AR-oriented detectors. LLaMA shows much lower perplexity but reduced lexical fidelity. Relying on any single metric fails to separate diffusion outputs from human writing. We highlight the need for diffusion-aware detectors and outline directions such as hybrid models, diffusion-specific stylometric signatures, and robust watermarking.
zh
[NLP-7] MLAR: Multi-layer Large Language Model-based Robotic Process Automation Applicant Tracking
【速读】: 该论文旨在解决传统招聘流程中由于时间和资源限制导致的简历筛选与候选人初步筛选的瓶颈问题。其解决方案的关键在于引入一种基于机器学习的自动化机器人流程(MLAR),该框架利用大型语言模型(LLMs)在三个层次上进行操作:第一层从职位发布中提取关键特征,第二层解析申请人简历以识别教育背景、工作经验和技能,第三层进行相似性匹配。通过先进的语义算法实现特征匹配,从而高效地识别最佳候选人。
链接: https://arxiv.org/abs/2507.10472
作者: Mohamed T. Younes,Omar Walid,Mai Hassan,Ali Hamdi
机构: MSA University (MSA大学)
类目: Computation and Language (cs.CL)
备注:
Abstract:This paper introduces an innovative Applicant Tracking System (ATS) enhanced by a novel Robotic process automation (RPA) framework or as further referred to as MLAR. Traditional recruitment processes often encounter bottlenecks in resume screening and candidate shortlisting due to time and resource constraints. MLAR addresses these challenges employing Large Language Models (LLMs) in three distinct layers: extracting key characteristics from job postings in the first layer, parsing applicant resume to identify education, experience, skills in the second layer, and similarity matching in the third layer. These features are then matched through advanced semantic algorithms to identify the best candidates efficiently. Our approach integrates seamlessly into existing RPA pipelines, automating resume parsing, job matching, and candidate notifications. Extensive performance benchmarking shows that MLAR outperforms the leading RPA platforms, including UiPath and Automation Anywhere, in high-volume resume-processing tasks. When processing 2,400 resumes, MLAR achieved an average processing time of 5.4 seconds per resume, reducing processing time by approximately 16.9% compared to Automation Anywhere and 17.1% compared to UiPath. These results highlight the potential of MLAR to transform recruitment workflows by providing an efficient, accurate, and scalable solution tailored to modern hiring needs.
zh
[NLP-8] From BERT to Qwen : Hate Detection across architectures
【速读】: 该论文试图解决在线平台在遏制仇恨言论时面临的挑战,即如何在不过度审查合法讨论的前提下有效识别仇恨言论。解决方案的关键在于评估不同模型家族(经典双向变换器编码器与新一代超大规模自回归语言模型)在真实网络互动语料库上的仇恨言论检测性能,以验证模型规模的增加是否能实际提升检测效果。
链接: https://arxiv.org/abs/2507.10468
作者: Ariadna Mon,Saúl Fenollosa,Jon Lecumberri
机构: 未知
类目: Computation and Language (cs.CL); Machine Learning (cs.LG)
备注: 4 pages, 5 figures. EE-559 Deep Learning course project (Group 11)
Abstract:Online platforms struggle to curb hate speech without over-censoring legitimate discourse. Early bidirectional transformer encoders made big strides, but the arrival of ultra-large autoregressive LLMs promises deeper context-awareness. Whether this extra scale actually improves practical hate-speech detection on real-world text remains unverified. Our study puts this question to the test by benchmarking both model families, classic encoders and next-generation LLMs, on curated corpora of online interactions for hate-speech detection (Hate or No Hate).
zh
[NLP-9] Referential ambiguity and clarification requests: comparing human and LLM behaviour
【速读】: 该论文试图解决大语言模型(Large Language Models, LLMs)在任务导向对话中生成澄清问题的能力及其与指代歧义和任务不确定性之间的关系问题。其解决方案的关键在于构建一个整合了Minecraft Dialogue Corpus中两种现有标注的新型语料库,该语料库将指代与歧义的标注和基于结构化的语篇关系理论(SDRT)的澄清信息统一为一种共同格式,从而为研究澄清问题及其与歧义的关系提供了必要的数据支持。通过该语料库,作者对比了LLMs与人类生成的澄清问题,分析了两者在面对歧义时的行为差异,并探讨了推理能力对LLMs生成澄清问题的影响。
链接: https://arxiv.org/abs/2507.10445
作者: Chris Madge,Matthew Purver,Massimo Poesio
机构: Queen Mary University of London (伦敦玛丽女王大学)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:
Abstract:In this work we examine LLMs’ ability to ask clarification questions in task-oriented dialogues that follow the asynchronous instruction-giver/instruction-follower format. We present a new corpus that combines two existing annotations of the Minecraft Dialogue Corpus – one for reference and ambiguity in reference, and one for SDRT including clarifications – into a single common format providing the necessary information to experiment with clarifications and their relation to ambiguity. With this corpus we compare LLM actions with original human-generated clarification questions, examining how both humans and LLMs act in the case of ambiguity. We find that there is only a weak link between ambiguity and humans producing clarification questions in these dialogues, and low correlation between humans and LLMs. Humans hardly ever produce clarification questions for referential ambiguity, but often do so for task-based uncertainty. Conversely, LLMs produce more clarification questions for referential ambiguity, but less so for task uncertainty. We question if LLMs’ ability to ask clarification questions is predicated on their recent ability to simulate reasoning, and test this with different reasoning approaches, finding that reasoning does appear to increase question frequency and relevancy.
zh
[NLP-10] From Sequence to Structure: Uncovering Substructure Reasoning in Transformers
【速读】: 该论文试图解决的问题是:如何让仅包含解码器的Transformer架构理解嵌入在文本中的图结构。解决方案的关键在于提出了一种名为Induced Substructure Filtration (ISF) 的机制,通过该机制能够捕捉多层Transformer中子结构识别的过程,并验证了LLMs中ISF过程的一致内部动态,从而揭示了Transformer处理图数据时进行子结构提取的新视角。
链接: https://arxiv.org/abs/2507.10435
作者: Xinnan Dai,Kai Yang,Jay Revolinsky,Kai Guo,Aoran Wang,Bohang Zhang,Jiliang Tang
机构: Michigan State University (密歇根州立大学); Peking University (北京大学); University of Luxembourg (卢森堡大学)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:
Abstract:Recent studies suggest that large language models (LLMs) possess the capability to solve graph reasoning tasks. Notably, even when graph structures are embedded within textual descriptions, LLMs can still effectively answer related questions. This raises a fundamental question: How can a decoder-only Transformer architecture understand underlying graph structures? To address this, we start with the substructure extraction task, interpreting the inner mechanisms inside the transformers and analyzing the impact of the input queries. Specifically, through both empirical results and theoretical analysis, we present Induced Substructure Filtration (ISF), a perspective that captures the substructure identification in the multi-layer transformers. We further validate the ISF process in LLMs, revealing consistent internal dynamics across layers. Building on these insights, we explore the broader capabilities of Transformers in handling diverse graph types. Specifically, we introduce the concept of thinking in substructures to efficiently extract complex composite patterns, and demonstrate that decoder-only Transformers can successfully extract substructures from attributed graphs, such as molecular graphs. Together, our findings offer a new insight on how sequence-based Transformers perform the substructure extraction task over graph data.
zh
[NLP-11] Multiple Choice Learning of Low Rank Adapters for Language Modeling
【速读】: 该论文试图解决语言模型在给定上下文时生成多样且合理句子延续的问题,这一问题本质上是一个病态问题(ill-posed problem),因为给定的上下文可能有多个同样合理的未来可能性。解决方案的关键在于提出LoRA-MCL训练方案,该方案结合了Multiple Choice Learning (MCL) 和 Winner-Takes-All (WTA) 损失函数,通过Low-Rank Adaptation (LoRA) 有效处理歧义,并在实际的视觉和音频字幕任务中验证了方法在生成输出的多样性和相关性上的优越表现。
链接: https://arxiv.org/abs/2507.10419
作者: Victor Letzelter,Hugo Malard,Mathieu Fontaine,Gaël Richard,Slim Essid,Andrei Bursuc,Patrick Pérez
机构: Télécom Paris, Institut Polytechnique de Paris (Télécom Paris, 巴黎综合理工学院); Valeo.ai (瓦莱奥AI); Kyutai
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Machine Learning (stat.ML)
备注:
Abstract:We propose LoRA-MCL, a training scheme that extends next-token prediction in language models with a method designed to decode diverse, plausible sentence continuations at inference time. Traditional language modeling is an intrinsically ill-posed problem: given a context, multiple futures may be equally plausible. Our approach leverages Multiple Choice Learning (MCL) and the Winner-Takes-All (WTA) loss to efficiently handle ambiguity through Low-Rank Adaptation (LoRA). We provide a theoretical interpretation of applying Multiple Choice Learning to Language Modeling, assuming the data is generated from a mixture of distributions. To illustrate the proposed approach, we use data sampled from mixtures of Markov chains. We then demonstrate with extensive experiments on real-world visual and audio captioning tasks that our method achieves high diversity and relevance in generated outputs.
zh
[NLP-12] xt-to-Remote-Sensing-Image Retrieval beyond RGB Sources
【速读】: 该论文试图解决从大规模卫星档案中检索相关影像的问题,特别是在灾难响应和长期气候监测等应用中,现有文本到图像检索系统主要局限于RGB数据,未能充分利用其他传感器提供的独特物理信息,如合成孔径雷达(SAR)的全天候结构敏感性或光学多光谱数据的光谱特征。解决方案的关键在于引入CrisisLandMark语料库以及提出CLOSP(Contrastive Language Optical SAR Pretraining)框架,通过文本作为桥梁将未配对的光学与SAR图像对齐到统一的嵌入空间,从而显著提升检索性能,并通过统一训练策略将光学领域的丰富语义知识迁移至SAR图像解释中。
链接: https://arxiv.org/abs/2507.10403
作者: Daniele Rege Cambrin,Lorenzo Vaiani,Giuseppe Gallipoli,Luca Cagliero,Paolo Garza
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Computation and Language (cs.CL); Information Retrieval (cs.IR); Multimedia (cs.MM)
备注:
Abstract:Retrieving relevant imagery from vast satellite archives is crucial for applications like disaster response and long-term climate monitoring. However, most text-to-image retrieval systems are limited to RGB data, failing to exploit the unique physical information captured by other sensors, such as the all-weather structural sensitivity of Synthetic Aperture Radar (SAR) or the spectral signatures in optical multispectral data. To bridge this gap, we introduce CrisisLandMark, a new large-scale corpus of over 647,000 Sentinel-1 SAR and Sentinel-2 multispectral images paired with structured textual annotations for land cover, land use, and crisis events harmonized from authoritative land cover systems (CORINE and Dynamic World) and crisis-specific sources. We then present CLOSP (Contrastive Language Optical SAR Pretraining), a novel framework that uses text as a bridge to align unpaired optical and SAR images into a unified embedding space. Our experiments show that CLOSP achieves a new state-of-the-art, improving retrieval nDGC by 54% over existing models. Additionally, we find that the unified training strategy overcomes the inherent difficulty of interpreting SAR imagery by transferring rich semantic knowledge from the optical domain with indirect interaction. Furthermore, GeoCLOSP, which integrates geographic coordinates into our framework, creates a powerful trade-off between generality and specificity: while the CLOSP excels at general semantic tasks, the GeoCLOSP becomes a specialized expert for retrieving location-dependent crisis events and rare geographic features. This work highlights that the integration of diverse sensor data and geographic context is essential for unlocking the full potential of remote sensing archives.
zh
[NLP-13] Devanagari Handwritten Character Recognition using Convolutional Neural Network
【速读】: 该论文旨在解决手写Devanagari字符识别的问题,特别是在缺乏有效数字化工具的情况下,如何从Devanagari脚本图像中自动提取手写印度语字符。其解决方案的关键在于采用一种基于两个深度卷积神经网络层的深度学习方法,通过优化模型结构以提高识别率,并针对Devanagari手写文本识别(DHTR)进行配置。该方法利用了包含36个类别的Devanagari手写字符数据集(DHCD),每个类别包含1700张用于训练和测试的图像,最终在测试阶段达到了96.36%的准确率。
链接: https://arxiv.org/abs/2507.10398
作者: Diksha Mehta,Prateek Mehta
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注: 9 pages, 6 figures
Abstract:Handwritten character recognition is getting popular among researchers because of its possible applications in facilitating technological search engines, social media, recommender systems, etc. The Devanagari script is one of the oldest language scripts in India that does not have proper digitization tools. With the advancement of computing and technology, the task of this research is to extract handwritten Hindi characters from an image of Devanagari script with an automated approach to save time and obsolete data. In this paper, we present a technique to recognize handwritten Devanagari characters using two deep convolutional neural network layers. This work employs a methodology that is useful to enhance the recognition rate and configures a convolutional neural network for effective Devanagari handwritten text recognition (DHTR). This approach uses the Devanagari handwritten character dataset (DHCD), an open dataset with 36 classes of Devanagari characters. Each of these classes has 1700 images for training and testing purposes. This approach obtains promising results in terms of accuracy by achieving 96.36% accuracy in testing and 99.55% in training time.
zh
[NLP-14] Meanings are like Onions: a Layered Approach to Metaphor Processing
【速读】: 该论文试图解决如何在计算系统中实现对隐喻意义的深层次、上下文敏感的理解问题。传统方法往往仅关注表面关联,而未能捕捉隐喻背后的复杂认知过程。其解决方案的关键在于提出一个分层的隐喻处理模型,该模型将意义视为多层结构,包括(1)内容分析、(2)概念融合以及(3)语用意图,通过统一这些层次形成一个形式化框架,从而支持超越表层关联的隐喻意义表示与推理。
链接: https://arxiv.org/abs/2507.10354
作者: Silvia Cappa,Anna Sofia Lippolis,Stefano Zoia
机构: Institute for Cognitive Sciences and Technologies (ISTC), CNR, Rome, Italy; University of Bologna, Italy; University of Turin, Italy
类目: Computation and Language (cs.CL)
备注:
Abstract:Metaphorical meaning is not a flat mapping between concepts, but a complex cognitive phenomenon that integrates multiple levels of interpretation. In this paper, we propose a stratified model of metaphor processing that treats meaning as an onion: a multi-layered structure comprising (1) content analysis, (2) conceptual blending, and (3) pragmatic intentionality. This three-dimensional framework allows for a richer and more cognitively grounded approach to metaphor interpretation in computational systems. At the first level, metaphors are annotated through basic conceptual elements. At the second level, we model conceptual combinations, linking components to emergent meanings. Finally, at the third level, we introduce a pragmatic vocabulary to capture speaker intent, communicative function, and contextual effects, aligning metaphor understanding with pragmatic theories. By unifying these layers into a single formal framework, our model lays the groundwork for computational methods capable of representing metaphorical meaning beyond surface associations, toward deeper, more context-sensitive reasoning.
zh
[NLP-15] Using AI to replicate human experimental results: a motion study
【速读】: 该论文试图解决如何评估大规模语言模型(Large Language Models, LLMs)在语言研究中作为可靠分析工具的潜力,特别是其在捕捉时间表达中情感意义生成方面的能力。论文提出的解决方案的关键在于通过四组心理语言学实验,比较人类参与者与LLMs在情感意义涌现、情感极性变化、情绪语境中的动词选择以及句子与表情符号关联等方面的响应一致性,结果表明人类与AI在评分模式和分类选择上表现出高度相关性(如Spearman’s rho = .73-.96),证明LLMs能够有效辅助传统的人类实验,从而扩展研究规模而不损害解释的有效性。
链接: https://arxiv.org/abs/2507.10342
作者: Rosa Illan Castillo,Javier Valenzuela
机构: 未知
类目: Computation and Language (cs.CL)
备注:
Abstract:This paper explores the potential of large language models (LLMs) as reliable analytical tools in linguistic research, focusing on the emergence of affective meanings in temporal expressions involving manner-of-motion verbs. While LLMs like GPT-4 have shown promise across a range of tasks, their ability to replicate nuanced human judgements remains under scrutiny. We conducted four psycholinguistic studies (on emergent meanings, valence shifts, verb choice in emotional contexts, and sentence-emoji associations) first with human participants and then replicated the same tasks using an LLM. Results across all studies show a striking convergence between human and AI responses, with statistical analyses (e.g., Spearman’s rho = .73-.96) indicating strong correlations in both rating patterns and categorical choices. While minor divergences were observed in some cases, these did not alter the overall interpretative outcomes. These findings offer compelling evidence that LLMs can augment traditional human-based experimentation, enabling broader-scale studies without compromising interpretative validity. This convergence not only strengthens the empirical foundation of prior human-based findings but also opens possibilities for hypothesis generation and data expansion through AI. Ultimately, our study supports the use of LLMs as credible and informative collaborators in linguistic inquiry.
zh
[NLP-16] Bridging Robustness and Generalization Against Word Substitution Attacks in NLP via the Growth Bound Matrix Approach ACL
【速读】: 该论文试图解决自然语言处理(Natural Language Processing, NLP)模型在面对对抗性攻击时的脆弱性问题,特别是针对词替换攻击的鲁棒性不足。其解决方案的关键在于引入一种基于增长界矩阵(Growth Bound Matrices, GBM)的正则化技术,通过减少输入扰动对模型输出的影响来提升模型的鲁棒性。该方法在LSTM、S4和CNN三种架构上进行了验证,并在多个基准数据集上取得了显著的对抗鲁棒性提升。
链接: https://arxiv.org/abs/2507.10330
作者: Mohammed Bouri,Adnane Saoud
机构: Mohammed VI Polytechnic University (摩洛哥穆罕默德六世理工大学); CID Development (CID 开发); Morocco (摩洛哥)
类目: Computation and Language (cs.CL); Machine Learning (cs.LG)
备注: Accepted to ACL Findings 2025
Abstract:Despite advancements in Natural Language Processing (NLP), models remain vulnerable to adversarial attacks, such as synonym substitutions. While prior work has focused on improving robustness for feed-forward and convolutional architectures, the robustness of recurrent networks and modern state space models (SSMs), such as S4, remains understudied. These architectures pose unique challenges due to their sequential processing and complex parameter dynamics. In this paper, we introduce a novel regularization technique based on Growth Bound Matrices (GBM) to improve NLP model robustness by reducing the impact of input perturbations on model outputs. We focus on computing the GBM for three architectures: Long Short-Term Memory (LSTM), State Space models (S4), and Convolutional Neural Networks (CNN). Our method aims to (1) enhance resilience against word substitution attacks, (2) improve generalization on clean text, and (3) providing the first systematic analysis of SSM (S4) robustness. Extensive experiments across multiple architectures and benchmark datasets demonstrate that our method improves adversarial robustness by up to 8.8% over existing baselines. These results highlight the effectiveness of our approach, outperforming several state-of-the-art methods in adversarial defense. Codes are available at this https URL
zh
[NLP-17] Grammar-Guided Evolutionary Search for Discrete Prompt Optimisation ECAI2025
【速读】: 该论文试图解决在使用预训练大语言模型(Large Language Models, LLMs)进行复杂任务求解时,如何有效优化提示(prompt)的问题。现有方法多针对需要少量提示模板的任务,并且主要在大型高能力LLMs上进行评估,而对需要详细信息的复杂任务以及较小模型的敏感性关注不足。该论文提出了一种基于进化搜索的自动化离散提示优化方法,其关键在于通过两个阶段实现提示优化:第一阶段利用语法引导的遗传编程,通过组合句法、字典和LLM-based的提示编辑函数生成提示生成程序;第二阶段则通过局部搜索进一步微调最优程序的性能。
链接: https://arxiv.org/abs/2507.10326
作者: Muzhaffar Hazman,Minh-Khoi Pham,Shweta Soundararajan,Goncalo Mordido,Leonardo Custode,David Lynch,Giorgio Cruciata,Yucheng Shi,Hongmeng Song,Wang Chao,Pan Yue,Aleksandar Milenovic,Alexandros Agapitos
机构: Huawei Technologies, Ireland Research Centre(华为技术爱尔兰研究中心); University of Galway(爱尔兰高威大学); Dublin City University(都柏林城市大学); Technological University Dublin(都柏林理工学院)
类目: Computation and Language (cs.CL)
备注: Accepted for Publication at ECAI 2025
Abstract:Prompt engineering has proven to be a crucial step in leveraging pretrained large language models (LLMs) in solving various real-world tasks. Numerous solutions have been proposed that seek to automate prompt engineering by using the model itself to edit prompts. However, the majority of state-of-the-art approaches are evaluated on tasks that require minimal prompt templates and on very large and highly capable LLMs. In contrast, solving complex tasks that require detailed information to be included in the prompt increases the amount of text that needs to be optimised. Furthermore, smaller models have been shown to be more sensitive to prompt design. To address these challenges, we propose an evolutionary search approach to automated discrete prompt optimisation consisting of two phases. In the first phase, grammar-guided genetic programming is invoked to synthesise prompt-creating programmes by searching the space of programmes populated by function compositions of syntactic, dictionary-based and LLM-based prompt-editing functions. In the second phase, local search is applied to explore the neighbourhoods of best-performing programmes in an attempt to further fine-tune their performance. Our approach outperforms three state-of-the-art prompt optimisation approaches, PromptWizard, OPRO, and RL-Prompt, on three relatively small general-purpose LLMs in four domain-specific challenging tasks. We also illustrate several examples where these benchmark methods suffer relatively severe performance degradation, while our approach improves performance in almost all task-model combinations, only incurring minimal degradation when it does not.
zh
[NLP-18] FaceLLM : A Multimodal Large Language Model for Face Understanding ICCV2025
【速读】: 该论文试图解决现有多模态大语言模型(Multimodal Large Language Models, MLLMs)在处理特定领域视觉线索(如面部图像)时能力受限的问题,尤其是缺乏大规模标注的面部图像-文本数据集导致模型难以理解面部结构、表情、情绪和人口统计特征等细节。解决方案的关键在于提出一种弱监督的训练数据生成管道,利用具有属性感知提示的ChatGPT生成高质量的图像问答对,从而构建名为FairFaceGPT的多样化面部属性数据集,并基于此训练出专门用于面部图像理解的多模态大语言模型FaceLLM。
链接: https://arxiv.org/abs/2507.10300
作者: Hatef Otroshi Shahreza,Sébastien Marcel
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注: Accepted in ICCV 2025 workshops
Abstract:Multimodal large language models (MLLMs) have shown remarkable performance in vision-language tasks. However, existing MLLMs are primarily trained on generic datasets, limiting their ability to reason on domain-specific visual cues such as those in facial images. In particular, tasks that require detailed understanding of facial structure, expression, emotion, and demographic features remain underexplored by MLLMs due to the lack of large-scale annotated face image-text datasets. In this work, we introduce FaceLLM, a multimodal large language model trained specifically for facial image understanding. To construct the training data, we propose a novel weakly supervised pipeline that uses ChatGPT with attribute-aware prompts to generate high-quality question-answer pairs based on images from the FairFace dataset. The resulting corpus, called FairFaceGPT, covers a diverse set of attributes including expression, pose, skin texture, and forensic information. Our experiments demonstrate that FaceLLM improves the performance of MLLMs on various face-centric tasks and achieves state-of-the-art performance. This work highlights the potential of synthetic supervision via language models for building domain-specialized MLLMs, and sets a precedent for trustworthy, human-centric multimodal AI systems. FairFaceGPT dataset and pretrained FaceLLM models are publicly available in the project page.
zh
[NLP-19] Absher: A Benchmark for Evaluating Large Language Models Understanding of Saudi Dialects
【速读】: 该论文试图解决大型语言模型(Large Language Models, LLMs)在阿拉伯语自然语言处理(NLP)应用中对区域方言和文化细微差别的理解不足问题。解决方案的关键是构建了一个名为\texttt{Absher}的全面基准测试集,该基准涵盖超过18,000道多选题,覆盖六个类别:意义、真/假、填空、语境使用、文化解释和地点识别。这些题目来源于沙特阿拉伯不同地区收集的方言词汇、短语和谚语数据集,旨在评估LLMs在沙特主要方言中的表现,并揭示其在文化推理或语境理解任务中的性能差距。
链接: https://arxiv.org/abs/2507.10216
作者: Renad Al-Monef,Hassan Alhuzali,Nora Alturayeif,Ashwag Alasmari
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:
Abstract:As large language models (LLMs) become increasingly central to Arabic NLP applications, evaluating their understanding of regional dialects and cultural nuances is essential, particularly in linguistically diverse settings like Saudi Arabia. This paper introduces \textttAbsher, a comprehensive benchmark specifically designed to assess LLMs performance across major Saudi dialects. \textttAbsher comprises over 18,000 multiple-choice questions spanning six distinct categories: Meaning, True/False, Fill-in-the-Blank, Contextual Usage, Cultural Interpretation, and Location Recognition. These questions are derived from a curated dataset of dialectal words, phrases, and proverbs sourced from various regions of Saudi Arabia. We evaluate several state-of-the-art LLMs, including multilingual and Arabic-specific models. We also provide detailed insights into their capabilities and limitations. Our results reveal notable performance gaps, particularly in tasks requiring cultural inference or contextual understanding. Our findings highlight the urgent need for dialect-aware training and culturally aligned evaluation methodologies to improve LLMs performance in real-world Arabic applications.
zh
[NLP-20] Abusive text transformation using LLM s
【速读】: 该论文试图解决如何利用大型语言模型(Large Language Models, LLMs)将包含仇恨言论和脏话的滥用文本(如推文和评论)转换为非滥用文本,同时保留原文的意图。其解决方案的关键在于评估不同LLMs在识别滥用文本并进行转换的能力,确保转换后的文本在情感和语义上与原文本保持一致,从而实现内容净化而不损失信息核心。
链接: https://arxiv.org/abs/2507.10177
作者: Rohitash Chandra,Jiyong Choi
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:
Abstract:Although Large Language Models (LLMs) have demonstrated significant advancements in natural language processing tasks, their effectiveness in the classification and transformation of abusive text into non-abusive versions remains an area for exploration. In this study, we aim to use LLMs to transform abusive text (tweets and reviews) featuring hate speech and swear words into non-abusive text, while retaining the intent of the text. We evaluate the performance of two state-of-the-art LLMs, such as Gemini, GPT-4o, DeekSeek and Groq, on their ability to identify abusive text. We them to transform and obtain a text that is clean from abusive and inappropriate content but maintains a similar level of sentiment and semantics, i.e. the transformed text needs to maintain its message. Afterwards, we evaluate the raw and transformed datasets with sentiment analysis and semantic analysis. Our results show Groq provides vastly different results when compared with other LLMs. We have identified similarities between GPT-4o and DeepSeek-V3.
zh
[NLP-21] ask-Based Flexible Feature Distillation for LLM s
【速读】: 该论文试图解决传统特征知识蒸馏(Feature Distillation)方法中教师模型与学生模型隐藏层维度不一致导致的灵活性受限问题,以及引入线性投影器所带来的额外参数和下游任务性能下降的问题。其解决方案的关键在于提出一种基于任务的特征蒸馏方法,通过识别教师模型中对特定下游任务最相关的隐藏单元,并直接将这些激活信息蒸馏到学生模型中,从而实现不同隐藏尺寸模型之间的知识迁移,且无需引入任何新参数。
链接: https://arxiv.org/abs/2507.10155
作者: Khouloud Saadi,Di Wang
机构: KAUST(沙特国王阿卜杜拉科技大学)
类目: Computation and Language (cs.CL)
备注:
Abstract:Knowledge Distillation (KD) in general and feature distillation in particular are promising techniques for reducing the high computational demand of large language models (LLMs). However, traditional feature KD methods typically assume that the teacher and the student share the same hidden size, limiting the flexibility of the student’s architecture. A common solution to this problem involves training a linear projector to align their feature spaces, but this introduces additional parameters that must be learned from scratch and often degrades performance on downstream tasks, especially in generative settings. To address this issue, in this work, we propose a novel task-based feature distillation method that enables knowledge transfer between teacher and student models with different hidden layer dimensions, without introducing any new parameters. Leveraging the insight that only a subset of LLM components contribute significantly to a specific downstream task, our approach identifies the most task-relevant hidden units in the teacher and directly distills their activations to the student. Our method is flexible and easily integrates with other distillation frameworks. Empirical results show consistent improvements over prior approaches across diverse tasks, including classification, instruction-following, and summarization, achieving up to a 3% performance gain over the linear projection baseline.
zh
[NLP-22] Fusing Large Language Models with Temporal Transformers for Time Series Forecasting
【速读】: 该论文试图解决将生成式 AI (Generative AI) 与传统 Transformer 模型结合以提升时间序列预测 (Time Series Forecasting, TSF) 性能的问题。现有基于大语言模型 (Large Language Models, LLMs) 的方法在处理连续数值时间序列数据时表现不佳,而单纯的 Transformer 模型又难以捕捉高层次语义模式。解决方案的关键在于设计一种新型的 Transformer 架构,通过融合 LLM 提取的高层语义表示与 Transformer 编码的时间信息,形成混合表示,从而同时捕捉历史时间动态和语义变化模式,提升预测精度。
链接: https://arxiv.org/abs/2507.10098
作者: Chen Su,Yuanhe Tian,Qinyu Liu,Jun Zhang,Yan Song
机构: University of Science and Technology of China (中国科学技术大学); University of Washington (华盛顿大学); Beijing Northern Computility InterConnection Co., Ltd. (北京北算互联科技有限公司); ENN Group Co., Ltd. (中节能集团有限公司)
类目: Computation and Language (cs.CL)
备注:
Abstract:Recently, large language models (LLMs) have demonstrated powerful capabilities in performing various tasks and thus are applied by recent studies to time series forecasting (TSF) tasks, which predict future values with the given historical time series. Existing LLM-based approaches transfer knowledge learned from text data to time series prediction using prompting or fine-tuning strategies. However, LLMs are proficient at reasoning over discrete tokens and semantic patterns but are not initially designed to model continuous numerical time series data. The gaps between text and time series data lead LLMs to achieve inferior performance to a vanilla Transformer model that is directly trained on TSF data. However, the vanilla Transformers often struggle to learn high-level semantic patterns. In this paper, we design a novel Transformer-based architecture that complementarily leverages LLMs and vanilla Transformers, so as to integrate the high-level semantic representations learned by LLMs into the temporal information encoded by time series Transformers, where a hybrid representation is obtained by fusing the representations from the LLM and the Transformer. The resulting fused representation contains both historical temporal dynamics and semantic variation patterns, allowing our model to predict more accurate future values. Experiments on benchmark datasets demonstrate the effectiveness of the proposed approach.
zh
[NLP-23] Enhancing Chain-of-Thought Reasoning with Critical Representation Fine-tuning ACL2025
【速读】: 该论文试图解决在复杂推理任务中,传统参数高效微调(PEFT)方法如Representation Fine-tuning (ReFT) 因仅修改固定位置的表示而效果不佳的问题。其解决方案的关键在于识别并优化那些在推理过程中起关键作用的表示,通过信息流分析确定这些关键表示,并在监督学习框架下对它们进行低秩线性子空间中的动态优化,同时冻结基础模型,从而显著提升推理性能。
链接: https://arxiv.org/abs/2507.10085
作者: Chenxi Huang,Shaotian Yan,Liang Xie,Binbin Lin,Sinan Fan,Yue Xin,Deng Cai,Chen Shen,Jieping Ye
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: Accepted by ACL 2025
Abstract:Representation Fine-tuning (ReFT), a recently proposed Parameter-Efficient Fine-Tuning (PEFT) method, has attracted widespread attention for significantly improving parameter efficiency by editing representation space alone. In this work, we investigate applying ReFT to complex reasoning tasks. However, directly using the native ReFT method, which modifies fixed representations at the beginning and end of each layer, yields suboptimal performance, as these fixed-position representations have uncertain impact on the outputs. We observe that, in complex reasoning tasks, there often exist certain critical representations. These representations either integrate significant information from preceding layers or regulate subsequent layer representations. Through layer-by-layer propagation, they exert a substantial influence on the final output. Naturally, fine-tuning these critical representations has the potential to greatly enhance reasoning performance. Building upon these insights, we propose Critical Representation Fine-Tuning (CRFT), a novel method that identifies and optimizes these critical representations through information flow analysis. CRFT operates within a supervised learning framework, dynamically optimizing critical representations in a low-rank linear subspace while freezing the base model. The effectiveness and efficiency of our method are validated across eight benchmarks for arithmetic and commonsense reasoning, using LLaMA and Mistral model families. Furthermore, our method also adapts effectively to few-shot settings, boosting one-shot accuracy by 16.4%. Our work highlights the untapped potential of representation-level optimization for CoT reasoning, offering a lightweight yet powerful alternative to traditional PEFT methods.
zh
[NLP-24] Cultural Bias in Large Language Models : Evaluating AI Agents through Moral Questionnaires
【速读】: 该论文试图解决当前大型语言模型(Large Language Models, LLMs)在文化道德框架表示上的不足问题,即它们是否真正代表了人类价值观,还是仅仅对不同价值观进行了平均处理。研究通过在19个文化背景下应用道德基础问卷(Moral Foundations Questionnaire),揭示了AI生成内容与人类道德直觉之间的显著差异。其解决方案的关键在于指出,尽管模型规模增大,但并未显著提升文化表征的准确性,表明单纯依赖模型规模或提示工程无法有效捕捉 culturally-specific 的道德直觉,强调需要更基于数据的对齐目标和评估指标以确保AI系统能够反映多元的人类价值观。
链接: https://arxiv.org/abs/2507.10073
作者: Simon Münker
机构: Tier University (蒂尔大学)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: 15pages, 1 figure, 2 tables
Abstract:Are AI systems truly representing human values, or merely averaging across them? Our study suggests a concerning reality: Large Language Models (LLMs) fail to represent diverse cultural moral frameworks despite their linguistic capabilities. We expose significant gaps between AI-generated and human moral intuitions by applying the Moral Foundations Questionnaire across 19 cultural contexts. Comparing multiple state-of-the-art LLMs’ origins against human baseline data, we find these models systematically homogenize moral diversity. Surprisingly, increased model size doesn’t consistently improve cultural representation fidelity. Our findings challenge the growing use of LLMs as synthetic populations in social science research and highlight a fundamental limitation in current AI alignment approaches. Without data-driven alignment beyond prompting, these systems cannot capture the nuanced, culturally-specific moral intuitions. Our results call for more grounded alignment objectives and evaluation metrics to ensure AI systems represent diverse human values rather than flattening the moral landscape.
zh
[NLP-25] GeLaCo: An Evolutionary Approach to Layer Compression
【速读】: 该论文试图解决大型语言模型(Large Language Models, LLM)在部署和使用中面临的计算需求大导致的障碍问题。其解决方案的关键在于提出GeLaCo,一种基于进化算法的层坍塌压缩方法,通过种群搜索和模块级相似性适应度函数实现对压缩解空间的高效探索,同时支持单目标和多目标进化压缩搜索,从而建立压缩与质量之间的帕累托前沿。
链接: https://arxiv.org/abs/2507.10059
作者: David Ponce,Thierry Etchegoyhen,Javier Del Ser
机构: Fundación Vicomtech, Basque Research and Technology Alliance (BRTA); TECNALIA. Basque Research and Technology Alliance (BRTA); University of the Basque Country UPV/EHU
类目: Computation and Language (cs.CL)
备注:
Abstract:Large Language Models (LLM) have achieved remarkable performance across a large number of tasks, but face critical deployment and usage barriers due to substantial computational requirements. Model compression methods, which aim to reduce model size while preserving its capacity, are an important means to mitigate these issues. Promising approaches along these lines, such as structured pruning, typically require costly empirical search for optimal variants and may run the risk of ignoring better solutions. In this work we introduce GeLaCo, an evolutionary approach to LLM compression via layer collapse. Our approach supports an efficient exploration of the compression solution space via population-based search and a module-wise similarity fitness function capturing attention, feed-forward, and hidden state representations. GeLaCo also supports both single and multi-objective evolutionary compression search, establishing the first Pareto frontier along compression and quality axes. We evaluate GeLaCo solutions via both perplexity-based and generative evaluations over foundational and instruction-tuned models, outperforming state-of-the-art alternatives.
zh
[NLP-26] PRISM: Fine-Grained Paper-to-Paper Retrieval with Multi-Aspect-Aware Query Optimization
【速读】: 该论文试图解决科学论文检索中基于文档的检索问题,即根据长格式查询文档而非短查询字符串来识别相关论文。现有方法通常依赖摘要进行嵌入并计算相似性,但摘要仅提供稀疏且高层次的总结。解决方案的关键在于提出PRISM,这是一种新的文档到文档检索方法,通过为查询和候选论文引入多个细粒度表示,将查询文档分解为多个特定方面视图并分别嵌入,再与分段相似性的候选文档进行匹配,以考虑其多维特征。
链接: https://arxiv.org/abs/2507.10057
作者: Sangwoo Park,Jinheon Baek,Soyeong Jeong,Sung Ju Hwang
机构: KAIST(韩国科学技术院); DeepAuto.ai(DeepAuto.ai)
类目: Information Retrieval (cs.IR); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Machine Learning (cs.LG)
备注:
Abstract:Scientific paper retrieval, particularly framed as document-to-document retrieval, aims to identify relevant papers in response to a long-form query paper, rather than a short query string. Previous approaches to this task have focused on abstracts, embedding them into dense vectors as surrogates for full documents and calculating similarity across them, although abstracts provide only sparse and high-level summaries. To address this, we propose PRISM, a novel document-to-document retrieval method that introduces multiple, fine-grained representations for both the query and candidate papers. In particular, each query paper is decomposed into multiple aspect-specific views and individually embedded, which are then matched against candidate papers similarity segmented to consider their multifaceted dimensions. Moreover, we present SciFullBench, a novel benchmark in which the complete and segmented context of full papers for both queries and candidates is available. Then, experimental results show that PRISM improves performance by an average of 4.3% over existing retrieval baselines.
zh
[NLP-27] Automating SPARQL Query Translations between DBpedia and Wikidata
【速读】: 该论文试图解决知识图谱(Knowledge Graph, KG)互操作性中的一个问题,即是否能够利用最先进的大型语言模型(Large Language Models, LLMs)自动将SPARQL查询在不同的KG模式之间进行翻译。其解决方案的关键在于通过构建两个基准测试集,分别对DBpedia与Wikidata、以及DBLP与OpenAlex之间的SPARQL查询进行对齐,并基于不同大小和架构的开放LLMs(如Llama-3-8B、DeepSeek-R1-Distill-Llama-70B和Mistral-Large-Instruct-2407)进行零样本、少样本及两种思维链变体的测试,以评估LLMs在SPARQL到SPARQL翻译任务中的表现。
链接: https://arxiv.org/abs/2507.10045
作者: Malte Christian Bartels,Debayan Banerjee,Ricardo Usbeck
机构: Leuphana University of Lüneburg (吕讷堡大学)
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注: 18 pages, 2 figues. Paper accepted at SEMANTiCS 2025 conference happening on September 2025
Abstract:This paper investigates whether state-of-the-art Large Language Models (LLMs) can automatically translate SPARQL between popular Knowledge Graph (KG) schemas. We focus on translations between the DBpedia and Wikidata KG, and later on DBLP and OpenAlex KG. This study addresses a notable gap in KG interoperability research by rigorously evaluating LLM performance on SPARQL-to-SPARQL translation. Two benchmarks are assembled, where the first align 100 DBpedia-Wikidata queries from QALD-9-Plus; the second contains 100 DBLP queries aligned to OpenAlex, testing generalizability beyond encyclopaedic KGs. Three open LLMs: Llama-3-8B, DeepSeek-R1-Distill-Llama-70B, and Mistral-Large-Instruct-2407 are selected based on their sizes and architectures and tested with zero-shot, few-shot, and two chain-of-thought variants. Outputs were compared with gold answers, and resulting errors were categorized. We find that the performance varies markedly across models and prompting strategies, and that translations for Wikidata to DBpedia work far better than translations for DBpedia to Wikidata.
zh
[NLP-28] Cross-modal Associations in Vision and Language Models: Revisiting the bouba-kiki effect
【速读】: 该论文试图解决视觉-语言模型(VLMs)是否能够以类似人类认知的方式整合跨模态信息的问题。其解决方案的关键在于对两种主流的CLIP变体——ResNet和Vision Transformer(ViT)进行系统性再评估,并采用两种互补的方法:一种是基于提示的评估方法,利用概率衡量模型偏好;另一种是引入Grad-CAM技术,以新颖方式解释模型在形状-词语匹配任务中的视觉注意力机制。通过这些方法,研究揭示了当前模型在表现“bouba-kiki效应”上的不足,从而为理解VLMs在跨模态概念理解方面的局限性提供了实证依据。
链接: https://arxiv.org/abs/2507.10013
作者: Tom Kouwenhoven,Kiana Shahrasbi,Tessa Verhoef
机构: Leiden Institute of Advanced Computer Science (莱顿先进计算机科学研究所); Leiden University (莱顿大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Computation and Language (cs.CL)
备注:
Abstract:Recent advances in multimodal models have raised questions about whether vision-and-language models (VLMs) integrate cross-modal information in ways that reflect human cognition. One well-studied test case in this domain is the bouba-kiki effect, where humans reliably associate pseudowords like “bouba” with round shapes and “kiki” with jagged ones. Given the mixed evidence found in prior studies for this effect in VLMs, we present a comprehensive re-evaluation focused on two variants of CLIP, ResNet and Vision Transformer (ViT), given their centrality in many state-of-the-art VLMs. We apply two complementary methods closely modelled after human experiments: a prompt-based evaluation that uses probabilities as model preference, and we use Grad-CAM as a novel way to interpret visual attention in shape-word matching tasks. Our findings show that these models do not consistently exhibit the bouba-kiki effect. While ResNet shows a preference for round shapes, overall performance across both models lacks the expected associations. Moreover, direct comparison with prior human data on the same task shows that the models’ responses fall markedly short of the robust, modality-integrated behaviour characteristic of human cognition. These results contribute to the ongoing debate about the extent to which VLMs truly understand cross-modal concepts, highlighting limitations in their internal representations and alignment with human intuitions.
zh
[NLP-29] Protective Factor-Aware Dynamic Influence Learning for Suicide Risk Prediction on Social Media
【速读】: 该论文试图解决现有研究在预测个体后续自杀风险方面的不足,特别是未能有效捕捉心理状态的动态变化以及忽略保护性因素的问题。其解决方案的关键在于提出一种结合风险因素与保护性因素动态影响的新型框架,通过联合学习两者对用户自杀风险转变的影响,从而更准确地预测自杀风险。此外,研究构建了一个包含12年Reddit帖子及全面标注的保护性因素感知数据集,并引入了动态因素影响学习方法,以捕捉风险与保护性因素随时间变化的影响,提升模型的预测性能和可解释性。
链接: https://arxiv.org/abs/2507.10008
作者: Jun Li,Xiangmeng Wang,Haoyang Li,Yifei Yan,Hong Va Leong,Ling Feng,Nancy Xiaonan Yu,Qing Li
机构: IEEE Publication Technology Department; The Hong Kong Polytechnic University (香港理工大学); Tsinghua University (清华大学); City University of Hong Kong (香港城市大学)
类目: Computation and Language (cs.CL)
备注:
Abstract:Suicide is a critical global health issue that requires urgent attention. Even though prior work has revealed valuable insights into detecting current suicide risk on social media, little attention has been paid to developing models that can predict subsequent suicide risk over time, limiting their ability to capture rapid fluctuations in individuals’ mental state transitions. In addition, existing work ignores protective factors that play a crucial role in suicide risk prediction, focusing predominantly on risk factors alone. Protective factors such as social support and coping strategies can mitigate suicide risk by moderating the impact of risk factors. Therefore, this study proposes a novel framework for predicting subsequent suicide risk by jointly learning the dynamic influence of both risk factors and protective factors on users’ suicide risk transitions. We propose a novel Protective Factor-Aware Dataset, which is built from 12 years of Reddit posts along with comprehensive annotations of suicide risk and both risk and protective factors. We also introduce a Dynamic Factors Influence Learning approach that captures the varying impact of risk and protective factors on suicide risk transitions, recognizing that suicide risk fluctuates over time according to established psychological theories. Our thorough experiments demonstrate that the proposed model significantly outperforms state-of-the-art models and large language models across three datasets. In addition, the proposed Dynamic Factors Influence Learning provides interpretable weights, helping clinicians better understand suicidal patterns and enabling more targeted intervention strategies.
zh
[NLP-30] On The Role of Intentionality in Knowledge Representation: Analyzing Scene Context for Cognitive Agents with a Tiny Language Model
【速读】: 该论文试图解决在科学与技术领域中对“意图”(intent)实际意义关注不足的问题,特别是在生成式 AI (Generative AI) 领域中如何识别文本中的隐含意图。其解决方案的关键在于利用过程一致性(process coherence)作为指导,通过检测多尺度异常来评估数据中的潜在“意向性”(intentionality),并借助时空一致性(spacetime coherence)将内容分为“有意”部分和“环境背景”。这种方法能够在计算成本极低的情况下实现对隐含意图的初步解释,而无需依赖大规模训练或推理能力。
链接: https://arxiv.org/abs/2507.10000
作者: Mark Burgess
机构: 未知
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注:
Abstract:Since Searle’s work deconstructing intent and intentionality in the realm of philosophy, the practical meaning of intent has received little attention in science and technology. Intentionality and context are both central to the scope of Promise Theory’s model of Semantic Spacetime, used as an effective Tiny Language Model. One can identify themes and concepts from a text, on a low level (without knowledge of the specific language) by using process coherence as a guide. Any agent process can assess superficially a degree of latent intentionality' in data by looking for anomalous multi-scale anomalies and assessing the work done to form them. Scale separation can be used to sort parts into
intended’ content and `ambient context’, using the spacetime coherence as a measure. This offers an elementary but pragmatic interpretation of latent intentionality for very low computational cost, and without reference to extensive training or reasoning capabilities. The process is well within the reach of basic organisms as it does not require large scale artificial probabilistic batch processing. The level of concept formation depends, however, on the memory capacity of the agent.
zh
[NLP-31] xtOmics-Guided Diffusion for Hit-like Molecular Generation
【速读】: 该论文试图解决靶点特异性药物发现中缺乏异构数据和统一框架的问题,以实现具有治疗潜力的类药分子生成。解决方案的关键在于提出TextOmics基准,建立组学表达与分子文本描述之间的一一对应关系,并构建异构数据集以促进分子表示对齐;在此基础上,进一步提出ToDi生成框架,通过联合条件作用于组学表达和分子文本描述,生成生物相关且化学有效的类药分子,其核心是利用两个编码器(OmicsEn和TextEn)捕捉多层级生物和语义关联,并采用条件扩散模型(DiffGen)实现可控生成。
链接: https://arxiv.org/abs/2507.09982
作者: Hang Yuan,Chen Li,Wenjun Ma,Yuncheng Jiang
机构: South China Normal University(华南师范大学); The University of Osaka(大阪大学)
类目: Computation and Language (cs.CL)
备注:
Abstract:Hit-like molecular generation with therapeutic potential is essential for target-specific drug discovery. However, the field lacks heterogeneous data and unified frameworks for integrating diverse molecular representations. To bridge this gap, we introduce TextOmics, a pioneering benchmark that establishes one-to-one correspondences between omics expressions and molecular textual descriptions. TextOmics provides a heterogeneous dataset that facilitates molecular generation through representations alignment. Built upon this foundation, we propose ToDi, a generative framework that jointly conditions on omics expressions and molecular textual descriptions to produce biologically relevant, chemically valid, hit-like molecules. ToDi leverages two encoders (OmicsEn and TextEn) to capture multi-level biological and semantic associations, and develops conditional diffusion (DiffGen) for controllable generation. Extensive experiments confirm the effectiveness of TextOmics and demonstrate ToDi outperforms existing state-of-the-art approaches, while also showcasing remarkable potential in zero-shot therapeutic molecular generation. Sources are available at: this https URL.
zh
[NLP-32] ny Reward Models ICML
【速读】: 该论文试图解决在基于人类反馈的强化学习(RLHF)中,大型解码器语言模型作为奖励模型时存在的推理成本过高的问题。其解决方案的关键在于提出TinyRM,这是一个参数量仅为400百万的小型双向掩码语言模型(MLM),通过结合FLAN风格提示、方向低秩适应(DoRA)和层冻结技术,在RewardBench任务上实现了与参数量大得多的模型相当的性能,从而展示了轻量级双向架构在偏好建模中的高效性和可扩展性。
链接: https://arxiv.org/abs/2507.09973
作者: Sarah Pan
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: 2025 ICML Efficient Systems for Foundation Models Workshop
Abstract:Large decoder-based language models have become the dominant architecture for reward modeling in reinforcement learning from human feedback (RLHF). However, as reward models are increasingly deployed in test-time strategies, their inference costs become a growing concern. We present TinyRM, a family of small, bidirectional masked language models (MLMs) with as few as 400 million parameters, that rival the capabilities of models over 175 times larger on reasoning and safety preference modeling tasks. TinyRM combines FLAN-style prompting, Directional Low-Rank Adaptation (DoRA), and layer freezing to achieve strong performance on RewardBench, despite using significantly fewer resources. Our experiments suggest that small models benefit from domain-specific tuning strategies, particularly in reasoning, where lightweight finetuning methods are especially effective. While challenges remain in building generalist models and conversational preference modeling, our preliminary results highlight the promise of lightweight bidirectional architectures as efficient, scalable alternatives for preference modeling.
zh
[NLP-33] Enhancing Retrieval Augmented Generation with Hierarchical Text Segmentation Chunking
【速读】: 该论文试图解决传统检索增强生成(Retrieval-Augmented Generation, RAG)系统在文本分块策略中难以捕捉足够语义信息的问题,因为这些方法未考虑文本的底层结构。解决方案的关键在于集成层次化文本分割与聚类,以生成更具语义连贯性的文本块。在推理过程中,该框架通过利用段级和簇级的向量表示进行信息检索,从而提高检索到更精确和上下文相关信息的可能性。
链接: https://arxiv.org/abs/2507.09935
作者: Hai Toan Nguyen,Tien Dat Nguyen,Viet Ha Nguyen
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:
Abstract:Retrieval-Augmented Generation (RAG) systems commonly use chunking strategies for retrieval, which enhance large language models (LLMs) by enabling them to access external knowledge, ensuring that the retrieved information is up-to-date and domain-specific. However, traditional methods often fail to create chunks that capture sufficient semantic meaning, as they do not account for the underlying textual structure. This paper proposes a novel framework that enhances RAG by integrating hierarchical text segmentation and clustering to generate more meaningful and semantically coherent chunks. During inference, the framework retrieves information by leveraging both segment-level and cluster-level vector representations, thereby increasing the likelihood of retrieving more precise and contextually relevant information. Evaluations on the NarrativeQA, QuALITY, and QASPER datasets indicate that the proposed method achieved improved results compared to traditional chunking techniques.
zh
[NLP-34] MixLoRA-DSI: Dynamically Expandable Mixture-of-LoRA Experts for Rehearsal-Free Generative Retrieval over Dynamic Corpora
【速读】: 该论文试图解决在生成式检索中持续更新基于模型的索引时面临的挑战,即在资源受限条件下进行全量重新训练计算成本过高且不切实际。其解决方案的关键在于提出MixLoRA-DSI框架,该框架结合了可扩展的低秩适应(Low-Rank Adaptation)专家混合与逐层分布外(out-of-distribution, OOD)驱动的扩展策略,通过仅在检测到大量OOD文档时选择性引入新专家,实现了次线性参数增长,从而在保持性能的同时显著降低参数开销和训练成本。
链接: https://arxiv.org/abs/2507.09924
作者: Tuan-Luc Huynh,Thuy-Trang Vu,Weiqing Wang,Trung Le,Dragan Gašević,Yuan-Fang Li,Thanh-Toan Do
机构: Monash University(莫纳什大学)
类目: Information Retrieval (cs.IR); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Machine Learning (cs.LG)
备注:
Abstract:Continually updating model-based indexes in generative retrieval with new documents remains challenging, as full retraining is computationally expensive and impractical under resource constraints. We propose MixLoRA-DSI, a novel framework that combines an expandable mixture of Low-Rank Adaptation experts with a layer-wise out-of-distribution (OOD)-driven expansion strategy. Instead of allocating new experts for each new corpus, our proposed expansion strategy enables sublinear parameter growth by selectively introducing new experts only when significant number of OOD documents are detected. Experiments on NQ320k and MS MARCO Passage demonstrate that MixLoRA-DSI outperforms full-model update baselines, with minimal parameter overhead and substantially lower training costs.
zh
[NLP-35] ViTCoT: Video-Text Interleaved Chain-of-Thought for Boosting Video Understanding in Large Language Models ACM-MM2025
【速读】: 该论文试图解决当前视频理解方法主要依赖文本信息进行推理,而忽视了视频中实际存在的视觉模态的问题。其解决方案的关键在于提出一种新的视频推理范式——Video-Text Interleaved CoT (ViTCoT),该范式通过将视频与文本信息交织在一起,实现更符合人类认知过程的视频推理,从而提升视频理解性能。
链接: https://arxiv.org/abs/2507.09876
作者: Yongheng Zhang,Xu Liu,Ruihan Tao,Qiguang Chen,Hao Fei,Wanxiang Che,Libo Qin
机构: School of Computer Science and Engineering, Central South University(计算机科学与工程学院,中南大学); Research Center for SCIR, Harbin Institute of Technology(SCIR研究中心,哈尔滨工业大学); NExT Research Center, National University of Singapore(NExT研究中心,新加坡国立大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注: Accepted by ACM MM 2025
Abstract:Video understanding plays a vital role in bridging low-level visual signals with high-level cognitive reasoning, and is fundamental to applications such as autonomous driving, embodied AI, and the broader pursuit of AGI. The rapid development of large language models (LLMs), particularly those utilizing Chain-of-Thought (CoT) technology, has significantly advanced video reasoning capabilities. However, current approaches primarily depend on textual information for reasoning, overlooking the visual modality in the actual video reasoning process. In contrast, humans naturally re-examine visual content while reasoning. Motivated by this, we introduce a novel video reasoning paradigm: Video-Text Interleaved CoT (ViTCoT), which facilitates more intuitive and cognitively aligned reasoning. To the end, first, we construct the Video-Text Interleaved Benchmark (ViTIB), which is created using MLLMs for key-video selection and manually verified. Furthermore, we extensively explore the potential of the ViTCoT paradigm in the video understanding field. Extensive experiments demonstrate that ViTCoT significantly enhances performance compared to the traditional text-only CoT paradigm and effectively activates more neuron values in MLLMs.
zh
[NLP-36] Function Induction and Task Generalization: An Interpretability Study with Off-by-One Addition
【速读】: 该论文试图解决大型语言模型在未见过的任务中通过上下文学习实现任务级泛化的机制问题。其解决方案的关键在于揭示模型内部用于执行类似“off-by-one addition”这类反事实任务的功能诱导机制,该机制通过多个注意力头并行协作,共同实现对+1函数的归纳,并且这种机制在更广泛的任务中具有可重用性和组合性。
链接: https://arxiv.org/abs/2507.09875
作者: Qinyuan Ye,Robin Jia,Xiang Ren
机构: University of Southern California (南加州大学)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: Code: this https URL
Abstract:Large language models demonstrate the intriguing ability to perform unseen tasks via in-context learning. However, it remains unclear what mechanisms inside the model drive such task-level generalization. In this work, we approach this question through the lens of off-by-one addition (i.e., 1+1=3, 2+2=5, 3+3=?), a two-step, counterfactual task with an unexpected +1 function as a second step. Leveraging circuit-style interpretability techniques such as path patching, we analyze the models’ internal computations behind their notable performance and present three key findings. First, we uncover a function induction mechanism that explains the model’s generalization from standard addition to off-by-one addition. This mechanism resembles the structure of the induction head mechanism found in prior work and elevates it to a higher level of abstraction. Second, we show that the induction of the +1 function is governed by multiple attention heads in parallel, each of which emits a distinct piece of the +1 function. Finally, we find that this function induction mechanism is reused in a broader range of tasks, including synthetic tasks such as shifted multiple-choice QA and algorithmic tasks such as base-8 addition. Overall, our findings offer deeper insights into how reusable and composable structures within language models enable task-level generalization.
zh
[NLP-37] nyTroupe: An LLM -powered Multiagent Persona Simulation Toolkit
【速读】: 该论文试图解决当前多智能体系统(MAS)在模拟真实人类行为方面的不足,特别是缺乏细粒度角色定义、群体采样功能、实验支持和集成验证等关键能力。其解决方案的关键在于提出TinyTroupe,一个基于大型语言模型(LLM)的仿真工具包,它支持详细的个性特征定义(如国籍、年龄、职业、人格、信仰和行为)并通过多种LLM驱动机制实现程序化控制,从而能够简洁地表述实际感兴趣的个体或群体行为问题,并提供有效的解决手段。
链接: https://arxiv.org/abs/2507.09788
作者: Paulo Salem,Robert Sim,Christopher Olsen,Prerit Saxena,Rafael Barcelos,Yi Ding
机构: Microsoft Corporation(微软公司)
类目: Multiagent Systems (cs.MA); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Human-Computer Interaction (cs.HC)
备注: 9 pages. Preprint to be submitted to peer-review
Abstract:Recent advances in Large Language Models (LLM) have led to a new class of autonomous agents, renewing and expanding interest in the area. LLM-powered Multiagent Systems (MAS) have thus emerged, both for assistive and simulation purposes, yet tools for realistic human behavior simulation – with its distinctive challenges and opportunities – remain underdeveloped. Existing MAS libraries and tools lack fine-grained persona specifications, population sampling facilities, experimentation support, and integrated validation, among other key capabilities, limiting their utility for behavioral studies, social simulation, and related applications. To address these deficiencies, in this work we introduce TinyTroupe, a simulation toolkit enabling detailed persona definitions (e.g., nationality, age, occupation, personality, beliefs, behaviors) and programmatic control via numerous LLM-driven mechanisms. This allows for the concise formulation of behavioral problems of practical interest, either at the individual or group level, and provides effective means for their solution. TinyTroupe’s components are presented using representative working examples, such as brainstorming and market research sessions, thereby simultaneously clarifying their purpose and demonstrating their usefulness. Quantitative and qualitative evaluations of selected aspects are also provided, highlighting possibilities, limitations, and trade-offs. The approach, though realized as a specific Python implementation, is meant as a novel conceptual contribution, which can be partially or fully incorporated in other contexts. The library is available as open source at this https URL.
zh
[NLP-38] Ahorré Un Click: A Revised Definition of Clickbait and Detection in Spanish News
【速读】: 该论文试图解决点击诱饵(clickbait)定义不明确及检测数据集缺乏标准化的问题,其核心在于明确区分点击诱饵与其他类似现象(如夸张报道或承诺与内容不符的标题)。解决方案的关键是提出一个新的点击诱饵定义,即通过故意省略部分信息以激发读者好奇心、吸引注意力并促使点击的技巧。此外,研究者通过细化概念边界和标注标准,提出了新的点击诱饵检测数据集构建方法,并发布了首个面向西班牙语的开源数据集TA1C,以提高检测任务的客观性和可重复性。
链接: https://arxiv.org/abs/2507.09777
作者: Gabriel Mordecki,Guillermo Moncecchi,Javier Couto
机构: 未知
类目: Computation and Language (cs.CL); Social and Information Networks (cs.SI)
备注:
Abstract:We revise the definition of clickbait, which lacks current consensus, and argue that the creation of a curiosity gap is the key concept that distinguishes clickbait from other related phenomena such as sensationalism and headlines that do not deliver what they promise or diverge from the article. Therefore, we propose a new definition: clickbait is a technique for generating headlines and teasers that deliberately omit part of the information with the goal of raising the readers’ curiosity, capturing their attention and enticing them to click. We introduce a new approach to clickbait detection datasets creation, by refining the concept limits and annotations criteria, minimizing the subjectivity in the decision as much as possible. Following it, we created and release TA1C (for Te Ahorré Un Click, Spanish for Saved You A Click), the first open source dataset for clickbait detection in Spanish. It consists of 3,500 tweets coming from 18 well known media sources, manually annotated and reaching a 0.825 Fleiss’ K inter annotator agreement. We implement strong baselines that achieve 0.84 in F1-score.
zh
[NLP-39] EventHunter: Dynamic Clustering and Ranking of Security Events from Hacker Forum Discussions RAID2025
【速读】: 该论文试图解决从黑客论坛中提取可操作的网络安全威胁情报这一挑战,因其内容具有非结构化和噪声大的特点。解决方案的关键在于提出一种无监督框架,利用基于Transformer的嵌入表示并结合对比学习进行微调,以自动检测、聚类和优先排序安全事件,从而在不依赖预定义关键词的情况下识别如零日漏洞披露或恶意软件发布等事件,并通过量化指标对事件进行每日排序,提升威胁响应的效率与准确性。
链接: https://arxiv.org/abs/2507.09762
作者: Yasir Ech-Chammakhy,Anas Motii,Anass Rabii,Jaafar Chbili
机构: Mohammed VI Polytechnic University (穆罕默德六世理工大学); Deloitte Morocco Cyber Center (德勤摩洛哥网络安全中心); Deloitte Conseil (德勤咨询)
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注: Accepted for publication at the 28th International Symposium on Research in Attacks, Intrusions, and Defenses (RAID 2025)
Abstract:Hacker forums provide critical early warning signals for emerging cybersecurity threats, but extracting actionable intelligence from their unstructured and noisy content remains a significant challenge. This paper presents an unsupervised framework that automatically detects, clusters, and prioritizes security events discussed across hacker forum posts. Our approach leverages Transformer-based embeddings fine-tuned with contrastive learning to group related discussions into distinct security event clusters, identifying incidents like zero-day disclosures or malware releases without relying on predefined keywords. The framework incorporates a daily ranking mechanism that prioritizes identified events using quantifiable metrics reflecting timeliness, source credibility, information completeness, and relevance. Experimental evaluation on real-world hacker forum data demonstrates that our method effectively reduces noise and surfaces high-priority threats, enabling security analysts to mount proactive responses. By transforming disparate hacker forum discussions into structured, actionable intelligence, our work addresses fundamental challenges in automated threat detection and analysis.
zh
[NLP-40] Your Pretrained Model Tells the Difficulty Itself: A Self-Adaptive Curriculum Learning Paradigm for Natural Language Understanding ACL2025
【速读】: 该论文试图解决传统课程学习方法依赖人工定义的难度度量(如文本长度)可能无法准确反映模型自身认知水平的问题。其解决方案的关键在于提出一种自适应课程学习范式,该范式利用预训练语言模型(PLM)自身预测的难度分数来优先选择微调样本,从而实现更有效的模型训练。
链接: https://arxiv.org/abs/2507.09758
作者: Qi Feng,Yihong Liu,Hinrich Schütze
机构: Center for Information and Language Processing, LMU Munich (信息与语言处理中心,慕尼黑路德维希马克西米利安大学); Munich Center for Machine Learning (MCML) (慕尼黑机器学习中心)
类目: Computation and Language (cs.CL); Machine Learning (cs.LG)
备注: 18 pages, 23 figures. To appear in ACL 2025 Student Research Workshop (SRW)
Abstract:Curriculum learning is a widely adopted training strategy in natural language processing (NLP), where models are exposed to examples organized by increasing difficulty to enhance learning efficiency and performance. However, most existing approaches rely on manually defined difficulty metrics – such as text length – which may not accurately reflect the model’s own perspective. To overcome this limitation, we present a self-adaptive curriculum learning paradigm that prioritizes fine-tuning examples based on difficulty scores predicted by pre-trained language models (PLMs) themselves. Building on these scores, we explore various training strategies that differ in the ordering of examples for the fine-tuning: from easy-to-hard, hard-to-easy, to mixed sampling. We evaluate our method on four natural language understanding (NLU) datasets covering both binary and multi-class classification tasks. Experimental results show that our approach leads to faster convergence and improved performance compared to standard random sampling.
zh
[NLP-41] Sound and Complete Neuro-symbolic Reasoning with LLM -Grounded Interpretations
【速读】: 该论文试图解决大型语言模型(Large Language Models, LLMs)在生成输出时表现出逻辑一致性不足的问题,尤其是在形式推理中如何有效利用其广泛的知识覆盖。论文提出的解决方案的关键在于将LLM直接集成到一种次协调逻辑(paraconsistent logic)的形式语义解释函数中,从而构建一个神经符号推理的理论框架,既利用了LLM的知识,又保持了底层逻辑系统的可靠性和完备性。
链接: https://arxiv.org/abs/2507.09751
作者: Bradley P. Allen,Prateek Chhikara,Thomas Macaulay Ferguson,Filip Ilievski,Paul Groth
机构: University of Amsterdam(阿姆斯特丹大学); University of Southern California(南加州大学); Rensselaer Polytechnic Institute(伦斯勒理工学院); Vrije Universiteit Amsterdam(阿姆斯特丹自由大学)
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Logic in Computer Science (cs.LO)
备注: 29 pages, 9 tables, 3 figures. Accepted to the 19th Conference on Neurosymbolic Learning and Reasoning (NeSy 2025)
Abstract:Large language models (LLMs) have demonstrated impressive capabilities in natural language understanding and generation, but they exhibit problems with logical consistency in the output they generate. How can we harness LLMs’ broad-coverage parametric knowledge in formal reasoning despite their inconsistency? We present a method for directly integrating an LLM into the interpretation function of the formal semantics for a paraconsistent logic. We provide experimental evidence for the feasibility of the method by evaluating the function using datasets created from several short-form factuality benchmarks. Unlike prior work, our method offers a theoretical framework for neuro-symbolic reasoning that leverages an LLM’s knowledge while preserving the underlying logic’s soundness and completeness properties.
zh
[NLP-42] Large Language Models Encode Semantics in Low-Dimensional Linear Subspaces
【速读】: 该论文试图解决大型语言模型(Large Language Models, LLMs)内部表征如何组织与语义理解相关的信息这一问题。其解决方案的关键在于发现高阶语义信息始终位于低维子空间中,并且这些子空间在不同领域间呈现出线性可分的表示。这种几何结构使得在隐藏空间中进行简单的因果干预成为可能,例如通过单一向量方向捕获如链式思维等推理模式。
链接: https://arxiv.org/abs/2507.09709
作者: Baturay Saglam,Paul Kassianik,Blaine Nelson,Sajana Weerawardhena,Yaron Singer,Amin Karbasi
机构: Yale University (耶鲁大学); Foundation AI – Cisco Systems Inc (Foundation AI – 思科系统公司)
类目: Computation and Language (cs.CL); Machine Learning (cs.LG)
备注:
Abstract:Understanding the latent space geometry of large language models (LLMs) is key to interpreting their behavior and improving alignment. \baturayHowever, it remains unclear to what extent LLMs internally organize representations related to semantic understanding. To investigate this, we conduct a large-scale empirical study of hidden states in transformer-based LLMs, analyzing 11 decoder-only models across 6 scientific topics and 12 layers each. We find that high-level semantic information consistently lies in low-dimensional subspaces that form linearly separable representations across distinct domains. This separability becomes more pronounced in deeper layers and under prompts that trigger structured reasoning or alignment behaviors \unicodex2013 even when surface content is unchanged. This geometry enables simple yet effective causal interventions in hidden space; for example, reasoning patterns like chain-of-thought can be captured by a single vector direction. Together, these findings support the development of geometry-aware tools that operate directly on latent representations to detect and mitigate harmful or adversarial content, using methods such as transport-based defenses that leverage this separability. As a proof of concept, we demonstrate this potential by training a simple MLP classifier as a lightweight latent-space guardrail, which detects adversarial and malicious prompts with high precision.
zh
[NLP-43] MCEval: A Dynamic Framework for Fair Multilingual Cultural Evaluation of LLM s
【速读】: 该论文试图解决大型语言模型(Large Language Models, LLMs)在跨文化场景中表现出的文化偏见和有限的跨文化理解能力问题。其解决方案的关键在于提出MCEval,一个新颖的多语言评估框架,该框架通过动态文化问题构建以及通过反事实重述(Counterfactual Rephrasing)和混杂因素重述(Confounder Rephrasing)实现因果分析,从而系统地评估不同语言情景下的文化意识和文化偏见。
链接: https://arxiv.org/abs/2507.09701
作者: Shulin Huang,Linyi Yang,Yue Zhang
机构: Zhejiang University (浙江大学); School of Engineering, Westlake University (西湖大学工程学院); Southern University of Science and Technology (南方科技大学); Institute of Advanced Technology, Westlake Institute for Advanced Study (西湖研究院先进科技研究所)
类目: Computation and Language (cs.CL)
备注:
Abstract:Large language models exhibit cultural biases and limited cross-cultural understanding capabilities, particularly when serving diverse global user populations. We propose MCEval, a novel multilingual evaluation framework that employs dynamic cultural question construction and enables causal analysis through Counterfactual Rephrasing and Confounder Rephrasing. Our comprehensive evaluation spans 13 cultures and 13 languages, systematically assessing both cultural awareness and cultural bias across different linguistic scenarios. The framework provides 39,897 cultural awareness instances and 17,940 cultural bias instances. Experimental results reveal performance disparities across different linguistic scenarios, demonstrating that optimal cultural performance is not only linked to training data distribution, but also is related to language-culture alignment. The evaluation results also expose the fairness issue, where approaches appearing successful in the English scenario create substantial disadvantages. MCEval represents the first comprehensive multilingual cultural evaluation framework that provides deeper insights into LLMs’ cultural understanding.
zh
[NLP-44] owards Concise and Adaptive Thinking in Large Reasoning Models: A Survey
【速读】: 该论文试图解决大型推理模型(Large Reasoning Models, LRMs)在处理简单问题时生成冗长且不必要的推理链,导致计算资源浪费和响应时间增加的问题。解决方案的关键在于缩短冗长的推理链,并根据输入难度学习快速思维与慢速思维之间的自适应推理机制。
链接: https://arxiv.org/abs/2507.09662
作者: Jason Zhu,Hongyu Li
机构: 未知
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注:
Abstract:Large reasoning models (LRMs) like OpenAI o1 and DeepSeek R1 have demonstrated impressive performance on complex reasoning tasks like mathematics and programming with long Chain-of-Thought (CoT) reasoning sequences (slow-thinking), compared with traditional large language models (fast-thinking). However, these reasoning models also face a huge challenge that generating unnecessarily lengthy and redundant reasoning chains even for trivial questions. This phenomenon leads to a significant waste of inference resources, increases the response time for simple queries, and hinders the practical application of LRMs in real-world products. To this end, it is crucial to shorten lengthy reasoning chains and learn adaptive reasoning between fast and slow thinking based on input difficulty. In this survey, we provide a comprehensive overview of recent progress in concise and adaptive thinking for efficient reasoning of LRMs, including methodologies, benchmarks, and challenges for future exploration. We hope this survey can help researchers quickly understand the landscape of this field and inspire novel adaptive thinking ideas to facilitate better usage of LRMs.
zh
[NLP-45] Can Group Relative Policy Optimization Improve Thai Legal Reasoning and Question Answering?
【速读】: 该论文试图解决泰国法律问答任务中检索增强生成(Retrieval-Augmented Generation, RAG)系统性能受限的问题,尤其是在需要复杂法律推理的场景下。其关键解决方案是通过Group-Relative Policy Optimization (GRPO) 方法对大语言模型(LLM)进行对齐,以提升法律引用准确性与回答质量。该方法利用BGE-M3嵌入作为成本高效的语义相似性奖励机制,显著降低了计算开销,相比大型语言模型裁判减少了最多2.5倍的资源消耗。实验结果表明,GRPO在NitiBench基准测试中实现了相对于基础模型90%的引用F1分数提升,并在联合质量指标上比指令微调提高了31%,同时在复杂法律推理任务中表现出更强的鲁棒性。
链接: https://arxiv.org/abs/2507.09638
作者: Pawitsapak Akarajaradwong,Chompakorn Chaksangchaichot,Pirat Pothavorn,Attapol Thamrongrattanarit-Rutherford,Ekapol Chuangsuwanich,Sarana Nutanong
机构: VISAI AI(Visai人工智能); Chulalongkorn University(朱拉隆功大学); Vidyasirimedhi Institute of Science and Technology(维迪亚西里梅迪技术研究所)
类目: Computation and Language (cs.CL)
备注:
Abstract:The Retrieval-Augmented Generation (RAG) systems’ performance on Thai legal question answering is still limited, especially for questions requiring extensive, complex legal reasoning. To address these limitations, we introduce an approach aligning LLMs toward improved law citation accuracy and better response quality using Group-Relative Policy Optimization (GRPO). Our approach leverages BGE-M3 embeddings as a cost-efficient semantic-similarity reward, significantly reducing computational expenses up to 2.5x compared to large language model judges. Experiments on the NitiBench benchmark demonstrate substantial improvements: GRPO achieves up to 90% citation-F1 gains from the base model and a 31% increase in joint quality metrics over instruction tuning. Crucially, our method shows enhanced robustness on complex legal reasoning tasks compared to instruction tuning, providing an effective and resource-efficient solution for enhancing Thai legal LLMs.
zh
[NLP-46] An Exploration of Knowledge Editing for Arabic
【速读】: 该论文试图解决知识编辑(Knowledge Editing, KE)在形态丰富的语言如阿拉伯语中的行为尚不明确的问题。其关键解决方案是评估四种KE方法(ROME、MEMIT、ICE和LTE)在阿拉伯语的ZsRE和Counterfact基准上的表现,并探索多语言和跨语言设置下的效果。研究发现基于参数的方法在跨语言泛化上表现不佳,而经过指令微调的方法更为稳健;此外,将Learning-To-Edit (LTE)扩展至多语言设置,并通过阿拉伯语与英语的联合训练提升了编辑能力和迁移性能。
链接: https://arxiv.org/abs/2507.09629
作者: Basel Mousi,Nadir Durrani,Fahim Dalvi
机构: Qatar Computing Research Institute, HBKU, Doha, Qatar
类目: Computation and Language (cs.CL)
备注:
Abstract:While Knowledge Editing (KE) has been widely explored in English, its behavior in morphologically rich languages like Arabic remains underexamined. In this work, we present the first study of Arabic KE. We evaluate four methods (ROME, MEMIT, ICE, and LTE) on Arabic translations of the ZsRE and Counterfact benchmarks, analyzing both multilingual and cross-lingual settings. Our experiments on Llama-2-7B-chat show show that parameter-based methods struggle with cross-lingual generalization, while instruction-tuned methods perform more robustly. We extend Learning-To-Edit (LTE) to a multilingual setting and show that joint Arabic-English training improves both editability and transfer. We release Arabic KE benchmarks and multilingual training for LTE data to support future research.
zh
[NLP-47] SpreadPy: A Python tool for modelling spreading activation and superdiffusion in cognitive multiplex networks
【速读】: 该论文试图解决如何通过模拟认知单层网络和多层网络中的激活扩散过程,来理解结构与功能之间的关系及其在认知、心理和临床现象中的体现问题。解决方案的关键在于开发SpreadPy这一Python库,它能够进行数值仿真,并将结果与知识建模中的基础理论进行比较,从而系统地研究激活动力学如何反映相关现象。其核心创新在于利用 empirically derived 或 theoretical 网络对认知过程进行建模,提供对个体差异和认知障碍的机制性洞察。
链接: https://arxiv.org/abs/2507.09628
作者: Salvatore Citraro,Edith Haim,Alessandra Carini,Cynthia S. Q. Siew,Giulio Rossetti,Massimo Stella
机构: 未知
类目: Computation and Language (cs.CL)
备注:
Abstract:We introduce SpreadPy as a Python library for simulating spreading activation in cognitive single-layer and multiplex networks. Our tool is designed to perform numerical simulations testing structure-function relationships in cognitive processes. By comparing simulation results with grounded theories in knowledge modelling, SpreadPy enables systematic investigations of how activation dynamics reflect cognitive, psychological and clinical phenomena. We demonstrate the library’s utility through three case studies: (1) Spreading activation on associative knowledge networks distinguishes students with high versus low math anxiety, revealing anxiety-related structural differences in conceptual organization; (2) Simulations of a creativity task show that activation trajectories vary with task difficulty, exposing how cognitive load modulates lexical access; (3) In individuals with aphasia, simulated activation patterns on lexical networks correlate with empirical error types (semantic vs. phonological) during picture-naming tasks, linking network structure to clinical impairments. SpreadPy’s flexible framework allows researchers to model these processes using empirically derived or theoretical networks, providing mechanistic insights into individual differences and cognitive impairments. The library is openly available, supporting reproducible research in psychology, neuroscience, and education research.
zh
[NLP-48] NMIXX: Domain-Adapted Neural Embeddings for Cross-Lingual eXploration of Finance
【速读】: 该论文试图解决通用句子嵌入模型在捕捉金融领域专业语义时的不足,特别是在低资源语言如韩语中的表现问题,这主要源于领域特定术语、时间意义变化以及双语词汇不对齐等问题。解决方案的关键在于引入NMIXX(Neural eMbeddings for Cross-lingual eXploration of Finance),这是一个经过18.8K高置信度三元组微调的跨语言嵌入模型,这些三元组包括领域内同义句、基于语义变化类型的困难负例以及精确的韩英翻译对。同时,论文还发布了KorFinSTS,一个包含1,921对韩语金融文本的STS基准,用于揭示通用基准可能忽略的细微差别。
链接: https://arxiv.org/abs/2507.09601
作者: Hanwool Lee,Sara Yu,Yewon Hwang,Jonghyun Choi,Heejae Ahn,Sungbum Jung,Youngjae Yu
机构: Samsung Fire & Marine Insurance(三星火灾海上保险); KB Securities(KB证券); Netmarble(网马); Yonsei University(延世大学)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Computational Finance (q-fin.CP)
备注: Under Review
Abstract:General-purpose sentence embedding models often struggle to capture specialized financial semantics, especially in low-resource languages like Korean, due to domain-specific jargon, temporal meaning shifts, and misaligned bilingual vocabularies. To address these gaps, we introduce NMIXX (Neural eMbeddings for Cross-lingual eXploration of Finance), a suite of cross-lingual embedding models fine-tuned with 18.8K high-confidence triplets that pair in-domain paraphrases, hard negatives derived from a semantic-shift typology, and exact Korean-English translations. Concurrently, we release KorFinSTS, a 1,921-pair Korean financial STS benchmark spanning news, disclosures, research reports, and regulations, designed to expose nuances that general benchmarks miss. When evaluated against seven open-license baselines, NMIXX’s multilingual bge-m3 variant achieves Spearman’s rho gains of +0.10 on English FinSTS and +0.22 on KorFinSTS, outperforming its pre-adaptation checkpoint and surpassing other models by the largest margin, while revealing a modest trade-off in general STS performance. Our analysis further shows that models with richer Korean token coverage adapt more effectively, underscoring the importance of tokenizer design in low-resource, cross-lingual settings. By making both models and the benchmark publicly available, we provide the community with robust tools for domain-adapted, multilingual representation learning in finance. Comments: Under Review Subjects: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Computational Finance (q-fin.CP) Cite as: arXiv:2507.09601 [cs.CL] (or arXiv:2507.09601v1 [cs.CL] for this version) https://doi.org/10.48550/arXiv.2507.09601 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
zh
[NLP-49] MENTOR: Efficient Multimodal-Conditioned Tuning for Autoregressive Vision Generation Models
【速读】: 该论文试图解决当前文本到图像生成模型在精确视觉控制、多模态输入平衡以及复杂多模态图像生成中的训练效率问题。其解决方案的关键在于提出MENTOR框架,该框架采用两阶段训练范式:第一阶段通过多模态对齐建立像素级和语义级的精细对齐,第二阶段通过多模态指令微调平衡多模态输入的整合并增强生成可控性,从而在不依赖辅助适配器或交叉注意力模块的情况下实现高效的多模态图像生成。
链接: https://arxiv.org/abs/2507.09574
作者: Haozhe Zhao,Zefan Cai,Shuzheng Si,Liang Chen,Jiuxiang Gu,Wen Xiao,Junjie Hu
机构: University of Illinois Urbana-Champaign (伊利诺伊大学厄巴纳-香槟分校); University of Wisconsin-Madison (威斯康星大学麦迪逊分校); Tsinghua University (清华大学); Peking University (北京大学); Adobe Research (Adobe 研究院); Microsoft (微软)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注: 24 pages,12 figures
Abstract:Recent text-to-image models produce high-quality results but still struggle with precise visual control, balancing multimodal inputs, and requiring extensive training for complex multimodal image generation. To address these limitations, we propose MENTOR, a novel autoregressive (AR) framework for efficient Multimodal-conditioned Tuning for Autoregressive multimodal image generation. MENTOR combines an AR image generator with a two-stage training paradigm, enabling fine-grained, token-level alignment between multimodal inputs and image outputs without relying on auxiliary adapters or cross-attention modules. The two-stage training consists of: (1) a multimodal alignment stage that establishes robust pixel- and semantic-level alignment, followed by (2) a multimodal instruction tuning stage that balances the integration of multimodal inputs and enhances generation controllability. Despite modest model size, suboptimal base components, and limited training resources, MENTOR achieves strong performance on the DreamBench++ benchmark, outperforming competitive baselines in concept preservation and prompt following. Additionally, our method delivers superior image reconstruction fidelity, broad task adaptability, and improved training efficiency compared to diffusion-based methods. Dataset, code, and models are available at: this https URL
zh
[NLP-50] Adapting Definition Modeling for New Languages: A Case Study on Belarusian
【速读】: 该论文试图解决如何利用现有的生成式 AI (Generative AI) 模型来支持尚无足够资源的语种(如白俄罗斯语)的定义建模问题。其解决方案的关键在于提出一个包含43,150个定义的新数据集,并验证了通过少量数据即可适应现有定义建模系统,但同时也指出当前自动评估指标在捕捉模型性能方面仍存在不足。
链接: https://arxiv.org/abs/2507.09536
作者: Daniela Kazakouskaya,Timothee Mickus,Janine Siewert
机构: University of Helsinki(赫尔辛基大学)
类目: Computation and Language (cs.CL)
备注: To appear at SlavicNLP 2025
Abstract:Definition modeling, the task of generating new definitions for words in context, holds great prospect as a means to assist the work of lexicographers in documenting a broader variety of lects and languages, yet much remains to be done in order to assess how we can leverage pre-existing models for as-of-yet unsupported languages. In this work, we focus on adapting existing models to Belarusian, for which we propose a novel dataset of 43,150 definitions. Our experiments demonstrate that adapting a definition modeling systems requires minimal amounts of data, but that there currently are gaps in what automatic metrics do capture.
zh
[NLP-51] How Important is `Perfect English for Machine Translation Prompts?
【速读】: 该论文试图解决用户提示中的错误和扰动对大型语言模型(Large Language Models, LLMs)在机器翻译和机器翻译评估任务中性能的影响问题。其解决方案的关键在于系统性地评估人类可接受的和合成的提示错误如何影响LLMs的表现,并通过定量分析和定性洞察揭示不同类型的噪声对翻译质量的不同影响,特别是指出提示质量主要通过影响模型对指令的遵循程度而非直接作用于翻译质量本身。
链接: https://arxiv.org/abs/2507.09509
作者: Patrícia Schmidtová,Niyati Bafna,Seth Aycock,Gianluca Vico,Wiktor Kamzela,Katharina Hämmerl,Vilém Zouhar
机构: 未知
类目: Computation and Language (cs.CL)
备注:
Abstract:Large language models (LLMs) have achieved top results in recent machine translation evaluations, but they are also known to be sensitive to errors and perturbations in their prompts. We systematically evaluate how both humanly plausible and synthetic errors in user prompts affect LLMs’ performance on two related tasks: Machine translation and machine translation evaluation. We provide both a quantitative analysis and qualitative insights into how the models respond to increasing noise in the user prompt. The prompt quality strongly affects the translation performance: With many errors, even a good prompt can underperform a minimal or poor prompt without errors. However, different noise types impact translation quality differently, with character-level and combined noisers degrading performance more than phrasal perturbations. Qualitative analysis reveals that lower prompt quality largely leads to poorer instruction following, rather than directly affecting translation quality itself. Further, LLMs can still translate in scenarios with overwhelming random noise that would make the prompt illegible to humans. Subjects: Computation and Language (cs.CL) Cite as: arXiv:2507.09509 [cs.CL] (or arXiv:2507.09509v1 [cs.CL] for this version) https://doi.org/10.48550/arXiv.2507.09509 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
zh
[NLP-52] Ref-Long: Benchmarking the Long-context Referencing Capability of Long-context Language Models ACL2025
【速读】: 该论文试图解决长上下文语言模型(Long-context Language Models, LCLMs)在长上下文引用任务中的能力不足问题,该任务要求模型能够将感兴趣的内容准确关联到长文本数据中的具体部分。解决方案的关键在于提出Ref-Long基准,通过设计需要模型识别引用特定关键词的文档索引的任务,强调关键词与文档之间的上下文关系,而非简单的信息检索。
链接: https://arxiv.org/abs/2507.09506
作者: Junjie Wu,Gefei Gu,Yanan Zheng,Dit-Yan Yeung,Arman Cohan
机构: Hong Kong University of Science and Technology (香港科技大学); Carnegie Mellon University (卡内基梅隆大学); Yale University (耶鲁大学)
类目: Computation and Language (cs.CL)
备注: ACL 2025 Main Conference. First 2 authors contributed equally
Abstract:Long-context language models (LCLMs) have exhibited impressive capabilities in long-context understanding tasks. Among these, long-context referencing – a crucial task that requires LCLMs to attribute items of interest to specific parts of long-context data – remains underexplored. To bridge this gap, this paper proposes Referencing Evaluation for Long-context Language Models (Ref-Long), a novel benchmark designed to assess the long-context referencing capability of LCLMs. Specifically, Ref-Long requires LCLMs to identify the indexes of documents that reference a specific key, emphasizing contextual relationships between the key and the documents over simple retrieval. Based on the task design, we construct three subsets ranging from synthetic to realistic scenarios to form the Ref-Long benchmark. Experimental results of 13 LCLMs reveal significant shortcomings in long-context referencing, even among advanced models like GPT-4o. To further investigate these challenges, we conduct comprehensive analyses, including human evaluations, task format adjustments, fine-tuning experiments, and error analyses, leading to several key insights. Our data and code can be found in https://github. com/wujunjie1998/Ref-Long.
zh
[NLP-53] GoalfyMax: A Protocol-Driven Multi-Agent System for Intelligent Experience Entities
【速读】: 该论文试图解决现代企业环境中多智能体系统在处理复杂、动态和多方面任务时所面临的协调不足、记忆复用有限以及任务分解能力薄弱的问题。其解决方案的关键在于提出一种基于协议驱动的端到端多智能体协作框架——GoalfyMax,该框架通过标准化的Agent-to-Agent(A2A)通信层实现异步、协议合规的智能体间协作,并引入Experience Pack(XP)架构作为分层记忆系统,以结构化方式保留任务推理过程与执行轨迹,从而支持持续学习与经验复用。
链接: https://arxiv.org/abs/2507.09497
作者: Siyi Wu,Zeyu Wang,Xinyuan Song,Zhengpeng Zhou,Lifan Sun,Tianyu Shi
机构: The University of Texas at Arlington (德克萨斯大学阿灵顿分校); University of California (加州大学); Emory University (埃默里大学); Shanghai Jiaotong University (上海交通大学); University of Toronto (多伦多大学)
类目: Computation and Language (cs.CL)
备注:
Abstract:Modern enterprise environments demand intelligent systems capable of handling complex, dynamic, and multi-faceted tasks with high levels of autonomy and adaptability. However, traditional single-purpose AI systems often lack sufficient coordination, memory reuse, and task decomposition capabilities, limiting their scalability in realistic settings. To address these challenges, we present \textbfGoalfyMax, a protocol-driven framework for end-to-end multi-agent collaboration. GoalfyMax introduces a standardized Agent-to-Agent (A2A) communication layer built on the Model Context Protocol (MCP), allowing independent agents to coordinate through asynchronous, protocol-compliant interactions. It incorporates the Experience Pack (XP) architecture, a layered memory system that preserves both task rationales and execution traces, enabling structured knowledge retention and continual learning. Moreover, our system integrates advanced features including multi-turn contextual dialogue, long-short term memory modules, and dynamic safety validation, supporting robust, real-time strategy adaptation. Empirical results on complex task orchestration benchmarks and case study demonstrate that GoalfyMax achieves superior adaptability, coordination, and experience reuse compared to baseline frameworks. These findings highlight its potential as a scalable, future-ready foundation for multi-agent intelligent systems.
zh
[NLP-54] Balanced Training Data Augmentation for Aspect-Based Sentiment Analysis
【速读】: 该论文试图解决在社交媒体场景中进行基于方面的情感分析(Aspect-based Sentiment Analysis, ABSA)时,由于文本短、标注训练数据量小且分布不均衡(多数数据为正面情感)而导致的上下文信息学习困难问题。其解决方案的关键在于利用生成式AI(Generative AI)生成增强的训练数据,以构建规模更大且标签分布更平衡的数据集,从而更好地训练ABSA模型。同时,为了提升增强数据的质量,还引入了强化学习方法对数据增强过程进行优化。
链接: https://arxiv.org/abs/2507.09485
作者: Junjie Liu,Yuanhe Tian,Yan Song
机构: University of Science and Technology of China (中国科学技术大学); University of Washington (华盛顿大学)
类目: Computation and Language (cs.CL)
备注:
Abstract:Aspect-based sentiment analysis (ABSA) is a crucial fine-grained task in social media scenarios to identify the sentiment polarity of specific aspect terms in a sentence. Although many existing studies leverage large language models (LLMs) to perform ABSA due to their strong context understanding capabilities, they still face challenges to learn the context information in the running text because of the short text, as well as the small and unbalanced labeled training data, where most data are labeled with positive sentiment. Data augmentation (DA) is a feasible strategy for providing richer contextual information, especially when using LLMs to create synthetic training data, but faces challenges in ensuring a high quality of the augmented this http URL this paper, we propose an LLM-based ABSA approach with training data this http URL, an LLM is prompted to generate augmented training data based on the original training data, so as to construct a new training data with larger size and balanced label distributions to better train an ABSA model. Meanwhile, in order to improve the quality of the augmented data, we propose a reinforcement learning approach to optimize the data augmentation. this http URL results and further analyses on English benchmark datasets for ABSA demonstrate the effectiveness of our approach, where superior performance is observed over strong baselines and most existing studies.
zh
[NLP-55] ViSP: A PPO-Driven Framework for Sarcasm Generation with Contrastive Learning
【速读】: 该论文试图解决现有研究中对讽刺生成的探索不足问题,特别是由于过度依赖文本模态、忽视视觉线索以及现有数据集中图像内容与讽刺意图不匹配所导致的局限性。其解决方案的关键在于引入M2SaG多模态讽刺生成数据集,并提出ViSP生成框架,该框架结合了近端策略优化(PPO)和对比学习,通过DIP提供的奖励分数引导讽刺文本生成,并利用对比学习增强模型对高奖励输出的偏好,从而提升整体生成质量和讽刺意图的表达。
链接: https://arxiv.org/abs/2507.09482
作者: Changli Wang,Rui Wu,Fang Yin
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Human-Computer Interaction (cs.HC)
备注:
Abstract:Human emotions are complex, with sarcasm being a subtle and distinctive form. Despite progress in sarcasm research, sarcasm generation remains underexplored, primarily due to the overreliance on textual modalities and the neglect of visual cues, as well as the mismatch between image content and sarcastic intent in existing datasets. In this paper, we introduce M2SaG, a multimodal sarcasm generation dataset with 4,970 samples, each containing an image, a sarcastic text, and a sarcasm target. To benchmark M2SaG, we propose ViSP, a generation framework that integrates Proximal Policy Optimization (PPO) and contrastive learning. PPO utilizes reward scores from DIP to steer the generation of sarcastic texts, while contrastive learning encourages the model to favor outputs with higher reward scores. These strategies improve overall generation quality and produce texts with more pronounced sarcastic intent. We evaluate ViSP across five metric sets and find it surpasses all baselines, including large language models, underscoring their limitations in sarcasm generation. Furthermore, we analyze the distributions of Sarcasm Scores and Factual Incongruity for both M2SaG and the texts generated by ViSP. The generated texts exhibit higher mean Sarcasm Scores (0.898 vs. 0.770) and Factual Incongruity (0.768 vs. 0.739), demonstrating that ViSP produces higher-quality sarcastic content than the original dataset. % The dataset and code will be publicly available. Our dataset and code will be released at \textitthis https URL.
zh
[NLP-56] Evaluating LLM s on Sequential API Call Through Automated Test Generation
【速读】: 该论文试图解决大型语言模型(Large Language Models, LLMs)在使用外部API工具时的测试、评估和分析仍处于初级阶段的问题,尤其是现有基准测试依赖于人工收集的测试用例,难以自动验证语义正确性,并且忽略了实际应用中常见的顺序API调用之间的复杂交互。解决方案的关键是提出StateGen,一个自动化框架,通过结合基于状态机的API约束求解与验证、能量采样和控制流注入生成可执行程序,并通过两个LLM代理协作将其转换为类人类自然语言任务描述,从而构建了一个包含120个验证测试用例的基准StateEval。
链接: https://arxiv.org/abs/2507.09481
作者: Yuheng Huang,Da Song,Zhenlan Ji,Shuai Wang,Lei Ma
机构: The University of Tokyo (东京大学); University of Alberta (阿尔伯塔大学); Hong Kong University of Science and Technology (香港科技大学)
类目: oftware Engineering (cs.SE); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注:
Abstract:By integrating tools from external APIs, Large Language Models (LLMs) have expanded their promising capabilities in a diverse spectrum of complex real-world tasks. However, testing, evaluation, and analysis of LLM tool use remain in their early stages. Most existing benchmarks rely on manually collected test cases, many of which cannot be automatically checked for semantic correctness and instead depend on static methods such as string matching. Additionally, these benchmarks often overlook the complex interactions that occur between sequential API calls, which are common in real-world applications. To fill the gap, in this paper, we introduce StateGen, an automated framework designed to generate diverse coding tasks involving sequential API interactions. StateGen combines state-machine-based API constraint solving and validation, energy-based sampling, and control-flow injection to generate executable programs. These programs are then translated into human-like natural language task descriptions through a collaboration of two LLM agents. Utilizing StateGen, we construct StateEval, a benchmark encompassing 120 verified test cases spanning across three representative scenarios: Session Service, Tensor Operation, and ElevenLabs MCP. Experimental results confirm that StateGen can effectively generate challenging and realistic API-oriented tasks, highlighting areas for improvement in current LLMs incorporating APIs.
zh
[NLP-57] owards Agent ic RAG with Deep Reasoning : A Survey of RAG Reasoning : A Survey of RAG-Reasoning Systems in LLM s
【速读】: 该论文试图解决大型语言模型(Large Language Models, LLMs)在需要多步骤推理的问题上表现不足,以及纯推理导向方法容易产生幻觉或事实错误的问题。其解决方案的关键在于通过统一的推理-检索视角,将先进推理技术与检索增强生成(Retrieval-Augmented Generation, RAG)方法相结合,从而优化RAG的各个阶段,并利用不同类型的检索知识补充前提条件、扩展上下文以支持复杂推理,最终实现更高效、多模态适应、可信且以用户为中心的RAG-Reasoning系统。
链接: https://arxiv.org/abs/2507.09477
作者: Yangning Li,Weizhi Zhang,Yuyao Yang,Wei-Chieh Huang,Yaozu Wu,Junyu Luo,Yuanchen Bei,Henry Peng Zou,Xiao Luo,Yusheng Zhao,Chunkit Chan,Yankai Chen,Zhongfen Deng,Yinghui Li,Hai-Tao Zheng,Dongyuan Li,Renhe Jiang,Ming Zhang,Yangqiu Song,Philip S. Yu
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: submitted to ARR May
Abstract:Retrieval-Augmented Generation (RAG) lifts the factuality of Large Language Models (LLMs) by injecting external knowledge, yet it falls short on problems that demand multi-step inference; conversely, purely reasoning-oriented approaches often hallucinate or mis-ground facts. This survey synthesizes both strands under a unified reasoning-retrieval perspective. We first map how advanced reasoning optimizes each stage of RAG (Reasoning-Enhanced RAG). Then, we show how retrieved knowledge of different type supply missing premises and expand context for complex inference (RAG-Enhanced Reasoning). Finally, we spotlight emerging Synergized RAG-Reasoning frameworks, where (agentic) LLMs iteratively interleave search and reasoning to achieve state-of-the-art performance across knowledge-intensive benchmarks. We categorize methods, datasets, and open challenges, and outline research avenues toward deeper RAG-Reasoning systems that are more effective, multimodally-adaptive, trustworthy, and human-centric. The collection is available at this https URL.
zh
[NLP-58] he CoNLL-2013 Shared Task on Grammatical Error Correction
【速读】: 该论文试图解决语法错误修正(Grammatical Error Correction, GEC)问题。解决方案的关键在于定义任务规范、构建数据集、设计评估指标和评分工具,并分析参与团队所采用的各种方法,从而为该领域提供标准化的评测框架和实验基准。
链接: https://arxiv.org/abs/2507.09474
作者: Hwee Tou Ng,Siew Mei Wu,Yuanbin Wu,Christian Hadiwinoto,Joel Tetreault
机构: National University of Singapore (新加坡国立大学); Nuance Communications, Inc. (Nuance通信公司)
类目: Computation and Language (cs.CL)
备注: 12 pages
Abstract:The CoNLL-2013 shared task was devoted to grammatical error correction. In this paper, we give the task definition, present the data sets, and describe the evaluation metric and scorer used in the shared task. We also give an overview of the various approaches adopted by the participating teams, and present the evaluation results.
zh
[NLP-59] Enhancing Clinical Text Classification via Fine-Tuned DRAG ON Longformer Models
【速读】: 该论文试图解决临床文本分类中的性能优化问题,特别是针对医学案例描述的二分类任务。其关键解决方案包括对预训练的DRAGON Longformer基础模型进行超参数调优、领域特定的预处理以及架构调整,其中核心改进包括将序列长度从512个token增加到1024个token,学习率从1e-05调整为5e-06,训练轮数从5次延长至8次,并引入专业医学术语。这些改进显著提升了模型在准确率、精确率、召回率和F1分数等方面的性能。
链接: https://arxiv.org/abs/2507.09470
作者: Mingchuan Yang,Ziyuan Huang
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: 29 pages, 5 tables
Abstract:This study explores the optimization of the DRAGON Longformer base model for clinical text classification, specifically targeting the binary classification of medical case descriptions. A dataset of 500 clinical cases containing structured medical observations was used, with 400 cases for training and 100 for validation. Enhancements to the pre-trained joeranbosma/dragon-longformer-base-mixed-domain model included hyperparameter tuning, domain-specific preprocessing, and architectural adjustments. Key modifications involved increasing sequence length from 512 to 1024 tokens, adjusting learning rates from 1e-05 to 5e-06, extending training epochs from 5 to 8, and incorporating specialized medical terminology. The optimized model achieved notable performance gains: accuracy improved from 72.0% to 85.2%, precision from 68.0% to 84.1%, recall from 75.0% to 86.3%, and F1-score from 71.0% to 85.2%. Statistical analysis confirmed the significance of these improvements (p .001). The model demonstrated enhanced capability in interpreting medical terminology, anatomical measurements, and clinical observations. These findings contribute to domain-specific language model research and offer practical implications for clinical natural language processing applications. The optimized model’s strong performance across diverse medical conditions underscores its potential for broad use in healthcare settings.
zh
[NLP-60] DATE-LM: Benchmarking Data Attribution Evaluation for Large Language Models
【速读】: 该论文试图解决当前在大型语言模型(Large Language Models, LLMs)中对数据归属方法(data attribution methods)进行系统性评估的不足。其解决方案的关键在于引入DATE-LM,这是一个通过真实世界LLM应用来评估数据归属方法的统一基准。DATE-LM通过三个关键任务——训练数据选择、毒性/偏见过滤和事实归属——来衡量归属质量,并设计为易于使用,支持研究人员在多种任务和LLM架构上进行大规模评估。
链接: https://arxiv.org/abs/2507.09424
作者: Cathy Jiao,Yijun Pan,Emily Xiao,Daisy Sheng,Niket Jain,Hanzhang Zhao,Ishita Dasgupta,Jiaqi W. Ma,Chenyan Xiong
机构: Carnegie Mellon University (卡内基梅隆大学); University of Michigan (密歇根大学); UIUC (伊利诺伊大学厄巴纳-香槟分校)
类目: Computation and Language (cs.CL)
备注:
Abstract:Data attribution methods quantify the influence of training data on model outputs and are becoming increasingly relevant for a wide range of LLM research and applications, including dataset curation, model interpretability, data valuation. However, there remain critical gaps in systematic LLM-centric evaluation of data attribution methods. To this end, we introduce DATE-LM (Data Attribution Evaluation in Language Models), a unified benchmark for evaluating data attribution methods through real-world LLM applications. DATE-LM measures attribution quality through three key tasks – training data selection, toxicity/bias filtering, and factual attribution. Our benchmark is designed for ease of use, enabling researchers to configure and run large-scale evaluations across diverse tasks and LLM architectures. Furthermore, we use DATE-LM to conduct a large-scale evaluation of existing data attribution methods. Our findings show that no single method dominates across all tasks, data attribution methods have trade-offs with simpler baselines, and method performance is sensitive to task-specific evaluation design. Finally, we release a public leaderboard for quick comparison of methods and to facilitate community engagement. We hope DATE-LM serves as a foundation for future data attribution research in LLMs.
zh
[NLP-61] Voice Conversion for Lombard Speaking Style with Implicit and Explicit Acoustic Feature Conditioning
【速读】: 该论文试图解决在缺乏足够目标说话人 Lombard 语调数据的情况下,如何有效训练 Text-to-Speech (TTS) 系统的问题。其解决方案的关键在于利用语音转换 (Voice Conversion, VC) 技术,通过隐式声学特征条件控制策略,在保持 Lombard 语调的声学属性的同时转换说话人身份,从而实现与显式声学特征条件控制模型相当的可懂度提升,并维持说话人相似性。
链接: https://arxiv.org/abs/2507.09310
作者: Dominika Woszczyk,Manuel Sam Ribeiro,Thomas Merritt,Daniel Korzekwa
机构: 未知
类目: ound (cs.SD); Computation and Language (cs.CL); Audio and Speech Processing (eess.AS)
备注: Presented at Clarity Challenge 2023
Abstract:Text-to-Speech (TTS) systems in Lombard speaking style can improve the overall intelligibility of speech, useful for hearing loss and noisy conditions. However, training those models requires a large amount of data and the Lombard effect is challenging to record due to speaker and noise variability and tiring recording conditions. Voice conversion (VC) has been shown to be a useful augmentation technique to train TTS systems in the absence of recorded data from the target speaker in the target speaking style. In this paper, we are concerned with Lombard speaking style transfer. Our goal is to convert speaker identity while preserving the acoustic attributes that define the Lombard speaking style. We compare voice conversion models with implicit and explicit acoustic feature conditioning. We observe that our proposed implicit conditioning strategy achieves an intelligibility gain comparable to the model conditioned on explicit acoustic features, while also preserving speaker similarity.
zh
[NLP-62] ClaritySpeech: Dementia Obfuscation in Speech INTERSPEECH2025
【速读】: 该论文试图解决阿尔茨海默病(Alzheimer’s Disease, AD)患者因神经退行性疾病导致的言语模式改变所引发的沟通障碍和隐私问题,以及现有自动语音识别(Automatic Speech Recognition, ASR)技术在处理痴呆症和非典型言语时的不足。其解决方案的关键在于提出一种名为ClaritySpeech的框架,该框架结合了ASR、文本模糊化和零样本文本到语音(Zero-shot Text-to-Speech, TTS)技术,在低数据环境下无需微调即可纠正受痴呆影响的语音并保持说话人身份,从而提升语音识别准确率和隐私保护水平。
链接: https://arxiv.org/abs/2507.09282
作者: Dominika Woszczyk,Ranya Aloufi,Soteris Demetriou
机构: 未知
类目: Computation and Language (cs.CL); Cryptography and Security (cs.CR); Machine Learning (cs.LG); Sound (cs.SD); Audio and Speech Processing (eess.AS)
备注: Accepted at Interspeech 2025
Abstract:Dementia, a neurodegenerative disease, alters speech patterns, creating communication barriers and raising privacy concerns. Current speech technologies, such as automatic speech transcription (ASR), struggle with dementia and atypical speech, further challenging accessibility. This paper presents a novel dementia obfuscation in speech framework, ClaritySpeech, integrating ASR, text obfuscation, and zero-shot text-to-speech (TTS) to correct dementia-affected speech while preserving speaker identity in low-data environments without fine-tuning. Results show a 16% and 10% drop in mean F1 score across various adversarial settings and modalities (audio, text, fusion) for ADReSS and ADReSSo, respectively, maintaining 50% speaker similarity. We also find that our system improves WER (from 0.73 to 0.08 for ADReSS and 0.15 for ADReSSo) and speech quality from 1.65 to ~2.15, enhancing privacy and accessibility.
zh
[NLP-63] Prompt4Trust: A Reinforcement Learning Prompt Augmentation Framework for Clinically-Aligned Confidence Calibration in Multimodal Large Language Models ICCV2025
【速读】: 该论文试图解决多模态大语言模型(Multimodal Large Language Models, MLLMs)在医疗等安全关键场景中部署时面临的两个核心问题:对提示设计的敏感性以及在高置信度下生成错误响应的倾向。其解决方案的关键在于提出Prompt4Trust,这是一个针对MLLMs置信度校准的强化学习(Reinforcement Learning, RL)框架,通过训练一个轻量级的语言模型生成上下文感知的辅助提示,引导下游任务模型生成更准确反映预测精度的响应,从而提升模型在临床决策中的安全性和可信度。
链接: https://arxiv.org/abs/2507.09279
作者: Anita Kriz,Elizabeth Laura Janes,Xing Shen,Tal Arbel
机构: Centre for Intelligent Machines, McGill University (智能机器中心,麦吉尔大学); Mila – Quebec AI Institute (Mila – 魁北克人工智能研究所)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注: Preprint version. The peer-reviewed version of this paper has been accepted to ICCV 2025 Workshop CVAMD
Abstract:Multimodal large language models (MLLMs) hold considerable promise for applications in healthcare. However, their deployment in safety-critical settings is hindered by two key limitations: (i) sensitivity to prompt design, and (ii) a tendency to generate incorrect responses with high confidence. As clinicians may rely on a model’s stated confidence to gauge the reliability of its predictions, it is especially important that when a model expresses high confidence, it is also highly accurate. We introduce Prompt4Trust, the first reinforcement learning (RL) framework for prompt augmentation targeting confidence calibration in MLLMs. A lightweight LLM is trained to produce context-aware auxiliary prompts that guide a downstream task MLLM to generate responses in which the expressed confidence more accurately reflects predictive accuracy. Unlike conventional calibration techniques, Prompt4Trust specifically prioritizes aspects of calibration most critical for safe and trustworthy clinical decision-making. Beyond improvements driven by this clinically motivated calibration objective, our proposed method also improves task accuracy, achieving state-of-the-art medical visual question answering (VQA) performance on the PMC-VQA benchmark, which is composed of multiple-choice questions spanning diverse medical imaging modalities. Moreover, our framework trained with a small downstream task MLLM showed promising zero-shot generalization to larger MLLMs in our experiments, suggesting the potential for scalable calibration without the associated computational costs. This work demonstrates the potential of automated yet human-aligned prompt engineering for improving the the trustworthiness of MLLMs in safety critical settings. Our codebase can be found at this https URL.
zh
[NLP-64] Psychology-Driven Enhancement of Humour Translation
【速读】: 该论文试图解决生成式 AI 在幽默翻译任务中的不足,特别是由于语言干扰导致的幽默元素缺失问题。其解决方案的关键在于提出一种受心理学启发的幽默分解机制(Humour Decomposition Mechanism, HDM),该机制利用链式思维(Chain-of-Thought, CoT)模拟人类的思维过程,以优化翻译后幽默文本的可读性,并结合幽默理论进一步增强翻译文本中的幽默元素。
链接: https://arxiv.org/abs/2507.09259
作者: Yuchen Su,Yonghua Zhu,Yang Chen,Diana Benavides-Prado,Michael Witbrock
机构: 未知
类目: Computation and Language (cs.CL)
备注:
Abstract:Humour translation plays a vital role as a bridge between different cultures, fostering understanding and communication. Although most existing Large Language Models (LLMs) are capable of general translation tasks, these models still struggle with humour translation, which is especially reflected through linguistic interference and lacking humour in translated text. In this paper, we propose a psychology-inspired Humour Decomposition Mechanism (HDM) that utilises Chain-of-Thought (CoT) to imitate the ability of the human thought process, stimulating LLMs to optimise the readability of translated humorous texts. Moreover, we integrate humour theory in HDM to further enhance the humorous elements in the translated text. Our automatic evaluation experiments on open-source humour datasets demonstrate that our method significantly improves the quality of humour translation, yielding average gains of 7.75% in humour, 2.81% in fluency, and 6.13% in coherence of the generated text.
zh
[NLP-65] Swa-bhasha Resource Hub: Romanized Sinhala to Sinhala Transliteration Systems and Data Resources
【速读】: 该论文旨在解决罗马化僧伽罗语到僧伽罗语的转写问题,通过提供一系列数据资源和算法来促进僧伽罗语自然语言处理(NLP)的研究。解决方案的关键在于构建一个全面的资源库,即Swa-bhasha Resource Hub,该库包含了2020年至2025年间开发的数据集和工具,特别针对罗马化僧伽罗语的转写任务,为训练转写模型和相关应用的开发提供了重要支持。
链接: https://arxiv.org/abs/2507.09245
作者: Deshan Sumanathilaka,Sameera Perera,Sachithya Dharmasiri,Maneesha Athukorala,Anuja Dilrukshi Herath,Rukshan Dias,Pasindu Gamage,Ruvan Weerasinghe,Y.H.P.P. Priyadarshana
机构: 未知
类目: Computation and Language (cs.CL)
备注: 13 pages, 3 Tables, 3 figures
Abstract:The Swa-bhasha Resource Hub provides a comprehensive collection of data resources and algorithms developed for Romanized Sinhala to Sinhala transliteration between 2020 and 2025. These resources have played a significant role in advancing research in Sinhala Natural Language Processing (NLP), particularly in training transliteration models and developing applications involving Romanized Sinhala. The current openly accessible data sets and corresponding tools are made publicly available through this hub. This paper presents a detailed overview of the resources contributed by the authors and includes a comparative analysis of existing transliteration applications in the domain.
zh
[NLP-66] MetaClimage: A novel database of visual metaphors related to Climate Change with costs and benefits analysis
【速读】: 该论文试图解决视觉隐喻在气候变化传播中的影响问题,特别是其在理解难度、效果、情感唤醒及审美价值方面的表现。解决方案的关键在于构建了一个名为MetaClimage的新型数据库,该数据库包含气候变化图像中的隐喻图像与对应的真实图像,并通过人类评分获取了难度、效果、艺术质量和情感唤醒等指标,同时利用自然语言处理技术从参与者生成的标签中提取语义和情感变量。这一数据库为未来研究提供了基础资源,并揭示了视觉隐喻在认知负荷与积极效应之间的权衡。
链接: https://arxiv.org/abs/2507.09225
作者: Biagio Scalingi,Chiara Barattieri di San Pietro,Paolo Canal,Valentina Bambini
机构: 未知
类目: Computation and Language (cs.CL); Computers and Society (cs.CY)
备注: 27 pages, 5 figures
Abstract:Visual metaphors of climate change (e.g., melting glaciers depicted as a melting ice grenade) are regarded as valuable tools for addressing the complexity of environmental challenges. However, few studies have examined their impact on communication, also due to scattered availability of material. Here, we present a novel database of Metaphors of Climate Change in Images (MetaClimage) this https URL, paired with literal images and enriched with human ratings. For each image, we collected values of difficulty, efficacy, artistic quality, and emotional arousal from human rating, as well as number of tags generated by participants to summarize the message. Semantic and emotion variables were further derived from the tags via Natural Language Processing. Visual metaphors were rated as more difficult to understand, yet more aesthetically pleasant than literal images, but did not differ in efficacy and arousal. The latter for visual metaphors, however, was higher in participants with higher Need For Cognition. Furthermore, visual metaphors received more tags, often referring to entities not depicted in the image, and elicited words with more positive valence and greater dominance than literal images. These results evidence the greater cognitive load of visual metaphors, which nevertheless might induce positive effects such as deeper cognitive elaboration and abstraction compared to literal stimuli. Furthermore, while they are not deemed as more effective and arousing, visual metaphors seem to generate superior aesthetic appreciation and a more positively valenced experience. Overall, this study contributes to understanding the impact of visual metaphors of climate change both by offering a database for future research and by elucidating a cost-benefit trade-off to take into account when shaping environmental communication.
zh
[NLP-67] Banzhida: Advancing Large Language Models for Tibetan with Curated Data and Continual Pre-Training
【速读】: 该论文试图解决藏语作为代表性低资源语言在现有大型语言模型中严重缺失的问题,主要原因是高质量训练语料的匮乏。解决方案的关键在于构建迄今为止最大的藏语预训练语料库,通过整合多样化数据源并应用专门针对藏语设计的数据清洗和处理流程,从而为模型训练提供充足且高质量的数据支持。基于此语料库,研究者对多语言基础模型进行了持续的预训练/微调,最终开发出Banzhida,一个在藏语生成式AI方面具有显著优势的多语言大语言模型。
链接: https://arxiv.org/abs/2507.09205
作者: Leiyu Pan,Bojian Xiong,Lei Yang,Renren Jin,Shaowei Zhang,Yue Chen,Ling Shi,Jiang Zhou,Junru Wu,Zhen Wang,Jianxiang Peng,Juesi Xiao,Tianyu Dong,Zhuowen Han,Zhuo Chen,Sangjee Dondrub,Caizang Tai,Haixing Zhao,Huaque Cairang,Suonan Cairang,Rou Te,Lengben Zhaxi,Gazang Zhaxi,Zhonglin Ye,Yuhui Zheng,Chunyan Peng,Secha Jia,Pema Tashi,Cizhen Jiacuo,Pema Dorjee,Hongkai Liu,Pema Yanggon,Tsehang Dorjee,Jiaxin Han,Qiongying Hu,Jilin Man,Huanke You,Yuqi Ren,Duo La,Deyi Xiong
机构: TJUNLP Lab, Tianjin University (天津大学); Qinghai Normal University (青海师范大学); Yuanhui AI Lab (元慧人工智能实验室)
类目: Computation and Language (cs.CL)
备注:
Abstract:Large language models have achieved remarkable progress across many languages. However, Tibetan, as a representative low-resource language, is particularly underrepresented in existing models due to the scarcity of high-quality training corpora. To address this gap, we curate the largest Tibetan pre-training corpus to date, aggregating data from diverse sources and applying a dedicated data cleaning and processing pipeline tailored for Tibetan. With the curated data, we continue pre/post-training a multilingual base model into Banzhida, a multilingual large language model that advances generative AI for Tibetan. To evaluate the Tibetan capabilities of the model, we create new high-quality Tibetan benchmarks, and complement them with existing public benchmarks. Experimental results demonstrate that Banzhida consistently and significantly outperforms both open-source models of similar scale and Tibetan-tailored models across a wide range of tasks.
zh
[NLP-68] Detecting and Pruning Prominent but Detrimental Neurons in Large Language Models
【速读】: 该论文试图解决大型语言模型(Large Language Models, LLMs)在面对新任务或分布时性能下降的问题,这是因为模型在特定数据集上学习到的机制缺乏泛化能力。论文提出的解决方案关键在于通过识别并剪枝与数据集特定机制相关的神经元,从而增强模型的泛化能力。该方法利用集成梯度(Integrated Gradients)量化每个神经元对高置信度预测的影响,筛选出那些过度贡献于数据集特定性能但不支持可迁移推理的神经元,并通过选择性剪枝促使模型依赖更通用的表示。
链接: https://arxiv.org/abs/2507.09185
作者: Ameen Ali,Shahar Katz,Lior Wolf,Ivan Titov
机构: 未知
类目: Computation and Language (cs.CL); Machine Learning (cs.LG)
备注:
Abstract:Large language models (LLMs) often develop learned mechanisms specialized to specific datasets, such as reliance on domain-specific correlations, which yield high-confidence predictions without generalizable reasoning. While beneficial in one setting, these dataset-specific mechanisms typically degrade performance when models encounter novel tasks or distributions. In this work, we introduce a fine-tuning approach designed to enhance generalization by identifying and pruning neurons associated with dataset-specific mechanisms in transformer-based LLMs. Our method employs Integrated Gradients to quantify each neuron’s influence on high-confidence predictions, pinpointing those that disproportionately contribute to dataset-specific performance without supporting robust, transferable reasoning. Selectively pruning these neurons compels the model to depend on generalizable representations. Evaluated across multiple-choice benchmarks, our pruning-based fine-tuning significantly enhances performance, surpassing prior (non-pruning) adaptation methods.
zh
[NLP-69] DLBAcalib: Robust Extrinsic Calibration for Non-Overlapping LiDARs Based on Dual LBA
【速读】: 该论文旨在解决多LiDAR系统中无需依赖重叠视场或精确初始参数估计的外参标定问题。其解决方案的关键在于提出了一种统一的优化框架,将LiDAR束平差(LBA)优化与鲁棒迭代精化相结合,通过连续扫描目标LiDAR和滑动窗口LiDAR束平差构建高精度参考点云地图,并将外参标定建模为联合LBA优化问题,从而有效缓解累积映射误差并实现抗异常值的参数估计。
链接: https://arxiv.org/abs/2507.09176
作者: Han Ye,Yuqiang Jin,Jinyuan Liu,Tao Li,Wen-An Zhang,Minglei Fu
机构: Zhejiang University of Technology (浙江理工大学)
类目: Robotics (cs.RO); Computation and Language (cs.CL)
备注: 9 pages,14 figures
Abstract:Accurate extrinsic calibration of multiple LiDARs is crucial for improving the foundational performance of three-dimensional (3D) map reconstruction systems. This paper presents a novel targetless extrinsic calibration framework for multi-LiDAR systems that does not rely on overlapping fields of view or precise initial parameter estimates. Unlike conventional calibration methods that require manual annotations or specific reference patterns, our approach introduces a unified optimization framework by integrating LiDAR bundle adjustment (LBA) optimization with robust iterative refinement. The proposed method constructs an accurate reference point cloud map via continuous scanning from the target LiDAR and sliding-window LiDAR bundle adjustment, while formulating extrinsic calibration as a joint LBA optimization problem. This method effectively mitigates cumulative mapping errors and achieves outlier-resistant parameter estimation through an adaptive weighting mechanism. Extensive evaluations in both the CARLA simulation environment and real-world scenarios demonstrate that our method outperforms state-of-the-art calibration techniques in both accuracy and robustness. Experimental results show that for non-overlapping sensor configurations, our framework achieves an average translational error of 5 mm and a rotational error of 0.2°, with an initial error tolerance of up to 0.4 m/30°. Moreover, the calibration process operates without specialized infrastructure or manual parameter tuning. The code is open source and available on GitHub (\underlinethis https URL)
zh
[NLP-70] RAMA: Retrieval-Augmented Multi-Agent Framework for Misinformation Detection in Multimodal Fact-Checking
【速读】: 该论文试图解决多模态虚假信息在自动化事实核查系统中面临的挑战,特别是在声明模糊或缺乏足够上下文的情况下。其解决方案的关键在于提出RAMA框架,该框架通过三项核心创新实现多模态信息的验证:(1)战略查询生成,将多模态声明转化为精确的网络搜索查询;(2)从多样且权威来源中进行交叉验证证据聚合;(3)多智能体集成架构,利用多个多模态大语言模型和提示变体的互补优势。这些创新使得RAMA在基准数据集上表现出色,尤其在基于检索到的事实证据解决模糊或不可能的声明方面表现突出。
链接: https://arxiv.org/abs/2507.09174
作者: Shuo Yang,Zijian Yu,Zhenzhe Ying,Yuqin Dai,Guoqing Wang,Jun Lan,Jinfeng Xu,Jinze Li,Edith C.H. Ngai
机构: The University of Hong Kong (香港大学); Ant Group (蚂蚁集团)
类目: Computation and Language (cs.CL)
备注:
Abstract:The rapid proliferation of multimodal misinformation presents significant challenges for automated fact-checking systems, especially when claims are ambiguous or lack sufficient context. We introduce RAMA, a novel retrieval-augmented multi-agent framework designed for verifying multimedia misinformation. RAMA incorporates three core innovations: (1) strategic query formulation that transforms multimodal claims into precise web search queries; (2) cross-verification evidence aggregation from diverse, authoritative sources; and (3) a multi-agent ensemble architecture that leverages the complementary strengths of multiple multimodal large language models and prompt variants. Extensive experiments demonstrate that RAMA achieves superior performance on benchmark datasets, particularly excelling in resolving ambiguous or improbable claims by grounding verification in retrieved factual evidence. Our findings underscore the necessity of integrating web-based evidence and multi-agent reasoning for trustworthy multimedia verification, paving the way for more reliable and scalable fact-checking solutions. RAMA will be publicly available at this https URL.
zh
[NLP-71] PU-Lie: Lightweight Deception Detection in Imbalanced Diplomatic Dialogues via Positive-Unlabeled Learning
【速读】: 该论文试图解决战略对话中欺骗检测的问题,这一任务因语言的细微差别和欺骗性与真实性通信之间的极端类别不平衡而变得复杂。其解决方案的关键在于引入一种轻量级但有效的模型,该模型结合了冻结的BERT嵌入、可解释的语言学和游戏特定特征以及正-未标记(PU)学习目标。与传统的二分类器不同,PU-Lie专门针对仅有一小部分欺骗性消息被标记而大部分未被标记的情况,通过PU学习策略,模型在减少可训练参数超过650倍的同时实现了新的最佳宏观F1分数0.60,并强调了在该问题设置中准确检测欺骗性信息比识别真实信息更为关键。
链接: https://arxiv.org/abs/2507.09157
作者: Bhavinkumar Vinodbhai Kuwar,Bikrant Bikram Pratap Maurya,Priyanshu Gupta,Nitin Choudhury
机构: Indraprastha Institute of Information Technology Delhi (印度德里印德拉普拉斯塔姆信息技术学院)
类目: Computation and Language (cs.CL)
备注:
Abstract:Detecting deception in strategic dialogues is a complex and high-stakes task due to the subtlety of language and extreme class imbalance between deceptive and truthful communications. In this work, we revisit deception detection in the Diplomacy dataset, where less than 5% of messages are labeled deceptive. We introduce a lightweight yet effective model combining frozen BERT embeddings, interpretable linguistic and game-specific features, and a Positive-Unlabeled (PU) learning objective. Unlike traditional binary classifiers, PU-Lie is tailored for situations where only a small portion of deceptive messages are labeled, and the majority are unlabeled. Our model achieves a new best macro F1 of 0.60 while reducing trainable parameters by over 650x. Through comprehensive evaluations and ablation studies across seven models, we demonstrate the value of PU learning, linguistic interpretability, and speaker-aware representations. Notably, we emphasize that in this problem setting, accurately detecting deception is more critical than identifying truthful messages. This priority guides our choice of PU learning, which explicitly models the rare but vital deceptive class.
zh
[NLP-72] OPENXRD: A Comprehensive Benchmark and Enhancement Framework for LLM /MLLM XRD Question Answering
【速读】: 该论文试图解决在材料科学领域中,小型模型因缺乏晶体学知识而难以准确回答X射线衍射(XRD)相关问题的挑战。其解决方案的关键在于构建OPENXRD系统,该系统通过GPT-4.5生成简洁且领域特定的文本支持内容,为小型模型提供关键概念的辅助理解,从而提升其在XRD问题上的推理能力。
链接: https://arxiv.org/abs/2507.09155
作者: Ali Vosoughi,Ayoub Shahnazari,Yufeng Xi,Zeliang Zhang,Griffin Hess,Chenliang Xu,Niaz Abdolrahim
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: 10 pages, 6 figures, 5 tables. Code and dataset available at this https URL . Project webpage: this https URL
Abstract:This work presents OPENXRD, an open-book pipeline designed for crystallography question answering, which integrates textual prompts with concise supporting content generated by GPT-4.5. Instead of using scanned textbooks, which may lead to copyright issues, OPENXRD generates compact, domain-specific references that help smaller models understand key concepts in X-ray diffraction (XRD). We evaluate OPENXRD on a well-defined set of 217 expert-level XRD questions by comparing different vision-language models, including GPT-4 and LLaVA-based frameworks such as Mistral, LLaMA, and QWEN, under both closed-book (without supporting material) and open-book (with supporting material) conditions. Our experimental results show significant accuracy improvements in models that use the GPT-4.5-generated summaries, particularly those with limited prior training in crystallography. OPENXRD uses knowledge from larger models to fill knowledge gaps in crystallography and shows that AI-generated texts can help smaller models reason more effectively in scientific tasks. While the current version of OPENXRD focuses on text-based inputs, we also explore future extensions such as adding real crystal diagrams or diffraction patterns to improve interpretation in specialized materials science contexts. Overall, OPENXRD shows that specialized open-book systems can be useful in materials science and provides a foundation for broader natural language processing (NLP) tools in critical scientific fields.
zh
[NLP-73] CompassJudger-2: Towards Generalist Judge Model via Verifiable Rewards
【速读】: 该论文试图解决当前作为评估工具的大型语言模型(Large Language Models, LLMs)中的判官模型(judge models)存在的专业化狭窄和鲁棒性不足的问题,这些问题限制了其进行全面评估的能力。解决方案的关键在于提出CompassJudger-2,这是一个通过任务驱动的多领域数据收集策略实现的通用判官模型,其核心方法是通过可验证的奖励监督判断任务,并利用拒绝采样引导内在批判性推理,从而培养出稳健且可泛化的判断能力。此外,论文还引入了带有边界策略梯度损失的优化学习目标,以提升模型性能。
链接: https://arxiv.org/abs/2507.09104
作者: Taolin Zhang,Maosong Cao,Alexander Lam,Songyang Zhang,Kai Chen
机构: Shanghai AI Laboratory (上海人工智能实验室); Tsinghua University (清华大学)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:
Abstract:Recently, the role of LLM-as-judge in evaluating large language models has gained prominence. However, current judge models suffer from narrow specialization and limited robustness, undermining their capacity for comprehensive evaluations. In this work, we present CompassJudger-2, a novel generalist judge model that overcomes these limitations via a task-driven, multi-domain data curation strategy. Central to our approach is supervising judgment tasks with verifiable rewards, guiding intrinsic critical reasoning through rejection sampling to foster robust, generalizable judgment capabilities. We introduce a refined learning objective with margin policy gradient loss to enhance performance. Empirically, CompassJudger-2 achieves superior results across multiple judge and reward benchmarks, and our 7B model demonstrates competitive judgment accuracy with significantly larger models like DeepSeek-V3 and Qwen3-235B-A22B. Additionally, we propose JudgerBenchV2, a comprehensive benchmark evaluating cross-domain judgment accuracy and rank consistency to standardize judge model evaluation. These contributions advance robust, scalable LLM judgment and establish new performance and evaluation standards.
zh
[NLP-74] AInsight: Augmenting Expert Decision-Making with On-the-Fly Insights Grounded in Historical Data
【速读】: 该论文试图解决在决策对话中,专家由于实时性要求无法有效回顾和利用历史数据的问题。解决方案的关键在于构建一个基于检索的大型语言模型(Large Language Model, LLM)代理的管道,通过持续监听对话、识别问题与解决方案,并从嵌入式数据集中检索相关数据以生成简洁的洞察,从而实现在实时决策过程中利用过去数据的见解。
链接: https://arxiv.org/abs/2507.09100
作者: Mohammad Abolnejadian,Shakiba Amirshahi,Matthew Brehmer,Anamaria Crisan
机构: Cheriton School of Computer Science, University of Waterloo(查里顿计算机科学学院,滑铁卢大学)
类目: Human-Computer Interaction (cs.HC); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注: 7 pages and 4 figures. Proceedings of the 7th ACM Conference on Conversational User Interfaces (CUI '25)
Abstract:In decision-making conversations, experts must navigate complex choices and make on-the-spot decisions while engaged in conversation. Although extensive historical data often exists, the real-time nature of these scenarios makes it infeasible for decision-makers to review and leverage relevant information. This raises an interesting question: What if experts could utilize relevant past data in real-time decision-making through insights derived from past data? To explore this, we implemented a conversational user interface, taking doctor-patient interactions as an example use case. Our system continuously listens to the conversation, identifies patient problems and doctor-suggested solutions, and retrieves related data from an embedded dataset, generating concise insights using a pipeline built around a retrieval-based Large Language Model (LLM) agent. We evaluated the prototype by embedding Health Canada datasets into a vector database and conducting simulated studies using sample doctor-patient dialogues, showing effectiveness but also challenges, setting directions for the next steps of our work.
zh
[NLP-75] DS@GT at Touché: Large Language Models for Retrieval-Augmented Debate
【速读】: 该论文试图解决大型语言模型(Large Language Models, LLMs)在结构化辩论中的表现及其对辩论中话语的评估能力问题。解决方案的关键在于利用检索增强的辩论与评估框架,通过部署来自三家供应商的六种领先的公开可用模型,并基于四个关键指标——质量、数量、方式和关联性进行评估,以分析模型在提供相关论点和一致评价方面的能力。
链接: https://arxiv.org/abs/2507.09090
作者: Anthony Miyaguchi,Conor Johnston,Aaryan Potdar
机构: Georgia Institute of Technology (佐治亚理工学院)
类目: Information Retrieval (cs.IR); Computation and Language (cs.CL)
备注:
Abstract:Large Language Models (LLMs) demonstrate strong conversational abilities. In this Working Paper, we study them in the context of debating in two ways: their ability to perform in a structured debate along with a dataset of arguments to use and their ability to evaluate utterances throughout the debate. We deploy six leading publicly available models from three providers for the Retrieval-Augmented Debate and Evaluation. The evaluation is performed by measuring four key metrics: Quality, Quantity, Manner, and Relation. Throughout this task, we found that although LLMs perform well in debates when given related arguments, they tend to be verbose in responses yet consistent in evaluation. The accompanying source code for this paper is located at this https URL.
zh
[NLP-76] Dynamic Parameter Memory: Temporary LoRA-Enhanced LLM for Long-Sequence Emotion Recognition in Conversation EMNLP2025
【速读】: 该论文试图解决语音情感识别(Speech Emotion Recognition, SER)中由于语音模态固有的高帧率导致的语音大语言模型(SLLM)在信号处理和理解能力上的限制问题。传统输入令牌压缩方法忽略了多轮对话中情绪的连续性和惯性。该论文提出的动态参数记忆(Dynamic Parameter Memory, DPM)机制,通过结合上下文语义和句子级情感编码,使SLLM能够在有限的上下文窗口内处理无限长度的音频。DPM的关键在于在推理过程中逐步将句子级信息和情感编码到临时LoRA模块中,从而有效“记忆”上下文信息,显著提升了SLLM在处理长音频序列时的情感识别能力。
链接: https://arxiv.org/abs/2507.09076
作者: Jialong Mai,Xiaofen Xing,Yawei Li,Zhipeng Li,Jingyuan Xing,Xiangmin Xu
机构: South China University of Technology (华南理工大学); MiniMax (迷你宇宙); The Chinese University of Hong Kong (香港中文大学)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: submitted to EMNLP 2025
Abstract:Recent research has focused on applying speech large language model (SLLM) to improve speech emotion recognition (SER). However, the inherently high frame rate in speech modality severely limits the signal processing and understanding capabilities of SLLM. For example, a SLLM with a 4K context window can only process 80 seconds of audio at 50Hz feature sampling rate before reaching its capacity limit. Input token compression methods used in SLLM overlook the continuity and inertia of emotions across multiple conversation turns. This paper proposes a Dynamic Parameter Memory (DPM) mechanism with contextual semantics and sentence-level emotion encoding, enabling processing of unlimited-length audio with limited context windows in SLLM. Specifically, DPM progressively encodes sentence-level information and emotions into a temporary LoRA module during inference to effectively “memorize” the contextual information. We trained an emotion SLLM as a backbone and incorporated our DPM into inference for emotion recognition in conversation (ERC). Experimental results on the IEMOCAP dataset show that DPM significantly improves the emotion recognition capabilities of SLLM when processing long audio sequences, achieving state-of-the-art performance.
zh
[NLP-77] OpenCodeReasoning -II: A Simple Test Time Scaling Approach via Self-Critique
【速读】: 该论文旨在解决代码生成与代码评述中的知识蒸馏问题,其核心挑战在于依赖大规模、高质量的数据集以提升模型性能。论文提出的关键解决方案是引入OpenCodeReasoning-II数据集,该数据集包含2.5M个问题-解法-评述三元组,规模约为之前最大公开代码推理数据集的两倍。此外,论文采用两阶段监督微调策略,第一阶段专注于代码生成的微调,第二阶段则联合训练代码生成与评述模型,从而显著提升了模型在竞赛编程任务中的表现。
链接: https://arxiv.org/abs/2507.09075
作者: Wasi Uddin Ahmad,Somshubra Majumdar,Aleksander Ficek,Sean Narenthiran,Mehrzad Samadi,Jocelyn Huang,Siddhartha Jain,Vahid Noroozi,Boris Ginsburg
机构: NVIDIA(英伟达)
类目: Computation and Language (cs.CL)
备注: work in progress
Abstract:Recent advancements in reasoning-based Large Language Models (LLMs), particularly their potential through test-time scaling, have created significant opportunities for distillation in code generation and critique. However, progress in both areas fundamentally depends on large-scale, high-quality datasets. In this work, we introduce OpenCodeReasoning-II, a dataset consists of 2.5M question-solution-critique triples (approx. 35K unique programming questions), making it nearly twice the size of the previous largest publicly available code reasoning dataset. In this work, we employ a two-stage supervised fine-tuning strategy. The first stage focuses on fine-tuning for code generation, while the second stage involves the joint training of models for both code generation and critique. Our resulting finetuned Qwen2.5-Instruct models achieve performance in code generation that either exceeds or equals the best prior open-weight distilled models. Notably, the integration of our code generation and critique models leads to significant improvements in competitive coding performance. Furthermore, we present an extension of the LiveCodeBench benchmark to specifically support the C++ programming language, thereby facilitating more comprehensive LLM evaluation using this benchmark.
zh
[NLP-78] ALIGN: Prompt-based Attribute Alignment for Reliable Responsible and Personalized LLM -based Decision-Making ICML2025
【速读】: 该论文试图解决大型语言模型(Large Language Models, LLMs)在作为决策辅助工具时,因用户价值观和偏好差异而导致的对齐与个性化问题。其解决方案的关键在于提出ALIGN系统,该系统通过基于提示的对齐方法,实现LLM决策者的动态个性化,核心特征包括稳健的配置管理、带有推理的结构化输出生成以及可交换的LLM后端算法实现,从而支持多种分析类型,并提供一个模块化的后端以方便算法集成。
链接: https://arxiv.org/abs/2507.09037
作者: Bharadwaj Ravichandran,David Joy,Paul Elliott,Brian Hu,Jadie Adams,Christopher Funk,Emily Veenhuis,Anthony Hoogs,Arslan Basharat
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: 10 pages total (including appendix), ICML 2025 Workshop on Reliable and Responsible Foundation Models
Abstract:Large language models (LLMs) are increasingly being used as decision aids. However, users have diverse values and preferences that can affect their decision-making, which requires novel methods for LLM alignment and personalization. Existing LLM comparison tools largely focus on benchmarking tasks, such as knowledge-based question answering. In contrast, our proposed ALIGN system focuses on dynamic personalization of LLM-based decision-makers through prompt-based alignment to a set of fine-grained attributes. Key features of our system include robust configuration management, structured output generation with reasoning, and several algorithm implementations with swappable LLM backbones, enabling different types of analyses. Our user interface enables a qualitative, side-by-side comparison of LLMs and their alignment to various attributes, with a modular backend for easy algorithm integration. Additionally, we perform a quantitative analysis comparing alignment approaches in two different domains: demographic alignment for public opinion surveys and value alignment for medical triage decision-making. The entire ALIGN framework is open source and will enable new research on reliable, responsible, and personalized LLM-based decision-makers.
zh
[NLP-79] Lizard: An Efficient Linearization Framework for Large Language Models
【速读】: 该论文旨在解决基于Transformer的大型语言模型(LLM)在上下文长度增加时面临的内存和计算瓶颈问题,这些问题主要由softmax注意力机制的二次复杂度以及不断增长的键值(KV)缓存引起。论文提出的解决方案——Lizard,其关键在于引入了一种近似softmax注意力但具有次二次复杂度的注意力机制,同时保留了输出质量。此外,Lizard通过集成受最新线性模型启发的门控模块,实现了自适应内存控制、常数内存推理、强长度泛化能力,并支持更灵活的模型设计。
链接: https://arxiv.org/abs/2507.09025
作者: Chien Van Nguyen,Ruiyi Zhang,Hanieh Deilamsalehy,Puneet Mathur,Viet Dac Lai,Haoliang Wang,Jayakumar Subramanian,Ryan A. Rossi,Trung Bui,Nikos Vlassis,Franck Dernoncourt,Thien Huu Nguyen
机构: University of Oregon (俄勒冈大学); Adobe Research (Adobe 研究院)
类目: Computation and Language (cs.CL); Machine Learning (cs.LG)
备注: 15 pages
Abstract:We propose Lizard, a linearization framework that transforms pretrained Transformer-based Large Language Models (LLMs) into flexible, subquadratic architectures for infinite-context generation. Transformer-based LLMs face significant memory and computational bottlenecks as context lengths increase, due to the quadratic complexity of softmax attention and the growing key-value (KV) cache. Lizard addresses these limitations by introducing a subquadratic attention mechanism that closely approximates softmax attention while preserving the output quality. Unlike previous linearization methods, which are often limited by fixed model structures and therefore exclude gating mechanisms, Lizard incorporates a gating module inspired by recent state-of-the-art linear models. This enables adaptive memory control, supports constant-memory inference, offers strong length generalization, and allows more flexible model design. Lizard combines gated linear attention for global context compression with sliding window attention enhanced by meta memory, forming a hybrid mechanism that captures both long-range dependencies and fine-grained local interactions. Moreover, we introduce a hardware-aware algorithm that accelerates the training speed of our models. Extensive experiments show that Lizard achieves near-lossless recovery of the teacher model’s performance across standard language modeling tasks, while significantly outperforming previous linearization methods. On the 5-shot MMLU benchmark, Lizard improves over prior models by 18 points and shows significant improvements on associative recall tasks.
zh
[NLP-80] Beyond vividness: Content analysis of induced hallucinations reveals the hidden structure of individual differences in visual imagery
【速读】: 该论文试图解决个体在视觉意象能力(imagery ability)上的差异如何影响其在Ganzflicker诱导的视觉幻觉中的体验内容。研究的关键在于利用自然语言处理工具分析超过4,000名参与者的自由文本描述,通过对比不同意象表型(imagery phenotype)个体的语言表达,揭示其在幻觉内容复杂性和类型上的差异,特别是发现强意象者描述更复杂的自然主义内容,而弱意象者则更多报告简单的几何图案。此外,研究还表明基于视觉-语言模型的嵌入方法比仅依赖文本的语言模型更能捕捉这些差异。
链接: https://arxiv.org/abs/2507.09011
作者: Ana Chkhaidze,Reshanne R. Reeder,Connor Gag,Anastasia Kiyonaga,Seana Coulson
机构: 未知
类目: Computation and Language (cs.CL); Neurons and Cognition (q-bio.NC); Quantitative Methods (q-bio.QM)
备注:
Abstract:A rapidly alternating red and black display known as Ganzflicker induces visual hallucinations that reflect the generative capacity of the visual system. Recent proposals regarding the imagery spectrum, that is, differences in the visual system of individuals with absent imagery, typical imagery, and vivid imagery, suggest these differences should impact the complexity of other internally generated visual experiences. Here, we used tools from natural language processing to analyze free-text descriptions of hallucinations from over 4,000 participants, asking whether people with different imagery phenotypes see different things in their mind’s eye during Ganzflicker-induced hallucinations. Strong imagers described complex, naturalistic content, while weak imagers reported simple geometric patterns. Embeddings from vision language models better captured these differences than text-only language models, and participants with stronger imagery used language with richer sensorimotor associations. These findings may reflect individual variation in coordination between early visual areas and higher-order regions relevant for the imagery spectrum.
zh
[NLP-81] Semantic Source Code Segmentation using Small and Large Language Models
【速读】: 该论文试图解决在软件开发中对研究型R语言代码进行高效分割的问题,以支持知识检索和维护。传统的人工和语法分析方法在大规模代码库中已不再适用,尤其是在低资源语言如R及其研究领域(如社会科学、心理学)中。论文的关键解决方案是引入一种自动化、领域特定的代码分割方法,利用大型语言模型(LLMs)和小型语言模型(SLMs),并通过两种新颖的方法——基于上下文的逐行分析和基于范围的段落确定进行实验,最终表明基于上下文的逐行分析优于基于范围的方法。
链接: https://arxiv.org/abs/2507.08992
作者: Abdelhalim Dahou,Ansgar Scherp,Sebastian Kurten,Brigitte Mathiak,Madhu Chauhan
机构: 未知
类目: oftware Engineering (cs.SE); Computation and Language (cs.CL); Programming Languages (cs.PL)
备注: 18 pages, 4 figures
Abstract:Source code segmentation, dividing code into functionally coherent segments, is crucial for knowledge retrieval and maintenance in software development. While enabling efficient navigation and comprehension of large codebases, manual and syntactic analysis approaches have become impractical as repositories grow, especially for low-resource languages like R and their research domains (e.g., social sciences, psychology).This paper introduces an automated, domain-specific approach for research R code segmentation using Large and Small Language Models (LLMs/SLMs). It presents two novel approaches and a human-annotated dataset, StatCodeSeg. We explore two distinct approaches: line-by-line analysis with context and range-based segment determination. We experiment with LLMs and fine-tuned SLMs. To support the generalizability of our approaches, we also include experiments on Python code from the computer science this http URL results show that context-based line-by-line analysis is superior over range-based this http URL smaller language models like CodeBERT and an encoder-only version of CodeT5+ are better than their LLM counterparts. Most notably, these two best-performing models did not see R code during pre-training versus the LLMs but were only fine-tuned on 4,130 lines of manually annotated code.
zh
[NLP-82] Application of CARE-SD text classifier tools to assess distribution of stigmatizing and doubt-marking language features in EHR
【速读】: 该论文试图解决电子健康记录(Electronic Health Records, EHR)中患者污名化语言的传播问题,特别是其在医疗团队中的持续现象。解决方案的关键在于通过扩展词典匹配和监督学习分类器识别EHR中的怀疑标志词和污名化标签的语言特征,并利用泊松回归模型评估这些语言特征的预测因素,从而揭示不同患者群体和医疗人员在使用污名化语言上的差异。
链接: https://arxiv.org/abs/2507.08969
作者: Drew Walker,Jennifer Love,Swati Rajwal,Isabel C Walker,Hannah LF Cooper,Abeed Sarker,Melvin Livingston III
机构: 未知
类目: Computation and Language (cs.CL)
备注: 3 Tables
Abstract:Introduction: Electronic health records (EHR) are a critical medium through which patient stigmatization is perpetuated among healthcare teams. Methods: We identified linguistic features of doubt markers and stigmatizing labels in MIMIC-III EHR via expanded lexicon matching and supervised learning classifiers. Predictors of rates of linguistic features were assessed using Poisson regression models. Results: We found higher rates of stigmatizing labels per chart among patients who were Black or African American (RR: 1.16), patients with Medicare/Medicaid or government-run insurance (RR: 2.46), self-pay (RR: 2.12), and patients with a variety of stigmatizing disease and mental health conditions. Patterns among doubt markers were similar, though male patients had higher rates of doubt markers (RR: 1.25). We found increased stigmatizing labels used by nurses (RR: 1.40), and social workers (RR: 2.25), with similar patterns of doubt markers. Discussion: Stigmatizing language occurred at higher rates among historically stigmatized patients, perpetuated by multiple provider types.
zh
[NLP-83] Self-Improving Model Steering
【速读】: 该论文试图解决传统模型 steering 方法依赖外部标注数据导致的适应性差和效果受限的问题。其解决方案的关键在于提出 SIMS,这是一个无需依赖外部监督的自提升模型 steering 框架,通过自主生成和迭代优化对比样本实现动态、上下文相关的模型对齐,同时引入提示排序和对比采样等策略以提升 steering 效果。
链接: https://arxiv.org/abs/2507.08967
作者: Rongyi Zhu,Yuhui Wang,Tanqiu Jiang,Jiacheng Liang,Ting Wang
机构: 未知
类目: Computation and Language (cs.CL)
备注: 16 pages, 9 figures
Abstract:Model steering represents a powerful technique that dynamically aligns large language models (LLMs) with human preferences during inference. However, conventional model-steering methods rely heavily on externally annotated data, not only limiting their adaptability to varying contexts but also tethering their effectiveness to annotation quality. In this paper, we present SIMS, the first self-improving model-steering framework that operates without relying on external supervision. At its core, SIMS autonomously generates and refines contrastive samples through iterative self-improvement cycles, enabling adaptive, context-specific steering. Additionally, SIMS employs novel strategies, including prompt ranking and contrast sampling, to further enhance steering efficacy. Extensive evaluation across diverse LLMs and benchmarks demonstrates that SIMS substantially outperforms existing methods in steering effectiveness and adaptability, highlighting self-improving model steering as a promising direction for future research on inference-time LLM alignment.
zh
[NLP-84] From KMMLU-Redux to KMMLU-Pro: A Professional Korean Benchmark Suite for LLM Evaluation
【速读】: 该论文试图解决大型语言模型(Large Language Models, LLMs)在实际应用场景中评估能力不足的问题,特别是针对韩国工业领域知识的覆盖与可靠性。其解决方案的关键在于构建两个面向韩国专业领域的基准测试集:KMMLU-Redux和KMMLU-Pro,分别基于韩国国家技术资格考试和国家职业执照考试,以确保评测内容的准确性和实用性,并通过去除关键错误提升数据集的可靠性。
链接: https://arxiv.org/abs/2507.08924
作者: Seokhee Hong,Sunkyoung Kim,Guijin Son,Soyeon Kim,Yeonjung Hong,Jinsik Lee
机构: LG AI Research (LG人工智能研究院); OnelineAI (OnelineAI)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:
Abstract:The development of Large Language Models (LLMs) requires robust benchmarks that encompass not only academic domains but also industrial fields to effectively evaluate their applicability in real-world scenarios. In this paper, we introduce two Korean expert-level benchmarks. KMMLU-Redux, reconstructed from the existing KMMLU, consists of questions from the Korean National Technical Qualification exams, with critical errors removed to enhance reliability. KMMLU-Pro is based on Korean National Professional Licensure exams to reflect professional knowledge in Korea. Our experiments demonstrate that these benchmarks comprehensively represent industrial knowledge in Korea. We release our dataset publicly available.
zh
[NLP-85] Evaluating LLM s in Medicine: A Call for Rigor Transparency
【速读】: 该论文试图解决当前大型语言模型(Large Language Models, LLMs)在医学问答任务中的评估局限性,特别是数据集的质量问题。其关键解决方案在于建立一个标准化的评估框架,以确保数据集的临床真实性、透明度和全面性,并通过机构与政策制定者之间的协作,推动更加严谨、无偏且反映临床复杂性的评估方法和数据集的发展。
链接: https://arxiv.org/abs/2507.08916
作者: Mahmoud Alwakeel,Aditya Nagori,Vijay Krishnamoorthy,Rishikesan Kamaleswaran
机构: 未知
类目: Computation and Language (cs.CL)
备注:
Abstract:Objectives: To evaluate the current limitations of large language models (LLMs) in medical question answering, focusing on the quality of datasets used for their evaluation. Materials and Methods: Widely-used benchmark datasets, including MedQA, MedMCQA, PubMedQA, and MMLU, were reviewed for their rigor, transparency, and relevance to clinical scenarios. Alternatives, such as challenge questions in medical journals, were also analyzed to identify their potential as unbiased evaluation tools. Results: Most existing datasets lack clinical realism, transparency, and robust validation processes. Publicly available challenge questions offer some benefits but are limited by their small size, narrow scope, and exposure to LLM training. These gaps highlight the need for secure, comprehensive, and representative datasets. Conclusion: A standardized framework is critical for evaluating LLMs in medicine. Collaborative efforts among institutions and policymakers are needed to ensure datasets and methodologies are rigorous, unbiased, and reflective of clinical complexities.
zh
[NLP-86] SEALGuard: Safeguarding the Multilingual Conversations in Southeast Asian Languages for LLM Software Systems
【速读】: 该论文试图解决大型语言模型(Large Language Model, LLM)系统中多语言安全对齐(safety alignment)不足的问题,特别是针对低资源语言中的不安全和越狱提示(jailbreak prompts)检测能力薄弱的问题。解决方案的关键在于提出SEALGuard,一个基于低秩适配(low-rank adaptation, LoRA)的多语言防护机制,并构建了SEALSBench数据集以支持多语言安全对齐的评估。通过这一方法,SEALGuard显著提升了对多语言不安全和越狱提示的检测性能。
链接: https://arxiv.org/abs/2507.08898
作者: Wenliang Shan,Michael Fu,Rui Yang,Chakkrit(Kla)Tantithamthavorn
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: Under Review at Information and Software Technology
Abstract:Safety alignment is critical for LLM-powered systems. While recent LLM-powered guardrail approaches such as LlamaGuard achieve high detection accuracy of unsafe inputs written in English (e.g., ``How to create a bomb?‘’), they struggle with multilingual unsafe inputs. This limitation leaves LLM systems vulnerable to unsafe and jailbreak prompts written in low-resource languages such as those in Southeast Asia. This paper introduces SEALGuard, a multilingual guardrail designed to improve the safety alignment across diverse languages. It aims to address the multilingual safety alignment gap of existing guardrails and ensure effective filtering of unsafe and jailbreak prompts in LLM-powered systems. We adapt a general-purpose multilingual language model into a multilingual guardrail using low-rank adaptation (LoRA). We construct SEALSBench, a large-scale multilingual safety alignment dataset containing over 260,000 prompts in ten languages, including safe, unsafe, and jailbreak cases. We evaluate SEALGuard against state-of-the-art guardrails such as LlamaGuard on this benchmark. Our findings show that multilingual unsafe and jailbreak prompts substantially degrade the performance of the state-of-the-art LlamaGuard, which experiences a drop in Defense Success Rate (DSR) by 9% and 18%, respectively, compared to its performance on English-only prompts. In contrast, SEALGuard outperforms existing guardrails in detecting multilingual unsafe and jailbreak prompts, improving DSR by 48% over LlamaGuard and achieving the best DSR, precision, and F1-score. Our ablation study further reveals the contributions of adaptation strategies and model size to the overall performance of SEALGuard. SEALGuard advances the safety alignment of LLM systems by introducing an effective multilingual guardrail.
zh
[NLP-87] Overview of the TREC 2023 deep learning track
【速读】: 该论文旨在评估生成式 AI (Generative AI) 在信息检索任务中的表现,特别是在基于MS MARCO数据集的段落和文档排序任务中。其关键解决方案是通过使用经过微调的T5模型和GPT-4提示生成合成查询,并与传统的人工标注查询进行对比,以测试基于大型语言模型 (Large Language Model, LLM) 的提示方法是否优于以往的“nnlm”方法。研究发现,LLM提示方法在某些情况下表现更优,表明其在信息检索任务中的潜力。
链接: https://arxiv.org/abs/2507.08890
作者: Nick Craswell,Bhaskar Mitra,Emine Yilmaz,Hossein A. Rahmani,Daniel Campos,Jimmy Lin,Ellen M. Voorhees,Ian Soboroff
机构: Microsoft(微软); University College London(伦敦大学学院); Amazon(亚马逊); Snowflake(雪花); University of Waterloo(滑铁卢大学); NIST(美国国家标准与技术研究院)
类目: Information Retrieval (cs.IR); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注: arXiv admin note: substantial text overlap with arXiv:2507.08191
Abstract:This is the fifth year of the TREC Deep Learning track. As in previous years, we leverage the MS MARCO datasets that made hundreds of thousands of human-annotated training labels available for both passage and document ranking tasks. We mostly repeated last year’s design, to get another matching test set, based on the larger, cleaner, less-biased v2 passage and document set, with passage ranking as primary and document ranking as a secondary task (using labels inferred from passage). As we did last year, we sample from MS MARCO queries that were completely held out, unused in corpus construction, unlike the test queries in the first three years. This approach yields a more difficult test with more headroom for improvement. Alongside the usual MS MARCO (human) queries from MS MARCO, this year we generated synthetic queries using a fine-tuned T5 model and using a GPT-4 prompt. The new headline result this year is that runs using Large Language Model (LLM) prompting in some way outperformed runs that use the “nnlm” approach, which was the best approach in the previous four years. Since this is the last year of the track, future iterations of prompt-based ranking can happen in other tracks. Human relevance assessments were applied to all query types, not just human MS MARCO queries. Evaluation using synthetic queries gave similar results to human queries, with system ordering agreement of \tau=0.8487 . However, human effort was needed to select a subset of the synthetic queries that were usable. We did not see clear evidence of bias, where runs using GPT-4 were favored when evaluated using synthetic GPT-4 queries, or where runs using T5 were favored when evaluated on synthetic T5 queries. Comments: arXiv admin note: substantial text overlap with arXiv:2507.08191 Subjects: Information Retrieval (cs.IR); Artificial Intelligence (cs.AI); Computation and Language (cs.CL) Cite as: arXiv:2507.08890 [cs.IR] (or arXiv:2507.08890v1 [cs.IR] for this version) https://doi.org/10.48550/arXiv.2507.08890 Focus to learn more arXiv-issued DOI via DataCite
zh
[NLP-88] Less Stress More Privacy: Stress Detection on Anonymized Speech of Air Traffic Controllers
【速读】: 该论文试图解决在遵守隐私保护法规(如GDPR)的前提下,对航空交通管制员(ATCO)语音数据进行压力检测的问题。解决方案的关键在于对ATCO语音数据进行匿名化处理,同时保持深度学习模型在压力检测任务中的高性能表现,实验结果表明,匿名化数据仍可支持高精度的压力检测模型构建。
链接: https://arxiv.org/abs/2507.08882
作者: Janaki Viswanathan,Alexander Blatt,Konrad Hagemann,Dietrich Klakow
机构: 未知
类目: ound (cs.SD); Computation and Language (cs.CL); Audio and Speech Processing (eess.AS)
备注: 8 pages, 2 figures, 4 tables, publication identification number (URN)- urn:nbn:de:101:1-2022122008393409239462, see archived online publication- this https URL Katalogeintrag: this https URL
Abstract:Air traffic control (ATC) demands multi-tasking under time pressure with high consequences of an error. This can induce stress. Detecting stress is a key point in maintaining the high safety standards of ATC. However, processing ATC voice data entails privacy restrictions, e.g. the General Data Protection Regulation (GDPR) law. Anonymizing the ATC voice data is one way to comply with these restrictions. In this paper, different architectures for stress detection for anonymized ATCO speech are evaluated. Our best networks reach a stress detection accuracy of 93.6% on an anonymized version of the Speech Under Simulated and Actual Stress (SUSAS) dataset and an accuracy of 80.1% on our anonymized ATC simulation dataset. This shows that privacy does not have to be an impediment in building well-performing deep-learning-based models.
zh
[NLP-89] Spatial ModernBERT: Spatial-Aware Transformer for Table and Key-Value Extraction in Financial Documents at Scale
【速读】: 该论文旨在解决从复杂财务文档中准确提取表格数据和键值对的问题,这对于审计、数据分析和自动化发票处理等业务流程至关重要。其解决方案的关键在于引入Spatial ModernBERT模型,该模型基于Transformer架构并增强了空间嵌入,通过三个分类头(标签头、列头和行头)实现对文本中标签、列索引和行类型的联合识别,并结合后处理方法利用B-I-IB标注策略合并令牌以重建表格布局和提取键值对。
链接: https://arxiv.org/abs/2507.08865
作者: Javis AI Team:Amrendra Singh,Maulik Shah,Dharshan Sampath
机构: Javis AI Team(贾维斯人工智能团队)
类目: Computation and Language (cs.CL)
备注:
Abstract:Extracting tables and key-value pairs from financial documents is essential for business workflows such as auditing, data analytics, and automated invoice processing. In this work, we introduce Spatial ModernBERT-a transformer-based model augmented with spatial embeddings-to accurately detect and extract tabular data and key-value fields from complex financial documents. We cast the extraction task as token classification across three heads: (1) Label Head, classifying each token as a label (e.g., PO Number, PO Date, Item Description, Quantity, Base Cost, MRP, etc.); (2) Column Head, predicting column indices; (3) Row Head, distinguishing the start of item rows and header rows. The model is pretrained on the PubTables-1M dataset, then fine-tuned on a financial document dataset, achieving robust performance through cross-entropy loss on each classification head. We propose a post-processing method to merge tokens using B-I-IB tagging, reconstruct the tabular layout, and extract key-value pairs. Empirical evaluation shows that Spatial ModernBERT effectively leverages both textual and spatial cues, facilitating highly accurate table and key-value extraction in real-world financial documents.
zh
[NLP-90] RAG Safety: Exploring Knowledge Poisoning Attacks to Retrieval-Augmented Generation
【速读】: 该论文试图解决知识图谱增强的生成(KG-RAG)方法在面对数据中毒攻击时的安全性问题,特别是针对知识图谱结构化和可编辑特性所带来的独特漏洞。解决方案的关键在于提出一种实用且隐蔽的攻击策略,该策略首先识别对抗性目标答案,然后在知识图谱中插入扰动三元组以构建误导性的推理链,从而增加KG-RAG系统在生成过程中检索并依赖这些扰动的可能性。
链接: https://arxiv.org/abs/2507.08862
作者: Tianzhe Zhao,Jiaoyan Chen,Yanchi Ru,Haiping Zhu,Nan Hu,Jun Liu,Qika Lin
机构: XJTU(西安交通大学); University of Manchester(曼彻斯特大学); Southeast University(东南大学)
类目: Cryptography and Security (cs.CR); Computation and Language (cs.CL)
备注: 13 pages, 6 figures
Abstract:Retrieval-Augmented Generation (RAG) enhances large language models (LLMs) by retrieving external data to mitigate hallucinations and outdated knowledge issues. Benefiting from the strong ability in facilitating diverse data sources and supporting faithful reasoning, knowledge graphs (KGs) have been increasingly adopted in RAG systems, giving rise to KG-based RAG (KG-RAG) methods. Though RAG systems are widely applied in various applications, recent studies have also revealed its vulnerabilities to data poisoning attacks, where malicious information injected into external knowledge sources can mislead the system into producing incorrect or harmful responses. However, these studies focus exclusively on RAG systems using unstructured textual data sources, leaving the security risks of KG-RAG largely unexplored, despite the fact that KGs present unique vulnerabilities due to their structured and editable nature. In this work, we conduct the first systematic investigation of the security issue of KG-RAG methods through data poisoning attacks. To this end, we introduce a practical, stealthy attack setting that aligns with real-world implementation. We propose an attack strategy that first identifies adversarial target answers and then inserts perturbation triples to complete misleading inference chains in the KG, increasing the likelihood that KG-RAG methods retrieve and rely on these perturbations during generation. Through extensive experiments on two benchmarks and four recent KG-RAG methods, our attack strategy demonstrates strong effectiveness in degrading KG-RAG performance, even with minimal KG perturbations. In-depth analyses are also conducted to understand the safety threats within the internal stages of KG-RAG systems and to explore the robustness of LLMs against adversarial knowledge.
zh
[NLP-91] LoRA Is Slower Than You Think
【速读】: 该论文试图解决低秩适应(Low-Rank Adaptation, LoRA)在不同模型架构和训练设置中未能始终提供速度提升的问题。其解决方案的关键在于对LoRA性能进行系统分析,识别影响其加速效果的潜在因素,并基于研究结果提出更高效的大型语言模型(Large Language Models, LLMs)微调方法,以实现更稳定和显著的训练速度提升。
链接: https://arxiv.org/abs/2507.08833
作者: Seokmin Ko
机构: Yonsei University (延世大学)
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注:
Abstract:Low-Rank Adaptation (LoRA) is one of the most widely used techniques for fine-tuning large language models (LLMs). By introducing a small number of trainable low-rank weight matrices, LoRA substantially reduces the number of parameters that need to be updated, offering significant advantages in memory consumption and computational efficiency compared to full fine-tuning. However, we observed that LoRA does not consistently provide speed improvements across all model architectures and training setups. Motivated by this inconsistency, we conduct a comprehensive analysis of LoRA’s performance and investigate the underlying factors limiting its speedup. Based on our findings, we propose several methods for more efficient fine-tuning of LLMs. We empirically evaluate these methods and compare them to LoRA, demonstrating that our approach achieves comparable or superior performance while delivering more consistent training speed improvements. Our work offers valuable insights and practical guidelines for practitioners seeking to optimize LLM fine-tuning under resource constraints.
zh
[NLP-92] hink Clearly: Improving Reasoning via Redundant Token Pruning
【速读】: 该论文试图解决大型语言模型在进行长文本推理时存在的推理路径冗余问题,这种冗余导致注意力分布广泛且不集中,尤其在产生错误答案时表现出更高的注意力稀疏性。解决方案的关键在于通过测量到一个特殊结束思考标记(end-of-thinking token)的token级注意力分数,系统地识别推理中的冗余,并采用结构感知剪枝策略,优先移除低贡献推理片段中的token,从而提升推理效率与准确性。
链接: https://arxiv.org/abs/2507.08806
作者: Daewon Choi,Jimin Lee,Jihoon Tack,Woomin Song,Saket Dingliwal,Sai Muralidhar Jayanthi,Bhavana Ganesh,Jinwoo Shin,Aram Galstyan,Sravan Babu Bodapati
机构: KAIST(韩国科学技术院); Korea University(韩国大学); Amazon AGI(亚马逊人工智能小组)
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Machine Learning (cs.LG)
备注:
Abstract:Recent large language models have shown promising capabilities in long-form reasoning, following structured chains of thought before arriving at a final answer. However, we observe that these reasoning paths tend to include substantial redundancy; analyzing attention patterns reveals that attention scores are widely scattered, particularly incorrect answers exhibit greater attention sparsity. In this paper, we demonstrate that deliberately removing this redundancy in the reasoning process significantly improves performance through clear thinking, i.e., removing distraction. Specifically, we systematically identify reasoning redundancy by measuring token-level attention scores to a special end-of-thinking token, which is appended to an explicit instruction inserted to conclude each intermediate reasoning step. Furthermore, we propose structure-aware pruning that prioritizes removing tokens in low-contributing reasoning chunks over individual tokens. After evicting redundant tokens, we remove the injected end-of-thinking instruction, then resume the reasoning generation. We demonstrate that our method significantly improves overall accuracy across reasoning-intensive benchmarks without any training involved. In particular, our method shows strong performance on challenging mathematical competition benchmarks such as AIME and AMC, where reasoning redundancy is more prevalent.
zh
[NLP-93] Principled Foundations for Preference Optimization
【速读】: 该论文试图解决生成式 AI (Generative AI) 模型在学习偏好过程中如何与理论框架建立联系的问题,特别是通过直接偏好优化(DPO)与贝叶斯损失函数(Savage)及随机选择理论(Doignon-Falmagne 和 Machina)之间的关系进行深入分析。其解决方案的关键在于揭示 DPO 作为连接上述两种理论的特定形式,并在广泛适用的框架下,支持选择理论中的弃权机制、机器学习中的非凸目标函数,以及 DPO 设置的一些重要扩展,如边界调整和长度修正。这一理论视角有助于理解 DPO 的运作机制及其应用范围,同时为探索其局限性及改进方法提供基础。
链接: https://arxiv.org/abs/2507.07855
作者: Wenxuan Zhou,Shujian Zhang,Brice Magdalou,John Lambert,Ehsan Amid,Richard Nock,Andrew Hard
机构: Google DeepMind(谷歌深度思维); CEE-M, Montpellier U.(CEE-M,蒙彼利埃大学); Google Research(谷歌研究院)
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注:
Abstract:In this paper, we show that direct preference optimization (DPO) is a very specific form of a connection between two major theories in the ML context of learning from preferences: loss functions (Savage) and stochastic choice (Doignon-Falmagne and Machina). The connection is established for all of Savage’s losses and at this level of generality, (i) it includes support for abstention on the choice theory side, (ii) it includes support for non-convex objectives on the ML side, and (iii) it allows to frame for free some notable extensions of the DPO setting, including margins and corrections for length. Getting to understand how DPO operates from a general principled perspective is crucial because of the huge and diverse application landscape of models, because of the current momentum around DPO, but also – and importantly – because many state of the art variations on DPO definitely occupy a small region of the map that we cover. It also helps to understand the pitfalls of departing from this map, and figure out workarounds.
zh
[NLP-94] Natural Language-based Assessment of L2 Oral Proficiency using LLM s
【速读】: 该论文试图解决如何利用自然语言描述的评估标准(Natural Language-Based Assessment, NLA)来评估大型语言模型(LLM)在零样本设置下的表现问题。其解决方案的关键在于使用可解释的、广泛适用的语言描述符,通过仅依赖文本信息的方式对SI语料库中的响应进行评估,从而实现与人类评估相媲美的效果,并展现出在不同任务设置下的泛化能力和可解释性。
链接: https://arxiv.org/abs/2507.10200
作者: Stefano Bannò,Rao Ma,Mengjie Qian,Siyuan Tang,Kate Knill,Mark Gales
机构: ALTA Institute, Machine Intelligence Lab
类目: Audio and Speech Processing (eess.AS); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注: Accepted for the 10th Workshop on Speech and Language Technology in Education (SLaTE 2025)
Abstract:Natural language-based assessment (NLA) is an approach to second language assessment that uses instructions - expressed in the form of can-do descriptors - originally intended for human examiners, aiming to determine whether large language models (LLMs) can interpret and apply them in ways comparable to human assessment. In this work, we explore the use of such descriptors with an open-source LLM, Qwen 2.5 72B, to assess responses from the publicly available SI Corpus in a zero-shot setting. Our results show that this approach - relying solely on textual information - achieves competitive performance: while it does not outperform state-of-the-art speech LLMs fine-tuned for the task, it surpasses a BERT-based model trained specifically for this purpose. NLA proves particularly effective in mismatched task settings, is generalisable to other data types and languages, and offers greater interpretability, as it is grounded in clearly explainable, widely applicable language descriptors.
zh
[NLP-95] ZipVoice-Dialog: Non-Autoregressive Spoken Dialogue Generation with Flow Matching
【速读】: 该论文旨在解决语音对话生成中的挑战,包括真实对话轮换和不同说话人音色的区分,同时应对现有自回归模型在推理速度和稳定性上的不足。其解决方案的关键在于引入ZipVoice-Dialog,这是一个基于流匹配(flow matching)的非自回归零样本语音对话生成模型,核心设计包括说话人轮换嵌入以实现精确的对话轮换、课程学习策略以确保稳定的语音-文本对齐以及专门的策略以支持立体对话生成。
链接: https://arxiv.org/abs/2507.09318
作者: Han Zhu,Wei Kang,Liyong Guo,Zengwei Yao,Fangjun Kuang,Weiji Zhuang,Zhaoqing Li,Zhifeng Han,Dong Zhang,Xin Zhang,Xingchen Song,Long Lin,Daniel Povey
机构: Xiaomi Corp.(小米公司)
类目: Audio and Speech Processing (eess.AS); Computation and Language (cs.CL)
备注:
Abstract:Generating spoken dialogue is more challenging than monologue text-to-speech (TTS) due to the need for realistic turn-taking and distinct speaker timbres. Existing spoken dialogue generation models, being auto-regressive, suffer from slow and unstable inference. To overcome these limitations, we introduce ZipVoice-Dialog, a non-autoregressive zero-shot spoken dialogue generation model built upon flow matching. Key designs include: 1) speaker-turn embeddings for precise speaker turn-taking; 2) a curriculum learning strategy for stable speech-text alignment; 3) specialized strategies to enable stereo dialogue generation. Additionally, recognizing the lack of open-source large-scale spoken dialogue datasets, we curated OpenDialog, a 6.8k-hour spoken dialogue dataset from in-the-wild speech data. Furthermore, we established a benchmark to comprehensively evaluate various models. Experimental results demonstrate that ZipVoice-Dialog achieves superior performance in intelligibility, speaker turn-taking accuracy, speaker similarity, and inference speed. Our codes, model checkpoints, demo samples, and the OpenDialog dataset are all publicly available at this https URL.
zh
计算机视觉
[CV-0] Self-supervised Learning on Camera Trap Footage Yields a Strong Universal Face Embedder WWW
【速读】:该论文试图解决野生动物监测中个体动物手动识别效率低下的问题,特别是在缺乏标注数据的情况下。解决方案的关键在于提出一种完全自监督的方法,通过DINOv2框架在未标注的相机陷阱视频中学习鲁棒的黑猩猩面部嵌入表示,从而实现无需身份标签的开放集重识别任务。
链接: https://arxiv.org/abs/2507.10552
作者: Vladimir Iashin,Horace Lee,Dan Schofield,Andrew Zisserman
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: Accepted for publication. Project page, code and weights: this https URL
Abstract:Camera traps are revolutionising wildlife monitoring by capturing vast amounts of visual data; however, the manual identification of individual animals remains a significant bottleneck. This study introduces a fully self-supervised approach to learning robust chimpanzee face embeddings from unlabeled camera-trap footage. Leveraging the DINOv2 framework, we train Vision Transformers on automatically mined face crops, eliminating the need for identity labels. Our method demonstrates strong open-set re-identification performance, surpassing supervised baselines on challenging benchmarks such as Bossou, despite utilising no labelled data during training. This work underscores the potential of self-supervised learning in biodiversity monitoring and paves the way for scalable, non-invasive population studies.
zh
[CV-1] Quantize-then-Rectify: Efficient VQ-VAE Training
【速读】:该论文旨在解决高压缩率向量量化变分自编码器(VQ-VAE)训练计算成本高昂的问题,传统方法通常需要数千小时的GPU计算时间。其解决方案的关键在于利用预训练的变分自编码器(VAE),通过在VAE的容错阈值内控制量化噪声,高效地将其转换为VQ-VAE。论文提出的\textbf{Quantize-then-Rectify (ReVQ)}框架结合了通道多组量化以扩大代码本容量以及后修正模块以减轻量化误差,从而在保持重建质量的同时显著降低训练成本。
链接: https://arxiv.org/abs/2507.10547
作者: Borui Zhang,Qihang Rao,Wenzhao Zheng,Jie Zhou,Jiwen Lu
机构: Tsinghua University (清华大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注:
Abstract:Visual tokenizers are pivotal in multimodal large models, acting as bridges between continuous inputs and discrete tokens. Nevertheless, training high-compression-rate VQ-VAEs remains computationally demanding, often necessitating thousands of GPU hours. This work demonstrates that a pre-trained VAE can be efficiently transformed into a VQ-VAE by controlling quantization noise within the VAE’s tolerance threshold. We present \textbfQuantize-then-Rectify (ReVQ), a framework leveraging pre-trained VAEs to enable rapid VQ-VAE training with minimal computational overhead. By integrating \textbfchannel multi-group quantization to enlarge codebook capacity and a \textbfpost rectifier to mitigate quantization errors, ReVQ compresses ImageNet images into at most 512 tokens while sustaining competitive reconstruction quality (rFID = 1.06). Significantly, ReVQ reduces training costs by over two orders of magnitude relative to state-of-the-art approaches: ReVQ finishes full training on a single NVIDIA 4090 in approximately 22 hours, whereas comparable methods require 4.5 days on 32 A100 GPUs. Experimental results show that ReVQ achieves superior efficiency-reconstruction trade-offs.
zh
[CV-2] ScaffoldAvatar: High-Fidelity Gaussian Avatars with Patch Expressions SIGGRAPH2025
【速读】:该论文旨在解决生成高保真、实时动画的逼真三维头像的问题,特别是在渲染数字头像特写时展现面部微表情和细微动作的挑战。其解决方案的关键在于将局部定义的面部表情与3D高斯点云(3D Gaussian Splatting)相结合,通过基于区域的局部表达特征来驱动头像动态,并在区域级别合成3D高斯分布。该方法利用基于区域的几何3D人脸模型提取区域表达,并通过与Scaffold-GS的锚点耦合,学习如何将这些表达转化为局部动态皮肤外观和运动,从而实现高质量、实时的逼真三维头像生成。
链接: https://arxiv.org/abs/2507.10542
作者: Shivangi Aneja,Sebastian Weiss,Irene Baeza,Prashanth Chandran,Gaspard Zoss,Matthias Nießner,Derek Bradley
机构: Technical University of Munich(慕尼黑工业大学); DisneyResearch—StudiosZurichCAS(迪士尼研究院—苏黎世工作室); Swiss Federal Institute of Technology Zurich(苏黎世联邦理工学院)
类目: Graphics (cs.GR); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
备注: (SIGGRAPH 2025) Paper Video: this https URL Project Page: this https URL
Abstract:Generating high-fidelity real-time animated sequences of photorealistic 3D head avatars is important for many graphics applications, including immersive telepresence and movies. This is a challenging problem particularly when rendering digital avatar close-ups for showing character’s facial microfeatures and expressions. To capture the expressive, detailed nature of human heads, including skin furrowing and finer-scale facial movements, we propose to couple locally-defined facial expressions with 3D Gaussian splatting to enable creating ultra-high fidelity, expressive and photorealistic 3D head avatars. In contrast to previous works that operate on a global expression space, we condition our avatar’s dynamics on patch-based local expression features and synthesize 3D Gaussians at a patch level. In particular, we leverage a patch-based geometric 3D face model to extract patch expressions and learn how to translate these into local dynamic skin appearance and motion by coupling the patches with anchor points of Scaffold-GS, a recent hierarchical scene representation. These anchors are then used to synthesize 3D Gaussians on-the-fly, conditioned by patch-expressions and viewing direction. We employ color-based densification and progressive training to obtain high-quality results and faster convergence for high resolution 3K training images. By leveraging patch-level expressions, ScaffoldAvatar consistently achieves state-of-the-art performance with visually natural motion, while encompassing diverse facial expressions and styles in real time.
zh
[CV-3] Scene-Aware Conversational ADAS with Generative AI for Real-Time Driver Assistance
【速读】:该论文试图解决当前高级驾驶辅助系统(ADAS)在场景上下文理解及通过自然语言与驾驶员交互方面的局限性。现有系统依赖于预定义逻辑,缺乏对话式交互支持,导致在动态环境或适应驾驶员意图时不够灵活。解决方案的关键在于提出一种名为Scene-Aware Conversational ADAS (SC-ADAS)的模块化框架,该框架整合了生成式AI组件,包括大语言模型、视觉到文本的解释以及结构化功能调用,以实现实时、可解释且自适应的驾驶辅助。
链接: https://arxiv.org/abs/2507.10500
作者: Kyungtae Han,Yitao Chen,Rohit Gupta,Onur Altintas
机构: Toyota Motor North America(丰田汽车北美公司)
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV); Human-Computer Interaction (cs.HC)
备注:
Abstract:While autonomous driving technologies continue to advance, current Advanced Driver Assistance Systems (ADAS) remain limited in their ability to interpret scene context or engage with drivers through natural language. These systems typically rely on predefined logic and lack support for dialogue-based interaction, making them inflexible in dynamic environments or when adapting to driver intent. This paper presents Scene-Aware Conversational ADAS (SC-ADAS), a modular framework that integrates Generative AI components including large language models, vision-to-text interpretation, and structured function calling to enable real-time, interpretable, and adaptive driver assistance. SC-ADAS supports multi-turn dialogue grounded in visual and sensor context, allowing natural language recommendations and driver-confirmed ADAS control. Implemented in the CARLA simulator with cloud-based Generative AI, the system executes confirmed user intents as structured ADAS commands without requiring model fine-tuning. We evaluate SC-ADAS across scene-aware, conversational, and revisited multi-turn interactions, highlighting trade-offs such as increased latency from vision-based context retrieval and token growth from accumulated dialogue history. These results demonstrate the feasibility of combining conversational reasoning, scene perception, and modular ADAS control to support the next generation of intelligent driver assistance.
zh
[CV-4] National level satellite-based crop field inventories in smallholder landscapes
【速读】:该论文旨在解决小农户农业可持续性政策设计中因对基础系统属性(如活跃耕地的空间分布和田块规模)理解有限而面临的挑战。其关键解决方案是整合超高空间分辨率(1.5米)的地球观测数据与深度迁移学习技术,以在国家尺度上精确提取复杂农业系统中的作物田块边界,同时保持最低的参考数据需求并提高模型的可迁移性。该方法生成了莫桑比克2023年的全国性田块数据集,覆盖约80万平方公里,实现了93%的整体准确率,并在田块级空间一致性上达到中位数交并比(IoU)0.81,显著提升了复杂小农户系统中大范围田块划分的精度。
链接: https://arxiv.org/abs/2507.10499
作者: Philippe Rufin,Pauline Lucie Hammer,Leon-Friedrich Thomas,Sá Nogueira Lisboa,Natasha Ribeiro,Almeida Sitoe,Patrick Hostert,Patrick Meyfroidt
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注:
Abstract:The design of science-based policies to improve the sustainability of smallholder agriculture is challenged by a limited understanding of fundamental system properties, such as the spatial distribution of active cropland and field size. We integrate very high spatial resolution (1.5 m) Earth observation data and deep transfer learning to derive crop field delineations in complex agricultural systems at the national scale, while maintaining minimum reference data requirements and enhancing transferability. We provide the first national-level dataset of 21 million individual fields for Mozambique (covering ~800,000 km2) for 2023. Our maps separate active cropland from non-agricultural land use with an overall accuracy of 93% and balanced omission and commission errors. Field-level spatial agreement reached median intersection over union (IoU) scores of 0.81, advancing the state-of-the-art in large-area field delineation in complex smallholder systems. The active cropland maps capture fragmented rural regions with low cropland shares not yet identified in global land cover or cropland maps. These regions are mostly located in agricultural frontier regions which host 7-9% of the Mozambican population. Field size in Mozambique is very low overall, with half of the fields being smaller than 0.16 ha, and 83% smaller than 0.5 ha. Mean field size at aggregate spatial resolution (0.05°) is 0.32 ha, but it varies strongly across gradients of accessibility, population density, and net forest cover change. This variation reflects a diverse set of actors, ranging from semi-subsistence smallholder farms to medium-scale commercial farming, and large-scale farming operations. Our results highlight that field size is a key indicator relating to socio-economic and environmental outcomes of agriculture (e.g., food production, livelihoods, deforestation, biodiversity), as well as their trade-offs.
zh
[CV-5] Cameras as Relative Positional Encoding WWW
【速读】:该论文旨在解决多视角计算机视觉任务中如何有效利用相机几何关系以提升3D感知性能的问题。其解决方案的关键在于通过不同的相机条件化技术将视觉标记与3D空间进行对齐,其中核心创新是提出了一种新的相对编码方法——投影位置编码(Projective Positional Encoding, PRoPE),该方法能够捕捉完整的相机锥体信息,包括内参和外参,作为相对位置编码,从而更全面地建模视角间的几何关系。
链接: https://arxiv.org/abs/2507.10496
作者: Ruilong Li,Brent Yi,Junchen Liu,Hang Gao,Yi Ma,Angjoo Kanazawa
机构: UC Berkeley (加州大学伯克利分校); NVIDIA (英伟达); HKU (香港大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: Project Page: this https URL
Abstract:Transformers are increasingly prevalent for multi-view computer vision tasks, where geometric relationships between viewpoints are critical for 3D perception. To leverage these relationships, multi-view transformers must use camera geometry to ground visual tokens in 3D space. In this work, we compare techniques for conditioning transformers on cameras: token-level raymap encodings, attention-level relative pose encodings, and a new relative encoding we propose – Projective Positional Encoding (PRoPE) – that captures complete camera frustums, both intrinsics and extrinsics, as a relative positional encoding. Our experiments begin by showing how relative camera conditioning improves performance in feedforward novel view synthesis, with further gains from PRoPE. This holds across settings: scenes with both shared and varying intrinsics, when combining token- and attention-level conditioning, and for generalization to inputs with out-of-distribution sequence lengths and camera intrinsics. We then verify that these benefits persist for different tasks, stereo depth estimation and discriminative spatial cognition, as well as larger model sizes.
zh
[CV-6] BenchReAD: A systematic benchmark for retinal anomaly detection MICCAI2025
【速读】:该论文试图解决视网膜异常检测领域缺乏全面且公开可用的基准问题,这一问题限制了方法的公平评估与进一步发展。现有研究受限于异常类型单一、测试集饱和以及缺乏泛化性评估,同时多数基准仅关注单类监督方法,忽视了临床中大量存在的标注异常数据和未标注数据。论文提出的解决方案关键在于引入一个全面系统的视网膜异常检测基准,并通过结合解耦异常表示(DRA)与正常特征记忆(Normal Feature Memory)的NFM-DRA方法,有效缓解了在遇到未见异常时性能下降的问题,从而建立了新的最先进水平(SOTA)。
链接: https://arxiv.org/abs/2507.10492
作者: Chenyu Lian,Hong-Yu Zhou,Zhanli Hu,Jing Qin
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: MICCAI 2025
Abstract:Retinal anomaly detection plays a pivotal role in screening ocular and systemic diseases. Despite its significance, progress in the field has been hindered by the absence of a comprehensive and publicly available benchmark, which is essential for the fair evaluation and advancement of methodologies. Due to this limitation, previous anomaly detection work related to retinal images has been constrained by (1) a limited and overly simplistic set of anomaly types, (2) test sets that are nearly saturated, and (3) a lack of generalization evaluation, resulting in less convincing experimental setups. Furthermore, existing benchmarks in medical anomaly detection predominantly focus on one-class supervised approaches (training only with negative samples), overlooking the vast amounts of labeled abnormal data and unlabeled data that are commonly available in clinical practice. To bridge these gaps, we introduce a benchmark for retinal anomaly detection, which is comprehensive and systematic in terms of data and algorithm. Through categorizing and benchmarking previous methods, we find that a fully supervised approach leveraging disentangled representations of abnormalities (DRA) achieves the best performance but suffers from significant drops in performance when encountering certain unseen anomalies. Inspired by the memory bank mechanisms in one-class supervised learning, we propose NFM-DRA, which integrates DRA with a Normal Feature Memory to mitigate the performance degradation, establishing a new SOTA. The benchmark is publicly available at this https URL.
zh
[CV-7] he Power of Certainty: How Confident Models Lead to Better Segmentation
【速读】:该论文旨在解决深度学习模型在结肠镜检查中进行息肉检测和精确分割时存在的过拟合问题以及跨数据集泛化能力不足的问题。其解决方案的关键在于提出一种基于置信度的自蒸馏方法,该方法通过利用训练过程中前一迭代的数据存储来计算批次内当前与前一迭代之间的损失,采用动态置信度系数,从而在不增加额外计算或内存消耗的情况下提升模型性能。
链接: https://arxiv.org/abs/2507.10490
作者: Tugberk Erol,Tuba Caglikantar,Duygu Sarikaya
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 9 pages, 3 figures
Abstract:Deep learning models have been proposed for automatic polyp detection and precise segmentation of polyps during colonoscopy procedures. Although these state-of-the-art models achieve high performance, they often require a large number of parameters. Their complexity can make them prone to overfitting, particularly when trained on biased datasets, and can result in poor generalization across diverse datasets. Knowledge distillation and self-distillation are proposed as promising strategies to mitigate the limitations of large, over-parameterized models. These approaches, however, are resource-intensive, often requiring multiple models and significant memory during training. We propose a confidence-based self-distillation approach that outperforms state-of-the-art models by utilizing only previous iteration data storage during training, without requiring extra computation or memory usage during testing. Our approach calculates the loss between the previous and current iterations within a batch using a dynamic confidence coefficient. To evaluate the effectiveness of our approach, we conduct comprehensive experiments on the task of polyp segmentation. Our approach outperforms state-of-the-art models and generalizes well across datasets collected from multiple clinical centers. The code will be released to the public once the paper is accepted.
zh
[CV-8] Privacy-Preserving Multi-Stage Fall Detection Framework with Semi-supervised Federated Learning and Robotic Vision Confirmation
【速读】:该论文旨在解决老年人跌倒检测的问题,以降低因跌倒导致的伤害风险并减少医疗成本和恢复时间。其解决方案的关键在于提出一种融合多种互补系统的框架,包括基于半监督联邦学习的跌倒检测系统(SF2D)、室内定位与导航系统以及基于视觉的人体跌倒识别系统。该框架通过多系统协同工作,实现了高准确性和可靠性,同时保障了用户隐私。
链接: https://arxiv.org/abs/2507.10474
作者: Seyed Alireza Rahimi Azghadi,Truong-Thanh-Hung Nguyen,Helene Fournier,Monica Wachowicz,Rene Richard,Francis Palma,Hung Cao
机构: Analytics Everywhere Lab, University of New Brunswick, Canada; SE+AI Research Lab, University of New Brunswick, Canada; National Research Council, New Brunswick, Canada; RMIT University, Australia
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Robotics (cs.RO)
备注:
Abstract:The aging population is growing rapidly, and so is the danger of falls in older adults. A major cause of injury is falling, and detection in time can greatly save medical expenses and recovery time. However, to provide timely intervention and avoid unnecessary alarms, detection systems must be effective and reliable while addressing privacy concerns regarding the user. In this work, we propose a framework for detecting falls using several complementary systems: a semi-supervised federated learning-based fall detection system (SF2D), an indoor localization and navigation system, and a vision-based human fall recognition system. A wearable device and an edge device identify a fall scenario in the first system. On top of that, the second system uses an indoor localization technique first to localize the fall location and then navigate a robot to inspect the scenario. A vision-based detection system running on an edge device with a mounted camera on a robot is used to recognize fallen people. Each of the systems of this proposed framework achieves different accuracy rates. Specifically, the SF2D has a 0.81% failure rate equivalent to 99.19% accuracy, while the vision-based fallen people detection achieves 96.3% accuracy. However, when we combine the accuracy of these two systems with the accuracy of the navigation system (95% success rate), our proposed framework creates a highly reliable performance for fall detection, with an overall accuracy of 99.99%. Not only is the proposed framework safe for older adults, but it is also a privacy-preserving solution for detecting falls.
zh
[CV-9] GT-Loc: Unifying When and Where in Images Through a Joint Embedding Space ICCV2025
【速读】:该论文试图解决基于视觉信息的图像时间戳预测问题,旨在通过图像内容确定拍摄时间,支持元数据校正、检索和数字取证等应用。其核心挑战在于视觉线索(如亮度、色调和阴影位置)高度依赖地理环境,使得时间预测与地理定位紧密相关。解决方案的关键是提出GT-Loc方法,该方法通过联合预测图像的拍摄时间(小时和月份)与地理坐标(GPS坐标),采用图像、时间和位置的独立编码器,并在共享的高维特征空间中对齐嵌入。为处理时间的周期性特性,引入了一种基于时间度量学习的目标函数,通过建模周期性环面表面的成对时间差异提供软目标,从而提升时间预测性能。
链接: https://arxiv.org/abs/2507.10473
作者: David G. Shatwell,Ishan Rajendrakumar Dave,Sirnam Swetha,Mubarak Shah
机构: University of Central Florida (佛罗里达中央大学); Adobe (Adobe)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted in ICCV2025
Abstract:Timestamp prediction aims to determine when an image was captured using only visual information, supporting applications such as metadata correction, retrieval, and digital forensics. In outdoor scenarios, hourly estimates rely on cues like brightness, hue, and shadow positioning, while seasonal changes and weather inform date estimation. However, these visual cues significantly depend on geographic context, closely linking timestamp prediction to geo-localization. To address this interdependence, we introduce GT-Loc, a novel retrieval-based method that jointly predicts the capture time (hour and month) and geo-location (GPS coordinates) of an image. Our approach employs separate encoders for images, time, and location, aligning their embeddings within a shared high-dimensional feature space. Recognizing the cyclical nature of time, instead of conventional contrastive learning with hard positives and negatives, we propose a temporal metric-learning objective providing soft targets by modeling pairwise time differences over a cyclical toroidal surface. We present new benchmarks demonstrating that our joint optimization surpasses previous time prediction methods, even those using the ground-truth geo-location as an input during inference. Additionally, our approach achieves competitive results on standard geo-localization tasks, and the unified embedding space facilitates compositional and text-based image retrieval.
zh
[CV-10] RefSTAR: Blind Facial Image Restoration with Reference Selection Transfer and Reconstruction
【速读】:该论文试图解决盲式面部图像恢复中由于未知复杂退化和人类对人脸的高度敏感性而导致的身份保留问题,特别是在引入细节纹理特征时的不当处理。其解决方案的关键在于有效融合高质量参考图像中的适当特征,提出了一种考虑参考选择、迁移和重建的新型方法(RefSTAR)。该方法通过构建参考选择模块(RefSel)和RefSel-HQ数据集进行训练,设计特征融合范式以避免传统交叉注意力操作中的平凡解,并引入参考图像重建机制以确保参考特征在输出图像中的存在,同时结合掩码重新设计循环一致性损失,从而实现更优的身份保留能力和参考特征迁移质量。
链接: https://arxiv.org/abs/2507.10470
作者: Zhicun Yin,Junjie Chen,Ming Liu,Zhixin Wang,Fan Li,Renjing Pei,Xiaoming Li,Rynson W.H. Lau,Wangmeng Zuo
机构: Harbin Institute of Technology (哈尔滨工业大学); Huawei Noah’s Ark Lab (华为诺亚方舟实验室); Nanyang Technological University (南洋理工大学); City University of Hong Kong (香港城市大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Blind facial image restoration is highly challenging due to unknown complex degradations and the sensitivity of humans to faces. Although existing methods introduce auxiliary information from generative priors or high-quality reference images, they still struggle with identity preservation problems, mainly due to improper feature introduction on detailed textures. In this paper, we focus on effectively incorporating appropriate features from high-quality reference images, presenting a novel blind facial image restoration method that considers reference selection, transfer, and reconstruction (RefSTAR). In terms of selection, we construct a reference selection (RefSel) module. For training the RefSel module, we construct a RefSel-HQ dataset through a mask generation pipeline, which contains annotating masks for 10,000 ground truth-reference pairs. As for the transfer, due to the trivial solution in vanilla cross-attention operations, a feature fusion paradigm is designed to force the features from the reference to be integrated. Finally, we propose a reference image reconstruction mechanism that further ensures the presence of reference image features in the output image. The cycle consistency loss is also redesigned in conjunction with the mask. Extensive experiments on various backbone models demonstrate superior performance, showing better identity preservation ability and reference feature transfer quality. Source code, dataset, and pre-trained models are available at this https URL.
zh
[CV-11] RAPNet: A Receptive-Field Adaptive Convolutional Neural Network for Pansharpening
【速读】:该论文旨在解决遥感中图像融合问题,即如何将高分辨率全色(PAN)图像与低分辨率多光谱(MS)图像融合以生成高质量的融合产品。其解决方案的关键在于提出RAPNet架构,该架构引入了内容自适应卷积——Receptive-field Adaptive Pansharpening Convolution (RAPConv),通过生成空间自适应卷积核来响应局部特征上下文,从而提升空间细节提取的精度。此外,网络还集成了Pansharpening Dynamic Feature Fusion (PAN-DFF)模块,利用注意力机制实现空间细节增强与光谱保真度之间的最佳平衡。
链接: https://arxiv.org/abs/2507.10461
作者: Tao Tang,Chengxu Yang
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Multimedia (cs.MM); Image and Video Processing (eess.IV)
备注: To appear in the proceedings of the 6th International Conference on Artificial Intelligence and Electromechanical Automation (AIEA 2025). 5 pages, 6 figures
Abstract:Pansharpening refers to the process of integrating a high resolution panchromatic (PAN) image with a lower resolution multispectral (MS) image to generate a fused product, which is pivotal in remote sensing. Despite the effectiveness of CNNs in addressing this challenge, they are inherently constrained by the uniform application of convolutional kernels across all spatial positions, overlooking local content variations. To overcome this issue, we introduce RAPNet, a new architecture that leverages content-adaptive convolution. At its core, RAPNet employs the Receptive-field Adaptive Pansharpening Convolution (RAPConv), designed to produce spatially adaptive kernels responsive to local feature context, thereby enhancing the precision of spatial detail extraction. Additionally, the network integrates the Pansharpening Dynamic Feature Fusion (PAN-DFF) module, which incorporates an attention mechanism to achieve an optimal balance between spatial detail enhancement and spectral fidelity. Comprehensive evaluations on publicly available datasets confirm that RAPNet delivers superior performance compared to existing approaches, as demonstrated by both quantitative metrics and qualitative assessments. Ablation analyses further substantiate the effectiveness of the proposed adaptive components.
zh
[CV-12] CoralVQA: A Large-Scale Visual Question Answering Dataset for Coral Reef Image Understanding
【速读】:该论文试图解决珊瑚礁图像分析中因领域专业知识需求而导致的视觉问答(VQA)应用困难问题。解决方案的关键在于构建一个专门针对珊瑚礁的大型VQA数据集——CoralVQA,该数据集包含来自三大洋的67种珊瑚属的12,805张真实图像及277,653对问题-答案对,旨在支持生态和健康状况的全面评估。为确保数据集的规模与专业质量,研究团队与海洋生物学家合作开发了半自动的数据构建流程。
链接: https://arxiv.org/abs/2507.10449
作者: Hongyong Han,Wei Wang,Gaowei Zhang,Mingjie Li,Yi Wang
机构: Beijing University of Posts and Telecommunications (北京邮电大学); Technology Innovation Center for South China Sea Remote Sensing, Surveying and Mapping Collaborative Application, Ministry of Natural Resources (南海遥感测绘协同应用技术创新中心,自然资源部); South China Sea Development Research Institute, Ministry of Natural Resources (南海发展研究院,自然资源部)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:
Abstract:Coral reefs are vital yet vulnerable ecosystems that require continuous monitoring to support conservation. While coral reef images provide essential information in coral monitoring, interpreting such images remains challenging due to the need for domain expertise. Visual Question Answering (VQA), powered by Large Vision-Language Models (LVLMs), has great potential in user-friendly interaction with coral reef images. However, applying VQA to coral imagery demands a dedicated dataset that addresses two key challenges: domain-specific annotations and multidimensional questions. In this work, we introduce CoralVQA, the first large-scale VQA dataset for coral reef analysis. It contains 12,805 real-world coral images from 67 coral genera collected from 3 oceans, along with 277,653 question-answer pairs that comprehensively assess ecological and health-related conditions. To construct this dataset, we develop a semi-automatic data construction pipeline in collaboration with marine biologists to ensure both scalability and professional-grade data quality. CoralVQA presents novel challenges and provides a comprehensive benchmark for studying vision-language reasoning in the context of coral reef images. By evaluating several state-of-the-art LVLMs, we reveal key limitations and opportunities. These insights form a foundation for future LVLM development, with a particular emphasis on supporting coral conservation efforts.
zh
[CV-13] 4D-Animal: Freely Reconstructing Animatable 3D Animals from Videos
【速读】:该论文试图解决从视频中重建可动画化3D动物的问题,现有方法依赖于稀疏语义关键点来拟合参数化模型,但获取这些关键点耗时且在有限动物数据上训练的关键点检测器往往不可靠。解决方案的关键在于提出4D-Animal框架,该框架无需稀疏关键点标注即可从视频中重建可动画化3D动物,其核心包括一个将2D表示映射到SMAL参数的密集特征网络,以及一种结合轮廓、部件级、像素级和时间线索的分层对齐策略,从而提升拟合过程的效率和稳定性,并实现跨帧的准确且时间一致的重建。
链接: https://arxiv.org/abs/2507.10437
作者: Shanshan Zhong,Jiawei Peng,Zehan Zheng,Zhongzhan Huang,Wufei Ma,Guofeng Zhang,Qihao Liu,Alan Yuille,Jieneng Chen
机构: Johns Hopkins University (约翰霍普金斯大学); Sun Yat-sen University (中山大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Existing methods for reconstructing animatable 3D animals from videos typically rely on sparse semantic keypoints to fit parametric models. However, obtaining such keypoints is labor-intensive, and keypoint detectors trained on limited animal data are often unreliable. To address this, we propose 4D-Animal, a novel framework that reconstructs animatable 3D animals from videos without requiring sparse keypoint annotations. Our approach introduces a dense feature network that maps 2D representations to SMAL parameters, enhancing both the efficiency and stability of the fitting process. Furthermore, we develop a hierarchical alignment strategy that integrates silhouette, part-level, pixel-level, and temporal cues from pre-trained 2D visual models to produce accurate and temporally coherent reconstructions across frames. Extensive experiments demonstrate that 4D-Animal outperforms both model-based and model-free baselines. Moreover, the high-quality 3D assets generated by our method can benefit other 3D tasks, underscoring its potential for large-scale applications. The code is released at this https URL.
zh
[CV-14] CLA: Latent Alignment for Online Continual Self-Supervised Learning
【速读】:该论文试图解决在线持续学习(Online Continual Learning, Online CL)场景下的模型遗忘问题,即在数据以小批量形式连续到达、模型需满足固定计算预算且无明确任务边界的情况下,如何保持模型对先前知识的遗忘。解决方案的关键在于提出一种名为持续潜在对齐(Continual Latent Alignment, CLA)的新型自监督学习策略,通过将当前模型学到的表示与过去表示进行对齐,从而减轻遗忘现象。
链接: https://arxiv.org/abs/2507.10434
作者: Giacomo Cignoni,Andrea Cossu,Alexandra Gomez-Villa,Joost van de Weijer,Antonio Carta
机构: University of Pisa(比萨大学); Computer Vision Center (CVC)(计算机视觉中心)
类目: Machine Learning (cs.LG); Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted at CoLLAs 2025 conference
Abstract:Self-supervised learning (SSL) is able to build latent representations that generalize well to unseen data. However, only a few SSL techniques exist for the online CL setting, where data arrives in small minibatches, the model must comply with a fixed computational budget, and task boundaries are absent. We introduce Continual Latent Alignment (CLA), a novel SSL strategy for Online CL that aligns the representations learned by the current model with past representations to mitigate forgetting. We found that our CLA is able to speed up the convergence of the training process in the online scenario, outperforming state-of-the-art approaches under the same computational budget. Surprisingly, we also discovered that using CLA as a pretraining protocol in the early stages of pretraining leads to a better final performance when compared to a full i.i.d. pretraining.
zh
[CV-15] xt-Visual Semantic Constrained AI-Generated Image Quality Assessment
【速读】:该论文旨在解决人工智能生成图像(AGI)质量评估中存在语义错位和细节感知缺失的问题。其解决方案的关键在于提出一种统一框架——文本-视觉语义约束的AI生成图像质量评估(SC-AGIQA),该框架通过引入两个核心模块:文本辅助语义对齐模块(TSAM)和频域细粒度退化感知模块(FFDPM),分别利用多模态大语言模型(MLLMs)进行语义一致性检查以及基于频域分析与感知敏感性加权来捕捉细微视觉失真,从而提升对文本-图像一致性和感知失真的综合评估能力。
链接: https://arxiv.org/abs/2507.10432
作者: Qiang Li,Qingsen Yan,Haojian Huang,Peng Wu,Haokui Zhang,Yanning Zhang
机构: Northwest Polytechnical University(西北工业大学); The University of Hong Kong(香港大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 9 pages, 5 figures, Accepted at ACMMM 2025
Abstract:With the rapid advancements in Artificial Intelligence Generated Image (AGI) technology, the accurate assessment of their quality has become an increasingly vital requirement. Prevailing methods typically rely on cross-modal models like CLIP or BLIP to evaluate text-image alignment and visual quality. However, when applied to AGIs, these methods encounter two primary challenges: semantic misalignment and details perception missing. To address these limitations, we propose Text-Visual Semantic Constrained AI-Generated Image Quality Assessment (SC-AGIQA), a unified framework that leverages text-visual semantic constraints to significantly enhance the comprehensive evaluation of both text-image consistency and perceptual distortion in AI-generated images. Our approach integrates key capabilities from multiple models and tackles the aforementioned challenges by introducing two core modules: the Text-assisted Semantic Alignment Module (TSAM), which leverages Multimodal Large Language Models (MLLMs) to bridge the semantic gap by generating an image description and comparing it against the original prompt for a refined consistency check, and the Frequency-domain Fine-Grained Degradation Perception Module (FFDPM), which draws inspiration from Human Visual System (HVS) properties by employing frequency domain analysis combined with perceptual sensitivity weighting to better quantify subtle visual distortions and enhance the capture of fine-grained visual quality details in images. Extensive experiments conducted on multiple benchmark datasets demonstrate that SC-AGIQA outperforms existing state-of-the-art methods. The code is publicly available at this https URL.
zh
[CV-16] Numerically Computing Galois Groups of Minimal Problems
【速读】:该论文试图解决的是求解参数化代数方程组(parametric family of systems of algebraic equations)多个实例的问题,这一问题在计算机视觉领域中的鲁棒模型拟合方法(如“随机采样与共识”即RanSaC)中具有实际应用价值。解决方案的关键在于衡量这类参数化系统的内在难度,并探索可行的实用求解方法。
链接: https://arxiv.org/abs/2507.10407
作者: Timothy Duff
机构: University of Missouri - Columbia(密苏里大学哥伦比亚分校)
类目: Computer Vision and Pattern Recognition (cs.CV); Symbolic Computation (cs.SC); Algebraic Geometry (math.AG)
备注: abstract accompanying invited tutorial at ISSAC 2025; 10 pages w/ references
Abstract:I discuss a seemingly unlikely confluence of topics in algebra, numerical computation, and computer vision. The motivating problem is that of solving multiples instances of a parametric family of systems of algebraic (polynomial or rational function) equations. No doubt already of interest to ISSAC attendees, this problem arises in the context of robust model-fitting paradigms currently utilized by the computer vision community (namely “Random Sampling and Consensus”, aka “RanSaC”.) This talk will give an overview of work in the last 5+ years that aspires to measure the intrinsic difficulty of solving such parametric systems, and makes strides towards practical solutions.
zh
[CV-17] Improving Remote Sensing Classification using Topological Data Analysis and Convolutional Neural Networks
【速读】:该论文旨在解决深度学习模型在遥感分类任务中对纹理等局部特征的偏好,从而限制了其对更全局、几何结构信息的利用问题。解决方案的关键在于提出一种基于拓扑数据分析(Topological Data Analysis, TDA)的特征工程流程,并将其与深度学习模型进行简单融合,以增强模型对复杂数据集的描述能力。通过将TDA提取的拓扑特征引入ResNet18模型,显著提升了EuroSAT和RESISC45数据集上的分类性能,证明了TDA特征在没有显式拓扑结构的数据集上也能有效提升深度学习模型的泛化能力。
链接: https://arxiv.org/abs/2507.10381
作者: Aaryam Sharma
机构: University of Waterloo (滑铁卢大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注: 9 pages, 8 figures
Abstract:Topological data analysis (TDA) is a relatively new field that is gaining rapid adoption due to its robustness and ability to effectively describe complex datasets by quantifying geometric information. In imaging contexts, TDA typically models data as filtered cubical complexes from which we can extract discriminative features using persistence homology. Meanwhile, convolutional neural networks (CNNs) have been shown to be biased towards texture based local features. To address this limitation, we propose a TDA feature engineering pipeline and a simple method to integrate topological features with deep learning models on remote sensing classification. Our method improves the performance of a ResNet18 model on the EuroSAT dataset by 1.44% achieving 99.33% accuracy, which surpasses all previously reported single-model accuracies, including those with larger architectures, such as ResNet50 (2x larger) and XL Vision Transformers (197x larger). We additionally show that our method’s accuracy is 1.82% higher than our ResNet18 baseline on the RESISC45 dataset. To our knowledge, this is the first application of TDA features in satellite scene classification with deep learning. This demonstrates that TDA features can be integrated with deep learning models, even on datasets without explicit topological structures, thereby increasing the applicability of TDA. A clean implementation of our method will be made publicly available upon publication.
zh
[CV-18] st-Time Canonicalization by Foundation Models for Robust Perception ICML2025
【速读】:该论文试图解决现实世界视觉感知中对多种变换的不变性问题,当前方法依赖于专用架构或预定义增强的训练,限制了泛化能力。解决方案的关键在于提出FOCAL,一个基于数据驱动的测试时框架,通过利用基础模型中的互联网规模视觉先验来实现鲁棒感知。FOCAL通过生成并优化候选变换以达到视觉典型的“规范”视图,从而提升鲁棒性,而无需重新训练或改变架构。
链接: https://arxiv.org/abs/2507.10375
作者: Utkarsh Singhal,Ryan Feng,Stella X. Yu,Atul Prakash
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注: Published at ICML 2025
Abstract:Real-world visual perception requires invariance to diverse transformations, yet current methods rely heavily on specialized architectures or training on predefined augmentations, limiting generalization. We propose FOCAL, a test-time, data-driven framework that achieves robust perception by leveraging internet-scale visual priors from foundation models. By generating and optimizing candidate transformations toward visually typical, “canonical” views, FOCAL enhances robustness without re-training or architectural changes. Our experiments demonstrate improved robustness of CLIP and SAM across challenging transformations, including 2D/3D rotations, illumination shifts (contrast and color), and day-night variations. We also highlight potential applications in active vision. Our approach challenges the assumption that transform-specific training is necessary, instead offering a scalable path to invariance. Our code is available at: this https URL.
zh
[CV-19] Fine-Grained Zero-Shot Object Detection ACM-MM’25
【速读】:该论文试图解决细粒度零样本目标检测(Fine-Grained Zero-Shot Object Detection, FG-ZSD)问题,即在零样本检测框架下,检测具有细微差异的不同类别对象。现有零样本目标检测(ZSD)方法主要针对视觉差异较大的粗粒度类别,而实际应用中常需处理细粒度场景,如不同种类的鸟类、鱼类和花卉。为解决这一问题,作者提出了一种名为MSHC的方法,其关键在于基于改进的两阶段检测器,并引入多层级语义感知嵌入对齐损失,以确保视觉空间与语义空间之间的紧密耦合。
链接: https://arxiv.org/abs/2507.10358
作者: Hongxu Ma,Chenbo Zhang,Lu Zhang,Jiaogen Zhou,Jihong Guan,Shuigeng Zhou
机构: Fudan University(复旦大学); Huaiyin Normal University(淮阴师范学院); Tongji University(同济大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted by ACM MM’25
Abstract:Zero-shot object detection (ZSD) aims to leverage semantic descriptions to localize and recognize objects of both seen and unseen classes. Existing ZSD works are mainly coarse-grained object detection, where the classes are visually quite different, thus are relatively easy to distinguish. However, in real life we often have to face fine-grained object detection scenarios, where the classes are too similar to be easily distinguished. For example, detecting different kinds of birds, fishes, and flowers. In this paper, we propose and solve a new problem called Fine-Grained Zero-Shot Object Detection (FG-ZSD for short), which aims to detect objects of different classes with minute differences in details under the ZSD paradigm. We develop an effective method called MSHC for the FG-ZSD task, which is based on an improved two-stage detector and employs a multi-level semantics-aware embedding alignment loss, ensuring tight coupling between the visual and semantic spaces. Considering that existing ZSD datasets are not suitable for the new FG-ZSD task, we build the first FG-ZSD benchmark dataset FGZSD-Birds, which contains 148,820 images falling into 36 orders, 140 families, 579 genera and 1432 species. Extensive experiments on FGZSD-Birds show that our method outperforms existing ZSD models. Comments: Accepted by ACM MM’25 Subjects: Computer Vision and Pattern Recognition (cs.CV) Cite as: arXiv:2507.10358 [cs.CV] (or arXiv:2507.10358v1 [cs.CV] for this version) https://doi.org/10.48550/arXiv.2507.10358 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
zh
[CV-20] Beyond Graph Model: Reliable VLM Fine-Tuning via Random Graph Adapter
【速读】:该论文试图解决预训练视觉-语言模型(VLM)在下游任务中知识迁移时,传统确定性文本适配器无法充分捕捉类别文本描述的多样性及类间关系的问题。解决方案的关键在于引入随机图模型,构建一种新型顶点随机图适配器(VRGAdapter),通过顶点随机知识图(VRKG)同时建模每个类别的内在多样描述和类间关系,并利用概率信息传播学习上下文感知的分布表示,最终通过重参数化采样函数实现文本适配器的学习。该方法提供了一个更通用的适配器框架,传统基于图的适配器可视为其特例。
链接: https://arxiv.org/abs/2507.10355
作者: Bo Jiang,Xueyang Ze,Beibei Wang,Xixi Wang,Xixi Wan,Bin Luo
机构: Anhui Provincial Key Laboratory of Multimodal Cognitive Computation, School of Computer Science and Technology, Anhui University (安徽省多模态认知计算重点实验室,安徽大学计算机科学与技术学院)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Textual adapter-based tuning methods have shown significant potential in transferring knowledge from pre-trained Vision-Language Models (VLMs) to downstream tasks. Existing works generally employ the deterministic textual feature adapter to refine each category textual representation. However, due to inherent factors such as different attributes and contexts, there exists significant diversity in textual descriptions for each category. Such description diversity offers rich discriminative semantic knowledge that can benefit downstream visual learning tasks. Obviously, traditional deterministic adapter model cannot adequately capture this varied semantic information. Also, it is desirable to exploit the inter-class relationships in VLM adapter. To address these issues, we propose to exploit random graph model into VLM adapter and develop a novel Vertex Random Graph Adapter (VRGAdapter). VRGAdapter first models the inherent diverse descriptions of each category and inter-class relationships of different categories simultaneously by leveraging a Vertex Random Knowledge Graph (VRKG) model. Then, it employs probabilistic message propagation on VRKG to learn context-aware distribution representation for each class node. Finally, it adopts a reparameterized sampling function to achieve textual adapter learning. Note that, VRGAdapter provides a more general adapter solution that encompasses traditional graph-based adapter as a special case. In addition, to enable more robust performance for downstream tasks, we also introduce a new Uncertainty-guided Multi-branch Fusion (UMF) scheme that dynamically integrates multiple pre-trained models for ensemble prediction. Extensive experiments on multiple benchmark datasets demonstrate the effectiveness of our approach.
zh
[CV-21] FGSSNet: Feature-Guided Semantic Segmentation of Real World Floorplans
【速读】:该论文试图解决在建筑平面图中墙体分割的泛化能力不足的问题,旨在提升分割模型对不同场景下墙体特征的适应性。解决方案的关键在于提出FGSSNet架构,其核心创新是引入一个多头专用特征提取器,该提取器通过编码输入平面图中代表性墙体区域的纹理和宽度特征,生成压缩的潜在表示,并将其注入到U-Net的潜在空间中,从而指导分割过程。这种特征引导机制有效增强了模型对墙体特征的理解与分割精度。
链接: https://arxiv.org/abs/2507.10343
作者: Hugo Norrby,Gabriel Färm,Kevin Hernandez-Diaz,Fernando Alonso-Fernandez
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted at International Workshop on Artificial Intelligence and Pattern Recognition, IWAIPR 2025
Abstract:We introduce FGSSNet, a novel multi-headed feature-guided semantic segmentation (FGSS) architecture designed to improve the generalization ability of wall segmentation on floorplans. FGSSNet features a U-Net segmentation backbone with a multi-headed dedicated feature extractor used to extract domain-specific feature maps which are injected into the latent space of U-Net to guide the segmentation process. This dedicated feature extractor is trained as an encoder-decoder with selected wall patches, representative of the walls present in the input floorplan, to produce a compressed latent representation of wall patches while jointly trained to predict the wall width. In doing so, we expect that the feature extractor encodes texture and width features of wall patches that are useful to guide the wall segmentation process. Our experiments show increased performance by the use of such injected features in comparison to the vanilla U-Net, highlighting the validity of the proposed approach.
zh
[CV-22] xt Embedding Knows How to Quantize Text-Guided Diffusion Models ICCV2025
【速读】:该论文试图解决扩散模型在资源受限环境中因计算复杂度高而难以应用的问题,其解决方案的关键在于提出一种名为QLIP的新型量化方法,该方法利用文本提示指导每个时间步长下每一层的比特精度选择,从而实现更高效的量化。
链接: https://arxiv.org/abs/2507.10340
作者: Hongjae Lee,Myungjun Son,Dongjea Kang,Seung-Won Jung
机构: Korea University (高丽大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: ICCV 2025
Abstract:Despite the success of diffusion models in image generation tasks such as text-to-image, the enormous computational complexity of diffusion models limits their use in resource-constrained environments. To address this, network quantization has emerged as a promising solution for designing efficient diffusion models. However, existing diffusion model quantization methods do not consider input conditions, such as text prompts, as an essential source of information for quantization. In this paper, we propose a novel quantization method dubbed Quantization of Language-to-Image diffusion models using text Prompts (QLIP). QLIP leverages text prompts to guide the selection of bit precision for every layer at each time step. In addition, QLIP can be seamlessly integrated into existing quantization methods to enhance quantization efficiency. Our extensive experiments demonstrate the effectiveness of QLIP in reducing computational complexity and improving the quality of the generated images across various datasets.
zh
[CV-23] Mind the Gap: Aligning Vision Foundation Models to Image Feature Matching ICCV2025
【速读】:该论文试图解决在将视觉基础模型引入图像特征匹配时存在的对齐问题(misalignment),这一问题源于基础模型侧重单图理解而特征匹配需要跨图理解的矛盾。具体表现为:1)常用基础模型生成的嵌入与特征匹配所需的最优嵌入存在差异;2)缺乏有效机制将单图理解能力转化为跨图理解。解决方案的关键在于提出一种名为IMD(Image feature Matching with a pre-trained Diffusion model)的框架,其核心是集成生成式扩散模型以捕捉实例级细节,并利用生成模型中的提示机制设计跨图交互提示模块,促进图像对之间的双向信息交互。
链接: https://arxiv.org/abs/2507.10318
作者: Yuhan Liu,Jingwen Fu,Yang Wu,Kangyi Wu,Pengna Li,Jiayi Wu,Sanping Zhou,Jingmin Xin
机构: Xi’an Jiaotong University (西安交通大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted by ICCV 2025
Abstract:Leveraging the vision foundation models has emerged as a mainstream paradigm that improves the performance of image feature matching. However, previous works have ignored the misalignment when introducing the foundation models into feature matching. The misalignment arises from the discrepancy between the foundation models focusing on single-image understanding and the cross-image understanding requirement of feature matching. Specifically, 1) the embeddings derived from commonly used foundation models exhibit discrepancies with the optimal embeddings required for feature matching; 2) lacking an effective mechanism to leverage the single-image understanding ability into cross-image understanding. A significant consequence of the misalignment is they struggle when addressing multi-instance feature matching problems. To address this, we introduce a simple but effective framework, called IMD (Image feature Matching with a pre-trained Diffusion model) with two parts: 1) Unlike the dominant solutions employing contrastive-learning based foundation models that emphasize global semantics, we integrate the generative-based diffusion models to effectively capture instance-level details. 2) We leverage the prompt mechanism in generative model as a natural tunnel, propose a novel cross-image interaction prompting module to facilitate bidirectional information interaction between image pairs. To more accurately measure the misalignment, we propose a new benchmark called IMIM, which focuses on multi-instance scenarios. Our proposed IMD establishes a new state-of-the-art in commonly evaluated benchmarks, and the superior improvement 12% in IMIM indicates our method efficiently mitigates the misalignment.
zh
[CV-24] Contrastive Pretraining with Dual Visual Encoders for Gloss-Free Sign Language Translation
【速读】:该论文旨在解决手语翻译(Sign Language Translation, SLT)中依赖于成本高昂且难以全面捕捉连续手语复杂性的词素注释(gloss annotations)的问题。其解决方案的关键在于提出一种两阶段、双视觉编码器框架,通过对比视觉-语言预训练实现无需词素注释的手语翻译。在预训练阶段,采用两个互补的视觉主干网络,其输出通过对比目标与彼此以及句子级文本嵌入进行联合对齐;在下游任务中,融合视觉特征并输入到编码器-解码器模型中,从而提升了翻译性能。
链接: https://arxiv.org/abs/2507.10306
作者: Ozge Mercanoglu Sincan,Richard Bowden
机构: CVSSP, University of Surrey; Guildford UK
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted at 9th Workshop on Sign Language Translation and Avatar Technologies (SLTAT), will be held in conjunction with IVA’25
Abstract:Sign Language Translation (SLT) aims to convert sign language videos into spoken or written text. While early systems relied on gloss annotations as an intermediate supervision, such annotations are costly to obtain and often fail to capture the full complexity of continuous signing. In this work, we propose a two-phase, dual visual encoder framework for gloss-free SLT, leveraging contrastive visual-language pretraining. During pretraining, our approach employs two complementary visual backbones whose outputs are jointly aligned with each other and with sentence-level text embeddings via a contrastive objective. During the downstream SLT task, we fuse the visual features and input them into an encoder-decoder model. On the Phoenix-2014T benchmark, our dual encoder architecture consistently outperforms its single stream variants and achieves the highest BLEU-4 score among existing gloss-free SLT approaches.
zh
[CV-25] DisCo: Towards Distinct and Coherent Visual Encapsulation in Video MLLM s ICCV2025
【速读】:该论文旨在解决视频多模态大语言模型(video MLLMs)中视觉封装过程存在的语义模糊性和时间不连贯性问题。现有方法中广泛使用的线性投影器在处理视频时难以保持语义区分性和时间一致性,而现有的重采样结构虽具潜力,但尚未有有效的解决方案。论文提出的解决方案关键在于DisCo方法,其核心包含两个组件:(1)视觉概念判别器(VCD)模块,通过将视觉标记与视频中的判别性概念配对来赋予其独特的语义;(2)时间焦点校准器(TFC)模块,确保视觉标记在视频每一帧中对视频元素保持一致的时间关注。
链接: https://arxiv.org/abs/2507.10302
作者: Jiahe Zhao,Rongkun Zheng,Yi Wang,Helin Wang,Hengshuang Zhao
机构: University of Chinese Academy of Sciences (中国科学院大学); The University of Hong Kong (香港大学); Shanghai Artificial Intelligence Laboratory (上海人工智能实验室); Shanghai Innovation Institute (上海创新研究院); Fudan University (复旦大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: ICCV 2025
Abstract:In video Multimodal Large Language Models (video MLLMs), the visual encapsulation process plays a pivotal role in converting video contents into representative tokens for LLM input. While linear projectors are widely employed for encapsulation, they introduce semantic indistinctness and temporal incoherence when applied to videos. Conversely, the structure of resamplers shows promise in tackling these challenges, but an effective solution remains unexplored. Drawing inspiration from resampler structures, we introduce DisCo, a novel visual encapsulation method designed to yield semantically distinct and temporally coherent visual tokens for video MLLMs. DisCo integrates two key components: (1) A Visual Concept Discriminator (VCD) module, assigning unique semantics for visual tokens by associating them in pair with discriminative concepts in the video. (2) A Temporal Focus Calibrator (TFC) module, ensuring consistent temporal focus of visual tokens to video elements across every video frame. Through extensive experiments on multiple video MLLM frameworks, we demonstrate that DisCo remarkably outperforms previous state-of-the-art methods across a variety of video understanding benchmarks, while also achieving higher token efficiency thanks to the reduction of semantic indistinctness. The code: this https URL.
zh
[CV-26] Show and Polish: Reference-Guided Identity Preservation in Face Video Restoration
【速读】:该论文旨在解决在严重退化情况下,传统方法难以保留细粒度、身份特异性特征的问题,从而导致恢复的面部视频缺乏个体特征。其解决方案的关键在于引入IP-FVR方法,该方法通过使用高质量参考人脸图像作为视觉提示,在去噪过程中提供身份条件约束。IP-FVR利用解耦交叉注意力机制从参考图像中提取语义丰富的身份信息,确保结果的细节和身份一致性,并通过身份保持反馈学习方法和指数融合策略分别解决片段内和片段间的身份漂移问题,同时采用多流负提示增强恢复过程,提升面部属性的相关性并减少低质量或错误特征的生成。
链接: https://arxiv.org/abs/2507.10293
作者: Wenkang Han,Wang Lin,Yiyun Zhou,Qi Liu,Shulei Wang,Chang Yao,Jingyuan Chen
机构: Zhejiang University(浙江大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted by MM 2025
Abstract:Face Video Restoration (FVR) aims to recover high-quality face videos from degraded versions. Traditional methods struggle to preserve fine-grained, identity-specific features when degradation is severe, often producing average-looking faces that lack individual characteristics. To address these challenges, we introduce IP-FVR, a novel method that leverages a high-quality reference face image as a visual prompt to provide identity conditioning during the denoising process. IP-FVR incorporates semantically rich identity information from the reference image using decoupled cross-attention mechanisms, ensuring detailed and identity consistent results. For intra-clip identity drift (within 24 frames), we introduce an identity-preserving feedback learning method that combines cosine similarity-based reward signals with suffix-weighted temporal aggregation. This approach effectively minimizes drift within sequences of frames. For inter-clip identity drift, we develop an exponential blending strategy that aligns identities across clips by iteratively blending frames from previous clips during the denoising process. This method ensures consistent identity representation across different clips. Additionally, we enhance the restoration process with a multi-stream negative prompt, guiding the model’s attention to relevant facial attributes and minimizing the generation of low-quality or incorrect features. Extensive experiments on both synthetic and real-world datasets demonstrate that IP-FVR outperforms existing methods in both quality and identity preservation, showcasing its substantial potential for practical applications in face video restoration.
zh
[CV-27] FTCFormer: Fuzzy Token Clustering Transformer for Image Classification
【速读】:该论文试图解决传统基于Transformer的架构在图像嵌入时依赖于均匀网格化视觉标记,而忽视了图像区域的语义含义,从而导致特征表示不够优化的问题。解决方案的关键在于提出Fuzzy Token Clustering Transformer (FTCFormer),其核心是引入了一种基于聚类的下采样模块,能够根据语义意义动态生成视觉标记,而非仅依赖空间位置。该方法通过DPC-FKNN机制确定聚类中心、SCS用于标记分配以及Cmerge策略进行标记合并,实现了对不同语义重要性的区域进行更合理的标记分配。
链接: https://arxiv.org/abs/2507.10283
作者: Muyi Bao,Changyu Zeng,Yifan Wang,Zhengni Yang,Zimu Wang,Guangliang Cheng,Jun Qi,Wei Wang
机构: School of Advanced Technology, Xi’an Jiaotong-Liverpool University, Suzhou, China; Department of Mathematical Sciences, University of Liverpool, Liverpool, United Kingdom; Department of Computer Science, University of Liverpool, Liverpool, United Kingdom
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Transformer-based deep neural networks have achieved remarkable success across various computer vision tasks, largely attributed to their long-range self-attention mechanism and scalability. However, most transformer architectures embed images into uniform, grid-based vision tokens, neglecting the underlying semantic meanings of image regions, resulting in suboptimal feature representations. To address this issue, we propose Fuzzy Token Clustering Transformer (FTCFormer), which incorporates a novel clustering-based downsampling module to dynamically generate vision tokens based on the semantic meanings instead of spatial positions. It allocates fewer tokens to less informative regions and more to represent semantically important regions, regardless of their spatial adjacency or shape irregularity. To further enhance feature extraction and representation, we propose a Density Peak Clustering-Fuzzy K-Nearest Neighbor (DPC-FKNN) mechanism for clustering center determination, a Spatial Connectivity Score (SCS) for token assignment, and a channel-wise merging (Cmerge) strategy for token merging. Extensive experiments on 32 datasets across diverse domains validate the effectiveness of FTCFormer on image classification, showing consistent improvements over the TCFormer baseline, achieving gains of improving 1.43% on five fine-grained datasets, 1.09% on six natural image datasets, 0.97% on three medical datasets and 0.55% on four remote sensing datasets. The code is available at: this https URL.
zh
[CV-28] Kaleidoscopic Background Attack: Disrupting Pose Estimation with Multi-Fold Radial Symmetry Textures ICCV2025
【速读】:该论文试图解决在物体中心场景中,由于背景纹理占据图像大部分区域而导致的相机位姿估计精度下降问题。解决方案的关键在于引入了Kaleidoscopic Background Attack(KBA),该方法利用相同片段生成具有多倍径向对称性的圆盘,这些圆盘在不同视角下保持高相似性,从而有效攻击位姿估计模型。此外,通过引入投影方向一致性损失优化这些 kaleidoscopic 片段,进一步提升了攻击效果。
链接: https://arxiv.org/abs/2507.10265
作者: Xinlong Ding,Hongwei Yu,Jiawei Li,Feifan Li,Yu Shang,Bochao Zou,Huimin Ma,Jiansheng Chen
机构: University of Science and Technology Beijing (北京科技大学); Tsinghua University (清华大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted at ICCV 2025. Project page is available at this https URL
Abstract:Camera pose estimation is a fundamental computer vision task that is essential for applications like visual localization and multi-view stereo reconstruction. In the object-centric scenarios with sparse inputs, the accuracy of pose estimation can be significantly influenced by background textures that occupy major portions of the images across different viewpoints. In light of this, we introduce the Kaleidoscopic Background Attack (KBA), which uses identical segments to form discs with multi-fold radial symmetry. These discs maintain high similarity across different viewpoints, enabling effective attacks on pose estimation models even with natural texture segments. Additionally, a projected orientation consistency loss is proposed to optimize the kaleidoscopic segments, leading to significant enhancement in the attack effectiveness. Experimental results show that optimized adversarial kaleidoscopic backgrounds can effectively attack various camera pose estimation models.
zh
[CV-29] ransferring Styles for Reduced Texture Bias and Improved Robustness in Semantic Segmentation Networks ECAI2025
【速读】:该论文试图解决语义分割中深度神经网络(Deep Neural Networks, DNNs)对纹理线索的依赖问题,从而降低纹理偏差并提高模型在常见图像损坏和对抗攻击下的鲁棒性。其解决方案的关键在于使用风格迁移(style transfer)技术,通过对人工图像区域进行风格变换(风格变化基于Voronoi细胞生成的随机区域),生成风格转移的数据集,以此训练语义分割DNN,使其减少对纹理特征的依赖,增强对形状特征的利用。
链接: https://arxiv.org/abs/2507.10239
作者: Ben Hamscher,Edgar Heinert,Annika Mütze,Kira Maag,Matthias Rottmann
机构: Heinrich-Heine-University Düsseldorf(海因里希·海涅大学杜塞尔多夫分校); University of Wuppertal(伍珀塔尔大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: accepted at ECAI 2025
Abstract:Recent research has investigated the shape and texture biases of deep neural networks (DNNs) in image classification which influence their generalization capabilities and robustness. It has been shown that, in comparison to regular DNN training, training with stylized images reduces texture biases in image classification and improves robustness with respect to image corruptions. In an effort to advance this line of research, we examine whether style transfer can likewise deliver these two effects in semantic segmentation. To this end, we perform style transfer with style varying across artificial image areas. Those random areas are formed by a chosen number of Voronoi cells. The resulting style-transferred data is then used to train semantic segmentation DNNs with the objective of reducing their dependence on texture cues while enhancing their reliance on shape-based features. In our experiments, it turns out that in semantic segmentation, style transfer augmentation reduces texture bias and strongly increases robustness with respect to common image corruptions as well as adversarial attacks. These observations hold for convolutional neural networks and transformer architectures on the Cityscapes dataset as well as on PASCAL Context, showing the generality of the proposed method.
zh
[CV-30] Navigating the Challenges of AI-Generated Image Detection in the Wild: What Truly Matters?
【速读】:该论文试图解决AI-Generated Image Detection (AID)模型在真实场景下检测性能下降的问题,即当前AID模型在受控基准数据集上表现优异,但在面对实际社会媒体平台中的多样化和复杂图像时存在显著不足。解决方案的关键在于通过系统分析影响AID性能的四个核心因素:主干架构、训练数据组成、预处理策略以及数据增强组合,并基于此进行优化,从而在真实世界条件下实现了平均AUC提升26.87%。
链接: https://arxiv.org/abs/2507.10236
作者: Despina Konstantinidou,Dimitrios Karageorgiou,Christos Koutlis,Olga Papadopoulou,Emmanouil Schinas,Symeon Papadopoulos
机构: ITI - CERTH, Thessaloniki, Greece
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 35 pages, 4 figures
Abstract:The rapid advancement of generative technologies presents both unprecedented creative opportunities and significant challenges, particularly in maintaining social trust and ensuring the integrity of digital information. Following these concerns, the challenge of AI-Generated Image Detection (AID) becomes increasingly critical. As these technologies become more sophisticated, the quality of AI-generated images has reached a level that can easily deceive even the most discerning observers. Our systematic evaluation highlights a critical weakness in current AI-Generated Image Detection models: while they perform exceptionally well on controlled benchmark datasets, they struggle significantly with real-world variations. To assess this, we introduce ITW-SM, a new dataset of real and AI-generated images collected from major social media platforms. In this paper, we identify four key factors that influence AID performance in real-world scenarios: backbone architecture, training data composition, pre-processing strategies and data augmentation combinations. By systematically analyzing these components, we shed light on their impact on detection efficacy. Our modifications result in an average AUC improvement of 26.87% across various AID models under real-world conditions.
zh
[CV-31] Synthesizing Near-Boundary OOD Samples for Out-of-Distribution Detection
【速读】:该论文试图解决在图像特征空间中接近分布内(InD)数据的挑战性分布外(OOD)样本仍可能导致误分类的问题。其解决方案的关键在于利用基础模型生成具有边界对齐特性的合成OOD数据,通过迭代的图像修复过程结合多模态大语言模型(MLLM)的上下文提示,生成精细化的OOD样本,并基于OOD得分(如能量得分)进行噪声调整,从而有效采样于InD/OOD边界。随后,利用这些合成图像微调CLIP模型的图像编码器及从文本编码器获得的负标签特征,以增强近边界OOD样本与负标签之间的关联。
链接: https://arxiv.org/abs/2507.10225
作者: Jinglun Li,Kaixun Jiang,Zhaoyu Chen,Bo Lin,Yao Tang,Weifeng Ge,Wenqiang Zhang
机构: Fudan University(复旦大学); Shanghai Key Lab of Intelligent Information Processing(上海市智能信息处理重点实验室); College of Computer Science and Artificial Intelligence, Fudan University(复旦大学计算机科学与人工智能学院); JIIOV Technology(极目科技)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Pre-trained vision-language models have exhibited remarkable abilities in detecting out-of-distribution (OOD) samples. However, some challenging OOD samples, which lie close to in-distribution (InD) data in image feature space, can still lead to misclassification. The emergence of foundation models like diffusion models and multimodal large language models (MLLMs) offers a potential solution to this issue. In this work, we propose SynOOD, a novel approach that harnesses foundation models to generate synthetic, challenging OOD data for fine-tuning CLIP models, thereby enhancing boundary-level discrimination between InD and OOD samples. Our method uses an iterative in-painting process guided by contextual prompts from MLLMs to produce nuanced, boundary-aligned OOD samples. These samples are refined through noise adjustments based on gradients from OOD scores like the energy score, effectively sampling from the InD/OOD boundary. With these carefully synthesized images, we fine-tune the CLIP image encoder and negative label features derived from the text encoder to strengthen connections between near-boundary OOD samples and a set of negative labels. Finally, SynOOD achieves state-of-the-art performance on the large-scale ImageNet benchmark, with minimal increases in parameters and runtime. Our approach significantly surpasses existing methods, improving AUROC by 2.80% and reducing FPR95 by 11.13%. Codes are available in this https URL.
zh
[CV-32] ProGait: A Multi-Purpose Video Dataset and Benchmark for Transfemoral Prosthesis Users ICCV’25
【速读】:该论文试图解决视觉感知在假肢步态分析中的挑战,特别是针对假肢的独特外观和新型运动模式导致的检测与分析困难。其解决方案的关键在于引入了一个多用途数据集ProGait,该数据集支持视频目标分割、2D人体姿态估计和步态分析等多种视觉任务,包含412段来自四位膝上截肢者在不同新适配假肢下行走试验的视频片段,提供了人类受试者使用股骨假肢时的存在性、轮廓、姿态及步态模式的信息。
链接: https://arxiv.org/abs/2507.10223
作者: Xiangyu Yin,Boyuan Yang,Weichen Liu,Qiyao Xue,Abrar Alamri,Goeran Fiedler,Wei Gao
机构: University of Pittsburgh (匹兹堡大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: Accepted by ICCV’25
Abstract:Prosthetic legs play a pivotal role in clinical rehabilitation, allowing individuals with lower-limb amputations the ability to regain mobility and improve their quality of life. Gait analysis is fundamental for optimizing prosthesis design and alignment, directly impacting the mobility and life quality of individuals with lower-limb amputations. Vision-based machine learning (ML) methods offer a scalable and non-invasive solution to gait analysis, but face challenges in correctly detecting and analyzing prosthesis, due to their unique appearances and new movement patterns. In this paper, we aim to bridge this gap by introducing a multi-purpose dataset, namely ProGait, to support multiple vision tasks including Video Object Segmentation, 2D Human Pose Estimation, and Gait Analysis (GA). ProGait provides 412 video clips from four above-knee amputees when testing multiple newly-fitted prosthetic legs through walking trials, and depicts the presence, contours, poses, and gait patterns of human subjects with transfemoral prosthetic legs. Alongside the dataset itself, we also present benchmark tasks and fine-tuned baseline models to illustrate the practical application and performance of the ProGait dataset. We compared our baseline models against pre-trained vision models, demonstrating improved generalizability when applying the ProGait dataset for prosthesis-specific tasks. Our code is available at this https URL and dataset at this https URL.
zh
[CV-33] Spatial Lifting for Dense Prediction
【速读】:该论文试图解决密集预测任务中模型参数量大、推理成本高以及监督机制不足的问题。其解决方案的关键在于提出空间提升(Spatial Lifting, SL)方法,通过将标准输入(如2D图像)提升到更高维空间,并利用为高维设计的网络(如3D U-Net)进行处理,从而在保持性能的同时显著减少模型参数数量和推理成本,同时在提升维度上生成内在结构化的输出,以支持密集监督和高效的预测质量评估。
链接: https://arxiv.org/abs/2507.10222
作者: Mingzhi Xu,Yizhe Zhang
机构: Nanjing University of Science and Technology(南京理工大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG); Image and Video Processing (eess.IV)
备注: Preprint. Under review
Abstract:We present Spatial Lifting (SL), a novel methodology for dense prediction tasks. SL operates by lifting standard inputs, such as 2D images, into a higher-dimensional space and subsequently processing them using networks designed for that higher dimension, such as a 3D U-Net. Counterintuitively, this dimensionality lifting allows us to achieve good performance on benchmark tasks compared to conventional approaches, while reducing inference costs and significantly lowering the number of model parameters. The SL framework produces intrinsically structured outputs along the lifted dimension. This emergent structure facilitates dense supervision during training and enables robust, near-zero-additional-cost prediction quality assessment at test time. We validate our approach across 19 benchmark datasets (13 for semantic segmentation and 6 for depth estimation), demonstrating competitive dense prediction performance while reducing the model parameter count by over 98% (in the U-Net case) and lowering inference costs. Spatial Lifting introduces a new vision modeling paradigm that offers a promising path toward more efficient, accurate, and reliable deep networks for dense prediction tasks in vision.
zh
[CV-34] Straighten Viscous Rectified Flow via Noise Optimization
【速读】:该论文旨在解决Reflow方法在单步或少步生成中生成高质量图像能力受限的问题,其核心问题在于Reflow构建的确定性耦合中噪声与图像的分布与真实图像存在差距。为解决这一问题,论文提出了一种新的解决方案——通过噪声优化的直化粘性修正流(VRFNO),其关键在于引入了两个创新点:一是历史速度项以增强轨迹区分度,使模型更准确地预测当前轨迹的速度;二是通过重参数化进行噪声优化,形成与真实图像的优化耦合,从而有效缓解Reflow方法的局限性。
链接: https://arxiv.org/abs/2507.10218
作者: Jimin Dai,Jiexi Yan,Jian Yang,Lei Luo
机构: Nanjing University of Science and Technology (南京理工大学); Xidian University (西安电子科技大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:The Reflow operation aims to straighten the inference trajectories of the rectified flow during training by constructing deterministic couplings between noises and images, thereby improving the quality of generated images in single-step or few-step generation. However, we identify critical limitations in Reflow, particularly its inability to rapidly generate high-quality images due to a distribution gap between images in its constructed deterministic couplings and real images. To address these shortcomings, we propose a novel alternative called Straighten Viscous Rectified Flow via Noise Optimization (VRFNO), which is a joint training framework integrating an encoder and a neural velocity field. VRFNO introduces two key innovations: (1) a historical velocity term that enhances trajectory distinction, enabling the model to more accurately predict the velocity of the current trajectory, and (2) the noise optimization through reparameterization to form optimized couplings with real images which are then utilized for training, effectively mitigating errors caused by Reflow’s limitations. Comprehensive experiments on synthetic data and real datasets with varying resolutions show that VRFNO significantly mitigates the limitations of Reflow, achieving state-of-the-art performance in both one-step and few-step generation tasks.
zh
[CV-35] From Wardrobe to Canvas: Wardrobe Polyptych LoRA for Part-level Controllable Human Image Generation
【速读】:该论文旨在解决个性化人体图像生成中属性保持不精确且不一致的问题,尤其是在身份和服装细节等方面。其解决方案的关键在于提出Wardrobe Polyptych LoRA,一种基于部件级别的可控模型,通过仅训练LoRA层来减轻推理阶段的计算负担,同时确保未见过主体的高保真合成。该方法通过将生成过程条件化于主体的衣橱并利用空间参考以减少信息丢失,从而提升生成图像的保真度和一致性。此外,引入选择性主体区域损失,使模型在训练过程中忽略部分参考图像,进一步提升生成结果与文本提示的一致性及主体完整性。
链接: https://arxiv.org/abs/2507.10217
作者: Jeongho Kim,Sunghyun Park,Hyoungwoo Park,Sungrack Yun,Jaegul Choo,Seokeon Cho
机构: Qualcomm AI Research†; Korea Advanced Institute of Science and Technology (KAIST)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 10 pages, 8 figures
Abstract:Recent diffusion models achieve personalization by learning specific subjects, allowing learned attributes to be integrated into generated images. However, personalized human image generation remains challenging due to the need for precise and consistent attribute preservation (e.g., identity, clothing details). Existing subject-driven image generation methods often require either (1) inference-time fine-tuning with few images for each new subject or (2) large-scale dataset training for generalization. Both approaches are computationally expensive and impractical for real-time applications. To address these limitations, we present Wardrobe Polyptych LoRA, a novel part-level controllable model for personalized human image generation. By training only LoRA layers, our method removes the computational burden at inference while ensuring high-fidelity synthesis of unseen subjects. Our key idea is to condition the generation on the subject’s wardrobe and leverage spatial references to reduce information loss, thereby improving fidelity and consistency. Additionally, we introduce a selective subject region loss, which encourages the model to disregard some of reference images during training. Our loss ensures that generated images better align with text prompts while maintaining subject integrity. Notably, our Wardrobe Polyptych LoRA requires no additional parameters at the inference stage and performs generation using a single model trained on a few training samples. We construct a new dataset and benchmark tailored for personalized human image generation. Extensive experiments show that our approach significantly outperforms existing techniques in fidelity and consistency, enabling realistic and identity-preserving full-body synthesis.
zh
[CV-36] Boosting Multimodal Learning via Disentangled Gradient Learning ICCV2025
【速读】:该论文试图解决多模态学习中因模态编码器与模态融合模块之间的优化冲突导致的性能下降问题,即多模态模型中的每个模态性能通常劣于单模态模型。解决方案的关键在于提出一种解耦梯度学习(Disentangled Gradient Learning, DGL)框架,通过截断从多模态损失反向传播到模态编码器的梯度,并用单模态损失的梯度替代,同时移除从单模态损失反向传播到模态融合模块的梯度,从而消除模态编码器与模态融合模块之间的梯度干扰,确保两者的独立优化过程。
链接: https://arxiv.org/abs/2507.10213
作者: Shicai Wei,Chunbo Luo,Yang Luo
机构: UESTC(电子科技大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted to ICCV2025
Abstract:Multimodal learning often encounters the under-optimized problem and may have worse performance than unimodal learning. Existing methods attribute this problem to the imbalanced learning between modalities and rebalance them through gradient modulation. However, they fail to explain why the dominant modality in multimodal models also underperforms that in unimodal learning. In this work, we reveal the optimization conflict between the modality encoder and modality fusion module in multimodal models. Specifically, we prove that the cross-modal fusion in multimodal models decreases the gradient passed back to each modality encoder compared with unimodal models. Consequently, the performance of each modality in the multimodal model is inferior to that in the unimodal model. To this end, we propose a disentangled gradient learning (DGL) framework to decouple the optimization of the modality encoder and modality fusion module in the multimodal model. DGL truncates the gradient back-propagated from the multimodal loss to the modality encoder and replaces it with the gradient from unimodal loss. Besides, DGL removes the gradient back-propagated from the unimodal loss to the modality fusion module. This helps eliminate the gradient interference between the modality encoder and modality fusion module while ensuring their respective optimization processes. Finally, extensive experiments on multiple types of modalities, tasks, and frameworks with dense cross-modal interaction demonstrate the effectiveness and versatility of the proposed DGL. Code is available at \hrefthis https URLthis https URL
zh
[CV-37] Is Micro-expression Ethnic Leaning?
【速读】:该论文试图解决种族背景在微表情(micro-expression)分析中的影响问题,挑战了埃克曼(Ekman)提出的情绪普遍性假设,即认为不同文化背景下的情绪表达是相同的。其解决方案的关键在于构建一个跨文化的微表情数据库,并通过算法标注种族标签,以支持对种族因素在情绪表达中的作用进行系统研究。此外,论文提出了一种将种族背景整合到情感特征学习过程中的框架,从而实现对微表情识别中种族差异的敏感识别。
链接: https://arxiv.org/abs/2507.10209
作者: Huai-Qian Khor,Yante Li,Xingxun Jiang,Guoying Zhao
机构: University of Oulu (奥卢大学); Southeast University (东南大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:How much does ethnicity play its part in emotional expression? Emotional expression and micro-expression research probe into understanding human psychological responses to emotional stimuli, thereby revealing substantial hidden yet authentic emotions that can be useful in the event of diagnosis and interviews. While increased attention had been provided to micro-expression analysis, the studies were done under Ekman’s assumption of emotion universality, where emotional expressions are identical across cultures and social contexts. Our computational study uncovers some of the influences of ethnic background in expression analysis, leading to an argument that the emotional universality hypothesis is an overgeneralization from the perspective of manual psychological analysis. In this research, we propose to investigate the level of influence of ethnicity in a simulated micro-expression scenario. We construct a cross-cultural micro-expression database and algorithmically annotate the ethnic labels to facilitate the investigation. With the ethnically annotated dataset, we perform a prima facie study to compare mono-ethnicity and stereo-ethnicity in a controlled environment, which uncovers a certain influence of ethnic bias via an experimental way. Building on this finding, we propose a framework that integrates ethnic context into the emotional feature learning process, yielding an ethnically aware framework that recognises ethnicity differences in micro-expression recognition. For improved understanding, qualitative analyses have been done to solidify the preliminary investigation into this new realm of research. Code is publicly available at this https URL
zh
[CV-38] Improving Multimodal Learning via Imbalanced Learning ICCV2025
【速读】:该论文试图解决多模态学习中因模态间不平衡导致的性能下降问题,即多模态学习可能表现得不如单模态学习。解决方案的关键在于提出一种非对称表示学习(Asymmetric Representation Learning, ARL)策略,通过引入辅助正则化项来计算每个模态编码器的预测方差,并利用单模态方差计算系数以重新加权每个模态的优化过程,使模态依赖比例与模态方差比例成反比。此外,ARL还引入了每个模态的预测偏差,并与多模态损失联合优化,以最小化泛化误差。
链接: https://arxiv.org/abs/2507.10203
作者: Shicai Wei,Chunbo Luo,Yang Luo
机构: UESTC(电子科技大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted to ICCV2025
Abstract:Multimodal learning often encounters the under-optimized problem and may perform worse than unimodal learning. Existing approaches attribute this issue to imbalanced learning across modalities and tend to address it through gradient balancing. However, this paper argues that balanced learning is not the optimal setting for multimodal learning. With bias-variance analysis, we prove that imbalanced dependency on each modality obeying the inverse ratio of their variances contributes to optimal performance. To this end, we propose the Asymmetric Representation Learning(ARL) strategy to assist multimodal learning via imbalanced optimization. ARL introduces auxiliary regularizers for each modality encoder to calculate their prediction variance. ARL then calculates coefficients via the unimodal variance to re-weight the optimization of each modality, forcing the modality dependence ratio to be inversely proportional to the modality variance ratio. Moreover, to minimize the generalization error, ARL further introduces the prediction bias of each modality and jointly optimizes them with multimodal loss. Notably, all auxiliary regularizers share parameters with the multimodal model and rely only on the modality representation. Thus the proposed ARL strategy introduces no extra parameters and is independent of the structures and fusion methods of the multimodal model. Finally, extensive experiments on various datasets validate the effectiveness and versatility of ARL. Code is available at \hrefthis https URLthis https URL
zh
[CV-39] A Training-Free Task-Agnostic Framework for Enhancing MLLM Performance on High-Resolution Images CVPR2025
【速读】:该论文试图解决多模态大语言模型(Multimodal Large Language Models, MLLMs)在处理高分辨率图像时,由于训练与测试分辨率不一致导致的细粒度定位和推理能力下降的问题。解决方案的关键在于提出一种无需训练、任务无关的两阶段框架——Extract Candidate then Predict (ECP),其核心思想是通过先利用下采样图像的粗略预测提取候选区域,再基于该区域进行最终预测,从而在保持细粒度视觉细节的同时缓解高分辨率数据带来的挑战。
链接: https://arxiv.org/abs/2507.10202
作者: Jaeseong Lee,Yeeun Choi,Heechan Choi,Hanjung Kim,Seonjoo Kim
机构: Yonsei University(延世大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: Accepted at CVPR 2025 Workshop on Emergent Visual Abilities and Limits of Foundation Models
Abstract:Multimodal Large Language Models (MLLMs) have demonstrated remarkable capabilities in vision-language understanding, reasoning, and generation. However, they struggle with tasks requiring fine-grained localization and reasoning in high-resolution images. This constraint stems from the fact that MLLMs are fine-tuned with fixed image resolution to align with the pre-trained image encoder used in MLLM. Consequently, feeding high-resolution images directly into MLLMs leads to poor generalization due to a train-test resolution discrepancy, while downsampling these images-although ensuring consistency-compromises fine-grained visual details and ultimately degrades performance. To address this challenge, we propose Extract Candidate then Predict (ECP), a novel training-free, task-agnostic two-stage framework designed to enhance MLLM performance on high-resolution images. The key intuition behind ECP is that while MLLMs struggle with high-resolution images, their predictions on downsampled images still contain implicit localization cues. By first identifying candidate region using the coarse prediction and then predicting the final output based on candidate region, ECP effectively preserves fine-grained details while mitigating the challenges posed by high-resolution data. We validate our framework on 4K GUI grounding and 4K, 8K MLLM perception, achieving +21.3%, +5.8%, +5.2% absolute improvement compared to baseline respectively, demonstrating its effectiveness. Code is available at this https URL.
zh
[CV-40] Minimizing the Pretraining Gap: Domain-aligned Text-Based Person Retrieval
【速读】:该论文旨在解决文本驱动的人体检索任务中,由于合成数据与真实数据之间存在显著领域差异(如光照、颜色和视角等)而导致的预训练-微调范式效果受限的问题。其解决方案的关键在于提出了一种统一的文本驱动人体检索流程,包含两个主要组件:面向图像级别的Domain-aware Diffusion (DaD) 和面向区域级别的Multi-granularity Relation Alignment (MRA),分别用于迁移图像分布和建立视觉区域与描述句子之间的对应关系,从而有效缩小领域差距并提升模型性能。
链接: https://arxiv.org/abs/2507.10195
作者: Shuyu Yang,Yaxiong Wang,Yongrui Li,Li Zhu,Zhedong Zheng
机构: Xi’an Jiaotong University (西安交通大学); Hefei University of Technology (合肥工业大学); University of Macau (澳门大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:In this work, we focus on text-based person retrieval, which aims to identify individuals based on textual descriptions. Given the significant privacy issues and the high cost associated with manual annotation, synthetic data has become a popular choice for pretraining models, leading to notable advancements. However, the considerable domain gap between synthetic pretraining datasets and real-world target datasets, characterized by differences in lighting, color, and viewpoint, remains a critical obstacle that hinders the effectiveness of the pretrain-finetune paradigm. To bridge this gap, we introduce a unified text-based person retrieval pipeline considering domain adaptation at both image and region levels. In particular, it contains two primary components, i.e., Domain-aware Diffusion (DaD) for image-level adaptation and Multi-granularity Relation Alignment (MRA) for region-level adaptation. As the name implies, Domain-aware Diffusion is to migrate the distribution of images from the pretraining dataset domain to the target real-world dataset domain, e.g., CUHK-PEDES. Subsequently, MRA performs a meticulous region-level alignment by establishing correspondences between visual regions and their descriptive sentences, thereby addressing disparities at a finer granularity. Extensive experiments show that our dual-level adaptation method has achieved state-of-the-art results on the CUHK-PEDES, ICFG-PEDES, and RSTPReid datasets, outperforming existing methodologies. The dataset, model, and code are available at this https URL.
zh
[CV-41] Learning Private Representations through Entropy-based Adversarial Training
【速读】:该论文试图解决在学习具有高预测能力表示的同时保护用户隐私的问题。其解决方案的关键在于提出一种对抗性表示学习方法,用于从学习到的表示中净化敏感内容,其中引入了一种熵的变体——焦点熵(focal entropy),以缓解基于熵的方法可能存在的信息泄露问题。
链接: https://arxiv.org/abs/2507.10194
作者: Tassilo Klein,Moin Nabi
机构: SAP SE( SAP股份公司); Apple(苹果)
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:How can we learn a representation with high predictive power while preserving user privacy? We present an adversarial representation learning method for sanitizing sensitive content from the learned representation. Specifically, we introduce a variant of entropy - focal entropy, which mitigates the potential information leakage of the existing entropy-based approaches. We showcase feasibility on multiple benchmarks. The results suggest high target utility at moderate privacy leakage.
zh
[CV-42] SlumpGuard: An AI-Powered Real-Time System for Automated Concrete Slump Prediction via Video Analysis
【速读】:该论文试图解决传统坍落度测试(slump test)在实际应用中存在的人工操作、耗时且结果不一致的问题,这些问题限制了其在实时监测中的适用性。解决方案的关键在于提出SlumpGuard,这是一个基于人工智能(AI)的视频分析系统,能够自动从搅拌车出料口分析混凝土流动情况,从而实现实时评估混凝土工作性(workability)。该系统实现了无需人工干预的全批次检测,提高了质量控制的准确性和效率。
链接: https://arxiv.org/abs/2507.10171
作者: Youngmin Kim,Giyeong Oh,Kwangsoo Youm,Youngjae Yu
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Concrete workability is essential for construction quality, with the slump test being the most common on-site method for its assessment. However, traditional slump testing is manual, time-consuming, and prone to inconsistency, limiting its applicability for real-time monitoring. To address these challenges, we propose SlumpGuard, an AI-powered, video-based system that automatically analyzes concrete flow from the truck chute to assess workability in real time. Our system enables full-batch inspection without manual intervention, improving both the accuracy and efficiency of quality control. We present the system design, a the construction of a dedicated dataset, and empirical results from real-world deployment, demonstrating the effectiveness of SlumpGuard as a practical solution for modern concrete quality assurance.
zh
[CV-43] Deep Recurrence for Dynamical Segmentation Models
【速读】:该论文试图解决人工神经网络在噪声环境下性能下降以及监督数据有限时泛化能力不足的问题。其解决方案的关键在于引入一种受预测编码启发的反馈机制,该机制通过从输出到输入的循环连接形成递归回路,使模型能够随着时间推移不断优化其内部状态,同时结合软最大投影和指数衰减两种生物启发的操作以确保反馈回路的稳定性。
链接: https://arxiv.org/abs/2507.10143
作者: David Calhas,Arlindo L. Oliveira
机构: INESC-ID (INESC-ID); Instituto Superior Tecnico (Instituto Superior Tecnico)
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注: 12 pages
Abstract:While biological vision systems rely heavily on feedback connections to iteratively refine perception, most artificial neural networks remain purely feedforward, processing input in a single static pass. In this work, we propose a predictive coding inspired feedback mechanism that introduces a recurrent loop from output to input, allowing the model to refine its internal state over time. We implement this mechanism within a standard U-Net architecture and introduce two biologically motivated operations, softmax projection and exponential decay, to ensure stability of the feedback loop. Through controlled experiments on a synthetic segmentation task, we show that the feedback model significantly outperforms its feedforward counterpart in noisy conditions and generalizes more effectively with limited supervision. Notably, feedback achieves above random performance with just two training examples, while the feedforward model requires at least four. Our findings demonstrate that feedback enhances robustness and data efficiency, and offer a path toward more adaptive and biologically inspired neural architectures. Code is available at: this http URL.
zh
[CV-44] Probabilistic Human Intent Prediction for Mobile Manipulation: An Evaluation with Human-Inspired Constraints
【速读】:该论文试图解决在人机协作中准确推断人类意图的问题,以避免限制人类控制或引发人与机器人之间的冲突。其解决方案的关键是提出GUIDER(Global User Intent Dual-phase Estimation for Robots)框架,该框架通过维护两个耦合的信念层来跟踪导航目标和操作目标,并结合多模态感知与实时更新规则,实现对意图区域和物体的识别,而无需预定义目标。
链接: https://arxiv.org/abs/2507.10131
作者: Cesar Alan Contreras,Manolis Chiou,Alireza Rastegarpanah,Michal Szulik,Rustam Stolkin
机构: 未知
类目: Robotics (cs.RO); Computer Vision and Pattern Recognition (cs.CV); Human-Computer Interaction (cs.HC)
备注: Submitted to Journal of Intelligent Robotic Systems (Under Review)
Abstract:Accurate inference of human intent enables human-robot collaboration without constraining human control or causing conflicts between humans and robots. We present GUIDER (Global User Intent Dual-phase Estimation for Robots), a probabilistic framework that enables a robot to estimate the intent of human operators. GUIDER maintains two coupled belief layers, one tracking navigation goals and the other manipulation goals. In the Navigation phase, a Synergy Map blends controller velocity with an occupancy grid to rank interaction areas. Upon arrival at a goal, an autonomous multi-view scan builds a local 3D cloud. The Manipulation phase combines U2Net saliency, FastSAM instance saliency, and three geometric grasp-feasibility tests, with an end-effector kinematics-aware update rule that evolves object probabilities in real-time. GUIDER can recognize areas and objects of intent without predefined goals. We evaluated GUIDER on 25 trials (five participants x five task variants) in Isaac Sim, and compared it with two baselines, one for navigation and one for manipulation. Across the 25 trials, GUIDER achieved a median stability of 93-100% during navigation, compared with 60-100% for the BOIR baseline, with an improvement of 39.5% in a redirection scenario (T5). During manipulation, stability reached 94-100% (versus 69-100% for Trajectron), with a 31.4% difference in a redirection task (T3). In geometry-constrained trials (manipulation), GUIDER recognized the object intent three times earlier than Trajectron (median remaining time to confident prediction 23.6 s vs 7.8 s). These results validate our dual-phase framework and show improvements in intent inference in both phases of mobile manipulation tasks.
zh
[CV-45] aming Modern Point Tracking for Speckle Tracking Echocardiography via Impartial Motion ICCV2025
【速读】:该论文试图解决在超声心动图中对可变形组织进行精确运动估计的问题,以实现更准确的心脏功能测量。传统方法如块匹配或光流在处理复杂的心脏运动时表现不佳,而现代点跟踪方法在该领域仍研究不足。论文的关键解决方案是通过分析真实B模式超声视频中的心脏运动,识别出不同视角下的方向性运动偏差,并通过改进训练流程和引入定制增强策略来减轻这种偏差,从而提升跟踪的鲁棒性和泛化能力。此外,还提出了一种轻量级网络,利用空间上下文的多尺度成本体积来挑战先进的时空点跟踪模型,实验表明该方法显著提升了模型性能,尤其是在分布外(OOD)案例中。
链接: https://arxiv.org/abs/2507.10127
作者: Md Abulkalam Azad,John Nyberg,Håvard Dalen,Bjørnar Grenne,Lasse Lovstakken,Andreas Østvik
机构: Norwegian University of Science and Technology (挪威科技大学); Clinic of Cardiology, St. Olavs Hospital (圣奥拉夫医院心血管科); SINTEF Digital (西门子数字)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: Accepted to CVAMD workshop at ICCV 2025
Abstract:Accurate motion estimation for tracking deformable tissues in echocardiography is essential for precise cardiac function measurements. While traditional methods like block matching or optical flow struggle with intricate cardiac motion, modern point tracking approaches remain largely underexplored in this domain. This work investigates the potential of state-of-the-art (SOTA) point tracking methods for ultrasound, with a focus on echocardiography. Although these novel approaches demonstrate strong performance in general videos, their effectiveness and generalizability in echocardiography remain limited. By analyzing cardiac motion throughout the heart cycle in real B-mode ultrasound videos, we identify that a directional motion bias across different views is affecting the existing training strategies. To mitigate this, we refine the training procedure and incorporate a set of tailored augmentations to reduce the bias and enhance tracking robustness and generalization through impartial cardiac motion. We also propose a lightweight network leveraging multi-scale cost volumes from spatial context alone to challenge the advanced spatiotemporal point tracking models. Experiments demonstrate that fine-tuning with our strategies significantly improves models’ performances over their baselines, even for out-of-distribution (OOD) cases. For instance, EchoTracker boosts overall position accuracy by 60.7% and reduces median trajectory error by 61.5% across heart cycle phases. Interestingly, several point tracking models fail to outperform our proposed simple model in terms of tracking accuracy and generalization, reflecting their limitations when applied to echocardiography. Nevertheless, clinical evaluation reveals that these methods improve GLS measurements, aligning more closely with expert-validated, semi-automated tools and thus demonstrating better reproducibility in real-world applications.
zh
[CV-46] DEARLi: Decoupled Enhancement of Recognition and Localization for Semi-supervised Panoptic Segmentation ICCV2025
【速读】:该论文试图解决像素级标注成本高且耗时的问题,通过在少量标注图像和大量未标注图像上学习模型来实现半监督分割。其解决方案的关键在于利用两个专用的基础模型,通过增强识别(基于CLIP特征的零样本分类)和定位(基于SAM伪标签的类无关解码器预热)的解耦提升,从而在大规模类别体系和有限标注数据的半监督场景中表现出色。
链接: https://arxiv.org/abs/2507.10118
作者: Ivan Martinović,Josip Šarić,Marin Oršić,Matej Kristan,Siniša Šegvić
机构: Faculty of Electrical Engineering and Computing (电气工程与计算学院); Faculty of Computer and Information Science (计算机与信息科学学院); University of Zagreb (萨格勒布大学); University of Ljubljana (卢布尔雅那大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: ICCV 2025 Findings Workshop
Abstract:Pixel-level annotation is expensive and time-consuming. Semi-supervised segmentation methods address this challenge by learning models on few labeled images alongside a large corpus of unlabeled images. Although foundation models could further account for label scarcity, effective mechanisms for their exploitation remain underexplored. We address this by devising a novel semi-supervised panoptic approach fueled by two dedicated foundation models. We enhance recognition by complementing unsupervised mask-transformer consistency with zero-shot classification of CLIP features. We enhance localization by class-agnostic decoder warm-up with respect to SAM pseudo-labels. The resulting decoupled enhancement of recognition and localization (DEARLi) particularly excels in the most challenging semi-supervised scenarios with large taxonomies and limited labeled data. Moreover, DEARLi outperforms the state of the art in semi-supervised semantic segmentation by a large margin while requiring 8x less GPU memory, in spite of being trained only for the panoptic objective. We observe 29.9 PQ and 38.9 mIoU on ADE20K with only 158 labeled images. The source code is available at this https URL.
zh
[CV-47] Glance-MCMT: A General MCMT Framework with Glance Initialization and Progressive Association
【速读】:该论文试图解决多摄像头多目标(MCMT)跟踪中跨视角的身份一致性问题,即在不同摄像头视角下保持目标身份的统一性。解决方案的关键在于利用轨迹和外观特征进行全局身份分配,通过初始阶段的轨迹-特征匹配初始化全局ID,并在后续帧中采用优先级全局匹配策略将新轨迹与现有全局ID进行关联,仅在无法找到足够相似的轨迹或特征匹配时才引入新的全局ID,从而实现跨视角的目标身份一致性。
链接: https://arxiv.org/abs/2507.10115
作者: Hamidreza Hashempoor
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:We propose a multi-camera multi-target (MCMT) tracking framework that ensures consistent global identity assignment across views using trajectory and appearance cues. The pipeline starts with BoT-SORT-based single-camera tracking, followed by an initial glance phase to initialize global IDs via trajectory-feature matching. In later frames, new tracklets are matched to existing global identities through a prioritized global matching strategy. New global IDs are only introduced when no sufficiently similar trajectory or feature match is found. 3D positions are estimated using depth maps and calibration for spatial validation.
zh
[CV-48] FIX-CLIP: Dual-Branch Hierarchical Contrastive Learning via Synthetic Captions for Better Understanding of Long Text
【速读】:该论文旨在解决CLIP模型在处理长文本输入(77个token)时表现不佳的问题,尤其是在下游任务中缺乏对长文本的有效表示能力。其解决方案的关键在于提出FIX-CLIP,包含三个创新模块:(1) 双分支训练流程,分别对齐短文本和长文本与掩码图像和原始图像,以增强长文本表征同时保持短文本能力;(2) 在Transformer层中引入带有单向掩码的可学习区域提示,用于区域信息提取;(3) 在中间编码器层中引入分层特征对齐模块,以提升多尺度特征的一致性。此外,通过收集30M图像并利用现有多模态大语言模型生成长文本描述进行训练,进一步提升了模型性能。
链接: https://arxiv.org/abs/2507.10095
作者: Bingchao Wang,Zhiwei Ning,Jianyu Ding,Xuanang Gao,Yin Li,Dongsheng Jiang,Jie Yang,Wei Liu
机构: Shanghai Jiao Tong University (上海交通大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:CLIP has shown promising performance across many short-text tasks in a zero-shot manner. However, limited by the input length of the text encoder, CLIP struggles on under-stream tasks with long-text inputs (77 tokens). To remedy this issue, we propose FIX-CLIP which includes three novel modules: (1) A dual-branch training pipeline that aligns short and long texts with masked and raw images respectively, which boosts the long-text representation while preserving the short-text ability. (2) Multiple learnable regional prompts with unidirectional masks in Transformer layers for regional information extraction. (3) A hierarchical feature alignment module in the intermediate encoder layers to promote the consistency of multi-scale features. Furthermore, we collect 30M images and utilize existing MLLMs to synthesize long-text captions for training. Extensive experiments show that FIX-CLIP achieves state-of-the-art performance on both long-text and short-text retrieval benchmarks. For downstream applications, we reveal that FIX-CLIP’s text encoder delivers promising performance in a plug-and-play manner for diffusion models with long-text input.
zh
[CV-49] A Transfer Learning-Based Method for Water Body Segmentation in Remote Sensing Imagery: A Case Study of the Zhada Tulin Area
【速读】:该论文旨在解决遥感图像水体分割中普遍存在的领域偏移(domain shift)和小样本量问题。其解决方案的关键在于提出并验证了一种基于SegFormer模型的两阶段迁移学习策略,首先在多样化的源域上训练基础分割模型,随后在目标域数据上进行微调,从而有效提升模型在复杂环境下的性能。
链接: https://arxiv.org/abs/2507.10084
作者: Haonan Chen(Tibet University),Xin Tong(Northwestern Polytechnical University)
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注: 13 pages, 6 figures, 2 tables
Abstract:To address the prevalent challenges of domain shift and small sample sizes in remote sensing image water body segmentation, this study proposes and validates a two-stage transfer learning strategy based on the SegFormer model. The approach begins by training a foundational segmentation model on a diverse source domain, where it achieves an Intersection over Union (IoU) of 68.80% on its validation set, followed by fine-tuning on data from the distinct target domain. Focusing on the Zhada Tulin area in Tibet – a region characterized by highly complex topography and spectral features – the experimental results demonstrate that this strategy significantly boosts the IoU for the water body segmentation task from 25.50% (for direct transfer) to 64.84%. This not only effectively resolves the model performance degradation caused by domain discrepancy but also provides an effective technical paradigm for high-precision thematic information extraction in data-scarce and environmentally unique remote sensing scenarios.
zh
[CV-50] Frequency Regulation for Exposure Bias Mitigation in Diffusion Models
【速读】:该论文旨在解决扩散模型在生成过程中受到的曝光偏差(exposure bias)问题。其解决方案的关键在于观察到预测噪声图像的能量在扩散过程中呈现下降趋势,并发现该能量减少在低频和高频子带中表现出不同的模式。基于这一关键观察,作者引入了一种基于小波变换的频域调控机制,分别调整低频和高频子带,从而有效缓解了曝光偏差带来的幅度变化问题。
链接: https://arxiv.org/abs/2507.10072
作者: Meng Yu,Kun Zhan
机构: Lanzhou University (兰州大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: ACM Multimedia 2025 accepted!
Abstract:Diffusion models exhibit impressive generative capabilities but are significantly impacted by exposure bias. In this paper, we make a key observation: the energy of the predicted noisy images decreases during the diffusion process. Building on this, we identify two important findings: 1) The reduction in energy follows distinct patterns in the low-frequency and high-frequency subbands; 2) This energy reduction results in amplitude variations between the network-reconstructed clean data and the real clean data. Based on the first finding, we introduce a frequency-domain regulation mechanism utilizing wavelet transforms, which separately adjusts the low- and high-frequency subbands. Leveraging the second insight, we provide a more accurate analysis of exposure bias in the two subbands. Our method is training-free and plug-and-play, significantly improving the generative quality of various diffusion models and providing a robust solution to exposure bias across different model architectures. The source code is available at this https URL.
zh
[CV-51] LayLens: Improving Deepfake Understanding through Simplified Explanations
【速读】:该论文试图解决深度伪造(deepfake)识别过程中技术术语晦涩难懂、普通用户难以理解的问题。其解决方案的关键在于提出一种三阶段管道:首先利用先进的伪造定位模型进行可解释的深度伪造检测,其次通过视觉-语言模型将技术性解释简化为自然语言,最后借助引导图像编辑技术重建合理的原始图像。该方法有效弥合了模型推理与人类理解之间的差距。
链接: https://arxiv.org/abs/2507.10066
作者: Abhijeet Narang,Parul Gupta,Liuyijia Su,Abhinav Dhall
机构: Monash University (莫纳什大学)
类目: Multimedia (cs.MM); Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:This demonstration paper presents \mathbfLayLens , a tool aimed to make deepfake understanding easier for users of all educational backgrounds. While prior works often rely on outputs containing technical jargon, LayLens bridges the gap between model reasoning and human understanding through a three-stage pipeline: (1) explainable deepfake detection using a state-of-the-art forgery localization model, (2) natural language simplification of technical explanations using a vision-language model, and (3) visual reconstruction of a plausible original image via guided image editing. The interface presents both technical and layperson-friendly explanations in addition to a side-by-side comparison of the uploaded and reconstructed images. A user study with 15 participants shows that simplified explanations significantly improve clarity and reduce cognitive load, with most users expressing increased confidence in identifying deepfakes. LayLens offers a step toward transparent, trustworthy, and user-centric deepfake forensics.
zh
[CV-52] MoVieS: Motion-Aware 4D Dynamic View Synthesis in One Second
【速读】:该论文试图解决从单目视频中实时合成4D动态新视角的问题,同时实现外观、几何和运动的统一建模。解决方案的关键在于使用像素对齐的高斯基元网格来表示动态3D场景,并显式监督其随时间变化的运动,从而在单一学习框架内实现视图合成、重建和3D点跟踪。
链接: https://arxiv.org/abs/2507.10065
作者: Chenguo Lin,Yuchen Lin,Panwang Pan,Yifan Yu,Honglei Yan,Katerina Fragkiadaki,Yadong Mu
机构: Peking University (北京大学); ByteDance (字节跳动); Carnegie Mellon University (卡内基梅隆大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Project page: this https URL
Abstract:We present MoVieS, a novel feed-forward model that synthesizes 4D dynamic novel views from monocular videos in one second. MoVieS represents dynamic 3D scenes using pixel-aligned grids of Gaussian primitives, explicitly supervising their time-varying motion. This allows, for the first time, the unified modeling of appearance, geometry and motion, and enables view synthesis, reconstruction and 3D point tracking within a single learning-based framework. By bridging novel view synthesis with dynamic geometry reconstruction, MoVieS enables large-scale training on diverse datasets with minimal dependence on task-specific supervision. As a result, it also naturally supports a wide range of zero-shot applications, such as scene flow estimation and moving object segmentation. Extensive experiments validate the effectiveness and efficiency of MoVieS across multiple tasks, achieving competitive performance while offering several orders of magnitude speedups.
zh
[CV-53] Lightweight Model for Poultry Disease Detection from Fecal Images Using Multi-Color Space Feature Optimization and Machine Learning
【速读】:该论文试图解决家禽养殖中因感染性疾病(如球虫病、沙门氏菌病和新城疫)导致的高脆弱性问题,提出了一种基于轻量级机器学习的方法,通过分析家禽粪便图像来检测这些疾病。解决方案的关键在于利用多颜色空间特征提取(RGB、HSV、LAB)以及多种颜色、纹理和形状描述符,并通过系统消融实验和主成分分析(PCA)与XGBoost特征选择进行维度约简,从而获得一个在准确性和计算效率之间取得平衡的紧凑全局特征集。该方法使用人工神经网络(ANN)分类器,在无需GPU的情况下实现了95.85%的准确率,展现出比深度学习模型(如Xception和MobileNetV3)更低的资源消耗和更高的可扩展性。
链接: https://arxiv.org/abs/2507.10056
作者: A. K. M. Shoriful Islam,Md. Rakib Hassan,Macbah Uddin,Md. Shahidur Rahman
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:
Abstract:Poultry farming is a vital component of the global food supply chain, yet it remains highly vulnerable to infectious diseases such as coccidiosis, salmonellosis, and Newcastle disease. This study proposes a lightweight machine learning-based approach to detect these diseases by analyzing poultry fecal images. We utilize multi-color space feature extraction (RGB, HSV, LAB) and explore a wide range of color, texture, and shape-based descriptors, including color histograms, local binary patterns (LBP), wavelet transforms, and edge detectors. Through a systematic ablation study and dimensionality reduction using PCA and XGBoost feature selection, we identify a compact global feature set that balances accuracy and computational efficiency. An artificial neural network (ANN) classifier trained on these features achieved 95.85% accuracy while requiring no GPU and only 638 seconds of execution time in Google Colab. Compared to deep learning models such as Xception and MobileNetV3, our proposed model offers comparable accuracy with drastically lower resource usage. This work demonstrates a cost-effective, interpretable, and scalable alternative to deep learning for real-time poultry disease detection in low-resource agricultural settings.
zh
[CV-54] CoSMo: A Multimodal Transformer for Page Stream Segmentation in Comic Books
【速读】:该论文试图解决漫画书中页面流分割(Page Stream Segmentation, PSS)的问题,这是一个对于自动化内容理解至关重要的任务,是许多下游任务如角色分析、故事索引或元数据增强的必要前期步骤。论文提出了一种名为CoSMo的新颖多模态Transformer模型,其关键在于通过结合视觉和语言信息,在F1-Macro、全景质量以及流级别度量上显著优于传统基线和更大的通用视觉-语言模型,同时强调了视觉特征在漫画PSS宏观结构中的主导作用,并展示了多模态方法在解决复杂歧义问题上的优势。
链接: https://arxiv.org/abs/2507.10053
作者: Marc Serra Ortega,Emanuele Vivoli,Artemis Llabrés,Dimosthenis Karatzas
机构: Computer Vision Center and Universitat Autònoma de Barcelona(计算机视觉中心和巴塞罗那自治大学); MICC, University of Florence, Italy( MICC,佛罗伦萨大学,意大利)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:This paper introduces CoSMo, a novel multimodal Transformer for Page Stream Segmentation (PSS) in comic books, a critical task for automated content understanding, as it is a necessary first stage for many downstream tasks like character analysis, story indexing, or metadata enrichment. We formalize PSS for this unique medium and curate a new 20,800-page annotated dataset. CoSMo, developed in vision-only and multimodal variants, consistently outperforms traditional baselines and significantly larger general-purpose vision-language models across F1-Macro, Panoptic Quality, and stream-level metrics. Our findings highlight the dominance of visual features for comic PSS macro-structure, yet demonstrate multimodal benefits in resolving challenging ambiguities. CoSMo establishes a new state-of-the-art, paving the way for scalable comic book analysis.
zh
[CV-55] LifelongPR: Lifelong knowledge fusion for point cloud place recognition based on replay and prompt learning
【速读】:该论文试图解决点云场景识别(Point Cloud Place Recognition, PCPR)模型在持续学习(Continual Learning, CL)过程中面临的灾难性遗忘问题,从而提升模型在动态和多样化环境中的适应能力与可扩展性。解决方案的关键在于提出一种名为LifelongPR的新型持续学习框架,其核心包括两个方面:一是通过动态分配样本数量并选择空间多样性样本的重放样本选择方法,以缓解知识丢失;二是设计基于提示学习的轻量级提示模块与两阶段训练策略,以应对领域偏移并最小化遗忘。
链接: https://arxiv.org/abs/2507.10034
作者: Xianghong Zou,Jianping Li,Zhe Chen,Zhen Cao,Zhen Dong,Qiegen Liu,Bisheng Yang
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Robotics (cs.RO)
备注:
Abstract:Point cloud place recognition (PCPR) plays a crucial role in photogrammetry and robotics applications such as autonomous driving, intelligent transportation, and augmented reality. In real-world large-scale deployments of a positioning system, PCPR models must continuously acquire, update, and accumulate knowledge to adapt to diverse and dynamic environments, i.e., the ability known as continual learning (CL). However, existing PCPR models often suffer from catastrophic forgetting, leading to significant performance degradation in previously learned scenes when adapting to new environments or sensor types. This results in poor model scalability, increased maintenance costs, and system deployment difficulties, undermining the practicality of PCPR. To address these issues, we propose LifelongPR, a novel continual learning framework for PCPR, which effectively extracts and fuses knowledge from sequential point cloud data. First, to alleviate the knowledge loss, we propose a replay sample selection method that dynamically allocates sample sizes according to each dataset’s information quantity and selects spatially diverse samples for maximal representativeness. Second, to handle domain shifts, we design a prompt learning-based CL framework with a lightweight prompt module and a two-stage training strategy, enabling domain-specific feature adaptation while minimizing forgetting. Comprehensive experiments on large-scale public and self-collected datasets are conducted to validate the effectiveness of the proposed method. Compared with state-of-the-art (SOTA) methods, our method achieves 6.50% improvement in mIR@1, 7.96% improvement in mR@1, and an 8.95% reduction in F. The code and pre-trained models are publicly available at this https URL.
zh
[CV-56] Memory-Efficient Personalization of Text-to-Image Diffusion Models via Selective Optimization Strategies
【速读】:该论文试图解决在边缘设备上对文本到图像扩散模型进行记忆高效的个性化问题,以在保护用户隐私和有限计算资源的前提下实现高质量的微调。解决方案的关键在于提出一种选择性优化框架,该框架根据扩散过程的特性,自适应地在低分辨率图像上的反向传播(BP-low)和高分辨率图像上的零阶优化(ZO-high)之间进行选择。通过结合两种方法的优势,框架利用BP-low实现有效的个性化,同时使用ZO-high保持结构一致性,从而实现内存高效且高质量的微调。
链接: https://arxiv.org/abs/2507.10029
作者: Seokeon Choi,Sunghyun Park,Hyoungwoo Park,Jeongho Kim,Sungrack Yun
机构: Qualcomm AI Research(高通人工智能研究); Korea Advanced Institute of Science and Technology (韩国科学技术院)
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注:
Abstract:Memory-efficient personalization is critical for adapting text-to-image diffusion models while preserving user privacy and operating within the limited computational resources of edge devices. To this end, we propose a selective optimization framework that adaptively chooses between backpropagation on low-resolution images (BP-low) and zeroth-order optimization on high-resolution images (ZO-high), guided by the characteristics of the diffusion process. As observed in our experiments, BP-low efficiently adapts the model to target-specific features, but suffers from structural distortions due to resolution mismatch. Conversely, ZO-high refines high-resolution details with minimal memory overhead but faces slow convergence when applied without prior adaptation. By complementing both methods, our framework leverages BP-low for effective personalization while using ZO-high to maintain structural consistency, achieving memory-efficient and high-quality fine-tuning. To maximize the efficacy of both BP-low and ZO-high, we introduce a timestep-aware probabilistic function that dynamically selects the appropriate optimization strategy based on diffusion timesteps. This function mitigates the overfitting from BP-low at high timesteps, where structural information is critical, while ensuring ZO-high is applied more effectively as training progresses. Experimental results demonstrate that our method achieves competitive performance while significantly reducing memory consumption, enabling scalable, high-quality on-device personalization without increasing inference latency.
zh
[CV-57] (Almost) Free Modality Stitching of Foundation Models
【速读】:该论文试图解决在构建多模态基础模型时,如何高效选择最优的单模态模型组合并训练对应的连接模块这一计算密集型问题。解决方案的关键在于提出Hypernetwork Model Alignment (Hyma),该方法利用超网络(hypernetwork)的参数预测能力,为N×M种单模态模型组合联合训练连接模块,从而显著降低最优模型对搜索的成本。
链接: https://arxiv.org/abs/2507.10015
作者: Jaisidh Singh,Diganta Misra,Boris Knyazev,Antonio Orvieto
机构: University of Tübingen(图宾根大学); Zuse School ELIZA(祖泽学校ELIZA); ELLIS Institute Tübingen(图宾根ELLIS研究所); MPI-IS Tübingen(图宾根马克斯·普朗克智能系统研究所); SAIT AI Lab Montréal(蒙特利尔SAIT人工智能实验室); Tübingen AI Center(图宾根人工智能中心)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: Pre-print
Abstract:Foundation multi-modal models are often designed by stitching of multiple existing pretrained uni-modal models: for example, an image classifier with an autoregressive text model. This stitching process is performed by training a connector module that aims to align the representation-representation or representation-input spaces of these uni-modal models. However, given the complexity of training such connectors on large scale web-based datasets coupled with the ever-increasing number of available pretrained uni-modal models, the task of uni-modal models selection and subsequent connector module training becomes computationally demanding. To address this under-studied critical problem, we propose Hypernetwork Model Alignment (Hyma), a novel all-in-one solution for optimal uni-modal model selection and connector training by leveraging hypernetworks. Specifically, our framework utilizes the parameter prediction capability of a hypernetwork to obtain jointly trained connector modules for N \times M combinations of uni-modal models. In our experiments, Hyma reduces the optimal uni-modal model pair search cost by 10\times (averaged across all experiments), while matching the ranking and trained connector performance obtained via grid search across a suite of diverse multi-modal benchmarks.
zh
[CV-58] Binomial Self-Compensation: Mechanism and Suppression of Motion Error in Phase-Shifting Profilometry
【速读】:该论文旨在解决相位移全息测量(Phase Shifting Profilometry, PSP)在动态测量中因物体运动导致的相位误差问题。其关键解决方案是提出一种图像序列二项式自补偿(Image-Sequential Binomial Self-Compensation, I-BSC)方法,通过加权求和同质条纹图像而非连续相位帧,从而在减少计算复杂度的同时有效抑制运动误差,并实现接近单帧拍摄的深度图帧率。
链接: https://arxiv.org/abs/2507.10009
作者: Geyou Zhang,Kai Liu,Ce Zhu
机构: University of Electronic Science and Technology of China(中国电子科技大学); Sichuan University(四川大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Phase shifting profilometry (PSP) is widely used in high-precision 3D scanning due to its high accuracy, robustness, and pixel-wise handling. However, a fundamental assumption of PSP that the object should remain static does not hold in dynamic measurement, making PSP susceptible to object motion. To address this challenge, our proposed solution, phase-sequential binomial self-compensation (P-BSC), sums successive motion-affected phase frames weighted by binomial coefficients. This approach exponentially reduces the motion error in a pixel-wise and frame-wise loopable manner. Despite its efficacy, P-BSC suffers from high computational overhead and error accumulation due to its reliance on multi-frame phase calculations and weighted summations. Inspired by P-BSC, we propose an image-sequential binomial self-compensation (I-BSC) to weight sum the homogeneous fringe images instead of successive phase frames, which generalizes the BSC concept from phase sequences to image sequences. I-BSC computes the arctangent function only once, resolving both limitations in P-BSC. Extensive analysis, simulations, and experiments show that 1) the proposed BSC outperforms existing methods in reducing motion error while achieving a quasi-single-shot frame rate, i.e., depth map frame rate equals to the camera’s acquisition rate, enabling 3D reconstruction with high pixel-depth-temporal resolution; 2) compared to P-BSC, our I-BSC reduces the computational complexity by one polynomial order, thereby accelerating the computational frame rate by several to dozen times, while also reaching faster motion error convergence.
zh
[CV-59] Vision-Based Anti Unmanned Aerial Technology: Opportunities and Challenges
【速读】:该论文试图解决在复杂环境中实现高效且精确的反无人机(Anti-UAV)跟踪问题。解决方案的关键在于结合计算机视觉技术,尤其是多传感器数据融合与先进检测和跟踪算法的集成,以提升在军事侦察、环境监测、物流等应用场景下的反无人机跟踪性能。
链接: https://arxiv.org/abs/2507.10006
作者: Guanghai Ding,Yihua Ren,Yuting Liu,Qijun Zhao,Shuiwang Li
机构: Guilin University of Technology(桂林理工大学); Guangxi Key Laboratory of Embedded Technology and Intelligent System(广西嵌入式技术与智能系统重点实验室); Northwestern Polytechnical University(西北工业大学); JD Logistic(京东物流); Sichuan University(四川大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:With the rapid advancement of UAV technology and its extensive application in various fields such as military reconnaissance, environmental monitoring, and logistics, achieving efficient and accurate Anti-UAV tracking has become essential. The importance of Anti-UAV tracking is increasingly prominent, especially in scenarios such as public safety, border patrol, search and rescue, and agricultural monitoring, where operations in complex environments can provide enhanced security. Current mainstream Anti-UAV tracking technologies are primarily centered around computer vision techniques, particularly those that integrate multi-sensor data fusion with advanced detection and tracking algorithms. This paper first reviews the characteristics and current challenges of Anti-UAV detection and tracking technologies. Next, it investigates and compiles several publicly available datasets, providing accessible links to support researchers in efficiently addressing related challenges. Furthermore, the paper analyzes the major vision-based and vision-fusion-based Anti-UAV detection and tracking algorithms proposed in recent years. Finally, based on the above research, this paper outlines future research directions, aiming to provide valuable insights for advancing the field.
zh
[CV-60] Leverag ing Swin Transformer for enhanced diagnosis of Alzheimers disease using multi-shell diffusion MRI
【速读】:该论文旨在通过利用多壳扩散磁共振成像(dMRI)数据中的微结构信息,支持阿尔茨海默病的早期诊断和淀粉样蛋白积累的检测,其解决方案的关键在于采用基于视觉变压器(Swin Transformer)的深度学习框架。该方法通过从扩散张量成像(DTI)和神经_ORIENTATION_distribution和密度成像(NODDI)中提取关键指标,并将其投影到二维平面以实现与ImageNet预训练模型的迁移学习,同时结合低秩适应技术以在有限标注的神经影像数据下有效调整变压器模型。
链接: https://arxiv.org/abs/2507.09996
作者: Quentin Dessain,Nicolas Delinte,Bernard Hanseeuw,Laurence Dricot,Benoît Macq
机构: ICTEAM Institute, UCLouvain; Institute of Neuroscience (IoNS), UCLouvain
类目: Computer Vision and Pattern Recognition (cs.CV); Neurons and Cognition (q-bio.NC); Quantitative Methods (q-bio.QM)
备注:
Abstract:Objective: This study aims to support early diagnosis of Alzheimer’s disease and detection of amyloid accumulation by leveraging the microstructural information available in multi-shell diffusion MRI (dMRI) data, using a vision transformer-based deep learning framework. Methods: We present a classification pipeline that employs the Swin Transformer, a hierarchical vision transformer model, on multi-shell dMRI data for the classification of Alzheimer’s disease and amyloid presence. Key metrics from DTI and NODDI were extracted and projected onto 2D planes to enable transfer learning with ImageNet-pretrained models. To efficiently adapt the transformer to limited labeled neuroimaging data, we integrated Low-Rank Adaptation. We assessed the framework on diagnostic group prediction (cognitively normal, mild cognitive impairment, Alzheimer’s disease dementia) and amyloid status classification. Results: The framework achieved competitive classification results within the scope of multi-shell dMRI-based features, with the best balanced accuracy of 95.2% for distinguishing cognitively normal individuals from those with Alzheimer’s disease dementia using NODDI metrics. For amyloid detection, it reached 77.2% balanced accuracy in distinguishing amyloid-positive mild cognitive impairment/Alzheimer’s disease dementia subjects from amyloid-negative cognitively normal subjects, and 67.9% for identifying amyloid-positive individuals among cognitively normal subjects. Grad-CAM-based explainability analysis identified clinically relevant brain regions, including the parahippocampal gyrus and hippocampus, as key contributors to model predictions. Conclusion: This study demonstrates the promise of diffusion MRI and transformer-based architectures for early detection of Alzheimer’s disease and amyloid pathology, supporting biomarker-driven diagnostics in data-limited biomedical settings. Subjects: Computer Vision and Pattern Recognition (cs.CV); Neurons and Cognition (q-bio.NC); Quantitative Methods (q-bio.QM) Cite as: arXiv:2507.09996 [cs.CV] (or arXiv:2507.09996v1 [cs.CV] for this version) https://doi.org/10.48550/arXiv.2507.09996 Focus to learn more arXiv-issued DOI via DataCite (pending registration) Submission history From: Quentin Dessain [view email] [v1] Mon, 14 Jul 2025 07:31:40 UTC (739 KB)
zh
[CV-61] 3DGAA: Realistic and Robust 3D Gaussian-based Adversarial Attack for Autonomous Driving WACV2026
【速读】:该论文试图解决自动驾驶中基于摄像头的目标检测系统在真实环境中的对抗威胁问题,现有2D和3D物理攻击方法在优化纹理时难以平衡物理真实性和攻击鲁棒性。解决方案的关键在于提出一种基于3D高斯的对抗攻击(3DGAA),该方法利用3D高斯点云(3DGS)的完整14维参数化,联合优化几何与外观属性,在物理可实现的方式下生成具有物理真实性和可迁移性的对抗目标,同时引入物理过滤模块和物理增强模块以提升攻击在现实条件下的泛化能力。
链接: https://arxiv.org/abs/2507.09993
作者: Yixun Zhang,Lizhi Wang,Junjun Zhao,Wending Zhao,Feng Zhou,Yonghao Dang,Jianqin Yin
机构: Beijing University of Posts and Telecommunications (北京邮电大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Submitted to WACV 2026
Abstract:Camera-based object detection systems play a vital role in autonomous driving, yet they remain vulnerable to adversarial threats in real-world environments. While existing 2D and 3D physical attacks typically optimize texture, they often struggle to balance physical realism and attack robustness. In this work, we propose 3D Gaussian-based Adversarial Attack (3DGAA), a novel adversarial object generation framework that leverages the full 14-dimensional parameterization of 3D Gaussian Splatting (3DGS) to jointly optimize geometry and appearance in physically realizable ways. Unlike prior works that rely on patches or texture, 3DGAA jointly perturbs both geometric attributes (shape, scale, rotation) and appearance attributes (color, opacity) to produce physically realistic and transferable adversarial objects. We further introduce a physical filtering module to preserve geometric fidelity, and a physical augmentation module to simulate complex physical scenarios, thus enhancing attack generalization under real-world conditions. We evaluate 3DGAA on both virtual benchmarks and physical-world setups using miniature vehicle models. Experimental results show that 3DGAA achieves to reduce the detection mAP from 87.21% to 7.38%, significantly outperforming existing 3D physical attacks. Moreover, our method maintains high transferability across different physical conditions, demonstrating a new state-of-the-art in physically realizable adversarial attacks. These results validate 3DGAA as a practical attack framework for evaluating the safety of perception systems in autonomous driving.
zh
[CV-62] Latent Diffusion Models with Masked AutoEncoders
【速读】:该论文试图解决生成式 AI (Generative AI) 中潜在扩散模型(Latent Diffusion Models, LDMs)的自编码器设计问题,特别是其潜在空间平滑性、感知压缩质量和重建质量这三个关键属性未能同时满足的问题。解决方案的关键在于提出变分掩码自编码器(Variational Masked AutoEncoders, VMAEs),该方法利用了掩码自编码器保持的层次化特征,并将其集成到LDM框架中,形成带有掩码自编码器的潜在扩散模型(Latent Diffusion Models with Masked AutoEncoders, LDMAEs),从而显著提升了图像生成的质量和计算效率。
链接: https://arxiv.org/abs/2507.09984
作者: Junho Lee,Jeongwoo Shin,Hyungwook Choi,Joonseok Lee
机构: Seoul National University (首尔大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:In spite of remarkable potential of the Latent Diffusion Models (LDMs) in image generation, the desired properties and optimal design of the autoencoders have been underexplored. In this work, we analyze the role of autoencoders in LDMs and identify three key properties: latent smoothness, perceptual compression quality, and reconstruction quality. We demonstrate that existing autoencoders fail to simultaneously satisfy all three properties, and propose Variational Masked AutoEncoders (VMAEs), taking advantage of the hierarchical features maintained by Masked AutoEncoder. We integrate VMAEs into the LDM framework, introducing Latent Diffusion Models with Masked AutoEncoders (LDMAEs). Through comprehensive experiments, we demonstrate significantly enhanced image generation quality and computational efficiency.
zh
[CV-63] Uncertainty Quantification for Incomplete Multi-View Data Using Divergence Measures
【速读】:该论文旨在解决多视图分类与聚类任务中由于数据噪声或损坏导致的多视图集成与最终决策可靠性问题。现有方法通常依赖Kullback-Leibler(KL)散度来估计网络预测的不确定性,但忽略了不同模态之间的领域差异。其解决方案的关键在于提出基于Hölder散度的KPHD-Net,通过变分Dirichlet分布表示类别概率分布,建模不同视图的证据,并结合Dempster-Shafer证据理论(DST)以提升不确定性估计效果,同时引入DST与卡尔曼滤波的融合机制,进一步增强最终融合结果的可靠性。
链接: https://arxiv.org/abs/2507.09980
作者: Zhipeng Xue,Yan Zhang,Ming Li,Chun Li,Yue Liu,Fei Yu
机构: Guangdong Laboratory of Artificial Intelligence and Digital Economy (SZ); MSU-BIT-SMBU Joint Research Center of Applied Mathematics, Shenzhen MSU-BIT University; School of Optics and Photonics, Beijing Institute of Technology
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Existing multi-view classification and clustering methods typically improve task accuracy by leveraging and fusing information from different views. However, ensuring the reliability of multi-view integration and final decisions is crucial, particularly when dealing with noisy or corrupted data. Current methods often rely on Kullback-Leibler (KL) divergence to estimate uncertainty of network predictions, ignoring domain gaps between different modalities. To address this issue, KPHD-Net, based on Hölder divergence, is proposed for multi-view classification and clustering tasks. Generally, our KPHD-Net employs a variational Dirichlet distribution to represent class probability distributions, models evidences from different views, and then integrates it with Dempster-Shafer evidence theory (DST) to improve uncertainty estimation effects. Our theoretical analysis demonstrates that Proper Hölder divergence offers a more effective measure of distribution discrepancies, ensuring enhanced performance in multi-view learning. Moreover, Dempster-Shafer evidence theory, recognized for its superior performance in multi-view fusion tasks, is introduced and combined with the Kalman filter to provide future state estimations. This integration further enhances the reliability of the final fusion results. Extensive experiments show that the proposed KPHD-Net outperforms the current state-of-the-art methods in both classification and clustering tasks regarding accuracy, robustness, and reliability, with theoretical guarantees.
zh
[CV-64] 4D-MISR: A unified model for low-dose super-resolution imaging via feature fusion
【速读】:该论文试图解决电子显微镜在观察对电子束敏感的材料(如蛋白质和二维材料)时因辐射损伤导致的使用限制问题。解决方案的关键在于借鉴遥感领域多图像超分辨率(MISR)的原理,通过融合多个低分辨率、亚像素位移的视图,并利用集成合成多角度观测特征的卷积神经网络(CNN)进行重建,从而实现从超低剂量数据中获得原子尺度的超分辨率成像。
链接: https://arxiv.org/abs/2507.09953
作者: Zifei Wang,Zian Mao,Xiaoya He,Xi Huang,Haoran Zhang,Chun Cheng,Shufen Chu,Tingzheng Hou,Xiaoqin Zeng,Yujun Xie
机构: University of Notre Dame(圣母大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:While electron microscopy offers crucial atomic-resolution insights into structure-property relationships, radiation damage severely limits its use on beam-sensitive materials like proteins and 2D materials. To overcome this challenge, we push beyond the electron dose limits of conventional electron microscopy by adapting principles from multi-image super-resolution (MISR) that have been widely used in remote sensing. Our method fuses multiple low-resolution, sub-pixel-shifted views and enhances the reconstruction with a convolutional neural network (CNN) that integrates features from synthetic, multi-angle observations. We developed a dual-path, attention-guided network for 4D-STEM that achieves atomic-scale super-resolution from ultra-low-dose data. This provides robust atomic-scale visualization across amorphous, semi-crystalline, and crystalline beam-sensitive specimens. Systematic evaluations on representative materials demonstrate comparable spatial resolution to conventional ptychography under ultra-low-dose conditions. Our work expands the capabilities of 4D-STEM, offering a new and generalizable method for the structural analysis of radiation-vulnerable materials.
zh
[CV-65] Can GPT -4o mini and Gemini 2.0 Flash Predict Fine-Grained Fashion Product Attributes? A Zero-Shot Analysis
【速读】:该论文试图解决时尚零售中产品属性识别的问题,特别是针对细粒度时尚属性的识别能力。现有研究较少探索大型语言模型(LLMs)在这一任务上的表现。论文的关键解决方案是通过零样本评估方法,对当前最先进的LLMs如GPT-4o-mini和Gemini 2.0 Flash进行性能测试,并利用DeepFashion-MultiModal数据集在18个时尚属性类别上进行评估,以分析其在产品属性任务中的表现及局限性。
链接: https://arxiv.org/abs/2507.09950
作者: Shubham Shukla,Kunal Sonalkar
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: 11 pages, 2 figures
Abstract:The fashion retail business is centered around the capacity to comprehend products. Product attribution helps in comprehending products depending on the business process. Quality attribution improves the customer experience as they navigate through millions of products offered by a retail website. It leads to well-organized product catalogs. In the end, product attribution directly impacts the ‘discovery experience’ of the customer. Although large language models (LLMs) have shown remarkable capabilities in understanding multimodal data, their performance on fine-grained fashion attribute recognition remains under-explored. This paper presents a zero-shot evaluation of state-of-the-art LLMs that balance performance with speed and cost efficiency, mainly GPT-4o-mini and Gemini 2.0 Flash. We have used the dataset DeepFashion-MultiModal (this https URL) to evaluate these models in the attribution tasks of fashion products. Our study evaluates these models across 18 categories of fashion attributes, offering insight into where these models excel. We only use images as the sole input for product information to create a constrained environment. Our analysis shows that Gemini 2.0 Flash demonstrates the strongest overall performance with a macro F1 score of 56.79% across all attributes, while GPT-4o-mini scored a macro F1 score of 43.28%. Through detailed error analysis, our findings provide practical insights for deploying these LLMs in production e-commerce product attribution-related tasks and highlight the need for domain-specific fine-tuning approaches. This work also lays the groundwork for future research in fashion AI and multimodal attribute extraction.
zh
[CV-66] ESG-Net: Event-Aware Semantic Guided Network for Dense Audio-Visual Event Localization
【速读】:该论文旨在解决密集音频-视觉事件定位(DAVE)中存在的时间边界定位不准确以及跨模态语义鸿沟的问题,同时解决事件间相关性建模不足导致的复杂场景下并发事件推理能力有限的问题。其解决方案的关键在于引入多阶段语义引导和多事件关系建模,分别实现对音频-视觉事件的层次化语义理解以及事件依赖关系的自适应提取,具体通过事件感知语义引导网络(ESG-Net)中的早期语义交互(ESI)模块和依赖专家混合(MoDE)模块来实现。
链接: https://arxiv.org/abs/2507.09945
作者: Huilai Li,Yonghao Dang,Ying Xing,Yiming Wang,Jianqin Yin
机构: Beijing University of Posts and Telecommunications(北京邮电大学)
类目: Multimedia (cs.MM); Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Dense audio-visual event localization (DAVE) aims to identify event categories and locate the temporal boundaries in untrimmed videos. Most studies only employ event-related semantic constraints on the final outputs, lacking cross-modal semantic bridging in intermediate layers. This causes modality semantic gap for further fusion, making it difficult to distinguish between event-related content and irrelevant background content. Moreover, they rarely consider the correlations between events, which limits the model to infer concurrent events among complex scenarios. In this paper, we incorporate multi-stage semantic guidance and multi-event relationship modeling, which respectively enable hierarchical semantic understanding of audio-visual events and adaptive extraction of event dependencies, thereby better focusing on event-related information. Specifically, our eventaware semantic guided network (ESG-Net) includes a early semantics interaction (ESI) module and a mixture of dependency experts (MoDE) module. ESI applys multi-stage semantic guidance to explicitly constrain the model in learning semantic information through multi-modal early fusion and several classification loss functions, ensuring hierarchical understanding of event-related content. MoDE promotes the extraction of multi-event dependencies through multiple serial mixture of experts with adaptive weight allocation. Extensive experiments demonstrate that our method significantly surpasses the state-of-the-art methods, while greatly reducing parameters and computational load. Our code will be released on this https URL.
zh
[CV-67] Crucial-Diff: A Unified Diffusion Model for Crucial Image and Annotation Synthesis in Data-scarce Scenarios
【速读】:该论文旨在解决数据稀缺场景下模型过拟合与数据集不平衡问题,从而影响有效检测与分割性能。现有方法虽采用生成模型合成更多训练样本,但生成样本往往重复或简单,无法提供针对下游模型弱点的“关键信息”,且通常需要为不同对象单独训练,导致计算效率低下。该论文提出的解决方案——Crucial-Diff框架,其关键在于集成两个核心模块:场景无关特征提取器(SAFE)和弱点感知样本挖掘器(WASM),其中SAFE通过统一特征提取器捕捉目标信息,WASM则利用下游模型检测结果的反馈生成难以检测的样本,并将其与SAFE模块输出融合,从而生成多样且高质量的训练数据。
链接: https://arxiv.org/abs/2507.09915
作者: Siyue Yao,Mingjie Sun,Eng Gee Lim,Ran Yi,Baojiang Zhong,Moncef Gabbouj
机构: Suzhou Jiaotong University(苏州交通大学); Soochow University(苏州大学); Shanghai Jiao Tong University(上海交通大学); Tampere University(坦佩雷大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:The scarcity of data in various scenarios, such as medical, industry and autonomous driving, leads to model overfitting and dataset imbalance, thus hindering effective detection and segmentation performance. Existing studies employ the generative models to synthesize more training samples to mitigate data scarcity. However, these synthetic samples are repetitive or simplistic and fail to provide “crucial information” that targets the downstream model’s weaknesses. Additionally, these methods typically require separate training for different objects, leading to computational inefficiencies. To address these issues, we propose Crucial-Diff, a domain-agnostic framework designed to synthesize crucial samples. Our method integrates two key modules. The Scene Agnostic Feature Extractor (SAFE) utilizes a unified feature extractor to capture target information. The Weakness Aware Sample Miner (WASM) generates hard-to-detect samples using feedback from the detection results of downstream model, which is then fused with the output of SAFE module. Together, our Crucial-Diff framework generates diverse, high-quality training data, achieving a pixel-level AP of 83.63% and an F1-MAX of 78.12% on MVTec. On polyp dataset, Crucial-Diff reaches an mIoU of 81.64% and an mDice of 87.69%. Code will be released after acceptance.
zh
[CV-68] IGD: Instructional Graphic Design with Multimodal Layer Generation ICCV2025
【速读】:该论文试图解决传统图形设计方法在创意性和智能性上的不足,以及现有基于扩散模型的图形设计方法在可编辑性和视觉文本可读性方面的缺陷。解决方案的关键在于提出Instructional Graphic Designer (IGD),其通过结合参数化渲染和图像素材生成的新范式,实现仅需自然语言指令即可快速生成具有可编辑灵活性的多模态图层,并利用多模态大语言模型(MLLM)进行属性预测、图层排序与布局,同时采用扩散模型生成图像内容,从而支持复杂图形设计任务的可扩展性和可扩展性。
链接: https://arxiv.org/abs/2507.09910
作者: Yadong Qu,Shancheng Fang,Yuxin Wang,Xiaorui Wang,Zhineng Chen,Hongtao Xie,Yongdong Zhang
机构: University of Science and Technology of China (中国科学技术大学); YuanShi Technology (元始科技); Institute of Trustworthy Embodied AI, Fudan University (复旦大学可信具身人工智能研究所)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: ICCV 2025
Abstract:Graphic design visually conveys information and data by creating and combining text, images and graphics. Two-stage methods that rely primarily on layout generation lack creativity and intelligence, making graphic design still labor-intensive. Existing diffusion-based methods generate non-editable graphic design files at image level with poor legibility in visual text rendering, which prevents them from achieving satisfactory and practical automated graphic design. In this paper, we propose Instructional Graphic Designer (IGD) to swiftly generate multimodal layers with editable flexibility with only natural language instructions. IGD adopts a new paradigm that leverages parametric rendering and image asset generation. First, we develop a design platform and establish a standardized format for multi-scenario design files, thus laying the foundation for scaling up data. Second, IGD utilizes the multimodal understanding and reasoning capabilities of MLLM to accomplish attribute prediction, sequencing and layout of layers. It also employs a diffusion model to generate image content for assets. By enabling end-to-end training, IGD architecturally supports scalability and extensibility in complex graphic design tasks. The superior experimental results demonstrate that IGD offers a new solution for graphic design.
zh
[CV-69] Measuring the Impact of Rotation Equivariance on Aerial Object Detection ICCV2025
【速读】:该论文旨在解决航空图像目标检测中旋转等变性(rotation equivariance)的重要性及其对检测性能的影响问题。现有方法大多依赖数据增强或构造近似旋转等变的网络结构,但受限于下采样过程对严格旋转等变性的破坏,难以实现真正的旋转等变性。论文的关键解决方案是构建一个严格满足旋转等变性的主干网络和颈部网络,并引入多分支头网络以在减少参数量的同时提升检测精度,从而提出了一种基于严格旋转等变性的单阶段检测器MessDet。
链接: https://arxiv.org/abs/2507.09896
作者: Xiuyu Wu,Xinhao Wang,Xiubin Zhu,Lan Yang,Jiyuan Liu,Xingchen Hu
机构: Xidian University(西安电子科技大学); National University of Defense Technology(国防科技大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted by ICCV 2025
Abstract:Due to the arbitrary orientation of objects in aerial images, rotation equivariance is a critical property for aerial object detectors. However, recent studies on rotation-equivariant aerial object detection remain scarce. Most detectors rely on data augmentation to enable models to learn approximately rotation-equivariant features. A few detectors have constructed rotation-equivariant networks, but due to the breaking of strict rotation equivariance by typical downsampling processes, these networks only achieve approximately rotation-equivariant backbones. Whether strict rotation equivariance is necessary for aerial image object detection remains an open question. In this paper, we implement a strictly rotation-equivariant backbone and neck network with a more advanced network structure and compare it with approximately rotation-equivariant networks to quantitatively measure the impact of rotation equivariance on the performance of aerial image detectors. Additionally, leveraging the inherently grouped nature of rotation-equivariant features, we propose a multi-branch head network that reduces the parameter count while improving detection accuracy. Based on the aforementioned improvements, this study proposes the Multi-branch head rotation-equivariant single-stage Detector (MessDet), which achieves state-of-the-art performance on the challenging aerial image datasets DOTA-v1.0, DOTA-v1.5 and DIOR-R with an exceptionally low parameter count.
zh
[CV-70] MCGA: Mixture of Codebooks Hyperspectral Reconstruction via Grayscale-Aware Attention
【速读】:该论文试图解决从RGB图像重建高光谱图像(HSI)的问题,旨在为各种基于视觉的应用提供一种成本效益高的解决方案。现有方法通常直接利用复杂的注意力机制学习RGB到HSI的映射,忽视了从低维到高维信息转换的固有挑战。该论文提出的解决方案关键在于采用两阶段方法MCGA:第一阶段通过多尺度VQ-VAE从异构HSI数据集中学习光谱模式,提取混合代码本(MoC);第二阶段通过查询MoC中的特征来替代潜在HSI表示,从而优化RGB到HSI的映射,融入先验知识而非强制高维变换。此外,引入了灰度感知注意力和量化自注意力机制,以适应高光谱重建需求,并提出了基于熵的测试时适应策略以提高实际场景下的鲁棒性。
链接: https://arxiv.org/abs/2507.09885
作者: Zhanjiang Yang,Lijun Sun,Jiawei Dong,Xiaoxin An,Yang Liu,Meng Li
机构: Shenzhen Technology University (深圳技术大学); Swansea University (斯旺西大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Reconstructing hyperspectral images (HSI) from RGB images is a cost-effective solution for various vision-based applications. However, most existing learning-based hyperspectral reconstruction methods directly learn the RGB-to-HSI mapping using complex attention mechanisms, neglecting the inherent challenge of transitioning from low-dimensional to high-dimensional information. To address this limitation, we propose a two-stage approach, MCGA, which first learns spectral patterns before estimating the mapping. In the first stage, a multi-scale VQ-VAE learns representations from heterogeneous HSI datasets, extracting a Mixture of Codebooks (MoC). In the second stage, the RGB-to-HSI mapping is refined by querying features from the MoC to replace latent HSI representations, incorporating prior knowledge rather than forcing a direct high-dimensional transformation. To further enhance reconstruction quality, we introduce Grayscale-Aware Attention and Quantized Self-Attention, which adaptively adjust feature map intensities to meet hyperspectral reconstruction requirements. This physically motivated attention mechanism ensures lightweight and efficient HSI recovery. Moreover, we propose an entropy-based Test-Time Adaptation strategy to improve robustness in real-world scenarios. Extensive experiments demonstrate that our method, MCGA, achieves state-of-the-art performance. The code and models will be released at this https URL
zh
[CV-71] Counterfactual Visual Explanation via Causally-Guided Adversarial Steering
【速读】:该论文试图解决当前反事实视觉解释方法在生成反事实图像时忽视图像生成过程中的因果关系和虚假相关性,导致反事实图像出现意外变化,从而降低解释质量的问题。解决方案的关键在于引入一种名为CECAS的新框架,该框架首先利用基于因果引导的对抗方法生成反事实解释,并创新性地整合因果视角以避免在反事实样本中对虚假因素进行不必要的扰动。
链接: https://arxiv.org/abs/2507.09881
作者: Yiran Qiao,Disheng Liu,Yiren Lu,Yu Yin,Mengnan Du,Jing Ma
机构: Case Western Reserve University (凯斯西储大学); New Jersey Institute of Technology (新泽西理工学院)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Recent work on counterfactual visual explanations has contributed to making artificial intelligence models more explainable by providing visual perturbation to flip the prediction. However, these approaches neglect the causal relationships and the spurious correlations behind the image generation process, which often leads to unintended alterations in the counterfactual images and renders the explanations with limited quality. To address this challenge, we introduce a novel framework CECAS, which first leverages a causally-guided adversarial method to generate counterfactual explanations. It innovatively integrates a causal perspective to avoid unwanted perturbations on spurious factors in the counterfactuals. Extensive experiments demonstrate that our method outperforms existing state-of-the-art approaches across multiple benchmark datasets and ultimately achieves a balanced trade-off among various aspects of validity, sparsity, proximity, and realism.
zh
[CV-72] OpenHuman4D: Open-Vocabulary 4D Human Parsing
【速读】:该论文旨在解决动态3D人体表征在虚拟和扩展现实应用中的挑战,特别是现有人体部件分割方法受限于闭集数据集依赖和较长的推理时间,从而限制了其适用性。其解决方案的关键在于提出首个4D人体解析框架,通过减少推理时间和引入开放词汇能力来同时应对上述问题,核心创新包括基于掩码的视频目标跟踪以建立时空对应关系、设计用于新目标识别和缓解跟踪失败的掩码验证模块,以及结合记忆条件注意力和逻辑值均衡的4D掩码融合模块。
链接: https://arxiv.org/abs/2507.09880
作者: Keito Suzuki,Bang Du,Runfa Blark Li,Kunyao Chen,Lei Wang,Peng Liu,Ning Bi,Truong Nguyen
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Understanding dynamic 3D human representation has become increasingly critical in virtual and extended reality applications. However, existing human part segmentation methods are constrained by reliance on closed-set datasets and prolonged inference times, which significantly restrict their applicability. In this paper, we introduce the first 4D human parsing framework that simultaneously addresses these challenges by reducing the inference time and introducing open-vocabulary capabilities. Building upon state-of-the-art open-vocabulary 3D human parsing techniques, our approach extends the support to 4D human-centric video with three key innovations: 1) We adopt mask-based video object tracking to efficiently establish spatial and temporal correspondences, avoiding the necessity of segmenting all frames. 2) A novel Mask Validation module is designed to manage new target identification and mitigate tracking failures. 3) We propose a 4D Mask Fusion module, integrating memory-conditioned attention and logits equalization for robust embedding fusion. Extensive experiments demonstrate the effectiveness and flexibility of the proposed method on 4D human-centric parsing tasks, achieving up to 93.3% acceleration compared to the previous state-of-the-art method, which was limited to parsing fixed classes.
zh
[CV-73] SpeakerVid-5M: A Large-Scale High-Quality Dataset for Audio-Visual Dyadic Interactive Human Generation
【速读】:该论文旨在解决音频-视觉双人交互虚拟人类生成这一新兴领域中的数据与基准问题。其关键解决方案是构建了SpeakerVid-5M数据集,这是首个针对音频-视觉双人交互虚拟人类生成的大规模、高质量数据集,包含超过8,743小时的视频片段,覆盖多种交互类型,并通过交互类型和数据质量两个维度进行结构化划分,以支持广泛的2D虚拟人类任务。同时,论文还提供了基于自回归(AR)的视频聊天基线模型及相应的评估指标和测试数据,作为未来研究的基准 VidChatBench。
链接: https://arxiv.org/abs/2507.09862
作者: Youliang Zhang,Zhaoyang Li,Duomin Wang,Jiahe Zhang,Deyu Zhou,Zixin Yin,Xili Dai,Gang Yu,Xiu Li
机构: Tsinghua University (清华大学); StepFun; The Hong Kong University of Science and Technology (Guangzhou) (香港科技大学(广州)); The Hong Kong University of Science and Technology (香港科技大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Audio and Speech Processing (eess.AS)
备注:
Abstract:The rapid development of large-scale models has catalyzed significant breakthroughs in the digital human domain. These advanced methodologies offer high-fidelity solutions for avatar driving and rendering, leading academia to focus on the next major challenge: audio-visual dyadic interactive virtual human. To facilitate research in this emerging area, we present SpeakerVid-5M dataset, the first large-scale, high-quality dataset designed for audio-visual dyadic interactive virtual human generation. Totaling over 8,743 hours, SpeakerVid-5M contains more than 5.2 million video clips of human portraits. It covers diverse scales and interaction types, including monadic talking, listening, and dyadic conversations. Crucially, the dataset is structured along two key dimensions: interaction type and data quality. First, it is categorized into four types (dialogue branch, single branch, listening branch and multi-turn branch) based on the interaction scenario. Second, it is stratified into a large-scale pre-training subset and a curated, high-quality subset for Supervised Fine-Tuning (SFT). This dual structure accommodates a wide array of 2D virtual human tasks. In addition, we provide an autoregressive (AR)-based video chat baseline trained on this data, accompanied by a dedicated set of metrics and test data to serve as a benchmark VidChatBench for future work. Both the dataset and the corresponding data processing code will be publicly released. Project page: this https URL
zh
[CV-74] A Survey on MLLM -based Visually Rich Document Understanding: Methods Challenges and Emerging Trends
【速读】:该论文旨在解决视觉丰富文档理解(Visually-Rich Document Understanding, VRDU)中如何有效融合文本、视觉和版式特征,并提升多模态大语言模型(Multimodal Large Language Models, MLLMs)在文档信息提取与解释任务中的性能问题。其解决方案的关键在于三个方面:一是设计高效的特征编码与融合方法,以捕捉文档中的多模态信息;二是探索有效的训练范式,包括预训练策略、指令-响应微调以及不同模型模块的可训练性;三是构建适用于预训练、指令微调和监督微调的高质量数据集。这些关键要素共同推动了VRDU系统的效率、泛化能力和鲁棒性的提升。
链接: https://arxiv.org/abs/2507.09861
作者: Yihao Ding,Siwen Luo,Yue Dai,Yanbei Jiang,Zechuan Li,Geoffrey Martin,Yifan Peng
机构: The University of Western Australia(西澳大学); The University of Melbourne(墨尔本大学); Cornell University(康奈尔大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: Work in progress
Abstract:Visually-Rich Document Understanding (VRDU) has emerged as a critical field, driven by the need to automatically process documents containing complex visual, textual, and layout information. Recently, Multimodal Large Language Models (MLLMs) have shown remarkable potential in this domain, leveraging both Optical Character Recognition (OCR)-dependent and OCR-free frameworks to extract and interpret information in document images. This survey reviews recent advancements in MLLM-based VRDU, highlighting three core components: (1) methods for encoding and fusing textual, visual, and layout features; (2) training paradigms, including pretraining strategies, instruction-response tuning, and the trainability of different model modules; and (3) datasets utilized for pretraining, instruction-tuning, and supervised fine-tuning. Finally, we discuss the challenges and opportunities in this evolving field and propose future directions to advance the efficiency, generalizability, and robustness of VRDU systems.
zh
[CV-75] Hierarchical Abstraction Enables Human-Like 3D Object Recognition in Deep Learning Models
【速读】:该论文试图解决的问题是:尽管深度学习模型在从3D形状中识别物体方面表现出接近人类的性能,但其是否形成了与人类视觉用于物体识别相似的3D形状表征仍不明确。论文提出的解决方案关键在于使用基于视觉变换器(point transformer)的模型,该模型通过支持3D形状的分层抽象机制,能够更好地解释人类在处理3D点云数据时的表现。
链接: https://arxiv.org/abs/2507.09830
作者: Shuhao Fu,Philip J. Kellman,Hongjing Lu
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注:
Abstract:Both humans and deep learning models can recognize objects from 3D shapes depicted with sparse visual information, such as a set of points randomly sampled from the surfaces of 3D objects (termed a point cloud). Although deep learning models achieve human-like performance in recognizing objects from 3D shapes, it remains unclear whether these models develop 3D shape representations similar to those used by human vision for object recognition. We hypothesize that training with 3D shapes enables models to form representations of local geometric structures in 3D shapes. However, their representations of global 3D object shapes may be limited. We conducted two human experiments systematically manipulating point density and object orientation (Experiment 1), and local geometric structure (Experiment 2). Humans consistently performed well across all experimental conditions. We compared two types of deep learning models, one based on a convolutional neural network (DGCNN) and the other on visual transformers (point transformer), with human performance. We found that the point transformer model provided a better account of human performance than the convolution-based model. The advantage mainly results from the mechanism in the point transformer model that supports hierarchical abstraction of 3D shapes.
zh
[CV-76] VRU-Accident: A Vision-Language Benchmark for Video Question Answering and Dense Captioning for Accident Scene Understanding
【速读】:该论文试图解决自动驾驶系统中对弱势道路使用者(Vulnerable Road Users, VRUs)安全性的保障问题,特别是在高风险交通场景下,如何量化评估多模态大语言模型(Multimodal Large Language Models, MLLMs)的推理能力。解决方案的关键在于提出VRU-Accident,这是一个大规模的视觉-语言基准,包含1K真实交通事故视频、6K多选问答对以及1K密集场景描述,涵盖六个安全关键类别,提供丰富的细粒度标注,以捕捉事故的空间-时间动态和因果语义,从而为MLLMs在复杂安全关键场景中的性能评估提供标准化工具。
链接: https://arxiv.org/abs/2507.09815
作者: Younggun Kim,Ahmed S. Abdelrahman,Mohamed Abdel-Aty
机构: University of Central Florida (佛罗里达中央大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 22 pages, 11 figures, 5 tables
Abstract:Ensuring the safety of vulnerable road users (VRUs), such as pedestrians and cyclists, is a critical challenge for autonomous driving systems, as crashes involving VRUs often result in severe or fatal consequences. While multimodal large language models (MLLMs) have shown promise in enhancing scene understanding and decision making in autonomous vehicles, there is currently no standardized benchmark to quantitatively evaluate their reasoning abilities in complex, safety-critical scenarios involving VRUs. To address this gap, we present VRU-Accident, a large-scale vision-language benchmark designed to evaluate MLLMs in high-risk traffic scenarios involving VRUs. VRU-Accident comprises 1K real-world dashcam accident videos, annotated with 6K multiple-choice question-answer pairs across six safety-critical categories (with 24K candidate options and 3.4K unique answer choices), as well as 1K dense scene descriptions. Unlike prior works, our benchmark focuses explicitly on VRU-vehicle accidents, providing rich, fine-grained annotations that capture both spatial-temporal dynamics and causal semantics of accidents. To assess the current landscape of MLLMs, we conduct a comprehensive evaluation of 17 state-of-the-art models on the multiple-choice VQA task and on the dense captioning task. Our findings reveal that while MLLMs perform reasonably well on visually grounded attributes, they face significant challenges in reasoning and describing accident causes, types, and preventability.
zh
[CV-77] NegRefine: Refining Negative Label-Based Zero-Shot OOD Detection ICCV2025
【速读】:该论文试图解决基于负标签的零样本异常检测方法在检测分布内样本时误将其识别为分布外(OOD)样本的问题,以及在处理同时匹配多个分布内和负标签的图像时存在的局限性。解决方案的关键在于提出NegRefine框架,通过引入过滤机制从负标签集中排除子类别标签和专有名词,并结合多匹配感知的评分函数,动态调整与图像匹配的多个标签的贡献,从而实现更稳健的分布内与OOD样本分离。
链接: https://arxiv.org/abs/2507.09795
作者: Amirhossein Ansari,Ke Wang,Pulei Xiong
机构: Simon Fraser University (西蒙 Fraser大学); National Research Council Canada (加拿大国家研究委员会)
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注: Accepted to ICCV 2025
Abstract:Recent advancements in Vision-Language Models like CLIP have enabled zero-shot OOD detection by leveraging both image and textual label information. Among these, negative label-based methods such as NegLabel and CSP have shown promising results by utilizing a lexicon of words to define negative labels for distinguishing OOD samples. However, these methods suffer from detecting in-distribution samples as OOD due to negative labels that are subcategories of in-distribution labels or proper nouns. They also face limitations in handling images that match multiple in-distribution and negative labels. We propose NegRefine, a novel negative label refinement framework for zero-shot OOD detection. By introducing a filtering mechanism to exclude subcategory labels and proper nouns from the negative label set and incorporating a multi-matching-aware scoring function that dynamically adjusts the contributions of multiple labels matching an image, NegRefine ensures a more robust separation between in-distribution and OOD samples. We evaluate NegRefine on large-scale benchmarks, including ImageNet-1K. Source code is available at this https URL.
zh
[CV-78] CADmium: Fine-Tuning Code Language Models for Text-Driven Sequential CAD Design
【速读】:该论文试图解决计算机辅助设计(CAD)建模过程中手动且耗时的问题,旨在通过自动化手段提升CAD设计效率。其解决方案的关键在于利用大规模语言模型(LLMs)的能力,结合一个包含超过170k个标注CAD模型的大型数据集,该数据集通过基于GPT-4.1的管道生成高质量、类人描述。在此基础上,对强大的代码-语言模型(code-LLMs)进行微调,以从自然语言描述中生成结构化的CAD序列,从而实现文本条件下的CAD生成。
链接: https://arxiv.org/abs/2507.09792
作者: Prashant Govindarajan,Davide Baldelli,Jay Pathak,Quentin Fournier,Sarath Chandar
机构: Mila – Quebec AI Institute (Mila – 魁北克人工智能研究所); Polytechnique Montréal (蒙特利尔理工学院); Ansys (安斯赛); Chandar Research Lab (查恩德研究实验室)
类目: Graphics (cs.GR); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Computer-aided design (CAD) is the digital construction of 2D and 3D objects, and is central to a wide range of engineering and manufacturing applications like automobile and aviation. Despite its importance, CAD modeling remains largely a time-intensive, manual task. Recent works have attempted to automate this process with small transformer-based models and handcrafted CAD sequence representations. However, there has been little effort to leverage the potential of large language models (LLMs) for sequential CAD design. In this work, we introduce a new large-scale dataset of more than 170k CAD models annotated with high-quality, human-like descriptions generated with our pipeline based on GPT-4.1. Using this dataset, we fine-tune powerful code-LLMs to generate CAD sequences represented in a JSON-based format from natural language descriptions, demonstrating the viability and effectiveness of this approach for text-conditioned CAD generation. Because simple metrics often fail to reflect the quality of generated objects, we introduce geometric and topological metrics based on sphericity, mean curvature, and Euler characteristic to provide richer structural insights. Our experiments and ablation studies on both synthetic and human-annotated data demonstrate that CADmium is able to automate CAD design, drastically speeding up the design of new objects. The dataset, code, and fine-tuned models are available online.
zh
[CV-79] Pairwise Alignment Compatibility for Arbitrarily Irregular Image Frag ments
【速读】:该论文旨在解决现实生活中复杂几何属性的碎片在拼图重建算法中的配对兼容性计算问题,传统方法往往无法有效处理此类问题或依赖于受限的碎片形状假设。其解决方案的关键在于提出一种高效的混合(几何与图像)方法,用于计算碎片对的最优对齐,无需对碎片的形状、尺寸或图像内容做出任何假设。
链接: https://arxiv.org/abs/2507.09767
作者: Ofir Itzhak Shahar,Gur Elkin,Ohad Ben-Shahar
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Pairwise compatibility calculation is at the core of most fragments-reconstruction algorithms, in particular those designed to solve different types of the jigsaw puzzle problem. However, most existing approaches fail, or aren’t designed to deal with fragments of realistic geometric properties one encounters in real-life puzzles. And in all other cases, compatibility methods rely strongly on the restricted shapes of the fragments. In this paper, we propose an efficient hybrid (geometric and pictorial) approach for computing the optimal alignment for pairs of fragments, without any assumptions about their shapes, dimensions, or pictorial content. We introduce a new image fragments dataset generated via a novel method for image fragmentation and a formal erosion model that mimics real-world archaeological erosion, along with evaluation metrics for the compatibility task. We then embed our proposed compatibility into an archaeological puzzle-solving framework and demonstrate state-of-the-art neighborhood-level precision and recall on the RePAIR 2D dataset, directly reflecting compatibility performance improvements.
zh
[CV-80] Advancing Text-to-3D Generation with Linearized Lookahead Variational Score Distillation ICCV2025
【速读】:该论文旨在解决基于预训练2D扩散模型的文本到3D生成中,变分得分蒸馏(VSD)方法在实际应用中可能面临的收敛缓慢和病态问题。其解决方案的关键在于发现并调整引入的得分模型与3D模型之间的优化顺序,通过让得分模型前瞻性地考虑当前3D状态,从而实现更合理的梯度修正。为进一步提升稳定性,论文提出使用线性化变体进行得分蒸馏,即L²-VSD,该方法可高效实现,并在多个实验中验证了其优于现有方法的性能。
链接: https://arxiv.org/abs/2507.09748
作者: Yu Lei,Bingde Liu,Qingsong Xie,Haonan Lu,Zhijie Deng
机构: Shanghai Jiao Tong University (上海交通大学); OPPO AI Center (OPPO人工智能中心)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted by ICCV 2025
Abstract:Text-to-3D generation based on score distillation of pre-trained 2D diffusion models has gained increasing interest, with variational score distillation (VSD) as a remarkable example. VSD proves that vanilla score distillation can be improved by introducing an extra score-based model, which characterizes the distribution of images rendered from 3D models, to correct the distillation gradient. Despite the theoretical foundations, VSD, in practice, is likely to suffer from slow and sometimes ill-posed convergence. In this paper, we perform an in-depth investigation of the interplay between the introduced score model and the 3D model, and find that there exists a mismatching problem between LoRA and 3D distributions in practical implementation. We can simply adjust their optimization order to improve the generation quality. By doing so, the score model looks ahead to the current 3D state and hence yields more reasonable corrections. Nevertheless, naive lookahead VSD may suffer from unstable training in practice due to the potential over-fitting. To address this, we propose to use a linearized variant of the model for score distillation, giving rise to the Linearized Lookahead Variational Score Distillation ( L^2 -VSD). L^2 -VSD can be realized efficiently with forward-mode autodiff functionalities of existing deep learning libraries. Extensive experiments validate the efficacy of L^2 -VSD, revealing its clear superiority over prior score distillation-based methods. We also show that our method can be seamlessly incorporated into any other VSD-based text-to-3D framework.
zh
[CV-81] Universal Physics Simulation: A Foundational Diffusion Approach
【速读】:该论文试图解决传统物理模拟方法在通用性和物理规律发现上的局限性问题,特别是传统物理信息神经网络(PINNs)和有限差分方法需要显式数学公式化控制方程,从而限制了其泛化能力和新物理规律的发现潜力。解决方案的关键在于提出一种基于草图引导的扩散变换器方法,将计算物理重新定义为条件生成问题,通过空间边界条件引导生成物理上准确的稳态解,利用增强的扩散变换器架构与新颖的空间关系编码技术,实现从边界到平衡状态的直接映射,并在无需先验物理编码的情况下达到SSIM 0.8的电磁场生成性能。
链接: https://arxiv.org/abs/2507.09733
作者: Bradley Camburn
机构: Singapore University of Technology and Design (新加坡科技设计大学)
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
备注: 10 pages, 3 figures. Foundational AI model for universal physics simulation using sketch-guided diffusion transformers. Achieves SSIM 0.8 on electromagnetic field generation without requiring a priori physics encoding
Abstract:We present the first foundational AI model for universal physics simulation that learns physical laws directly from boundary-condition data without requiring a priori equation encoding. Traditional physics-informed neural networks (PINNs) and finite-difference methods necessitate explicit mathematical formulation of governing equations, fundamentally limiting their generalizability and discovery potential. Our sketch-guided diffusion transformer approach reimagines computational physics by treating simulation as a conditional generation problem, where spatial boundary conditions guide the synthesis of physically accurate steady-state solutions. By leveraging enhanced diffusion transformer architectures with novel spatial relationship encoding, our model achieves direct boundary-to-equilibrium mapping and is generalizable to diverse physics domains. Unlike sequential time-stepping methods that accumulate errors over iterations, our approach bypasses temporal integration entirely, directly generating steady-state solutions with SSIM 0.8 while maintaining sub-pixel boundary accuracy. Our data-informed approach enables physics discovery through learned representations analyzable via Layer-wise Relevance Propagation (LRP), revealing emergent physical relationships without predetermined mathematical constraints. This work represents a paradigm shift from AI-accelerated physics to AI-discovered physics, establishing the first truly universal physics simulation framework. Comments: 10 pages, 3 figures. Foundational AI model for universal physics simulation using sketch-guided diffusion transformers. Achieves SSIM 0.8 on electromagnetic field generation without requiring a priori physics encoding Subjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV) MSC classes: 68T07, 65M06, 78M34 ACMclasses: I.2.6; I.4.8; J.2 Cite as: arXiv:2507.09733 [cs.LG] (or arXiv:2507.09733v1 [cs.LG] for this version) https://doi.org/10.48550/arXiv.2507.09733 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
zh
[CV-82] Visual Homing in Outdoor Robots Using Mushroom Body Circuits and Learning Walks
【速读】:该论文旨在解决自主导航中的视觉归巢问题,即如何在有限的感官输入和少量学习步态的情况下实现鲁棒的定位与返回。其解决方案的关键在于首次在现实世界中实现了基于侧化蘑菇体(Mushroom Body, MB)架构的视觉归巢系统,通过利用角向路径积分(PI)信号的符号将全景视图分类为“目标在左”和“目标在右”的记忆库,从而在自然户外环境中实现稳定归巢。此外,系统还引入了一个第五个MB输出神经元(MBON),用于编码目标视图以控制速度,实现了精确的到达目标行为。
链接: https://arxiv.org/abs/2507.09725
作者: Gabriel G. Gattaux,Julien R. Serres,Franck Ruffier,Antoine Wystrach
机构: 未知
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
备注: Published by Springer Nature with the 14th bioinspired and biohybrid systems conference in Sheffield, and presented at the conference in July 2025
Abstract:Ants achieve robust visual homing with minimal sensory input and only a few learning walks, inspiring biomimetic solutions for autonomous navigation. While Mushroom Body (MB) models have been used in robotic route following, they have not yet been applied to visual homing. We present the first real-world implementation of a lateralized MB architecture for visual homing onboard a compact autonomous car-like robot. We test whether the sign of the angular path integration (PI) signal can categorize panoramic views, acquired during learning walks and encoded in the MB, into “goal on the left” and “goal on the right” memory banks, enabling robust homing in natural outdoor settings. We validate this approach through four incremental experiments: (1) simulation showing attractor-like nest dynamics; (2) real-world homing after decoupled learning walks, producing nest search behavior; (3) homing after random walks using noisy PI emulated with GPS-RTK; and (4) precise stopping-at-the-goal behavior enabled by a fifth MB Output Neuron (MBON) encoding goal-views to control velocity. This mimics the accurate homing behavior of ants and functionally resembles waypoint-based position control in robotics, despite relying solely on visual input. Operating at 8 Hz on a Raspberry Pi 4 with 32x32 pixel views and a memory footprint under 9 kB, our system offers a biologically grounded, resource-efficient solution for autonomous visual homing.
zh
[CV-83] oken Compression Meets Compact Vision Transformers: A Survey and Comparative Evaluation for Edge AI
【速读】:该论文试图解决当前Token compression技术在Vision Transformer(ViT)中的两个关键问题:缺乏统一的分类与比较框架,以及现有方法在结构压缩的Transformer上的有效性尚未明确。其解决方案的关键在于构建首个系统性的分类体系,并对标准和紧凑型ViT架构进行实验评估,以揭示token压缩方法在不同设计下的性能差异,从而为未来在边缘AI和AI代理应用中适配token优化技术提供依据。
链接: https://arxiv.org/abs/2507.09702
作者: Phat Nguyen,Ngai-Man Cheung
机构: Singapore University of Technology and Design (新加坡科技设计大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Token compression techniques have recently emerged as powerful tools for accelerating Vision Transformer (ViT) inference in computer vision. Due to the quadratic computational complexity with respect to the token sequence length, these methods aim to remove less informative tokens before the attention layers to improve inference throughput. While numerous studies have explored various accuracy-efficiency trade-offs on large-scale ViTs, two critical gaps remain. First, there is a lack of unified survey that systematically categorizes and compares token compression approaches based on their core strategies (e.g., pruning, merging, or hybrid) and deployment settings (e.g., fine-tuning vs. plug-in). Second, most benchmarks are limited to standard ViT models (e.g., ViT-B, ViT-L), leaving open the question of whether such methods remain effective when applied to structurally compressed transformers, which are increasingly deployed on resource-constrained edge devices. To address these gaps, we present the first systematic taxonomy and comparative study of token compression methods, and we evaluate representative techniques on both standard and compact ViT architectures. Our experiments reveal that while token compression methods are effective for general-purpose ViTs, they often underperform when directly applied to compact designs. These findings not only provide practical insights but also pave the way for future research on adapting token optimization techniques to compact transformer-based networks for edge AI and AI agent applications.
zh
[CV-84] ExpStar: Towards Automatic Commentary Generation for Multi-discipline Scientific Experiments ACM-MM2025
【速读】:该论文试图解决跨多学科科学实验的自动评论生成问题,旨在减少教师在准备实验说明时所需的时间和专业知识依赖。其解决方案的关键在于构建了首个针对实验评论生成的专用数据集\textitExpInstruct,并提出了ExpStar模型,该模型通过检索增强机制自适应地访问、评估和利用外部知识,从而实现细粒度且具有洞察力的实验评论生成。
链接: https://arxiv.org/abs/2507.09693
作者: Jiali Chen,Yujie Jia,Zihan Wu,Jinyu Yang,Jianpeng Chen,Xusen Hei,Jiayuan Xie,Yi Cai,Qing Li
机构: South China University of Technology (华南理工大学); The Hong Kong Polytechnic University (香港理工大学); Key Laboratory of Big Data and Intelligent Robot Ministry of Education (教育部大数据与智能机器人重点实验室)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted by ACM MM 2025
Abstract:Experiment commentary is crucial in describing the experimental procedures, delving into underlying scientific principles, and incorporating content-related safety guidelines. In practice, human teachers rely heavily on subject-specific expertise and invest significant time preparing such commentary. To address this challenge, we introduce the task of automatic commentary generation across multi-discipline scientific experiments. While recent progress in large multimodal models (LMMs) has demonstrated promising capabilities in video understanding and reasoning, their ability to generate fine-grained and insightful experiment commentary remains largely underexplored. In this paper, we make the following contributions: (i) We construct \textitExpInstruct, the first dataset tailored for experiment commentary generation, featuring over 7\textitK step-level commentaries across 21 scientific subjects from 3 core disciplines (\ie, science, healthcare and engineering). Each sample includes procedural descriptions along with potential scientific principles (\eg, chemical equations and physical laws) and safety guidelines. (ii) We propose ExpStar, an automatic experiment commentary generation model that leverages a retrieval-augmented mechanism to adaptively access, evaluate, and utilize external knowledge. (iii) Extensive experiments show that our ExpStar substantially outperforms 14 leading LMMs, which highlights the superiority of our dataset and model. We believe that ExpStar holds great potential for advancing AI-assisted scientific experiment instruction.
zh
[CV-85] Prompt2DEM: High-Resolution DEMs for Urban and Open Environments from Global Prompts Using a Monocular Foundation Model
【速读】:该论文旨在解决高分辨率数字高程模型(Digital Elevation Model, DEM)生成的难题,特别是在缺乏足够高分辨率数据的情况下,如何准确估计全球范围内的绝对高程。其关键解决方案是引入基于提示的单目深度估计技术,并结合视觉Transformer编码器与LiDAR衍生的DEM进行微调,同时采用灵活的提示策略,实现了从低分辨率SRTM数据和高分辨率RGB影像中生成高精度、高分辨率的DEM,从而在100倍分辨率提升(30米至30厘米)上超越了现有方法。
链接: https://arxiv.org/abs/2507.09681
作者: Osher Rafaeli,Tal Svoray,Ariel Nahlieli
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Image and Video Processing (eess.IV)
备注: 18 pages
Abstract:High-resolution elevation estimations are essential to understand catchment and hillslope hydrology, study urban morphology and dynamics, and monitor the growth, decline, and mortality of terrestrial ecosystems. Various deep learning approaches (e.g., super-resolution techniques, monocular depth estimation) have been developed to create high-resolution Digital Elevation Models (DEMs). However, super-resolution techniques are limited by the upscaling factor, and monocular depth estimation lacks global elevation context, making its conversion to a seamless DEM restricted. The recently introduced technique of prompt-based monocular depth estimation has opened new opportunities to extract estimates of absolute elevation in a global context. We present here a framework for the estimation of high-resolution DEMs as a new paradigm for absolute global elevation mapping. It is exemplified using low-resolution Shuttle Radar Topography Mission (SRTM) elevation data as prompts and high-resolution RGB imagery from the National Agriculture Imagery Program (NAIP). The approach fine-tunes a vision transformer encoder with LiDAR-derived DEMs and employs a versatile prompting strategy, enabling tasks such as DEM estimation, void filling, and updating. Our framework achieves a 100x resolution gain (from 30-m to 30-cm), surpassing prior methods by an order of magnitude. Evaluations across three diverse U.S. landscapes show robust generalization, capturing urban structures and fine-scale terrain features with 5 m MAE relative to LiDAR, improving over SRTM by up to 18%. Hydrological analysis confirms suitability for hazard and environmental studies. We demonstrate scalability by applying the framework to large regions in the U.S. and Israel. All code and pretrained models are publicly available at: this https URL.
zh
[CV-86] VST-Pose: A Velocity-Integrated Spatiotem-poral Attention Network for Human WiFi Pose Estimation
【速读】:该论文试图解决在室内环境中进行连续、精确的人体姿态估计问题,尤其关注于非视觉方法的隐私保护与穿透性优势。其解决方案的关键在于提出一种名为VST-Pose的深度学习框架,该框架引入了ViSTA-Former,这是一种具有双流结构的时空注意力主干网络,能够分别捕捉人体关节之间的结构关系和时间依赖性。此外,通过集成速度建模分支,进一步提升了对细微人体运动的敏感度,从而增强了细粒度运动表征能力。
链接: https://arxiv.org/abs/2507.09672
作者: Xinyu Zhang,Zhonghao Ye,Jingwei Zhang,Xiang Tian,Zhisheng Liang,Shipeng Yu
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 8 pages, 7 figures, 8 tables. WiFi CSI, VST-Pose framework + ViSTA-Former dual-stream attention backbone. Code: this https URL
Abstract:WiFi-based human pose estimation has emerged as a promising non-visual alternative approaches due to its pene-trability and privacy advantages. This paper presents VST-Pose, a novel deep learning framework for accurate and continuous pose estimation using WiFi channel state information. The proposed method introduces ViSTA-Former, a spatiotemporal attention backbone with dual-stream architecture that adopts a dual-stream architecture to separately capture temporal dependencies and structural relationships among body joints. To enhance sensitivity to subtle human motions, a velocity modeling branch is integrated into the framework, which learns short-term keypoint dis-placement patterns and improves fine-grained motion representation. We construct a 2D pose dataset specifically designed for smart home care scenarios and demonstrate that our method achieves 92.2% accuracy on the PCK@50 metric, outperforming existing methods by 8.3% in PCK@50 on the self-collected dataset. Further evaluation on the public MMFi dataset confirms the model’s robustness and effectiveness in 3D pose estimation tasks. The proposed system provides a reliable and privacy-aware solution for continuous human motion analysis in indoor environments. Our codes are available in this https URL.
zh
[CV-87] EyeSeg: An Uncertainty-Aware Eye Segmentation Framework for AR/VR IJCAI
【速读】:该论文旨在解决增强现实(AR)和虚拟现实(VR)中人机交互场景下的精准眼动估计问题,其核心挑战包括运动模糊、眼睑遮挡以及训练与测试域之间的差异。现有方法在这些情况下难以提取鲁棒特征,导致性能不佳。论文提出的解决方案是EyeSeg,其关键在于设计了一个不确定性感知的眼部分割框架,通过贝叶斯不确定性学习,在封闭集先验下对后验分布进行建模,从而量化分割不确定性,并利用不确定性分数对多个眼动估计结果进行加权融合,提升系统在复杂场景下的鲁棒性。
链接: https://arxiv.org/abs/2507.09649
作者: Zhengyuan Peng,Jianqing Xu,Shen Li,Jiazhen Ji,Yuge Huang,Jingyun Zhang,Jinmin Li,Shouhong Ding,Rizen Guo,Xin Tan,Lizhuang Ma
机构: Shanghai Jiao Tong University (上海交通大学); Tencent (腾讯); National University of Singapore (新加坡国立大学); Tsinghua University (清华大学); East China Normal University (华东师范大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted to IJCAI
Abstract:Human-machine interaction through augmented reality (AR) and virtual reality (VR) is increasingly prevalent, requiring accurate and efficient gaze estimation which hinges on the accuracy of eye segmentation to enable smooth user experiences. We introduce EyeSeg, a novel eye segmentation framework designed to overcome key challenges that existing approaches struggle with: motion blur, eyelid occlusion, and train-test domain gaps. In these situations, existing models struggle to extract robust features, leading to suboptimal performance. Noting that these challenges can be generally quantified by uncertainty, we design EyeSeg as an uncertainty-aware eye segmentation framework for AR/VR wherein we explicitly model the uncertainties by performing Bayesian uncertainty learning of a posterior under the closed set prior. Theoretically, we prove that a statistic of the learned posterior indicates segmentation uncertainty levels and empirically outperforms existing methods in downstream tasks, such as gaze estimation. EyeSeg outputs an uncertainty score and the segmentation result, weighting and fusing multiple gaze estimates for robustness, which proves to be effective especially under motion blur, eyelid occlusion and cross-domain challenges. Moreover, empirical results suggest that EyeSeg achieves segmentation improvements of MIoU, E1, F1, and ACC surpassing previous approaches. The code is publicly available at this https URL.
zh
[CV-88] Disentanglement and Assessment of Shortcuts in Ophthalmological Retinal Imaging Exams
【速读】:该论文旨在解决糖尿病视网膜病变(Diabetic Retinopathy, DR)筛查中因传统成像技术成本高、可及性差而带来的诊断局限性,同时关注人工智能(Artificial Intelligence, AI)模型在公平性和泛化能力方面的潜在问题。其解决方案的关键在于评估基于图像训练的模型在DR预测中的性能与公平性,并探索解耦(disentanglement)作为减轻偏见的技术手段的有效性。研究使用了多样化的mBRSET眼底数据集,对三种模型(ConvNeXt V2、DINOv2和Swin V2)进行了训练与评估,以分析解耦敏感属性(Sensitive Attributes, SAs)对DR预测的影响。
链接: https://arxiv.org/abs/2507.09640
作者: Leonor Fernandes,Tiago Gonçalves,João Matos,Luis Filipe Nakayama,Jaime S. Cardoso
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注: 10 pages. Under review
Abstract:Diabetic retinopathy (DR) is a leading cause of vision loss in working-age adults. While screening reduces the risk of blindness, traditional imaging is often costly and inaccessible. Artificial intelligence (AI) algorithms present a scalable diagnostic solution, but concerns regarding fairness and generalization persist. This work evaluates the fairness and performance of image-trained models in DR prediction, as well as the impact of disentanglement as a bias mitigation technique, using the diverse mBRSET fundus dataset. Three models, ConvNeXt V2, DINOv2, and Swin V2, were trained on macula images to predict DR and sensitive attributes (SAs) (e.g., age and gender/sex). Fairness was assessed between subgroups of SAs, and disentanglement was applied to reduce bias. All models achieved high DR prediction performance in diagnosing (up to 94% AUROC) and could reasonably predict age and gender/sex (91% and 77% AUROC, respectively). Fairness assessment suggests disparities, such as a 10% AUROC gap between age groups in DINOv2. Disentangling SAs from DR prediction had varying results, depending on the model selected. Disentanglement improved DINOv2 performance (2% AUROC gain), but led to performance drops in ConvNeXt V2 and Swin V2 (7% and 3%, respectively). These findings highlight the complexity of disentangling fine-grained features in fundus imaging and emphasize the importance of fairness in medical imaging AI to ensure equitable and reliable healthcare solutions.
zh
[CV-89] Brain Stroke Detection and Classification Using CT Imaging with Transformer Models and Explainable AI
【速读】:该论文旨在解决急性中风(stroke)类型(包括缺血性、出血性和无中风)的早期准确诊断问题,特别是在急诊环境中,以提高患者预后。其解决方案的关键在于采用先进的深度学习模型MaxViT(一种状态最优的Vision Transformer)进行多类中风分类,并结合数据增强技术提升模型泛化能力与处理类别不平衡问题。此外,通过集成可解释人工智能(Explainable Artificial Intelligence, XAI)方法,特别是Grad-CAM++,提供模型决策的可视化解释,从而增强人工智能模型的透明度和临床可信度。
链接: https://arxiv.org/abs/2507.09630
作者: Shomukh Qari,Maha A. Thafar
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: 5 figures
Abstract:Stroke is one of the leading causes of death globally, making early and accurate diagnosis essential for improving patient outcomes, particularly in emergency settings where timely intervention is critical. CT scans are the key imaging modality because of their speed, accessibility, and cost-effectiveness. This study proposed an artificial intelligence framework for multiclass stroke classification (ischemic, hemorrhagic, and no stroke) using CT scan images from a dataset provided by the Republic of Turkey’s Ministry of Health. The proposed method adopted MaxViT, a state-of-the-art Vision Transformer, as the primary deep learning model for image-based stroke classification, with additional transformer variants (vision transformer, transformer-in-transformer, and ConvNext). To enhance model generalization and address class imbalance, we applied data augmentation techniques, including synthetic image generation. The MaxViT model trained with augmentation achieved the best performance, reaching an accuracy and F1-score of 98.00%, outperforming all other evaluated models and the baseline methods. The primary goal of this study was to distinguish between stroke types with high accuracy while addressing crucial issues of transparency and trust in artificial intelligence models. To achieve this, Explainable Artificial Intelligence (XAI) was integrated into the framework, particularly Grad-CAM++. It provides visual explanations of the model’s decisions by highlighting relevant stroke regions in the CT scans and establishing an accurate, interpretable, and clinically applicable solution for early stroke detection. This research contributed to the development of a trustworthy AI-assisted diagnostic tool for stroke, facilitating its integration into clinical practice and enhancing access to timely and optimal stroke diagnosis in emergency departments, thereby saving more lives.
zh
[CV-90] Lightweight Deep Learning-Based Channel Estimation for RIS-Aided Extremely Large-Scale MIMO Systems on Resource-Limited Edge Devices
【速读】:该论文旨在解决超大规模多输入多输出(XL-MIMO)系统中级联信道估计的可扩展性与实际部署难题,这些问题由于天线和智能反射面(RIS)元素数量的增加而变得更加严峻,导致数据量激增、计算复杂度上升、硬件要求提高以及能耗增大。论文提出的解决方案关键在于设计一种轻量级深度学习框架,通过利用信道的空间相关性,引入基于块(patch)的训练机制,将输入维度降低至块级表示,同时保留关键信息,从而实现大规模系统的可扩展训练,显著提升估计精度并降低计算复杂度。
链接: https://arxiv.org/abs/2507.09627
作者: Muhammad Kamran Saeed,Ashfaq Khokhar,Shakil Ahmed
机构: Iowa State University (爱荷华州立大学)
类目: Information Theory (cs.IT); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG); Networking and Internet Architecture (cs.NI)
备注:
Abstract:Next-generation wireless technologies such as 6G aim to meet demanding requirements such as ultra-high data rates, low latency, and enhanced connectivity. Extremely Large-Scale MIMO (XL-MIMO) and Reconfigurable Intelligent Surface (RIS) are key enablers, with XL-MIMO boosting spectral and energy efficiency through numerous antennas, and RIS offering dynamic control over the wireless environment via passive reflective elements. However, realizing their full potential depends on accurate Channel State Information (CSI). Recent advances in deep learning have facilitated efficient cascaded channel estimation. However, the scalability and practical deployment of existing estimation models in XL-MIMO systems remain limited. The growing number of antennas and RIS elements introduces a significant barrier to real-time and efficient channel estimation, drastically increasing data volume, escalating computational complexity, requiring advanced hardware, and resulting in substantial energy consumption. To address these challenges, we propose a lightweight deep learning framework for efficient cascaded channel estimation in XL-MIMO systems, designed to minimize computational complexity and make it suitable for deployment on resource-constrained edge devices. Using spatial correlations in the channel, we introduce a patch-based training mechanism that reduces the dimensionality of input to patch-level representations while preserving essential information, allowing scalable training for large-scale systems. Simulation results under diverse conditions demonstrate that our framework significantly improves estimation accuracy and reduces computational complexity, regardless of the increasing number of antennas and RIS elements in XL-MIMO systems.
zh
[CV-91] Generate Aligned Anomaly: Region-Guided Few-Shot Anomaly Image-Mask Pair Synthesis for Industrial Inspection
【速读】:该论文试图解决工业制造中异常检测任务因异常样本稀缺而导致的定位和分类效果受限的问题。其解决方案的关键在于提出一种基于区域引导的少样本异常图像-掩码对生成框架Generate Aligned Anomaly (GAA),该框架利用预训练潜在扩散模型的强大先验,通过Localized Concept Decomposition和Adaptive Multi-Round Anomaly Clustering等技术,实现语义一致、空间对齐且具有高真实感的异常数据生成。
链接: https://arxiv.org/abs/2507.09619
作者: Yilin Lu,Jianghang Lin,Linhuang Xie,Kai Zhao,Yansong Qu,Shengchuan Zhang,Liujuan Cao,Rongrong Ji
机构: Xiamen University (厦门大学); VIVO
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Anomaly inspection plays a vital role in industrial manufacturing, but the scarcity of anomaly samples significantly limits the effectiveness of existing methods in tasks such as localization and classification. While several anomaly synthesis approaches have been introduced for data augmentation, they often struggle with low realism, inaccurate mask alignment, and poor generalization. To overcome these limitations, we propose Generate Aligned Anomaly (GAA), a region-guided, few-shot anomaly image-mask pair generation framework. GAA leverages the strong priors of a pretrained latent diffusion model to generate realistic, diverse, and semantically aligned anomalies using only a small number of samples. The framework first employs Localized Concept Decomposition to jointly model the semantic features and spatial information of anomalies, enabling flexible control over the type and location of anomalies. It then utilizes Adaptive Multi-Round Anomaly Clustering to perform fine-grained semantic clustering of anomaly concepts, thereby enhancing the consistency of anomaly representations. Subsequently, a region-guided mask generation strategy ensures precise alignment between anomalies and their corresponding masks, while a low-quality sample filtering module is introduced to further improve the overall quality of the generated samples. Extensive experiments on the MVTec AD and LOCO datasets demonstrate that GAA achieves superior performance in both anomaly synthesis quality and downstream tasks such as localization and classification.
zh
[CV-92] MLoRQ: Bridging Low-Rank and Quantization for Transformer Compression
【速读】:该论文旨在解决在资源受限的边缘设备上部署基于Transformer的神经网络所带来的挑战,这一挑战通常通过低秩近似和混合精度量化等技术来应对。其解决方案的关键在于提出了一种名为Mixed Low-Rank and Quantization (MLoRQ)的新方法,该方法将低秩近似与量化技术相结合,并采用两阶段优化过程来确定每层的最佳位宽和秩分配,以满足预定义的内存约束。该方法包括层内优化和层间优化,同时可选地引入一种改进的自适应舍入技术以减少联合低秩近似和量化带来的压缩误差。
链接: https://arxiv.org/abs/2507.09616
作者: Ofir Gordon,Ariel Lapid,Elad Cohen,Yarden Yagil,Arnon Netzer,Hai Victor Habi
机构: Sony Semiconductor Israel(索尼半导体以色列)
类目: Machine Learning (cs.LG); Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Deploying transformer-based neural networks on resource-constrained edge devices presents a significant challenge. This challenge is often addressed through various techniques, such as low-rank approximation and mixed-precision quantization. In this work, we introduce Mixed Low-Rank and Quantization (MLoRQ), a novel method that integrates both techniques. MLoRQ employs a two-stage optimization process to determine optimal bit-width and rank assignments for each layer, adhering to predefined memory constraints. This process includes: (i) an intra-layer optimization that identifies potentially optimal compression solutions out of all low-rank and quantization combinations; (ii) an inter-layer optimization that assigns bit-width precision and rank to each layer while ensuring the memory constraint is met. An optional final step applies a sequential optimization process using a modified adaptive rounding technique to mitigate compression-induced errors in joint low-rank approximation and quantization. The method is compatible and can be seamlessly integrated with most existing quantization algorithms. MLoRQ shows state-of-the-art results with up to 15% performance improvement, evaluated on Vision Transformers for image classification, object detection, and instance segmentation tasks.
zh
[CV-93] owards Fine-Grained Adaptation of CLIP via a Self-Trained Alignment Score
【速读】:该论文旨在解决在细粒度分类任务中,基于视觉-语言模型(VLM)的无监督适应(Unsupervised Adaptation, UA)方法中存在的两个关键问题:一是依赖固定对齐分数无法捕捉细微的类别差异,二是使用计算成本高的伪标签策略限制了可扩展性。其解决方案的关键在于提出了一种名为细粒度对齐与交互优化(Fine-grained Alignment and Interaction Refinement, FAIR)的方法,通过动态对齐局部图像特征与描述性语言嵌入,并利用类描述锚点(Class Description Anchors, CDA)定义学习对齐分数(Learned Alignment Score, LAS),从而实现更准确、具有类别区分能力的伪标签生成,提升无监督适应性能。
链接: https://arxiv.org/abs/2507.09615
作者: Eman Ali,Sathira Silva,Chetan Arora,Muhammad Haris Khan
机构: Mohamed Bin Zayed University of Artificial Intelligence (穆罕默德·本·扎耶德人工智能大学); IIT Delhi (印度理工学院德里分校); Alexandria University (亚历山大大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Vision-language models (VLMs) like CLIP excel in zero-shot learning by aligning image and text representations through contrastive pretraining. Existing approaches to unsupervised adaptation (UA) for fine-grained classification with VLMs either rely on fixed alignment scores that cannot capture evolving, subtle class distinctions or use computationally expensive pseudo-labeling strategies that limit scalability. In contrast, we show that modeling fine-grained cross-modal interactions during adaptation produces more accurate, class-discriminative pseudo-labels and substantially improves performance over state-of-the-art (SOTA) methods. We introduce Fine-grained Alignment and Interaction Refinement (FAIR), an innovative approach that dynamically aligns localized image features with descriptive language embeddings through a set of Class Description Anchors (CDA). This enables the definition of a Learned Alignment Score (LAS), which incorporates CDA as an adaptive classifier, facilitating cross-modal interactions to improve self-training in unsupervised adaptation. Furthermore, we propose a self-training weighting mechanism designed to refine pseudo-labels in the presence of inter-class ambiguities. Our approach, FAIR, delivers a substantial performance boost in fine-grained unsupervised adaptation, achieving a notable overall gain of 2.78% across 13 fine-grained datasets compared to SOTA methods.
zh
[CV-94] Inter2Former: Dynamic Hybrid Attention for Efficient High-Precision Interactive ICCV2025
【速读】:该论文旨在解决交互式分割(Interactive Segmentation, IS)中存在的效率与精度之间的权衡问题:密集令牌方法虽然在精度和细节保留方面表现优异,但在CPU设备上处理速度极慢;而Segment Anything Model(SAM)虽通过稀疏提示令牌实现了快速推理,但牺牲了分割质量。论文提出的解决方案——Inter2Former,其关键在于优化密集令牌处理中的计算分配,具体包括四个核心改进:动态提示嵌入(Dynamic Prompt Embedding, DPE)、动态混合注意力(Dynamic Hybrid Attention, DHA)、混合专家(Hybrid Mixture of Experts, HMoE)以及动态局部上采样(Dynamic Local Upsampling, DLU),从而在保持高精度的同时显著提升CPU上的计算效率。
链接: https://arxiv.org/abs/2507.09612
作者: You Huang,Lichao Chen,Jiayi Ji,Liujuan Cao,Shengchuan Zhang,Rongrong Ji
机构: Xiamen University (厦门大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted by ICCV 2025
Abstract:Interactive segmentation (IS) improves annotation efficiency by segmenting target regions from user prompts, with widespread applications in real-world scenarios. Current approaches face a critical trade-off: dense-token methods achieve superior accuracy and detail preservation but suffer from prohibitively slow processing on CPU devices, while the Segment Anything Model (SAM) advances the field with sparse prompt tokens for fast inference but compromises segmentation quality. In this paper, we propose Inter2Former to address this challenge by optimizing computation allocation in dense-token processing, which introduces four key enhancements. First, we propose Dynamic Prompt Embedding (DPE) that adaptively processes only regions of interest while avoiding additional overhead from background tokens. Second, we introduce Dynamic Hybrid Attention (DHA), which leverages previous segmentation masks to route tokens through either full attention (O(N2)) for boundary regions or our proposed efficient BSQ attention (O(N)) for non-boundary regions. Third, we develop Hybrid Mixture of Experts (HMoE), which applies similar adaptive computation strategies in FFN modules with CPU-optimized parallel processing. Finally, we present Dynamic Local Upsampling (DLU), a reverse operation of DPE, which localizes objects with a lightweight MLP and performs fine-grained upsampling only in detected regions. Experimental results on high-precision IS benchmarks demonstrate that Inter2Former achieves SOTA performance with high efficiency on CPU devices.
zh
[CV-95] Demystifying Flux Architecture
【速读】:该论文试图解决FLUX.1模型的架构与训练细节不透明的问题,从而支持其作为未来研究和开发基础的广泛应用。解决方案的关键在于通过逆向工程方法,从源代码中直接解析并揭示FLUX.1的架构特性,以弥补官方技术文档缺失所带来的信息不足。
链接: https://arxiv.org/abs/2507.09595
作者: Or Greenberg
机构: Hebrew University of Jerusalem (希伯来大学); General Motors R&D (通用汽车研发)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:FLUX.1 is a diffusion-based text-to-image generation model developed by Black Forest Labs, designed to achieve faithful text-image alignment while maintaining high image quality and diversity. FLUX is considered state-of-the-art in text-to-image generation, outperforming popular models such as Midjourney, DALL-E 3, Stable Diffusion 3 (SD3), and SDXL. Although publicly available as open source, the authors have not released official technical documentation detailing the model’s architecture or training setup. This report summarizes an extensive reverse-engineering effort aimed at demystifying FLUX’s architecture directly from its source code, to support its adoption as a backbone for future research and development. This document is an unofficial technical report and is not published or endorsed by the original developers or their affiliated institutions.
zh
[CV-96] Memory-Augmented SAM2 for Training-Free Surgical Video Segmentation
【速读】:该论文试图解决手术视频分割中由于手术器械快速移动、频繁遮挡以及复杂的器械-组织交互导致的SAM2框架性能下降问题。解决方案的关键在于引入Memory Augmented (MA)-SAM2,这是一种无需训练的视频目标分割策略,其核心是新颖的上下文感知和遮挡鲁棒的记忆模型,能够在不增加额外参数或进行额外训练的情况下提升分割的准确性和鲁棒性。
链接: https://arxiv.org/abs/2507.09577
作者: Ming Yin,Fu Wang,Xujiong Ye,Yanda Meng,Zeyu Fu
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Surgical video segmentation is a critical task in computer-assisted surgery, essential for enhancing surgical quality and patient outcomes. Recently, the Segment Anything Model 2 (SAM2) framework has demonstrated remarkable advancements in both image and video segmentation. However, the inherent limitations of SAM2’s greedy selection memory design are amplified by the unique properties of surgical videos-rapid instrument movement, frequent occlusion, and complex instrument-tissue interaction-resulting in diminished performance in the segmentation of complex, long videos. To address these challenges, we introduce Memory Augmented (MA)-SAM2, a training-free video object segmentation strategy, featuring novel context-aware and occlusion-resilient memory models. MA-SAM2 exhibits strong robustness against occlusions and interactions arising from complex instrument movements while maintaining accuracy in segmenting objects throughout videos. Employing a multi-target, single-loop, one-prompt inference further enhances the efficiency of the tracking process in multi-instrument videos. Without introducing any additional parameters or requiring further training, MA-SAM2 achieved performance improvements of 4.36% and 6.1% over SAM2 on the EndoVis2017 and EndoVis2018 datasets, respectively, demonstrating its potential for practical surgical applications.
zh
[CV-97] WordCraft: Interactive Artistic Typography with Attention Awareness and Noise Blending
【速读】:该论文旨在解决艺术字体生成中交互性不足的问题,现有方法在局部编辑、迭代优化、多字符组合和开放式提示理解方面存在局限。其解决方案的关键在于引入WordCraft系统,该系统集成了扩散模型,并包含无需训练的区域注意力机制以实现多区域精确生成,以及噪声融合技术以支持连续优化而不损害视觉质量。此外,通过集成大语言模型解析和结构化用户提示,提升了生成过程的灵活性和意图驱动性。
链接: https://arxiv.org/abs/2507.09573
作者: Zhe Wang,Jingbo Zhang,Tianyi Wei,Wanchao Su,Can Wang
机构: Jiangxi University of Finance and Economics(江西财经大学); Tencent Robotics X Lab(腾讯机器人实验室); Nanyang Technological University(南洋理工大学); Monash University(莫纳什大学); Hong Kong University(香港大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 14 pages, 16 figures
Abstract:Artistic typography aims to stylize input characters with visual effects that are both creative and legible. Traditional approaches rely heavily on manual design, while recent generative models, particularly diffusion-based methods, have enabled automated character stylization. However, existing solutions remain limited in interactivity, lacking support for localized edits, iterative refinement, multi-character composition, and open-ended prompt interpretation. We introduce WordCraft, an interactive artistic typography system that integrates diffusion models to address these limitations. WordCraft features a training-free regional attention mechanism for precise, multi-region generation and a noise blending that supports continuous refinement without compromising visual quality. To support flexible, intent-driven generation, we incorporate a large language model to parse and structure both concrete and abstract user prompts. These components allow our framework to synthesize high-quality, stylized typography across single- and multi-character inputs across multiple languages, supporting diverse user-centered workflows. Our system significantly enhances interactivity in artistic typography synthesis, opening up creative possibilities for artists and designers.
zh
[CV-98] Prompt Engineering in Segment Anything Model: Methodologies Applications and Emerging Challenges
【速读】:该论文试图解决生成式 AI (Generative AI) 在图像分割任务中,尤其是 Segment Anything Model (SAM) 及其变体中,提示工程(prompt engineering)技术的系统性研究与总结问题。其解决方案的关键在于提出一种结构化的框架,以系统组织和分析提示工程的方法学、应用场景及核心挑战,从而推动基础模型在分割任务中的进一步发展与应用。
链接: https://arxiv.org/abs/2507.09562
作者: Yidong Jiang
机构: Tongji University (同济大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:
Abstract:The Segment Anything Model (SAM) has revolutionized image segmentation through its innovative prompt-based approach, yet the critical role of prompt engineering in its success remains underexplored. This paper presents the first comprehensive survey focusing specifically on prompt engineering techniques for SAM and its variants. We systematically organize and analyze the rapidly growing body of work in this emerging field, covering fundamental methodologies, practical applications, and key challenges. Our review reveals how prompt engineering has evolved from simple geometric inputs to sophisticated multimodal approaches, enabling SAM’s adaptation across diverse domains including medical imaging and remote sensing. We identify unique challenges in prompt optimization and discuss promising research directions. This survey fills an important gap in the literature by providing a structured framework for understanding and advancing prompt engineering in foundation models for segmentation.
zh
[CV-99] EHPE: A Segmented Architecture for Enhanced Hand Pose Estimation
【速读】:该论文旨在解决3D手部姿态估计中因误差累积导致远端关节(如指骨末节尖端,TIP)和腕关节预测不准确的问题,这些问题会引发姿态估计中的错位和伪影,从而降低整体重建质量。其解决方案的关键在于提出一种分段架构(EHPE),通过局部提取TIP和腕关节来缓解误差累积的影响,并在此基础上进一步减少所有关节的预测误差。该方法包含两个关键阶段:在TIP和腕关节提取阶段(TW-stage)中估计TIP和腕关节位置以提供初始准确的关节配置;在先验引导关节估计阶段(PG-stage)中使用双分支交互网络对剩余关节的位置进行精调。
链接: https://arxiv.org/abs/2507.09560
作者: Bolun Zheng,Xinjie Liu,Qianyu Zhang,Canjin Wang,Fangni Chen,Mingen Xu
机构: Hangzhou Dianzi University (杭州电子科技大学); Xinhua Zhiyun Technology Co., Ltd. (新华智云科技有限公司)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:3D hand pose estimation has garnered great attention in recent years due to its critical applications in human-computer interaction, virtual reality, and related fields. The accurate estimation of hand joints is essential for high-quality hand pose estimation. However, existing methods neglect the importance of Distal Phalanx Tip (TIP) and Wrist in predicting hand joints overall and often fail to account for the phenomenon of error accumulation for distal joints in gesture estimation, which can cause certain joints to incur larger errors, resulting in misalignments and artifacts in the pose estimation and degrading the overall reconstruction quality. To address this challenge, we propose a novel segmented architecture for enhanced hand pose estimation (EHPE). We perform local extraction of TIP and wrist, thus alleviating the effect of error accumulation on TIP prediction and further reduce the predictive errors for all joints on this basis. EHPE consists of two key stages: In the TIP and Wrist Joints Extraction stage (TW-stage), the positions of the TIP and wrist joints are estimated to provide an initial accurate joint configuration; In the Prior Guided Joints Estimation stage (PG-stage), a dual-branch interaction network is employed to refine the positions of the remaining joints. Extensive experiments on two widely used benchmarks demonstrate that EHPE achieves state-of-the-arts performance. Code is available at this https URL.
zh
[CV-100] SeqCSIST: Sequential Closely-Spaced Infrared Small Target Unmixing
【速读】:该论文试图解决远距离紧密排列的红外小目标(Closely-Spaced Infrared Small Target, CSIST)在红外图像中呈现为混合点的问题,旨在通过子像素定位的方式检测所有目标。其解决方案的关键在于提出了一种基于时序的CSIST解混任务,并构建了SeqCSIST数据集及相应的工具包以支持该任务的研究。此外,论文还提出了Deformable Refinement Network (DeRefNet) 模型,引入了时序可变形特征对齐(Temporal Deformable Feature Alignment, TDFA)模块,实现帧间信息的自适应聚合,从而提升检测精度。
链接: https://arxiv.org/abs/2507.09556
作者: Ximeng Zhai,Bohan Xu,Yaohong Chen,Hao Wang,Kehua Guo,Yimian Dai
机构: XiôÇÖan Institute of Optics and Precision Mechanics, Chinese Academy of Sciences(西安光学精密机械研究所,中国科学院); Henan University of Technology(河南工业大学); Central South University(中南大学); Nankai University(南开大学); NKIARI(深圳福田NKIARI)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted by TGRS
Abstract:Due to the limitation of the optical lens focal length and the resolution of the infrared detector, distant Closely-Spaced Infrared Small Target (CSIST) groups typically appear as mixing spots in the infrared image. In this paper, we propose a novel task, Sequential CSIST Unmixing, namely detecting all targets in the form of sub-pixel localization from a highly dense CSIST group. However, achieving such precise detection is an extremely difficult challenge. In addition, the lack of high-quality public datasets has also restricted the research progress. To this end, firstly, we contribute an open-source ecosystem, including SeqCSIST, a sequential benchmark dataset, and a toolkit that provides objective evaluation metrics for this special task, along with the implementation of 23 relevant methods. Furthermore, we propose the Deformable Refinement Network (DeRefNet), a model-driven deep learning framework that introduces a Temporal Deformable Feature Alignment (TDFA) module enabling adaptive inter-frame information aggregation. To the best of our knowledge, this work is the first endeavor to address the CSIST Unmixing task within a multi-frame paradigm. Experiments on the SeqCSIST dataset demonstrate that our method outperforms the state-of-the-art approaches with mean Average Precision (mAP) metric improved by 5.3%. Our dataset and toolkit are available from this https URL.
zh
[CV-101] DRPCA-Net: Make Robust PCA Great Again for Infrared Small Target Detection
【速读】:该论文旨在解决红外小目标检测中现有端到端卷积模型因追求性能而忽视可解释性、参数效率和泛化能力的问题,特别是这些模型通常忽略了红外小目标的内在稀疏性先验。解决方案的关键在于重新引入基于模型的鲁棒主成分分析(RPCA)框架,并提出动态RPCA网络(DRPCA-Net),该网络通过集成稀疏感知先验到可学习架构中,结合轻量级超网络实现动态展开机制,使模型能够根据输入场景自适应生成迭代参数,从而提升在不同背景下的鲁棒性和泛化能力。此外,设计了动态残差组模块以更好地捕捉背景中的上下文变化,提高低秩估计精度和小目标分离效果。
链接: https://arxiv.org/abs/2507.09541
作者: Zihao Xiong,Fei Zhou,Fengyi Wu,Shuai Yuan,Maixia Fu,Zhenming Peng,Jian Yang,Yimian Dai
机构: Henan University of Technology (河南工业大学); University of Electronic Science and Technology of China (电子科技大学); Xidian University (西安电子科技大学); Nankai University (南开大学); NKIARI (NKIARI)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted by TGRS
Abstract:Infrared small target detection plays a vital role in remote sensing, industrial monitoring, and various civilian applications. Despite recent progress powered by deep learning, many end-to-end convolutional models tend to pursue performance by stacking increasingly complex architectures, often at the expense of interpretability, parameter efficiency, and generalization. These models typically overlook the intrinsic sparsity prior of infrared small targets–an essential cue that can be explicitly modeled for both performance and efficiency gains. To address this, we revisit the model-based paradigm of Robust Principal Component Analysis (RPCA) and propose Dynamic RPCA Network (DRPCA-Net), a novel deep unfolding network that integrates the sparsity-aware prior into a learnable architecture. Unlike conventional deep unfolding methods that rely on static, globally learned parameters, DRPCA-Net introduces a dynamic unfolding mechanism via a lightweight hypernetwork. This design enables the model to adaptively generate iteration-wise parameters conditioned on the input scene, thereby enhancing its robustness and generalization across diverse backgrounds. Furthermore, we design a Dynamic Residual Group (DRG) module to better capture contextual variations within the background, leading to more accurate low-rank estimation and improved separation of small targets. Extensive experiments on multiple public infrared datasets demonstrate that DRPCA-Net significantly outperforms existing state-of-the-art methods in detection accuracy. Code is available at this https URL.
zh
[CV-102] VDInstruct: Zero-Shot Key Information Extraction via Content-Aware Vision Tokenization
【速读】:该论文试图解决多模态大语言模型(MLLM)在处理密集文档时表现不佳以及依赖于随图像尺寸扩展的视觉分词方法导致计算和内存效率低的问题。其解决方案的关键在于引入VDInstruct,该模型通过将空间区域检测与语义特征提取分离,并采用内容感知的分词策略,根据文档复杂度生成相应数量的token,从而在保留关键结构的同时减少冗余token,提升了文档理解的效率和效果。
链接: https://arxiv.org/abs/2507.09531
作者: Son Nguyen,Giang Nguyen,Hung Dao,Thao Do,Daeyoung Kim
机构: KAIST(韩国科学技术院); Auburn University(奥本大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: Under Review
Abstract:Key Information Extraction (KIE) underpins the understanding of visual documents (e.g., receipts and contracts) by extracting precise semantic content and accurately capturing spatial structure. Yet existing multimodal large language models (MLLMs) often perform poorly on dense documents and rely on vision tokenization approaches that scale with image size, leading to redundant computation and memory inefficiency. To address these challenges, we introduce VDInstruct, an MLLM that separates spatial region detection from semantic feature extraction. Central to our model is a content-aware tokenization strategy: rather than fragmenting the entire image uniformly, it generates tokens in proportion to document complexity, preserving critical structure while eliminating wasted tokens. Leveraging a three-stage training paradigm, our model achieves state-of-the-art (SOTA) results on KIE benchmarks, matching or exceeding the accuracy of leading approaches while reducing the number of image tokens by roughly 3.6x. In zero-shot evaluations, VDInstruct surpasses strong baselines-such as DocOwl 1.5-by +5.5 F1 points, highlighting its robustness to unseen documents. These findings show that content-aware tokenization combined with explicit layout modeling offers a promising direction forward for document understanding. Data, source code, and model weights will be made publicly available.
zh
[CV-103] When Schrödinger Bridge Meets Real-World Image Dehazing with Unpaired Training ICCV2025
【速读】:该论文试图解决无配对去雾方法中由于生成器的有限传输映射能力导致的性能受限问题,这限制了其在无配对训练范式中的有效性。解决方案的关键在于提出DehazeSB框架,该框架基于Schrödinger Bridge,通过最优传输(Optimal Transport, OT)理论直接连接模糊图像与清晰图像的分布,从而在更少步骤内实现从模糊到清晰图像的最优传输映射,生成高质量结果。
链接: https://arxiv.org/abs/2507.09524
作者: Yunwei Lan,Zhigao Cui,Xin Luo,Chang Liu,Nian Wang,Menglin Zhang,Yanzhao Su,Dong Liu
机构: Rocket Force University of Engineering (火箭军工程大学); MOE Key Laboratory of Brain-Inspired Intelligent Perception and Cognition, University of Science and Technology of China (教育部脑启发智能感知与认知重点实验室,中国科学技术大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted by ICCV2025
Abstract:Recent advancements in unpaired dehazing, particularly those using GANs, show promising performance in processing real-world hazy images. However, these methods tend to face limitations due to the generator’s limited transport mapping capability, which hinders the full exploitation of their effectiveness in unpaired training paradigms. To address these challenges, we propose DehazeSB, a novel unpaired dehazing framework based on the Schrödinger Bridge. By leveraging optimal transport (OT) theory, DehazeSB directly bridges the distributions between hazy and clear images. This enables optimal transport mappings from hazy to clear images in fewer steps, thereby generating high-quality results. To ensure the consistency of structural information and details in the restored images, we introduce detail-preserving regularization, which enforces pixel-level alignment between hazy inputs and dehazed outputs. Furthermore, we propose a novel prompt learning to leverage pre-trained CLIP models in distinguishing hazy images and clear ones, by learning a haze-aware vision-language alignment. Extensive experiments on multiple real-world datasets demonstrate our method’s superiority. Code: this https URL.
zh
[CV-104] QuarterMap: Efficient Post-Training Token Pruning for Visual State Space Models ICML
【速读】:该论文试图解决基于状态空间模型(State Space Models, SSMs)的视觉主干网络在四方向扫描过程中存在的空间冗余问题,从而提升计算效率。解决方案的关键在于提出QuarterMap,这是一种后训练激活剪枝方法,通过在扫描前移除冗余的空间激活,并利用最近邻上采样恢复维度,从而在不重新训练模型的情况下提高吞吐量。
链接: https://arxiv.org/abs/2507.09514
作者: Tien-Yu Chi,Hung-Yueh Chiang,Diana Marculescu,Kai-Chiang Wu
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: Accepted by Efficient Systems for Foundation Models Workshop at the International Conference on Machine Learning (ICML) 2025
Abstract:State space models (SSMs) reduce the quadratic complexity of transformers by leveraging linear recurrence. Recently, VMamba has emerged as a strong SSM-based vision backbone, yet remains bottlenecked by spatial redundancy in its four-directional scan. We propose QuarterMap, a post-training activation pruning method that removes redundant spatial activations before scanning and restores dimensions via nearest-neighbor upsampling. Our method improves throughput without retraining. On ImageNet-1K, QuarterMap achieves up to 11% speedup on VMamba with less than 0.9% accuracy drop, and yields similar gains on ADE20K segmentation. Beyond VMamba, we validate QuarterMap on MedMamba, a domain-specific model that shares the same four-directional scanning structure, where it consistently improves throughput while preserving accuracy across multiple medical imaging tasks. Compared to token merging methods like ToMe, QuarterMap is tailored for SSMs and avoids costly merge-unmerge operations. Our method offers a plug-and-play tool for deployment-time efficiency without compromising transferability.
zh
[CV-105] Online Micro-gesture Recognition Using Data Augmentation and Spatial-Temporal Attention
【速读】:该论文旨在解决微动作在线识别(Micro-gesture Online Recognition)问题,即在未剪辑视频中定位多个微动作实例的时间位置并识别其类别。与传统的时间动作检测相比,该任务更强调微动作类别的区分以及对每个实例起止时间的精确识别。由于微动作通常是自发的人类动作,其差异性大于其他类型的人类动作,因此更具挑战性。论文提出的解决方案的关键在于采用手工设计的数据增强和时空注意力机制,以提升模型对微动作分类和定位的准确性。
链接: https://arxiv.org/abs/2507.09512
作者: Pengyu Liu,Kun Li,Fei Wang,Yanyan Wei,Junhui She,Dan Guo
机构: Hefei University of Technology (合肥工业大学); Zhejiang University (浙江大学); Ministry of Education (教育部); Hefei Comprehensive National Science Center (合肥综合性国家科学中心)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 11 pages, 4 figures
Abstract:In this paper, we introduce the latest solution developed by our team, HFUT-VUT, for the Micro-gesture Online Recognition track of the IJCAI 2025 MiGA Challenge. The Micro-gesture Online Recognition task is a highly challenging problem that aims to locate the temporal positions and recognize the categories of multiple micro-gesture instances in untrimmed videos. Compared to traditional temporal action detection, this task places greater emphasis on distinguishing between micro-gesture categories and precisely identifying the start and end times of each instance. Moreover, micro-gestures are typically spontaneous human actions, with greater differences than those found in other human actions. To address these challenges, we propose hand-crafted data augmentation and spatial-temporal attention to enhance the model’s ability to classify and localize micro-gestures more accurately. Our solution achieved an F1 score of 38.03, outperforming the previous state-of-the-art by 37.9%. As a result, our method ranked first in the Micro-gesture Online Recognition track.
zh
[CV-106] Advancing Reliable Test-Time Adaptation of Vision-Language Models under Visual Variations ACM-MM2025
【速读】:该论文旨在解决视觉-语言模型(Vision-Language Models, VLMs)在无标注数据情况下面临分布偏移时,测试阶段适应(Test-Time Adaptation, TTA)性能下降的问题。其关键解决方案是提出一种可靠测试阶段适应方法(Reliable Test-time Adaptation, ReTA),该方法通过两种互补策略提升适应过程的可靠性:首先,引入一致性感知的熵重加权(Consistency-aware Entropy Reweighting, CER),利用预测一致性对熵进行加权,以提高缓存质量;其次,提出多样性驱动的分布校准(Diversity-driven Distribution Calibration, DDC),通过建模类别文本嵌入为多变量高斯分布,实现更具适应性的决策边界。
链接: https://arxiv.org/abs/2507.09500
作者: Yiwen Liang,Hui Chen,Yizhe Xiong,Zihan Zhou,Mengyao Lyu,Zijia Lin,Shuaicheng Niu,Sicheng Zhao,Jungong Han,Guiguang Ding
机构: Tsinghua University (清华大学); Nanyang Technological University (南洋理工大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted at the 33rd ACM International Conference on Multimedia(ACM MM 2025)
Abstract:Vision-language models (VLMs) exhibit remarkable zero-shot capabilities but struggle with distribution shifts in downstream tasks when labeled data is unavailable, which has motivated the development of Test-Time Adaptation (TTA) to improve VLMs’ performance during inference without annotations. Among various TTA approaches, cache-based methods show promise by preserving historical knowledge from low-entropy samples in a dynamic cache and fostering efficient adaptation. However, these methods face two critical reliability challenges: (1) entropy often becomes unreliable under distribution shifts, causing error accumulation in the cache and degradation in adaptation performance; (2) the final predictions may be unreliable due to inflexible decision boundaries that fail to accommodate large downstream shifts. To address these challenges, we propose a Reliable Test-time Adaptation (ReTA) method that integrates two complementary strategies to enhance reliability from two perspectives. First, to mitigate the unreliability of entropy as a sample selection criterion for cache construction, we introduce Consistency-aware Entropy Reweighting (CER), which incorporates consistency constraints to weight entropy during cache updating. While conventional approaches rely solely on low entropy for cache prioritization and risk introducing noise, our method leverages predictive consistency to maintain a high-quality cache and facilitate more robust adaptation. Second, we present Diversity-driven Distribution Calibration (DDC), which models class-wise text embeddings as multivariate Gaussian distributions, enabling adaptive decision boundaries for more accurate predictions across visually diverse content. Extensive experiments demonstrate that ReTA consistently outperforms state-of-the-art methods, particularly under challenging real-world distribution shifts.
zh
[CV-107] SDTN and TRN: Adaptive Spectral-Spatial Feature Extraction for Hyperspectral Image Classification
【速读】:该论文旨在解决高光谱图像分类中面临的高维数据处理、光谱-空间冗余以及标记样本稀缺等问题,这些问题导致传统方法性能欠佳。其解决方案的关键在于提出自适应张量正则化网络(SDTN),通过结合张量分解与正则化机制,动态调整张量秩以实现针对数据复杂性的最优特征表示;在此基础上进一步构建了张量正则化网络(TRN),将SDTN提取的特征整合到轻量级网络中,从而在多尺度上捕捉光谱-空间特征,有效提升分类精度并显著降低计算复杂度。
链接: https://arxiv.org/abs/2507.09492
作者: Fuyin Ye,Erwen Yao,Jianyong Chen,Fengmei He,Junxiang Zhang,Lihao Ni
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: 4 pages, 2 figures
Abstract:Hyperspectral image classification plays a pivotal role in precision agriculture, providing accurate insights into crop health monitoring, disease detection, and soil analysis. However, traditional methods struggle with high-dimensional data, spectral-spatial redundancy, and the scarcity of labeled samples, often leading to suboptimal performance. To address these challenges, we propose the Self-Adaptive Tensor- Regularized Network (SDTN), which combines tensor decomposition with regularization mechanisms to dynamically adjust tensor ranks, ensuring optimal feature representation tailored to the complexity of the data. Building upon SDTN, we propose the Tensor-Regularized Network (TRN), which integrates the features extracted by SDTN into a lightweight network capable of capturing spectral-spatial features at multiple scales. This approach not only maintains high classification accuracy but also significantly reduces computational complexity, making the framework highly suitable for real-time deployment in resource-constrained environments. Experiments on PaviaU datasets demonstrate significant improvements in accuracy and reduced model parameters compared to state-of-the-art methods.
zh
[CV-108] GLIMPSE: Do Large Vision-Language Models Truly Think With Videos or Just Glimpse at Them?
【速读】:该论文试图解决现有视频基准测试难以评估大型视觉-语言模型(LVLMs)是否能够进行深层次的视频思考,而非仅依赖于表面的帧级分析的问题。解决方案的关键在于引入GLIMPSE基准,该基准专门设计用于评估LVLMs是否能真正通过视频进行思考,其核心特点是强调超越静态图像线索的全面视频理解,所有问题均需观看完整视频并基于全视频上下文进行推理,而非仅依赖部分帧或文本信息。
链接: https://arxiv.org/abs/2507.09491
作者: Yiyang Zhou,Linjie Li,Shi Qiu,Zhengyuan Yang,Yuyang Zhao,Siwei Han,Yangfan He,Kangqi Li,Haonian Ji,Zihao Zhao,Haibo Tong,Lijuan Wang,Huaxiu Yao
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 15 pages, 10 figures
Abstract:Existing video benchmarks often resemble image-based benchmarks, with question types like “What actions does the person perform throughout the video?” or “What color is the woman’s dress in the video?” For these, models can often answer by scanning just a few key frames, without deep temporal reasoning. This limits our ability to assess whether large vision-language models (LVLMs) can truly think with videos rather than perform superficial frame-level analysis. To address this, we introduce GLIMPSE, a benchmark specifically designed to evaluate whether LVLMs can genuinely think with videos. Unlike prior benchmarks, GLIMPSE emphasizes comprehensive video understanding beyond static image cues. It consists of 3,269 videos and over 4,342 highly visual-centric questions across 11 categories, including Trajectory Analysis, Temporal Reasoning, and Forensics Detection. All questions are carefully crafted by human annotators and require watching the entire video and reasoning over full video context-this is what we mean by thinking with video. These questions cannot be answered by scanning selected frames or relying on text alone. In human evaluations, GLIMPSE achieves 94.82% accuracy, but current LVLMs face significant challenges. Even the best-performing model, GPT-o3, reaches only 66.43%, highlighting that LVLMs still struggle to move beyond surface-level reasoning to truly think with videos.
zh
[CV-109] HMID-Net: An Exploration of Masked Image Modeling and Knowledge Distillation in Hyperbolic Space
【速读】:该论文试图解决如何更高效地训练模型以捕捉和利用视觉-语义层次结构的问题。解决方案的关键在于提出一种名为Hyperbolic Masked Image and Distillation Network (HMID-Net)的新方法,该方法在双曲空间中集成Masked Image Modeling (MIM)和知识蒸馏技术,并引入了一种专门设计的蒸馏损失函数,以促进双曲空间中的有效知识迁移。
链接: https://arxiv.org/abs/2507.09487
作者: Changli Wang,Fang Yin,Jiafeng Liu,Rui Wu
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:
Abstract:Visual and semantic concepts are often structured in a hierarchical manner. For instance, textual concept `cat’ entails all images of cats. A recent study, MERU, successfully adapts multimodal learning techniques from Euclidean space to hyperbolic space, effectively capturing the visual-semantic hierarchy. However, a critical question remains: how can we more efficiently train a model to capture and leverage this hierarchy? In this paper, we propose the \textitHyperbolic Masked Image and Distillation Network (HMID-Net), a novel and efficient method that integrates Masked Image Modeling (MIM) and knowledge distillation techniques within hyperbolic space. To the best of our knowledge, this is the first approach to leverage MIM and knowledge distillation in hyperbolic space to train highly efficient models. In addition, we introduce a distillation loss function specifically designed to facilitate effective knowledge transfer in hyperbolic space. Our experiments demonstrate that MIM and knowledge distillation techniques in hyperbolic space can achieve the same remarkable success as in Euclidean space. Extensive evaluations show that our method excels across a wide range of downstream tasks, significantly outperforming existing models like MERU and CLIP in both image classification and retrieval.
zh
[CV-110] CKAA: Cross-subspace Knowledge Alignment and Aggregation for Robust Continual Learning
【速读】:该论文试图解决参数高效微调(PEFT)基础上的持续学习(CL)方法在面对误导性任务标识符(task-ids)时产生的决策模糊问题,这一问题源于独立训练的子模块之间的特征子空间错位。解决方案的关键在于提出Cross-subspace Knowledge Alignment and Aggregation (CKAA)框架,其核心创新包括:(1) 双层次知识对齐(DKA),通过跨子空间的类内特征分布对齐和特征模拟过程学习鲁棒的全局分类器,使模型在训练过程中能够区分正确与错误子空间的特征;(2) 基于任务置信度的适配器混合(TC-MoA),通过任务置信度分数自适应聚合相关子模块的任务特定知识,避免在误导性任务标识符预测中产生过度自信。
链接: https://arxiv.org/abs/2507.09471
作者: Lingfeng He,De Cheng,Zhiheng Ma,Huaijie Wang,Dingwen Zhang,Nannan Wang,Xinbo Gao
机构: Xidian University(西安电子科技大学); Shenzhen University of Advanced Technology(深圳先进技术大学); Northwestern Polytechnical University(西北工业大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Continual Learning (CL) empowers AI models to continuously learn from sequential task streams. Recently, parameter-efficient fine-tuning (PEFT)-based CL methods have garnered increasing attention due to their superior performance. They typically allocate a unique sub-module for learning each task, with a task recognizer to select the appropriate sub-modules for testing images. However, due to the feature subspace misalignment from independently trained sub-modules, these methods tend to produce ambiguous decisions under misleading task-ids. To address this, we propose Cross-subspace Knowledge Alignment and Aggregation (CKAA), a novel framework that enhances model robustness against misleading task-ids through two key innovations: (1) Dual-level Knowledge Alignment (DKA): By aligning intra-class feature distributions across different subspaces and learning a robust global classifier through a feature simulation process, DKA enables the model to distinguish features from both correct and incorrect subspaces during training. (2) Task-Confidence-guided Mixture of Adapters (TC-MoA): A robust inference scheme that adaptively aggregates task-specific knowledge from relevant sub-modules based on task-confidence scores, avoiding overconfidence in misleading task-id predictions. Extensive experiments demonstrate that CKAA outperforms existing PEFT-based CL methods.
zh
[CV-111] SegVec3D: A Method for Vector Embedding of 3D Objects Oriented Towards Robot manipulation
【速读】:该论文试图解决3D点云实例分割(instance segmentation)中缺乏有效融合多模态信息与弱监督学习的问题,其解决方案的关键在于提出SegVec3D框架,该框架通过整合注意力机制、嵌入学习和跨模态对齐,构建层次化特征提取器以增强几何结构建模,并利用对比聚类实现无监督实例分割,同时在共享语义空间中对齐3D数据与自然语言查询,从而支持零样本检索。
链接: https://arxiv.org/abs/2507.09459
作者: Zhihan Kang,Boyu Wang
机构: Northwestern Polytechnical University (西北工业大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Robotics (cs.RO)
备注: Undergraduate Theis; 12 pages, 6 figures
Abstract:We propose SegVec3D, a novel framework for 3D point cloud instance segmentation that integrates attention mechanisms, embedding learning, and cross-modal alignment. The approach builds a hierarchical feature extractor to enhance geometric structure modeling and enables unsupervised instance segmentation via contrastive clustering. It further aligns 3D data with natural language queries in a shared semantic space, supporting zero-shot retrieval. Compared to recent methods like Mask3D and ULIP, our method uniquely unifies instance segmentation and multimodal understanding with minimal supervision and practical deployability.
zh
[CV-112] RACER: Efficient Object Re-Identification in Networked Cameras through Adaptive Query Processing
【速读】:该论文旨在解决在大规模摄像头网络中高效进行跨摄像头重识别(Re-ID)查询的问题,特别是针对现有系统Spatula在处理大规模网络时因局部摄像头历史导致的准确性不足以及缺乏自适应查询处理机制的问题。其解决方案的关键在于提出Tracer,一个基于自适应查询处理框架的新型视频数据库管理系统(VDBMS),通过训练循环神经网络来建模长期历史相关性,以选择每个时间步最优的摄像头进行处理,并结合概率自适应搜索模型,在高召回率约束下通过增量搜索窗口和探索-利用策略动态更新采样概率,从而提升查询效率与准确性。
链接: https://arxiv.org/abs/2507.09448
作者: Pramod Chunduri,Yao Lu,Joy Arulraj
机构: Georgia Institute of Technology (佐治亚理工学院); National University of Singapore (新加坡国立大学)
类目: Databases (cs.DB); Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Efficiently re-identifying and tracking objects across a network of cameras is crucial for applications like traffic surveillance. Spatula is the state-of-the-art video database management system (VDBMS) for processing Re-ID queries. However, it suffers from two limitations. Its spatio-temporal filtering scheme has limited accuracy on large camera networks due to localized camera history. It is not suitable for critical video analytics applications that require high recall due to a lack of support for adaptive query processing. In this paper, we present Tracer, a novel VDBMS for efficiently processing Re-ID queries using an adaptive query processing framework. Tracer selects the optimal camera to process at each time step by training a recurrent network to model long-term historical correlations. To accelerate queries under a high recall constraint, Tracer incorporates a probabilistic adaptive search model that processes camera feeds in incremental search windows and dynamically updates the sampling probabilities using an exploration-exploitation strategy. To address the paucity of benchmarks for the Re-ID task due to privacy concerns, we present a novel synthetic benchmark for generating multi-camera Re-ID datasets based on real-world traffic distribution. Our evaluation shows that Tracer outperforms the state-of-the-art cross-camera analytics system by 3.9x on average across diverse datasets. Subjects: Databases (cs.DB); Computer Vision and Pattern Recognition (cs.CV) Cite as: arXiv:2507.09448 [cs.DB] (or arXiv:2507.09448v1 [cs.DB] for this version) https://doi.org/10.48550/arXiv.2507.09448 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
zh
[CV-113] Efficient Multi-Person Motion Prediction by Lightweight Spatial and Temporal Interactions ICCV2025
【速读】:该论文试图解决3D多人员运动预测中的复杂问题,主要挑战在于对个体过去运动的依赖以及代理之间的交互建模,而有效建模这些交互通常会带来高昂的计算成本。解决方案的关键在于设计一种计算高效的模型,通过简化空间和时间交互来降低复杂度。具体而言,该方法采用轻量级的双分支结构分别学习个体和多人的局部与全局表示,并引入一种新的跨层级交互块以整合来自两个分支的空间和时间表示,同时显式地引入人与人之间的空间距离嵌入来增强交互建模效果。
链接: https://arxiv.org/abs/2507.09446
作者: Yuanhong Zheng,Ruixuan Yu,Jian Sun
机构: Shandong University (山东大学); Xi’an Jiaotong University (西安交通大学); Peking University (北京大学); Pazhou Laboratory (琶洲实验室)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: ICCV 2025
Abstract:3D multi-person motion prediction is a highly complex task, primarily due to the dependencies on both individual past movements and the interactions between agents. Moreover, effectively modeling these interactions often incurs substantial computational costs. In this work, we propose a computationally efficient model for multi-person motion prediction by simplifying spatial and temporal interactions. Our approach begins with the design of lightweight dual branches that learn local and global representations for individual and multiple persons separately. Additionally, we introduce a novel cross-level interaction block to integrate the spatial and temporal representations from both branches. To further enhance interaction modeling, we explicitly incorporate the spatial inter-person distance embedding. With above efficient temporal and spatial design, we achieve state-of-the-art performance for multiple metrics on standard datasets of CMU-Mocap, MuPoTS-3D, and 3DPW, while significantly reducing the computational cost. Code is available at this https URL.
zh
[CV-114] RectifiedHR: High-Resolution Diffusion via Energy Profiling and Adaptive Guidance Scheduling
【速读】:该论文试图解决扩散模型在高分辨率图像合成中面临的能量不稳定性和引导伪影问题,这些问题会降低视觉质量。解决方案的关键在于分析采样过程中的潜在能量景观,并提出自适应无分类器引导(CFG)调度策略,通过随时间调节引导强度来保持稳定的能量轨迹,从而实现更高的稳定性分数和一致性指标。
链接: https://arxiv.org/abs/2507.09441
作者: Ankit Sanjyal
机构: Fordham University (福特汉姆大学)
类目: Graphics (cs.GR); Computer Vision and Pattern Recognition (cs.CV)
备注: 8 Pages, 10 Figures, Pre-Print Version, Code Available at: this https URL
Abstract:High-resolution image synthesis with diffusion models often suffers from energy instabilities and guidance artifacts that degrade visual quality. We analyze the latent energy landscape during sampling and propose adaptive classifier-free guidance (CFG) schedules that maintain stable energy trajectories. Our approach introduces energy-aware scheduling strategies that modulate guidance strength over time, achieving superior stability scores (0.9998) and consistency metrics (0.9873) compared to fixed-guidance approaches. We demonstrate that DPM++ 2M with linear-decreasing CFG scheduling yields optimal performance, providing sharper, more faithful images while reducing artifacts. Our energy profiling framework serves as a powerful diagnostic tool for understanding and improving diffusion model behavior.
zh
[CV-115] Domain Adaptation and Multi-view Attention for Learnable Landmark Tracking with Sparse Data
【速读】:该论文旨在解决自主空间飞行应用中天体表面地形特征的检测与跟踪问题,特别是针对传统基于光度测定法的流程在计算资源受限的抗辐射系统上存在的处理速度慢、泛化能力差及依赖大量先验成像数据等局限性。其解决方案的关键在于提出了一种基于实时执行的轻量级神经网络架构的在位地标跟踪方法,通过改进的领域自适应方法实现利用低成本获取的训练数据识别天体地形特征,并引入一种新颖的注意力对齐公式以学习在显著视角变化下仍能保持对应关系的鲁棒特征表示,从而构建了一个性能优于现有最先进技术的统一系统。
链接: https://arxiv.org/abs/2507.09420
作者: Timothy Chase Jr,Karthik Dantu
机构: University at Buffalo (纽约州立大学布法罗分校)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Robotics (cs.RO)
备注: Presented at the RSS Space Robotics Workshop 2025. Poster available online at this https URL
Abstract:The detection and tracking of celestial surface terrain features are crucial for autonomous spaceflight applications, including Terrain Relative Navigation (TRN), Entry, Descent, and Landing (EDL), hazard analysis, and scientific data collection. Traditional photoclinometry-based pipelines often rely on extensive a priori imaging and offline processing, constrained by the computational limitations of radiation-hardened systems. While historically effective, these approaches typically increase mission costs and duration, operate at low processing rates, and have limited generalization. Recently, learning-based computer vision has gained popularity to enhance spacecraft autonomy and overcome these limitations. While promising, emerging techniques frequently impose computational demands exceeding the capabilities of typical spacecraft hardware for real-time operation and are further challenged by the scarcity of labeled training data for diverse extraterrestrial environments. In this work, we present novel formulations for in-situ landmark tracking via detection and description. We utilize lightweight, computationally efficient neural network architectures designed for real-time execution on current-generation spacecraft flight processors. For landmark detection, we propose improved domain adaptation methods that enable the identification of celestial terrain features with distinct, cheaply acquired training data. Concurrently, for landmark description, we introduce a novel attention alignment formulation that learns robust feature representations that maintain correspondence despite significant landmark viewpoint variations. Together, these contributions form a unified system for landmark tracking that demonstrates superior performance compared to existing state-of-the-art techniques.
zh
[CV-116] GreenCrossingAI: A Camera Trap/Computer Vision Pipeline for Environmental Science Research Groups
【速读】:该论文试图解决野生动物研究中相机陷阱(Camera Trap)数据处理与管理的挑战,尤其是在资源有限的小型研究团队中应用机器学习/人工智能(ML/AI)工具的问题。解决方案的关键在于构建一个低资源的本地处理流程,该流程集成了针对小型研究团队定制的ML/AI能力,提供了可访问的数据传输、推理和评估方法,从而帮助研究人员从不断增长的相机陷阱数据集中提取有意义的见解。
链接: https://arxiv.org/abs/2507.09410
作者: Bernie Boscoe,Shawn Johnson,Andrea Osborn,Chandler Campbell,Karen Mager
机构: Southern Oregon University (南方俄勒冈大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注: This is the preprint version of the paper in Practice and Experience in Advanced Research Computing, PEARC25
Abstract:Camera traps have long been used by wildlife researchers to monitor and study animal behavior, population dynamics, habitat use, and species diversity in a non-invasive and efficient manner. While data collection from the field has increased with new tools and capabilities, methods to develop, process, and manage the data, especially the adoption of ML/AI tools, remain challenging. These challenges include the sheer volume of data generated, the need for accurate labeling and annotation, variability in environmental conditions affecting data quality, and the integration of ML/AI tools into existing workflows that often require domain-specific customization and computational resources. This paper provides a guide to a low-resource pipeline to process camera trap data on-premise, incorporating ML/AI capabilities tailored for small research groups with limited resources and computational expertise. By focusing on practical solutions, the pipeline offers accessible approaches for data transmission, inference, and evaluation, enabling researchers to discover meaningful insights from their ever-increasing camera trap datasets.
zh
[CV-117] Automated Multi-Class Crop Pathology Classification via Convolutional Neural Networks: A Deep Learning Approach for Real-Time Precision Agriculture
【速读】:该论文试图解决农作物疾病检测效率低、准确性不足的问题,特别是在大规模农业中,早期识别常因延迟或错误而影响农业生产与全球粮食安全。其解决方案的关键在于构建一个基于卷积神经网络(CNN)的图像分类系统,通过端到端的深度学习流程实现对八种常见作物疾病的自动化检测与分类,系统包含图像预处理、模型训练及病害治疗建议模块,并部署于开源移动平台以支持偏远地区农民进行实时诊断。
链接: https://arxiv.org/abs/2507.09375
作者: Sourish Suri(University of California, San Diego),Yifei Shao(University of Pennsylvania, Philadelphia)
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 29 pages, 10 figures, 1 table. Code available at: this https URL
Abstract:Crop diseases present a significant barrier to agricultural productivity and global food security, especially in large-scale farming where early identification is often delayed or inaccurate. This research introduces a Convolutional Neural Network (CNN)-based image classification system designed to automate the detection and classification of eight common crop diseases using leaf imagery. The methodology involves a complete deep learning pipeline: image acquisition from a large, labeled dataset, preprocessing via resizing, normalization, and augmentation, and model training using TensorFlow with Keras’ Sequential API. The CNN architecture comprises three convolutional layers with increasing filter sizes and ReLU activations, followed by max pooling, flattening, and fully connected layers, concluding with a softmax output for multi-class classification. The system achieves high training accuracy (~90%) and demonstrates reliable performance on unseen data, although a validation accuracy of ~60% suggests minor overfitting. Notably, the model integrates a treatment recommendation module, providing actionable guidance by mapping each detected disease to suitable pesticide or fungicide interventions. Furthermore, the solution is deployed on an open-source, mobile-compatible platform, enabling real-time image-based diagnostics for farmers in remote areas. This research contributes a scalable and accessible tool to the field of precision agriculture, reducing reliance on manual inspection and promoting sustainable disease management practices. By merging deep learning with practical agronomic support, this work underscores the potential of CNNs to transform crop health monitoring and enhance food production resilience on a global scale.
zh
[CV-118] Simplifying Traffic Anomaly Detection with Video Foundation Models ICCV
【速读】:该论文旨在解决自中心交通异常检测(ego-centric Traffic Anomaly Detection, TAD)中传统方法依赖复杂多阶段或多表示融合架构的问题,探索是否可以通过简化架构实现高效且有效的TAD。其解决方案的关键在于采用简单的仅编码器结构的Video Vision Transformers (Video ViTs),并通过强大的预训练策略提升模型性能。研究发现,强预训练能够使简单编码器模型达到甚至超越现有先进方法的性能,同时具备更高的效率;其中,自监督的Masked Video Modeling (MVM)在TAD任务中表现最优,而无需异常样本的域适应预训练(Domain-Adaptive Pre-Training, DAPT)进一步提升了下游任务性能。
链接: https://arxiv.org/abs/2507.09338
作者: Svetlana Orlova,Tommie Kerssies,Brunó B. Englert,Gijs Dubbelman
机构: Eindhoven University of Technology (埃因霍温理工大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: ICCVW 2025 accepted. Code: this https URL
Abstract:Recent methods for ego-centric Traffic Anomaly Detection (TAD) often rely on complex multi-stage or multi-representation fusion architectures, yet it remains unclear whether such complexity is necessary. Recent findings in visual perception suggest that foundation models, enabled by advanced pre-training, allow simple yet flexible architectures to outperform specialized designs. Therefore, in this work, we investigate an architecturally simple encoder-only approach using plain Video Vision Transformers (Video ViTs) and study how pre-training enables strong TAD performance. We find that: (i) strong pre-training enables simple encoder-only models to match or even surpass the performance of specialized state-of-the-art TAD methods, while also being significantly more efficient; (ii) although weakly- and fully-supervised pre-training are advantageous on standard benchmarks, we find them less effective for TAD. Instead, self-supervised Masked Video Modeling (MVM) provides the strongest signal; and (iii) Domain-Adaptive Pre-Training (DAPT) on unlabeled driving videos further improves downstream performance, without requiring anomalous examples. Our findings highlight the importance of pre-training and show that effective, efficient, and scalable TAD models can be built with minimal architectural complexity. We release our code, domain-adapted encoders, and fine-tuned models to support future work: this https URL.
zh
[CV-119] Fast3D: Accelerating 3D Multi-modal Large Language Models for Efficient 3D Scene Understanding ACM-MM2025
【速读】:该论文旨在解决3D多模态大语言模型(3D Multi-modal Large Language Models, MLLMs)在实际部署中面临的计算效率低下问题,其核心挑战在于处理过多的以物体为中心的视觉标记(object-centric visual tokens)所带来的计算负担。论文提出的关键解决方案是Fast3D,其核心技术创新包括:(1)全局注意力预测(Global Attention Prediction, GAP),通过轻量级神经网络学习目标模型的全局注意力分布,从而实现高效的标记重要性估计;(2)样本自适应视觉标记剪枝(Sample-Adaptive visual token Pruning, SAP),通过基于注意力的复杂度评估引入动态标记预算,根据输入特性自动调整各层的剪枝比例。这两项技术均无需修改目标模型的参数。
链接: https://arxiv.org/abs/2507.09334
作者: Wencan Huang,Daizong Liu,Wei Hu
机构: Wangxuan Institute of Computer Technology, Peking University(王选计算机技术研究所,北京大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted by ACM MM 2025
Abstract:While 3D Multi-modal Large Language Models (MLLMs) demonstrate remarkable scene understanding capabilities, their practical deployment faces critical challenges due to computational inefficiency. The key bottleneck stems from processing excessive object-centric visual tokens required for comprehensive 3D scene representation. Although visual token pruning has shown promise in accelerating 2D MLLMs, its applicability to 3D domains remains largely unexplored due to fundamental disparities in token structures. In this paper, we reveal two critical insights: (1) Significant redundancy exists in object-level 3D token representations, analogous to patch-level redundancy in 2D systems; (2) Global attention patterns exhibit strong predictive power for identifying non-essential tokens in 3D contexts. Building on these observations, we propose Fast3D, a plug-and-play visual token pruning framework for 3D MLLMs featuring two technical innovations: (1) Global Attention Prediction (GAP), where a lightweight neural network learns to predict the global attention distributions of the target model, enabling efficient token importance estimation for precise pruning guidance; (2) Sample-Adaptive visual token Pruning (SAP), which introduces dynamic token budgets through attention-based complexity assessment, automatically adjusting layer-wise pruning ratios based on input characteristics. Both of these two techniques operate without modifying the parameters of the target model. Extensive evaluations across five benchmarks validate the effectiveness of Fast3D, particularly under high visual token pruning ratios. Code is available at this https URL
zh
[CV-120] Dynamic Inter-Class Confusion-Aware Encoder for Audio-Visual Fusion in Human Activity Recognition
【速读】:该论文试图解决音频-视频预训练中对相似类别区分能力不足的问题,现有方法仅关注整体模态对齐,而未考虑通过认知诱导和对比学习增强易混淆类别的区分能力。其解决方案的关键是提出动态跨类别混淆感知编码器(DICCAE),该编码器通过动态调整跨类别混淆程度的混淆损失,实现细粒度、类别级别的音频-视频表示对齐,从而提升模型对相似活动的区分能力。
链接: https://arxiv.org/abs/2507.09323
作者: Kaixuan Cong,Yifan Wang,Rongkun Xue,Yuyang Jiang,Yiming Feng,Jing Yang
机构: Xi’an Jiaotong University (西安交通大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Humans do not understand individual events in isolation; rather, they generalize concepts within classes and compare them to others. Existing audio-video pre-training paradigms only focus on the alignment of the overall audio-video modalities, without considering the reinforcement of distinguishing easily confused classes through cognitive induction and contrast during training. This paper proposes the Dynamic Inter-Class Confusion-Aware Encoder (DICCAE), an encoder that aligns audio-video representations at a fine-grained, category-level. DICCAE addresses category confusion by dynamically adjusting the confusion loss based on inter-class confusion degrees, thereby enhancing the model’s ability to distinguish between similar activities. To further extend the application of DICCAE, we also introduce a novel training framework that incorporates both audio and video modalities, as well as their fusion. To mitigate the scarcity of audio-video data in the human activity recognition task, we propose a cluster-guided audio-video self-supervised pre-training strategy for DICCAE. DICCAE achieves near state-of-the-art performance on the VGGSound dataset, with a top-1 accuracy of 65.5%. We further evaluate its feature representation quality through extensive ablation studies, validating the necessity of each module.
zh
[CV-121] ProactiveBench: A Comprehensive Benchmark Evaluating Proactive Interactions in Video Large Language Models
【速读】:该论文试图解决多模态对话系统中主动交互能力的评估问题,特别是在视频播放过程中实时自主决定多轮响应时机的需求。解决方案的关键在于引入ProactiveBench,这是首个全面评估系统主动交互能力的基准,并提出PAUC(Proactive Area Under the Curve),这是首个考虑模型响应时间动态性的评估指标,从而更准确地衡量主动交互场景下的用户体验。
链接: https://arxiv.org/abs/2507.09313
作者: Yueqian Wang,Xiaojun Meng,Yifan Wang,Huishuai Zhang,Dongyan Zhao
机构: Peking University (北京大学); Huawei Noah’s Ark Lab (华为诺亚方舟实验室); University of Science and Technology Beijing (北京科技大学); National Key Laboratory of General Artificial Intelligence (通用人工智能国家重点实验室)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:With the growing research focus on multimodal dialogue systems, the capability for proactive interaction is gradually gaining recognition. As an alternative to conventional turn-by-turn dialogue, users increasingly expect multimodal systems to be more initiative, for example, by autonomously determining the timing of multi-turn responses in real time during video playback. To facilitate progress in this emerging area, we introduce ProactiveBench, the first comprehensive benchmark to evaluate a system’s ability to engage in proactive interaction. Since model responses are generated at varying timestamps, we further propose PAUC, the first metric that accounts for the temporal dynamics of model responses. This enables a more accurate evaluation of systems operating in proactive settings. Through extensive benchmarking of various baseline systems on ProactiveBench and a user study of human preferences, we show that PAUC is in better agreement with human preferences than traditional evaluation metrics, which typically only consider the textual content of responses. These findings demonstrate that PAUC provides a more faithful assessment of user experience in proactive interaction scenarios. Project homepage: this https URL
zh
[CV-122] AlphaVAE: Unified End-to-End RGBA Image Reconstruction and Generation with Alpha-Aware Representation Learning
【速读】:该论文试图解决透明或分层内容(RGBA图像)生成在大规模基准数据集缺失背景下的研究不足问题。其关键解决方案是提出ALPHA,首个针对四通道图像的综合性RGBA基准,通过在标准RGB指标中引入alpha blending来适应 RGBA 图像;同时引入ALPHAVAE,一种统一的端到端 RGBA 变分自编码器(VAE),通过扩展预训练的RGB VAE并加入专用的alpha通道,结合alpha-blended像素重建、块级保真度、感知一致性以及双KL散度约束,确保RGB和alpha表示之间的潜在保真度。该模型仅使用8K图像进行训练,相较于之前方法使用的1M图像,在PSNR和SSIM指标上分别提升了4.9 dB和3.2%。
链接: https://arxiv.org/abs/2507.09308
作者: Zile Wang,Hao Yu,Jiabo Zhan,Chun Yuan
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:
Abstract:Recent advances in latent diffusion models have achieved remarkable results in high-fidelity RGB image synthesis by leveraging pretrained VAEs to compress and reconstruct pixel data at low computational cost. However, the generation of transparent or layered content (RGBA image) remains largely unexplored, due to the lack of large-scale benchmarks. In this work, we propose ALPHA, the first comprehensive RGBA benchmark that adapts standard RGB metrics to four-channel images via alpha blending over canonical backgrounds. We further introduce ALPHAVAE, a unified end-to-end RGBA VAE that extends a pretrained RGB VAE by incorporating a dedicated alpha channel. The model is trained with a composite objective that combines alpha-blended pixel reconstruction, patch-level fidelity, perceptual consistency, and dual KL divergence constraints to ensure latent fidelity across both RGB and alpha representations. Our RGBA VAE, trained on only 8K images in contrast to 1M used by prior methods, achieves a +4.9 dB improvement in PSNR and a +3.2% increase in SSIM over LayerDiffuse in reconstruction. It also enables superior transparent image generation when fine-tuned within a latent diffusion framework. Our code, data, and models are released on this https URL for reproducibility.
zh
[CV-123] DAA*: Deep Angular A Star for Image-based Path Planning ICCV
【速读】:该论文旨在解决路径模仿学习中路径平滑性常被忽视的问题,通过提升预测路径与参考路径之间的相似性来改善学习效果。其解决方案的关键在于引入了路径角度自由度(PAF),将这一概念整合到A算法中,以实现自适应的路径平滑性优化。通过联合优化路径缩短和路径平滑,DAA算法在保持路径最优性的同时提高了路径相似性,从而在多个数据集上表现出显著优于现有方法的性能。
链接: https://arxiv.org/abs/2507.09305
作者: Zhiwei Xu
机构: The University of Melbourne (墨尔本大学); The Australian National University (澳大利亚国立大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG); Image and Video Processing (eess.IV)
备注: International Conference on Computer Vision (ICCV), 2025
Abstract:Path smoothness is often overlooked in path imitation learning from expert demonstrations. In this paper, we introduce a novel learning method, termed deep angular A* (DAA*), by incorporating the proposed path angular freedom (PAF) into A* to improve path similarity through adaptive path smoothness. The PAF aims to explore the effect of move angles on path node expansion by finding the trade-off between their minimum and maximum values, allowing for high adaptiveness for imitation learning. DAA* improves path optimality by closely aligning with the reference path through joint optimization of path shortening and smoothing, which correspond to heuristic distance and PAF, respectively. Throughout comprehensive evaluations on 7 datasets, including 4 maze datasets, 2 video-game datasets, and a real-world drone-view dataset containing 2 scenarios, we demonstrate remarkable improvements of our DAA* over neural A* in path similarity between the predicted and reference paths with a shorter path length when the shortest path is plausible, improving by 9.0% SPR, 6.9% ASIM, and 3.9% PSIM. Furthermore, when jointly learning pathfinding with both path loss and path probability map loss, DAA* significantly outperforms the state-of-the-art TransPath by 6.7% SPR, 6.5% PSIM, and 3.7% ASIM. We also discuss the minor trade-off between path optimality and search efficiency where applicable.
zh
[CV-124] ViT-ProtoNet for Few-Shot Image Classification: A Multi-Benchmark Evaluation
【速读】:该论文旨在解决Vision Transformers (ViTs)在少样本图像分类任务中表征能力未被充分利用的问题。其解决方案的关键在于将ViT-Small作为主干网络集成到原型网络(Prototypical Network)框架中,通过平均少量支持样本的类别条件token嵌入来构建鲁棒的原型,从而在5-shot设置下实现对新类别的良好泛化能力。
链接: https://arxiv.org/abs/2507.09299
作者: Abdulvahap Mutlu,Şengül Doğan,Türker Tuncer
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: All codes are available at this https URL
Abstract:The remarkable representational power of Vision Transformers (ViTs) remains underutilized in few-shot image classification. In this work, we introduce ViT-ProtoNet, which integrates a ViT-Small backbone into the Prototypical Network framework. By averaging class conditional token embeddings from a handful of support examples, ViT-ProtoNet constructs robust prototypes that generalize to novel categories under 5-shot settings. We conduct an extensive empirical evaluation on four standard benchmarks: Mini-ImageNet, FC100, CUB-200, and CIFAR-FS, including overlapped support variants to assess robustness. Across all splits, ViT-ProtoNet consistently outperforms CNN-based prototypical counterparts, achieving up to a 3.2% improvement in 5-shot accuracy and demonstrating superior feature separability in latent space. Furthermore, it outperforms or is competitive with transformer-based competitors using a more lightweight backbone. Comprehensive ablations examine the impact of transformer depth, patch size, and fine-tuning strategy. To foster reproducibility, we release code and pretrained weights. Our results establish ViT-ProtoNet as a powerful, flexible approach for few-shot classification and set a new baseline for transformer-based meta-learners.
zh
[CV-125] Geo-RepNet: Geometry-Aware Representation Learning for Surgical Phase Recognition in Endoscopic Submucosal Dissection
【速读】:该论文旨在解决微创手术中内镜黏膜下剥离术(Endoscopic Submucosal Dissection, ESD)的手术阶段识别问题,该问题因不同阶段之间视觉相似性高且RGB图像缺乏结构线索而具有挑战性。解决方案的关键在于引入深度信息以提供几何线索,从而补充外观特征,并通过Geo-RepNet框架融合RGB图像与深度信息。该框架基于可重参数化的RepVGG主干网络,包含Depth-Guided Geometric Prior Generation (DGPG)模块和Geometry-Enhanced Multi-scale Attention (GEMA)模块,分别用于从原始深度图中提取几何先验和通过几何感知的跨注意力机制及高效的多尺度聚合注入空间引导。
链接: https://arxiv.org/abs/2507.09294
作者: Rui Tang,Haochen Yin,Guankun Wang,Long Bai,An Wang,Huxin Gao,Jiazheng Wang,Hongliang Ren
机构: The Chinese University of Hong Kong (中国香港中文大学); Huawei Technologies Co. Ltd. (华为技术有限公司)
类目: Computer Vision and Pattern Recognition (cs.CV); Robotics (cs.RO)
备注: IEEE ICIA 2025
Abstract:Surgical phase recognition plays a critical role in developing intelligent assistance systems for minimally invasive procedures such as Endoscopic Submucosal Dissection (ESD). However, the high visual similarity across different phases and the lack of structural cues in RGB images pose significant challenges. Depth information offers valuable geometric cues that can complement appearance features by providing insights into spatial relationships and anatomical structures. In this paper, we pioneer the use of depth information for surgical phase recognition and propose Geo-RepNet, a geometry-aware convolutional framework that integrates RGB image and depth information to enhance recognition performance in complex surgical scenes. Built upon a re-parameterizable RepVGG backbone, Geo-RepNet incorporates the Depth-Guided Geometric Prior Generation (DGPG) module that extracts geometry priors from raw depth maps, and the Geometry-Enhanced Multi-scale Attention (GEMA) to inject spatial guidance through geometry-aware cross-attention and efficient multi-scale aggregation. To evaluate the effectiveness of our approach, we construct a nine-phase ESD dataset with dense frame-level annotations from real-world ESD videos. Extensive experiments on the proposed dataset demonstrate that Geo-RepNet achieves state-of-the-art performance while maintaining robustness and high computational efficiency under complex and low-texture surgical environments.
zh
[CV-126] Supercharging Floorplan Localization with Semantic Rays ICCV2025
【速读】:该论文试图解决传统楼层平面定位技术仅依赖深度结构线索而忽略楼层平面中丰富语义信息的问题。其解决方案的关键在于提出一种语义感知的定位框架,该框架联合估计深度和语义光线,并通过融合两者来预测结构-语义概率体积。该概率体积采用自粗到精的构建方式,先通过少量光线采样获得初始低分辨率概率体积,再在高概率区域进行更密集采样以细化概率,最终用于预测2D位置和方向角。
链接: https://arxiv.org/abs/2507.09291
作者: Yuval Grader,Hadar Averbuch-Elor
机构: Tel Aviv University (特拉维夫大学); Cornell University (康奈尔大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注: Accepted at ICCV 2025
Abstract:Floorplans provide a compact representation of the building’s structure, revealing not only layout information but also detailed semantics such as the locations of windows and doors. However, contemporary floorplan localization techniques mostly focus on matching depth-based structural cues, ignoring the rich semantics communicated within floorplans. In this work, we introduce a semantic-aware localization framework that jointly estimates depth and semantic rays, consolidating over both for predicting a structural-semantic probability volume. Our probability volume is constructed in a coarse-to-fine manner: We first sample a small set of rays to obtain an initial low-resolution probability volume. We then refine these probabilities by performing a denser sampling only in high-probability regions and process the refined values for predicting a 2D location and orientation angle. We conduct an evaluation on two standard floorplan localization benchmarks. Our experiments demonstrate that our approach substantially outperforms state-of-the-art methods, achieving significant improvements in recall metrics compared to prior works. Moreover, we show that our framework can easily incorporate additional metadata such as room labels, enabling additional gains in both accuracy and efficiency.
zh
[CV-127] Generative Latent Kernel Modeling for Blind Motion Deblurring
【速读】:该论文试图解决盲运动模糊(Blind Motion Deblurring, BMD)中由于优化过程的高非凸性导致对初始模糊核极度敏感的问题。解决方案的关键在于提出一种新颖框架,利用深度生成模型编码模糊核先验,并诱导出更优的初始模糊核。具体而言,通过预训练一个基于生成对抗网络(GAN)的核生成器来准确表征核的先验分布,以及一个核初始化器为核估计提供高质量的起始点,从而将BMD解约束在紧凑的潜在核流形内,降低对初始核的敏感性。
链接: https://arxiv.org/abs/2507.09285
作者: Chenhao Ding,Jiangtao Zhang,Zongsheng Yue,Hui Wang,Qian Zhao,Deyu Meng
机构: Xi’an Jiaotong University(西安交通大学); Pengcheng Laboratory(鹏城实验室)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Deep prior-based approaches have demonstrated remarkable success in blind motion deblurring (BMD) recently. These methods, however, are often limited by the high non-convexity of the underlying optimization process in BMD, which leads to extreme sensitivity to the initial blur kernel. To address this issue, we propose a novel framework for BMD that leverages a deep generative model to encode the kernel prior and induce a better initialization for the blur kernel. Specifically, we pre-train a kernel generator based on a generative adversarial network (GAN) to aptly characterize the kernel’s prior distribution, as well as a kernel initializer to provide a well-informed and high-quality starting point for kernel estimation. By combining these two components, we constrain the BMD solution within a compact latent kernel manifold, thus alleviating the aforementioned sensitivity for kernel initialization. Notably, the kernel generator and initializer are designed to be easily integrated with existing BMD methods in a plug-and-play manner, enhancing their overall performance. Furthermore, we extend our approach to tackle blind non-uniform motion deblurring without the need for additional priors, achieving state-of-the-art performance on challenging benchmark datasets. The source code is available at this https URL.
zh
[CV-128] Cross Knowledge Distillation between Artificial and Spiking Neural Networks ICME2025
【速读】:该论文旨在解决脉冲神经网络(Spiking Neural Networks, SNNs)在事件驱动数据格式DVS数据上的性能不足问题,这主要是由于标注事件数据集有限和SNN架构不成熟所致。其解决方案的关键在于提出跨知识蒸馏(Cross Knowledge Distillation, CKD),该方法通过利用语义相似性和滑动替换来缓解跨模态挑战,并采用间接分阶段知识蒸馏来应对跨架构挑战。
链接: https://arxiv.org/abs/2507.09269
作者: Shuhan Ye,Yuanbin Qian,Chong Wang,Sunqi Lin,Jiazhen Xu,Jiangbo Qian,Yuqi Li
机构: Merchants’ Guild Economics and Cultural Intelligent Computing Laboratory, Ningbo University, Ningbo, China; Faculty of Electrical Engineering and Computer Science, Ningbo University, Ningbo, China
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: This paper has been accepted by ICME2025
Abstract:Recently, Spiking Neural Networks (SNNs) have demonstrated rich potential in computer vision domain due to their high biological plausibility, event-driven characteristic and energy-saving efficiency. Still, limited annotated event-based datasets and immature SNN architectures result in their performance inferior to that of Artificial Neural Networks (ANNs). To enhance the performance of SNNs on their optimal data format, DVS data, we explore using RGB data and well-performing ANNs to implement knowledge distillation. In this case, solving cross-modality and cross-architecture challenges is necessary. In this paper, we propose cross knowledge distillation (CKD), which not only leverages semantic similarity and sliding replacement to mitigate the cross-modality challenge, but also uses an indirect phased knowledge distillation to mitigate the cross-architecture challenge. We validated our method on main-stream neuromorphic datasets, including N-Caltech101 and CEP-DVS. The experimental results show that our method outperforms current State-of-the-Art methods. The code will be available at this https URL
zh
[CV-129] SAGE: Segment-Aware Gloss-Free Encoding for Token-Efficient Sign Language Translation ICCV
【速读】:该论文旨在解决无词干符号(gloss-free)手语翻译(SLT)中模型复杂度高、计算需求大导致的可扩展性问题。其关键解决方案是提出一种基于段落感知的视觉标记化框架,通过手语分割将连续视频转换为离散的、受手语信息指导的视觉标记,从而将输入序列长度减少多达50%,显著降低内存使用并提升在大规模数据集上的可扩展性。此外,通过引入token-to-token对比对齐目标和双级监督机制,实现了视觉与语言模态之间的细粒度跨模态对齐,无需依赖词干级监督。
链接: https://arxiv.org/abs/2507.09266
作者: JianHe Low,Ozge Mercanoglu Sincan,Richard Bowden
机构: CVSSP, University of Surrey, United Kingdom
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted in International Conference on Computer Vision (ICCV) Workshops
Abstract:Gloss-free Sign Language Translation (SLT) has advanced rapidly, achieving strong performances without relying on gloss annotations. However, these gains have often come with increased model complexity and high computational demands, raising concerns about scalability, especially as large-scale sign language datasets become more common. We propose a segment-aware visual tokenization framework that leverages sign segmentation to convert continuous video into discrete, sign-informed visual tokens. This reduces input sequence length by up to 50% compared to prior methods, resulting in up to 2.67x lower memory usage and better scalability on larger datasets. To bridge the visual and linguistic modalities, we introduce a token-to-token contrastive alignment objective, along with a dual-level supervision that aligns both language embeddings and intermediate hidden states. This improves fine-grained cross-modal alignment without relying on gloss-level supervision. Our approach notably exceeds the performance of state-of-the-art methods on the PHOENIX14T benchmark, while significantly reducing sequence length. Further experiments also demonstrate our improved performance over prior work under comparable sequence-lengths, validating the potential of our tokenization and alignment strategies.
zh
[CV-130] Ambiguity-Aware and High-Order Relation Learning for Multi-Grained Image-Text Matching
【速读】:该论文试图解决图像-文本匹配任务中由于高阶关联性和语义模糊性导致的匹配不确定性问题,特别是软正样本(语义相似但标签错误)与软负样本(局部匹配但全局不一致)之间的区分困难。解决方案的关键在于提出了一种基于模糊感知和高阶关系学习的框架(AAHR),通过动态聚类原型对比学习构建统一表征空间,引入全局与局部特征提取机制及自适应聚合网络以增强细粒度语义理解能力,并利用图神经网络(GNN)和动量对比学习来加强实例间的语义交互与负样本扩展,从而显著提升模型的特征区分能力。
链接: https://arxiv.org/abs/2507.09256
作者: Junyu Chen,Yihua Gao,Mingyuan Ge,Mingyong Li
机构: College of Computer and Information Science, Chongqing Normal University, Chongqing 401331, China; School of Big Data and Software Engineering, Chongqing University, Chongqing 401331, China
类目: Computer Vision and Pattern Recognition (cs.CV); Information Retrieval (cs.IR); Multimedia (cs.MM)
备注: Accepted by the Knowledge-Based Systems(KBS), 2025
Abstract:Image-text matching is crucial for bridging the semantic gap between computer vision and natural language processing. However, existing methods still face challenges in handling high-order associations and semantic ambiguities among similar instances. These ambiguities arise from subtle differences between soft positive samples (semantically similar but incorrectly labeled) and soft negative samples (locally matched but globally inconsistent), creating matching uncertainties. Furthermore, current methods fail to fully utilize the neighborhood relationships among semantically similar instances within training batches, limiting the model’s ability to learn high-order shared knowledge. This paper proposes the Ambiguity-Aware and High-order Relation learning framework (AAHR) to address these issues. AAHR constructs a unified representation space through dynamic clustering prototype contrastive learning, effectively mitigating the soft positive sample problem. The framework introduces global and local feature extraction mechanisms and an adaptive aggregation network, significantly enhancing full-grained semantic understanding capabilities. Additionally, AAHR employs intra-modal and inter-modal correlation matrices to investigate neighborhood relationships among sample instances thoroughly. It incorporates GNN to enhance semantic interactions between instances. Furthermore, AAHR integrates momentum contrastive learning to expand the negative sample set. These combined strategies significantly improve the model’s ability to discriminate between features. Experimental results demonstrate that AAHR outperforms existing state-of-the-art methods on Flickr30K, MSCOCO, and ECCV Caption datasets, considerably improving the accuracy and efficiency of image-text matching. The code and model checkpoints for this research are available at this https URL .
zh
[CV-131] AGCD-Net: Attention Guided Context Debiasing Network for Emotion Recognition
【速读】:该论文旨在解决上下文感知情绪识别(CAER)中因上下文偏差导致的虚假相关性问题,即背景上下文与情绪标签之间的非因果关联(例如将“花园”与“快乐”联系起来)。其解决方案的关键在于提出AGCD-Net模型,该模型引入了Hybrid ConvNeXt卷积编码器,通过集成空间变换网络和挤压-激励模块以增强特征重新校准。AGCD-Net的核心是注意力引导的因果干预模块(AG-CIM),该模块应用因果理论对上下文特征进行扰动,隔离虚假相关性,并通过面部特征引导的注意力机制进行修正,从而减轻上下文偏差。
链接: https://arxiv.org/abs/2507.09248
作者: Varsha Devi,Amine Bohi,Pardeep Kumar
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: 13 Pages, 4 figures, 2 tables ICIAP 2025
Abstract:Context-aware emotion recognition (CAER) enhances affective computing in real-world scenarios, but traditional methods often suffer from context bias-spurious correlation between background context and emotion labels (e.g. associating garden'' with
happy’'). In this paper, we propose \textbfAGCD-Net, an Attention Guided Context Debiasing model that introduces \textitHybrid ConvNeXt, a novel convolutional encoder that extends the ConvNeXt backbone by integrating Spatial Transformer Network and Squeeze-and-Excitation layers for enhanced feature recalibration. At the core of AGCD-Net is the Attention Guided - Causal Intervention Module (AG-CIM), which applies causal theory, perturbs context features, isolates spurious correlations, and performs an attention-driven correction guided by face features to mitigate context bias. Experimental results on the CAER-S dataset demonstrate the effectiveness of AGCD-Net, achieving state-of-the-art performance and highlighting the importance of causal debiasing for robust emotion recognition in complex settings.
zh
[CV-132] PPJudge: Towards Human-Aligned Assessment of Artistic Painting Process
【速读】:该论文试图解决现有艺术图像评估方法仅关注静态最终图像,而忽视了绘画过程的动态性和多阶段特性的问题。其解决方案的关键在于提出了一种人类对齐的绘画过程评估框架,包括首个大规模的绘画过程评估数据集(PPAD)以及基于Transformer的PPJudge模型,该模型通过时间感知的位置编码和异质专家混合架构实现对绘画过程的有效评估。
链接: https://arxiv.org/abs/2507.09242
作者: Shiqi Jiang,Xinpeng Li,Xi Mao,Changbo Wang,Chenhui Li
机构: East China Normal University(华东师范大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: ACM International Conference on Multimedia 2025
Abstract:Artistic image assessment has become a prominent research area in computer vision. In recent years, the field has witnessed a proliferation of datasets and methods designed to evaluate the aesthetic quality of paintings. However, most existing approaches focus solely on static final images, overlooking the dynamic and multi-stage nature of the artistic painting process. To address this gap, we propose a novel framework for human-aligned assessment of painting processes. Specifically, we introduce the Painting Process Assessment Dataset (PPAD), the first large-scale dataset comprising real and synthetic painting process images, annotated by domain experts across eight detailed attributes. Furthermore, we present PPJudge (Painting Process Judge), a Transformer-based model enhanced with temporally-aware positional encoding and a heterogeneous mixture-of-experts architecture, enabling effective assessment of the painting process. Experimental results demonstrate that our method outperforms existing baselines in accuracy, robustness, and alignment with human judgment, offering new insights into computational creativity and art education.
zh
[CV-133] EgoAnimate: Generating Human Animations from Egocentric top-down Views
【速读】:该论文旨在解决从第一人称视角(egocentric view)图像中重建可动画化虚拟形象(avatar)的问题,以实现更真实和通用的数字远程存在体验。现有方法在利用第一人称视角时面临遮挡和身体比例失真的挑战,且多数依赖多视角数据集进行训练。该研究的关键在于引入基于生成先验(generative prior)的方法,利用Stable Diffusion框架降低训练负担并提升泛化能力,同时通过ControlNet生成真实正面视图,从而将单张顶部俯视的第一人称图像转换为可用于动作生成的正面表示。
链接: https://arxiv.org/abs/2507.09230
作者: G. Kutay Türkoglu,Julian Tanke,Iheb Belgacem,Lev Markhasin
机构: Sony Semiconductor Solutions Europe(索尼半导体解决方案欧洲); Sony AI(索尼人工智能)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 10 pages, 5 figures
Abstract:An ideal digital telepresence experience requires accurate replication of a person’s body, clothing, and movements. To capture and transfer these movements into virtual reality, the egocentric (first-person) perspective can be adopted, which enables the use of a portable and cost-effective device without front-view cameras. However, this viewpoint introduces challenges such as occlusions and distorted body proportions. There are few works reconstructing human appearance from egocentric views, and none use a generative prior-based approach. Some methods create avatars from a single egocentric image during inference, but still rely on multi-view datasets during training. To our knowledge, this is the first study using a generative backbone to reconstruct animatable avatars from egocentric inputs. Based on Stable Diffusion, our method reduces training burden and improves generalizability. Inspired by methods such as SiTH and MagicMan, which perform 360-degree reconstruction from a frontal image, we introduce a pipeline that generates realistic frontal views from occluded top-down images using ControlNet and a Stable Diffusion backbone. Our goal is to convert a single top-down egocentric image into a realistic frontal representation and feed it into an image-to-motion model. This enables generation of avatar motions from minimal input, paving the way for more accessible and generalizable telepresence systems. Comments: 10 pages, 5 figures Subjects: Computer Vision and Pattern Recognition (cs.CV) Cite as: arXiv:2507.09230 [cs.CV] (or arXiv:2507.09230v1 [cs.CV] for this version) https://doi.org/10.48550/arXiv.2507.09230 Focus to learn more arXiv-issued DOI via DataCite (pending registration) Submission history From: Gürbüz Kutay Türkoglu [view email] [v1] Sat, 12 Jul 2025 09:59:31 UTC (4,236 KB)
zh
[CV-134] Calibrated and Robust Foundation Models for Vision-Language and Medical Image Tasks Under Distribution Shift
【速读】:该论文旨在解决基础模型(如CLIP和SAM)在部署过程中面临的两个关键问题:训练数据与测试数据之间的分布偏移(distribution shift)以及置信度不匹配(confidence misalignment),后者会导致模型产生过于自信的错误预测。解决方案的关键在于提出一种统一框架StaRFM,其核心包括两个部分:一是通过Fisher信息惩罚(FIP)来减少嵌入空间中的协变量偏移,该方法通过基于块的正则化扩展至3D医学数据;二是引入置信度不匹配惩罚(CMP),针对体素级预测进行重新设计,以校准分割任务中的不确定性。理论分析表明,FIP通过Fisher-Rao范数控制泛化能力,而CMP通过Brier分数优化最小化校准误差。
链接: https://arxiv.org/abs/2507.09222
作者: Behraj Khan,Tahir Syed
机构: School of Mathematics and Computer Science, Institute of Business Administration Karachi, Pakistan; National University of Computer and Emerging Sciences, Karachi, Pakistan
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注:
Abstract:Foundation models like CLIP and SAM have transformed computer vision and medical imaging via low-shot transfer learning. However, deployment of these models hindered by two key challenges: \textitdistribution shift between training and test data, and \textitconfidence misalignment that leads to overconfident incorrect predictions. These issues manifest differently in vision-language classification and medical segmentation tasks, yet existing solutions remain domain-specific. We propose \textitStaRFM, a unified framework addressing both challenges. It introduces a Fisher information penalty (FIP), extended to 3D medical data via patch-wise regularization, to reduce covariate shift in CLIP and SAM embeddings. Additionally, a confidence misalignment penalty (CMP), reformulated for voxel-level predictions, calibrates uncertainty in segmentation tasks. We theoretically derive PAC-Bayes bounds showing FIP controls generalization via the Fisher-Rao norm, while CMP minimizes calibration error through Brier score optimization. StaRFM shows consistent performance like \texttt+3.5% accuracy and 28% lower ECE on 19 vision datasets (e.g., ImageNet, Office-Home), 84.7% DSC and 4.8mm HD95 in medical segmentation (e.g., BraTS, ATLAS), and 40% lower cross-domain performance gap compared to prior benchmarking methods. The framework is plug-and-play, requiring minimal architectural changes for seamless integration with foundation models. Code and models will be released at this https URL
zh
[CV-135] Online Long-term Point Tracking in the Foundation Model Era
【速读】:该论文试图解决在在线设置下进行长期点跟踪的问题,即模型必须仅使用当前和过去帧进行因果预测,而无法访问未来帧或滑动窗口。解决方案的关键在于引入Track-On,这是一种基于Transformer的模型,通过将每个跟踪点作为查询,并逐帧处理视频帧,从而在因果条件下维持时间上的连贯性,利用记忆机制传播外观和上下文信息。该方法在七个公开基准上达到了新的最先进水平,证明了在无未来信息访问情况下实现长期跟踪的可行性。
链接: https://arxiv.org/abs/2507.09217
作者: Görkay Aydemir
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: arXiv admin note: substantial text overlap with arXiv:2501.18487
Abstract:Point tracking aims to identify the same physical point across video frames and serves as a geometry-aware representation of motion. This representation supports a wide range of applications, from robotics to augmented reality, by enabling accurate modeling of dynamic environments. Most existing long-term tracking approaches operate in an offline setting, where future frames are available to refine predictions and recover from occlusions. However, real-world scenarios often demand online predictions: the model must operate causally, using only current and past frames. This constraint is critical in streaming video and embodied AI, where decisions must be made immediately based on past observations. Under such constraints, viewpoint invariance becomes essential. Visual foundation models, trained on diverse large-scale datasets, offer the potential for robust geometric representations. While they lack temporal reasoning on their own, they can be integrated into tracking pipelines to enrich spatial features. In this thesis, we address the problem of long-term point tracking in an online setting, where frames are processed sequentially without access to future information or sliding windows. We begin by evaluating the suitability of visual foundation models for this task and find that they can serve as useful initializations and be integrated into tracking pipelines. However, to enable long-term tracking in an online setting, a dedicated design is still required. In particular, maintaining coherence over time in this causal regime requires memory to propagate appearance and context across frames. To address this, we introduce Track-On, a transformer-based model that treats each tracked point as a query and processes video frames one at a time. Track-On sets a new state of the art across seven public benchmarks, demonstrating the feasibility of long-term tracking without future access.
zh
[CV-136] 360-Degree Full-view Image Segmentation by Spherical Convolution compatible with Large-scale Planar Pre-trained Models ICME
【速读】:该论文试图解决全景图像任务中由于缺乏大规模数据集而依赖二维预训练图像基准模型所带来的性能下降问题,特别是这些模型无法有效识别全景图像中的畸变和不连续性。解决方案的关键在于提出一种新颖的球面采样方法,该方法基于预训练模型的权重进行球面离散采样,从而有效减轻畸变并获得良好的初始训练值,使得现有二维预训练模型能够直接用于全景图像处理任务。
链接: https://arxiv.org/abs/2507.09216
作者: Jingguo Liu,Han Yu,Shigang Li,Jianfeng Li
机构: Southwest University (西南大学); Hiroshima City University (广岛市立大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: This paper is accecpted by ICMEW 2025
Abstract:Due to the current lack of large-scale datasets at the million-scale level, tasks involving panoramic images predominantly rely on existing two-dimensional pre-trained image benchmark models as backbone networks. However, these networks are not equipped to recognize the distortions and discontinuities inherent in panoramic images, which adversely affects their performance in such tasks. In this paper, we introduce a novel spherical sampling method for panoramic images that enables the direct utilization of existing pre-trained models developed for two-dimensional images. Our method employs spherical discrete sampling based on the weights of the pre-trained models, effectively mitigating distortions while achieving favorable initial training values. Additionally, we apply the proposed sampling method to panoramic image segmentation, utilizing features obtained from the spherical model as masks for specific channel attentions, which yields commendable results on commonly used indoor datasets, Stanford2D3D.
zh
[CV-137] Stereo-based 3D Anomaly Object Detection for Autonomous Driving: A New Dataset and Baseline
【速读】:该论文试图解决3D检测模型在面对道路中罕见异常类别时出现的误检或漏检问题,以及模型泛化能力不足的问题。解决方案的关键在于提出一种基于立体视觉的3D异常目标检测算法(S3AD),通过解耦2D与3D的训练策略,释放模型对任意3D前景的检测泛化能力,并引入基于前景置信度预测的异常评分算法,实现目标级别的异常评分。此外,通过3D渲染方法构建了增强现实双目立体3D检测数据集KITTI-AR,以进一步验证和提升异常检测的泛化能力。
链接: https://arxiv.org/abs/2507.09214
作者: Shiyi Mu,Zichong Gu,Hanqi Lyu,Yilin Gao,Shugong Xu
机构: Shanghai University (上海大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: under review
Abstract:3D detection technology is widely used in the field of autonomous driving, with its application scenarios gradually expanding from enclosed highways to open conventional roads. For rare anomaly categories that appear on the road, 3D detection models trained on closed sets often misdetect or fail to detect anomaly objects. To address this risk, it is necessary to enhance the generalization ability of 3D detection models for targets of arbitrary shapes and to possess the capability to filter out anomalies. The generalization of 3D detection is limited by two factors: the coupled training of 2D and 3D, and the insufficient diversity in the scale distribution of training samples. This paper proposes a Stereo-based 3D Anomaly object Detection (S3AD) algorithm, which decouples the training strategy of 3D and 2D to release the generalization ability for arbitrary 3D foreground detection, and proposes an anomaly scoring algorithm based on foreground confidence prediction, achieving target-level anomaly scoring. In order to further verify and enhance the generalization of anomaly detection, we use a 3D rendering method to synthesize two augmented reality binocular stereo 3D detection datasets which named KITTI-AR. KITTI-AR extends upon KITTI by adding 97 new categories, totaling 6k pairs of stereo images. The KITTI-AR-ExD subset includes 39 common categories as extra training data to address the sparse sample distribution issue. Additionally, 58 rare categories form the KITTI-AR-OoD subset, which are not used in training to simulate zero-shot scenarios in real-world settings, solely for evaluating 3D anomaly detection. Finally, the performance of the algorithm and the dataset is verified in the experiments. (Code and dataset can be obtained at this https URL).
zh
[CV-138] Warm Starts Accelerate Generative Modelling
【速读】:该论文试图解决迭代生成模型(如扩散模型和流匹配模型)在生成过程中计算效率低的问题,这些模型通常需要数百次函数评估才能生成高质量样本。解决方案的关键在于引入“热启动模型”,该模型通过提供一个更优的初始分布来加速条件生成过程,具体而言,它预测一个基于输入上下文条件化的高斯先验分布N(μ, σ),而非传统的无信息先验N(0, I),从而显著减少生成过程所需的迭代次数。
链接: https://arxiv.org/abs/2507.09212
作者: Jonas Scholz,Richard E. Turner
机构: 未知
类目: Machine Learning (cs.LG); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (stat.ML)
备注: 10 pages, 6 figures
Abstract:Iterative generative models, like diffusion and flow-matching, create high-fidelity samples by progressively refining a noise vector into data. However, this process is notoriously slow, often requiring hundreds of function evaluations. We introduce the warm-start model, a simple, deterministic model that dramatically accelerates conditional generation by providing a better starting point. Instead of starting generation from an uninformed N(0, I) prior, our warm-start model predicts an informed prior N(mu, sigma), whose moments are conditioned on the input context. This “warm start” substantially reduces the distance the generative process must traverse, particularly when the conditioning information is strongly informative. On tasks like image inpainting, our method achieves results competitive with a 1000-step DDPM baseline using only 11 total function evaluations (1 for the warm start, 10 for generation). A simple conditional normalization trick makes our method compatible with any standard generative model and sampler without modification, allowing it to be combined with other efficient sampling techniques for further acceleration. Our implementation is available at this https URL.
zh
[CV-139] Uncertainty-Driven Expert Control: Enhancing the Reliability of Medical Vision-Language Models
【速读】:该论文旨在解决当前医学视觉语言模型(Medical Vision Language Model, MedVLM)在医疗应用中存在固有概率不确定性的问题,即模型可能生成错误或未经验证的响应。现有方法通过调整模型结构、使用高质量数据进行微调或偏好微调来提升性能,但这些依赖训练的策略成本高且与临床专业知识对齐不足。论文提出的解决方案关键在于一种无需额外训练的专家反馈框架——专家控制的无分类器指导(Expert-Controlled Classifier-Free Guidance, Expert-CFG),其核心是引入不确定性估计策略以识别不可靠输出,并通过无分类器指导优化token嵌入,使模型输出更准确并符合专家标注的关键术语。
链接: https://arxiv.org/abs/2507.09209
作者: Xiao Liang,Di Wang,Zhicheng Jiao,Ronghan Li,Pengfei Yang,Quan Wang,Tat-Seng Chua
机构: Xidian University (西安电子科技大学); Brown University (布朗大学); National University of Singapore (新加坡国立大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:The rapid advancements in Vision Language Models (VLMs) have prompted the development of multi-modal medical assistant systems. Despite this progress, current models still have inherent probabilistic uncertainties, often producing erroneous or unverified responses-an issue with serious implications in medical applications. Existing methods aim to enhance the performance of Medical Vision Language Model (MedVLM) by adjusting model structure, fine-tuning with high-quality data, or through preference fine-tuning. However, these training-dependent strategies are costly and still lack sufficient alignment with clinical expertise. To address these issues, we propose an expert-in-the-loop framework named Expert-Controlled Classifier-Free Guidance (Expert-CFG) to align MedVLM with clinical expertise without additional training. This framework introduces an uncertainty estimation strategy to identify unreliable outputs. It then retrieves relevant references to assist experts in highlighting key terms and applies classifier-free guidance to refine the token embeddings of MedVLM, ensuring that the adjusted outputs are correct and align with expert highlights. Evaluations across three medical visual question answering benchmarks demonstrate that the proposed Expert-CFG, with 4.2B parameters and limited expert annotations, outperforms state-of-the-art models with 13B parameters. The results demonstrate the feasibility of deploying such a system in resource-limited settings for clinical use.
zh
[CV-140] Visual Surface Wave Elastography: Revealing Subsurface Physical Properties via Visible Surface Waves ICCV2025
【速读】:该论文试图解决从材料表面波传播视频中推断其内部结构厚度和刚度的问题。解决方案的关键在于从视频中提取色散关系,并通过基于物理的优化问题求解最佳匹配的厚度和刚度参数。
链接: https://arxiv.org/abs/2507.09207
作者: Alexander C. Ogren,Berthy T. Feng,Jihoon Ahn,Katherine L. Bouman,Chiara Daraio
机构: California Institute of Technology(加州理工学院)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: ICCV 2025
Abstract:Wave propagation on the surface of a material contains information about physical properties beneath its surface. We propose a method for inferring the thickness and stiffness of a structure from just a video of waves on its surface. Our method works by extracting a dispersion relation from the video and then solving a physics-based optimization problem to find the best-fitting thickness and stiffness parameters. We validate our method on both simulated and real data, in both cases showing strong agreement with ground-truth measurements. Our technique provides a proof-of-concept for at-home health monitoring of medically-informative tissue properties, and it is further applicable to fields such as human-computer interaction.
zh
[CV-141] HYME: Temporal Hierarchical-Cyclic Interactivity Modeling for Video Scene Graphs in Aerial Footage
【速读】:该论文旨在解决动态场景理解中因现有方法存在碎片化表示而无法同时捕捉细粒度空间细节和长程时间依赖性的问题。其解决方案的关键在于提出一种名为THYME(Temporal Hierarchical Cyclic Scene Graph)的方法,该方法通过层次化特征聚合与循环时间优化的协同整合,有效建模多尺度空间上下文并确保帧间的时间一致性,从而生成更准确且连贯的场景图。
链接: https://arxiv.org/abs/2507.09200
作者: Trong-Thuan Nguyen,Pha Nguyen,Jackson Cothren,Alper Yilmaz,Minh-Triet Tran,Khoa Luu
机构: University of Arkansas (阿肯色大学); University of Science, VNU-HCM (胡志明市自然科学大学); Vietnam National University, Ho Chi Minh, Viet Nam (越南国家大学胡志明市分校)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:The rapid proliferation of video in applications such as autonomous driving, surveillance, and sports analytics necessitates robust methods for dynamic scene understanding. Despite advances in static scene graph generation and early attempts at video scene graph generation, previous methods often suffer from fragmented representations, failing to capture fine-grained spatial details and long-range temporal dependencies simultaneously. To address these limitations, we introduce the Temporal Hierarchical Cyclic Scene Graph (THYME) approach, which synergistically integrates hierarchical feature aggregation with cyclic temporal refinement to address these limitations. In particular, THYME effectively models multi-scale spatial context and enforces temporal consistency across frames, yielding more accurate and coherent scene graphs. In addition, we present AeroEye-v1.0, a novel aerial video dataset enriched with five types of interactivity that overcome the constraints of existing datasets and provide a comprehensive benchmark for dynamic scene graph generation. Empirically, extensive experiments on ASPIRe and AeroEye-v1.0 demonstrate that the proposed THYME approach outperforms state-of-the-art methods, offering improved scene understanding in ground-view and aerial scenarios.
zh
[CV-142] MCA-LLaVA: Manhattan Causal Attention for Reducing Hallucination in Large Vision-Language Models ACM-MM2025
【速读】:该论文旨在解决大型多模态语言模型(Large Vision Language Models, LVLMs)中因多模态特征对齐不足而导致的幻觉问题。研究指出,旋转位置编码(Rotary Position Encoding, RoPE)的长期衰减是导致这一问题的关键因素,因为它使得指令令牌在二维空间中对图像令牌的感知存在偏差,尤其倾向于底部右区域的图像令牌。解决方案的关键在于提出MCA-LLaVA,该方法基于曼哈顿距离,将长期衰减扩展为二维多方向的空间衰减,结合图像令牌的一维序列顺序和二维空间位置进行位置建模,从而缓解图像对齐偏差,减少幻觉现象。
链接: https://arxiv.org/abs/2507.09184
作者: Qiyan Zhao,Xiaofeng Zhang,Yiheng Li,Yun Xing,Xiaosong Yuan,Feilong Tang,Sinan Fan,Xuhang Chen,Xuyao Zhang,Dahan Wang
机构: FKLPRIU, Xiamen University of Technology(福建理工研究院,厦门理工学院); Shanghai Jiao Tong University(上海交通大学); Nanyang Technological University(南洋理工大学); Jilin University(吉林大学); Monash University(莫纳什大学); Zhejiang University(浙江大学); Huizhou University(惠州学院); Chinese Academy of Sciences(中国科学院)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted in ACM MM 2025
Abstract:Hallucinations pose a significant challenge in Large Vision Language Models (LVLMs), with misalignment between multimodal features identified as a key contributing factor. This paper reveals the negative impact of the long-term decay in Rotary Position Encoding (RoPE), used for positional modeling in LVLMs, on multimodal alignment. Concretely, under long-term decay, instruction tokens exhibit uneven perception of image tokens located at different positions within the two-dimensional space: prioritizing image tokens from the bottom-right region since in the one-dimensional sequence, these tokens are positionally closer to the instruction tokens. This biased perception leads to insufficient image-instruction interaction and suboptimal multimodal alignment. We refer to this phenomenon as image alignment bias. To enhance instruction’s perception of image tokens at different spatial locations, we propose MCA-LLaVA, based on Manhattan distance, which extends the long-term decay to a two-dimensional, multi-directional spatial decay. MCA-LLaVA integrates the one-dimensional sequence order and two-dimensional spatial position of image tokens for positional modeling, mitigating hallucinations by alleviating image alignment bias. Experimental results of MCA-LLaVA across various hallucination and general benchmarks demonstrate its effectiveness and generality. The code can be accessed in this https URL.
zh
[CV-143] Revisiting Pool-based Prompt Learning for Few-shot Class-incremental Learning ICCV2025
【速读】:该论文试图解决少样本类增量学习(Few-Shot Class-Incremental Learning, FSCIL)中数据稀缺与增量学习的双重挑战。研究发现,现有的基于池的提示方法在FSCIL任务中会出现性能退化现象,其根源在于令牌维度饱和:有限的数据导致过多提示竞争任务相关信息,进而引发模型过拟合。解决方案的关键在于提出LGSP-Prompt(Local-Global Spatial Prompting),该方法创新性地将基于池的提示学习从令牌维度转移到空间维度,通过结合局部空间特征与全局频域表示生成空间提示,从而突出输入图像中的关键模式,并构建两个空间提示池以实现动态提示选择,有效保持已有知识并学习新类别。
链接: https://arxiv.org/abs/2507.09183
作者: Yongwei Jiang,Yixiong Zou,Yuhua Li,Ruixuan Li
机构: Huazhong University of Science and Technology (华中科技大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted to ICCV 2025, 11 pages
Abstract:Few-Shot Class-Incremental Learning (FSCIL) faces dual challenges of data scarcity and incremental learning in real-world scenarios. While pool-based prompting methods have demonstrated success in traditional incremental learning, their effectiveness in FSCIL settings remains unexplored. This paper presents the first study of current prompt pool methods in FSCIL tasks, revealing an unanticipated performance degradation in incremental sessions. Through comprehensive analysis, we identify that this phenomenon stems from token-dimension saturation: with limited data, excessive prompts compete for task-relevant information, leading to model overfitting. Based on this finding, we propose LGSP-Prompt (Local-Global Spatial Prompting), which innovatively shifts pool-based prompt learning from the token dimension to the spatial dimension. LGSP-Prompt generates spatial prompts by synergistically combining local spatial features and global frequency-domain representations to highlight key patterns in input images. We construct two spatial prompt pools enabling dynamic prompt selection to maintain acquired knowledge while effectively learning novel sessions. Extensive experiments demonstrate that our approach achieves state-of-the-art performance across multiple FSCIL benchmarks, showing significant advantages in both base knowledge preservation and incremental learning. Our implementation is available at this https URL.
zh
[CV-144] Learning and Transferring Better with Depth Information in Visual Reinforcement Learning
【速读】:该论文旨在解决多模态感知中的泛化能力不足问题,特别是如何有效融合RGB与深度信息以提升模型在不同场景下的鲁棒性。其解决方案的关键在于采用基于视觉Transformer的视觉主干网络,通过独立的卷积神经网络(CNN)分支处理不同模态的数据,并将融合后的特征输入可扩展的视觉Transformer以获取更具表现力的视觉表征。此外,还设计了基于掩码与非掩码标记的对比无监督学习方案,以提高强化学习过程中的样本效率,同时引入灵活的课程学习策略以增强模拟到现实(sim2real)迁移的效果。
链接: https://arxiv.org/abs/2507.09180
作者: Zichun Xu,Yuntao Li,Zhaomin Wang,Lei Zhuang,Guocai Yang,Jingdong Zhao
机构: Harbin Institute of Technology (哈尔滨工业大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Robotics (cs.RO)
备注:
Abstract:Depth information is robust to scene appearance variations and inherently carries 3D spatial details. In this paper, a visual backbone based on the vision transformer is proposed to fuse RGB and depth modalities for enhancing generalization. Different modalities are first processed by separate CNN stems, and the combined convolutional features are delivered to the scalable vision transformer to obtain visual representations. Moreover, a contrastive unsupervised learning scheme is designed with masked and unmasked tokens to accelerate the sample efficiency during the reinforcement learning progress. For sim2real transfer, a flexible curriculum learning schedule is developed to deploy domain randomization over training processes.
zh
[CV-145] Stable Score Distillation
【速读】:该论文试图解决文本引导的图像和3D编辑中存在稳定性差、空间控制不足以及编辑强度有限的问题。这些问题主要源于现有方法依赖复杂的辅助结构,导致优化信号冲突并限制了精确的局部编辑。解决方案的关键在于提出Stable Score Distillation (SSD),通过将单一分类器锚定到源提示,利用Classifier-Free Guidance (CFG) 方程实现跨提示对齐,并引入一个常量项空文本分支以稳定优化过程,从而提升编辑过程的稳定性和一致性。
链接: https://arxiv.org/abs/2507.09168
作者: Haiming Zhu,Yangyang Xu,Chenshu Xu,Tingrui Shen,Wenxi Liu,Yong Du,Jun Yu,Shengfeng He
机构: Singapore Management University; Harbin Institute of Technology (Shenzhen); South China University of Technology; Fuzhou University; Ocean University of China
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Text-guided image and 3D editing have advanced with diffusion-based models, yet methods like Delta Denoising Score often struggle with stability, spatial control, and editing strength. These limitations stem from reliance on complex auxiliary structures, which introduce conflicting optimization signals and restrict precise, localized edits. We introduce Stable Score Distillation (SSD), a streamlined framework that enhances stability and alignment in the editing process by anchoring a single classifier to the source prompt. Specifically, SSD utilizes Classifier-Free Guidance (CFG) equation to achieves cross-prompt alignment, and introduces a constant term null-text branch to stabilize the optimization process. This approach preserves the original content’s structure and ensures that editing trajectories are closely aligned with the source prompt, enabling smooth, prompt-specific modifications while maintaining coherence in surrounding regions. Additionally, SSD incorporates a prompt enhancement branch to boost editing strength, particularly for style transformations. Our method achieves state-of-the-art results in 2D and 3D editing tasks, including NeRF and text-driven style edits, with faster convergence and reduced complexity, providing a robust and efficient solution for text-guided editing.
zh
[CV-146] I2-World: Intra-Inter Tokenization for Efficient Dynamic 4D Scene Forecasting
【速读】:该论文试图解决在自动驾驶系统中通过基于占用的世界模型预测3D场景的演变并生成未见过的场景时,高效对复杂3D场景进行分词(tokenization)这一关键挑战。其解决方案的关键在于提出I^2-World框架,该框架将场景分词解耦为场景内(intra-scene)和场景间(inter-scene)分词器,其中场景内分词器采用多尺度残差量化策略以层次化压缩3D场景并保留空间细节,而场景间分词器则通过残差聚合跨时间步的时序依赖关系,从而在保持3D分词紧凑性的同时保留4D分词器的动态表达能力。
链接: https://arxiv.org/abs/2507.09144
作者: Zhimin Liao,Ping Wei,Ruijie Zhang,Shuaijia Chen,Haoxuan Wang,Ziyang Ren
机构: Xi’an Jiaotong University (西安交通大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Forecasting the evolution of 3D scenes and generating unseen scenarios via occupancy-based world models offers substantial potential for addressing corner cases in autonomous driving systems. While tokenization has revolutionized image and video generation, efficiently tokenizing complex 3D scenes remains a critical challenge for 3D world models. To address this, we propose I^2 -World, an efficient framework for 4D occupancy forecasting. Our method decouples scene tokenization into intra-scene and inter-scene tokenizers. The intra-scene tokenizer employs a multi-scale residual quantization strategy to hierarchically compress 3D scenes while preserving spatial details. The inter-scene tokenizer residually aggregates temporal dependencies across timesteps. This dual design preserves the compactness of 3D tokenizers while retaining the dynamic expressiveness of 4D tokenizers. Unlike decoder-only GPT-style autoregressive models, I^2 -World adopts an encoder-decoder architecture. The encoder aggregates spatial context from the current scene and predicts a transformation matrix to enable high-level control over scene generation. The decoder, conditioned on this matrix and historical tokens, ensures temporal consistency during generation. Experiments demonstrate that I^2 -World achieves state-of-the-art performance, outperforming existing methods by 25.1% in mIoU and 36.9% in IoU for 4D occupancy forecasting while exhibiting exceptional computational efficiency: it requires merely 2.9 GB of training memory and achieves real-time inference at 37.0 FPS. Our code is available on this https URL.
zh
[CV-147] PoseLLM : Enhancing Language-Guided Human Pose Estimation with MLP Alignment
【速读】:该论文旨在解决传统人体姿态估计方法依赖关键点先验信息导致的泛化能力受限问题,以及现有语言引导方法如LocLLM在复杂空间-文本交互建模方面的不足。其解决方案的关键在于提出PoseLLM,这是首个基于大型语言模型(Large Language Model, LLM)的人体姿态估计框架,通过将LocLLM中的线性投影器替换为非线性多层感知机(MLP)视觉-语言连接器,实现了更高效的跨模态特征融合与转换,从而提升了定位精度并保持了良好的零样本泛化能力。
链接: https://arxiv.org/abs/2507.09139
作者: Dewen Zhang,Tahir Hussain,Wangpeng An,Hayaru Shouno
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Preprint
Abstract:Human pose estimation traditionally relies on architectures that encode keypoint priors, limiting their generalization to novel poses or unseen keypoints. Recent language-guided approaches like LocLLM reformulate keypoint localization as a vision-language task, enabling zero-shot generalization through textual descriptions. However, LocLLM’s linear projector fails to capture complex spatial-textual interactions critical for high-precision localization. To address this, we propose PoseLLM, the first Large Language Model (LLM)-based pose estimation framework that replaces the linear projector with a nonlinear MLP vision-language connector. This lightweight two-layer MLP with GELU activation enables hierarchical cross-modal feature transformation, enhancing the fusion of visual patches and textual keypoint descriptions. Trained exclusively on COCO data, PoseLLM achieves 77.8 AP on the COCO validation set, outperforming LocLLM by +0.4 AP, while maintaining strong zero-shot generalization on Human-Art and MPII. Our work demonstrates that a simple yet powerful nonlinear connector significantly boosts localization accuracy without sacrificing generalization, advancing the state-of-the-art in language-guided pose estimation. Code is available at this https URL.
zh
[CV-148] SnapMoGen: Human Motion Generation from Expressive Texts
【速读】:该论文试图解决当前文本到动作生成方法在处理长文本提示时的局限性,主要由于数据集的约束导致细粒度控制能力和对未见提示的泛化能力不足。解决方案的关键在于引入了SnapMoGen数据集,该数据集包含高质量的动作捕捉数据与准确且富有表现力的文本注释,同时保留了原始时间连续性,以支持长期动作生成和混合研究。此外,论文提出MoMask++模型,通过将动作转化为多尺度标记序列并利用单一生成式掩码Transformer学习生成所有标记,从而提升模型性能。
链接: https://arxiv.org/abs/2507.09122
作者: Chuan Guo,Inwoo Hwang,Jian Wang,Bing Zhou
机构: Snap Inc. (Snap Inc.); Seoul National University (首尔大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Project Webpage: this https URL
Abstract:Text-to-motion generation has experienced remarkable progress in recent years. However, current approaches remain limited to synthesizing motion from short or general text prompts, primarily due to dataset constraints. This limitation undermines fine-grained controllability and generalization to unseen prompts. In this paper, we introduce SnapMoGen, a new text-motion dataset featuring high-quality motion capture data paired with accurate, expressive textual annotations. The dataset comprises 20K motion clips totaling 44 hours, accompanied by 122K detailed textual descriptions averaging 48 words per description (vs. 12 words of HumanML3D). Importantly, these motion clips preserve original temporal continuity as they were in long sequences, facilitating research in long-term motion generation and blending. We also improve upon previous generative masked modeling approaches. Our model, MoMask++, transforms motion into multi-scale token sequences that better exploit the token capacity, and learns to generate all tokens using a single generative masked transformer. MoMask++ achieves state-of-the-art performance on both HumanML3D and SnapMoGen benchmarks. Additionally, we demonstrate the ability to process casual user prompts by employing an LLM to reformat inputs to align with the expressivity and narration style of SnapMoGen. Project webpage: this https URL
zh
[CV-149] Mind the Gap: Preserving and Compensating for the Modality Gap in CLIP-Based Continual Learning ICCV2025
【速读】:该论文旨在解决在持续学习场景下,模型在学习新任务时容易遗忘先前任务知识的问题,尤其是在利用生成式视觉-语言预训练模型(如CLIP)进行分类增量学习时所面临的模态差距(modality gap)问题。其解决方案的关键在于通过模态差距的保持(modality gap preservation)来缓解遗忘,并通过模态差距补偿(modality gap compensation)来增强对新数据的适应能力,从而提出一种基于模态差距的新视角来提升持续学习性能。
链接: https://arxiv.org/abs/2507.09118
作者: Linlan Huang,Xusheng Cao,Haori Lu,Yifan Meng,Fei Yang,Xialei Liu
机构: Nankai University (南开大学); NKIARI (NKIARI)
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注: Accepted at ICCV 2025
Abstract:Continual learning aims to enable models to learn sequentially from continuously incoming data while retaining performance on previously learned tasks. With the Contrastive Language-Image Pre-trained model (CLIP) exhibiting strong capabilities across various downstream tasks, there has been growing interest in leveraging CLIP for continual learning in such scenarios. Most existing works overlook the inherent modality gap in CLIP, a key factor in its generalization and adaptability. In this paper, we analyze the variations in the modality gap during the fine-tuning of vision-language pre-trained models. Our observations reveal that the modality gap effectively reflects the extent to which pre-trained knowledge is preserved. Based on these insights, we propose a simple yet effective method, MG-CLIP, that improves CLIP’s performance in class-incremental learning. Our approach leverages modality gap preservation to mitigate forgetting and modality gap compensation to enhance the capacity for new data, introducing a novel modality-gap-based perspective for continual learning. Extensive experiments on multiple benchmarks demonstrate that our method outperforms existing approaches without requiring additional replay data. Our code is available at this https URL.
zh
[CV-150] RoHOI: Robustness Benchmark for Human-Object Interaction Detection KR
【速读】:该论文旨在解决Human-Object Interaction (HOI)检测模型在真实世界条件下因未预见的噪声、遮挡和环境变化等因素导致性能下降的问题。其解决方案的关键在于提出一种基于语义感知掩码的渐进式学习策略(SAMPL),通过引导模型利用整体和局部线索进行优化,动态调整模型的优化过程,从而增强鲁棒性特征学习能力。
链接: https://arxiv.org/abs/2507.09111
作者: Di Wen,Kunyu Peng,Kailun Yang,Yufan Chen,Ruiping Liu,Junwei Zheng,Alina Roitberg,Rainer Stiefelhagen
机构: Karlsruhe Institute of Technology (卡尔斯鲁厄理工学院); Hunan University (湖南大学); University of Stuttgart (斯图加特大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Human-Computer Interaction (cs.HC); Robotics (cs.RO); Image and Video Processing (eess.IV)
备注: Benchmarks, datasets, and code will be made publicly available at this https URL
Abstract:Human-Object Interaction (HOI) detection is crucial for robot-human assistance, enabling context-aware support. However, models trained on clean datasets degrade in real-world conditions due to unforeseen corruptions, leading to inaccurate prediction. To address this, we introduce the first robustness benchmark for HOI detection, evaluating model resilience under diverse challenges. Despite advances, current models struggle with environmental variability, occlusion, and noise. Our benchmark, RoHOI, includes 20 corruption types based on HICO-DET and V-COCO datasets and a new robustness-focused metric. We systematically analyze existing models in the related field, revealing significant performance drops under corruptions. To improve robustness, we propose a Semantic-Aware Masking-based Progressive Learning (SAMPL) strategy to guide the model to be optimized based on holistic and partial cues, dynamically adjusting the model’s optimization to enhance robust feature learning. Extensive experiments show our approach outperforms state-of-the-art methods, setting a new standard for robust HOI detection. Benchmarks, datasets, and code will be made publicly available at this https URL.
zh
[CV-151] Hybrid Autoregressive-Diffusion Model for Real-Time Streaming Sign Language Production
【速读】:该论文旨在解决早期手语生成(Sign Language Production, SLP)模型在推理阶段因误差累积导致性能下降的问题,以及基于扩散模型的生成方法在实时任务中因迭代性和全序列去噪限制而适用性不足的问题。其解决方案的关键在于首次将自回归模型与扩散模型相结合,利用自回归模型在序列依赖建模上的优势和扩散模型在输出优化上的能力,同时引入多尺度姿态表示模块和基于置信度的因果注意力机制,以提升生成质量与实时流式处理效率。
链接: https://arxiv.org/abs/2507.09105
作者: Maoxiao Ye,Xinfeng Ye,Mano Manoharan
机构: University of Auckland(奥克兰大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Earlier Sign Language Production (SLP) models typically relied on autoregressive methods that generate output tokens one by one, which inherently provide temporal alignment. Although techniques like Teacher Forcing can prevent model collapse during training, they still cannot solve the problem of error accumulation during inference, since ground truth is unavailable at that stage. In contrast, more recent approaches based on diffusion models leverage step-by-step denoising to enable high-quality generation. However, the iterative nature of these models and the requirement to denoise entire sequences limit their applicability in real-time tasks like SLP. To address it, we apply a hybrid approach combining autoregressive and diffusion models to SLP for the first time, leveraging the strengths of both models in sequential dependency modeling and output refinement. To capture fine-grained body movements, we design a Multi-Scale Pose Representation module that separately extracts detailed features from distinct articulators and integrates them via a Multi-Scale Fusion module. Furthermore, we introduce a Confidence-Aware Causal Attention mechanism that utilizes joint-level confidence scores to dynamically guide the pose generation process, improving accuracy and robustness. Extensive experiments on the PHOENIX14T and How2Sign datasets demonstrate the effectiveness of our method in both generation quality and real-time streaming efficiency.
zh
[CV-152] Harnessing Text-to-Image Diffusion Models for Point Cloud Self-Supervised Learning ICCV2025
【速读】:该论文试图解决3D点云自监督学习中由于3D扩散模型训练数据量有限而导致性能受限的问题。其解决方案的关键在于利用文本到图像扩散模型(如Stable Diffusion, SD)的强大能力,通过将SD模型的文本编码器替换为3D编码器,构建一个点到图像的扩散模型,使点云能够引导渲染噪声图像的去噪过程,从而提升点云的表征学习效果。
链接: https://arxiv.org/abs/2507.09102
作者: Yiyang Chen,Shanshan Zhao,Lunhao Duan,Changxing Ding,Dacheng Tao
机构: South China University of Technology (华南理工大学); Alibaba International Digital Commerce Group (阿里巴巴国际数字商业集团); Pazhou Lab (琶洲实验室); Nanyang Technological University (南洋理工大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted by ICCV 2025
Abstract:Diffusion-based models, widely used in text-to-image generation, have proven effective in 2D representation learning. Recently, this framework has been extended to 3D self-supervised learning by constructing a conditional point generator for enhancing 3D representations. However, its performance remains constrained by the 3D diffusion model, which is trained on the available 3D datasets with limited size. We hypothesize that the robust capabilities of text-to-image diffusion models, particularly Stable Diffusion (SD), which is trained on large-scale datasets, can help overcome these limitations. To investigate this hypothesis, we propose PointSD, a framework that leverages the SD model for 3D self-supervised learning. By replacing the SD model’s text encoder with a 3D encoder, we train a point-to-image diffusion model that allows point clouds to guide the denoising of rendered noisy images. With the trained point-to-image diffusion model, we use noise-free images as the input and point clouds as the condition to extract SD features. Next, we train a 3D backbone by aligning its features with these SD features, thereby facilitating direct semantic learning. Comprehensive experiments on downstream point cloud tasks and ablation studies demonstrate that the SD model can enhance point cloud self-supervised learning. Code is publicly available at this https URL.
zh
[CV-153] RadEyeVideo: Enhancing general-domain Large Vision Language Model for chest X-ray analysis with video representations of eye gaze
【速读】:该论文试图解决在胸部X光片(CXR)分析中,如何有效利用放射科医生的视觉注意力信息以提升大型视觉-语言模型(LVLMs)性能的问题。现有方法通常仅依赖热图或文本提示,而忽略了眼动序列的时序信息,这可能影响模型对感兴趣区域及其检查顺序的理解。解决方案的关键在于提出一种名为RadEyeVideo的新方法,该方法将放射科医生的眼动数据作为视频序列进行整合,从而捕捉其注视的时空动态特性,进而提升模型在CXR报告生成和疾病诊断任务中的表现。
链接: https://arxiv.org/abs/2507.09097
作者: Yunsoo Kim,Jinge Wu,Honghan Wu
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Large Vision-Language Models (LVLMs) have demonstrated promising performance in chest X-ray (CXR) analysis. To enhance human-computer interaction, several studies have incorporated radiologists’ eye gaze, typically through heatmaps or textual prompts. However, these methods often overlook the sequential order of eye movements, which could provide valuable insights by highlighting both the areas of interest and the order in which they are examined. In this work, we propose a novel approach called RadEyeVideo that integrates radiologists’ eye-fixation data as a video sequence, capturing both the temporal and spatial dynamics of their gaze. We evaluate this method in CXR report generation and disease diagnosis using three general-domain, open-source LVLMs with video input capabilities. When prompted with eye-gaze videos, model performance improves by up to 24.6% in the report generation task and on average 15.2% for both tasks using scaled evaluation metrics. Notably, RadEyeVideo enhanced an open-domain LVLM model, LLaVA-OneVision, to surpass task-specific medical LVLMs such as MAIRA-2 and CheXagent, trained on large Chest X-ray data. This work highlights that domain expert’s knowledge (eye-gaze information in this case), when effectively integrated with LVLMs, can significantly enhance general-domain models’ capabilities in clinical tasks. RadEyeVideo is a step toward a scalable human-centered approach of utilizing LVLMs in medical image analytics.
zh
[CV-154] MI CAM: Mutual Information Weighted Activation Mapping for Causal Visual Explanations of Convolutional Neural Networks
【速读】:该论文试图解决卷积神经网络(Convolutional Neural Networks, CNN)在关键日常应用中缺乏可解释性的问题,即如何揭示网络为何给出特定推理结果。解决方案的关键在于提出一种基于激活映射的新型后处理可视化解释方法——MI CAM,该方法通过输入图像与最终结果之间的互信息对每个特征图进行加权,并通过加权和的方式生成显著性可视化结果,从而实现因果性解释。
链接: https://arxiv.org/abs/2507.09092
作者: Ram S Iyer,Narayan S Iyer,Rugmini Ammal P
机构: Rajiv Gandhi Institute of Petroleum Technology (拉吉夫·甘地石油技术研究所); National Institute of Technology Rourkela (国家技术学院鲁尔克拉); ZGC Calicut (齐克卡尔库特中心)
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注: 12 pages, 10 figures
Abstract:With the intervention of machine vision in our crucial day to day necessities including healthcare and automated power plants, attention has been drawn to the internal mechanisms of convolutional neural networks, and the reason why the network provides specific inferences. This paper proposes a novel post-hoc visual explanation method called MI CAM based on activation mapping. Differing from previous class activation mapping based approaches, MI CAM produces saliency visualizations by weighing each feature map through its mutual information with the input image and the final result is generated by a linear combination of weights and activation maps. It also adheres to producing causal interpretations as validated with the help of counterfactual analysis. We aim to exhibit the visual performance and unbiased justifications for the model inferencing procedure achieved by MI CAM. Our approach works at par with all state-of-the-art methods but particularly outperforms some in terms of qualitative and quantitative measures. The implementation of proposed method can be found on this https URL
zh
[CV-155] aming generative video models for zero-shot optical flow extraction
【速读】:该论文试图解决从视频中提取光流(optical flow)的问题,这一任务在计算机视觉中具有核心地位。传统方法通常依赖于监督学习或基于图像的损失函数,但这些方法在光流领域面临标签稀缺和模拟到现实的差距问题。论文提出的解决方案的关键在于利用冻结的自监督视频模型,通过反事实提示(counterfactual prompting)技术,无需微调即可输出光流。其核心思想是通过在模型中注入局部扰动并跟踪其传播,结合生成式视频模型的特定属性,如分布预测、因子化潜在表示和随机访问解码,实现零样本光流提取。
链接: https://arxiv.org/abs/2507.09082
作者: Seungwoo Kim,Khai Loong Aw,Klemen Kotar,Cristobal Eyzaguirre,Wanhee Lee,Yunong Liu,Jared Watrous,Stefan Stojanov,Juan Carlos Niebles,Jiajun Wu,Daniel L. K. Yamins
机构: Stanford University (斯坦福大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Project webpage: this https URL
Abstract:Extracting optical flow from videos remains a core computer vision problem. Motivated by the success of large general-purpose models, we ask whether frozen self-supervised video models trained only for future frame prediction can be prompted, without fine-tuning, to output flow. Prior work reading out depth or illumination from video generators required fine-tuning, which is impractical for flow where labels are scarce and synthetic datasets suffer from a sim-to-real gap. Inspired by the Counterfactual World Model (CWM) paradigm, which can obtain point-wise correspondences by injecting a small tracer perturbation into a next-frame predictor and tracking its propagation, we extend this idea to generative video models. We explore several popular architectures and find that successful zero-shot flow extraction in this manner is aided by three model properties: (1) distributional prediction of future frames (avoiding blurry or noisy outputs); (2) factorized latents that treat each spatio-temporal patch independently; and (3) random-access decoding that can condition on any subset of future pixels. These properties are uniquely present in the recent Local Random Access Sequence (LRAS) architecture. Building on LRAS, we propose KL-tracing: a novel test-time procedure that injects a localized perturbation into the first frame, rolls out the model one step, and computes the Kullback-Leibler divergence between perturbed and unperturbed predictive distributions. Without any flow-specific fine-tuning, our method outperforms state-of-the-art models on real-world TAP-Vid DAVIS dataset (16.6% relative improvement for endpoint error) and synthetic TAP-Vid Kubric (4.7% relative improvement). Our results indicate that counterfactual prompting of controllable generative video models is a scalable and effective alternative to supervised or photometric-loss approaches for high-quality flow.
zh
[CV-156] From Physics to Foundation Models: A Review of AI-Driven Quantitative Remote Sensing Inversion
【速读】:该论文旨在解决定量遥感反演中如何从卫星观测数据中准确估计连续地表变量(如生物量、植被指数和蒸散发)的问题,以支持生态系统监测、碳核算和土地管理等应用。其解决方案的关键在于方法论的演进,从传统的物理模型(如PROSPECT、SCOPE、DART)到机器学习方法(如深度学习、多模态融合),再到当前基于基础模型(Foundation Model, FM)的方法(如SatMAE、GFM、mmEarth)。论文重点分析了不同范式下的建模假设、应用场景及局限性,并强调了近期基础模型在自监督预训练、多模态整合和跨任务适应方面的进展。
链接: https://arxiv.org/abs/2507.09081
作者: Zhenyu Yu,Mohd Yamani Idna Idris,Hua Wang,Pei Wang,Junyi Chen,Kun Wang
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Quantitative remote sensing inversion aims to estimate continuous surface variables-such as biomass, vegetation indices, and evapotranspiration-from satellite observations, supporting applications in ecosystem monitoring, carbon accounting, and land management. With the evolution of remote sensing systems and artificial intelligence, traditional physics-based paradigms are giving way to data-driven and foundation model (FM)-based approaches. This paper systematically reviews the methodological evolution of inversion techniques, from physical models (e.g., PROSPECT, SCOPE, DART) to machine learning methods (e.g., deep learning, multimodal fusion), and further to foundation models (e.g., SatMAE, GFM, mmEarth). We compare the modeling assumptions, application scenarios, and limitations of each paradigm, with emphasis on recent FM advances in self-supervised pretraining, multi-modal integration, and cross-task adaptation. We also highlight persistent challenges in physical interpretability, domain generalization, limited supervision, and uncertainty quantification. Finally, we envision the development of next-generation foundation models for remote sensing inversion, emphasizing unified modeling capacity, cross-domain generalization, and physical interpretability.
zh
[CV-157] BlindSight: Harnessing Sparsity for Efficient VLMs
【速读】:该论文试图解决大型视觉语言模型(Large Vision-Language Models, VLMs)在处理多图像输入时因视觉数据引入导致的提示长度增加和注意力计算复杂度上升的问题,从而造成预填充时间变长的瓶颈。解决方案的关键在于利用注意力计算中的固有稀疏性,通过分析VLM中注意力模式,识别出跨图像注意力较少的层,并基于此提出BlindSight方法,该方法采用输入模板感知的注意力稀疏性掩码,在不进行额外训练的情况下优化VLM推理,实现显著的计算量减少(平均FLOPs降低32%-41%),同时保持模型精度变化在-2%至+2%范围内。
链接: https://arxiv.org/abs/2507.09071
作者: Tharun Adithya Srikrishnan,Deval Shah,Steven K. Reinhardt
机构: Advanced Micro Devices, Inc. (Advanced Micro Devices, Inc.)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Large vision-language models (VLMs) enable the joint processing of text and images. However, the inclusion of vision data significantly expands the prompt length. Along with the quadratic complexity of the attention computation, this results in a longer prefill duration. An approach to mitigate this bottleneck is to leverage the inherent sparsity in the attention computation. In our analysis of attention patterns in VLMs, we observe that a substantial portion of layers exhibit minimal cross-image attention, except through attention-sink tokens per image. These sparse attention patterns fall into distinct categories: sink-only, document mask and a hybrid document-sink mask. Based on this, we propose BlindSight: a training-free approach to optimize VLM inference using a input template-aware attention sparsity mask. We utilize samples from a dataset to derive a prompt-agnostic sparsity categorization for every attention head. We evaluate the proposed technique using VLMs such as Qwen2-VL, Qwen2.5-VL and Gemma-3. BlindSight results in a 32%-41% reduction in FLOPs on average with -2%-+2% accuracy compared to the original model in most evaluated multi-image understanding benchmarks.
zh
[CV-158] Infinite Video Understanding
【速读】:该论文试图解决视频理解中处理超长时视频内容的挑战,即模型在面对持续时间超过分钟或小时的视频时,面临计算和内存限制、时间连贯性保持、复杂事件跟踪以及细粒度细节保留等问题。其解决方案的关键在于提出“无限视频理解”(Infinite Video Understanding)这一前瞻性研究目标,旨在推动流式架构、持久化记忆机制、分层自适应表示、以事件为中心的推理等领域的创新,从而实现对任意长度视频数据的持续处理与理解。
链接: https://arxiv.org/abs/2507.09068
作者: Dell Zhang,Xiangyu Chen,Jixiang Luo,Mengxi Jia,Changzhi Sun,Ruilong Ren,Jingren Liu,Hao Sun,Xuelong Li
机构: Institute of Artificial Intelligence (TeleAI), China Telecom; Peking University; Tianjin University
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Information Retrieval (cs.IR); Machine Learning (cs.LG); Multimedia (cs.MM)
备注:
Abstract:The rapid advancements in Large Language Models (LLMs) and their multimodal extensions (MLLMs) have ushered in remarkable progress in video understanding. However, a fundamental challenge persists: effectively processing and comprehending video content that extends beyond minutes or hours. While recent efforts like Video-XL-2 have demonstrated novel architectural solutions for extreme efficiency, and advancements in positional encoding such as HoPE and VideoRoPE++ aim to improve spatio-temporal understanding over extensive contexts, current state-of-the-art models still encounter significant computational and memory constraints when faced with the sheer volume of visual tokens from lengthy sequences. Furthermore, maintaining temporal coherence, tracking complex events, and preserving fine-grained details over extended periods remain formidable hurdles, despite progress in agentic reasoning systems like Deep Video Discovery. This position paper posits that a logical, albeit ambitious, next frontier for multimedia research is Infinite Video Understanding – the capability for models to continuously process, understand, and reason about video data of arbitrary, potentially never-ending duration. We argue that framing Infinite Video Understanding as a blue-sky research objective provides a vital north star for the multimedia, and the wider AI, research communities, driving innovation in areas such as streaming architectures, persistent memory mechanisms, hierarchical and adaptive representations, event-centric reasoning, and novel evaluation paradigms. Drawing inspiration from recent work on long/ultra-long video understanding and several closely related fields, we outline the core challenges and key research directions towards achieving this transformative capability.
zh
[CV-159] Can Contrastive Learning Improve Class-Imbalanced Diffusion Model?
【速读】:该论文试图解决类别条件图像生成中训练数据长尾分布导致的尾部类别图像多样性不足问题,该问题会导致模式崩溃并降低尾部类别的生成多样性。其解决方案的关键在于引入两种看似简单但效果显著的对比损失函数:首先,采用无监督的InfoNCE损失利用负样本增加合成图像之间的距离/差异性,尤其针对尾部类别;其次,引入MSE损失对比条件生成与无条件生成在大时间步的差异,使去噪过程在初始步骤对类别条件不敏感,从而通过头部类别的知识共享丰富尾部类别。
链接: https://arxiv.org/abs/2507.09052
作者: Fang Chen,Alex Villa,Gongbo Liang,Xiaoyi Lu,Meng Tang
机构: University of California Merced (加州大学默塞德分校); Texas A&M University-San Antonio (得克萨斯农工大学圣安东尼奥分校)
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注: 20 pages, 11 figures
Abstract:Training data for class-conditional image synthesis often exhibit a long-tailed distribution with limited images for tail classes. Such an imbalance causes mode collapse and reduces the diversity of synthesized images for tail classes. For class-conditional diffusion models trained on imbalanced data, we aim to improve the diversity of tail class images without compromising the fidelity and diversity of head class images. We achieve this by introducing two deceptively simple but highly effective contrastive loss functions. Firstly, we employ an unsupervised InfoNCE loss utilizing negative samples to increase the distance/dissimilarity among synthetic images, particularly for tail classes. To further enhance the diversity of tail classes, our second loss is an MSE loss that contrasts class-conditional generation with unconditional generation at large timesteps. This second loss makes the denoising process insensitive to class conditions for the initial steps, which enriches tail classes through knowledge sharing from head classes. Conditional-unconditional alignment has been shown to enhance the performance of long-tailed GAN. We are the first to adapt such alignment to diffusion models. We successfully leveraged contrastive learning for class-imbalanced diffusion models. Our contrastive learning framework is easy to implement and outperforms standard DDPM and alternative methods for class-imbalanced diffusion models across various datasets, including CIFAR10/100-LT, PlacesLT, TinyImageNetLT, and ImageNetLT.
zh
[CV-160] BrainLesion Suite: A Flexible and User-Friendly Framework for Modular Brain Lesion Image Analysis
【速读】:该论文试图解决脑部病变图像分析流程构建的复杂性问题,旨在为临床和科研实践提供一种模块化、高效且易用的工具链。解决方案的关键在于BrainLesion Suite的核心可适应性预处理模块,该模块能够对任意多模态输入图像进行配准、图谱配准以及可选的去骨处理和去脸处理,同时结合BraTS挑战中的算法实现缺失模态合成、病灶填补及病理特异性肿瘤分割,从而支持复杂的图像分析工作流构建与分割模型性能评估。
链接: https://arxiv.org/abs/2507.09036
作者: Florian Kofler,Marcel Rosier,Mehdi Astaraki,Hendrik Möller,Ilhem Isra Mekki,Josef A. Buchner,Anton Schmick,Arianna Pfiffer,Eva Oswald,Lucas Zimmer,Ezequiel de la Rosa,Sarthak Pati,Julian Canisius,Arianna Piffer,Ujjwal Baid,Mahyar Valizadeh,Akis Linardos,Jan C. Peeken,Surprosanna Shit,Felix Steinbauer,Daniel Rueckert,Rolf Heckemann,Spyridon Bakas,Jan Kirschke,Constantin von See,Ivan Ezhov,Marie Piraud,Benedikt Wiestler,Bjoern Menze
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Software Engineering (cs.SE)
备注: 16p, 3f
Abstract:BrainLesion Suite is a versatile toolkit for building modular brain lesion image analysis pipelines in Python. Following Pythonic principles, BrainLesion Suite is designed to provide a ‘brainless’ development experience, minimizing cognitive effort and streamlining the creation of complex workflows for clinical and scientific practice. At its core is an adaptable preprocessing module that performs co-registration, atlas registration, and optional skull-stripping and defacing on arbitrary multi-modal input images. BrainLesion Suite leverages algorithms from the BraTS challenge to synthesize missing modalities, inpaint lesions, and generate pathology-specific tumor segmentations. BrainLesion Suite also enables quantifying segmentation model performance, with tools such as panoptica to compute lesion-wise metrics. Although BrainLesion Suite was originally developed for image analysis pipelines of brain lesions such as glioma, metastasis, and multiple sclerosis, it can be adapted for other biomedical image analysis applications. The individual BrainLesion Suite packages and tutorials are accessible on GitHub.
zh
[CV-161] Confounder-Free Continual Learning via Recursive Feature Normalization
【速读】:该论文试图解决在持续学习(continual learning)场景下,由于混杂变量(confounders)引起的特征表示不不变性问题,从而导致预测偏差和灾难性遗忘(catastrophic forgetting)。其解决方案的关键在于引入递归元数据归一化(Recursive MDN, R-MDN)层,该层通过递归最小二乘算法进行统计回归,以持续更新模型内部状态,从而消除混杂变量对中间特征表示的影响。
链接: https://arxiv.org/abs/2507.09031
作者: Yash Shah,Camila Gonzalez,Mohammad H. Abbasi,Qingyu Zhao,Kilian M. Pohl,Ehsan Adeli
机构: 未知
类目: Machine Learning (cs.LG); Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Confounders are extraneous variables that affect both the input and the target, resulting in spurious correlations and biased predictions. There are recent advances in dealing with or removing confounders in traditional models, such as metadata normalization (MDN), where the distribution of the learned features is adjusted based on the study confounders. However, in the context of continual learning, where a model learns continuously from new data over time without forgetting, learning feature representations that are invariant to confounders remains a significant challenge. To remove their influence from intermediate feature representations, we introduce the Recursive MDN (R-MDN) layer, which can be integrated into any deep learning architecture, including vision transformers, and at any model stage. R-MDN performs statistical regression via the recursive least squares algorithm to maintain and continually update an internal model state with respect to changing distributions of data and confounding variables. Our experiments demonstrate that R-MDN promotes equitable predictions across population groups, both within static learning and across different stages of continual learning, by reducing catastrophic forgetting caused by confounder effects changing over time.
zh
[CV-162] VISTA: A Visual Analytics Framework to Enhance Foundation Model-Generated Data Labels
【速读】:该论文试图解决多模态基础模型(Multi-modal Foundation Models, FMs)生成标签的质量问题,尤其是在缺乏真实标签的情况下难以有效验证大规模数据的问题。现有方法过于关注数据量而忽视了数据质量,导致无法全面识别和纠正FM生成标签中的潜在问题。论文提出的解决方案关键在于引入VISTA,这是一个结合多阶段数据验证策略与人类专家知识的可视化分析框架,使人类能够识别、理解和修正FM生成标签中的隐藏问题,从而提升多模态模型的性能。
链接: https://arxiv.org/abs/2507.09008
作者: Xiwei Xuan,Xiaoqi Wang,Wenbin He,Jorge Piazentin Ono,Liang Gou,Kwan-Liu Ma,Liu Ren
机构: Bosch Center for Artificial Intelligence (BCAI); Splunk Technology (Splunk Technology)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: IEEE Transactions on Visualization and Computer Graphics (2025)
Abstract:The advances in multi-modal foundation models (FMs) (e.g., CLIP and LLaVA) have facilitated the auto-labeling of large-scale datasets, enhancing model performance in challenging downstream tasks such as open-vocabulary object detection and segmentation. However, the quality of FM-generated labels is less studied as existing approaches focus more on data quantity over quality. This is because validating large volumes of data without ground truth presents a considerable challenge in practice. Existing methods typically rely on limited metrics to identify problematic data, lacking a comprehensive perspective, or apply human validation to only a small data fraction, failing to address the full spectrum of potential issues. To overcome these challenges, we introduce VISTA, a visual analytics framework that improves data quality to enhance the performance of multi-modal models. Targeting the complex and demanding domain of open-vocabulary image segmentation, VISTA integrates multi-phased data validation strategies with human expertise, enabling humans to identify, understand, and correct hidden issues within FM-generated labels. Through detailed use cases on two benchmark datasets and expert reviews, we demonstrate VISTA’s effectiveness from both quantitative and qualitative perspectives.
zh
[CV-163] From images to properties: a NeRF-driven framework for granular material parameter inversion
【速读】:该论文试图解决从视觉观测中推断颗粒材料属性的问题,特别是摩擦角的估计问题。其解决方案的关键在于将神经辐射场(NeRF)与材料点方法(MPM)模拟相结合,通过生成合成实验数据并利用贝叶斯优化最小化图像损失,从而实现对未知材料参数的逆向分析。
链接: https://arxiv.org/abs/2507.09005
作者: Cheng-Hsi Hsiao,Krishna Kumar
机构: The University of Texas at Austin (德克萨斯大学奥斯汀分校)
类目: Computer Vision and Pattern Recognition (cs.CV); Geophysics (physics.geo-ph)
备注:
Abstract:We introduce a novel framework that integrates Neural Radiance Fields (NeRF) with Material Point Method (MPM) simulation to infer granular material properties from visual observations. Our approach begins by generating synthetic experimental data, simulating an plow interacting with sand. The experiment is rendered into realistic images as the photographic observations. These observations include multi-view images of the experiment’s initial state and time-sequenced images from two fixed cameras. Using NeRF, we reconstruct the 3D geometry from the initial multi-view images, leveraging its capability to synthesize novel viewpoints and capture intricate surface details. The reconstructed geometry is then used to initialize material point positions for the MPM simulation, where the friction angle remains unknown. We render images of the simulation under the same camera setup and compare them to the observed images. By employing Bayesian optimization, we minimize the image loss to estimate the best-fitting friction angle. Our results demonstrate that friction angle can be estimated with an error within 2 degrees, highlighting the effectiveness of inverse analysis through purely visual observations. This approach offers a promising solution for characterizing granular materials in real-world scenarios where direct measurement is impractical or impossible.
zh
[CV-164] Video Inference for Human Mesh Recovery with Vision Transformer
【速读】:该论文试图解决从单张图像中进行人体网格恢复(Human Mesh Recovery, HMR)的挑战性问题,该任务由于其固有的模糊性而难以准确完成。现有方法通常仅利用时间信息或运动学关系来提高精度,但缺乏同时结合两者的方法。该论文提出的解决方案是“基于视觉Transformer的人体网格恢复视频推理方法(HMR-ViT)”,其关键在于构建一个融合时间-运动学特征的特征图像,通过通道重排矩阵(Channel Rearranging Matrix, CRM)将相似的运动学特征在空间上邻近排列,随后使用视觉Transformer进行编码,并通过回归网络推断SMPL姿态和形状参数。
链接: https://arxiv.org/abs/2507.08981
作者: Hanbyel Cho,Jaesung Ahn,Yooshin Cho,Junmo Kim
机构: KAIST(韩国科学技术院); Kim Jaechul Graduate School of AI(金载雄人工智能研究生院)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted to IEEE FG 2023
Abstract:Human Mesh Recovery (HMR) from an image is a challenging problem because of the inherent ambiguity of the task. Existing HMR methods utilized either temporal information or kinematic relationships to achieve higher accuracy, but there is no method using both. Hence, we propose “Video Inference for Human Mesh Recovery with Vision Transformer (HMR-ViT)” that can take into account both temporal and kinematic information. In HMR-ViT, a Temporal-kinematic Feature Image is constructed using feature vectors obtained from video frames by an image encoder. When generating the feature image, we use a Channel Rearranging Matrix (CRM) so that similar kinematic features could be located spatially close together. The feature image is then further encoded using Vision Transformer, and the SMPL pose and shape parameters are finally inferred using a regression network. Extensive evaluation on the 3DPW and Human3.6M datasets indicates that our method achieves a competitive performance in HMR.
zh
[CV-165] Learning Diffusion Models with Flexible Representation Guidance
【速读】:该论文试图解决扩散模型在生成过程中缺乏有效表示对齐的问题,从而影响生成质量。其解决方案的关键在于引入一种系统框架,将辅助表示引导整合到扩散模型中,通过不同的去噪模型分解及其相应的训练准则,确定辅助表示的引入时机和方式。同时,基于理论洞察,提出了两种增强表示对齐的新策略:一是将示例与来自自身或不同合成模态的目标表示进行配对,并学习多模态对的联合模型;二是设计一个平衡表示学习与数据生成的最优训练课程。
链接: https://arxiv.org/abs/2507.08980
作者: Chenyu Wang,Cai Zhou,Sharut Gupta,Zongyu Lin,Stefanie Jegelka,Stephen Bates,Tommi Jaakkola
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Diffusion models can be improved with additional guidance towards more effective representations of input. Indeed, prior empirical work has already shown that aligning internal representations of the diffusion model with those of pre-trained models improves generation quality. In this paper, we present a systematic framework for incorporating representation guidance into diffusion models. We provide alternative decompositions of denoising models along with their associated training criteria, where the decompositions determine when and how the auxiliary representations are incorporated. Guided by our theoretical insights, we introduce two new strategies for enhancing representation alignment in diffusion models. First, we pair examples with target representations either derived from themselves or arisen from different synthetic modalities, and subsequently learn a joint model over the multimodal pairs. Second, we design an optimal training curriculum that balances representation learning and data generation. Our experiments across image, protein sequence, and molecule generation tasks demonstrate superior performance as well as accelerated training. In particular, on the class-conditional ImageNet 256\times 256 benchmark, our guidance results in 23.3 times faster training than the original SiT-XL as well as four times speedup over the state-of-the-art method REPA. The code is available at this https URL.
zh
[CV-166] PRISM: Reducing Spurious Implicit Biases in Vision-Language Models with LLM -Guided Embedding Projection ICCV2025
【速读】:该论文试图解决视觉-语言模型(Vision-Language Models, VLMs)在训练数据中继承并放大偏差的问题,从而导致预测结果偏斜。其解决方案的关键在于提出一种无需依赖预定义偏差类别或外部数据的全新数据无关且任务无关的偏差缓解方法——基于投影的隐式虚假偏差减少方法(Projection-based Reduction of Implicit Spurious bias in vision-language Models, PRISM)。PRISM通过两个阶段实现:首先利用大语言模型(LLM)生成包含虚假相关性的场景描述,随后采用一种新颖的对比风格去偏差损失函数,学习将嵌入映射到一个最小化虚假相关性同时保持图像与文本对齐的潜在空间。
链接: https://arxiv.org/abs/2507.08979
作者: Mahdiyar Molahasani,Azadeh Motamedi,Michael Greenspan,Il-Min Kim,Ali Etemad
机构: Queen’s University (皇后大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注: Accepted to ICCV 2025
Abstract:We introduce Projection-based Reduction of Implicit Spurious bias in vision-language Models (PRISM), a new data-free and task-agnostic solution for bias mitigation in VLMs like CLIP. VLMs often inherit and amplify biases in their training data, leading to skewed predictions. PRISM is designed to debias VLMs without relying on predefined bias categories or additional external data. It operates in two stages: first, an LLM is prompted with simple class prompts to generate scene descriptions that contain spurious correlations. Next, PRISM uses our novel contrastive-style debiasing loss to learn a projection that maps the embeddings onto a latent space that minimizes spurious correlations while preserving the alignment between image and text this http URL experiments demonstrate that PRISM outperforms current debiasing methods on the commonly used Waterbirds and CelebA datasets We make our code public at: this https URL.
zh
[CV-167] Detecting Deepfake Talking Heads from Facial Biometric Anomalies
【速读】:该论文试图解决深度伪造视频(deepfake video) impersonations带来的安全威胁,特别是针对语音克隆、面部交换或唇形同步等技术生成的高真实感伪造视频的检测问题。解决方案的关键在于提出一种新颖的取证机器学习方法,该方法利用面部生物特征中的非自然模式来识别深度伪造视频。
链接: https://arxiv.org/abs/2507.08917
作者: Justin D. Norman,Hany Farid
机构: University of California, Berkeley (加州大学伯克利分校)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 10 pages, 3 figures, 3 tables
Abstract:The combination of highly realistic voice cloning, along with visually compelling avatar, face-swap, or lip-sync deepfake video generation, makes it relatively easy to create a video of anyone saying anything. Today, such deepfake impersonations are often used to power frauds, scams, and political disinformation. We propose a novel forensic machine learning technique for the detection of deepfake video impersonations that leverages unnatural patterns in facial biometrics. We evaluate this technique across a large dataset of deepfake techniques and impersonations, as well as assess its reliability to video laundering and its generalization to previously unseen video deepfake generators.
zh
[CV-168] Multimodal HD Mapping for Intersections by Intelligent Roadside Units ITSC’25
【速读】:该论文旨在解决复杂交叉口的高精度语义地图构建问题,传统基于车辆的方法由于遮挡和视角限制面临显著挑战。其解决方案的关键在于提出一种基于相机与LiDAR融合的框架,利用 elevated 智能路侧单元(IRU)的数据,通过两阶段过程实现模态特异性特征提取与跨模态语义融合,充分利用相机的高分辨率纹理信息和LiDAR的精确几何数据。
链接: https://arxiv.org/abs/2507.08903
作者: Zhongzhang Chen,Miao Fan,Shengtong Xu,Mengmeng Yang,Kun Jiang,Xiangzeng Liu,Haoyi Xiong
机构: 未知
类目: Robotics (cs.RO); Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted by ITSC’25
Abstract:High-definition (HD) semantic mapping of complex intersections poses significant challenges for traditional vehicle-based approaches due to occlusions and limited perspectives. This paper introduces a novel camera-LiDAR fusion framework that leverages elevated intelligent roadside units (IRUs). Additionally, we present RS-seq, a comprehensive dataset developed through the systematic enhancement and annotation of the V2X-Seq dataset. RS-seq includes precisely labelled camera imagery and LiDAR point clouds collected from roadside installations, along with vectorized maps for seven intersections annotated with detailed features such as lane dividers, pedestrian crossings, and stop lines. This dataset facilitates the systematic investigation of cross-modal complementarity for HD map generation using IRU data. The proposed fusion framework employs a two-stage process that integrates modality-specific feature extraction and cross-modal semantic integration, capitalizing on camera high-resolution texture and precise geometric data from LiDAR. Quantitative evaluations using the RS-seq dataset demonstrate that our multimodal approach consistently surpasses unimodal methods. Specifically, compared to unimodal baselines evaluated on the RS-seq dataset, the multimodal approach improves the mean Intersection-over-Union (mIoU) for semantic segmentation by 4% over the image-only results and 18% over the point cloud-only results. This study establishes a baseline methodology for IRU-based HD semantic mapping and provides a valuable dataset for future research in infrastructure-assisted autonomous driving systems.
zh
[CV-169] Zero-Shot Neural Architecture Search with Weighted Response Correlation
【速读】:该论文试图解决神经网络架构搜索(Neural Architecture Search, NAS)中架构估计计算成本高、耗时长的问题。现有零样本NAS方法虽然使用了无需训练的代理来加速架构估计,但其有效性、稳定性和泛化能力仍有不足。论文提出的解决方案的关键是引入一种新的无需训练的估计代理——加权响应相关性(Weighted Response Correlation, WRCor),该方法通过计算不同输入样本响应的相关系数矩阵来评估架构的表达能力和泛化能力,从而实现更高效和稳定的架构估计。实验结果表明,WRCor及其投票代理在代理评估和架构搜索中均优于现有方法。
链接: https://arxiv.org/abs/2507.08841
作者: Kun Jing,Luoyu Chen,Jungang Xu,Jianwei Tai,Yiyu Wang,Shuaimin Li
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Neural architecture search (NAS) is a promising approach for automatically designing neural network architectures. However, the architecture estimation of NAS is computationally expensive and time-consuming because of training multiple architectures from scratch. Although existing zero-shot NAS methods use training-free proxies to accelerate the architecture estimation, their effectiveness, stability, and generality are still lacking. We present a novel training-free estimation proxy called weighted response correlation (WRCor). WRCor utilizes correlation coefficient matrices of responses across different input samples to calculate the proxy scores of estimated architectures, which can measure their expressivity and generalizability. Experimental results on proxy evaluation demonstrate that WRCor and its voting proxies are more efficient estimation strategies than existing proxies. We also apply them with different search strategies in architecture search. Experimental results on architecture search show that our zero-shot NAS algorithm outperforms most existing NAS algorithms in different search spaces. Our NAS algorithm can discover an architecture with a 22.1% test error on the ImageNet-1k dataset within 4 GPU hours. All codes are publicly available at this https URL.
zh
[CV-170] View Invariant Learning for Vision-Language Navigation in Continuous Environments
【速读】:该论文试图解决视觉-语言导航在连续环境(VLNCE)中对视角变化敏感的问题,即由于相机高度和视角变化导致的观测差异影响导航策略的性能。解决方案的关键在于提出一种视图不变学习(VIL)的后训练策略,通过对比学习框架学习稀疏且视图不变的特征,并引入教师-学生框架对路径点预测模块进行知识蒸馏,从而提升现有导航策略对视角变化的鲁棒性。
链接: https://arxiv.org/abs/2507.08831
作者: Josh Qixuan Sun,Xiaoying Xing,Huaiyuan Weng,Chul Min Yeum,Mark Crowley
机构: University of Waterloo (滑铁卢大学); Northwestern University (西北大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG); Robotics (cs.RO)
备注: Under review
Abstract:Vision-Language Navigation in Continuous Environments (VLNCE), where an agent follows instructions and moves freely to reach a destination, is a key research problem in embodied AI. However, most navigation policies are sensitive to viewpoint changes, i.e., variations in camera height and viewing angle that alter the agent’s observation. In this paper, we introduce a generalized scenario, V2-VLNCE (VLNCE with Varied Viewpoints), and propose VIL (View Invariant Learning), a view-invariant post-training strategy that enhances the robustness of existing navigation policies to changes in camera viewpoint. VIL employs a contrastive learning framework to learn sparse and view-invariant features. Additionally, we introduce a teacher-student framework for the Waypoint Predictor Module, a core component of most VLNCE baselines, where a view-dependent teacher model distills knowledge into a view-invariant student model. We employ an end-to-end training paradigm to jointly optimize these components, thus eliminating the cost for individual module training. Empirical results show that our method outperforms state-of-the-art approaches on V2-VLNCE by 8-15% measured on Success Rate for two standard benchmark datasets R2R-CE and RxR-CE. Furthermore, we evaluate VIL under the standard VLNCE setting and find that, despite being trained for varied viewpoints, it often still improves performance. On the more challenging RxR-CE dataset, our method also achieved state-of-the-art performance across all metrics when compared to other map-free methods. This suggests that adding VIL does not diminish the standard viewpoint performance and can serve as a plug-and-play post-training method.
zh
[CV-171] Lightweight Cloud Masking Models for On-Board Inference in Hyperspectral Imaging
【速读】:该论文试图解决高光谱卫星成像中的云和云影掩膜问题,这是提取高质量、可分析数据的关键预处理步骤。研究评估了多种机器学习方法,包括梯度提升方法(如XGBoost和LightGBM)以及卷积神经网络(CNN)。其中,采用特征缩减的CNN模型表现出最高的效率,实现了高精度、低存储需求和在CPU与GPU上的快速推理时间。该模型通过仅最多597个可训练参数,在部署可行性、准确性和计算效率之间取得了最佳平衡,展示了轻量级人工智能模型在实时高光谱图像处理中的潜力,支持了星载卫星AI系统的开发。
链接: https://arxiv.org/abs/2507.08052
作者: Mazen Ali,António Pereira,Fabio Gentile,Aser Cortines,Sam Mugel,Román Orús,Stelios P. Neophytides,Michalis Mavrovouniotis
机构: Multiverse Computing(多宇宙计算); ERATOSTHENES Centre of Excellence(埃拉托色尼卓越中心); Cyprus University of Technology(塞浦路斯技术大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:
Abstract:Cloud and cloud shadow masking is a crucial preprocessing step in hyperspectral satellite imaging, enabling the extraction of high-quality, analysis-ready data. This study evaluates various machine learning approaches, including gradient boosting methods such as XGBoost and LightGBM as well as convolutional neural networks (CNNs). All boosting and CNN models achieved accuracies exceeding 93%. Among the investigated models, the CNN with feature reduction emerged as the most efficient, offering a balance of high accuracy, low storage requirements, and rapid inference times on both CPUs and GPUs. Variations of this version, with only up to 597 trainable parameters, demonstrated the best trade-off in terms of deployment feasibility, accuracy, and computational efficiency. These results demonstrate the potential of lightweight artificial intelligence (AI) models for real-time hyperspectral image processing, supporting the development of on-board satellite AI systems for space-based applications.
zh
[CV-172] An Enhanced Classification Method Based on Adaptive Multi-Scale Fusion for Long-tailed Multispectral Point Clouds
【速读】:该论文旨在解决多光谱点云(MPC)在户外数据集上分类时面临的稀疏标注目标、地物尺度差异以及长尾分布等问题。其解决方案的关键在于提出一种基于自适应多尺度融合的增强分类方法,包括在训练集生成阶段采用网格平衡采样策略以可靠生成训练样本,在特征学习阶段引入多尺度特征融合模块以融合不同尺度的地物浅层特征,在分类阶段设计自适应混合损失模块以利用具有自适应权重的多分类头来平衡不同类别的学习能力,从而提升因地物尺度多样性和长尾分布而导致的小类别分类性能。
链接: https://arxiv.org/abs/2412.11407
作者: TianZhu Liu,BangYan Hu,YanFeng Gu,Xian Li,Aleksandra Pižurica
机构: Harbin Institute of Technology (哈尔滨工业大学); Ghent University (根特大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Image and Video Processing (eess.IV)
备注: 16 pages, 9 figures, 5 tables
Abstract:Multispectral point cloud (MPC) captures 3D spatial-spectral information from the observed scene, which can be used for scene understanding and has a wide range of applications. However, most of the existing classification methods were extensively tested on indoor datasets, and when applied to outdoor datasets they still face problems including sparse labeled targets, differences in land-covers scales, and long-tailed distributions. To address the above issues, an enhanced classification method based on adaptive multi-scale fusion for MPCs with long-tailed distributions is proposed. In the training set generation stage, a grid-balanced sampling strategy is designed to reliably generate training samples from sparse labeled datasets. In the feature learning stage, a multi-scale feature fusion module is proposed to fuse shallow features of land-covers at different scales, addressing the issue of losing fine features due to scale variations in land-covers. In the classification stage, an adaptive hybrid loss module is devised to utilize multi-classification heads with adaptive weights to balance the learning ability of different classes, improving the classification performance of small classes due to various-scales and long-tailed distributions in land-covers. Experimental results on three MPC datasets demonstrate the effectiveness of the proposed method compared with the state-of-the-art methods.
zh
[CV-173] DepViT-CAD: Deployable Vision Transformer-Based Cancer Diagnosis in Histopathology
【速读】:该论文旨在解决准确且及时的癌症诊断问题,通过分析组织病理学切片实现有效的临床决策。其解决方案的关键在于提出DepViT-CAD系统,该系统基于MAViT(Multi-Attention Vision Transformer)架构,能够捕捉多种肿瘤类型的细粒度形态学特征。MAViT在1008张全切片图像的专家标注区域上进行训练,覆盖11个诊断类别,包括10种主要癌症和非肿瘤组织。该系统在两个独立队列中进行了验证,展示了其在实际临床环境中的高诊断灵敏度。
链接: https://arxiv.org/abs/2507.10250
作者: Ashkan Shakarami,Lorenzo Nicole,Rocco Cappellesso,Angelo Paolo Dei Tos,Stefano Ghidoni
机构: 未知
类目: Image and Video Processing (eess.IV); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注: 25 pages, 15 figures
Abstract:Accurate and timely cancer diagnosis from histopathological slides is vital for effective clinical decision-making. This paper introduces DepViT-CAD, a deployable AI system for multi-class cancer diagnosis in histopathology. At its core is MAViT, a novel Multi-Attention Vision Transformer designed to capture fine-grained morphological patterns across diverse tumor types. MAViT was trained on expert-annotated patches from 1008 whole-slide images, covering 11 diagnostic categories, including 10 major cancers and non-tumor tissue. DepViT-CAD was validated on two independent cohorts: 275 WSIs from The Cancer Genome Atlas and 50 routine clinical cases from pathology labs, achieving diagnostic sensitivities of 94.11% and 92%, respectively. By combining state-of-the-art transformer architecture with large-scale real-world validation, DepViT-CAD offers a robust and scalable approach for AI-assisted cancer diagnostics. To support transparency and reproducibility, software and code will be made publicly available at GitHub.
zh
[CV-174] Graph-based Multi-Modal Interaction Lightweight Network for Brain Tumor Segmentation (GMLN-BTS) in Edge Iterative MRI Lesion Localization System (EdgeIMLocSys)
【速读】:该论文旨在解决不同MRI扫描仪间成像质量差异导致的模型泛化能力不足问题,从而提升脑肿瘤分割的准确性与鲁棒性。其解决方案的关键在于提出了一种基于图的多模态交互轻量网络(GMLN-BTS),该网络通过模态感知自适应编码器(M2AE)高效提取多尺度语义特征,并利用图-based多模态协同交互模块(G2MCIM)建模跨模态的互补关系,同时引入体素细化上采样模块(VRUM)以提高分割边界精度。
链接: https://arxiv.org/abs/2507.09995
作者: Guohao Huo,Ruiting Dai,Hao Tang
机构: 未知
类目: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Brain tumor segmentation plays a critical role in clinical diagnosis and treatment planning, yet the variability in imaging quality across different MRI scanners presents significant challenges to model generalization. To address this, we propose the Edge Iterative MRI Lesion Localization System (EdgeIMLocSys), which integrates Continuous Learning from Human Feedback to adaptively fine-tune segmentation models based on clinician feedback, thereby enhancing robustness to scanner-specific imaging characteristics. Central to this system is the Graph-based Multi-Modal Interaction Lightweight Network for Brain Tumor Segmentation (GMLN-BTS), which employs a Modality-Aware Adaptive Encoder (M2AE) to extract multi-scale semantic features efficiently, and a Graph-based Multi-Modal Collaborative Interaction Module (G2MCIM) to model complementary cross-modal relationships via graph structures. Additionally, we introduce a novel Voxel Refinement UpSampling Module (VRUM) that synergistically combines linear interpolation and multi-scale transposed convolutions to suppress artifacts while preserving high-frequency details, improving segmentation boundary accuracy. Our proposed GMLN-BTS model achieves a Dice score of 85.1% on the BraTS2017 dataset with only 4.58 million parameters, representing a 98% reduction compared to mainstream 3D Transformer models, and significantly outperforms existing lightweight approaches. This work demonstrates a synergistic breakthrough in achieving high-accuracy, resource-efficient brain tumor segmentation suitable for deployment in resource-constrained clinical environments.
zh
[CV-175] A Brain Tumor Segmentation Method Based on CLIP and 3D U-Net with Cross-Modal Semantic Guidance and Multi-Level Feature Fusion
【速读】:该论文旨在解决脑肿瘤在磁共振成像(MRI)中的精确分割问题,这一问题对于神经肿瘤学的诊断和治疗计划至关重要。尽管深度学习方法取得了进展,但由于肿瘤形态的异质性和复杂的三维空间关系,自动分割仍然具有挑战性。该研究提出的解决方案的关键在于构建一个多层次融合架构,整合像素级、特征级和语义级信息,从而实现从低层次数据到高层次概念的全面处理。其中,语义级融合路径通过三种机制——3D-2D语义桥接、跨模态语义引导和基于语义的注意力机制——将对比语言-图像预训练(CLIP)模型的语义理解能力与3D U-Net的空间特征提取优势相结合,显著提升了分割性能。
链接: https://arxiv.org/abs/2507.09966
作者: Mingda Zhang
机构: Yunnan University (云南大学)
类目: Image and Video Processing (eess.IV); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注: 13 pages,6 figures
Abstract:Precise segmentation of brain tumors from magnetic resonance imaging (MRI) is essential for neuro-oncology diagnosis and treatment planning. Despite advances in deep learning methods, automatic segmentation remains challenging due to tumor morphological heterogeneity and complex three-dimensional spatial relationships. Current techniques primarily rely on visual features extracted from MRI sequences while underutilizing semantic knowledge embedded in medical reports. This research presents a multi-level fusion architecture that integrates pixel-level, feature-level, and semantic-level information, facilitating comprehensive processing from low-level data to high-level concepts. The semantic-level fusion pathway combines the semantic understanding capabilities of Contrastive Language-Image Pre-training (CLIP) models with the spatial feature extraction advantages of 3D U-Net through three mechanisms: 3D-2D semantic bridging, cross-modal semantic guidance, and semantic-based attention mechanisms. Experimental validation on the BraTS 2020 dataset demonstrates that the proposed model achieves an overall Dice coefficient of 0.8567, representing a 4.8% improvement compared to traditional 3D U-Net, with a 7.3% Dice coefficient increase in the clinically important enhancing tumor (ET) region.
zh
[CV-176] IM-LUT: Interpolation Mixing Look-Up Tables for Image Super-Resolution ICCV2025
【速读】:该论文试图解决任意尺度图像超分辨率(ASISR)问题,传统基于查找表(LUT)的方法通常仅适用于固定缩放因子,而现有ASISR技术多依赖隐式神经表示,计算成本和内存需求较高。解决方案的关键在于提出Interpolation Mixing LUT (IM-LUT),通过学习融合多种插值函数来最大化其表征能力,其中IM-Net负责根据局部图像特征和目标缩放因子预测融合权重,并将其转换为IM-LUT以利用LUT替代计算密集型操作,从而在保持重建质量的同时实现轻量级和快速的CPU推理。
链接: https://arxiv.org/abs/2507.09923
作者: Sejin Park,Sangmin Lee,Kyong Hwan Jin,Seung-Won Jung
机构: Korea University (高丽大学)
类目: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV)
备注: ICCV 2025
Abstract:Super-resolution (SR) has been a pivotal task in image processing, aimed at enhancing image resolution across various applications. Recently, look-up table (LUT)-based approaches have attracted interest due to their efficiency and performance. However, these methods are typically designed for fixed scale factors, making them unsuitable for arbitrary-scale image SR (ASISR). Existing ASISR techniques often employ implicit neural representations, which come with considerable computational cost and memory demands. To address these limitations, we propose Interpolation Mixing LUT (IM-LUT), a novel framework that operates ASISR by learning to blend multiple interpolation functions to maximize their representational capacity. Specifically, we introduce IM-Net, a network trained to predict mixing weights for interpolation functions based on local image patterns and the target scale factor. To enhance efficiency of interpolation-based methods, IM-Net is transformed into IM-LUT, where LUTs are employed to replace computationally expensive operations, enabling lightweight and fast inference on CPUs while preserving reconstruction quality. Experimental results on several benchmark datasets demonstrate that IM-LUT consistently achieves a superior balance between image quality and efficiency compared to existing methods, highlighting its potential as a promising solution for resource-constrained applications.
zh
[CV-177] Advanced U-Net Architectures with CNN Backbones for Automated Lung Cancer Detection and Segmentation in Chest CT Images
【速读】:该论文旨在解决胸部CT图像中肺癌的自动检测与分割问题,以满足临床环境中对精准诊断工具的迫切需求。其解决方案的关键在于将U-Net架构与多种卷积神经网络(CNN)骨干网络(如ResNet50、VGG16和Xception)相结合,并通过集成基于CNN的分类器及融合传统机器学习方法的混合模型进行性能优化。实验结果表明,该框架在分割和分类任务中均表现出色,显著优于现有方法。
链接: https://arxiv.org/abs/2507.09898
作者: Alireza Golkarieha,Kiana Kiashemshakib,Sajjad Rezvani Boroujenic,Nasibeh Asadi Isakand
机构: Oakland University(奥克兰大学); Bowling Green State University(鲍灵格林州立大学); University of Kentucky(肯塔基大学)
类目: Image and Video Processing (eess.IV); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注: This manuscript has 20 pages and 10 figures. It is submitted to the Journal ‘Scientific Reports’
Abstract:This study investigates the effectiveness of U-Net architectures integrated with various convolutional neural network (CNN) backbones for automated lung cancer detection and segmentation in chest CT images, addressing the critical need for accurate diagnostic tools in clinical settings. A balanced dataset of 832 chest CT images (416 cancerous and 416 non-cancerous) was preprocessed using Contrast Limited Adaptive Histogram Equalization (CLAHE) and resized to 128x128 pixels. U-Net models were developed with three CNN backbones: ResNet50, VGG16, and Xception, to segment lung regions. After segmentation, CNN-based classifiers and hybrid models combining CNN feature extraction with traditional machine learning classifiers (Support Vector Machine, Random Forest, and Gradient Boosting) were evaluated using 5-fold cross-validation. Metrics included accuracy, precision, recall, F1-score, Dice coefficient, and ROC-AUC. U-Net with ResNet50 achieved the best performance for cancerous lungs (Dice: 0.9495, Accuracy: 0.9735), while U-Net with VGG16 performed best for non-cancerous segmentation (Dice: 0.9532, Accuracy: 0.9513). For classification, the CNN model using U-Net with Xception achieved 99.1 percent accuracy, 99.74 percent recall, and 99.42 percent F1-score. The hybrid CNN-SVM-Xception model achieved 96.7 percent accuracy and 97.88 percent F1-score. Compared to prior methods, our framework consistently outperformed existing models. In conclusion, combining U-Net with advanced CNN backbones provides a powerful method for both segmentation and classification of lung cancer in CT scans, supporting early diagnosis and clinical decision-making.
zh
[CV-178] Resolution Revolution: A Physics-Guided Deep Learning Framework for Spatiotemporal Temperature Reconstruction ICCV2025
【速读】:该论文试图解决地球观测中空间分辨率与时间分辨率之间的权衡问题,特别是在温度数据获取方面,现有技术在高时空分辨率数据的获取上存在显著限制。解决方案的关键在于提出一种物理引导的深度学习框架,该框架整合了高时间分辨率但低空间分辨率的地球系统模型数据与高空间分辨率但低时间分辨率的卫星观测数据。该框架采用包含年度温度周期信息的卷积神经网络,并引入线性项将粗粒度的地球系统模型输出放大为细粒度的卫星观测温度值,从而实现有效温度数据重建。
链接: https://arxiv.org/abs/2507.09872
作者: Shengjie Liu,Lu Zhang,Siqin Wang
机构: University of Southern California (南加州大学)
类目: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV)
备注: ICCV 2025 Workshop SEA – International Conference on Computer Vision 2025 Workshop on Sustainability with Earth Observation and AI
Abstract:Central to Earth observation is the trade-off between spatial and temporal resolution. For temperature, this is especially critical because real-world applications require high spatiotemporal resolution data. Current technology allows for hourly temperature observations at 2 km, but only every 16 days at 100 m, a gap further exacerbated by cloud cover. Earth system models offer continuous hourly temperature data, but at a much coarser spatial resolution (9-31 km). Here, we present a physics-guided deep learning framework for temperature data reconstruction that integrates these two data sources. The proposed framework uses a convolutional neural network that incorporates the annual temperature cycle and includes a linear term to amplify the coarse Earth system model output into fine-scale temperature values observed from satellites. We evaluated this framework using data from two satellites, GOES-16 (2 km, hourly) and Landsat (100 m, every 16 days), and demonstrated effective temperature reconstruction with hold-out and in situ data across four datasets. This physics-guided deep learning framework opens new possibilities for generating high-resolution temperature data across spatial and temporal scales, under all weather conditions and globally.
zh
[CV-179] Generative Audio Language Modeling with Continuous-valued Tokens and Masked Next-Token Prediction ICML2025
【速读】:该论文试图解决音频生成中基于离散标记的模型在连续性建模上的局限性,以及如何在保持高效参数量的同时提升生成质量。其解决方案的关键在于采用因果语言模型(causal language model)直接建模连续值的下一个音频片段,并引入基于逐标记扩散的方法来捕捉连续分布;此外,还提出了一种新的掩码下一标记预测任务,以增强模型对音频序列的理解与生成能力。
链接: https://arxiv.org/abs/2507.09834
作者: Shu-wen Yang,Byeonggeun Kim,Kuan-Po Huang,Qingming Tang,Huy Phan,Bo-Ru Lu,Harsha Sundar,Shalini Ghosh,Hung-yi Lee,Chieh-Chi Kao,Chao Wang
机构: 未知
类目: Audio and Speech Processing (eess.AS); Computer Vision and Pattern Recognition (cs.CV); Sound (cs.SD)
备注: Accepted by ICML 2025. Project website: this https URL
Abstract:Autoregressive next-token prediction with the Transformer decoder has become a de facto standard in large language models (LLMs), achieving remarkable success in Natural Language Processing (NLP) at scale. Extending this paradigm to audio poses unique challenges due to its inherently continuous nature. We research audio generation with a causal language model (LM) without discrete tokens. We leverage token-wise diffusion to model the continuous distribution of the next continuous-valued token. Our approach delivers significant improvements over previous discrete solution, AudioGen, achieving 20% and 40% relative gains on AudioCaps in Frechet Audio Distance (FAD) and Kullback-Leibler (KL) divergence, respectively. Additionally, we propose a novel masked next-token prediction task that incorporates masked prediction into the causal LM framework. On AudioCaps, the innovation yields 41% and 33% relative FAD improvements over AudioGen Base (285M) and AudioGen Large (1B) models, respectively, and is on par with the state-of-the-art (SOTA) diffusion models. Furthermore, we achieve these results with significantly fewer parameters – 193M for our Base and 462M for our Large models.
zh
[CV-180] AI-Enhanced Pediatric Pneumonia Detection: A CNN-Based Approach Using Data Augmentation and Generative Adversarial Networks (GANs)
【速读】:该论文旨在解决儿童肺炎的准确诊断问题,特别是在资源有限的临床环境中提高诊断效率和准确性。其关键解决方案是基于卷积神经网络(CNN)构建一个儿科胸片肺炎分类系统,并通过数据增强技术和生成对抗网络(GANs)生成合成图像来应对数据量有限和类别不平衡的问题,从而提升模型性能。
链接: https://arxiv.org/abs/2507.09759
作者: Abdul Manaf,Nimra Mughal
机构: Sukkur IBA University (苏克尔IBA大学)
类目: Image and Video Processing (eess.IV); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Pneumonia is a leading cause of mortality in children under five, requiring accurate chest X-ray diagnosis. This study presents a machine learning-based Pediatric Chest Pneumonia Classification System to assist healthcare professionals in diagnosing pneumonia from chest X-ray images. The CNN-based model was trained on 5,863 labeled chest X-ray images from children aged 0-5 years from the Guangzhou Women and Children’s Medical Center. To address limited data, we applied augmentation techniques (rotation, zooming, shear, horizontal flipping) and employed GANs to generate synthetic images, addressing class imbalance. The system achieved optimal performance using combined original, augmented, and GAN-generated data, evaluated through accuracy and F1 score metrics. The final model was deployed via a Flask web application, enabling real-time classification with probability estimates. Results demonstrate the potential of deep learning and GANs in improving diagnostic accuracy and efficiency for pediatric pneumonia classification, particularly valuable in resource-limited clinical settings this https URL
zh
[CV-181] Pre-trained Under Noise: A Framework for Robust Bone Fracture Detection in Medical Imaging
【速读】:该论文试图解决医疗影像技术在全球范围内存在的健康差异问题,特别是通过深度学习模型在X-ray图像中对骨骨折分类的鲁棒性进行研究。其解决方案的关键在于建立一种方法论框架,利用迁移学习和受控噪声增强来评估人工智能模型在不同设备质量条件下的性能退化情况。通过在不同噪声水平下测试ResNet50、VGG16和EfficientNetv2等预训练模型,该研究旨在模拟现实世界中医疗影像技术人员所面临的挑战,并提供关于不同预训练深度学习驱动的计算机视觉模型在不同情境下稳健性和泛化能力的实际见解。
链接: https://arxiv.org/abs/2507.09731
作者: Robby Hoover,Nelly Elsayed,Zag ElSayed,Chengcheng Li
机构: University of Cincinnati (辛辛那提大学)
类目: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV)
备注: 7 pages, under review
Abstract:Medical Imagings are considered one of the crucial diagnostic tools for different bones-related diseases, especially bones fractures. This paper investigates the robustness of pre-trained deep learning models for classifying bone fractures in X-ray images and seeks to address global healthcare disparity through the lens of technology. Three deep learning models have been tested under varying simulated equipment quality conditions. ResNet50, VGG16 and EfficientNetv2 are the three pre-trained architectures which are compared. These models were used to perform bone fracture classification as images were progressively degraded using noise. This paper specifically empirically studies how the noise can affect the bone fractures detection and how the pre-trained models performance can be changes due to the noise that affect the quality of the X-ray images. This paper aims to help replicate real world challenges experienced by medical imaging technicians across the world. Thus, this paper establishes a methodological framework for assessing AI model degradation using transfer learning and controlled noise augmentation. The findings provide practical insight into how robust and generalizable different pre-trained deep learning powered computer vision models can be when used in different contexts.
zh
[CV-182] I2I-PR: Deep Iterative Refinement for Phase Retrieval using Image-to-Image Diffusion Models
【速读】:该论文试图解决相位恢复问题,即从仅包含强度信息的测量中恢复信号,这一问题在成像、全息、光学计算、晶体学和显微等领域具有重要应用。传统相位恢复算法如经典迭代求解器在重建性能上对初始化和测量噪声敏感。本文提出的解决方案的关键在于引入一种基于图像到图像扩散框架的新型相位恢复方法,称为Inversion by Direct Iteration。该方法首先通过增强初始化阶段,结合混合输入输出与误差缩减方法并引入新颖的加速机制,获得稳健的粗略估计;随后利用学习到的图像到图像管道迭代优化该初始估计,从而显著提升了训练效率和重建质量。
链接: https://arxiv.org/abs/2507.09609
作者: Mehmet Onurcan Kaya,Figen S. Oktem
机构: 未知
类目: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Phase retrieval involves recovering a signal from intensity-only measurements, crucial in many fields such as imaging, holography, optical computing, crystallography, and microscopy. Although there are several well-known phase retrieval algorithms, including classical iterative solvers, the reconstruction performance often remains sensitive to initialization and measurement noise. Recently, image-to-image diffusion models have gained traction in various image reconstruction tasks, yielding significant theoretical insights and practical breakthroughs. In this work, we introduce a novel phase retrieval approach based on an image-to-image diffusion framework called Inversion by Direct Iteration. Our method begins with an enhanced initialization stage that leverages a hybrid iterative technique, combining the Hybrid Input-Output and Error Reduction methods and incorporating a novel acceleration mechanism to obtain a robust crude estimate. Then, it iteratively refines this initial crude estimate using the learned image-to-image pipeline. Our method achieves substantial improvements in both training efficiency and reconstruction quality. Furthermore, our approach utilizes aggregation techniques to refine quality metrics and demonstrates superior results compared to both classical and contemporary techniques. This highlights its potential for effective and efficient phase retrieval across various applications.
zh
[CV-183] prNet: Data-Driven Phase Retrieval via Stochastic Refinement
【速读】:该论文旨在解决相位恢复(phase retrieval)问题,其核心挑战在于从强度测量中重建高质量的图像,同时平衡失真与感知质量。传统方法通常侧重于像素级精度,而本文提出的解决方案关键在于利用Langevin动力学进行高效后验采样,通过结合随机采样、学习到的去噪机制和基于模型的更新,实现感知与失真之间的合理权衡。该框架包含多个复杂度递增的变体,融合了理论支撑的Langevin推断、自适应噪声调度学习、并行重建采样以及经典求解器的热启动初始化。
链接: https://arxiv.org/abs/2507.09608
作者: Mehmet Onurcan Kaya,Figen S. Oktem
机构: 未知
类目: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:We propose a novel framework for phase retrieval that leverages Langevin dynamics to enable efficient posterior sampling, yielding reconstructions that explicitly balance distortion and perceptual quality. Unlike conventional approaches that prioritize pixel-wise accuracy, our method navigates the perception-distortion tradeoff through a principled combination of stochastic sampling, learned denoising, and model-based updates. The framework comprises three variants of increasing complexity, integrating theoretically grounded Langevin inference, adaptive noise schedule learning, parallel reconstruction sampling, and warm-start initialization from classical solvers. Extensive experiments demonstrate that our method achieves state-of-the-art performance across multiple benchmarks, both in terms of fidelity and perceptual quality.
zh
[CV-184] Self-supervised pretraining of vision transformers for animal behavioral analysis and neural encoding
【速读】:该论文旨在解决神经行为分析中因缺乏大量标注数据而导致的模型性能受限问题。其解决方案的关键在于提出BEAST(BEhavioral Analysis via Self-supervised pretraining of Transformers),该框架通过结合掩码自编码与时间对比学习,利用未标注视频数据进行预训练,从而有效提升多种神经行为任务的性能,包括与神经活动相关的行为特征提取、单动物和多动物场景下的姿态估计与动作分割。
链接: https://arxiv.org/abs/2507.09513
作者: Yanchen Wang,Han Yu,Ari Blau,Yizi Zhang, TheInternational Brain Laboratory,Liam Paninski,Cole Hurwitz,Matt Whiteway
机构: Columbia University (哥伦比亚大学); The International Brain Laboratory (国际脑实验室)
类目: Neurons and Cognition (q-bio.NC); Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:The brain can only be fully understood through the lens of the behavior it generates – a guiding principle in modern neuroscience research that nevertheless presents significant technical challenges. Many studies capture behavior with cameras, but video analysis approaches typically rely on specialized models requiring extensive labeled data. We address this limitation with BEAST (BEhavioral Analysis via Self-supervised pretraining of Transformers), a novel and scalable framework that pretrains experiment-specific vision transformers for diverse neuro-behavior analyses. BEAST combines masked autoencoding with temporal contrastive learning to effectively leverage unlabeled video data. Through comprehensive evaluation across multiple species, we demonstrate improved performance in three critical neuro-behavioral tasks: extracting behavioral features that correlate with neural activity, and pose estimation and action segmentation in both the single- and multi-animal settings. Our method establishes a powerful and versatile backbone model that accelerates behavioral analysis in scenarios where labeled data remains scarce.
zh
[CV-185] PanoDiff-SR: Synthesizing Dental Panoramic Radiographs using Diffusion and Super-resolution
【速读】:该论文旨在解决医学图像数据集稀缺的问题,通过生成高质量、逼真的合成牙科全景放射图像(PRs)来补充公开数据集,并用于教育目的。其解决方案的关键在于结合基于扩散的生成方法(PanoDiff)与超分辨率(SR)技术,首先生成低分辨率(LR)的PR种子图像,再通过改进的Transformer架构实现超分辨率重建,从而得到高分辨率(HR)的PR图像,该方法在保持细节和纹理方面表现出色。
链接: https://arxiv.org/abs/2507.09227
作者: Sanyam Jain,Bruna Neves de Freitas,Andreas Basse-OConnor,Alexandros Iosifidis,Ruben Pauwels
机构: Aarhus University(奥胡斯大学); Tampere University(坦佩雷大学)
类目: Image and Video Processing (eess.IV); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注:
Abstract:There has been increasing interest in the generation of high-quality, realistic synthetic medical images in recent years. Such synthetic datasets can mitigate the scarcity of public datasets for artificial intelligence research, and can also be used for educational purposes. In this paper, we propose a combination of diffusion-based generation (PanoDiff) and Super-Resolution (SR) for generating synthetic dental panoramic radiographs (PRs). The former generates a low-resolution (LR) seed of a PR (256 X 128) which is then processed by the SR model to yield a high-resolution (HR) PR of size 1024 X 512. For SR, we propose a state-of-the-art transformer that learns local-global relationships, resulting in sharper edges and textures. Experimental results demonstrate a Frechet inception distance score of 40.69 between 7243 real and synthetic images (in HR). Inception scores were 2.55, 2.30, 2.90 and 2.98 for real HR, synthetic HR, real LR and synthetic LR images, respectively. Among a diverse group of six clinical experts, all evaluating a mixture of 100 synthetic and 100 real PRs in a time-limited observation, the average accuracy in distinguishing real from synthetic images was 68.5% (with 50% corresponding to random guessing).
zh
[CV-186] Automatic Contouring of Spinal Vertebrae on X-Ray using a Novel Sandwich U-Net Architecture
【速读】:该论文旨在解决脊柱椎体活动性疾病中椎体精确提取与轮廓划分的问题,传统方法依赖放射科医生或外科医生手动操作,存在劳动强度大、耗时长及易出错的缺点。论文提出的解决方案关键在于设计了一种改进的U-Net结构,采用“三明治”式U-Net架构并结合双激活函数,显著提升了胸椎在正位X光图像中的分割准确率,其Dice分数相比基线U-Net模型提高了4.1%。
链接: https://arxiv.org/abs/2507.09158
作者: Sunil Munthumoduku Krishna Murthy,Kumar Rajamani,Srividya Tirunellai Rajamani,Yupei Li,Qiyang Sun,Bjoern W. Schuller
机构: Marwadi University, India; University of Augsburg, Germany; GLAM – Group on Language, Audio, & Music, Imperial College London, UK; Munich Center for Machine Learning, Germany; MDSI – Munich Data Science Institute, Germany
类目: Image and Video Processing (eess.IV); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:In spinal vertebral mobility disease, accurately extracting and contouring vertebrae is essential for assessing mobility impairments and monitoring variations during flexion-extension movements. Precise vertebral contouring plays a crucial role in surgical planning; however, this process is traditionally performed manually by radiologists or surgeons, making it labour-intensive, time-consuming, and prone to human error. In particular, mobility disease analysis requires the individual contouring of each vertebra, which is both tedious and susceptible to inconsistencies. Automated methods provide a more efficient alternative, enabling vertebra identification, segmentation, and contouring with greater accuracy and reduced time consumption. In this study, we propose a novel U-Net variation designed to accurately segment thoracic vertebrae from anteroposterior view on X-Ray images. Our proposed approach, incorporating a ``sandwich" U-Net structure with dual activation functions, achieves a 4.1% improvement in Dice score compared to the baseline U-Net model, enhancing segmentation accuracy while ensuring reliable vertebral contour extraction.
zh
[CV-187] CNeuroMod-THINGS a densely-sampled fMRI dataset for visual neuroscience
【速读】:该论文旨在解决神经网络人工智能(Neuro-AI)建模中对大规模、高质量神经影像数据的依赖问题,通过构建一个密集采样的大尺度功能性磁共振成像(fMRI)数据集来满足这一需求。其解决方案的关键在于整合两个现有项目——THINGS倡议与Courtois神经建模项目(CNeuroMod),利用THINGS提供的广泛标注图像和CNeuroMod获取的长时间段内受试者在控制与自然任务中的神经响应数据,从而构建出涵盖720个语义类别的高密度神经表征数据集。
链接: https://arxiv.org/abs/2507.09024
作者: Marie St-Laurent,Basile Pinsard,Oliver Contier,Elizabeth DuPre,Katja Seeliger,Valentina Borghesani,Julie A. Boyle,Lune Bellec,Martin N. Hebart
机构: 未知
类目: Neurons and Cognition (q-bio.NC); Computer Vision and Pattern Recognition (cs.CV)
备注: 29 pages manuscript, 5 figures, 12 pages supplementary material
Abstract:Data-hungry neuro-AI modelling requires ever larger neuroimaging datasets. CNeuroMod-THINGS meets this need by capturing neural representations for a wide set of semantic concepts using well-characterized stimuli in a new densely-sampled, large-scale fMRI dataset. Importantly, CNeuroMod-THINGS exploits synergies between two existing projects: the THINGS initiative (THINGS) and the Courtois Project on Neural Modelling (CNeuroMod). THINGS has developed a common set of thoroughly annotated images broadly sampling natural and man-made objects which is used to acquire a growing collection of large-scale multimodal neural responses. Meanwhile, CNeuroMod is acquiring hundreds of hours of fMRI data from a core set of participants during controlled and naturalistic tasks, including visual tasks like movie watching and videogame playing. For CNeuroMod-THINGS, four CNeuroMod participants each completed 33-36 sessions of a continuous recognition paradigm using approximately 4000 images from the THINGS stimulus set spanning 720 categories. We report behavioural and neuroimaging metrics that showcase the quality of the data. By bridging together large existing resources, CNeuroMod-THINGS expands our capacity to model broad slices of the human visual experience.
zh
[CV-188] VIP: Visual Information Protection through Adversarial Attacks on Vision-Language Models
【速读】:该论文试图解决视觉-语言模型(Vision-Language Models, VLMs)在处理图像时可能无意中暴露或处理私密视觉信息所带来的用户隐私问题。其解决方案的关键在于将隐私保护问题建模为一种对抗攻击问题,并提出一种新颖的攻击策略,该策略能够选择性地隐藏图像中指定的兴趣区域(Region Of Interests, ROIs)内的信息,从而防止VLMs访问敏感内容,同时保持图像其余部分的语义完整性。与传统对抗攻击通常破坏整个图像不同,该方法在未遮挡区域保持了高度的一致性。
链接: https://arxiv.org/abs/2507.08982
作者: Hanene F. Z. Brachemi Meftah,Wassim Hamidouche,Sid Ahmed Fezza,Olivier Déforges
机构: Région Bretagne (Brittany region), CREACH Labs, Direction Générale de l’Armement (DGA), Univ. Rennes, INSA Rennes, CNRS, IETR - UMR 6164, Technology Innovation Institute, National Higher School of Telecommunications and ICT
类目: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注:
Abstract:Recent years have witnessed remarkable progress in developing Vision-Language Models (VLMs) capable of processing both textual and visual inputs. These models have demonstrated impressive performance, leading to their widespread adoption in various applications. However, this widespread raises serious concerns regarding user privacy, particularly when models inadvertently process or expose private visual information. In this work, we frame the preservation of privacy in VLMs as an adversarial attack problem. We propose a novel attack strategy that selectively conceals information within designated Region Of Interests (ROIs) in an image, effectively preventing VLMs from accessing sensitive content while preserving the semantic integrity of the remaining image. Unlike conventional adversarial attacks that often disrupt the entire image, our method maintains high coherence in unmasked areas. Experimental results across three state-of-the-art VLMs namely LLaVA, Instruct-BLIP, and BLIP2-T5 demonstrate up to 98% reduction in detecting targeted ROIs, while maintaining global image semantics intact, as confirmed by high similarity scores between clean and adversarial outputs. We believe that this work contributes to a more privacy conscious use of multimodal models and offers a practical tool for further research, with the source code publicly available at: this https URL.
zh
[CV-189] Interpretable Artificial Intelligence for Detecting Acute Heart Failure on Acute Chest CT Scans
【速读】:该论文试图解决在急性心力衰竭(Acute Heart Failure, AHF)诊断中,由于放射科医生短缺导致的胸片CT影像解读延迟问题。其解决方案的关键是开发一个可解释的人工智能(Artificial Intelligence, AI)模型,通过分析胸部CT扫描中分割的心脏和肺部结构的测量值,实现对AHF的准确检测,且模型的性能与胸腔放射科医生相当。该模型采用Boosted Trees算法,并利用Shapley Additive explanations(SHAP)方法对预测结果进行解释,以提高诊断决策的透明度和可信度。
链接: https://arxiv.org/abs/2507.08952
作者: Silas Nyboe Ørting,Kristina Miger,Anne Sophie Overgaard Olesen,Mikael Ploug Boesen,Michael Brun Andersen,Jens Petersen,Olav W. Nielsen,Marleen de Bruijne
机构: 未知
类目: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV)
备注: 34 pages, 11 figures, Submitted to “Radiology AI”
Abstract:Introduction: Chest CT scans are increasingly used in dyspneic patients where acute heart failure (AHF) is a key differential diagnosis. Interpretation remains challenging and radiology reports are frequently delayed due to a radiologist shortage, although flagging such information for emergency physicians would have therapeutic implication. Artificial intelligence (AI) can be a complementary tool to enhance the diagnostic precision. We aim to develop an explainable AI model to detect radiological signs of AHF in chest CT with an accuracy comparable to thoracic radiologists. Methods: A single-center, retrospective study during 2016-2021 at Copenhagen University Hospital - Bispebjerg and Frederiksberg, Denmark. A Boosted Trees model was trained to predict AHF based on measurements of segmented cardiac and pulmonary structures from acute thoracic CT scans. Diagnostic labels for training and testing were extracted from radiology reports. Structures were segmented with TotalSegmentator. Shapley Additive explanations values were used to explain the impact of each measurement on the final prediction. Results: Of the 4,672 subjects, 49% were female. The final model incorporated twelve key features of AHF and achieved an area under the ROC of 0.87 on the independent test set. Expert radiologist review of model misclassifications found that 24 out of 64 (38%) false positives and 24 out of 61 (39%) false negatives were actually correct model predictions, with the errors originating from inaccuracies in the initial radiology reports. Conclusion: We developed an explainable AI model with strong discriminatory performance, comparable to thoracic radiologists. The AI model’s stepwise, transparent predictions may support decision-making. Comments: 34 pages, 11 figures, Submitted to “Radiology AI” Subjects: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV) Cite as: arXiv:2507.08952 [eess.IV] (or arXiv:2507.08952v1 [eess.IV] for this version) https://doi.org/10.48550/arXiv.2507.08952 Focus to learn more arXiv-issued DOI via DataCite (pending registration) Submission history From: Silas Ørting [view email] [v1] Fri, 11 Jul 2025 18:25:34 UTC (1,774 KB)
zh
[CV-190] Multi-omic Prognosis of Alzheimers Disease with Asymmetric Cross-Modal Cross-Attention Network
【速读】:该论文旨在解决多模态数据在阿尔茨海默病(Alzheimer’s Disease, AD)深度学习辅助诊断中难以有效融合的问题。传统卷积神经网络和简单特征拼接方法无法充分利用多模态数据之间的互补信息,且容易在模态融合过程中丢失关键信息。该论文提出的解决方案的关键在于引入一种非对称跨模态交叉注意力机制,该机制能够有效捕捉不同数据模态特征之间的交互关键信息,从而提升AD、轻度认知障碍(Mild Cognitive Impairment, MCI)和认知正常(Cognitively Normal, CN)的检测准确性,实验结果显示该算法在测试集上达到了94.88%的准确率。
链接: https://arxiv.org/abs/2507.08855
作者: Yang Ming,Jiang Shi Zhong,Zhou Su Juan
机构: Guangdong Pharmaceutical University (广东药科大学)
类目: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注:
Abstract:Alzheimer’s Disease (AD) is an irreversible neurodegenerative disease characterized by progressive cognitive decline as its main symptom. In the research field of deep learning-assisted diagnosis of AD, traditional convolutional neural networks and simple feature concatenation methods fail to effectively utilize the complementary information between multimodal data, and the simple feature concatenation approach is prone to cause the loss of key information during the process of modal fusion. In recent years, the development of deep learning technology has brought new possibilities for solving the problem of how to effectively fuse multimodal features. This paper proposes a novel deep learning algorithm framework to assist medical professionals in AD diagnosis. By fusing medical multi-view information such as brain fluorodeoxyglucose positron emission tomography (PET), magnetic resonance imaging (MRI), genetic data, and clinical data, it can accurately detect the presence of AD, Mild Cognitive Impairment (MCI), and Cognitively Normal (CN). The innovation of the algorithm lies in the use of an asymmetric cross-modal cross-attention mechanism, which can effectively capture the key information features of the interactions between different data modal features. This paper compares the asymmetric cross-modal cross-attention mechanism with the traditional algorithm frameworks of unimodal and multimodal deep learning models for AD diagnosis, and evaluates the importance of the asymmetric cross-modal cross-attention mechanism. The algorithm model achieves an accuracy of 94.88% on the test set.
zh
人工智能
[AI-0] Disentangling Neural Disjunctive Normal Form Models
【速读】:该论文试图解决神经析取范式(Neural Disjunctive Normal Form, DNF)模型在后训练符号翻译过程中因阈值处理导致的性能退化问题。其关键解决方案是提出一种新的解耦方法,通过将编码嵌套规则的节点拆分为更小的独立节点,从而更好地保留模型的性能。
链接: https://arxiv.org/abs/2507.10546
作者: Kexin Gu Baugh,Vincent Perreault,Matthew Baugh,Luke Dickens,Katsumi Inoue,Alessandra Russo
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注: Accepted at NeSy 2025
Abstract:Neural Disjunctive Normal Form (DNF) based models are powerful and interpretable approaches to neuro-symbolic learning and have shown promising results in classification and reinforcement learning settings without prior knowledge of the tasks. However, their performance is degraded by the thresholding of the post-training symbolic translation process. We show here that part of the performance degradation during translation is due to its failure to disentangle the learned knowledge represented in the form of the networks’ weights. We address this issue by proposing a new disentanglement method; by splitting nodes that encode nested rules into smaller independent nodes, we are able to better preserve the models’ performance. Through experiments on binary, multiclass, and multilabel classification tasks (including those requiring predicate invention), we demonstrate that our disentanglement method provides compact and interpretable logical representations for the neural DNF-based models, with performance closer to that of their pre-translation counterparts. Our code is available at this https URL.
zh
[AI-1] WildFX: A DAW-Powered Pipeline for In-the-Wild Audio FX Graph Modeling
【速读】:该论文试图解决AI在专业数字信号处理(DSP)工作流建模中的挑战,尤其是难以准确复制专业音频效果图中复杂的信号流向和参数交互问题。其解决方案的关键在于提出WildFX,这是一个基于Docker容器化的数据集生成管道,采用专业的数字音频工作站(DAW)后端,支持跨平台商业插件或任意插件的无缝集成,并能够实现结构复杂性(如侧链、分频器)和高效的并行处理。
链接: https://arxiv.org/abs/2507.10534
作者: Qihui Yang,Taylor Berg-Kirkpatrick,Julian McAuley,Zachary Novack
机构: 未知
类目: ound (cs.SD); Artificial Intelligence (cs.AI)
备注:
Abstract:Despite rapid progress in end-to-end AI music generation, AI-driven modeling of professional Digital Signal Processing (DSP) workflows remains challenging. In particular, while there is growing interest in neural black-box modeling of audio effect graphs (e.g. reverb, compression, equalization), AI-based approaches struggle to replicate the nuanced signal flow and parameter interactions used in professional workflows. Existing differentiable plugin approaches often diverge from real-world tools, exhibiting inferior performance relative to simplified neural controllers under equivalent computational constraints. We introduce WildFX, a pipeline containerized with Docker for generating multi-track audio mixing datasets with rich effect graphs, powered by a professional Digital Audio Workstation (DAW) backend. WildFX supports seamless integration of cross-platform commercial plugins or any plugins in the wild, in VST/VST3/LV2/CLAP formats, enabling structural complexity (e.g., sidechains, crossovers) and achieving efficient parallelized processing. A minimalist metadata interface simplifies project/plugin configuration. Experiments demonstrate the pipeline’s validity through blind estimation of mixing graphs, plugin/gain parameters, and its ability to bridge AI research with practical DSP demands. The code is available on: this https URL.
zh
[AI-2] Chat with AI: The Surprising Turn of Real-time Video Communication from Human to AI
【速读】:该论文试图解决AI Video Chat中由于Multimodal Large Language Model (MLLM)推理延迟导致的实时通信瓶颈问题,尤其是在网络不稳定情况下传输延迟对AI交互自然性的影响。其解决方案的关键在于提出Artic框架,通过Context-Aware Video Streaming实现视频区域的重要性识别与带宽优化分配,以及通过Loss-Resilient Adaptive Frame Rate提升丢包场景下的帧率适应能力,从而在降低码率的同时保持MLLM的准确性。
链接: https://arxiv.org/abs/2507.10510
作者: Jiangkai Wu,Zhiyuan Ren,Liming Liu,Xinggong Zhang
机构: 未知
类目: Networking and Internet Architecture (cs.NI); Artificial Intelligence (cs.AI); Human-Computer Interaction (cs.HC); Multimedia (cs.MM)
备注:
Abstract:AI Video Chat emerges as a new paradigm for Real-time Communication (RTC), where one peer is not a human, but a Multimodal Large Language Model (MLLM). This makes interaction between humans and AI more intuitive, as if chatting face-to-face with a real person. However, this poses significant challenges to latency, because the MLLM inference takes up most of the response time, leaving very little time for video streaming. Due to network uncertainty and instability, transmission latency becomes a critical bottleneck preventing AI from being like a real person. To address this, we propose Artic, an AI-oriented Real-time Communication framework, exploring the network requirement shift from “humans watching video” to “AI understanding video”. To reduce bitrate dramatically while maintaining MLLM accuracy, we propose Context-Aware Video Streaming that recognizes the importance of each video region for chat and allocates bitrate almost exclusively to chat-important regions. To avoid packet retransmission, we propose Loss-Resilient Adaptive Frame Rate that leverages previous frames to substitute for lost/delayed frames while avoiding bitrate waste. To evaluate the impact of video streaming quality on MLLM accuracy, we build the first benchmark, named Degraded Video Understanding Benchmark (DeViBench). Finally, we discuss some open questions and ongoing solutions for AI Video Chat.
zh
[AI-3] Benchmarking and Evaluation of AI Models in Biology: Outcomes and Recommendations from the CZI Virtual Cells Workshop
【速读】:该论文试图解决生物领域中人工智能(Artificial Intelligence, AI)模型缺乏标准化、跨领域基准的问题,这限制了构建稳健、可信模型的能力。解决方案的关键在于识别数据异质性与噪声、可重复性挑战、偏见以及公开资源生态系统碎片化等主要技术与系统性瓶颈,并提出一套构建基准框架的建议,以高效比较不同任务和数据模态下的机器学习(Machine Learning, ML)模型。通过促进高质量数据整理、标准化工具、全面评估指标以及开放协作平台,旨在加速AI驱动虚拟细胞(Virtual Cells)的稳健基准发展。
链接: https://arxiv.org/abs/2507.10502
作者: Elizabeth Fahsbender,Alma Andersson,Jeremy Ash,Polina Binder,Daniel Burkhardt,Benjamin Chang,Georg K. Gerber,Anthony Gitter,Patrick Godau,Ankit Gupta,Genevieve Haliburton,Siyu He,Trey Ideker,Ivana Jelic,Aly Khan,Yang-Joon Kim,Aditi Krishnapriyan,Jon M. Laurent,Tianyu Liu 28,Emma Lundberg,Shalin B. Mehta,Rob Moccia,Angela Oliveira Pisco,Katherine S. Pollard,Suresh Ramani,Julio Saez-Rodriguez,Yasin Senbabaoglu,Elana Simon,Srinivasan Sivanandan,Gustavo Stolovitzky,Marc Valer,Bo Wang,Xikun Zhang,James Zou,Katrina Kalantar
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:
Abstract:Artificial intelligence holds immense promise for transforming biology, yet a lack of standardized, cross domain, benchmarks undermines our ability to build robust, trustworthy models. Here, we present insights from a recent workshop that convened machine learning and computational biology experts across imaging, transcriptomics, proteomics, and genomics to tackle this gap. We identify major technical and systemic bottlenecks such as data heterogeneity and noise, reproducibility challenges, biases, and the fragmented ecosystem of publicly available resources and propose a set of recommendations for building benchmarking frameworks that can efficiently compare ML models of biological systems across tasks and data modalities. By promoting high quality data curation, standardized tooling, comprehensive evaluation metrics, and open, collaborative platforms, we aim to accelerate the development of robust benchmarks for AI driven Virtual Cells. These benchmarks are crucial for ensuring rigor, reproducibility, and biological relevance, and will ultimately advance the field toward integrated models that drive new discoveries, therapeutic insights, and a deeper understanding of cellular systems.
zh
[AI-4] An Empirical Evaluation of AI-Powered Non-Player Characters Perceived Realism and Performance in Virtual Reality Environments
【速读】:该论文试图解决虚拟现实(VR)中非玩家角色(NPCs)的 realism 和交互性不足的问题,旨在通过生成式 AI 提升 NPC 的表现力与用户沉浸感。解决方案的关键在于利用 GPT-4 Turbo 作为核心模型,驱动两个 AI 驱动的 NPC——嫌疑人和搭档,在 VR 审讯模拟器中与用户进行互动,从而增强场景的真实性和用户的参与度。
链接: https://arxiv.org/abs/2507.10469
作者: Mikko Korkiakoski,Saeid Sheikhi,Jesper Nyman,Jussi Saariniemi,Kalle Tapio,Panos Kostakos
机构: 未知
类目: Human-Computer Interaction (cs.HC); Artificial Intelligence (cs.AI); Multimedia (cs.MM)
备注:
Abstract:Advancements in artificial intelligence (AI) have significantly enhanced the realism and interactivity of non-player characters (NPCs) in virtual reality (VR), creating more engaging and believable user experiences. This paper evaluates AI-driven NPCs within a VR interrogation simulator, focusing on their perceived realism, usability, and system performance. The simulator features two AI-powered NPCs, a suspect, and a partner, using GPT-4 Turbo to engage participants in a scenario to determine the suspect’s guilt or innocence. A user study with 18 participants assessed the system using the System Usability Scale (SUS), Game Experience Questionnaire (GEQ), and a Virtual Agent Believability Questionnaire, alongside latency measurements for speech-to-text (STT), text-to-speech (TTS), OpenAI GPT-4 Turbo, and overall (cycle) latency. Results showed an average cycle latency of 7 seconds, influenced by the increasing conversational context. Believability scored 6.67 out of 10, with high ratings in behavior, social relationships, and intelligence but moderate scores in emotion and personality. The system achieved a SUS score of 79.44, indicating good usability. These findings demonstrate the potential of large language models to improve NPC realism and interaction in VR while highlighting challenges in reducing system latency and enhancing emotional depth. This research contributes to the development of more sophisticated AI-driven NPCs, revealing the need for performance optimization to achieve increasingly immersive virtual experiences.
zh
[AI-5] AudioMAE: learning better masked audio representations with SwiGLU FFNs
【速读】:该论文旨在解决自监督音频表示学习中模型性能提升的问题,特别是在音频掩码自编码器(MAE)的架构设计上。其解决方案的关键在于引入了改进的Transformer结构,即带有门控线性单元的macaron风格Transformer块,从而提升了模型的表达能力和效率。通过在AudioSet数据集上预训练,所提出的AudioMAE++模型在多个下游任务中表现出色,展示了其在音频分类和语音相关基准测试中的优越性能。
链接: https://arxiv.org/abs/2507.10464
作者: Sarthak Yadav,Sergios Theodoridis,Zheng-Hua Tan
机构: 未知
类目: ound (cs.SD); Artificial Intelligence (cs.AI); Audio and Speech Processing (eess.AS)
备注: TO APPEAR AT IEEE MLSP 2025
Abstract:Masked Autoencoders (MAEs) trained on audio spectrogram patches have emerged as a prominent approach for learning self-supervised audio representations. While several recent papers have evaluated key aspects of training MAEs on audio data, the majority of these approaches still leverage vanilla transformer building blocks, whereas the transformer community has seen steady integration of newer architectural advancements. In this work, we propose AudioMAE++, a revamped audio masked autoencoder with two such enhancements, namely macaron-style transformer blocks with gated linear units. When pretrained on the AudioSet dataset, the proposed AudioMAE++ models outperform existing MAE based approaches on 10 diverse downstream tasks, demonstrating excellent performance on audio classification and speech-based benchmarks. The proposed AudioMAE++ models also demonstrate excellent scaling characteristics, outperforming directly comparable standard MAE baselines with up to 4x more parameters.
zh
[AI-6] Logic layer Prompt Control Injection (LPCI): A Novel Security Vulnerability Class in Agent ic Systems
【速读】:该论文试图解决大型语言模型(Large Language Models, LLMs)集成到企业系统中所带来的隐蔽安全漏洞问题,特别是在逻辑执行层和持久化内存上下文中存在的威胁。论文提出的解决方案的关键在于引入逻辑层提示控制注入(Logic-Layer Prompt Control Injection, LPCI),通过在内存、向量存储或工具输出中嵌入编码的、延迟的和条件触发的负载,绕过传统的输入过滤机制,并在不同会话中触发未经授权的行为。
链接: https://arxiv.org/abs/2507.10457
作者: Hammad Atta,Ken Huang,Manish Bhatt,Kamal Ahmed,Muhammad Aziz Ul Haq,Yasir Mehmood
机构: 未知
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:
Abstract:The integration of large language models (LLMs) into enterprise systems has created a new class of covert security vulnerabilities, particularly within logic-execution layers and persistent-memory contexts. In this paper, we introduce Logic-Layer Prompt Control Injection (LPCI), a novel attack category in which encoded, delayed, and conditionally triggered payloads are embedded in memory, vector stores, or tool outputs. These payloads can bypass conventional input filters and trigger unauthorised behaviour across sessions.
zh
[AI-7] Evaluating Fake Music Detection Performance Under Audio Augmentations
【速读】:该论文试图解决生成式音频模型使得人类创作与生成音乐难以区分的问题,特别是针对虚假音乐检测系统的鲁棒性进行研究。解决方案的关键在于构建包含真实音乐和多种系统生成的合成音乐的数据集,并通过应用各种音频增强技术来评估模型的泛化能力和检测性能。
链接: https://arxiv.org/abs/2507.10447
作者: Tomasz Sroka,Tomasz Wężowicz,Dominik Sidorczuk,Mateusz Modrzejewski
机构: 未知
类目: ound (cs.SD); Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Audio and Speech Processing (eess.AS)
备注: ISMIR 2025 LBD, 2 pages + bibliography, 1 figure
Abstract:With the rapid advancement of generative audio models, distinguishing between human-composed and generated music is becoming increasingly challenging. As a response, models for detecting fake music have been proposed. In this work, we explore the robustness of such systems under audio augmentations. To evaluate model generalization, we constructed a dataset consisting of both real and synthetic music generated using several systems. We then apply a range of audio transformations and analyze how they affect classification accuracy. We test the performance of a recent state-of-the-art musical deepfake detection model in the presence of audio augmentations. The performance of the model decreases significantly even with the introduction of light augmentations.
zh
[AI-8] Acquiring and Adapting Priors for Novel Tasks via Neural Meta-Architectures
【速读】:该论文试图解决在数据稀缺的领域中,如计算化学、计算免疫学和医学影像,难以训练大型预训练模型或基础模型的问题。其解决方案的关键在于设计高效的架构,以在缺乏大量数据的情况下获取先验知识。具体而言,研究展示了如何利用神经记忆在仅有少量样本的情况下适应非平稳分布,并通过超网络设计(一种生成其他网络的网络)在Model Agnostic Meta-Learning(MAML)训练下获得更具泛化能力的先验知识。此外,该方法被应用于3D场景生成和分割,以及分子生成任务,以提高在有限数据下的性能。
链接: https://arxiv.org/abs/2507.10446
作者: Sudarshan Babu
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注: arXiv admin note: text overlap with arXiv:2310.17075
Abstract:The ability to transfer knowledge from prior experiences to novel tasks stands as a pivotal capability of intelligent agents, including both humans and computational models. This principle forms the basis of transfer learning, where large pre-trained neural networks are fine-tuned to adapt to downstream tasks. Transfer learning has demonstrated tremendous success, both in terms of task adaptation speed and performance. However there are several domains where, due to lack of data, training such large pre-trained models or foundational models is not a possibility - computational chemistry, computational immunology, and medical imaging are examples. To address these challenges, our work focuses on designing architectures to enable efficient acquisition of priors when large amounts of data are unavailable. In particular, we demonstrate that we can use neural memory to enable adaptation on non-stationary distributions with only a few samples. Then we demonstrate that our hypernetwork designs (a network that generates another network) can acquire more generalizable priors than standard networks when trained with Model Agnostic Meta-Learning (MAML). Subsequently, we apply hypernetworks to 3D scene generation, demonstrating that they can acquire priors efficiently on just a handful of training scenes, thereby leading to faster text-to-3D generation. We then extend our hypernetwork framework to perform 3D segmentation on novel scenes with limited data by efficiently transferring priors from earlier viewed scenes. Finally, we repurpose an existing molecular generative method as a pre-training framework that facilitates improved molecular property prediction, addressing critical challenges in computational immunology
zh
[AI-9] Response Wide Shut? Surprising Observations in Basic Vision Language Model Capabilities ACL2025
【速读】:该论文试图解决当前最先进视觉-语言模型(Vision-Language Models, VLMs)在基础视觉任务中的局限性问题,旨在深入理解其在设计组件上的不足。解决方案的关键在于构建一系列测试,不仅评估VLM的最终性能,还通过将其与直接基于视觉编码器特征、中间视觉-语言投影以及大语言模型解码器输出训练的探测器进行对比,揭示VLM在视觉理解能力、鲁棒性及视觉信息处理机制方面的缺陷。
链接: https://arxiv.org/abs/2507.10442
作者: Shivam Chandhok,Wan-Cyuan Fan,Vered Shwartz,Vineeth N Balasubramanian,Leonid Sigal
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注: Accepted at ACL 2025 (Main Conference)
Abstract:Vision-language Models (VLMs) have emerged as general-purpose tools for addressing a variety of complex computer vision problems. Such models have been shown to be highly capable, but, at the same time, lacking some basic visual understanding skills. In this paper, we set out to understand the limitations of SoTA VLMs on fundamental visual tasks by constructing a series of tests that probe which components of design, specifically, may be lacking. Importantly, we go significantly beyond the current benchmarks, which simply measure the final performance of VLM response, by also comparing and contrasting it to the performance of probes trained directly on features obtained from the visual encoder, intermediate vision-language projection and LLM-decoder output. In doing so, we uncover shortcomings in VLMs and make a number of important observations about their capabilities, robustness and how they process visual information. We hope our insights will guide progress in further improving VLMs.
zh
[AI-10] Efficient Federated Learning with Heterogeneous Data and Adaptive Dropout KDD
【速读】:该论文旨在解决联邦学习(Federated Learning, FL)中由于边缘设备数据分布不均匀(non-IID)导致的模型精度下降以及由于边缘设备计算和通信能力有限引发的收敛速度慢的问题。其解决方案的关键在于提出FedDHAD框架,该框架包含两个创新方法:动态异构模型聚合(FedDH)和自适应丢弃(FedAD)。FedDH通过根据数据异构程度动态调整局部模型权重来应对统计异构性,而FedAD则通过神经元自适应操作提升精度并保持高效性。这两项技术的结合使FedDHAD在精度、效率和计算成本方面均优于现有方法。
链接: https://arxiv.org/abs/2507.10430
作者: Ji Liu,Beichen Ma,Yang Zhou,Jingbo Zhou,Ruoming Jin,Dejing Dou,Huaiyu Dai,Haixun Wang,Patrick Valduriez
机构: 未知
类目: Distributed, Parallel, and Cluster Computing (cs.DC); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: 29 pages, to appear in ACM Transactions on Knowledge Discovery from Data (TKDD)
Abstract:Federated Learning (FL) is a promising distributed machine learning approach that enables collaborative training of a global model using multiple edge devices. The data distributed among the edge devices is highly heterogeneous. Thus, FL faces the challenge of data distribution and heterogeneity, where non-Independent and Identically Distributed (non-IID) data across edge devices may yield in significant accuracy drop. Furthermore, the limited computation and communication capabilities of edge devices increase the likelihood of stragglers, thus leading to slow model convergence. In this paper, we propose the FedDHAD FL framework, which comes with two novel methods: Dynamic Heterogeneous model aggregation (FedDH) and Adaptive Dropout (FedAD). FedDH dynamically adjusts the weights of each local model within the model aggregation process based on the non-IID degree of heterogeneous data to deal with the statistical data heterogeneity. FedAD performs neuron-adaptive operations in response to heterogeneous devices to improve accuracy while achieving superb efficiency. The combination of these two methods makes FedDHAD significantly outperform state-of-the-art solutions in terms of accuracy (up to 6.7% higher), efficiency (up to 2.02 times faster), and computation cost (up to 15.0% smaller).
zh
[AI-11] SentiDrop: A Multi Modal Machine Learning model for Predicting Dropout in Distance Learning
【速读】:该论文试图解决远程学习中的学生辍学问题,旨在通过早期检测来实现有效的干预和提升学生的坚持性。解决方案的关键在于整合多源数据,包括社会人口统计信息、行为数据以及使用BERT模型进行情感分析的学生评论,并将其与通过XGBoost分析的特征相结合,以提高辍学预测的准确性。
链接: https://arxiv.org/abs/2507.10421
作者: Meriem Zerkouk,Miloud Mihoubi,Belkacem Chikhaoui
机构: 未知
类目: Artificial Intelligence (cs.AI); Emerging Technologies (cs.ET); Information Retrieval (cs.IR); Machine Learning (cs.LG)
备注: International Conference on Education and New Learning Technologies (2025)
Abstract:School dropout is a serious problem in distance learning, where early detection is crucial for effective intervention and student perseverance. Predicting student dropout using available educational data is a widely researched topic in learning analytics. Our partner’s distance learning platform highlights the importance of integrating diverse data sources, including socio-demographic data, behavioral data, and sentiment analysis, to accurately predict dropout risks. In this paper, we introduce a novel model that combines sentiment analysis of student comments using the Bidirectional Encoder Representations from Transformers (BERT) model with socio-demographic and behavioral data analyzed through Extreme Gradient Boosting (XGBoost). We fine-tuned BERT on student comments to capture nuanced sentiments, which were then merged with key features selected using feature importance techniques in XGBoost. Our model was tested on unseen data from the next academic year, achieving an accuracy of 84%, compared to 82% for the baseline model. Additionally, the model demonstrated superior performance in other metrics, such as precision and F1-score. The proposed method could be a vital tool in developing personalized strategies to reduce dropout rates and encourage student perseverance
zh
[AI-12] Energy Efficiency in AI for 5G and Beyond: A DeepRx Case Study
【速读】:该论文试图解决在人工智能/机器学习(AI/ML)模型中平衡能效与性能的问题。其解决方案的关键在于提出一种基于全卷积ResNet架构的深度学习接收器DeepRX,并通过知识蒸馏(Knowledge Distillation, KD)训练一个紧凑的DeepRX学生模型,以模仿教师模型的性能,同时降低能耗。研究验证了估计与实际能耗的一致性,并通过对比不同学生模型规模、教师模型规模及KD超参数,展示了知识蒸馏在实现节能型AI解决方案中的有效性。
链接: https://arxiv.org/abs/2507.10409
作者: Amine Lbath,Ibtissam Labriji
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:
Abstract:This study addresses the challenge of balancing energy efficiency with performance in AI/ML models, focusing on DeepRX, a deep learning receiver based on a fully convolutional ResNet architecture. We evaluate the energy consumption of DeepRX, considering factors including FLOPs/Watt and FLOPs/clock, and find consistency between estimated and actual energy usage, influenced by memory access patterns. The research extends to comparing energy dynamics during training and inference phases. A key contribution is the application of knowledge distillation (KD) to train a compact DeepRX \textitstudent model that emulates the performance of the \textitteacher model but with reduced energy consumption. We experiment with different student model sizes, optimal teacher sizes, and KD hyperparameters. Performance is measured by comparing the Bit Error Rate (BER) performance versus Signal-to-Interference \ Noise Ratio (SINR) values of the distilled model and a model trained from scratch. The distilled models demonstrate a lower error floor across SINR levels, highlighting the effectiveness of KD in achieving energy-efficient AI solutions.
zh
[AI-13] Instance space analysis of the capacitated vehicle routing problem
【速读】:该论文试图解决如何理解实例特征与元启发式算法(Metaheuristic, MH)性能之间复杂关系的问题。其解决方案的关键在于引入实例空间分析(Instance Space Analysis, ISA),通过结合DIMACS 12th Implementation Challenge的车辆路径问题数据集,识别出23个相关的实例特征,并利用PRELIM、SIFTED和PILOT阶段进行降维和机器学习方法处理,从而构建实例空间的二维投影,以揭示实例结构对MH行为的影响。此外,该研究提供了一个投影矩阵,便于将新实例纳入分析,并为CVRP领域提供了新的实例分析方法。
链接: https://arxiv.org/abs/2507.10397
作者: Alessandra M. M. M. Gouvêa,Nuno Paulos,Eduardo Uchoa e Mariá C. V. Nascimento
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:
Abstract:This paper seeks to advance CVRP research by addressing the challenge of understanding the nuanced relationships between instance characteristics and metaheuristic (MH) performance. We present Instance Space Analysis (ISA) as a valuable tool that allows for a new perspective on the field. By combining the ISA methodology with a dataset from the DIMACS 12th Implementation Challenge on Vehicle Routing, our research enabled the identification of 23 relevant instance characteristics. Our use of the PRELIM, SIFTED, and PILOT stages, which employ dimensionality reduction and machine learning methods, allowed us to create a two-dimensional projection of the instance space to understand how the structure of instances affect the behavior of MHs. A key contribution of our work is that we provide a projection matrix, which makes it straightforward to incorporate new instances into this analysis and allows for a new method for instance analysis in the CVRP field.
zh
[AI-14] AT: Temporal-Aligned Transformer for Multi-Horizon Peak Demand Forecasting KDD2025
【速读】:该论文旨在解决多时间跨度的时间序列预测问题,特别是在高风险销售事件中准确预测需求峰值的挑战。此类预测对于电商和实体零售商的供应链管理至关重要,但传统方法在处理需求峰值时表现不佳。论文提出的解决方案是Temporal-Aligned Transformer (TAT),其关键在于引入了Temporal Alignment Attention (TAA)机制,通过利用已知的上下文变量(如节假日和促销活动信息)来学习与上下文相关的对齐关系,从而提升峰值需求预测的准确性。
链接: https://arxiv.org/abs/2507.10349
作者: Zhiyuan Zhao,Sitan Yang,Kin G. Olivares,Boris N. Oreshkin,Stan Vitebsky,Michael W. Mahoney,B. Aditya Prakash,Dmitry Efimov
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注: 9 pages, 4 figures, 7 tables, published at KDD 2025 workshop on AI for Supply Chain: Today and Future
Abstract:Multi-horizon time series forecasting has many practical applications such as demand forecasting. Accurate demand prediction is critical to help make buying and inventory decisions for supply chain management of e-commerce and physical retailers, and such predictions are typically required for future horizons extending tens of weeks. This is especially challenging during high-stake sales events when demand peaks are particularly difficult to predict accurately. However, these events are important not only for managing supply chain operations but also for ensuring a seamless shopping experience for customers. To address this challenge, we propose Temporal-Aligned Transformer (TAT), a multi-horizon forecaster leveraging apriori-known context variables such as holiday and promotion events information for improving predictive performance. Our model consists of an encoder and decoder, both embedded with a novel Temporal Alignment Attention (TAA), designed to learn context-dependent alignment for peak demand forecasting. We conduct extensive empirical analysis on two large-scale proprietary datasets from a large e-commerce retailer. We demonstrate that TAT brings up to 30% accuracy improvement on peak demand forecasting while maintaining competitive overall performance compared to other state-of-the-art methods.
zh
[AI-15] Feature Distillation is the Better Choice for Model-Heterogeneous Federated Learning
【速读】:该论文试图解决模型异构联邦学习(Hetero-FL)中由于客户端模型结构异构导致的知识偏差问题,以及简单结合Hetero-FL与集成蒸馏技术时可能出现的训练不稳定问题。其解决方案的关键在于提出一种基于特征蒸馏的稳定高效方法——FedFD,通过正交投影对齐不同客户端模型的特征表示,从而更好地整合异构模型的知识,并缓解因模型结构差异带来的知识偏差。
链接: https://arxiv.org/abs/2507.10348
作者: Yichen Li
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:
Abstract:Model-Heterogeneous Federated Learning (Hetero-FL) has attracted growing attention for its ability to aggregate knowledge from heterogeneous models while keeping private data locally. To better aggregate knowledge from clients, ensemble distillation, as a widely used and effective technique, is often employed after global aggregation to enhance the performance of the global model. However, simply combining Hetero-FL and ensemble distillation does not always yield promising results and can make the training process unstable. The reason is that existing methods primarily focus on logit distillation, which, while being model-agnostic with softmax predictions, fails to compensate for the knowledge bias arising from heterogeneous models. To tackle this challenge, we propose a stable and efficient Feature Distillation for model-heterogeneous Federated learning, dubbed FedFD, that can incorporate aligned feature information via orthogonal projection to integrate knowledge from heterogeneous models better. Specifically, a new feature-based ensemble federated knowledge distillation paradigm is proposed. The global model on the server needs to maintain a projection layer for each client-side model architecture to align the features separately. Orthogonal techniques are employed to re-parameterize the projection layer to mitigate knowledge bias from heterogeneous models and thus maximize the distilled knowledge. Extensive experiments show that FedFD achieves superior performance compared to state-of-the-art methods.
zh
[AI-16] oolsuite for Implementing Multiagent Systems Based on Communication Protocols
【速读】:该论文旨在解决多智能体系统(Multiagent Systems)开发中的复杂交互建模与实现问题,其核心在于通过交互导向编程(Interaction-Oriented Programming, IOP)方法,以灵活的交互协议建模角色间的交互,并通过代理实现这些交互。解决方案的关键在于提供一套软件工具,支持对协议进行高效验证(如活性和安全性等属性),以及简化代理的实现过程,从而提升多智能体系统开发的效率与可靠性。
链接: https://arxiv.org/abs/2507.10324
作者: Amit K. Chopra,Samuel H. Christie V,Munindar P. Singh
机构: 未知
类目: Multiagent Systems (cs.MA); Artificial Intelligence (cs.AI); Programming Languages (cs.PL); Software Engineering (cs.SE)
备注:
Abstract:Interaction-Oriented Programming (IOP) is an approach to building a multiagent system by modeling the interactions between its roles via a flexible interaction protocol and implementing agents to realize the interactions of the roles they play in the protocol. In recent years, we have developed an extensive suite of software that enables multiagent system developers to apply IOP. These include tools for efficiently verifying protocols for properties such as liveness and safety and middleware that simplifies the implementation of agents. This paper presents some of that software suite. Subjects: Multiagent Systems (cs.MA); Artificial Intelligence (cs.AI); Programming Languages (cs.PL); Software Engineering (cs.SE) ACMclasses: I.2.11; I.2.4; I.2.5 Cite as: arXiv:2507.10324 [cs.MA] (or arXiv:2507.10324v1 [cs.MA] for this version) https://doi.org/10.48550/arXiv.2507.10324 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
zh
[AI-17] Recognizing Dementia from Neuropsychological Tests with State Space Models
【速读】:该论文试图解决早期检测痴呆症的问题,以实现及时的医疗干预和改善患者预后。传统神经心理学测试依赖人工评分,而自动痴呆分类(ADC)系统旨在通过语音记录直接推断认知衰退。该研究提出的解决方案是Demenba,一种基于状态空间模型的新型ADC框架,其关键在于能够线性扩展内存和计算资源与序列长度,从而在使用更少参数的情况下,在细粒度痴呆分类任务中比先前方法提升了21%的性能。
链接: https://arxiv.org/abs/2507.10311
作者: Liming Wang,Saurabhchand Bhati,Cody Karjadi,Rhoda Au,James Glass
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:
Abstract:Early detection of dementia is critical for timely medical intervention and improved patient outcomes. Neuropsychological tests are widely used for cognitive assessment but have traditionally relied on manual scoring. Automatic dementia classification (ADC) systems aim to infer cognitive decline directly from speech recordings of such tests. We propose Demenba, a novel ADC framework based on state space models, which scale linearly in memory and computation with sequence length. Trained on over 1,000 hours of cognitive assessments administered to Framingham Heart Study participants, some of whom were diagnosed with dementia through adjudicated review, our method outperforms prior approaches in fine-grained dementia classification by 21%, while using fewer parameters. We further analyze its scaling behavior and demonstrate that our model gains additional improvement when fused with large language models, paving the way for more transparent and scalable dementia assessment tools. Code: this https URL
zh
[AI-18] oward Real-World Table Agents : Capabilities Workflows and Design Principles for LLM -based Table Intelligence
【速读】:该论文试图解决现实世界中表格任务所面临的噪声、结构异质性和语义复杂性问题,这些问题在现有研究中尚未得到充分探索,而现有研究主要针对干净的学术数据集。解决方案的关键在于构建基于大语言模型(LLM)的Table Agents,通过整合预处理、推理和领域适应能力,实现以表格为中心的工作流自动化。论文定义了五个核心能力——C1: 表格结构理解,C2: 表格与查询语义理解,C3: 表格检索与压缩,C4: 可执行推理与可追溯性,C5: 跨领域泛化——用以分析和比较当前方法,并提出改进LLM-based Table Agents在实际应用中的鲁棒性、泛化能力和效率的可行建议。
链接: https://arxiv.org/abs/2507.10281
作者: Jiaming Tian,Liyao Li,Wentao Ye,Haobo Wang,Lingxin Wang,Lihua Yu,Zujie Ren,Gang Chen,Junbo Zhao
机构: 未知
类目: Artificial Intelligence (cs.AI); Databases (cs.DB)
备注:
Abstract:Tables are fundamental in domains such as finance, healthcare, and public administration, yet real-world table tasks often involve noise, structural heterogeneity, and semantic complexity–issues underexplored in existing research that primarily targets clean academic datasets. This survey focuses on LLM-based Table Agents, which aim to automate table-centric workflows by integrating preprocessing, reasoning, and domain adaptation. We define five core competencies–C1: Table Structure Understanding, C2: Table and Query Semantic Understanding, C3: Table Retrieval and Compression, C4: Executable Reasoning with Traceability, and C5: Cross-Domain Generalization–to analyze and compare current approaches. In addition, a detailed examination of the Text-to-SQL Agent reveals a performance gap between academic benchmarks and real-world scenarios, especially for open-source models. Finally, we provide actionable insights to improve the robustness, generalization, and efficiency of LLM-based Table Agents in practical settings.
zh
[AI-19] Visual Analytics for Explainable and Trustworthy Artificial Intelligence
【速读】:该论文试图解决人工智能(AI)系统在医疗等关键领域应用时因缺乏透明性而导致的信任问题,即AI系统作为“黑箱”难以被专家理解和信赖。解决方案的关键在于利用视觉分析(Visual Analytics, VA)技术,通过将AI模型与交互式可视化相结合,使用户能够结合领域知识对模型进行优化和改进,从而增强对AI系统的信任并促进其有效应用。
链接: https://arxiv.org/abs/2507.10240
作者: Angelos Chatzimparmpas
机构: 未知
类目: Human-Computer Interaction (cs.HC); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:
Abstract:Our society increasingly depends on intelligent systems to solve complex problems, ranging from recommender systems suggesting the next movie to watch to AI models assisting in medical diagnoses for hospitalized patients. With the iterative improvement of diagnostic accuracy and efficiency, AI holds significant potential to mitigate medical misdiagnoses by preventing numerous deaths and reducing an economic burden of approximately 450 EUR billion annually. However, a key obstacle to AI adoption lies in the lack of transparency: many automated systems function as “black boxes,” providing predictions without revealing the underlying processes. This opacity can hinder experts’ ability to trust and rely on AI systems. Visual analytics (VA) provides a compelling solution by combining AI models with interactive visualizations. These specialized charts and graphs empower users to incorporate their domain expertise to refine and improve the models, bridging the gap between AI and human understanding. In this work, we define, categorize, and explore how VA solutions can foster trust across the stages of a typical AI pipeline. We propose a design space for innovative visualizations and present an overview of our previously developed VA dashboards, which support critical tasks within the various pipeline stages, including data processing, feature engineering, hyperparameter tuning, understanding, debugging, refining, and comparing models.
zh
[AI-20] Survey for Categorising Explainable AI Studies Using Data Analysis Task Frameworks
【速读】:该论文试图解决可解释人工智能(Explainable Artificial Intelligence, XAI)在数据分析任务中的研究存在大量矛盾且缺乏具体设计建议的问题,这些问题源于对需要AI辅助的任务理解不足。其解决方案的关键在于通过跨学科视角(包括可视化分析、认知科学和仪表板设计)提出一个基于“什么(what)、为什么(why)和谁(who)”三个维度的XAI研究分类与比较方法,以识别研究中的主要问题,如任务描述不充分、上下文无关的研究以及目标用户测试不足,并提出针对XAI任务设计与报告的指导原则,以提升XAI领域研究的可解析性和实用性。
链接: https://arxiv.org/abs/2507.10208
作者: Hamzah Ziadeh,Hendrik Knoche
机构: 未知
类目: Artificial Intelligence (cs.AI); Human-Computer Interaction (cs.HC)
备注:
Abstract:Research into explainable artificial intelligence (XAI) for data analysis tasks suffer from a large number of contradictions and lack of concrete design recommendations stemming from gaps in understanding the tasks that require AI assistance. In this paper, we drew on multiple fields such as visual analytics, cognition, and dashboard design to propose a method for categorising and comparing XAI studies under three dimensions: what, why, and who. We identified the main problems as: inadequate descriptions of tasks, context-free studies, and insufficient testing with target users. We propose that studies should specifically report on their users’ domain, AI, and data analysis expertise to illustrate the generalisability of their findings. We also propose study guidelines for designing and reporting XAI tasks to improve the XAI community’s ability to parse the rapidly growing field. We hope that our contribution can help researchers and designers better identify which studies are most relevant to their work, what gaps exist in the research, and how to handle contradictory results regarding XAI design.
zh
[AI-21] Breaking the Myth: Can Small Models Infer Postconditions Too?
【速读】:该论文试图解决手动编写形式化规格说明耗时且容易出错的问题,以及是否需要使用大型语言模型(Large Language Models, LLMs)来生成这些规格说明。论文的核心解决方案是通过在特定数据集上微调一个小型的7B参数代码模型,实现高质量的后置条件生成,从而在保持性能的同时显著降低计算成本。关键在于构建包含提示、推理日志和后置条件的专用数据集,并针对实际代码库依赖关系和前状态信息进行优化,使模型能够生成表达性强且准确的规格说明。
链接: https://arxiv.org/abs/2507.10182
作者: Gehao Zhang,Zhenting Wang,Juan Zhai
机构: 未知
类目: oftware Engineering (cs.SE); Artificial Intelligence (cs.AI)
备注:
Abstract:Formal specifications are essential for ensuring software correctness, yet manually writing them is tedious and error-prone. Large Language Models (LLMs) have shown promise in generating such specifications from natural language intents, but the giant model size and high computational demands raise a fundamental question: Do we really need large models for this task? In this paper, we show that a small, fine-tuned language model can achieve high-quality postcondition generation with much lower computational costs. We construct a specialized dataset of prompts, reasoning logs, and postconditions, then supervise the fine-tuning of a 7 B-parameter code model. Our approach tackles real-world repository dependencies and preserves pre-state information, allowing for expressive and accurate specifications. We evaluate the model on a benchmark of real-world Java bugs (Defects4J) and compare against both proprietary giants (e.g., GPT-4o) and open-source large models. Empirical results demonstrate that our compact model matches or outperforms significantly larger counterparts in syntax correctness, semantic correctness, and bug-distinguishing capability. These findings highlight that targeted fine-tuning on a modest dataset can enable small models to achieve results formerly seen only in massive, resource-heavy LLMs, offering a practical and efficient path for the real-world adoption of automated specification generation.
zh
[AI-22] Should We Ever Prefer Decision Transformer for Offline Reinforcement Learning?
【速读】:该论文试图解决生成式 AI (Generative AI) 在离线强化学习(Offline Reinforcement Learning, Offline RL)中是否在所有场景下都优于传统方法的问题,特别是针对稀疏奖励环境下的性能表现。其解决方案的关键在于提出一种基于多层感知机(MLP)的过滤行为克隆(Filtered Behavior Cloning, FBC)方法,该方法通过从数据集中过滤掉低性能轨迹,然后对过滤后的数据进行常规行为克隆,从而在稀疏奖励环境中实现了与决策变压器(Decision Transformer, DT)相当或更优的性能,同时具备更低的数据需求和计算效率。
链接: https://arxiv.org/abs/2507.10174
作者: Yumi Omori,Zixuan Dong,Keith Ross
机构: 未知
类目: Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: Accepted by RLBrew: Ingredients for Developing Generalist Agents workshop (RLC 2025)
Abstract:In recent years, extensive work has explored the application of the Transformer architecture to reinforcement learning problems. Among these, Decision Transformer (DT) has gained particular attention in the context of offline reinforcement learning due to its ability to frame return-conditioned policy learning as a sequence modeling task. Most recently, Bhargava et al. (2024) provided a systematic comparison of DT with more conventional MLP-based offline RL algorithms, including Behavior Cloning (BC) and Conservative Q-Learning (CQL), and claimed that DT exhibits superior performance in sparse-reward and low-quality data settings. In this paper, through experimentation on robotic manipulation tasks (Robomimic) and locomotion benchmarks (D4RL), we show that MLP-based Filtered Behavior Cloning (FBC) achieves competitive or superior performance compared to DT in sparse-reward environments. FBC simply filters out low-performing trajectories from the dataset and then performs ordinary behavior cloning on the filtered dataset. FBC is not only very straightforward, but it also requires less training data and is computationally more efficient. The results therefore suggest that DT is not preferable for sparse-reward environments. From prior work, arguably, DT is also not preferable for dense-reward environments. Thus, we pose the question: Is DT ever preferable? Comments: Accepted by RLBrew: Ingredients for Developing Generalist Agents workshop (RLC 2025) Subjects: Artificial Intelligence (cs.AI); Machine Learning (cs.LG) Cite as: arXiv:2507.10174 [cs.AI] (or arXiv:2507.10174v1 [cs.AI] for this version) https://doi.org/10.48550/arXiv.2507.10174 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
zh
[AI-23] Play Style Identification Using Low-Level Representations of Play Traces in MicroRTS
【速读】:该论文试图解决游戏玩法风格识别的问题,以提供有价值的游戏设计见解并实现自适应游戏体验,从而提升游戏AI代理的表现。传统方法依赖领域知识构建玩法轨迹表示,而近期方法虽然考虑了玩法轨迹的序列结构,但仍需一定程度的领域抽象。该研究的关键在于使用无监督的卷积神经网络-长短期记忆(CNN-LSTM)自动编码器模型,直接从MicroRTS中的低层次玩法轨迹数据中获得潜在表示,从而在潜在空间中实现不同游戏AI代理的有效区分,并减少对领域专业知识及其相关偏见的依赖。
链接: https://arxiv.org/abs/2507.10172
作者: Ruizhe Yu Xia,Jeremy Gow,Simon Lucas
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注: Accepted as Short Paper for IEEE CoG
Abstract:Play style identification can provide valuable game design insights and enable adaptive experiences, with the potential to improve game playing agents. Previous work relies on domain knowledge to construct play trace representations using handcrafted features. More recent approaches incorporate the sequential structure of play traces but still require some level of domain abstraction. In this study, we explore the use of unsupervised CNN-LSTM autoencoder models to obtain latent representations directly from low-level play trace data in MicroRTS. We demonstrate that this approach yields a meaningful separation of different game playing agents in the latent space, reducing reliance on domain expertise and its associated biases. This latent space is then used to guide the exploration of diverse play styles within studied AI players.
zh
[AI-24] Introducing the Swiss Food Knowledge Graph: AI for Context-Aware Nutrition Recommendation
【速读】:该论文试图解决现有自动饮食评估系统在处理非视觉因素(如食谱特定的食材替代)和个体化饮食需求(如过敏、限制、文化实践和个人偏好)方面的不足,以及瑞士境内营养相关信息碎片化、缺乏集中整合的问题。解决方案的关键在于构建首个基于知识图谱的瑞士食品知识图谱(Swiss Food Knowledge Graph, SwissFKG),通过大语言模型(LLM)驱动的增强流程来填充该图谱,整合食谱、食材、替代品、营养数据、饮食限制、过敏原信息及国家营养指南,并通过Graph-RAG应用展示其在回答个性化营养查询中的潜力。
链接: https://arxiv.org/abs/2507.10156
作者: Lubnaa Abdur Rahman,Ioannis Papathanail,Stavroula Mougiakakou
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注: 10 pages, 2 Figures, 7 tables
Abstract:AI has driven significant progress in the nutrition field, especially through multimedia-based automatic dietary assessment. However, existing automatic dietary assessment systems often overlook critical non-visual factors, such as recipe-specific ingredient substitutions that can significantly alter nutritional content, and rarely account for individual dietary needs, including allergies, restrictions, cultural practices, and personal preferences. In Switzerland, while food-related information is available, it remains fragmented, and no centralized repository currently integrates all relevant nutrition-related aspects within a Swiss context. To bridge this divide, we introduce the Swiss Food Knowledge Graph (SwissFKG), the first resource, to our best knowledge, to unite recipes, ingredients, and their substitutions with nutrient data, dietary restrictions, allergen information, and national nutrition guidelines under one graph. We establish a LLM-powered enrichment pipeline for populating the graph, whereby we further present the first benchmark of four off-the-shelf (70 B parameter) LLMs for food knowledge augmentation. Our results demonstrate that LLMs can effectively enrich the graph with relevant nutritional information. Our SwissFKG goes beyond recipe recommendations by offering ingredient-level information such as allergen and dietary restriction information, and guidance aligned with nutritional guidelines. Moreover, we implement a Graph-RAG application to showcase how the SwissFKG’s rich natural-language data structure can help LLM answer user-specific nutrition queries, and we evaluate LLM-embedding pairings by comparing user-query responses against predefined expected answers. As such, our work lays the foundation for the next generation of dietary assessment tools that blend visual, contextual, and cultural dimensions of eating.
zh
[AI-25] Adaptability in Multi-Agent Reinforcement Learning: A Framework and Unified Review
【速读】:该论文试图解决多智能体强化学习(MARL)在真实世界多智能体系统(MAS)中部署受限的问题,主要原因是现实环境的复杂性和动态性。解决方案的关键在于引入“适应性”(adaptability)这一统一且实用的评估视角,通过学习适应性、策略适应性和场景驱动适应性三个核心维度,对MARL算法在动态环境中的可靠性进行系统评估,从而支持更严谨的MARL性能分析,推动其在动态现实场景中的应用。
链接: https://arxiv.org/abs/2507.10142
作者: Siyi Hu,Mohamad A Hady,Jianglin Qiao,Jimmy Cao,Mahardhika Pratama,Ryszard Kowalczyk
机构: 未知
类目: Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Multiagent Systems (cs.MA)
备注:
Abstract:Multi-Agent Reinforcement Learning (MARL) has shown clear effectiveness in coordinating multiple agents across simulated benchmarks and constrained scenarios. However, its deployment in real-world multi-agent systems (MAS) remains limited, primarily due to the complex and dynamic nature of such environments. These challenges arise from multiple interacting sources of variability, including fluctuating agent populations, evolving task goals, and inconsistent execution conditions. Together, these factors demand that MARL algorithms remain effective under continuously changing system configurations and operational demands. To better capture and assess this capacity for adjustment, we introduce the concept of \textitadaptability as a unified and practically grounded lens through which to evaluate the reliability of MARL algorithms under shifting conditions, broadly referring to any changes in the environment dynamics that may occur during learning or execution. Centred on the notion of adaptability, we propose a structured framework comprising three key dimensions: learning adaptability, policy adaptability, and scenario-driven adaptability. By adopting this adaptability perspective, we aim to support more principled assessments of MARL performance beyond narrowly defined benchmarks. Ultimately, this survey contributes to the development of algorithms that are better suited for deployment in dynamic, real-world multi-agent systems.
zh
[AI-26] FRSICL: LLM -Enabled In-Context Learning Flight Resource Allocation for Fresh Data Collection in UAV-Assisted Wildfire Monitoring
【速读】:该论文旨在解决无人机辅助野火监测系统中传感器传输调度与飞行速度联合优化的问题,以最小化由过时传感器数据引起的信息年龄(Age of Information, AoI)。传统方法如深度强化学习(Deep Reinforcement Learning, DRL)在采样效率、仿真到现实的差距以及复杂训练方面存在局限,难以满足时间敏感的应用需求。论文提出的解决方案是基于大语言模型(Large Language Model, LLM)的上下文学习的在线飞行资源分配方案(Flight Resource Allocation scheme based on LLM-Enabled In-Context Learning, FRSICL),其关键在于利用自然语言任务描述和环境反馈生成数据采集计划并控制飞行速度,从而实现实时动态决策而无需大量重新训练。
链接: https://arxiv.org/abs/2507.10134
作者: Yousef Emami,Hao Zhou,Miguel Gutierrez Gaitan,Kai Li,Luis Almeida
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注: 8 pages, 8 figures
Abstract:Unmanned Aerial Vehicles (UAVs) are vital for public safety, particularly in wildfire monitoring, where early detection minimizes environmental impact. In UAV-Assisted Wildfire Monitoring (UAWM) systems, joint optimization of sensor transmission scheduling and velocity is critical for minimizing Age of Information (AoI) from stale sensor data. Deep Reinforcement Learning (DRL) has been used for such optimization; however, its limitations such as low sampling efficiency, simulation-to-reality gaps, and complex training render it unsuitable for time-critical applications like wildfire monitoring. This paper introduces a new online Flight Resource Allocation scheme based on LLM-Enabled In-Context Learning (FRSICL) to jointly optimize the UAV’s flight control and data collection schedule along the trajectory in real time, thereby asymptotically minimizing the average AoI across ground sensors. In contrast to DRL, FRSICL generates data collection schedules and controls velocity using natural language task descriptions and feedback from the environment, enabling dynamic decision-making without extensive retraining. Simulation results confirm the effectiveness of the proposed FRSICL compared to Proximal Policy Optimization (PPO) and Nearest-Neighbor baselines.
zh
[AI-27] Extending Defeasibility for Propositional Standpoint Logics
【速读】:该论文试图解决在命题立场逻辑中引入可废止性(defeasibility)的问题,以更灵活地表达默认推理和非单调逻辑。解决方案的关键在于整合Kraus等人的可废止条件句、Britz和Varzinczak的可废止必要性和独特可能性概念,以及Leisegang等人的可废止性方法,从而扩展Gómez Álvarez和Rudolph的立场逻辑框架。这一综合方法使得能够在蕴含、立场模态算子和立场精炼陈述层面表达可废止性,并通过优先语义和表列演算实现形式化验证。
链接: https://arxiv.org/abs/2507.10133
作者: Nicholas Leisegang,Thomas Meyer,Ivan Varzinczak
机构: 未知
类目: Logic in Computer Science (cs.LO); Artificial Intelligence (cs.AI)
备注:
Abstract:In this paper, we introduce a new defeasible version of propositional standpoint logic by integrating Kraus et al.'s defeasible conditionals, Britz and Varzinczak’s notions of defeasible necessity and distinct possibility, along with Leisegang et al.'s approach to defeasibility into the standpoint logics of Gómez Álvarez and Rudolph. The resulting logical framework allows for the expression of defeasibility on the level of implications, standpoint modal operators, and standpoint-sharpening statements. We provide a preferential semantics for this extended language and propose a tableaux calculus, which is shown to be sound and complete with respect to preferential entailment. We also establish the computational complexity of the tableaux procedure to be in PSpace.
zh
[AI-28] Wavelet-Enhanced Neural ODE and Graph Attention for Interpretable Energy Forecasting
【速读】:该论文旨在解决能源需求与供应预测中的挑战,特别是由于可再生能源的波动性和消费模式的动态性所带来的问题。其解决方案的关键在于提出一种集成连续时间神经微分方程(Neural ODEs)、图注意力机制、多分辨率小波变换和自适应频率学习的神经框架,以有效捕捉时间序列中的多尺度时序动态特性。该模型通过鲁棒的ODE求解器、基于图的注意力机制和残差连接,增强了对结构和时间模式的理解,并通过小波特征提取和自适应频率调制提升了预测性能。
链接: https://arxiv.org/abs/2507.10132
作者: Usman Gani Joy
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Machine Learning (stat.ML)
备注:
Abstract:Accurate forecasting of energy demand and supply is critical for optimizing sustainable energy systems, yet it is challenged by the variability of renewable sources and dynamic consumption patterns. This paper introduces a neural framework that integrates continuous-time Neural Ordinary Differential Equations (Neural ODEs), graph attention, multi-resolution wavelet transformations, and adaptive learning of frequencies to address the issues of time series prediction. The model employs a robust ODE solver, using the Runge-Kutta method, paired with graph-based attention and residual connections to better understand both structural and temporal patterns. Through wavelet-based feature extraction and adaptive frequency modulation, it adeptly captures and models diverse, multi-scale temporal dynamics. When evaluated across seven diverse datasets: ETTh1, ETTh2, ETTm1, ETTm2 (electricity transformer temperature), and Waste, Solar, and Hydro (renewable energy), this architecture consistently outperforms state-of-the-art baselines in various forecasting metrics, proving its robustness in capturing complex temporal dependencies. Furthermore, the model enhances interpretability through SHAP analysis, making it suitable for sustainable energy applications.
zh
[AI-29] Could you be wrong: Debiasing LLM s using a metacognitive prompt for improving human decision making
【速读】:该论文试图解决大型语言模型(Large Language Models, LLMs)中存在偏见的问题,特别是如何通过有效策略减少这些偏见。其解决方案的关键在于借鉴人类决策过程中的元认知提示(metacognitive prompts),尤其是“could you be wrong?”这一提示,该提示能够促使LLMs在生成回答后主动反思自身的回答,揭示潜在的错误、偏见、矛盾证据及替代观点,从而实现更全面和客观的回应。这种方法利用了人类心理学中已验证的有效提示机制,为LLM的提示工程提供了新的方向。
链接: https://arxiv.org/abs/2507.10124
作者: Thomas T. Hills
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注: 12 pages, 3 figures
Abstract:Identifying bias in LLMs is ongoing. Because they are still in development, what is true today may be false tomorrow. We therefore need general strategies for debiasing that will outlive current models. Strategies developed for debiasing human decision making offer one promising approach as they incorporate an LLM-style prompt intervention designed to bring latent knowledge into awareness during decision making. LLMs trained on vast amounts of information contain information about potential biases, counter-arguments, and contradictory evidence, but that information may only be brought to bear if prompted. Metacognitive prompts developed in the human decision making literature are designed to achieve this, and as I demonstrate here, they show promise with LLMs. The prompt I focus on here is “could you be wrong?” Following an LLM response, this prompt leads LLMs to produce additional information, including why they answered as they did, errors, biases, contradictory evidence, and alternatives, none of which were apparent in their initial response. Indeed, this metaknowledge often reveals that how LLMs and users interpret prompts are not aligned. Here I demonstrate this prompt using a set of questions taken from recent articles about LLM biases, including implicit discriminatory biases and failures of metacognition. “Could you be wrong” prompts the LLM to identify its own biases and produce cogent metacognitive reflection. I also present another example involving convincing but incomplete information, which is readily corrected by the metacognitive prompt. In sum, this work argues that human psychology offers a new avenue for prompt engineering, leveraging a long history of effective prompt-based improvements to human decision making.
zh
[AI-30] A Variance-Reduced Cubic-Regularized Newton for Policy Optimization
【速读】:该论文旨在解决强化学习中策略优化的次序方法所面临的样本复杂度不足或依赖不现实的重要采样假设的问题。其解决方案的关键在于提出一种方差缩减的二阶策略牛顿算法(VR-CR-PN),该算法首次将Hessian辅助的方差缩减与二阶策略优化相结合,有效缓解了分布偏移问题,并在一般非凸条件下实现了最优已知的样本复杂度,而无需依赖重要采样。此外,该算法引入了一种新的期望回报函数的Hessian估计器,其具有与时间步长无关的统一上界,从而实现了与时间步长无关的样本复杂度。
链接: https://arxiv.org/abs/2507.10120
作者: Cheng Sun,Zhen Zhang,Shaofu Yang
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注: 13 pages, 1 figure
Abstract:In this paper, we study a second-order approach to policy optimization in reinforcement learning. Existing second-order methods often suffer from suboptimal sample complexity or rely on unrealistic assumptions about importance sampling. To overcome these limitations, we propose VR-CR-PN, a variance-reduced cubic-regularized policy Newton algorithm. To the best of our knowledge, this is the first algorithm that integrates Hessian-aided variance reduction with second-order policy optimization, effectively addressing the distribution shift problem and achieving best-known sample complexity under general nonconvex conditions but without the need for importance sampling. We theoretically establish that VR-CR-PN achieves a sample complexity of \tilde\mathcalO(\epsilon^-3) to reach an \epsilon -second-order stationary point, significantly improving upon the previous best result of \tilde\mathcalO(\epsilon^-3.5) under comparable assumptions. As an additional contribution, we introduce a novel Hessian estimator for the expected return function, which admits a uniform upper bound independent of the horizon length H , allowing the algorithm to achieve horizon-independent sample complexity.
zh
[AI-31] Analysis of AI Techniques for Orchestrating Edge-Cloud Application Migration
【速读】:该论文试图解决在边缘-云系统中实现高服务质量(QoS)和成本效益的服务交付问题,特别是针对可以建模为汉诺塔(Towers of Hanoi, ToH)问题的边缘-云应用迁移问题。解决方案的关键在于通过马尔可夫决策过程(Markov Decision Process, MDP)分析、比较并评估当前最先进的人工智能(Artificial Intelligence, AI)规划与强化学习(Reinforcement Learning, RL)方法,以理解其在新兴计算连续体环境中的应用迁移编排能力。
链接: https://arxiv.org/abs/2507.10119
作者: Sadig Gojayev,Ahmad Anaqreh,Carolina Fortuna
机构: 未知
类目: Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:
Abstract:Application migration in edge-cloud system enables high QoS and cost effective service delivery. However, automatically orchestrating such migration is typically solved with heuristic approaches. Starting from the Markov Decision Process (MDP), in this paper, we identify, analyze and compare selected state-of-the-art Artificial Intelligence (AI) planning and Reinforcement Learning (RL) approaches for solving the class of edge-cloud application migration problems that can be modeled as Towers of Hanoi (ToH) problems. We introduce a new classification based on state space definition and analyze the compared models also through this lense. The aim is to understand available techniques capable of orchestrating such application migration in emerging computing continuum environments.
zh
[AI-32] BlueGlass: A Framework for Composite AI Safety ICML2025
【速读】:该论文试图解决当前AI系统安全性工具分散、无法提供全面保障的问题,旨在通过集成和组合多种安全工具来提升AI系统的可靠性。解决方案的关键在于提出BlueGlass框架,该框架提供统一的基础设施,支持跨模型内部和输出的多样化安全工具的集成与组合,从而实现更全面的AI安全分析与保障。
链接: https://arxiv.org/abs/2507.10106
作者: Harshal Nandigramwar,Syed Qutub,Kay-Ulrich Scholl
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注: Accepted at ICML 2025 [Actionable Interpretability Workshop]
Abstract:As AI systems become increasingly capable and ubiquitous, ensuring the safety of these systems is critical. However, existing safety tools often target different aspects of model safety and cannot provide full assurance in isolation, highlighting a need for integrated and composite methodologies. This paper introduces BlueGlass, a framework designed to facilitate composite AI safety workflows by providing a unified infrastructure enabling the integration and composition of diverse safety tools that operate across model internals and outputs. Furthermore, to demonstrate the utility of this framework, we present three safety-oriented analyses on vision-language models for the task of object detection: (1) distributional evaluation, revealing performance trade-offs and potential failure modes across distributions; (2) probe-based analysis of layer dynamics highlighting shared hierarchical learning via phase transition; and (3) sparse autoencoders identifying interpretable concepts. More broadly, this work contributes foundational infrastructure and findings for building more robust and reliable AI systems.
zh
[AI-33] On Gradual Semantics for Assumption-Based Argumentation
【速读】:该论文试图解决在假设型论证框架(Assumption-Based Argumentation, ABA)中缺乏渐进语义(gradual semantics)的问题,尽管ABA作为一种结构化论证形式已被广泛应用,且渐进语义在其他类型的论证框架中已有研究。解决方案的关键在于利用基于双极集合的论证框架对ABA框架进行抽象,并将现有针对定量双极论证框架(Quantitative Bipolar Argumentation Frameworks, QBAFs)的模块化渐进语义进行泛化,从而为ABA中的假设赋予辩证强度。
链接: https://arxiv.org/abs/2507.10076
作者: Anna Rapberger,Fabrizio Russo,Antonio Rago,Francesca Toni
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:
Abstract:In computational argumentation, gradual semantics are fine-grained alternatives to extension-based and labelling-based semantics . They ascribe a dialectical strength to (components of) arguments sanctioning their degree of acceptability. Several gradual semantics have been studied for abstract, bipolar and quantitative bipolar argumentation frameworks (QBAFs), as well as, to a lesser extent, for some forms of structured argumentation. However, this has not been the case for assumption-based argumentation (ABA), despite it being a popular form of structured argumentation with several applications where gradual semantics could be useful. In this paper, we fill this gap and propose a family of novel gradual semantics for equipping assumptions, which are the core components in ABA frameworks, with dialectical strengths. To do so, we use bipolar set-based argumentation frameworks as an abstraction of (potentially non-flat) ABA frameworks and generalise state-of-the-art modular gradual semantics for QBAFs. We show that our gradual ABA semantics satisfy suitable adaptations of desirable properties of gradual QBAF semantics, such as balance and monotonicity. We also explore an argument-based approach that leverages established QBAF modular semantics directly, and use it as baseline. Finally, we conduct experiments with synthetic ABA frameworks to compare our gradual ABA semantics with its argument-based counterpart and assess convergence.
zh
[AI-34] GLD: A Trust-Aware Game-Theoretic Lane-Changing Decision Framework for Automated Vehicles in Heterogeneous Traffic ITSC
【速读】:该论文试图解决自动驾驶车辆(AVs)在混合交通环境中与人类驾驶车辆(HVs)有效协作的问题,尤其是现有变道框架忽视了HVs动态信任水平,导致无法准确预测人类驾驶员行为。解决方案的关键在于提出一种基于信任的博弈论变道决策(TGLD)框架,通过构建多车辆联盟博弈模型,结合AVs的完全合作与HVs的部分合作行为,并引入在线信任评估方法以动态估计HVs的信任水平,从而指导AVs选择合适的协作操作。此外,通过最小化对周围车辆的干扰并提高AV行为的可预测性,实现人机友好的变道策略。
链接: https://arxiv.org/abs/2507.10075
作者: Jie Pan,Tianyi Wang,Yangyang Wang,Junfeng Jiao,Christian Claudel
机构: 未知
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI); Systems and Control (eess.SY)
备注: 6 pages, 7 figures, accepted for IEEE International Conference on Intelligent Transportation Systems (ITSC) 2025
Abstract:Automated vehicles (AVs) face a critical need to adopt socially compatible behaviors and cooperate effectively with human-driven vehicles (HVs) in heterogeneous traffic environment. However, most existing lane-changing frameworks overlook HVs’ dynamic trust levels, limiting their ability to accurately predict human driver behaviors. To address this gap, this study proposes a trust-aware game-theoretic lane-changing decision (TGLD) framework. First, we formulate a multi-vehicle coalition game, incorporating fully cooperative interactions among AVs and partially cooperative behaviors from HVs informed by real-time trust evaluations. Second, we develop an online trust evaluation method to dynamically estimate HVs’ trust levels during lane-changing interactions, guiding AVs to select context-appropriate cooperative maneuvers. Lastly, social compatibility objectives are considered by minimizing disruption to surrounding vehicles and enhancing the predictability of AV behaviors, thereby ensuring human-friendly and context-adaptive lane-changing strategies. A human-in-the-loop experiment conducted in a highway on-ramp merging scenario validates our TGLD approach. Results show that AVs can effectively adjust strategies according to different HVs’ trust levels and driving styles. Moreover, incorporating a trust mechanism significantly improves lane-changing efficiency, maintains safety, and contributes to transparent and adaptive AV-HV interactions.
zh
[AI-35] Deep Hidden Cognition Facilitates Reliable Chain-of-Thought Reasoning
【速读】:该论文试图解决Chain of Thought (CoT)推理在中间步骤中误差累积导致可靠性下降的问题。解决方案的关键在于利用模型内在的真实性编码,通过特定注意力头激活来反映CoT推理步骤的真实性,并训练一个置信度预测器来评估每一步推理的正确性,进而通过束搜索动态选择最可能的推理路径。
链接: https://arxiv.org/abs/2507.10007
作者: Zijun Chen,Wenbo Hu,Richang Hong
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:
Abstract:Chain of Thought (CoT) reasoning has demonstrated remarkable deep reasoning capabilities in both large language models (LLMs) and multimodal large language models (MLLMs). However, its reliability is often undermined by the accumulation of errors in intermediate steps. This paper introduces an novel approach to calibrate the CoT reasoning accuracy by leveraging the model’s intrinsic veracity encoding. We discover that specific attention head activations reliably reflect the truthfulness of reasoning steps in CoT. Based on this insight, we train a confidence predictor to evaluate the correctness of each reasoning step using these truthfulness-sensitive activations, dynamically selecting the most plausible reasoning path via beam search. Experimental results demonstrate that our method significantly outperforms the state-of-the-art baselines (e.g., Few-Shot CoT, Self-Consistency, and Self-Evaluation Guided Beam Search) across the mathematical, symbolic, and commonsense reasoning tasks, exhibiting superior accuracy and reliability in both unimodal and multimodal settings. We further validate the approach on large reasoning models, confirming its applicability to specialized reasoning models. Additionally, we explore the role of the model’s self-correction ability in CoT reasoning. This work provides a novel reliability improvement path for CoT reasoning with broad application potential.
zh
[AI-36] Differentially Private Federated Low Rank Adaptation Beyond Fixed-Matrix NEURIPS2025
【速读】:该论文试图解决在联邦学习场景下,对大型语言模型(Large Language Models, LLMs)进行低秩适配(LoRA)时面临的隐私泄露问题。传统方法在传输本地适配器时存在严重的隐私风险,而引入差分隐私(Differential Privacy, DP)则会导致模型噪声增加或微调可学习性下降。解决方案的关键在于提出FedASK框架,其核心思想是基于随机SVD的双阶段压缩管道,通过精心设计的隐私保护本地更新聚合与全局矩阵重建,实现对两个低秩适配器的有效更新,同时保证强大的差分隐私保障和精确的聚合特性。
链接: https://arxiv.org/abs/2507.09990
作者: Ming Wen,Jiaqi Zhu,Yuedong Xu,Yipeng Zhou,Dingding Han
机构: 未知
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI)
备注: 23 pages, NeurIPS 2025 under review
Abstract:Large language models (LLMs) typically require fine-tuning for domain-specific tasks, and LoRA offers a computationally efficient approach by training low-rank adapters. LoRA is also communication-efficient for federated LLMs when multiple users collaboratively fine-tune a global LLM model without sharing their proprietary raw data. However, even the transmission of local adapters between a server and clients risks serious privacy leakage. Applying differential privacy (DP) to federated LoRA encounters a dilemma: adding noise to both adapters amplifies synthetic noise on the model, while fixing one adapter impairs the learnability of fine-tuning. In this paper, we propose FedASK (Differentially Private Federated Low Rank Adaptation with Double Sketching) , a novel federated LoRA framework to enable effective updating of both low-rank adapters with robust differential privacy. Inspired by randomized SVD, our key idea is a two-stage sketching pipeline. This pipeline first aggregates carefully sketched, privacy-preserving local updates, and then reconstructs the global matrices on the server to facilitate effective updating of both adapters. We theoretically prove FedASK’s differential privacy guarantee and its exact aggregation property. Comprehensive experiments demonstrate that FedASK consistently outperforms baseline methods across a variety of privacy settings and data distributions.
zh
[AI-37] Improving monotonic optimization in heterogeneous multi-agent reinforcement learning with optimal marginal deterministic policy gradient
【速读】:该论文旨在解决异构多智能体强化学习(Heterogeneous Multi-Agent Reinforcement Learning, MARL)中单调改进与部分参数共享(Partial Parameter-sharing, ParPS)之间的冲突问题。其关键解决方案是提出最优边际确定性策略梯度(Optimal Marginal Deterministic Policy Gradient, OMDPG)算法,通过引入最优边际Q函数(Optimal Marginal Q, OMQ)替代顺序计算的Q值,以维持单调改进并消除冲突;同时采用广义Q评论器(Generalized Q Critic, GQC)优化不同Q值估计,并设计集中式评论器分组执行者(Centralized Critic Grouped Actor, CCGA)架构,实现局部策略网络中的ParPS与全局Q函数计算的协同。
链接: https://arxiv.org/abs/2507.09989
作者: Xiaoyang Yu,Youfang Lin,Shuo Wang,Sheng Han
机构: 未知
类目: Artificial Intelligence (cs.AI); Multiagent Systems (cs.MA)
备注:
Abstract:In heterogeneous multi-agent reinforcement learning (MARL), achieving monotonic improvement plays a pivotal role in enhancing performance. The HAPPO algorithm proposes a feasible solution by introducing a sequential update scheme, which requires independent learning with No Parameter-sharing (NoPS). However, heterogeneous MARL generally requires Partial Parameter-sharing (ParPS) based on agent grouping to achieve high cooperative performance. Our experiments prove that directly combining ParPS with the sequential update scheme leads to the policy updating baseline drift problem, thereby failing to achieve improvement. To solve the conflict between monotonic improvement and ParPS, we propose the Optimal Marginal Deterministic Policy Gradient (OMDPG) algorithm. First, we replace the sequentially computed Q_\psi^s(s,a_1:i) with the Optimal Marginal Q (OMQ) function \phi_\psi^*(s,a_1:i) derived from Q-functions. This maintains MAAD’s monotonic improvement while eliminating the conflict through optimal joint action sequences instead of sequential policy ratio calculations. Second, we introduce the Generalized Q Critic (GQC) as the critic function, employing pessimistic uncertainty-constrained loss to optimize different Q-value estimations. This provides the required Q-values for OMQ computation and stable baselines for actor updates. Finally, we implement a Centralized Critic Grouped Actor (CCGA) architecture that simultaneously achieves ParPS in local policy networks and accurate global Q-function computation. Experimental results in SMAC and MAMuJoCo environments demonstrate that OMDPG outperforms various state-of-the-art MARL baselines.
zh
[AI-38] Demonstrating the Octopi-1.5 Visual-Tactile-Language Model
【速读】:该论文旨在解决机器人在触觉感知与理解方面的挑战,特别是在复杂操作任务中如何有效利用触觉信息进行推理和决策。其解决方案的关键在于提出Octopi-1.5,一个视觉-触觉-语言模型(Visual-Tactile-Language Model, VTLM),该模型能够处理多部位触觉信号,并引入了简单的检索增强生成(Retrieval-Augmented Generation, RAG)模块,以提升任务性能并实现对新物体的实时学习。
链接: https://arxiv.org/abs/2507.09985
作者: Samson Yu,Kelvin Lin,Harold Soh
机构: 未知
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI)
备注: Published at R:SS 2025
Abstract:Touch is recognized as a vital sense for humans and an equally important modality for robots, especially for dexterous manipulation, material identification, and scenarios involving visual occlusion. Building upon very recent work in touch foundation models, this demonstration will feature Octopi-1.5, our latest visual-tactile-language model. Compared to its predecessor, Octopi-1.5 introduces the ability to process tactile signals from multiple object parts and employs a simple retrieval-augmented generation (RAG) module to improve performance on tasks and potentially learn new objects on-the-fly. The system can be experienced live through a new handheld tactile-enabled interface, the TMI, equipped with GelSight and TAC-02 tactile sensors. This convenient and accessible setup allows users to interact with Octopi-1.5 without requiring a robot. During the demonstration, we will showcase Octopi-1.5 solving tactile inference tasks by leveraging tactile inputs and commonsense knowledge. For example, in a Guessing Game, Octopi-1.5 will identify objects being grasped and respond to follow-up queries about how to handle it (e.g., recommending careful handling for soft fruits). We also plan to demonstrate Octopi-1.5’s RAG capabilities by teaching it new items. With live interactions, this demonstration aims to highlight both the progress and limitations of VTLMs such as Octopi-1.5 and to foster further interest in this exciting field. Code for Octopi-1.5 and design files for the TMI gripper are available at this https URL.
zh
[AI-39] DeepSeek : Paradigm Shifts and Technical Evolution in Large AI Models
【速读】:该论文旨在探讨深度学习模型的发展路径及其在人工智能领域的应用,特别是针对大规模语言模型(Large Language Model, LLM)的技术演进与创新。其解决方案的关键在于分析DeepSeek公司推出的V3和R1系列模型所采用的新型算法与工程优化技术,包括多头潜在注意力(Multi-head Latent Attention, MLA)、专家混合(Mixture-of-Experts, MoE)、多标记预测(Multi-Token Prediction, MTP)以及组相对策略优化(Group Relative Policy Optimization, GRPO),并通过这些技术提升模型性能、降低训练成本,并推动系统级架构的优化。
链接: https://arxiv.org/abs/2507.09955
作者: Luolin Xiong,Haofen Wang,Xi Chen,Lu Sheng,Yun Xiong,Jingping Liu,Yanghua Xiao,Huajun Chen,Qing-Long Han,Yang Tang
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:
Abstract:DeepSeek, a Chinese Artificial Intelligence (AI) startup, has released their V3 and R1 series models, which attracted global attention due to their low cost, high performance, and open-source advantages. This paper begins by reviewing the evolution of large AI models focusing on paradigm shifts, the mainstream Large Language Model (LLM) paradigm, and the DeepSeek paradigm. Subsequently, the paper highlights novel algorithms introduced by DeepSeek, including Multi-head Latent Attention (MLA), Mixture-of-Experts (MoE), Multi-Token Prediction (MTP), and Group Relative Policy Optimization (GRPO). The paper then explores DeepSeek engineering breakthroughs in LLM scaling, training, inference, and system-level optimization architecture. Moreover, the impact of DeepSeek models on the competitive AI landscape is analyzed, comparing them to mainstream LLMs across various fields. Finally, the paper reflects on the insights gained from DeepSeek innovations and discusses future trends in the technical and engineering development of large AI models, particularly in data, training, and reasoning.
zh
[AI-40] Memorization Sinks: Isolating Memorization during LLM Training
【速读】:该论文试图解决大语言模型在训练过程中可能记住重复序列的问题,这会引发隐私和版权方面的担忧。现有方法通过事后移除特定神经元中的记忆信息来缓解这一问题,但效果有限。论文提出的关键解决方案是MemSinks,其核心在于通过设计一种序列标识符,使得每个序列在重复时激活一组独特的记忆神经元,从而实现记忆内容的隔离。这种方法有助于在不损害通用语言能力的前提下,更有效地移除记忆信息。
链接: https://arxiv.org/abs/2507.09937
作者: Gaurav R. Ghosal,Pratyush Maini,Aditi Raghunathan
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注: Accepted at the 2025 International Conference of Machine Learning
Abstract:Large language models are susceptible to memorizing repeated sequences, posing privacy and copyright concerns. A popular mitigation strategy is to remove memorized information from specific neurons post-hoc. However, such approaches have shown limited success so far. In a controlled setting, we show that the memorization of natural sequences (those that resemble linguistically plausible text) become mechanistically entangled with general language abilities, thereby becoming challenging to remove post-hoc. In this work, we put forward a new paradigm of MemSinks that promotes isolation of memorization by design. We leverage a sequence identifier that activates a unique set of memorization neurons for each sequence across repetitions. By analyzing the dynamics of learning and forgetting, we argue that MemSinks facilitates isolation of memorized content, making it easier to remove without compromising general language capabilities. We implement MemSinks at the billion-parameter and billion-token scale, and observe both effective isolation and strong generalization. To our knowledge, this is the first proof-of-concept on real data demonstrating that simultaneous generalization and isolation is achievable. We open-source our code at this http URL.
zh
[AI-41] Mechanistic Interpretability of LoRA-Adapted Language Models for Nuclear Reactor Safety Applications
【速读】:该论文试图解决在安全关键领域(如核工程)中部署大型语言模型(Large Language Models, LLMs)时,由于模型的黑箱特性而导致的可解释性和验证难题。其解决方案的关键在于提出一种新颖的方法,通过参数高效微调技术(如低秩适应,Low-Rank Adaptation)将通用语言模型适配到核领域,并结合神经元激活模式分析与神经元抑制技术,识别并验证对任务性能具有显著影响的特定神经元群体,从而增强模型的透明度并实现可追溯的领域知识编码。
链接: https://arxiv.org/abs/2507.09931
作者: Yoon Pyo Lee
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注: Submitted to Nuclear Technology. 22 pages, 2 tables, 4 figures
Abstract:The integration of Large Language Models (LLMs) into safety-critical domains, such as nuclear engineering, necessitates a deep understanding of their internal reasoning processes. This paper presents a novel methodology for interpreting how an LLM encodes and utilizes domain-specific knowledge, using a Boiling Water Reactor system as a case study. We adapted a general-purpose LLM (Gemma-3-1b-it) to the nuclear domain using a parameter-efficient fine-tuning technique known as Low-Rank Adaptation. By comparing the neuron activation patterns of the base model to those of the fine-tuned model, we identified a sparse set of neurons whose behavior was significantly altered during the adaptation process. To probe the causal role of these specialized neurons, we employed a neuron silencing technique. Our results demonstrate that while silencing most of these specialized neurons individually did not produce a statistically significant effect, deactivating the entire group collectively led to a statistically significant degradation in task performance. Qualitative analysis further revealed that silencing these neurons impaired the model’s ability to generate detailed, contextually accurate technical information. This paper provides a concrete methodology for enhancing the transparency of an opaque black-box model, allowing domain expertise to be traced to verifiable neural circuits. This offers a pathway towards achieving nuclear-grade artificial intelligence (AI) assurance, addressing the verification and validation challenges mandated by nuclear regulatory frameworks (e.g., 10 CFR 50 Appendix B), which have limited AI deployment in safety-critical nuclear operations.
zh
[AI-42] Large Population Models
【速读】:该论文试图解决社会中复杂系统问题,如疫情应对、供应链中断和气候适应等,这些问题源于数百万自主代理在时间维度上的集体行为。解决方案的关键在于大型人口模型(Large Population Models, LPMs),其通过三种核心创新实现对复杂系统的理解:能够同时模拟数百万代理的计算方法、从多样化现实数据流中学习的数学框架,以及保护隐私的通信协议,这些技术共同实现了对代理行为如何汇聚为系统级结果的观察,并在实际部署前测试干预措施。
链接: https://arxiv.org/abs/2507.09901
作者: Ayush Chopra
机构: 未知
类目: Multiagent Systems (cs.MA); Artificial Intelligence (cs.AI)
备注: Aggregation of Several Papers from MIT PhD Research. this http URL
Abstract:Many of society’s most pressing challenges, from pandemic response to supply chain disruptions to climate adaptation, emerge from the collective behavior of millions of autonomous agents making decisions over time. Large Population Models (LPMs) offer an approach to understand these complex systems by simulating entire populations with realistic behaviors and interactions at unprecedented scale. LPMs extend traditional modeling approaches through three key innovations: computational methods that efficiently simulate millions of agents simultaneously, mathematical frameworks that learn from diverse real-world data streams, and privacy-preserving communication protocols that bridge virtual and physical environments. This allows researchers to observe how agent behavior aggregates into system-level outcomes and test interventions before real-world implementation. While current AI advances primarily focus on creating “digital humans” with sophisticated individual capabilities, LPMs develop “digital societies” where the richness of interactions reveals emergent phenomena. By bridging individual agent behavior and population-scale dynamics, LPMs offer a complementary path in AI research illuminating collective intelligence and providing testing grounds for policies and social innovations before real-world deployment. We discuss the technical foundations and some open problems here. LPMs are implemented by the AgentTorch framework (this http URL)
zh
[AI-43] Soft Graph Clustering for single-cell RNA Sequencing Data
【速读】:该论文旨在解决单细胞RNA测序(scRNA-seq)数据聚类分析中因硬图结构导致的信息丢失与聚类偏差问题。传统基于图神经网络(GNN)的方法依赖于通过阈值化相似性矩阵构建的二元边权重图,这限制了对细胞间连续相似性的捕捉,并可能导致跨簇连接干扰聚类结果。其解决方案的关键在于引入scSGC,一种基于软图的聚类方法,通过非二元边权重更准确地表征细胞间的连续相似性,从而克服刚性数据结构的局限性。
链接: https://arxiv.org/abs/2507.09890
作者: Ping Xu,Pengfei Wang,Zhiyuan Ning,Meng Xiao,Min Wu,Yuanchun Zhou
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Genomics (q-bio.GN)
备注:
Abstract:Clustering analysis is fundamental in single-cell RNA sequencing (scRNA-seq) data analysis for elucidating cellular heterogeneity and diversity. Recent graph-based scRNA-seq clustering methods, particularly graph neural networks (GNNs), have significantly improved in tackling the challenges of high-dimension, high-sparsity, and frequent dropout events that lead to ambiguous cell population boundaries. However, their reliance on hard graph constructions derived from thresholded similarity matrices presents challenges:(i) The simplification of intercellular relationships into binary edges (0 or 1) by applying thresholds, which restricts the capture of continuous similarity features among cells and leads to significant information loss.(ii) The presence of significant inter-cluster connections within hard graphs, which can confuse GNN methods that rely heavily on graph structures, potentially causing erroneous message propagation and biased clustering outcomes. To tackle these challenges, we introduce scSGC, a Soft Graph Clustering for single-cell RNA sequencing data, which aims to more accurately characterize continuous similarities among cells through non-binary edge weights, thereby mitigating the limitations of rigid data structures. The scSGC framework comprises three core components: (i) a zero-inflated negative binomial (ZINB)-based feature autoencoder; (ii) a dual-channel cut-informed soft graph embedding module; and (iii) an optimal transport-based clustering optimization module. Extensive experiments across ten datasets demonstrate that scSGC outperforms 13 state-of-the-art clustering models in clustering accuracy, cell type annotation, and computational efficiency. These results highlight its substantial potential to advance scRNA-seq data analysis and deepen our understanding of cellular heterogeneity.
zh
[AI-44] NeuTSFlow: Modeling Continuous Functions Behind Time Series Forecasting
【速读】:该论文试图解决传统时间序列预测方法将数据视为离散序列而忽略其作为连续过程噪声采样的本质问题。解决方案的关键在于提出NeuTSFlow框架,该框架利用神经算子进行流匹配,以学习历史函数族到未来函数族之间的转换路径,通过在无限维函数空间中参数化流的速场,直接建模函数级特征,从而超越了仅关注离散点依赖的传统方法。
链接: https://arxiv.org/abs/2507.09888
作者: Huibo Xu,Likang Wu,Xianquan Wang,Haoning Dang,Chun-Wun Cheng,Angelica I Aviles-Rivero,Qi Liu
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Machine Learning (stat.ML)
备注:
Abstract:Time series forecasting is a fundamental task with broad applications, yet conventional methods often treat data as discrete sequences, overlooking their origin as noisy samples of continuous processes. Crucially, discrete noisy observations cannot uniquely determine a continuous function; instead, they correspond to a family of plausible functions. Mathematically, time series can be viewed as noisy observations of a continuous function family governed by a shared probability measure. Thus, the forecasting task can be framed as learning the transition from the historical function family to the future function family. This reframing introduces two key challenges: (1) How can we leverage discrete historical and future observations to learn the relationships between their underlying continuous functions? (2) How can we model the transition path in function space from the historical function family to the future function family? To address these challenges, we propose NeuTSFlow, a novel framework that leverages Neural Operators to facilitate flow matching for learning path of measure between historical and future function families. By parameterizing the velocity field of the flow in infinite-dimensional function spaces, NeuTSFlow moves beyond traditional methods that focus on dependencies at discrete points, directly modeling function-level features instead. Experiments on diverse forecasting tasks demonstrate NeuTSFlow’s superior accuracy and robustness, validating the effectiveness of the function-family perspective.
zh
[AI-45] olerantECG: A Foundation Model for Imperfect Electrocardiogram
【速读】:该论文试图解决心电图(ECG)在噪声干扰或标准12导联记录中存在缺失时导致诊断错误或不确定性的问题。其解决方案的关键在于提出TolerantECG,这是一个针对ECG信号的基础模型,能够容忍噪声并处理任意子集的12导联ECG数据。TolerantECG的训练结合了对比学习和自监督学习框架,共同学习ECG信号表示及其对应的基于知识检索的文本报告描述以及受损或导联缺失的信号。
链接: https://arxiv.org/abs/2507.09887
作者: Huynh Nguyen Dang,Thang Pham,Ngan Le,Van Nguyen
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Signal Processing (eess.SP)
备注: 10 pages, 6 figures. Accepted to ACM Multimedia 2025
Abstract:The electrocardiogram (ECG) is an essential and effective tool for diagnosing heart diseases. However, its effectiveness can be compromised by noise or unavailability of one or more leads of the standard 12-lead recordings, resulting in diagnostic errors or uncertainty. To address these challenges, we propose TolerantECG, a foundation model for ECG signals that is robust to noise and capable of functioning with arbitrary subsets of the standard 12-lead ECG. TolerantECG training combines contrastive and self-supervised learning frameworks to jointly learn ECG signal representations alongside their corresponding knowledge-retrieval-based text report descriptions and corrupted or lead-missing signals. Comprehensive benchmarking results demonstrate that TolerantECG consistently ranks as the best or second-best performer across various ECG signal conditions and class levels in the PTB-XL dataset, and achieves the highest performance on the MIT-BIH Arrhythmia Database.
zh
[AI-46] VerifyBench: A Systematic Benchmark for Evaluating Reasoning Verifiers Across Domains
【速读】:该论文试图解决生成式 AI (Generative AI) 在强化学习(Reinforcement Learning, RL)中对模型生成回答与参考答案之间一致性验证的挑战,特别是在面对长文本、多样化和复杂语义时,传统规则基验证器和通用大语言模型(LLM)在准确性和一致性上的不足。解决方案的关键在于提出 VerifyBench——一个跨领域的综合性基准,用于系统评估不同类型的验证器性能,通过构建涵盖数学、物理、化学和生物学的4,000个专家级问题及其参考答案与多样化生成回答,并设计四维实验框架比较专业验证器与通用 LLM 的表现边界,从而揭示验证器在准确性、召回率、输入结构敏感性及跨领域泛化能力方面的核心瓶颈。
链接: https://arxiv.org/abs/2507.09884
作者: Xuzhao Li,Xuchen Li,Shiyu Hu,Yongzhen Guo,Wentao Zhang
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注: Preprint, Under review
Abstract:Large language models (LLMs) increasingly rely on reinforcement learning (RL) to enhance their reasoning capabilities through feedback. A critical challenge is verifying the consistency of model-generated responses and reference answers, since these responses are often lengthy, diverse, and nuanced. Rule-based verifiers struggle with complexity, prompting the use of model-based verifiers. However, specialized verifiers lack flexibility, while general LLM judges can be inconsistent. Existing research primarily focuses on building better verifiers, yet a systematic evaluation of different types of verifiers’ performance across domains remains lacking, severely constraining the reliable development of Reinforcement Learning with Verifiable Reward (RLVR). To address this, we propose VerifyBench–a cross-domain comprehensive benchmark for systematically evaluating verifiers. We construct 4,000 expert-level questions covering mathematics, physics, chemistry, and biology. Each question is equipped with reference answers and diverse responses. The reliability of the evaluation is ensured through a rigorous annotation process conducted by a multidisciplinary expert team. We design a four-dimensional experimental framework to comprehensively compare the performance boundaries of specialized verifiers and general LLMs under combined conditions of extracted answers vs. complete responses, and short vs. long outputs. Our evaluation uncovers fundamental trade-offs in verifiers: while specialized verifiers achieve leading accuracy, they exhibit deficiencies in recall; general models show stronger inclusivity but unstable precision. More importantly, we discover verifiers’ high sensitivity to input structure and inherent limitations in cross-domain generalization, providing critical insights into the bottlenecks of current verifier technology.
zh
[AI-47] Covering a Few Submodular Constraints and Applications
【速读】:该论文试图解决在给定多个子模约束条件下,寻找一个最小成本子集以满足所有约束的问题(即多子模覆盖问题)。其关键解决方案是针对固定数量的子模函数,设计了一种随机化的双准则近似算法,能够在保证每个子模函数值接近所需要求的同时,控制总成本。此外,当子模函数为删除闭包集合系统中的加权覆盖函数时,还提出了一个基于自然线性规划的近似算法,进一步优化了近似比。这些方法表明,在固定约束数量的情况下,可以达到与单个子模约束情况相似的近似效果。
链接: https://arxiv.org/abs/2507.09879
作者: Tanvi Bajpai,Chandra Chekuri,Pooja Kulkarni
机构: 未知
类目: Data Structures and Algorithms (cs.DS); Artificial Intelligence (cs.AI); Computer Science and Game Theory (cs.GT)
备注: 34 pages. Accepted to APPROX 2025
Abstract:We consider the problem of covering multiple submodular constraints. Given a finite ground set N , a cost function c: N \rightarrow \mathbbR_+ , r monotone submodular functions f_1,f_2,\ldots,f_r over N and requirements b_1,b_2,\ldots,b_r the goal is to find a minimum cost subset S \subseteq N such that f_i(S) \ge b_i for 1 \le i \le r . When r=1 this is the well-known Submodular Set Cover problem. Previous work \citechekuri2022covering considered the setting when r is large and developed bi-criteria approximation algorithms, and approximation algorithms for the important special case when each f_i is a weighted coverage function. These are fairly general models and capture several concrete and interesting problems as special cases. The approximation ratios for these problem are at least \Omega(\log r) which is unavoidable when r is part of the input. In this paper, motivated by some recent applications, we consider the problem when r is a \emphfixed constant and obtain two main results. For covering multiple submodular constraints we obtain a randomized bi-criteria approximation algorithm that for any given integer \alpha \ge 1 outputs a set S such that f_i(S) \ge (1-1/e^\alpha -\epsilon)b_i for each i \in [r] and \mathbbE[c(S)] \le (1+\epsilon)\alpha \cdot \sfOPT . Second, when the f_i are weighted coverage functions from a deletion-closed set system we obtain a (1+\epsilon) (\fracee-1) (1+\beta) -approximation where \beta is the approximation ratio for the underlying set cover instances via the natural LP. These results show that one can obtain nearly as good an approximation for any fixed r as what one would achieve for r=1 . We mention some applications that follow easily from these general results and anticipate more in the future.
zh
[AI-48] ask Priors: Enhancing Model Evaluation by Considering the Entire Space of Downstream Tasks
【速读】:该论文试图解决当前人工智能(AI)研究中评估方法过于依赖固定下游任务集所带来的局限性,这种固定集无法全面反映模型在所有可能任务上的性能。论文提出的解决方案的关键在于定义一个基于任务分布和任务先验(Task Priors)的下游任务概率空间,从而能够在所有可能的任务上评估模型的表现,而不仅仅局限于人工选择的基准任务。这一框架首次提供了对模型在所有可能任务上的平均性能和性能方差等关键问题的量化分析。
链接: https://arxiv.org/abs/2507.09871
作者: Niket Patel,Randall Balestriero
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:
Abstract:The grand goal of AI research, and particularly Self Supervised Learning (SSL), is to produce systems that can successfully solve any possible task. In contrast, current evaluation methods available to AI researchers typically rely on a fixed collection of hand-picked downstream benchmarks. Hence, a large amount of effort is put into designing and searching for large collection of evaluation tasks that can serve as a proxy of our grand goal. We argue that such a rigid evaluation protocol creates a silent bottleneck in AI research. To remedy that, we define a probabilistic space of downstream tasks obtained by adopting a distribution of tasks and by defining Task Priors. Under this view, one can evaluate a model’s performance over the set of all possible downstream tasks. Our framework is the first to provide answers to key questions such as (i) what is the average performance of my model over all possible downstream tasks weighted by the probability to encounter each task? or (ii) what is the variance of my model’s performance across all downstream tasks under the defined Task Priors? Beyond establishing a new standard for evaluation, we believe that Task Priors will accelerate the pace of research in SSL - where downstream task evaluation is the sole qualitative signal that researchers have access to.
zh
[AI-49] urning the Tide: Repository-based Code Reflection
【速读】:该论文试图解决在代码仓库(repository)上下文中进行代码理解和生成的评估问题,尤其是针对代码修改场景的不足。其解决方案的关键在于引入LiveRepoReflection基准测试,该基准包含1,888个经过严格筛选的测试用例,覆盖6种编程语言,以确保多样性、正确性和高难度;同时构建了RepoReflection-Instruct指令微调数据集,用于通过两轮对话过程(包括代码生成和错误驱动修复)训练RepoReflectionCoder模型,从而提升模型在仓库级代码反思任务中的性能。
链接: https://arxiv.org/abs/2507.09866
作者: Wei Zhang,Jian Yang,Jiaxi Yang,Ya Wang,Zhoujun Li,Zeyu Cui,Binyuan Hui,Junyang Lin
机构: 未知
类目: oftware Engineering (cs.SE); Artificial Intelligence (cs.AI)
备注:
Abstract:Code large language models (LLMs) enhance programming by understanding and generating code across languages, offering intelligent feedback, bug detection, and code updates through reflection, improving development efficiency and accessibility. While benchmarks (e.g. HumanEval/LiveCodeBench) evaluate code generation and real-world relevance, previous works ignore the scenario of modifying code in repositories. Considering challenges remaining in improving reflection capabilities and avoiding data contamination in dynamic benchmarks, we introduce LiveRepoReflection, a challenging benchmark for evaluating code understanding and generation in multi-file repository contexts, featuring 1,888 rigorously filtered test cases across 6 programming languages to ensure diversity, correctness, and high difficulty. Further, we create RepoReflection-Instruct, a large-scale, quality-filtered instruction-tuning dataset derived from diverse sources, used to train RepoReflectionCoder through a two-turn dialogue process involving code generation and error-driven repair. The leaderboard evaluates over 40 LLMs to reflect the model performance of repository-based code reflection.
zh
[AI-50] Intersection of Reinforcement Learning and Bayesian Optimization for Intelligent Control of Industrial Processes: A Safe MPC-based DPG using Multi-Objective BO
【速读】:该论文旨在解决基于模型预测控制(MPC)的强化学习(MPC-RL)方法在收敛速度慢、策略学习次优以及在线适应过程中存在安全问题等挑战。其解决方案的关键在于将MPC-RL与多目标贝叶斯优化(MOBO)相结合,利用噪声的强化学习阶段成本及其梯度估计,并通过兼容确定性策略梯度(CDPG)方法将其融入基于期望超体积改进(EHVI)的MOBO算法中,从而实现MPC参数的高效且安全调优,提升闭环控制性能。
链接: https://arxiv.org/abs/2507.09864
作者: Hossein Nejatbakhsh Esfahani,Javad Mohammadpour Velni
机构: 未知
类目: ystems and Control (eess.SY); Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Optimization and Control (math.OC)
备注:
Abstract:Model Predictive Control (MPC)-based Reinforcement Learning (RL) offers a structured and interpretable alternative to Deep Neural Network (DNN)-based RL methods, with lower computational complexity and greater transparency. However, standard MPC-RL approaches often suffer from slow convergence, suboptimal policy learning due to limited parameterization, and safety issues during online adaptation. To address these challenges, we propose a novel framework that integrates MPC-RL with Multi-Objective Bayesian Optimization (MOBO). The proposed MPC-RL-MOBO utilizes noisy evaluations of the RL stage cost and its gradient, estimated via a Compatible Deterministic Policy Gradient (CDPG) approach, and incorporates them into a MOBO algorithm using the Expected Hypervolume Improvement (EHVI) acquisition function. This fusion enables efficient and safe tuning of the MPC parameters to achieve improved closed-loop performance, even under model imperfections. A numerical example demonstrates the effectiveness of the proposed approach in achieving sample-efficient, stable, and high-performance learning for control systems.
zh
[AI-51] Secure and Efficient UAV-Based Face Detection via Homomorphic Encryption and Edge Computing
【速读】:该论文试图解决无人机(UAV)在进行人脸识别时面临的隐私保护问题。传统方法在动态环境中虽能通过高分辨率图像和复杂神经网络实现准确识别,但无人机的广泛监控能力引发了隐私担忧。论文提出的解决方案的关键在于将同态加密(HE)与先进神经网络相结合,确保面部数据在整个推理过程中保持加密状态,同时最小化对检测精度的影响。其核心技术创新包括利用CKKS方案对加密数据进行直接计算,以及设计一种针对原始面部数据的SIMD编码方法,从而实现无需解密即可在密文上进行安全推理。
链接: https://arxiv.org/abs/2507.09860
作者: Nguyen Van Duc,Bui Duc Manh,Quang-Trung Luu,Dinh Thai Hoang,Van-Linh Nguyen,Diep N. Nguyen
机构: 未知
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI)
备注:
Abstract:This paper aims to propose a novel machine learning (ML) approach incorporating Homomorphic Encryption (HE) to address privacy limitations in Unmanned Aerial Vehicles (UAV)-based face detection. Due to challenges related to distance, altitude, and face orientation, high-resolution imagery and sophisticated neural networks enable accurate face recognition in dynamic environments. However, privacy concerns arise from the extensive surveillance capabilities of UAVs. To resolve this issue, we propose a novel framework that integrates HE with advanced neural networks to secure facial data throughout the inference phase. This method ensures that facial data remains secure with minimal impact on detection accuracy. Specifically, the proposed system leverages the Cheon-Kim-Kim-Song (CKKS) scheme to perform computations directly on encrypted data, optimizing computational efficiency and security. Furthermore, we develop an effective data encoding method specifically designed to preprocess the raw facial data into CKKS form in a Single-Instruction-Multiple-Data (SIMD) manner. Building on this, we design a secure inference algorithm to compute on ciphertext without needing decryption. This approach not only protects data privacy during the processing of facial data but also enhances the efficiency of UAV-based face detection systems. Experimental results demonstrate that our method effectively balances privacy protection and detection performance, making it a viable solution for UAV-based secure face detection. Significantly, our approach (while maintaining data confidentially with HE encryption) can still achieve an accuracy of less than 1% compared to the benchmark without using encryption.
zh
[AI-52] Model-Grounded Symbolic Artificial Intelligence Systems Learning and Reasoning with Model-Grounded Symbolic Artificial Intelligence Systems
【速读】:该论文试图解决如何有效结合神经网络的泛化学习能力与符号AI的可验证推理能力,以构建更高效和可靠的智能系统。其解决方案的关键在于将指令微调的大语言模型重新诠释为基于模型内部表示空间的符号AI系统,其中自然语言作为符号层,而模型的内部表征则实现语义接地,从而在保持传统学习与推理范式结构相似性的同时,提升学习效率和推理可靠性。
链接: https://arxiv.org/abs/2507.09854
作者: Aniruddha Chattopadhyay,Raj Dandekar,Kaushik Roy
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注: Accepted as paper in 19th International Conference on Neurosymbolic Learning and Reasoning,NeSy 2025
Abstract:Neurosymbolic artificial intelligence (AI) systems combine neural network and classical symbolic AI mechanisms to exploit the complementary strengths of large scale, generalizable learning and robust, verifiable reasoning. Numerous classifications of neurosymbolic AI illustrate how these two components can be integrated in distinctly different ways. In this work, we propose reinterpreting instruction tuned large language models as model grounded symbolic AI systems where natural language serves as the symbolic layer and grounding is achieved through the models internal representation space. Within this framework, we investigate and develop novel learning and reasoning approaches that preserve structural similarities to traditional learning and reasoning paradigms. Preliminary evaluations across axiomatic deductive reasoning procedures of varying complexity provide insights into the effectiveness of our approach in improving learning efficiency and reasoning reliability.
zh
[AI-53] Is Human-Written Data Enough? The Challenge of Teaching Reasoning to LLM s Without RL or Distillation ICML2025
【速读】:该论文试图解决如何在不进行大规模微调的情况下,通过提示(prompting)或最小调优,使基础模型具备生成长且显式的思维链(Chain-of-Thought, CoT)能力的问题。其解决方案的关键在于利用少量高质量的CoT示例对基础模型进行轻量级微调,实验表明仅使用20个来自推理模型QwQ-32B-Preview的长CoT示例即可显著提升模型的推理性能,甚至超越更大规模的专门数学模型Qwen2.5-Math-72B-Instruct。
链接: https://arxiv.org/abs/2507.09850
作者: Wei Du,Branislav Kisacanin,George Armstrong,Shubham Toshniwal,Ivan Moshkov,Alexan Ayrapetyan,Sadegh Mahdavi,Dan Zhao,Shizhe Diao,Dragan Masulovic,Marius Stanean,Advaith Avadhanam,Max Wang,Ashmit Dutta,Shitij Govil,Sri Yanamandara,Mihir Tandon,Sriram Ananthakrishnan,Vedant Rathi,David Zhang,Joonseok Kang,Leon Luo,Titu Andreescu,Boris Ginsburg,Igor Gitman
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注: Accepted at the Second AI for Math Workshop at the 42nd International Conference on Machine Learning (ICML 2025)
Abstract:Reasoning-capable language models achieve state-of-the-art performance in diverse complex tasks by generating long, explicit Chain-of-Thought (CoT) traces. While recent works show that base models can acquire such reasoning traces via reinforcement learning or distillation from stronger models like DeepSeek-R1, previous works demonstrate that even short CoT prompting without fine-tuning is able to improve reasoning. We ask whether long CoT can be induced in a base model using only prompting or minimal tuning. Using just 20 long CoT examples from the reasoning model \textttQwQ-32B-Preview, we lightly fine-tune the base model \textttQwen2.5-32B. The resulting model outperforms the much larger \textttQwen2.5-Math-72B-Instruct, showing that a handful of high-quality examples can unlock strong reasoning capabilities. We further explore using CoT data from non-reasoning models and human annotators, enhanced with prompt engineering, multi-pass editing, and structural guidance. However, neither matches the performance of reasoning model traces, suggesting that certain latent qualities of expert CoT are difficult to replicate. We analyze key properties of reasoning data, such as problem difficulty, diversity, and answer length, that influence reasoning distillation. While challenges remain, we are optimistic that carefully curated human-written CoT, even in small quantities, can activate reasoning behaviors in base models. We release our human-authored dataset across refinement stages and invite further investigation into what makes small-scale reasoning supervision so effective.
zh
[AI-54] hrough the River: Understanding the Benefit of Schedule-Free Methods for Language Model Training
【速读】:该论文试图解决大规模训练中传统预训练策略(如固定计算预算下的余弦学习率调度)日益不足的问题,特别是在模型和数据集规模迅速扩大的背景下。其解决方案的关键在于引入一种更为原理性且可扩展的方法——无调度(Schedule-Free, SF)方法,该方法无需显式衰减阶段或辅助平均机制,即可有效导航损失景观的“河流”结构。通过理论与实证分析,研究发现SF隐式实现了权重平均而无需额外内存开销,并在此基础上提出改进版本,提升了对动量的鲁棒性和大批次尺寸下的性能,从而解决了原始方法的关键局限。
链接: https://arxiv.org/abs/2507.09846
作者: Minhak Song,Beomhan Baek,Kwangjun Ahn,Chulhee Yun
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Optimization and Control (math.OC); Machine Learning (stat.ML)
备注: Comments would be appreciated!
Abstract:As both model and dataset sizes continue to scale rapidly, conventional pretraining strategies with fixed compute budgets-such as cosine learning rate schedules-are increasingly inadequate for large-scale training. Recent alternatives, including warmup-stable-decay (WSD) schedules and weight averaging, offer greater flexibility. However, WSD relies on explicit decay phases to track progress, while weight averaging addresses this limitation at the cost of additional memory. In search of a more principled and scalable alternative, we revisit the Schedule-Free (SF) method [Defazio et al., 2024], which has shown strong empirical performance across diverse settings. We show that SF-AdamW effectively navigates the “river” structure of the loss landscape without decay phases or auxiliary averaging, making it particularly suitable for continuously scaling training workloads. To understand this behavior, we conduct a theoretical and empirical analysis of SF dynamics, revealing that it implicitly performs weight averaging without memory overhead. Guided by this analysis, we propose a refined variant of SF that improves robustness to momentum and performs better under large batch sizes, addressing key limitations of the original method. Together, these results establish SF as a practical, scalable, and theoretically grounded approach for language model training.
zh
[AI-55] A Pre-training Framework for Relational Data with Information-theoretic Principles
【速读】:该论文试图解决在关系型数据库上构建可泛化的预训练策略所面临的挑战,这一挑战源于任务异质性,即存在无限多可能的下游任务,这些任务由关系模式图、时间依赖性和SQL定义的标签逻辑所定义。解决方案的关键在于引入任务向量估计(Task Vector Estimation, TVE),该框架通过基于集合的聚合操作在模式遍历图上构建预测监督信号,并显式建模下一窗口的关系动态,从而获得任务感知的表示。TVE通过信息论视角进行形式化,证明了任务知情表示比无任务先验的表示保留了更多相关信号。
链接: https://arxiv.org/abs/2507.09837
作者: Quang Truong,Zhikai Chen,Mingxuan Ju,Tong Zhao,Neil Shah,Jiliang Tang
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:
Abstract:Relational databases underpin critical infrastructure across a wide range of domains, yet the design of generalizable pre-training strategies for learning from relational databases remains an open challenge due to task heterogeneity. Specifically, there exist infinitely many possible downstream tasks, as tasks are defined based on relational schema graphs, temporal dependencies, and SQL-defined label logics. An effective pre-training framework is desired to take these factors into account in order to obtain task-aware representations. By incorporating knowledge of the underlying distribution that drives label generation, downstream tasks can benefit from relevant side-channel information. To bridge this gap, we introduce Task Vector Estimation (TVE), a novel pre-training framework that constructs predictive supervisory signals via set-based aggregation over schema traversal graphs, explicitly modeling next-window relational dynamics. We formalize our approach through an information-theoretic lens, demonstrating that task-informed representations retain more relevant signals than those obtained without task priors. Extensive experiments on the RelBench benchmark show that TVE consistently outperforms traditional pre-training baselines. Our findings advocate for pre-training objectives that encode task heterogeneity and temporal structure as design principles for predictive modeling on relational databases.
zh
[AI-56] Multi-residual Mixture of Experts Learning for Cooperative Control in Multi-vehicle Systems
【速读】:该论文试图解决在复杂多变的真实交通环境中,设计能够泛化且鲁棒的拉格朗日交通控制策略(Lagrangian traffic control policies)对于自动驾驶车辆(AVs)所面临的重大挑战。其解决方案的关键在于提出一种名为多残差专家混合学习(Multi-Residual Mixture of Expert Learning, MRMEL)的新框架,该框架通过在给定次优基准策略的基础上学习一个残差修正项,并结合基于交通场景的专家混合模型动态选择最合适的基准策略,从而有效提升控制性能。
链接: https://arxiv.org/abs/2507.09836
作者: Vindula Jayawardana,Sirui Li,Yashar Farid,Cathy Wu
机构: 未知
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Multiagent Systems (cs.MA); Systems and Control (eess.SY)
备注:
Abstract:Autonomous vehicles (AVs) are becoming increasingly popular, with their applications now extending beyond just a mode of transportation to serving as mobile actuators of a traffic flow to control flow dynamics. This contrasts with traditional fixed-location actuators, such as traffic signals, and is referred to as Lagrangian traffic control. However, designing effective Lagrangian traffic control policies for AVs that generalize across traffic scenarios introduces a major challenge. Real-world traffic environments are highly diverse, and developing policies that perform robustly across such diverse traffic scenarios is challenging. It is further compounded by the joint complexity of the multi-agent nature of traffic systems, mixed motives among participants, and conflicting optimization objectives subject to strict physical and external constraints. To address these challenges, we introduce Multi-Residual Mixture of Expert Learning (MRMEL), a novel framework for Lagrangian traffic control that augments a given suboptimal nominal policy with a learned residual while explicitly accounting for the structure of the traffic scenario space. In particular, taking inspiration from residual reinforcement learning, MRMEL augments a suboptimal nominal AV control policy by learning a residual correction, but at the same time dynamically selects the most suitable nominal policy from a pool of nominal policies conditioned on the traffic scenarios and modeled as a mixture of experts. We validate MRMEL using a case study in cooperative eco-driving at signalized intersections in Atlanta, Dallas Fort Worth, and Salt Lake City, with real-world data-driven traffic scenarios. The results show that MRMEL consistently yields superior performance-achieving an additional 4%-9% reduction in aggregate vehicle emissions relative to the strongest baseline in each setting.
zh
[AI-57] Generative Cognitive Diagnosis
【速读】:该论文旨在解决传统认知诊断(Cognitive Diagnosis, CD)模型在面对新学习者时无法实现即时诊断的问题,以及因需要计算昂贵的参数重优化而导致的诊断结果可靠性不足的问题。其解决方案的关键在于引入一种生成式诊断范式,将CD从预测建模转变为生成建模,从而实现无需参数重新优化的归纳推理。该方法通过设计合理的生成过程,将认知状态推断与响应预测解耦,并结合可识别性和单调性条件,提升了诊断的准确性和效率。
链接: https://arxiv.org/abs/2507.09831
作者: Jiatong Li,Qi Liu,Mengxiao Zhu
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Information Retrieval (cs.IR)
备注: Preprint; 15 pages, 12 figures
Abstract:Cognitive diagnosis (CD) models latent cognitive states of human learners by analyzing their response patterns on diagnostic tests, serving as a crucial machine learning technique for educational assessment and evaluation. Traditional cognitive diagnosis models typically follow a transductive prediction paradigm that optimizes parameters to fit response scores and extract learner abilities. These approaches face significant limitations as they cannot perform instant diagnosis for new learners without computationally expensive retraining and produce diagnostic outputs with limited reliability. In this study, we introduces a novel generative diagnosis paradigm that fundamentally shifts CD from predictive to generative modeling, enabling inductive inference of cognitive states without parameter re-optimization. We propose two simple yet effective instantiations of this paradigm: Generative Item Response Theory (G-IRT) and Generative Neural Cognitive Diagnosis Model (G-NCDM), which achieve excellent performance improvements over traditional methods. The generative approach disentangles cognitive state inference from response prediction through a well-designed generation process that incorporates identifiability and monotonicity conditions. Extensive experiments on real-world datasets demonstrate the effectiveness of our methodology in addressing scalability and reliability challenges, especially \times 100 speedup for the diagnosis of new learners. Our framework opens new avenues for cognitive diagnosis applications in artificial intelligence, particularly for intelligent model evaluation and intelligent education systems. The code is available at this https URL.
zh
[AI-58] Bridging Neural Networks and Dynamic Time Warping for Adaptive Time Series Classification
【速读】:该论文试图解决时间序列分类中神经网络对大量标注数据的依赖性问题以及动态时间规整(DTW)方法在资源丰富场景下的性能不足问题。其解决方案的关键在于提出一种动态长度缩短算法,将时间序列转换为保留关键结构模式的原型,从而将DTW的递归关系重新表述为等效的循环神经网络,构建了一个既可训练又保持DTW固有可解释性的模型。
链接: https://arxiv.org/abs/2507.09826
作者: Jintao Qu,Zichong Wang,Chenhao Wu,Wenbin Zhang
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:
Abstract:Neural networks have achieved remarkable success in time series classification, but their reliance on large amounts of labeled data for training limits their applicability in cold-start scenarios. Moreover, they lack interpretability, reducing transparency in decision-making. In contrast, dynamic time warping (DTW) combined with a nearest neighbor classifier is widely used for its effectiveness in limited-data settings and its inherent interpretability. However, as a non-parametric method, it is not trainable and cannot leverage large amounts of labeled data, making it less effective than neural networks in rich-resource scenarios. In this work, we aim to develop a versatile model that adapts to cold-start conditions and becomes trainable with labeled data, while maintaining interpretability. We propose a dynamic length-shortening algorithm that transforms time series into prototypes while preserving key structural patterns, thereby enabling the reformulation of the DTW recurrence relation into an equivalent recurrent neural network. Based on this, we construct a trainable model that mimics DTW’s alignment behavior. As a neural network, it becomes trainable when sufficient labeled data is available, while still retaining DTW’s inherent interpretability. We apply the model to several benchmark time series classification tasks and observe that it significantly outperforms previous approaches in low-resource settings and remains competitive in rich-resource settings.
zh
[AI-59] Compressed Computation: Dense Circuits in a Toy Model of the Universal-AND Problem
【速读】:该论文试图解决神经网络在计算层面实现超表示(superposition)的问题,即在有限的隐藏维度下高效地执行布尔运算,如计算所有(2m)对m个稀疏输入的AND操作。其解决方案的关键在于通过限制隐藏维度来迫使模型找到一种计算效率高的电路结构,称为压缩计算(compressed computation)。研究发现,训练过程找到了一个简单且完全稠密的解决方案,其中每个神经元都对每个输出做出贡献,该方案在低稀疏性下比理论构造更高效,并且具有良好的可扩展性和鲁棒性。
链接: https://arxiv.org/abs/2507.09816
作者: Adam Newgas
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注: 9 pages, 9 figures
Abstract:Neural networks are capable of superposition – representing more features than there are dimensions. Recent work considers the analogous concept for computation instead of storage, proposing theoretical constructions. But there has been little investigation into whether these circuits can be learned in practice. In this work, we investigate a toy model for the Universal-AND problem which computes the AND of all m\choose 2 pairs of m sparse inputs. The hidden dimension that determines the number of non-linear activations is restricted to pressure the model to find a compute-efficient circuit, called compressed computation. We find that the training process finds a simple solution that does not correspond to theoretical constructions. It is fully dense – every neuron contributes to every output. The solution circuit naturally scales with dimension, trading off error rates for neuron efficiency. It is similarly robust to changes in sparsity and other key parameters, and extends naturally to other boolean operations and boolean circuits. We explain the found solution in detail and compute why it is more efficient than the theoretical constructions at low sparsity. Our findings shed light on the types of circuits that models like to form and the flexibility of the superposition representation. This contributes to a broader understanding of network circuitry and interpretability.
zh
[AI-60] Federated Learning with Graph-Based Aggregation for Traffic Forecasting KDD2025
【速读】:该论文旨在解决交通预测中联邦学习(Federated Learning, FL)方法在处理空间相关性时的局限性。传统联邦学习方法如FedAvg假设客户端独立,未能有效利用交通网络中各区域或路段之间的空间依赖关系,从而限制了模型性能。论文提出的解决方案的关键在于结合联邦学习与图学习的思想,采用轻量级的图感知策略,在参数更新过程中基于图连通性对客户端模型进行加权,从而在保持计算效率的同时捕捉空间关系。
链接: https://arxiv.org/abs/2507.09805
作者: Audri Banik,Glaucio Haroldo Silva de Carvalho,Renata Dividino
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注: Accepted at FedKDD 2025: International Joint Workshop on Federated Learning for Data Mining and Graph Analytics. 6 pages, 1 figure
Abstract:In traffic prediction, the goal is to estimate traffic speed or flow in specific regions or road segments using historical data collected by devices deployed in each area. Each region or road segment can be viewed as an individual client that measures local traffic flow, making Federated Learning (FL) a suitable approach for collaboratively training models without sharing raw data. In centralized FL, a central server collects and aggregates model updates from multiple clients to build a shared model while preserving each client’s data privacy. Standard FL methods, such as Federated Averaging (FedAvg), assume that clients are independent, which can limit performance in traffic prediction tasks where spatial relationships between clients are important. Federated Graph Learning methods can capture these dependencies during server-side aggregation, but they often introduce significant computational overhead. In this paper, we propose a lightweight graph-aware FL approach that blends the simplicity of FedAvg with key ideas from graph learning. Rather than training full models, our method applies basic neighbourhood aggregation principles to guide parameter updates, weighting client models based on graph connectivity. This approach captures spatial relationships effectively while remaining computationally efficient. We evaluate our method on two benchmark traffic datasets, METR-LA and PEMS-BAY, and show that it achieves competitive performance compared to standard baselines and recent graph-based federated learning techniques.
zh
[AI-61] chnical Requirements for Halting Dangerous AI Activities
【速读】:该论文试图解决人工智能系统快速发展所带来的前所未有的风险,包括失控、滥用、地缘政治不稳定和权力集中等问题。其解决方案的关键在于提出一系列关键技术干预措施,这些措施能够支持对危险AI活动的协调暂停,从而为潜在的AI治理计划提供技术基础。
链接: https://arxiv.org/abs/2507.09801
作者: Peter Barnett,Aaron Scher,David Abecassis
机构: 未知
类目: Artificial Intelligence (cs.AI); Computers and Society (cs.CY)
备注:
Abstract:The rapid development of AI systems poses unprecedented risks, including loss of control, misuse, geopolitical instability, and concentration of power. To navigate these risks and avoid worst-case outcomes, governments may proactively establish the capability for a coordinated halt on dangerous AI development and deployment. In this paper, we outline key technical interventions that could allow for a coordinated halt on dangerous AI activities. We discuss how these interventions may contribute to restricting various dangerous AI activities, and show how these interventions can form the technical foundation for potential AI governance plans.
zh
[AI-62] Prompting for Performance: Exploring LLM s for Configuring Software
【速读】:该论文试图解决软件系统中性能导向的配置优化问题,即如何在大量配置选项中找到影响执行时间、内存使用、二进制大小或码率等性能指标的有效组合。其解决方案的关键在于探索大型语言模型(Large Language Models, LLMs)是否能够通过提示(prompts)辅助完成相关任务,如识别关键配置选项、对配置进行排序以及推荐高性能配置。研究结果显示,LLMs在某些任务和系统中能够与专家知识相匹配,但在其他情况下可能存在幻觉或表面推理的问题。
链接: https://arxiv.org/abs/2507.09790
作者: Helge Spieker,Théo Matricon,Nassim Belmecheri,Jørn Eirik Betten,Gauthier Le Bartz Lyan,Heraldo Borges,Quentin Mazouni,Dennis Gross,Arnaud Gotlieb,Mathieu Acher
机构: 未知
类目: oftware Engineering (cs.SE); Artificial Intelligence (cs.AI); Performance (cs.PF)
备注:
Abstract:Software systems usually provide numerous configuration options that can affect performance metrics such as execution time, memory usage, binary size, or bitrate. On the one hand, making informed decisions is challenging and requires domain expertise in options and their combinations. On the other hand, machine learning techniques can search vast configuration spaces, but with a high computational cost, since concrete executions of numerous configurations are required. In this exploratory study, we investigate whether large language models (LLMs) can assist in performance-oriented software configuration through prompts. We evaluate several LLMs on tasks including identifying relevant options, ranking configurations, and recommending performant configurations across various configurable systems, such as compilers, video encoders, and SAT solvers. Our preliminary results reveal both positive abilities and notable limitations: depending on the task and systems, LLMs can well align with expert knowledge, whereas hallucinations or superficial reasoning can emerge in other cases. These findings represent a first step toward systematic evaluations and the design of LLM-based solutions to assist with software configuration.
zh
[AI-63] BitParticle: Partializing Sparse Dual-Factors to Build Quasi-Synchronizing MAC Arrays for Energy-efficient DNNs
【速读】:该论文旨在解决量化深度神经网络(Quantized DNN)中位级稀疏性在优化乘加(MAC)操作时面临的两个关键问题:一是传统位串行方法无法同时利用两个因子的稀疏性,导致其中一个因子的稀疏性被完全浪费,而现有针对双因子稀疏性的方法面临部分积爆炸的问题;二是位级稀疏性的波动导致MAC操作的周期数变化,现有同步调度方案在处理双因子稀疏性时灵活性差,造成MAC单元利用率低。解决方案的关键在于提出一种基于粒子化方法的MAC单元,通过简单的控制逻辑解决部分积爆炸问题,实现更高效的空间和能耗性能,并引入准同步方案以提升MAC阵列的周期弹性,减少流水线阻塞,从而提高MAC单元的利用率。
链接: https://arxiv.org/abs/2507.09780
作者: Feilong Qiaoyuan,Jihe Wang,Zhiyu Sun,Linying Wu,Yuanhua Xiao,Danghui Wang
机构: 未知
类目: Hardware Architecture (cs.AR); Artificial Intelligence (cs.AI)
备注: 9 pages, 13 figures, 3 Tables
Abstract:Bit-level sparsity in quantized deep neural networks (DNNs) offers significant potential for optimizing Multiply-Accumulate (MAC) operations. However, two key challenges still limit its practical exploitation. First, conventional bit-serial approaches cannot simultaneously leverage the sparsity of both factors, leading to a complete waste of one factor’ s sparsity. Methods designed to exploit dual-factor sparsity are still in the early stages of exploration, facing the challenge of partial product explosion. Second, the fluctuation of bit-level sparsity leads to variable cycle counts for MAC operations. Existing synchronous scheduling schemes that are suitable for dual-factor sparsity exhibit poor flexibility and still result in significant underutilization of MAC units. To address the first challenge, this study proposes a MAC unit that leverages dual-factor sparsity through the emerging particlization-based approach. The proposed design addresses the issue of partial product explosion through simple control logic, resulting in a more area- and energy-efficient MAC unit. In addition, by discarding less significant intermediate results, the design allows for further hardware simplification at the cost of minor accuracy loss. To address the second challenge, a quasi-synchronous scheme is introduced that adds cycle-level elasticity to the MAC array, reducing pipeline stalls and thereby improving MAC unit utilization. Evaluation results show that the exact version of the proposed MAC array architecture achieves a 29.2% improvement in area efficiency compared to the state-of-the-art bit-sparsity-driven architecture, while maintaining comparable energy efficiency. The approximate variant further improves energy efficiency by 7.5%, compared to the exact version. Index-Terms: DNN acceleration, Bit-level sparsity, MAC unit
zh
[AI-64] oward accurate RUL and SOH estimation using reinforced graph-based PINNs enhanced with dynamic weights
【速读】:该论文旨在解决在工业应用中准确估计剩余使用寿命(Remaining Useful Life, RUL)和健康状态(State of Health, SOH)的问题,这是预测性维护与健康管理(Prognostics and Health Management, PHM)的核心挑战。其解决方案的关键在于提出一种新型框架——增强动态权重的强化图基物理信息神经网络(Reinforced Graph-Based Physics-Informed Neural Networks Enhanced with Dynamic Weights, RGPD),该框架结合了物理约束监督与先进的时空学习方法。通过图卷积循环网络(GCRN)捕捉节点表示随时间的变化,利用图注意力卷积(GATConv)实现动态空间聚合,并引入软演员-评论家(SAC)模块优化时空学习过程,同时采用Q-learning代理动态分配物理约束权重,从而提升模型的泛化能力和预测精度。
链接: https://arxiv.org/abs/2507.09766
作者: Mohamadreza Akbari Pour,Ali Ghasemzadeh,MohamadAli Bijarchi,Mohammad Behshad Shafii
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:
Abstract:Accurate estimation of Remaining Useful Life (RUL) and State of Health (SOH) is essential for Prognostics and Health Management (PHM) across a wide range of industrial applications. We propose a novel framework – Reinforced Graph-Based Physics-Informed Neural Networks Enhanced with Dynamic Weights (RGPD) – that combines physics-based supervision with advanced spatio-temporal learning. Graph Convolutional Recurrent Networks (GCRNs) embed graph-convolutional filters within recurrent units to capture how node representations evolve over time. Graph Attention Convolution (GATConv) leverages a self-attention mechanism to compute learnable, edge-wise attention coefficients, dynamically weighting neighbor contributions for adaptive spatial aggregation. A Soft Actor-Critic (SAC) module is positioned between the Temporal Attention Unit (TAU) and GCRN to further improve the spatio-temporal learning. This module improves attention and prediction accuracy by dynamically scaling hidden representations to minimize noise and highlight informative features. To identify the most relevant physical constraints in each area, Q-learning agents dynamically assign weights to physics-informed loss terms, improving generalization across real-time industrial systems and reducing the need for manual tuning. In both RUL and SOH estimation tasks, the proposed method consistently outperforms state-of-the-art models, demonstrating strong robustness and predictive accuracy across varied degradation patterns across three diverse industrial benchmark datasets.
zh
[AI-65] Causality-informed Anomaly Detection in Partially Observable Sensor Networks: Moving beyond Correlations
【速读】:该论文试图解决在AI驱动制造环境中,由于数据流体量不断增大而传感器资源有限,如何实现最优传感器部署以在保证系统部分可观测性的同时快速检测异常的问题。解决方案的关键在于引入一种基于因果信息的深度Q网络(Causal DQ),通过在Q网络训练的每个阶段整合因果信息,从而实现更快的收敛速度和更紧的理论误差边界,进而显著降低不同场景下的异常检测时间。
链接: https://arxiv.org/abs/2507.09742
作者: Xiaofeng Xiao,Bo Shen,Xubo Yue
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:
Abstract:Nowadays, as AI-driven manufacturing becomes increasingly popular, the volume of data streams requiring real-time monitoring continues to grow. However, due to limited resources, it is impractical to place sensors at every location to detect unexpected shifts. Therefore, it is necessary to develop an optimal sensor placement strategy that enables partial observability of the system while detecting anomalies as quickly as possible. Numerous approaches have been proposed to address this challenge; however, most existing methods consider only variable correlations and neglect a crucial factor: Causality. Moreover, although a few techniques incorporate causal analysis, they rely on interventions-artificially creating anomalies-to identify causal effects, which is impractical and might lead to catastrophic losses. In this paper, we introduce a causality-informed deep Q-network (Causal DQ) approach for partially observable sensor placement in anomaly detection. By integrating causal information at each stage of Q-network training, our method achieves faster convergence and tighter theoretical error bounds. Furthermore, the trained causal-informed Q-network significantly reduces the detection time for anomalies under various settings, demonstrating its effectiveness for sensor placement in large-scale, real-world data streams. Beyond the current implementation, our technique’s fundamental insights can be applied to various reinforcement learning problems, opening up new possibilities for real-world causality-informed machine learning methods in engineering applications.
zh
[AI-66] EPT-2 Technical Report
【速读】:该论文旨在提升地球系统预测的准确性,特别是在能量相关变量(如10米和100米风速、2米温度及地表太阳辐射)的预测上。其解决方案的关键在于提出EPT-2模型,这是Earth Physics Transformer(EPT)系列的最新版本,相较于前代EPT-1.5实现了显著性能提升,并在0-240小时的全预报范围内超越了现有的先进AI天气模型和数值预报系统,如Microsoft Aurora和ECMWF的IFS HRES。此外,论文还引入了基于扰动的EPT-2e集合模型,用于概率预测,该模型在计算成本远低于ECMWF ENS的情况下,仍能显著优于其均值,成为中长期预报的更优选择。
链接: https://arxiv.org/abs/2507.09703
作者: Roberto Molinaro,Niall Siegenheim,Niels Poulsen,Jordan Dane Daubinet,Henry Martin,Mark Frey,Kevin Thiart,Alexander Jakob Dautel,Andreas Schlueter,Alex Grigoryev,Bogdan Danciu,Nikoo Ekhtiari,Bas Steunebrink,Leonie Wagner,Marvin Vincent Gabler
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:
Abstract:We present EPT-2, the latest iteration in our Earth Physics Transformer (EPT) family of foundation AI models for Earth system forecasting. EPT-2 delivers substantial improvements over its predecessor, EPT-1.5, and sets a new state of the art in predicting energy-relevant variables-including 10m and 100m wind speed, 2m temperature, and surface solar radiation-across the full 0-240h forecast horizon. It consistently outperforms leading AI weather models such as Microsoft Aurora, as well as the operational numerical forecast system IFS HRES from the European Centre for Medium-Range Weather Forecasts (ECMWF). In parallel, we introduce a perturbation-based ensemble model of EPT-2 for probabilistic forecasting, called EPT-2e. Remarkably, EPT-2e significantly surpasses the ECMWF ENS mean-long considered the gold standard for medium- to longrange forecasting-while operating at a fraction of the computational cost. EPT models, as well as third-party forecasts, are accessible via the this http URL platform.
zh
[AI-67] Frequency-aware Surrogate Modeling With SMT Kernels For Advanced Data Forecasting
【速读】:该论文旨在解决复杂机械系统中基于核的方法建模的局限性,特别是在捕捉频率相关动态和非线性行为方面的不足。其解决方案的关键在于引入一种可扩展的核函数框架,该框架不仅包括传统的指数型核,还扩展至如指数平方正弦核和有理二次核等更广泛的核类型,并支持其一阶和二阶导数的计算,从而增强了模型对时间-频率动态特性的表征能力。此外,该框架通过组合不同核的优势,实现了针对特定问题的复合模型构建,提升了代理建模的灵活性与适用性。
链接: https://arxiv.org/abs/2507.09694
作者: Nicolas Gonel,Paul Saves,Joseph Morlier
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Optimization and Control (math.OC); Machine Learning (stat.ML)
备注: AeroBest 2025, Instituto Superior Tecnico of the University of Lisbon, Portugal
Abstract:This paper introduces a comprehensive open-source framework for developing correlation kernels, with a particular focus on user-defined and composition of kernels for surrogate modeling. By advancing kernel-based modeling techniques, we incorporate frequency-aware elements that effectively capture complex mechanical behaviors and timefrequency dynamics intrinsic to aircraft systems. Traditional kernel functions, often limited to exponential-based methods, are extended to include a wider range of kernels such as exponential squared sine and rational quadratic kernels, along with their respective firstand second-order derivatives. The proposed methodologies are first validated on a sinus cardinal test case and then applied to forecasting Mauna-Loa Carbon Dioxide (CO 2 ) concentrations and airline passenger traffic. All these advancements are integrated into the open-source Surrogate Modeling Toolbox (SMT 2.0), providing a versatile platform for both standard and customizable kernel configurations. Furthermore, the framework enables the combination of various kernels to leverage their unique strengths into composite models tailored to specific problems. The resulting framework offers a flexible toolset for engineers and researchers, paving the way for numerous future applications in metamodeling for complex, frequency-sensitive domains.
zh
[AI-68] Post-Training Quantization of Generative and Discriminative LSTM Text Classifiers: A Study of Calibration Class Balance and Robustness
【速读】:该论文旨在解决在边缘计算环境中部署文本分类模型时面临的计算和内存约束问题,同时确保模型在低延迟和高精度下的鲁棒性。其关键解决方案是通过后训练量化(Post Training Quantization, PTQ)技术,对基于长短期记忆网络(LSTM)的生成式与判别式文本分类模型进行量化评估,以探索其在不同位宽下的性能表现及对校准数据和输入噪声的敏感性。研究重点在于分析校准数据分布对量化后模型性能的影响,特别是生成式LSTM分类器在低比特宽度下因类别不平衡导致的权重适应不足问题。
链接: https://arxiv.org/abs/2507.09687
作者: Md Mushfiqur Rahaman,Elliot Chang,Tasmiah Haque,Srinjoy Das
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:
Abstract:Text classification plays a pivotal role in edge computing applications like industrial monitoring, health diagnostics, and smart assistants, where low latency and high accuracy are both key requirements. Generative classifiers, in particular, have been shown to exhibit robustness to out-of-distribution and noisy data, which is an extremely critical consideration for deployment in such real-time edge environments. However, deploying such models on edge devices faces computational and memory constraints. Post Training Quantization (PTQ) reduces model size and compute costs without retraining, making it ideal for edge deployment. In this work, we present a comprehensive comparative study of generative and discriminative Long Short Term Memory (LSTM)-based text classification models with PTQ using the Brevitas quantization library. We evaluate both types of classifier models across multiple bitwidths and assess their robustness under regular and noisy input conditions. We find that while discriminative classifiers remain robust, generative ones are more sensitive to bitwidth, calibration data used during PTQ, and input noise during quantized inference. We study the influence of class imbalance in calibration data for both types of classifiers, comparing scenarios with evenly and unevenly distributed class samples including their effect on weight adjustments and activation profiles during PTQ. Using test statistics derived from nonparametric hypothesis testing, we identify that using class imbalanced data during calibration introduces insufficient weight adaptation at lower bitwidths for generative LSTM classifiers, thereby leading to degraded performance. This study underscores the role of calibration data in PTQ and when generative classifiers succeed or fail under noise, aiding deployment in edge environments.
zh
[AI-69] OrQstrator: An AI-Powered Framework for Advanced Quantum Circuit Optimization
【速读】:该论文试图解决在噪声中等规模量子(NISQ)时代中量子电路优化的问题,以提升量子计算任务的执行效率和可靠性。解决方案的关键在于提出了一种基于深度强化学习(DRL)的模块化框架OrQstrator,其核心是通过一个协调引擎智能调度三种互补的电路优化器:基于DRL的电路重写器、领域专用优化器以及参数化电路实例化器,从而在考虑电路结构、硬件约束和后端性能特征的基础上,实现对量子电路的深度与门数优化、局部门重构及模板电路优化。
链接: https://arxiv.org/abs/2507.09682
作者: Laura Baird,Armin Moin
机构: 未知
类目: oftware Engineering (cs.SE); Artificial Intelligence (cs.AI); Emerging Technologies (cs.ET)
备注: IEEE International Conference on Quantum Computing and Engineering (QCE) 2025 - Extended Abstract
Abstract:We propose a novel approach, OrQstrator, which is a modular framework for conducting quantum circuit optimization in the Noisy Intermediate-Scale Quantum (NISQ) era. Our framework is powered by Deep Reinforcement Learning (DRL). Our orchestration engine intelligently selects among three complementary circuit optimizers: A DRL-based circuit rewriter trained to reduce depth and gate count via learned rewrite sequences; a domain-specific optimizer that performs efficient local gate resynthesis and numeric optimization; a parameterized circuit instantiator that improves compilation by optimizing template circuits during gate set translation. These modules are coordinated by a central orchestration engine that learns coordination policies based on circuit structure, hardware constraints, and backend-aware performance features such as gate count, depth, and expected fidelity. The system outputs an optimized circuit for hardware-aware transpilation and execution, leveraging techniques from an existing state-of-the-art approach, called the NISQ Analyzer, to adapt to backend constraints.
zh
[AI-70] Conformal Prediction for Privacy-Preserving Machine Learning
【速读】:该论文试图解决在隐私保护机器学习中如何实现严格的不确定性量化问题,特别是在确定性加密数据上的监督学习场景。解决方案的关键在于将共形预测(Conformal Prediction, CP)方法应用于加密域,利用固定密钥加密下数据的可交换性,使得CP方法在加密数据上仍能保持有效性。通过实验验证,模型在加密数据上仍能提取有意义的结构,并在预测集覆盖性和准确性方面表现出色。
链接: https://arxiv.org/abs/2507.09678
作者: Alexander David Balinsky,Dominik Krzeminski,Alexander Balinsky
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Statistics Theory (math.ST)
备注:
Abstract:We investigate the integration of Conformal Prediction (CP) with supervised learning on deterministically encrypted data, aiming to bridge the gap between rigorous uncertainty quantification and privacy-preserving machine learning. Using AES-encrypted variants of the MNIST dataset, we demonstrate that CP methods remain effective even when applied directly in the encrypted domain, owing to the preservation of data exchangeability under fixed-key encryption. We test traditional p -value-based against e -value-based conformal predictors. Our empirical evaluation reveals that models trained on deterministically encrypted data retain the ability to extract meaningful structure, achieving 36.88% test accuracy – significantly above random guessing (9.56%) observed with per-instance encryption. Moreover, e -value-based CP achieves predictive set coverage of over 60% with 4.3 loss-threshold calibration, correctly capturing the true label in 4888 out of 5000 test cases. In contrast, the p -value-based CP yields smaller predictive sets but with reduced coverage accuracy. These findings highlight both the promise and limitations of CP in encrypted data settings and underscore critical trade-offs between prediction set compactness and reliability. %Our work sets a foundation for principled uncertainty quantification in secure, privacy-aware learning systems.
zh
[AI-71] SimStep: Chain-of-Abstractions for Incremental Specification and Debugging of AI-Generated Interactive Simulations
【速读】:该论文试图解决生成式AI在编程中因绕过直接代码编写而丧失的编程核心功能,如可追溯性、逐步细化和行为测试等问题。其解决方案的关键在于提出Chain-of-Abstractions (CoA)框架,通过将合成过程分解为一系列认知上有意义且任务对齐的表示形式,以恢复这些核心功能,同时保持自然语言的表达灵活性。
链接: https://arxiv.org/abs/2507.09664
作者: Zoe Kaputa,Anika Rajaram,Vryan Almanon Feliciano,Zhuoyue Lyu,Maneesh Agrawala,Hari Subramonyam
机构: 未知
类目: Human-Computer Interaction (cs.HC); Artificial Intelligence (cs.AI)
备注:
Abstract:Programming-by-prompting with generative AI offers a new paradigm for end-user programming, shifting the focus from syntactic fluency to semantic intent. This shift holds particular promise for non-programmers such as educators, who can describe instructional goals in natural language to generate interactive learning content. Yet in bypassing direct code authoring, many of programming’s core affordances - such as traceability, stepwise refinement, and behavioral testing - are lost. We propose the Chain-of-Abstractions (CoA) framework as a way to recover these affordances while preserving the expressive flexibility of natural language. CoA decomposes the synthesis process into a sequence of cognitively meaningful, task-aligned representations that function as checkpoints for specification, inspection, and refinement. We instantiate this approach in SimStep, an authoring environment for teachers that scaffolds simulation creation through four intermediate abstractions: Concept Graph, Scenario Graph, Learning Goal Graph, and UI Interaction Graph. To address ambiguities and misalignments, SimStep includes an inverse correction process that surfaces in-filled model assumptions and enables targeted revision without requiring users to manipulate code. Evaluations with educators show that CoA enables greater authoring control and interpretability in programming-by-prompting workflows.
zh
[AI-72] KEN: Knowledge Augmentation and Emotion Guidance Network for Multimodal Fake News Detection ACM-MM2025
【速读】:该论文旨在解决社交媒体上多模态虚假新闻的准确检测问题,特别是针对图像语义理解不足以及仅依赖有限文本信息导致的新闻真实性判断困难。此外,现有方法对不同情感类型的新闻采用统一处理方式,未能考虑其类别间差异,从而影响了性能。论文提出的解决方案关键在于构建一种知识增强与情感引导网络(KEN),通过利用大视觉语言模型(LVLM)强大的语义理解和广泛的世界知识,为图像生成描述以全面理解图像内容,为文本检索证据以打破信息孤岛;同时通过平衡学习考虑不同情感类型新闻的类间差异,实现情感类型与真实性之间细粒度关系的建模。
链接: https://arxiv.org/abs/2507.09647
作者: Peican Zhu,Yubo Jing,Le Cheng,Keke Tang,Yangming Guo
机构: 未知
类目: Multimedia (cs.MM); Artificial Intelligence (cs.AI)
备注: Accepted by ACM MM 2025
Abstract:In recent years, the rampant spread of misinformation on social media has made accurate detection of multimodal fake news a critical research focus. However, previous research has not adequately understood the semantics of images, and models struggle to discern news authenticity with limited textual information. Meanwhile, treating all emotional types of news uniformly without tailored approaches further leads to performance degradation. Therefore, we propose a novel Knowledge Augmentation and Emotion Guidance Network (KEN). On the one hand, we effectively leverage LVLM’s powerful semantic understanding and extensive world knowledge. For images, the generated captions provide a comprehensive understanding of image content and scenes, while for text, the retrieved evidence helps break the information silos caused by the closed and limited text and context. On the other hand, we consider inter-class differences between different emotional types of news through balanced learning, achieving fine-grained modeling of the relationship between emotional types and authenticity. Extensive experiments on two real-world datasets demonstrate the superiority of our KEN.
zh
[AI-73] humancompatible.interconnect: Testing Properties of Repeated Uses of Interconnections of AI Systems
【速读】:该论文试图解决多智能体系统中人工智能(Artificial Intelligence, AI)系统的公平性和鲁棒性(robustness)的先验保证问题。解决方案的关键在于使用基于PyTorch的开源工具包,通过随机控制技术对AI系统的相互连接及其重复使用特性进行建模,从而在闭环(closed-loop)框架下实现对公平性和鲁棒性的建模,并提供相应的先验保证。
链接: https://arxiv.org/abs/2507.09626
作者: Rodion Nazarov,Anthony Quinn,Robert Shorten,Jakub Marecek
机构: 未知
类目: Artificial Intelligence (cs.AI); Systems and Control (eess.SY)
备注:
Abstract:Artificial intelligence (AI) systems often interact with multiple agents. The regulation of such AI systems often requires that \em a priori/ guarantees of fairness and robustness be satisfied. With stochastic models of agents’ responses to the outputs of AI systems, such \em a priori/ guarantees require non-trivial reasoning about the corresponding stochastic systems. Here, we present an open-source PyTorch-based toolkit for the use of stochastic control techniques in modelling interconnections of AI systems and properties of their repeated uses. It models robustness and fairness desiderata in a closed-loop fashion, and provides \em a priori/ guarantees for these interconnections. The PyTorch-based toolkit removes much of the complexity associated with the provision of fairness guarantees for closed-loop models of multi-agent systems.
zh
[AI-74] Bridging Bots: from Perception to Action via Multimodal-LMs and Knowledge Graphs
【速读】:该论文试图解决当前个人服务机器人在家庭环境中部署时面临的系统孤立问题,即现有系统依赖于特定软硬件的专有硬编码解决方案,导致难以跨平台适应和扩展。其解决方案的关键在于提出一种神经符号框架,将多模态语言模型的感知能力与知识图谱(Knowledge Graph, KG)和本体的结构化表示相结合,以支持机器人应用中的互操作性。该框架生成符合本体的KG,能够在平台无关的情况下指导机器人行为,并通过整合机器人感知数据、本体和多模态模型进行评估,验证了其一致性与有效性。
链接: https://arxiv.org/abs/2507.09617
作者: Margherita Martorana,Francesca Urgese,Mark Adamik,Ilaria Tiddi
机构: 未知
类目: Artificial Intelligence (cs.AI); Robotics (cs.RO)
备注:
Abstract:Personal service robots are deployed to support daily living in domestic environments, particularly for elderly and individuals requiring assistance. These robots must perceive complex and dynamic surroundings, understand tasks, and execute context-appropriate actions. However, current systems rely on proprietary, hard-coded solutions tied to specific hardware and software, resulting in siloed implementations that are difficult to adapt and scale across platforms. Ontologies and Knowledge Graphs (KGs) offer a solution to enable interoperability across systems, through structured and standardized representations of knowledge and reasoning. However, symbolic systems such as KGs and ontologies struggle with raw and noisy sensory input. In contrast, multimodal language models are well suited for interpreting input such as images and natural language, but often lack transparency, consistency, and knowledge grounding. In this work, we propose a neurosymbolic framework that combines the perceptual strengths of multimodal language models with the structured representations provided by KGs and ontologies, with the aim of supporting interoperability in robotic applications. Our approach generates ontology-compliant KGs that can inform robot behavior in a platform-independent manner. We evaluated this framework by integrating robot perception data, ontologies, and five multimodal models (three LLaMA and two GPT models), using different modes of neural-symbolic interaction. We assess the consistency and effectiveness of the generated KGs across multiple runs and configurations, and perform statistical analyzes to evaluate performance. Results show that GPT-o1 and LLaMA 4 Maverick consistently outperform other models. However, our findings also indicate that newer models do not guarantee better results, highlighting the critical role of the integration strategy in generating ontology-compliant KGs.
zh
[AI-75] he Hidden Costs of AI: A Review of Energy E-Waste and Inequality in Model Development
【速读】:该论文试图解决人工智能(Artificial Intelligence, AI)快速发展所带来的环境和伦理问题,包括能源消耗、电子废弃物(e-waste)、计算资源获取的不平等以及网络安全系统的隐性能源负担。其解决方案的关键在于识别研究空白,并倡导可持续、透明和公平的AI发展实践,以确保AI的进步与伦理责任和环境管理相协调。
链接: https://arxiv.org/abs/2507.09611
作者: Jenis Winsta
机构: 未知
类目: Artificial Intelligence (cs.AI); Computers and Society (cs.CY)
备注: 5 pages, 3 figures
Abstract:Artificial intelligence (AI) has made remarkable progress in recent years, yet its rapid expansion brings overlooked environmental and ethical challenges. This review explores four critical areas where AI’s impact extends beyond performance: energy consumption, electronic waste (e-waste), inequality in compute access, and the hidden energy burden of cybersecurity systems. Drawing from recent studies and institutional reports, the paper highlights systemic issues such as high emissions from model training, rising hardware turnover, global infrastructure disparities, and the energy demands of securing AI. By connecting these concerns, the review contributes to Responsible AI discourse by identifying key research gaps and advocating for sustainable, transparent, and equitable development practices. Ultimately, it argues that AI’s progress must align with ethical responsibility and environmental stewardship to ensure a more inclusive and sustainable technological future.
zh
[AI-76] DRAG D: A Federated Unlearning Data Reconstruction Attack Based on Gradient Differences
【速读】:该论文试图解决联邦学习中联邦遗忘(federated unlearning)引入的隐私泄露问题,即在客户端从全局模型中删除数据的过程中,梯度交换可能泄露被删除数据的敏感信息。解决方案的关键在于提出DRAGD攻击,通过分析遗忘前后梯度差异来重建被遗忘的数据,并进一步提出DRAGDP,利用公开的先验数据提升复杂数据集如人脸图像的重建准确性。
链接: https://arxiv.org/abs/2507.09602
作者: Bocheng Ju,Junchao Fan,Jiaqi Liu,Xiaolin Chang
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:
Abstract:Federated learning enables collaborative machine learning while preserving data privacy. However, the rise of federated unlearning, designed to allow clients to erase their data from the global model, introduces new privacy concerns. Specifically, the gradient exchanges during the unlearning process can leak sensitive information about deleted data. In this paper, we introduce DRAGD, a novel attack that exploits gradient discrepancies before and after unlearning to reconstruct forgotten data. We also present DRAGDP, an enhanced version of DRAGD that leverages publicly available prior data to improve reconstruction accuracy, particularly for complex datasets like facial images. Extensive experiments across multiple datasets demonstrate that DRAGD and DRAGDP significantly outperform existing methods in data this http URL work highlights a critical privacy vulnerability in federated unlearning and offers a practical solution, advancing the security of federated unlearning systems in real-world applications.
zh
[AI-77] HOR: Transformer Heuristics for On-Demand Retrieval
【速读】:该论文试图解决非技术用户在企业数据库中进行实时数据访问时面临的复杂性和安全性问题。解决方案的关键在于引入THOR(Transformer Heuristics for On-Demand Retrieval)模块,该模块通过自然语言到验证后的只读SQL查询的转换,实现了零SQL操作与企业级安全性的结合。其核心架构包括监督代理、动态模式检索、SQL生成代理以及自我校正评分循环,确保了查询的准确性、容错执行和合规性。
链接: https://arxiv.org/abs/2507.09592
作者: Isaac Shi,Zeyuan Li,Fan Liu,Wenli Wang,Lewei He,Yang Yang,Tianyu Shi
机构: 未知
类目: Databases (cs.DB); Artificial Intelligence (cs.AI)
备注:
Abstract:We introduce the THOR (Transformer Heuristics for On-Demand Retrieval) Module, designed and implemented by eSapiens, a secure, scalable engine that transforms natural-language questions into verified, read-only SQL analytics for enterprise databases. The Text-to-SQL module follows a decoupled orchestration/execution architecture: a Supervisor Agent routes queries, Schema Retrieval dynamically injects table and column metadata, and a SQL Generation Agent emits single-statement SELECT queries protected by a read-only guardrail. An integrated Self-Correction Rating loop captures empty results, execution errors, or low-quality outputs and triggers up to five LLM-driven regeneration attempts. Finally, a Result Interpretation Agent produces concise, human-readable insights and hands raw rows to the Insight Intelligence engine for visualization or forecasting. Smoke tests across finance, sales, and operations scenarios demonstrate reliable ad-hoc querying and automated periodic reporting. By embedding schema awareness, fault-tolerant execution, and compliance guardrails, the THOR Module empowers non-technical users to access live data with zero-SQL simplicity and enterprise-grade safety. Subjects: Databases (cs.DB); Artificial Intelligence (cs.AI) Cite as: arXiv:2507.09592 [cs.DB] (or arXiv:2507.09592v1 [cs.DB] for this version) https://doi.org/10.48550/arXiv.2507.09592 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
zh
[AI-78] Sapiens: A Platform for Secure and Auditable Retrieval-Augmented Generation
【速读】:该论文旨在解决企业在高风险领域(如法律和金融)中实现可信、可审计的AI工作流的问题,同时确保数据安全与知识保留。其解决方案的关键在于构建eSapiens平台,该平台围绕商业导向的三元组——专有数据、操作流程以及任何主流大语言模型(LLM)进行设计,提供全栈控制以保障AI资产的安全性,并通过AI代理(Sapiens)实现任务自动化与洞察力增强。此外,平台集成结构化文档摄入、混合向量检索及无代码编排功能,支持多种顶级LLM,并引入THOR Agent处理结构化SQL风格查询,从而提升企业数据库的可操作性与洞察深度。
链接: https://arxiv.org/abs/2507.09588
作者: Isaac Shi,Zeyuan Li,Fan Liu,Wenli Wang,Lewei He,Yang Yang,Tianyu Shi
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:
Abstract:We present eSapiens, an AI-as-a-Service (AIaaS) platform engineered around a business-oriented trifecta: proprietary data, operational workflows, and any major agnostic Large Language Model (LLM). eSapiens gives businesses full control over their AI assets, keeping everything in-house for AI knowledge retention and data security. eSapiens AI Agents (Sapiens) empower your team by providing valuable insights and automating repetitive tasks, enabling them to focus on high-impact work and drive better business outcomes. The system integrates structured document ingestion, hybrid vector retrieval, and no-code orchestration via LangChain, and supports top LLMs including OpenAI, Claude, Gemini, and DeepSeek. A key component is the THOR Agent, which handles structured SQL-style queries and generates actionable insights over enterprise databases. To evaluate the system, we conduct two experiments. First, a retrieval benchmark on legal corpora reveals that a chunk size of 512 tokens yields the highest retrieval precision (Top-3 accuracy: 91.3%). Second, a generation quality test using TRACe metrics across five LLMs shows that eSapiens delivers more context-consistent outputs with up to a 23% improvement in factual alignment. These results demonstrate the effectiveness of eSapiens in enabling trustworthy, auditable AI workflows for high-stakes domains like legal and finance. Subjects: Artificial Intelligence (cs.AI) Cite as: arXiv:2507.09588 [cs.AI] (or arXiv:2507.09588v1 [cs.AI] for this version) https://doi.org/10.48550/arXiv.2507.09588 Focus to learn more arXiv-issued DOI via DataCite (pending registration) Submission history From: Fan Liu [view email] [v1] Sun, 13 Jul 2025 11:41:44 UTC (1,016 KB)
zh
[AI-79] A Serverless Architecture for Real-Time Stock Analysis using Large Language Models : An Iterative Development and Debugging Case Study STOC
【速读】:该论文试图解决如何利用强大的、可访问的大型语言模型(Large Language Models, LLMs)来 democratize(民主化)金融数据分析的问题,具体表现为构建一个无需服务器的实时股票分析系统。解决方案的关键在于设计并实现一个事件驱动的、低开销的架构,该架构通过Google的Gemini API进行定性评估,利用GitHub Actions自动化数据摄取与处理,并通过解耦的静态前端展示结果,从而实现了高效、低成本的金融数据分析系统。
链接: https://arxiv.org/abs/2507.09583
作者: Taniv Ashraf
机构: 未知
类目: oftware Engineering (cs.SE); Artificial Intelligence (cs.AI)
备注: 6 pages. The live application can be viewed at this https URL and the source code is available at this https URL
Abstract:The advent of powerful, accessible Large Language Models (LLMs) like Google’s Gemini presents new opportunities for democratizing financial data analysis. This paper documents the design, implementation, and iterative debugging of a novel, serverless system for real-time stock analysis. The system leverages the Gemini API for qualitative assessment, automates data ingestion and processing via GitHub Actions, and presents the findings through a decoupled, static frontend. We detail the architectural evolution of the system, from initial concepts to a robust, event-driven pipeline, highlighting the practical challenges encountered during deployment. A significant portion of this paper is dedicated to a case study on the debugging process, covering common software errors, platform-specific permission issues, and rare, environment-level platform bugs. The final architecture operates at a near-zero cost, demonstrating a viable model for individuals to build sophisticated AI-powered financial tools. The operational application is publicly accessible, and the complete source code is available for review. We conclude by discussing the role of LLMs in financial analysis, the importance of robust debugging methodologies, and the emerging paradigm of human-AI collaboration in software development.
zh
[AI-80] Identifying Offline Metrics that Predict Online Impact: A Prag matic Strategy for Real-World Recommender Systems RECSYS2025
【速读】:该论文试图解决推荐系统中建立离线与在线指标之间可靠关系的问题,以预测实际性能。其解决方案的关键在于引入一种实用策略,用于识别与在线影响对齐的离线指标,该方法能够同时服务于多个具有不同离线性能指标的测试组,并由单一模型控制,具备模型无关性,适用于基于神经网络的系统,从而实现跨架构和领域的广泛适用性。
链接: https://arxiv.org/abs/2507.09566
作者: Timo Wilm,Philipp Normann
机构: 未知
类目: Information Retrieval (cs.IR); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: This work was accepted for publication in the 19th ACM Conference on Recommender Systems (RecSys 2025). The final published version will be available at the ACM Digital Library
Abstract:A critical challenge in recommender systems is to establish reliable relationships between offline and online metrics that predict real-world performance. Motivated by recent advances in Pareto front approximation, we introduce a pragmatic strategy for identifying offline metrics that align with online impact. A key advantage of this approach is its ability to simultaneously serve multiple test groups, each with distinct offline performance metrics, in an online experiment controlled by a single model. The method is model-agnostic for systems with a neural network backbone, enabling broad applicability across architectures and domains. We validate the strategy through a large-scale online experiment in the field of session-based recommender systems on the OTTO e-commerce platform. The online experiment identifies significant alignments between offline metrics and real-word click-through rate, post-click conversion rate and units sold. Our strategy provides industry practitioners with a valuable tool for understanding offline-to-online metric relationships and making informed, data-driven decisions.
zh
[AI-81] Learning to Control Dynamical Agents via Spiking Neural Networks and Metropolis-Hastings Sampling
【速读】:该论文旨在解决脉冲神经网络(Spiking Neural Networks, SNNs)在强化学习(Reinforcement Learning, RL)任务中训练困难的问题,尤其是由于基于脉冲的通信具有非可微性而难以应用梯度下降方法。其解决方案的关键在于引入马尔可夫链蒙特卡洛(Metropolis-Hastings, MH)采样,这是一种贝叶斯推断技术,用于在不依赖梯度的方法下训练SNN,通过累积奖励信号迭代地提出并概率性地接受网络参数更新,从而绕过反向传播的限制,并实现在类脑计算平台上的直接优化。
链接: https://arxiv.org/abs/2507.09540
作者: Ali Safa,Farida Mohsen,Ali Al-Zawqari
机构: 未知
类目: Artificial Intelligence (cs.AI); Robotics (cs.RO)
备注:
Abstract:Spiking Neural Networks (SNNs) offer biologically inspired, energy-efficient alternatives to traditional Deep Neural Networks (DNNs) for real-time control systems. However, their training presents several challenges, particularly for reinforcement learning (RL) tasks, due to the non-differentiable nature of spike-based communication. In this work, we introduce what is, to our knowledge, the first framework that employs Metropolis-Hastings (MH) sampling, a Bayesian inference technique, to train SNNs for dynamical agent control in RL environments without relying on gradient-based methods. Our approach iteratively proposes and probabilistically accepts network parameter updates based on accumulated reward signals, effectively circumventing the limitations of backpropagation while enabling direct optimization on neuromorphic platforms. We evaluated this framework on two standard control benchmarks: AcroBot and CartPole. The results demonstrate that our MH-based approach outperforms conventional Deep Q-Learning (DQL) baselines and prior SNN-based RL approaches in terms of maximizing the accumulated reward while minimizing network resources and training episodes.
zh
[AI-82] On the Importance of Neural Membrane Potential Leakage for LIDAR-based Robot Obstacle Avoidance using Spiking Neural Networks
【速读】:该论文试图解决在资源受限的自主机器人应用中,如何利用生成式神经网络(Spiking Neural Networks, SNNs)实现从激光雷达(LIDAR)数据中直接进行机器人导航和避障的问题。解决方案的关键在于通过精确调整脉冲漏电积分-放电(Leaky Integrate-and-Fire, LIF)神经元的膜电位漏电常数,以提升SNN在处理LIDAR数据时的精度,从而达到与非脉冲卷积神经网络(Convolutional Neural Network, CNN)相当的机器人控制精度。
链接: https://arxiv.org/abs/2507.09538
作者: Zainab Ali,Lujayn Al-Amir,Ali Safa
机构: 未知
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI)
备注:
Abstract:Using neuromorphic computing for robotics applications has gained much attention in recent year due to the remarkable ability of Spiking Neural Networks (SNNs) for high-precision yet low memory and compute complexity inference when implemented in neuromorphic hardware. This ability makes SNNs well-suited for autonomous robot applications (such as in drones and rovers) where battery resources and payload are typically limited. Within this context, this paper studies the use of SNNs for performing direct robot navigation and obstacle avoidance from LIDAR data. A custom robot platform equipped with a LIDAR is set up for collecting a labeled dataset of LIDAR sensing data together with the human-operated robot control commands used for obstacle avoidance. Crucially, this paper provides what is, to the best of our knowledge, a first focused study about the importance of neuron membrane leakage on the SNN precision when processing LIDAR data for obstacle avoidance. It is shown that by carefully tuning the membrane potential leakage constant of the spiking Leaky Integrate-and-Fire (LIF) neurons used within our SNN, it is possible to achieve on-par robot control precision compared to the use of a non-spiking Convolutional Neural Network (CNN). Finally, the LIDAR dataset collected during this work is released as open-source with the hope of benefiting future research.
zh
[AI-83] Consistency Trajectory Planning : High-Quality and Efficient Trajectory Optimization for Offline Model-Based Reinforcement Learning
【速读】:该论文试图解决传统基于扩散模型的轨迹规划方法在计算成本高、迭代采样过程耗时的问题,特别是在长周期目标条件任务中效率不足的问题。解决方案的关键在于提出了一种新的离线模型基础强化学习方法——一致性轨迹规划(Consistency Trajectory Planning, CTP),该方法利用最近提出的共识轨迹模型(Consistency Trajectory Model, CTM)实现高效的轨迹优化,能够在不显著降低策略质量的情况下,支持快速的单步轨迹生成。
链接: https://arxiv.org/abs/2507.09534
作者: Guanquan Wang,Takuya Hiraoka,Yoshimasa Tsuruoka
机构: 未知
类目: Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Robotics (cs.RO)
备注:
Abstract:This paper introduces Consistency Trajectory Planning (CTP), a novel offline model-based reinforcement learning method that leverages the recently proposed Consistency Trajectory Model (CTM) for efficient trajectory optimization. While prior work applying diffusion models to planning has demonstrated strong performance, it often suffers from high computational costs due to iterative sampling procedures. CTP supports fast, single-step trajectory generation without significant degradation in policy quality. We evaluate CTP on the D4RL benchmark and show that it consistently outperforms existing diffusion-based planning methods in long-horizon, goal-conditioned tasks. Notably, CTP achieves higher normalized returns while using significantly fewer denoising steps. In particular, CTP achieves comparable performance with over 120\times speedup in inference time, demonstrating its practicality and effectiveness for high-performance, low-latency offline planning.
zh
[AI-84] An Analysis of Action-Value Temporal-Difference Methods That Learn State Values
【速读】:该论文试图解决在强化学习中,通过使用单个动作价值函数(如Q-learning和Sarsa)进行策略学习的局限性,以及探讨是否通过引入两个非对称价值函数(即状态价值和动作价值)进行bootstrapping能够带来理论和性能上的优势。其解决方案的关键在于分析基于两个价值函数的算法家族(QV-learning和AV-learning)在收敛性和样本效率方面的表现,并提出一种改进的AV-learning算法——正则化对弈Q-learning(RDQ),以在控制任务中实现优于传统方法的性能。
链接: https://arxiv.org/abs/2507.09523
作者: Brett Daley,Prabhat Nagarajan,Martha White,Marlos C. Machado
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注: Published at RLC/RLJ 2025
Abstract:The hallmark feature of temporal-difference (TD) learning is bootstrapping: using value predictions to generate new value predictions. The vast majority of TD methods for control learn a policy by bootstrapping from a single action-value function (e.g., Q-learning and Sarsa). Significantly less attention has been given to methods that bootstrap from two asymmetric value functions: i.e., methods that learn state values as an intermediate step in learning action values. Existing algorithms in this vein can be categorized as either QV-learning or AV-learning. Though these algorithms have been investigated to some degree in prior work, it remains unclear if and when it is advantageous to learn two value functions instead of just one – and whether such approaches are theoretically sound in general. In this paper, we analyze these algorithmic families in terms of convergence and sample efficiency. We find that while both families are more efficient than Expected Sarsa in the prediction setting, only AV-learning methods offer any major benefit over Q-learning in the control setting. Finally, we introduce a new AV-learning algorithm called Regularized Dueling Q-learning (RDQ), which significantly outperforms Dueling DQN in the MinAtar benchmark.
zh
[AI-85] A Mixture of Linear Corrections Generates Secure Code
【速读】:该论文试图解决大型语言模型(Large Language Models, LLMs)在代码生成过程中无法可靠检测或避免代码漏洞的问题。其关键解决方案是通过表示工程(representation engineering)技术,验证LLMs内部是否编码了识别代码漏洞所需的概念,并利用这些漏洞敏感的表示开发了一种推理时引导技术,即通过修正混合(Mixture of Corrections, MoC)微调模型的词元生成概率,从而有效引导LLMs生成更安全的代码,同时保持功能完整性。
链接: https://arxiv.org/abs/2507.09508
作者: Weichen Yu,Ravi Mangal,Terry Zhuo,Matt Fredrikson,Corina S. Pasareanu
机构: 未知
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI)
备注:
Abstract:Large language models (LLMs) have become proficient at sophisticated code-generation tasks, yet remain ineffective at reliably detecting or avoiding code vulnerabilities. Does this deficiency stem from insufficient learning about code vulnerabilities, or is it merely a result of ineffective prompting? Using representation engineering techniques, we investigate whether LLMs internally encode the concepts necessary to identify code vulnerabilities. We find that current LLMs encode precise internal representations that distinguish vulnerable from secure code–achieving greater accuracy than standard prompting approaches. Leveraging these vulnerability-sensitive representations, we develop an inference-time steering technique that subtly modulates the model’s token-generation probabilities through a mixture of corrections (MoC). Our method effectively guides LLMs to produce less vulnerable code without compromising functionality, demonstrating a practical approach to controlled vulnerability management in generated code. Notably, MoC enhances the security ratio of Qwen2.5-Coder-7B by 8.9%, while simultaneously improving functionality on HumanEval pass@1 by 2.1%.
zh
[AI-86] GenAI-based Multi-Agent Reinforcement Learning towards Distributed Agent Agent Intelligence: A Generative-RL Agent Perspective
【速读】:该论文试图解决多智能体强化学习(Multi-agent Reinforcement Learning, MARL)中面临的根本性挑战,包括联合动作空间的指数级增长、非平稳环境导致的同时学习产生移动目标,以及部分可观测性对协调的限制。现有方法仍为被动响应型,依赖刺激-反应机制,在面对新场景时表现不佳。论文提出的解决方案关键在于通过生成式AI(Generative AI)驱动的强化学习实现从被动响应到主动智能的范式转变,将智能体重新概念化为能够合成复杂多智能体动态并基于对未来交互的预测性理解做出前瞻性决策的高级生成模型,从而实现环境演化建模、其他智能体行为预测、协调行动序列生成及考虑长期动态的战略推理。
链接: https://arxiv.org/abs/2507.09495
作者: Hang Wang,Junshan Zhang
机构: 未知
类目: Artificial Intelligence (cs.AI); Emerging Technologies (cs.ET); Human-Computer Interaction (cs.HC); Robotics (cs.RO); Systems and Control (eess.SY)
备注: Position paper
Abstract:Multi-agent reinforcement learning faces fundamental challenges that conventional approaches have failed to overcome: exponentially growing joint action spaces, non-stationary environments where simultaneous learning creates moving targets, and partial observability that constrains coordination. Current methods remain reactive, employing stimulus-response mechanisms that fail when facing novel scenarios. We argue for a transformative paradigm shift from reactive to proactive multi-agent intelligence through generative AI-based reinforcement learning. This position advocates reconceptualizing agents not as isolated policy optimizers, but as sophisticated generative models capable of synthesizing complex multi-agent dynamics and making anticipatory decisions based on predictive understanding of future interactions. Rather than responding to immediate observations, generative-RL agents can model environment evolution, predict other agents’ behaviors, generate coordinated action sequences, and engage in strategic reasoning accounting for long-term dynamics. This approach leverages pattern recognition and generation capabilities of generative AI to enable proactive decision-making, seamless coordination through enhanced communication, and dynamic adaptation to evolving scenarios. We envision this paradigm shift will unlock unprecedented possibilities for distributed intelligence, moving beyond individual optimization toward emergent collective behaviors representing genuine collaborative intelligence. The implications extend across autonomous systems, robotics, and human-AI collaboration, promising solutions to coordination challenges intractable under traditional reactive frameworks.
zh
[AI-87] Enhancing ALS Progression Tracking with Semi-Supervised ALSFRS-R Scores Estimated from Ambient Home Health Monitoring
【速读】:该论文试图解决肌萎缩侧索硬化症(ALS)患者功能衰退的临床监测问题,传统方法依赖于定期评估,可能无法捕捉两次就诊之间的关键变化。解决方案的关键在于利用半监督回归模型,通过连续的家庭传感器监测数据估计ALSFRS-R量表轨迹的变化率,结合不同的模型范式(如个体批量学习、队列级批量学习与增量微调迁移学习)以及自注意力伪标签插值方法,以提高对ALS功能衰退的预测准确性。研究发现,迁移学习在28/32个对比中降低了ALSFRS-R子量表的预测误差,而自注意力插值在子量表级别模型中表现最佳,能够捕捉复杂的非线性进展模式。
链接: https://arxiv.org/abs/2507.09460
作者: Noah Marchal,William E. Janes,Mihail Popescu,Xing Song
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注: 31 pages, 8 Figures
Abstract:Clinical monitoring of functional decline in ALS relies on periodic assessments that may miss critical changes occurring between visits. To address this gap, semi-supervised regression models were developed to estimate rates of decline in a case series cohort by targeting ALSFRS- R scale trajectories with continuous in-home sensor monitoring data. Our analysis compared three model paradigms (individual batch learning and cohort-level batch versus incremental fine-tuned transfer learning) across linear slope, cubic polynomial, and ensembled self-attention pseudo-label interpolations. Results revealed cohort homogeneity across functional domains responding to learning methods, with transfer learning improving prediction error for ALSFRS-R subscales in 28 of 32 contrasts (mean RMSE=0.20(0.04)), and individual batch learning for predicting the composite scale (mean RMSE=3.15(1.25)) in 2 of 3. Self-attention interpolation achieved the lowest prediction error for subscale-level models (mean RMSE=0.19(0.06)), capturing complex nonlinear progression patterns, outperforming linear and cubic interpolations in 20 of 32 contrasts, though linear interpolation proved more stable in all ALSFRS-R composite scale models (mean RMSE=0.23(0.10)). We identified distinct homogeneity-heterogeneity profiles across functional domains with respiratory and speech exhibiting patient-specific patterns benefiting from personalized incremental adaptation, while swallowing and dressing functions followed cohort-level trajectories suitable for transfer models. These findings suggest that matching learning and pseudo-labeling techniques to functional domain-specific homogeneity-heterogeneity profiles enhances predictive accuracy in ALS progression tracking. Integrating adaptive model selection within sensor monitoring platforms could enable timely interventions and scalable deployment in future multi-center studies.
zh
[AI-88] Fourier Basis Mapping: A Time-Frequency Learning Framework for Time Series Forecasting
【速读】:该论文试图解决现有基于傅里叶变换的时序预测方法在起始周期不一致和序列长度不一致方面的问题,以及它们对频率成分解释不准确和忽略时间信息的缺陷。解决方案的关键在于提出一种新的傅里叶基映射(Fourier Basis Mapping, FBM)方法,通过傅里叶基展开和在时频空间中的映射来整合时频特征,从而提取显式的频率特征并保留时间特性。
链接: https://arxiv.org/abs/2507.09445
作者: Runze Yang,Longbing Cao,Xin You,Kun Fang,Jianxun Li,Jie Yang
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Machine Learning (stat.ML)
备注: 18 pages, 6 figures
Abstract:The integration of Fourier transform and deep learning opens new avenues for time series forecasting. We reconsider the Fourier transform from a basis functions perspective. Specifically, the real and imaginary parts of the frequency components can be regarded as the coefficients of cosine and sine basis functions at tiered frequency levels, respectively. We find that existing Fourier-based methods face inconsistent starting cycles and inconsistent series length issues. They fail to interpret frequency components precisely and overlook temporal information. Accordingly, the novel Fourier Basis Mapping (FBM) method addresses these issues by integrating time-frequency features through Fourier basis expansion and mapping in the time-frequency space. Our approach extracts explicit frequency features while preserving temporal characteristics. FBM supports plug-and-play integration with various types of neural networks by only adjusting the first initial projection layer for better performance. First, we propose FBM-L, FBM-NL, and FBM-NP to enhance linear, MLP-based, and Transformer-based models, respectively, demonstrating the effectiveness of time-frequency features. Next, we propose a synergetic model architecture, termed FBM-S, which decomposes the seasonal, trend, and interaction effects into three separate blocks, each designed to model time-frequency features in a specialized manner. Finally, we introduce several techniques tailored for time-frequency features, including interaction masking, centralization, patching, rolling window projection, and multi-scale down-sampling. The results are validated on diverse real-world datasets for both long-term and short-term forecasting tasks with SOTA performance.
zh
[AI-89] ransformers Dont In-Context Learn Least Squares Regression ICML2025
【速读】:该论文试图解决生成式 AI (Generative AI) 中的上下文学习(In-context learning, ICL)机制问题,即理解大规模预训练变换器模型如何在推理过程中通过示例输入输出对解决新任务。其解决方案的关键在于通过合成线性回归任务分析变换器在推理时的学习行为,并利用分布外泛化实验揭示变换器在提示分布变化后的泛化能力不足,从而质疑其是否真正实现了类似普通最小二乘法(OLS)的算法。此外,研究还通过残差流中的表征谱分析,揭示了预训练语料库对ICL行为的影响。
链接: https://arxiv.org/abs/2507.09440
作者: Joshua Hill,Benjamin Eyre,Elliot Creager
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注: 21 pages, 16 figures, ICML 2025 Workshop on Reliable and Responsible Foundation Models
Abstract:In-context learning (ICL) has emerged as a powerful capability of large pretrained transformers, enabling them to solve new tasks implicit in example input-output pairs without any gradient updates. Despite its practical success, the mechanisms underlying ICL remain largely mysterious. In this work we study synthetic linear regression to probe how transformers implement learning at inference time. Previous works have demonstrated that transformers match the performance of learning rules such as Ordinary Least Squares (OLS) regression or gradient descent and have suggested ICL is facilitated in transformers through the learned implementation of one of these techniques. In this work, we demonstrate through a suite of out-of-distribution generalization experiments that transformers trained for ICL fail to generalize after shifts in the prompt distribution, a behaviour that is inconsistent with the notion of transformers implementing algorithms such as OLS. Finally, we highlight the role of the pretraining corpus in shaping ICL behaviour through a spectral analysis of the learned representations in the residual stream. Inputs from the same distribution as the training data produce representations with a unique spectral signature: inputs from this distribution tend to have the same top two singular vectors. This spectral signature is not shared by out-of-distribution inputs, and a metric characterizing the presence of this signature is highly correlated with low loss.
zh
[AI-90] Dynamic Sparse Causal-Attention Temporal Networks for Interpretable Causality Discovery in Multivariate Time Series
【速读】:该论文旨在解决多变量时间序列(Multivariate Time Series, MTS)中因果关系识别的问题,特别是在金融和营销领域,由于复杂的依赖关系和滞后效应,传统分析方法面临挑战。其解决方案的关键在于提出一种新型架构——动态稀疏因果注意力时序网络(DyCAST-Net),通过集成扩张时间卷积和动态稀疏注意力机制来增强因果发现能力。该方法利用扩张卷积捕捉多尺度时间依赖性,并通过自适应阈值策略在注意力机制中消除虚假连接,从而确保结果的准确性和可解释性。
链接: https://arxiv.org/abs/2507.09439
作者: Meriem Zerkouk,Miloud Mihoubi,Belkacem Chikhaoui
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Information Retrieval (cs.IR)
备注:
Abstract:Understanding causal relationships in multivariate time series (MTS) is essential for effective decision-making in fields such as finance and marketing, where complex dependencies and lagged effects challenge conventional analytical approaches. We introduce Dynamic Sparse Causal-Attention Temporal Networks for Interpretable Causality Discovery in MTS (DyCAST-Net), a novel architecture designed to enhance causal discovery by integrating dilated temporal convolutions and dynamic sparse attention mechanisms. DyCAST-Net effectively captures multiscale temporal dependencies through dilated convolutions while leveraging an adaptive thresholding strategy in its attention mechanism to eliminate spurious connections, ensuring both accuracy and interpretability. A statistical shuffle test validation further strengthens robustness by filtering false positives and improving causal inference reliability. Extensive evaluations on financial and marketing datasets demonstrate that DyCAST-Net consistently outperforms existing models such as TCDF, GCFormer, and CausalFormer. The model provides a more precise estimation of causal delays and significantly reduces false discoveries, particularly in noisy environments. Moreover, attention heatmaps offer interpretable insights, uncovering hidden causal patterns such as the mediated effects of advertising on consumer behavior and the influence of macroeconomic indicators on financial markets. Case studies illustrate DyCAST-Net’s ability to detect latent mediators and lagged causal factors, making it particularly effective in high-dimensional, dynamic settings. The model’s architecture enhanced by RMSNorm stabilization and causal masking ensures scalability and adaptability across diverse application domains
zh
[AI-91] LLM -Stackelberg Games: Conjectural Reasoning Equilibria and Their Applications to Spearphishing
【速读】:该论文试图解决在战略互动中如何有效整合大型语言模型(LLM)以建模具有有限理性、信息不对称和元认知适应性的决策过程的问题。其解决方案的关键在于提出LLM-Stackelberg博弈框架,该框架通过结构化提示让代理进行推理、利用LLM生成概率行为,并通过内部认知和信念更新调整策略,从而定义了推理与行为均衡及推测性推理均衡等概念,以捕捉复杂的人机交互场景中的动态特性。
链接: https://arxiv.org/abs/2507.09407
作者: Quanyan Zhu
机构: 未知
类目: Artificial Intelligence (cs.AI); Cryptography and Security (cs.CR); Computer Science and Game Theory (cs.GT)
备注:
Abstract:We introduce the framework of LLM-Stackelberg games, a class of sequential decision-making models that integrate large language models (LLMs) into strategic interactions between a leader and a follower. Departing from classical Stackelberg assumptions of complete information and rational agents, our formulation allows each agent to reason through structured prompts, generate probabilistic behaviors via LLMs, and adapt their strategies through internal cognition and belief updates. We define two equilibrium concepts: reasoning and behavioral equilibrium, which aligns an agent’s internal prompt-based reasoning with observable behavior, and conjectural reasoning equilibrium, which accounts for epistemic uncertainty through parameterized models over an opponent’s response. These layered constructs capture bounded rationality, asymmetric information, and meta-cognitive adaptation. We illustrate the framework through a spearphishing case study, where a sender and a recipient engage in a deception game using structured reasoning prompts. This example highlights the cognitive richness and adversarial potential of LLM-mediated interactions. Our results show that LLM-Stackelberg games provide a powerful paradigm for modeling decision-making in domains such as cybersecurity, misinformation, and recommendation systems.
zh
[AI-92] Adversarial Activation Patching: A Framework for Detecting and Mitigating Emergent Deception in Safety-Aligned Transformers
【速读】:该论文试图解决通过安全对齐技术(如人类反馈强化学习,RLHF)训练的大型语言模型(LLMs)中出现的隐性欺骗行为问题,即模型输出看似合规但可能微妙误导或遗漏关键信息。解决方案的关键在于提出一种名为对抗性激活补丁(adversarial activation patching)的机制可解释性框架,该框架利用激活补丁作为一种对抗工具,用于诱导、检测和缓解Transformer模型中的欺骗行为。通过从“欺骗性”提示中获取激活并将其植入特定层的安全前向传递中,该方法模拟了模型的脆弱性并量化了欺骗率。
链接: https://arxiv.org/abs/2507.09406
作者: Santhosh Kumar Ravindran
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:
Abstract:Large language models (LLMs) aligned for safety through techniques like reinforcement learning from human feedback (RLHF) often exhibit emergent deceptive behaviors, where outputs appear compliant but subtly mislead or omit critical information. This paper introduces adversarial activation patching, a novel mechanistic interpretability framework that leverages activation patching as an adversarial tool to induce, detect, and mitigate such deception in transformer-based models. By sourcing activations from “deceptive” prompts and patching them into safe forward passes at specific layers, we simulate vulnerabilities and quantify deception rates. Through toy neural network simulations across multiple scenarios (e.g., 1000 trials per setup), we demonstrate that adversarial patching increases deceptive outputs to 23.9% from a 0% baseline, with layer-specific variations supporting our hypotheses. We propose six hypotheses, including transferability across models, exacerbation in multimodal settings, and scaling effects. An expanded literature review synthesizes over 20 key works in interpretability, deception, and adversarial attacks. Mitigation strategies, such as activation anomaly detection and robust fine-tuning, are detailed, alongside ethical considerations and future research directions. This work advances AI safety by highlighting patching’s dual-use potential and provides a roadmap for empirical studies on large-scale models.
zh
[AI-93] Knowledge Conceptualization Impacts RAG Efficacy
【速读】:该论文试图解决如何将可解释性与可迁移性融合到神经符号人工智能系统中的问题,特别是在面对自然语言提示时,如何有效查询三元组存储(triplestore)。其解决方案的关键在于探索知识的不同概念化和表示方式,尤其是知识的结构和复杂性对AI代理(如大型语言模型)查询能力的影响。
链接: https://arxiv.org/abs/2507.09389
作者: Chris Davis Jaldi,Anmol Saini,Elham Ghiasi,O. Divine Eziolise,Cogan Shimizu
机构: 未知
类目: Artificial Intelligence (cs.AI); Computers and Society (cs.CY); Information Retrieval (cs.IR)
备注:
Abstract:Explainability and interpretability are cornerstones of frontier and next-generation artificial intelligence (AI) systems. This is especially true in recent systems, such as large language models (LLMs), and more broadly, generative AI. On the other hand, adaptability to new domains, contexts, or scenarios is also an important aspect for a successful system. As such, we are particularly interested in how we can merge these two efforts, that is, investigating the design of transferable and interpretable neurosymbolic AI systems. Specifically, we focus on a class of systems referred to as ‘‘Agentic Retrieval-Augmented Generation’’ systems, which actively select, interpret, and query knowledge sources in response to natural language prompts. In this paper, we systematically evaluate how different conceptualizations and representations of knowledge, particularly the structure and complexity, impact an AI agent (in this case, an LLM) in effectively querying a triplestore. We report our results, which show that there are impacts from both approaches, and we discuss their impact and implications.
zh
[AI-94] Fair CCA for Fair Representation Learning: An ADNI Study
【速读】:该论文试图解决机器学习中公平性不足的问题,特别是在多模态数据表示学习中,现有方法往往忽视对下游分类任务的影响,从而限制了其应用效果。解决方案的关键在于提出一种新颖的公平CCA(Canonical Correlation Analysis)方法,确保投影后的特征与敏感属性独立,从而在不牺牲准确性的前提下提升分类任务的公平性。
链接: https://arxiv.org/abs/2507.09382
作者: Bojian Hou,Zhanliang Wang,Zhuoping Zhou,Boning Tong,Zexuan Wang,Jingxuan Bao,Duy Duong-Tran,Qi Long,Li Shen
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computers and Society (cs.CY)
备注:
Abstract:Canonical correlation analysis (CCA) is a technique for finding correlations between different data modalities and learning low-dimensional representations. As fairness becomes crucial in machine learning, fair CCA has gained attention. However, previous approaches often overlook the impact on downstream classification tasks, limiting applicability. We propose a novel fair CCA method for fair representation learning, ensuring the projected features are independent of sensitive attributes, thus enhancing fairness without compromising accuracy. We validate our method on synthetic data and real-world data from the Alzheimer’s Disease Neuroimaging Initiative (ADNI), demonstrating its ability to maintain high correlation analysis performance while improving fairness in classification tasks. Our work enables fair machine learning in neuroimaging studies where unbiased analysis is essential.
zh
[AI-95] EduFlow: Advancing MLLM s Problem-Solving Proficiency through Multi-Stage Multi-Perspective Critique
【速读】:该论文试图解决多模态大语言模型(Multimodal Large Language Models, MLLMs)在科学任务中表现不佳的问题,特别是在需要多步骤和可解释推理的任务中。其关键问题包括缺乏科学推理模式、多步骤推理中的全局连贯性不足以及缺乏自我修正机制。解决方案的核心是提出EduFlow框架,其中包含EduPRM过程感知奖励模型,该模型通过标签和论证对推理步骤进行批判性评估,并通过课程学习在三种互补监督源上进行训练,从而实现多阶段问题求解的动态适应和迭代优化。此外,还提出了EduMCTS领域适配的搜索框架,引入了专门针对教育推理的自举动作,如自我反思机制,以促进错误修正,并利用EduPRM的细粒度反馈引导搜索至更高质量的推理轨迹。
链接: https://arxiv.org/abs/2507.09374
作者: Chenglin Zhu,Tao Zhang,Chong Li,Mingan Lin,Zenan Zhou,Jian Xie
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注: 14 pages,4 figures
Abstract:Multimodal large language models (MLLMs) still perform poorly on scientific tasks, particularly those requiring multi-step and interpretable reasoning. Their limitations include insufficient scientific reasoning patterns, lack of global coherence in multi-step inference, and the absence of reflective self-correction, making them unreliable in structured scientific contexts. We introduce EduFlow, the first end-to-end framework that covers the full pipeline of educational scientific reasoning, including data selection, MCTS-based trajectory construction, model training, and output optimization. At its core is EduPRM, a process-aware reward model that critiques reasoning steps with tags and justifications. EduPRM is trained via curriculum learning on three complementary supervision sources: MCTS-guided trajectories, error-injected critiques, and teacher-student dialogues, enabling dynamic adaptation to multi-stage problem solving and iterative refinement during inference. We further propose EduMCTS, a domain-adapted search framework that introduces bootstrapping actions specifically designed for educational reasoning, such as a self-reflection mechanism that promotes reflective error correction. It further leverages EduPRM’s fine-grained feedback to guide the search toward higher-quality reasoning trajectories. By applying self-consistency and rejection sampling, we constructed EduMCTS-160K, a large-scale dataset of educational reasoning trajectories. Extensive experiments demonstrate that EduFlow enhances reasoning consistency and coherence. Code, data, and models will be released.
zh
[AI-96] A Taxonomy of Omnicidal Futures Involving Artificial Intelligence
【速读】:该论文试图解决潜在的由人工智能引发的灭绝性事件(omnicidal events)问题,即可能导致全人类或几乎全部人类死亡的情景。其解决方案的关键在于通过公开讨论这些可能性,以获得公众支持,从而推动预防性措施,降低人工智能带来的灾难性风险。
链接: https://arxiv.org/abs/2507.09369
作者: Andrew Critch,Jacob Tsimerman
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:
Abstract:This report presents a taxonomy and examples of potential omnicidal events resulting from AI: scenarios where all or almost all humans are killed. These events are not presented as inevitable, but as possibilities that we can work to avoid. Insofar as large institutions require a degree of public support in order to take certain actions, we hope that by presenting these possibilities in public, we can help to support preventive measures against catastrophic risks from AI.
zh
[AI-97] Impute With Confidence: A Framework for Uncertainty Aware Multivariate Time Series Imputation
【速读】:该论文试图解决时间序列数据中存在缺失值的问题,特别是在医疗健康领域,由于传感器长时间断开连接导致的特殊挑战。现有方法往往忽视模型不确定性或缺乏估计不确定性的机制。解决方案的关键在于引入一个通用框架,该框架能够量化并利用不确定性进行选择性填补,通过聚焦于模型最自信的值来避免高不可靠的填补结果,从而减少填补误差并提升下游任务性能。
链接: https://arxiv.org/abs/2507.09353
作者: Addison Weatherhead,Anna Goldenberg
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Machine Learning (stat.ML)
备注:
Abstract:Time series data with missing values is common across many domains. Healthcare presents special challenges due to prolonged periods of sensor disconnection. In such cases, having a confidence measure for imputed values is critical. Most existing methods either overlook model uncertainty or lack mechanisms to estimate it. To address this gap, we introduce a general framework that quantifies and leverages uncertainty for selective imputation. By focusing on values the model is most confident in, highly unreliable imputations are avoided. Our experiments on multiple EHR datasets, covering diverse types of missingness, demonstrate that selectively imputing less-uncertain values not only reduces imputation errors but also improves downstream tasks. Specifically, we show performance gains in a 24-hour mortality prediction task, underscoring the practical benefit of incorporating uncertainty into time series imputation.
zh
[AI-98] When Developer Aid Becomes Security Debt: A Systematic Analysis of Insecure Behaviors in LLM Coding Agents
【速读】:该论文试图解决基于大语言模型(Large Language Model, LLM)的代码代理在软件开发中可能引入的安全问题,特别是其无意中导致不安全实践的风险。解决方案的关键在于首次对自主代码代理进行了系统的安全评估,并开发了一个高精度的检测系统,能够识别四种主要的漏洞类别,其中信息泄露(CWE-200)最为常见。此外,研究还评估了多种缓解策略的有效性,其中GPT-4.1表现出卓越的安全意识,具有96.8%的缓解成功率。该工作为评估代码代理的安全性提供了首个全面框架,并强调了下一代LLM-based代码代理需要具备安全意识的设计。
链接: https://arxiv.org/abs/2507.09329
作者: Matous Kozak,Roshanak Zilouchian Moghaddam,Siva Sivaraman
机构: 未知
类目: Artificial Intelligence (cs.AI); Cryptography and Security (cs.CR)
备注: 15 pages
Abstract:LLM-based coding agents are rapidly being deployed in software development, yet their security implications remain poorly understood. These agents, while capable of accelerating software development, may inadvertently introduce insecure practices. We conducted the first systematic security evaluation of autonomous coding agents, analyzing over 12,000 actions across five state-of-the-art models (GPT-4o, GPT-4.1, Claude variants) on 93 real-world software setup tasks. Our findings reveal significant security concerns: 21% of agent trajectories contained insecure actions, with models showing substantial variation in security behavior. We developed a high-precision detection system that identified four major vulnerability categories, with information exposure (CWE-200) being the most prevalent one. We also evaluated mitigation strategies including feedback mechanisms and security reminders with various effectiveness between models. GPT-4.1 demonstrated exceptional security awareness with 96.8% mitigation success. Our work provides the first comprehensive framework for evaluating coding agent security and highlights the need for security-aware design of next generation LLM-based coding agents.
zh
[AI-99] Enhancing Interpretability in Software Change Management with Chain-of-Thought Reasoning
【速读】:该论文试图解决现代在线服务中频繁软件变更带来的重大风险问题,其解决方案的关键是提出SCELM(Software Change Evaluation and Lifecycle Management),这是一个端到端的自动化软件变更管理框架,旨在高效且精准地管理软件变更,从而显著降低服务故障和经济损失。
链接: https://arxiv.org/abs/2507.09315
作者: Yongqian Sun,Weihua Kuang,Chao Shen,Xidao Wen,Tinghua Zheng,Heng Liu,Shenglin Zhang,Bo Wu,Dan Pei
机构: 未知
类目: oftware Engineering (cs.SE); Artificial Intelligence (cs.AI)
备注: 22 pages, 19 figures
Abstract:In modern online services, frequent software changes introduce significant risks. To tackle this challenge, we propose SCELM (Software Change Evaluation and Lifecycle Management), an end-to-end automated framework for software change management. SCELM aims to manage software changes efficiently and precisely, significantly reducing service failures and economic losses.
zh
[AI-100] Controllable Patching for Compute-Adaptive Surrogate Modeling of Partial Differential Equations
【速读】:该论文试图解决基于补丁的Transformer代理模型在生产环境中因固定补丁大小而导致的计算资源消耗过大的问题。其解决方案的关键在于引入两种轻量级、架构无关的模块——卷积核调制器(Convolutional Kernel Modulator, CKM)和卷积步长调制器(Convolutional Stride Modulator, CSM),这些模块能够在推理阶段动态调整补丁大小,而无需重新训练或损失精度。结合循环补丁大小滚动策略,该方法有效缓解了补丁伪影并提升了视频类预测任务的长期稳定性。
链接: https://arxiv.org/abs/2507.09264
作者: Payel Mukhopadhyay,Michael McCabe,Ruben Ohana,Miles Cranmer
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Image and Video Processing (eess.IV)
备注:
Abstract:Patch-based transformer surrogates have become increasingly effective for modeling spatiotemporal dynamics, but the fixed patch size is a major limitation for budget-conscience deployment in production. We introduce two lightweight, architecture-agnostic modules-the Convolutional Kernel Modulator (CKM) and Convolutional Stride Modulator (CSM)-that enable dynamic patch size control at inference in patch based models, without retraining or accuracy loss. Combined with a cyclic patch-size rollout, our method mitigates patch artifacts and improves long-term stability for video-like prediction tasks. Applied to a range of challenging 2D and 3D PDE benchmarks, our approach improves rollout fidelity and runtime efficiency. To our knowledge, this is the first framework to enable inference-time patch-size tunability in patch-based PDE surrogates. Its plug-and-play design makes it broadly applicable across architectures-establishing a general foundation for compute-adaptive modeling in PDE surrogate tasks.
zh
[AI-101] XiChen: An observation-scalable fully AI-driven global weather forecasting system with 4D variational knowledge
【速读】:该论文试图解决传统数值天气预报(Numerical Weather Prediction, NWP)系统在初始条件准备过程中依赖超级计算机计算资源导致耗时过长的问题。其解决方案的关键在于提出了一种名为XiChen的全AI驱动全球天气预报系统,该系统能够通过预训练的基础模型进行数据同化(Data Assimilation, DA)和中尺度天气预报,并在仅17秒内完成整个流程,同时通过四维变分知识的整合实现了与操作性NWP系统相当的预报精度,从而实现了对NWP系统的独立性。
链接: https://arxiv.org/abs/2507.09202
作者: Wuxin Wang,Weicheng Ni,Lilan Huang,Tao Hao,Ben Fei,Shuo Ma,Taikang Yuan,Yanlai Zhao,Kefeng Deng,Xiaoyong Li,Boheng Duan,Lei Bai,Kaijun Ren
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Atmospheric and Oceanic Physics (physics.ao-ph)
备注:
Abstract:Recent advancements in Artificial Intelligence (AI) demonstrate significant potential to revolutionize weather forecasting. However, most AI-driven models rely on Numerical Weather Prediction (NWP) systems for initial condition preparation, which often consumes hours on supercomputers. Here we introduce XiChen, the first observation-scalable fully AI-driven global weather forecasting system, whose entire pipeline, from Data Assimilation (DA) to medium-range forecasting, can be accomplished within only 17 seconds. XiChen is built upon a foundation model that is pre-trained for weather forecasting. Meanwhile, this model is subsequently fine-tuned to serve as both observation operators and DA models, thereby scalably assimilating conventional and raw satellite observations. Furthermore, the integration of four-dimensional variational knowledge ensures that XiChen’s DA and medium-range forecasting accuracy rivals that of operational NWP systems, amazingly achieving a skillful forecasting lead time exceeding 8.25 days. These findings demonstrate that XiChen holds strong potential toward fully AI-driven weather forecasting independent of NWP systems.
zh
[AI-102] Hide-and-Shill: A Reinforcement Learning Framework for Market Manipulation Detection in Symphony-a Decentralized Multi-Agent System
【速读】:该论文试图解决去中心化金融(Decentralized Finance, DeFi)中由于缺乏集中监管而导致的市场操纵问题,如洗票活动和拉高出货策略。其解决方案的关键在于提出一种多智能体强化学习(Multi-Agent Reinforcement Learning, MARL)框架,将操纵者与检测者的互动建模为动态对抗游戏,并引入三种创新:(1)群体相对策略优化(Group Relative Policy Optimization, GRPO)以提高稀疏奖励和部分可观测环境下的学习稳定性;(2)基于理性预期和信息不对称理论的奖励函数,以区分价格发现与操纵噪声;(3)融合大语言模型(LLM)语义特征、社交图谱信号和链上市场数据的多模态智能体流水线。该框架集成在Symphony系统中,支持去中心化的多智能体架构与信任感知学习,实现了无需中心化预言机的鲁棒操纵检测。
链接: https://arxiv.org/abs/2507.09179
作者: Ronghua Shi,Yiou Liu,Xinyu Ying,Yang Tan,Yuchun Feng,Lynn Ai,Bill Shi,Xuhui Wang,Zhuang Liu
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:
Abstract:Decentralized finance (DeFi) has introduced a new era of permissionless financial innovation but also led to unprecedented market manipulation. Without centralized oversight, malicious actors coordinate shilling campaigns and pump-and-dump schemes across various platforms. We propose a Multi-Agent Reinforcement Learning (MARL) framework for decentralized manipulation detection, modeling the interaction between manipulators and detectors as a dynamic adversarial game. This framework identifies suspicious patterns using delayed token price reactions as financial this http URL method introduces three innovations: (1) Group Relative Policy Optimization (GRPO) to enhance learning stability in sparse-reward and partially observable settings; (2) a theory-based reward function inspired by rational expectations and information asymmetry, differentiating price discovery from manipulation noise; and (3) a multi-modal agent pipeline that integrates LLM-based semantic features, social graph signals, and on-chain market data for informed this http URL framework is integrated within the Symphony system, a decentralized multi-agent architecture enabling peer-to-peer agent execution and trust-aware learning through distributed logs, supporting chain-verifiable evaluation. Symphony promotes adversarial co-evolution among strategic actors and maintains robust manipulation detection without centralized oracles, enabling real-time surveillance across global DeFi this http URL on 100,000 real-world discourse episodes and validated in adversarial simulations, Hide-and-Shill achieves top performance in detection accuracy and causal attribution. This work bridges multi-agent systems with financial surveillance, advancing a new paradigm for decentralized market intelligence. All resources are available at the Hide-and-Shill GitHub repository to promote open research and reproducibility.
zh
[AI-103] Continual Reinforcement Learning by Planning with Online World Models ICML2025
【速读】:该论文试图解决持续强化学习(Continual Reinforcement Learning, CRL)中的灾难性遗忘问题,即智能体在学习新任务时可能会忘记之前学到的任务。解决方案的关键在于通过在线世界模型进行规划,具体而言是学习一个在线的Follow-The-Leader浅层模型来捕捉世界动态,并利用模型预测控制来解决由任意奖励函数指定的一组任务。该在线世界模型通过构造具有O(K2Dlog(T))的理论后悔界,从而免疫于遗忘问题。
链接: https://arxiv.org/abs/2507.09177
作者: Zichen Liu,Guoji Fu,Chao Du,Wee Sun Lee,Min Lin
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Machine Learning (stat.ML)
备注: ICML 2025 Spotlight
Abstract:Continual reinforcement learning (CRL) refers to a naturalistic setting where an agent needs to endlessly evolve, by trial and error, to solve multiple tasks that are presented sequentially. One of the largest obstacles to CRL is that the agent may forget how to solve previous tasks when learning a new task, known as catastrophic forgetting. In this paper, we propose to address this challenge by planning with online world models. Specifically, we learn a Follow-The-Leader shallow model online to capture the world dynamics, in which we plan using model predictive control to solve a set of tasks specified by any reward functions. The online world model is immune to forgetting by construction with a proven regret bound of \mathcalO(\sqrtK^2D\log(T)) under mild assumptions. The planner searches actions solely based on the latest online model, thus forming a FTL Online Agent (OA) that updates incrementally. To assess OA, we further design Continual Bench, a dedicated environment for CRL, and compare with several strong baselines under the same model-planning algorithmic framework. The empirical results show that OA learns continuously to solve new tasks while not forgetting old skills, outperforming agents built on deep world models with various continual learning techniques.
zh
[AI-104] owards Interpretable Drug-Drug Interaction Prediction: A Graph-Based Approach with Molecular and Network-Level Explanations
【速读】:该论文试图解决药物-药物相互作用(Drug-drug interactions, DDIs)预测中模型独立处理药物对、忽视药物对独特上下文依赖性交互以及难以整合生物相互作用网络和分子结构以提供机制性见解的问题。解决方案的关键在于提出MolecBioNet框架,该框架通过将药物对建模为统一实体,结合分子和生物医学知识,捕获宏观层面的生物相互作用和微观层面的分子影响,同时引入两种领域特定的池化策略(context-aware subgraph pooling和attention-guided influence pooling)以及互信息最小化正则化,以提升模型的准确性与可解释性。
链接: https://arxiv.org/abs/2507.09173
作者: Mengjie Chen,Ming Zhang,Cunquan Qu
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Molecular Networks (q-bio.MN)
备注:
Abstract:Drug-drug interactions (DDIs) represent a critical challenge in pharmacology, often leading to adverse drug reactions with significant implications for patient safety and healthcare outcomes. While graph-based methods have achieved strong predictive performance, most approaches treat drug pairs independently, overlooking the complex, context-dependent interactions unique to drug pairs. Additionally, these models struggle to integrate biological interaction networks and molecular-level structures to provide meaningful mechanistic insights. In this study, we propose MolecBioNet, a novel graph-based framework that integrates molecular and biomedical knowledge for robust and interpretable DDI prediction. By modeling drug pairs as unified entities, MolecBioNet captures both macro-level biological interactions and micro-level molecular influences, offering a comprehensive perspective on DDIs. The framework extracts local subgraphs from biomedical knowledge graphs and constructs hierarchical interaction graphs from molecular representations, leveraging classical graph neural network methods to learn multi-scale representations of drug pairs. To enhance accuracy and interpretability, MolecBioNet introduces two domain-specific pooling strategies: context-aware subgraph pooling (CASPool), which emphasizes biologically relevant entities, and attention-guided influence pooling (AGIPool), which prioritizes influential molecular substructures. The framework further employs mutual information minimization regularization to enhance information diversity during embedding fusion. Experimental results demonstrate that MolecBioNet outperforms state-of-the-art methods in DDI prediction, while ablation studies and embedding visualizations further validate the advantages of unified drug pair modeling and multi-scale knowledge integration.
zh
[AI-105] Advanced Health Misinformation Detection Through Hybrid CNN-LSTM Models Informed by the Elaboration Likelihood Model (ELM)
【速读】:该论文试图解决新冠疫情期间社交媒体上健康谣言(health misinformation)对公共卫生工作的严重挑战。其解决方案的关键在于应用详尽可能性模型(Elaboration Likelihood Model, ELM),通过融合卷积神经网络(CNN)与长短期记忆网络(LSTM)的混合模型,提升谣言分类的准确性和可靠性。该模型整合了ELM相关的特征,如文本可读性、情感极性及启发式线索(如标点符号频率),从而显著提高了检测性能,最终实现了高精度、高召回率和高F1分数的检测效果。
链接: https://arxiv.org/abs/2507.09149
作者: Mkululi Sikosana,Sean Maudsley-Barton,Oluwaseun Ajao
机构: 未知
类目: ocial and Information Networks (cs.SI); Artificial Intelligence (cs.AI); Computers and Society (cs.CY); Machine Learning (cs.LG)
备注: 11 Pages, 2 Figures, 3 Tables conference paper to appear in proceedings of International Conference on Artificial Intelligence, Computer, Data Sciences and Applications (ACDSA’25)
Abstract:Health misinformation during the COVID-19 pandemic has significantly challenged public health efforts globally. This study applies the Elaboration Likelihood Model (ELM) to enhance misinformation detection on social media using a hybrid Convolutional Neural Network (CNN) and Long Short-Term Memory (LSTM) model. The model aims to enhance the detection accuracy and reliability of misinformation classification by integrating ELM-based features such as text readability, sentiment polarity, and heuristic cues (e.g., punctuation frequency). The enhanced model achieved an accuracy of 97.37%, precision of 96.88%, recall of 98.50%, F1-score of 97.41%, and ROC-AUC of 99.50%. A combined model incorporating feature engineering further improved performance, achieving a precision of 98.88%, recall of 99.80%, F1-score of 99.41%, and ROC-AUC of 99.80%. These findings highlight the value of ELM features in improving detection performance, offering valuable contextual information. This study demonstrates the practical application of psychological theories in developing advanced machine learning algorithms to address health misinformation effectively.
zh
[AI-106] POIFormer: A Transformer-Based Framework for Accurate and Scalable Point-of-Interest Attribution
【速读】:该论文试图解决在移动性分析中准确将用户访问归因于特定兴趣点(POI)的问题,这一问题由于GPS定位误差(通常在2至20米之间)以及城市环境中POI的高空间密度而变得复杂。论文提出的解决方案的关键在于引入\textsfPOIFormer,这是一个基于Transformer的框架,通过联合建模多种信号(包括空间邻近性、访问时间和持续时间、POI语义的上下文特征以及用户移动性和聚合人群行为模式),利用Transformer的自注意力机制来捕捉这些维度之间的复杂交互,从而实现对POI的准确和高效归因。
链接: https://arxiv.org/abs/2507.09137
作者: Nripsuta Ani Saxena,Shang-Ling Hsu,Mehul Shetty,Omar Alkhadra,Cyrus Shahabi,Abigail L. Horn
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:
Abstract:Accurately attributing user visits to specific Points of Interest (POIs) is a foundational task for mobility analytics, personalized services, marketing and urban planning. However, POI attribution remains challenging due to GPS inaccuracies, typically ranging from 2 to 20 meters in real-world settings, and the high spatial density of POIs in urban environments, where multiple venues can coexist within a small radius (e.g., over 50 POIs within a 100-meter radius in dense city centers). Relying on proximity is therefore often insufficient for determining which POI was actually visited. We introduce \textsfPOIFormer, a novel Transformer-based framework for accurate and efficient POI attribution. Unlike prior approaches that rely on limited spatiotemporal, contextual, or behavioral features, \textsfPOIFormer jointly models a rich set of signals, including spatial proximity, visit timing and duration, contextual features from POI semantics, and behavioral features from user mobility and aggregated crowd behavior patterns–using the Transformer’s self-attention mechanism to jointly model complex interactions across these dimensions. By leveraging the Transformer to model a user’s past and future visits (with the current visit masked) and incorporating crowd-level behavioral patterns through pre-computed KDEs, \textsfPOIFormer enables accurate, efficient attribution in large, noisy mobility datasets. Its architecture supports generalization across diverse data sources and geographic contexts while avoiding reliance on hard-to-access or unavailable data layers, making it practical for real-world deployment. Extensive experiments on real-world mobility datasets demonstrate significant improvements over existing baselines, particularly in challenging real-world settings characterized by spatial noise and dense POI clustering.
zh
[AI-107] Heterogeneous Graph Prompt Learning via Adaptive Weight Pruning
【速读】:该论文试图解决图神经网络(Graph Neural Networks, GNNs)在训练和推理时间长、难以捕捉复杂关系以及特征提取不足等问题。其解决方案的关键在于提出一种结合图提示(graph prompts)与权重剪枝(weight pruning)的新型框架,称为GPAWP,通过评估图提示的重要性并进行分层剪枝,去除负向提示标签,从而实现更参数高效且性能优越的图提示方法。
链接: https://arxiv.org/abs/2507.09132
作者: Chu-Yuan Wei,Shun-Yao Liu,Sheng-Da Zhuo,Chang-Dong Wang,Shu-Qiang Huang,Mohsen Guizani
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:
Abstract:Graph Neural Networks (GNNs) have achieved remarkable success in various graph-based tasks (e.g., node classification or link prediction). Despite their triumphs, GNNs still face challenges such as long training and inference times, difficulty in capturing complex relationships, and insufficient feature extraction. To tackle these issues, graph pre-training and graph prompt methods have garnered increasing attention for their ability to leverage large-scale datasets for initial learning and task-specific adaptation, offering potential improvements in GNN performance. However, previous research has overlooked the potential of graph prompts in optimizing models, as well as the impact of both positive and negative graph prompts on model stability and efficiency. To bridge this gap, we propose a novel framework combining graph prompts with weight pruning, called GPAWP, which aims to enhance the performance and efficiency of graph prompts by using fewer of them. We evaluate the importance of graph prompts using an importance assessment function to determine positive and negative weights at different granularities. Through hierarchically structured pruning, we eliminate negative prompt labels, resulting in more parameter-efficient and competitively performing prompts. Extensive experiments on three benchmark datasets demonstrate the superiority of GPAWP, leading to a significant reduction in parameters in node classification tasks.
zh
[AI-108] owards Human-level Dexterity via Robot Learning
【速读】:该论文旨在解决如何实现类人级别的灵巧操作能力,即通过机器人手完成复杂多指交互的问题,这是迈向通用具身智能的关键里程碑。其核心挑战在于克服计算传感器运动学习中的基本局限性,尤其是随机探索在强化学习中的效率问题。论文提出的解决方案关键在于直接从根源上解决这些限制,通过结构化探索和基于采样的规划方法,构建了一个有效的强化学习框架,以提升多指灵巧操作技能的学习效率与效果。
链接: https://arxiv.org/abs/2507.09117
作者: Gagan Khandate
机构: 未知
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI)
备注: PhD thesis
Abstract:Dexterous intelligence – the ability to perform complex interactions with multi-fingered hands – is a pinnacle of human physical intelligence and emergent higher-order cognitive skills. However, contrary to Moravec’s paradox, dexterous intelligence in humans appears simple only superficially. Many million years were spent co-evolving the human brain and hands including rich tactile sensing. Achieving human-level dexterity with robotic hands has long been a fundamental goal in robotics and represents a critical milestone toward general embodied intelligence. In this pursuit, computational sensorimotor learning has made significant progress, enabling feats such as arbitrary in-hand object reorientation. However, we observe that achieving higher levels of dexterity requires overcoming very fundamental limitations of computational sensorimotor learning. I develop robot learning methods for highly dexterous multi-fingered manipulation by directly addressing these limitations at their root cause. Chiefly, through key studies, this disseration progressively builds an effective framework for reinforcement learning of dexterous multi-fingered manipulation skills. These methods adopt structured exploration, effectively overcoming the limitations of random exploration in reinforcement learning. The insights gained culminate in a highly effective reinforcement learning that incorporates sampling-based planning for direct exploration. Additionally, this thesis explores a new paradigm of using visuo-tactile human demonstrations for dexterity, introducing corresponding imitation learning techniques. Comments: PhD thesis Subjects: Robotics (cs.RO); Artificial Intelligence (cs.AI) Cite as: arXiv:2507.09117 [cs.RO] (or arXiv:2507.09117v1 [cs.RO] for this version) https://doi.org/10.48550/arXiv.2507.09117 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
zh
[AI-109] SPICE: An Automated SWE-Bench Labeling Pipeline for Issue Clarity Test Coverag e and Effort Estimation
【速读】:该论文旨在解决软件工程(Software Engineering, SE)领域中高质量标注数据集创建成本高、耗时的问题。其关键解决方案是提出SPICE,一个可扩展的自动化流水线,通过结合上下文感知的代码导航、基于理由的提示策略以及多轮共识机制,生成与专家标注高度一致的标签,从而显著降低标注成本,实现高效的大规模数据集构建。
链接: https://arxiv.org/abs/2507.09108
作者: Aaditya Bhatia,Gustavo A. Oliva,Gopi Krishnan Rajbahadur,Haoxiang Zhang,Yihao Chen,Zhilong Chen,Arthur Leung,Dayi Lin,Boyuan Chen,Ahmed E. Hassan
机构: 未知
类目: oftware Engineering (cs.SE); Artificial Intelligence (cs.AI)
备注:
Abstract:High-quality labeled datasets are crucial for training and evaluating foundation models in software engineering, but creating them is often prohibitively expensive and labor-intensive. We introduce SPICE, a scalable, automated pipeline for labeling SWE-bench-style datasets with annotations for issue clarity, test coverage, and effort estimation. SPICE combines context-aware code navigation, rationale-driven prompting, and multi-pass consensus to produce labels that closely approximate expert annotations. SPICE’s design was informed by our own experience and frustration in labeling more than 800 instances from SWE-Gym. SPICE achieves strong agreement with human-labeled SWE-bench Verified data while reducing the cost of labeling 1,000 instances from around 100,000 (manual annotation) to just 5.10. These results demonstrate SPICE’s potential to enable cost-effective, large-scale dataset creation for SE-focused FMs. To support the community, we release both SPICE tool and SPICE Bench, a new dataset of 6,802 SPICE-labeled instances curated from 291 open-source projects in SWE-Gym (over 13x larger than SWE-bench Verified).
zh
[AI-110] Measuring the Impact of Early-2025 AI on Experienced Open-Source Developer Productivity
【速读】:该论文试图解决在真实软件开发环境中,前沿AI工具对经验丰富的开源开发者生产力的影响问题。其解决方案的关键在于开展一项随机对照试验(RCT),通过控制AI工具的使用条件,评估AI工具对任务完成时间的实际影响,从而揭示AI工具在实际应用中的效能表现及其与预期的差异。
链接: https://arxiv.org/abs/2507.09089
作者: Joel Becker,Nate Rush,Elizabeth Barnes,David Rein
机构: 未知
类目: Artificial Intelligence (cs.AI); Human-Computer Interaction (cs.HC); Software Engineering (cs.SE)
备注: 50 pages, 8 tables, 22 figures
Abstract:Despite widespread adoption, the impact of AI tools on software development in the wild remains understudied. We conduct a randomized controlled trial (RCT) to understand how AI tools at the February-June 2025 frontier affect the productivity of experienced open-source developers. 16 developers with moderate AI experience complete 246 tasks in mature projects on which they have an average of 5 years of prior experience. Each task is randomly assigned to allow or disallow usage of early 2025 AI tools. When AI tools are allowed, developers primarily use Cursor Pro, a popular code editor, and Claude 3.5/3.7 Sonnet. Before starting tasks, developers forecast that allowing AI will reduce completion time by 24%. After completing the study, developers estimate that allowing AI reduced completion time by 20%. Surprisingly, we find that allowing AI actually increases completion time by 19%–AI tooling slowed developers down. This slowdown also contradicts predictions from experts in economics (39% shorter) and ML (38% shorter). To understand this result, we collect and evaluate evidence for 20 properties of our setting that a priori could contribute to the observed slowdown effect–for example, the size and quality standards of projects, or prior developer experience with AI tooling. Although the influence of experimental artifacts cannot be entirely ruled out, the robustness of the slowdown effect across our analyses suggests it is unlikely to primarily be a function of our experimental design.
zh
[AI-111] Deep Reinforcement Learning with Gradient Eligibility Traces
【速读】:该论文试图解决深度强化学习中快速且稳定的离策略学习问题,现有方法多依赖于半梯度时序差分(TD)方法,虽简单高效但易发散;而更符合原理的梯度时序差分(GTD)方法虽有强收敛保证,却很少用于深度强化学习。论文的关键在于将广义投影贝尔曼误差(\GPBE)目标扩展至基于λ-回报的多步信用分配,并推导出三种优化该目标的梯度方法,从而提升了算法效率与性能。
链接: https://arxiv.org/abs/2507.09087
作者: Esraa Elelimy,Brett Daley,Andrew Patterson,Marlos C. Machado,Adam White,Martha White
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Machine Learning (stat.ML)
备注:
Abstract:Achieving fast and stable off-policy learning in deep reinforcement learning (RL) is challenging. Most existing methods rely on semi-gradient temporal-difference (TD) methods for their simplicity and efficiency, but are consequently susceptible to divergence. While more principled approaches like Gradient TD (GTD) methods have strong convergence guarantees, they have rarely been used in deep RL. Recent work introduced the Generalized Projected Bellman Error ( \GPBE ), enabling GTD methods to work efficiently with nonlinear function approximation. However, this work is only limited to one-step methods, which are slow at credit assignment and require a large number of samples. In this paper, we extend the \GPBE objective to support multistep credit assignment based on the \lambda -return and derive three gradient-based methods that optimize this new objective. We provide both a forward-view formulation compatible with experience replay and a backward-view formulation compatible with streaming algorithms. Finally, we evaluate the proposed algorithms and show that they outperform both PPO and StreamQ in MuJoCo and MinAtar environments, respectively. Code available at this https URL_algos
zh
[AI-112] Queue up for takeoff: a transferable deep learning framework for flight delay prediction
【速读】:该论文试图解决航空行业中航班延误预测的精准性和跨网络泛化能力问题,以提升乘客体验并减少经济损失。解决方案的关键在于结合排队论(Queue-Theory)与简单的注意力机制,提出了一种名为Queue-Theory SimAM (QT-SimAM) 的新方法,该方法在多个数据集上均表现出优异的预测性能。
链接: https://arxiv.org/abs/2507.09084
作者: Nnamdi Daniel Aghanya,Ta Duong Vu,Amaëlle Diop,Charlotte Deville,Nour Imane Kerroumi,Irene Moulitsas,Jun Li,Desmond Bisandu
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注: 3 figures, 20 pages references and appendix included,
Abstract:Flight delays are a significant challenge in the aviation industry, causing major financial and operational disruptions. To improve passenger experience and reduce revenue loss, flight delay prediction models must be both precise and generalizable across different networks. This paper introduces a novel approach that combines Queue-Theory with a simple attention model, referred to as the Queue-Theory SimAM (QT-SimAM). To validate our model, we used data from the US Bureau of Transportation Statistics, where our proposed QT-SimAM (Bidirectional) model outperformed existing methods with an accuracy of 0.927 and an F1 score of 0.932. To assess transferability, we tested the model on the EUROCONTROL dataset. The results demonstrated strong performance, achieving an accuracy of 0.826 and an F1 score of 0.791. Ultimately, this paper outlines an effective, end-to-end methodology for predicting flight delays. The proposed model’s ability to forecast delays with high accuracy across different networks can help reduce passenger anxiety and improve operational decision-making
zh
[AI-113] Learning from Synthetic Labs: Language Models as Auction Participants
【速读】:该论文试图解决如何利用生成式AI(Generative AI)代理进行拍卖实验以降低研究成本并验证拍卖机制有效性的问题,其解决方案的关键在于引入一种新颖的合成数据生成过程,并利用具有链式思维推理能力的大型语言模型(LLMs)作为拍卖参与者。通过这一方法,研究人员能够在低成本下进行大规模拍卖实验,并验证LLMs在不同拍卖机制中的行为是否符合理论预测和人类行为特征。
链接: https://arxiv.org/abs/2507.09083
作者: Anand Shah,Kehang Zhu,Yanchen Jiang,Jeffrey G. Wang,Arif K. Dayi,John J. Horton,David C. Parkes
机构: 未知
类目: Computer Science and Game Theory (cs.GT); Artificial Intelligence (cs.AI)
备注:
Abstract:This paper investigates the behavior of simulated AI agents (large language models, or LLMs) in auctions, introducing a novel synthetic data-generating process to help facilitate the study and design of auctions. We find that LLMs – when endowed with chain of thought reasoning capacity – agree with the experimental literature in auctions across a variety of classic auction formats. In particular, we find that LLM bidders produce results consistent with risk-averse human bidders; that they perform closer to theoretical predictions in obviously strategy-proof auctions; and, that they succumb to the winner’s curse in common value settings. On prompting, we find that LLMs are not very sensitive to naive changes in prompts (e.g., language, currency) but can improve dramatically towards theoretical predictions with the right mental model (i.e., the language of Nash deviations). We run 1,000 + auctions for less than \ 400 with GPT-4 models (three orders of magnitude cheaper than modern auction experiments) and develop a framework flexible enough to run auction experiments with any LLM model and a wide range of auction design specifications, facilitating further experimental study by decreasing costs and serving as a proof-of-concept for the use of LLM proxies.
zh
[AI-114] BioAnalyst: A Foundation Model for Biodiversity
【速读】:该论文旨在解决生物多样性持续丧失所带来的生态研究与保护策略挑战,特别是针对数据稀缺环境下的生态预测问题。解决方案的关键在于提出BioAnalyst,这是首个专为生物多样性分析和保护规划设计的生成式AI基础模型,其基于Transformer架构,并在多模态数据集上进行预训练,具备良好的可迁移性与适应性,能够有效支持物种分布建模、栖息地适宜性评估、入侵物种检测及种群趋势预测等下游任务。
链接: https://arxiv.org/abs/2507.09080
作者: Athanasios Trantas,Martino Mensio,Stylianos Stasinos,Sebastian Gribincea,Taimur Khan,Damian Podareanu,Aliene van der Veen
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:
Abstract:The accelerating loss of biodiversity presents critical challenges for ecological research and conservation strategies. The preservation of biodiversity is paramount for maintaining ecological balance and ensuring the sustainability of ecosystems. However, biodiversity faces numerous threats, including habitat loss, climate change, and the proliferation of invasive species. Addressing these and other ecology-related challenges, both at local and global scales, requires comprehensive monitoring, predictive and conservation planning capabilities. Artificial Intelligence (AI) Foundation Models (FMs) have gained significant momentum in numerous scientific domains by leveraging vast datasets to learn general-purpose representations adaptable to various downstream tasks. This paradigm holds immense promise for biodiversity conservation. In response, we introduce BioAnalyst, the first Foundation Model tailored for biodiversity analysis and conservation planning. BioAnalyst employs a transformer-based architecture, pre-trained on extensive multi-modal datasets encompassing species occurrence records, remote sensing indicators, climate and environmental variables. BioAnalyst is designed for adaptability, allowing for fine-tuning of a range of downstream tasks, such as species distribution modelling, habitat suitability assessments, invasive species detection, and population trend forecasting. We evaluate the model’s performance on two downstream use cases, demonstrating its generalisability compared to existing methods, particularly in data-scarce scenarios for two distinct use-cases, establishing a new accuracy baseline for ecological forecasting. By openly releasing BioAnalyst and its fine-tuning workflows to the scientific community, we aim to foster collaborative efforts in biodiversity modelling and advance AI-driven solutions to pressing ecological challenges.
zh
[AI-115] SetupBench: Assessing Software Engineering Agents Ability to Bootstrap Development Environments
【速读】:该论文试图解决当前大型语言模型(Large Language Model, LLM)代理在真实软件任务中环境引导能力不足的问题,即现有基准测试主要在预配置环境中进行,未能评估代理从零开始构建运行环境的能力。解决方案的关键是提出SetupBench,这是一个包含93个实例的基准测试,旨在隔离并评估代理在无预装依赖的Linux沙箱环境中安装包、解决依赖冲突、初始化数据库和配置后台服务等能力。
链接: https://arxiv.org/abs/2507.09063
作者: Avi Arora,Jinu Jang,Roshanak Zilouchian Moghaddam
机构: 未知
类目: oftware Engineering (cs.SE); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:
Abstract:Modern Large Language Model (LLM) agents promise end to end assistance with real-world software tasks, yet existing benchmarks evaluate LLM agents almost exclusively in pre-baked environments where every dependency is pre-installed. To fill this gap, we introduce SetupBench, a 93 instance benchmark that isolates the environment-bootstrap skill: starting from a bare Linux sandbox, an agent must install packages, resolve dependency conflicts, initialize databases, and configure background services. Our tasks span seven language ecosystems, five database engines, and multi-service orchestration scenarios, each accompanies by a natural language problem statement and a deterministic success command. Through evaluation of OpenHands, a state-of-the-art coding agent, we find low success rates across task categories, with particular challenges in repository setup (38.9-57.4%) and local database configuration (20.0-53.3%). Our analysis reveals systematic failure modes including incomplete development tooling installation, hallucinated task constraints, and non-persistent environment modifications that break agent-human collaboration workflows. We identify substantial inefficiencies in agent exploration strategies, with 38-89% of actions being unnecessary compared to optimal human behavior. These findings highlight gaps in current agents’ practical environment-bootstrap capabilities. By targeting this critical yet under-evaluated capability, SetupBench provides a rigorous yard-stick for the next generation of software developer agents aiming to solve end to end real-wold tasks.
zh
[AI-116] Analysing Health Misinformation with Advanced Centrality Metrics in Online Social Networks ALT
【速读】:该论文试图解决全球性危机期间在线社交网络(OSN)中健康错误信息快速传播所带来的公共健康、社会稳定和机构信任挑战。其解决方案的关键在于引入并比较三种新的中心性度量:动态影响力中心性(DIC)、健康错误信息易感中心性(MVC)和传播中心性(PC),这些度量结合了时间动态性、易感性和多层网络交互,以更准确地识别关键节点、传播路径和错误信息传播者,从而提升干预效果。
链接: https://arxiv.org/abs/2507.09055
作者: Mkululi Sikosana,Sean Maudsley-Barton,Oluwaseun Ajao
机构: 未知
类目: ocial and Information Networks (cs.SI); Artificial Intelligence (cs.AI); Information Retrieval (cs.IR)
备注: 10 Pages, 2 figures, 3 tables, journal article in PLOS Digital Health (2025)
Abstract:The rapid spread of health misinformation on online social networks (OSNs) during global crises such as the COVID-19 pandemic poses challenges to public health, social stability, and institutional trust. Centrality metrics have long been pivotal in understanding the dynamics of information flow, particularly in the context of health misinformation. However, the increasing complexity and dynamism of online networks, especially during crises, highlight the limitations of these traditional approaches. This study introduces and compares three novel centrality metrics: dynamic influence centrality (DIC), health misinformation vulnerability centrality (MVC), and propagation centrality (PC). These metrics incorporate temporal dynamics, susceptibility, and multilayered network interactions. Using the FibVID dataset, we compared traditional and novel metrics to identify influential nodes, propagation pathways, and misinformation influencers. Traditional metrics identified 29 influential nodes, while the new metrics uncovered 24 unique nodes, resulting in 42 combined nodes, an increase of 44.83%. Baseline interventions reduced health misinformation by 50%, while incorporating the new metrics increased this to 62.5%, an improvement of 25%. To evaluate the broader applicability of the proposed metrics, we validated our framework on a second dataset, Monant Medical Misinformation, which covers a diverse range of health misinformation discussions beyond COVID-19. The results confirmed that the advanced metrics generalised successfully, identifying distinct influential actors not captured by traditional methods. In general, the findings suggest that a combination of traditional and novel centrality measures offers a more robust and generalisable framework for understanding and mitigating the spread of health misinformation in different online network contexts.
zh
[AI-117] Model Parallelism With Subnetwork Data Parallelism
【速读】:该论文旨在解决大规模分布式预训练中节点内存需求高和节点内通信成本大的问题。其解决方案的关键在于通过在不同工作节点上训练模型的小型结构化子网络,从而降低内存消耗。与流水线方法不同,该方法避免了节点间的激活通信,并保持了与标准数据并行通信方案相当或更低的带宽需求。
链接: https://arxiv.org/abs/2507.09029
作者: Vaibhav Singh,Zafir Khalid,Edouard Oyallon,Eugene Belilovsky
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注: 6 pages, 1 figure
Abstract:Distributed pre-training of large models at scale often imposes heavy memory demands on individual nodes and incurs significant intra-node communication costs. We propose a novel alternative approach that reduces the memory requirements by training small, structured subnetworks of the model on separate workers. Unlike pipelining, our method avoids inter-node activation communication and maintains bandwidth requirements that are comparable to or lower than standard data parallel communication schemes based on all-reduce. We evaluate two subnetwork construction strategies guided by the principle of ensuring uniform representation of each parameter across the distributed training setup. Our results show that the stochastic block dropping technique consistently outperforms the width-wise subnetwork construction previously explored in federated learning. We empirically attribute this superior performance to stronger gradient alignment in subnetworks that retain blocks having skip connections. Preliminary experiments highlight the promise of our approach, achieving a 20-40% reduction in memory usage without any loss in performance.
zh
[AI-118] Accelerating Drug Discovery Through Agent ic AI: A Multi-Agent Approach to Laboratory Automation in the DMTA Cycle
【速读】:该论文试图解决制药行业在药物发现过程中面临的挑战,传统方法难以满足现代治疗开发的需求。其解决方案的关键在于引入一种名为Tippy的新型人工智能框架,该框架通过在设计-制作-测试-分析(DMTA)循环中运行的专用AI代理实现实验室自动化。Tippy采用五个专门代理——监督者、分子、实验室、分析和报告,并在安全防护机制下运作,每个代理专注于药物发现流程中的特定阶段,从而展示出AI在加速DMTA周期的同时保持科学严谨性的潜力。
链接: https://arxiv.org/abs/2507.09023
作者: Yao Fehlis,Charles Crain,Aidan Jensen,Michael Watson,James Juhasz,Paul Mandel,Betty Liu,Shawn Mahon,Daren Wilson,Nick Lynch-Jonely,Ben Leedom,David Fuller
机构: 未知
类目: oftware Engineering (cs.SE); Artificial Intelligence (cs.AI); Multiagent Systems (cs.MA)
备注:
Abstract:The pharmaceutical industry faces unprecedented challenges in drug discovery, with traditional approaches struggling to meet modern therapeutic development demands. This paper introduces a novel AI framework, Tippy, that transforms laboratory automation through specialized AI agents operating within the Design-Make-Test-Analyze (DMTA) cycle. Our multi-agent system employs five specialized agents - Supervisor, Molecule, Lab, Analysis, and Report, with Safety Guardrail oversight - each designed to excel in specific phases of the drug discovery pipeline. Tippy represents the first production-ready implementation of specialized AI agents for automating the DMTA cycle, providing a concrete example of how AI can transform laboratory workflows. By leveraging autonomous AI agents that reason, plan, and collaborate, we demonstrate how Tippy accelerates DMTA cycles while maintaining scientific rigor essential for pharmaceutical research. The system shows significant improvements in workflow efficiency, decision-making speed, and cross-disciplinary coordination, offering a new paradigm for AI-assisted drug discovery.
zh
[AI-119] On Evaluating Performance of LLM Inference Serving Systems
【速读】:该论文试图解决当前大型语言模型(Large Language Model, LLM)推理系统评估方法中存在的根本性缺陷,这些缺陷表现为常见的评估反模式(anti-patterns),导致无法准确反映系统的真实性能并阻碍科学研究进展。其解决方案的关键在于识别并纠正三个核心维度中的反模式:基线公平性、评估设置和度量设计,并提出一个全面的检查清单,以建立稳健的LLM推理评估框架。通过这一框架,可以避免评估过程中的误导性结论,确保评估结果的可重复性和与实际应用场景的一致性。
链接: https://arxiv.org/abs/2507.09019
作者: Amey Agrawal,Nitin Kedia,Anmol Agarwal,Jayashree Mohan,Nipun Kwatra,Souvik Kundu,Ramachandran Ramjee,Alexey Tumanov
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Distributed, Parallel, and Cluster Computing (cs.DC)
备注:
Abstract:The rapid evolution of Large Language Model (LLM) inference systems has yielded significant efficiency improvements. However, our systematic analysis reveals that current evaluation methodologies frequently exhibit fundamental flaws, often manifesting as common evaluation anti-patterns that obscure true performance characteristics and impede scientific progress. Through a comprehensive examination of recent systems, we identify recurring anti-patterns across three key dimensions: Baseline Fairness, Evaluation Setup, and Metric Design. These anti-patterns are uniquely problematic for LLM inference due to its dual-phase nature combining distinct prefill and decode operations, its handling of highly heterogeneous workloads, and its strict temporal requirements for interactive use. We demonstrate how common anti-patterns – such as inadequate baseline comparisons that conflate engineering effort with algorithmic novelty, workload selections that fail to represent production scenarios, and metric normalizations that hide substantial performance variability like generation stalls-lead to misleading conclusions. To address these challenges, we provide a comprehensive checklist derived from our analysis, establishing a framework for recognizing and avoiding these anti-patterns in favor of robust LLM inference evaluation. To demonstrate the practical application of our framework, we present a case study analyzing speculative decoding, a technique whose bursty, non-uniform token generation is easily misinterpreted when evaluated using approaches characteristic of these anti-patterns. Our work establishes a rigorous foundation for evaluation methodology, enabling meaningful comparisons, ensuring reproducible results, and ultimately accelerating genuine progress in LLM inference systems by moving beyond common anti-patterns to align evaluation with real-world requirements.
zh
[AI-120] Hybrid Systolic Array Accelerator with Optimized Dataflow for Edge Large Language Model Inference
【速读】:该论文旨在解决在边缘设备上进行大规模语言模型(Large Language Model, LLM)推理时面临的高内存访问开销和计算密集型阶段能效不足的问题。其解决方案的关键在于提出一种基于混合脉动阵列(Hybrid Systolic Array, HSA)架构的边缘LLM推理加速器,通过MXINT4权重量化和针对HSA优化的数据流设计,显著减少外部内存访问(EMA),同时保持硬件利用率并最小化精度损失;此外,还集成了优化的根均方归一化(RMSNorm)和旋转位置嵌入(RoPE)单元,以降低非线性运算的延迟、面积和内存访问开销,从而实现高效的端到端推理。
链接: https://arxiv.org/abs/2507.09010
作者: Chun-Ting Chen,HanGyeol Mun,Jian Meng,Mohamed S. Abdelfattah,Jae-sun Seo
机构: 未知
类目: Hardware Architecture (cs.AR); Artificial Intelligence (cs.AI)
备注: Accepted as a conference paper at the 2025 IEEE/ACM International Symposium on Low Power Electronics and Design (ISLPED)
Abstract:Edge inference for large language models (LLM) offers secure, low-latency, and cost-effective inference solutions. We emphasize that an edge accelerator should achieve high area efficiency and minimize external memory access (EMA) during the memory-bound decode stage, while maintaining high energy efficiency during the compute intensive prefill stage. This paper proposes an edge LLM inference accelerator featuring a hybrid systolic array (HSA) architecture that optimizes inference efficiency in both stages. To further reduce EMA, we adopt MXINT4 weight quantization and propose an optimized dataflow tailored for HSA, ensuring negligible dequantization overhead and achieving 100% hardware utilization with minimal accuracy loss under edge DRAM bandwidth constraints. For non-linear operations, we incorporate optimized root mean square normalization (RMSNorm) and rotary position embedding (RoPE) units, reducing their latency, area, and memory access overhead while enabling end-to-end inference on our accelerator. Our solution achieves 247/117 (token/s/mm2) while running a 1.3B LLM on long-input/long-output scenarios, providing 2.45x/13.5x improvement over existing approaches, while maintaining superior energy efficiency in token generation.
zh
[AI-121] Multimodal Cardiovascular Risk Profiling Using Self-Supervised Learning of Polysomnography
【速读】:该论文旨在解决心血管疾病(Cardiovascular Disease, CVD)风险评估的精准性问题,通过多模态信号(如脑电图(Electroencephalography, EEG)、心电图(Electrocardiography, ECG)和呼吸信号)提取个体化CVD风险评分。其解决方案的关键在于开发一种自监督深度学习模型,该模型能够从多模态数据中提取具有临床意义的模式,并通过对比有无CVD结局个体的嵌入向量生成投影得分,从而提升CVD预测的准确性。
链接: https://arxiv.org/abs/2507.09009
作者: Zhengxiao He,Huayu Li,Geng Yuan,William D.S. Killgore,Stuart F. Quan,Chen X. Chen,Ao Li
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:
Abstract:Methods: We developed a self-supervised deep learning model that extracts meaningful patterns from multi-modal signals (Electroencephalography (EEG), Electrocardiography (ECG), and respiratory signals). The model was trained on data from 4,398 participants. Projection scores were derived by contrasting embeddings from individuals with and without CVD outcomes. External validation was conducted in an independent cohort with 1,093 participants. The source code is available on this https URL. Results: The projection scores revealed distinct and clinically meaningful patterns across modalities. ECG-derived features were predictive of both prevalent and incident cardiac conditions, particularly CVD mortality. EEG-derived features were predictive of incident hypertension and CVD mortality. Respiratory signals added complementary predictive value. Combining these projection scores with the Framingham Risk Score consistently improved predictive performance, achieving area under the curve values ranging from 0.607 to 0.965 across different outcomes. Findings were robustly replicated and validated in the external testing cohort. Conclusion: Our findings demonstrate that the proposed framework can generate individualized CVD risk scores directly from PSG data. The resulting projection scores have the potential to be integrated into clinical practice, enhancing risk assessment and supporting personalized care.
zh
[AI-122] Simulation as Supervision: Mechanistic Pretraining for Scientific Discovery
【速读】:该论文试图解决科学建模中的核心问题:机制模型虽然具有可解释性,但在面对现实世界的复杂性时表现不佳;而机器学习模型虽然灵活,但需要大量标注数据,无法推断不可观测量,并且作为黑箱运行。解决方案的关键在于引入基于仿真的神经网络(Simulation-Grounded Neural Networks, SGNNs),该框架利用机制仿真作为神经网络的训练数据,通过在多样化的模型结构、参数范围、随机性和观测伪影上进行预训练,使SGNNs能够在不同科学领域和建模任务中实现最先进的性能,同时提供可解释的机制推断能力。
链接: https://arxiv.org/abs/2507.08977
作者: Carson Dudley,Reiden Magdaleno,Christopher Harding,Marisa Eisenberg
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Machine Learning (stat.ML)
备注:
Abstract:Scientific modeling faces a core limitation: mechanistic models offer interpretability but collapse under real-world complexity, while machine learning models are flexible but require large labeled datasets, cannot infer unobservable quantities, and operate as black boxes. We introduce Simulation-Grounded Neural Networks (SGNNs), a general framework that uses mechanistic simulations as training data for neural networks. SGNNs are pretrained on synthetic corpora spanning diverse model structures, parameter regimes, stochasticity, and observational artifacts. We evaluated SGNNs across scientific disciplines and modeling tasks, and found that SGNNs achieved state-of-the-art results across settings: for prediction tasks, they nearly tripled COVID-19 forecasting skill versus CDC baselines, reduced chemical yield prediction error by one third, and maintained accuracy in ecological forecasting where task specific models failed. For inference tasks, SGNNs also accurately classified the source of information spread in simulated social networks and enabled supervised learning for unobservable targets, such as estimating COVID-19 transmissibility more accurately than traditional methods even in early outbreaks. Finally, SGNNs enable back-to-simulation attribution, a new form of mechanistic interpretability. Given real world input, SGNNs retrieve simulations based on what the model has learned to see as most similar, revealing which underlying dynamics the model believes are active. This provides process-level insight – what the model thinks is happening – not just which features mattered. SGNNs unify scientific theory with deep learning flexibility and unlock a new modeling paradigm – transforming simulations from rigid, post hoc tools into flexible sources of supervision, enabling robust, interpretable inference even when ground truth is missing.
zh
[AI-123] Simulating Three-dimensional Turbulence with Physics-informed Neural Networks
【速读】:该论文试图解决高雷诺数下湍流模拟的计算资源消耗过大的问题,传统方法依赖于计算网格和大量训练数据,而本文提出了一种基于物理信息神经网络(Physics-informed Neural Networks, PINNs)的新方法。其解决方案的关键在于通过直接从物理方程训练神经网络,实现无需传统计算网格和训练数据的连续、无网格湍流模拟,结合自适应网络架构、因果训练和先进优化方法,有效应对混沌动力学的学习挑战。
链接: https://arxiv.org/abs/2507.08972
作者: Sifan Wang,Shyam Sankaran,Panos Stinis,Paris Perdikaris
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computational Physics (physics.comp-ph); Fluid Dynamics (physics.flu-dyn)
备注: 25 pages, 13 figures, 3 tables
Abstract:Turbulent fluid flows are among the most computationally demanding problems in science, requiring enormous computational resources that become prohibitive at high flow speeds. Physics-informed neural networks (PINNs) represent a radically different approach that trains neural networks directly from physical equations rather than data, offering the potential for continuous, mesh-free solutions. Here we show that appropriately designed PINNs can successfully simulate fully turbulent flows in both two and three dimensions, directly learning solutions to the fundamental fluid equations without traditional computational grids or training data. Our approach combines several algorithmic innovations including adaptive network architectures, causal training, and advanced optimization methods to overcome the inherent challenges of learning chaotic dynamics. Through rigorous validation on challenging turbulence problems, we demonstrate that PINNs accurately reproduce key flow statistics including energy spectra, kinetic energy, enstrophy, and Reynolds stresses. Our results demonstrate that neural equation solvers can handle complex chaotic systems, opening new possibilities for continuous turbulence modeling that transcends traditional computational limitations.
zh
[AI-124] oxBench: A Binding Affinity Prediction Benchmark with AB-FEP-Calculated Labels for Human Estrogen Receptor Alpha ICML2025
【速读】:该论文旨在解决蛋白质-配体结合亲和力预测中的数据稀缺与计算成本高的问题。传统机器学习方法受限于可靠数据的不足,而基于物理的绝对结合自由能扰动(AB-FEP)方法虽然精度高,但计算成本昂贵,难以用于高通量应用。论文提出的解决方案关键在于构建ToxBench数据集,这是首个针对人类雌激素受体α(ER α)这一药学关键靶点的大规模AB-FEP数据集,并采用双损失框架的DualBind模型,以在显著降低计算成本的前提下有效学习结合能量函数,从而提升机器学习在该领域的性能与适用性。
链接: https://arxiv.org/abs/2507.08966
作者: Meng Liu,Karl Leswing,Simon K. S. Chu,Farhad Ramezanghorbani,Griffin Young,Gabriel Marques,Prerna Das,Anjali Panikar,Esther Jamir,Mohammed Sulaiman Shamsudeen,K. Shawn Watts,Ananya Sen,Hari Priya Devannagari,Edward B. Miller,Muyun Lihan,Howook Hwang,Janet Paulsen,Xin Yu,Kyle Gion,Timur Rvachov,Emine Kucukbenli,Saee Gopal Paliwal
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Chemical Physics (physics.chem-ph); Biomolecules (q-bio.BM)
备注: Workshop on Generative AI for Biology at ICML 2025
Abstract:Protein-ligand binding affinity prediction is essential for drug discovery and toxicity assessment. While machine learning (ML) promises fast and accurate predictions, its progress is constrained by the availability of reliable data. In contrast, physics-based methods such as absolute binding free energy perturbation (AB-FEP) deliver high accuracy but are computationally prohibitive for high-throughput applications. To bridge this gap, we introduce ToxBench, the first large-scale AB-FEP dataset designed for ML development and focused on a single pharmaceutically critical target, Human Estrogen Receptor Alpha (ER \alpha ). ToxBench contains 8,770 ER \alpha -ligand complex structures with binding free energies computed via AB-FEP with a subset validated against experimental affinities at 1.75 kcal/mol RMSE, along with non-overlapping ligand splits to assess model generalizability. Using ToxBench, we further benchmark state-of-the-art ML methods, and notably, our proposed DualBind model, which employs a dual-loss framework to effectively learn the binding energy function. The benchmark results demonstrate the superior performance of DualBind and the potential of ML to approximate AB-FEP at a fraction of the computational cost.
zh
[AI-125] heory-Informed Improvements to Classifier-Free Guidance for Discrete Diffusion Models
【速读】:该论文旨在解决在离散扩散模型中,基于无分类器指导(Classifier-Free Guidance, CFG)的生成质量问题,特别是针对掩码离散扩散和均匀离散扩散场景下引导策略(guidance schedules)对生成效果的影响。研究发现,在采样早期高引导强度会损害生成质量,而晚期引导影响更大,同时现有CFG实现可能导致过渡不平衡,如早期解掩码过快,从而降低样本质量。解决方案的关键在于提出一种新的无分类器指导机制,通过平滑数据分布与初始(掩码/均匀)分布之间的传输过程,提升样本质量,且该方法仅需一行代码即可实现。
链接: https://arxiv.org/abs/2507.08965
作者: Kevin Rojas,Ye He,Chieh-Hsin Lai,Yuta Takida,Yuki Mitsufuji,Molei Tao
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Machine Learning (stat.ML)
备注:
Abstract:Classifier-Free Guidance (CFG) is a widely used technique for conditional generation and improving sample quality in continuous diffusion models, and recent works have extended it to discrete diffusion. This paper theoretically analyzes CFG in the context of masked discrete diffusion, focusing on the role of guidance schedules. Our analysis shows that high guidance early in sampling (when inputs are heavily masked) harms generation quality, while late-stage guidance has a larger effect. These findings provide a theoretical explanation for empirical observations in recent studies on guidance schedules. The analysis also reveals an imperfection of the current CFG implementations. These implementations can unintentionally cause imbalanced transitions, such as unmasking too rapidly during the early stages of generation, which degrades the quality of the resulting samples. To address this, we draw insight from the analysis and propose a novel classifier-free guidance mechanism empirically applicable to any discrete diffusion. Intuitively, our method smoothens the transport between the data distribution and the initial (masked/uniform) distribution, which results in improved sample quality. Remarkably, our method is achievable via a simple one-line code change. The efficacy of our method is empirically demonstrated with experiments on ImageNet (masked discrete diffusion) and QM9 (uniform discrete diffusion).
zh
[AI-126] How to Train a Leader: Hierarchical Reasoning in Multi-Agent LLM s
【速读】:该论文试图解决多模型协作中计算成本高且性能提升有限的问题,特别是在多智能体框架下如何有效利用多个大语言模型(LLMs)的互补优势。其解决方案的关键在于提出一种分层多智能体框架,通过仅训练一个领导者模型(leader LLM)来协调一组未经过训练的同行代理(peer agents),并采用多智能体引导的领导者策略优化(MLPO)方法,使领导者能够在不依赖辅助价值网络或显式代理反馈的情况下评估和综合代理响应,从而实现高效且高性能的协作推理。
链接: https://arxiv.org/abs/2507.08960
作者: Andrew Estornell,Jean-Francois Ton,Muhammad Faaiz Taufiq,Hang Li
机构: 未知
类目: Multiagent Systems (cs.MA); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:
Abstract:Large Language Models (LLMs) have achieved strong performance on a wide range of complex reasoning tasks, yet further gains are often possible by leveraging the complementary strengths of multiple models. While multi-agent frameworks can improve solution quality by leveraging multiple LLMs, existing methods are often computationally expensive, both at training and inference time. In this work, we introduce a hierarchical multi-agent framework that addresses these challenges by training only a single leader LLM to coordinate a team of untrained peer agents. To this end, we propose Multi-agent guided Leader Policy \textbfOptimization (MLPO), a novel approach which trains the leader to evaluate and synthesize agent responses without auxiliary value networks or explicit agent feedback. Leaders trained with MLPO exhibit improved performance not only when interacting with the agent team at inference time, but also enjoy improved performance when deployed in single-agent settings without the team. Empirical results on Big-Bench Hard (BBH), MATH, and MMLU demonstrate that our framework achieves substantial performance improvements over both single-agent and multi-agent baselines. Our results highlight the effectiveness and efficiency of training a single, flexible leader for collaborative reasoning in multi-agent LLM systems.
zh
[AI-127] GraphRunner: A Multi-Stage Framework for Efficient and Accurate Graph-Based Retrieval
【速读】:该论文试图解决传统基于图的检索方法在处理结构化、互连数据集(如知识图谱)时存在的问题,特别是在多跳推理过程中容易受到大型语言模型(LLM)推理错误和幻觉的影响,导致检索相关信息的效果受限。其解决方案的关键在于提出GraphRunner框架,该框架通过三个阶段——规划、验证和执行——实现高阶遍历操作,支持单步多跳探索,并生成全面的遍历计划,从而减少推理错误并检测幻觉,提升检索的准确性和效率。
链接: https://arxiv.org/abs/2507.08945
作者: Savini Kashmira,Jayanaka L. Dantanarayana,Krisztián Flautner,Lingjia Tang,Jason Mars
机构: 未知
类目: Information Retrieval (cs.IR); Artificial Intelligence (cs.AI)
备注:
Abstract:Conventional Retrieval Augmented Generation (RAG) approaches are common in text-based applications. However, they struggle with structured, interconnected datasets like knowledge graphs, where understanding underlying relationships is crucial for accurate retrieval. A common direction in graph-based retrieval employs iterative, rule-based traversal guided by Large Language Models (LLMs). Such existing iterative methods typically combine reasoning with single hop traversal at each step, making them vulnerable to LLM reasoning errors and hallucinations that ultimately hinder the retrieval of relevant information. To address these limitations, we propose GraphRunner, a novel graph-based retrieval framework that operates in three distinct stages: planning, verification, and execution. This introduces high-level traversal actions that enable multi-hop exploration in a single step. It also generates a holistic traversal plan, which is verified against the graph structure and pre-defined traversal actions, reducing reasoning errors and detecting hallucinations before execution. GraphRunner significantly reduces LLM reasoning errors and detects hallucinations through validation. Our evaluation using the GRBench dataset shows that GraphRunner consistently outperforms existing approaches, achieving 10-50% performance improvements over the strongest baseline while reducing inference cost by 3.0-12.9x and response generation time by 2.5-7.1x, making it significantly more robust and efficient for graph-based retrieval tasks. Subjects: Information Retrieval (cs.IR); Artificial Intelligence (cs.AI) Cite as: arXiv:2507.08945 [cs.IR] (or arXiv:2507.08945v1 [cs.IR] for this version) https://doi.org/10.48550/arXiv.2507.08945 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
zh
[AI-128] Optimizing Sequential Multi-Step Tasks with Parallel LLM Agents ICML2025
【速读】:该论文试图解决基于大型语言模型(Large Language Model, LLM)的多智能体系统在处理复杂任务时因多次迭代推理步骤而导致的高延迟问题。解决方案的关键在于提出M1-Parallel框架,该框架通过并行运行多个多智能体团队来探索不同的解决方案路径,并利用事件驱动的通信模型与异步消息传递机制,高效利用有效计划的内在多样性,从而减少端到端延迟或提高任务完成率。
链接: https://arxiv.org/abs/2507.08944
作者: Enhao Zhang,Erkang Zhu,Gagan Bansal,Adam Fourney,Hussein Mozannar,Jack Gerrits
机构: 未知
类目: Multiagent Systems (cs.MA); Artificial Intelligence (cs.AI)
备注: ICML 2025 Workshop on MAS
Abstract:Large language model (LLM)-based multi-agent systems have demonstrated remarkable promise for tackling complex tasks by breaking them down into subtasks that are iteratively planned, executed, observed, and refined. Despite their effectiveness, these systems often incur high latency because real-world problems frequently demand multiple iterative cycles of reasoning steps. To address this challenge, we propose M1-Parallel, a framework that concurrently runs multiple multi-agent teams in parallel to uncover distinct solution paths. By leveraging an event-driven communication model with asynchronous messaging, M1-Parallel efficiently capitalizes on the inherent diversity of valid plans to either reduce end-to-end latency or boost task completion rates. Our experiments on complex tasks show that M1-Parallel with early termination achieves up to 2.2\times speedup while preserving accuracy, and that M1-Parallel with aggregation yields higher task completion rates. We further investigate strategies aimed at encouraging diverse execution plans but observe no additional performance gains over repeated sampling. Overall, these findings underscore the potential of parallel plan execution for optimizing multi-agent systems for real-world, high-complexity reasoning tasks.
zh
[AI-129] Fair-FLIP: Fair Deepfake Detection with Fairness-Oriented Final Layer Input Prioritising
【速读】:该论文试图解决深度伪造(deepfake)检测中的公平性问题,即现有方法在不同人口统计属性(如种族和性别)上表现出偏差。其解决方案的关键在于提出一种新颖的后处理方法,称为公平导向最终层输入优先化(Fair-FLIP),通过重新加权训练模型的最终层输入,以减少子群体间的差异,优先考虑低变异性样本,降低高变异性样本的权重,从而提升检测的公平性。
链接: https://arxiv.org/abs/2507.08912
作者: Tomasz Szandala,Fatima Ezzeddine,Natalia Rusin,Silvia Giordano,Omran Ayoub
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:
Abstract:Artificial Intelligence-generated content has become increasingly popular, yet its malicious use, particularly the deepfakes, poses a serious threat to public trust and discourse. While deepfake detection methods achieve high predictive performance, they often exhibit biases across demographic attributes such as ethnicity and gender. In this work, we tackle the challenge of fair deepfake detection, aiming to mitigate these biases while maintaining robust detection capabilities. To this end, we propose a novel post-processing approach, referred to as Fairness-Oriented Final Layer Input Prioritising (Fair-FLIP), that reweights a trained model’s final-layer inputs to reduce subgroup disparities, prioritising those with low variability while demoting highly variable ones. Experimental results comparing Fair-FLIP to both the baseline (without fairness-oriented de-biasing) and state-of-the-art approaches show that Fair-FLIP can enhance fairness metrics by up to 30% while maintaining baseline accuracy, with only a negligible reduction of 0.25%. Code is available on Github: this https URL Subjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI) Cite as: arXiv:2507.08912 [cs.LG] (or arXiv:2507.08912v1 [cs.LG] for this version) https://doi.org/10.48550/arXiv.2507.08912 Focus to learn more arXiv-issued DOI via DataCite (pending registration) Journalreference: 2025 12th IEEE Swiss Conference on Data Science (SDS)
zh
[AI-130] Last Layer Hamiltonian Monte Carlo
【速读】:该论文旨在解决深度神经网络(Deep Neural Networks, DNNs)在大规模数据集和复杂架构中进行不确定性估计的计算成本过高的问题。其解决方案的关键在于提出一种基于哈密顿蒙特卡洛(Hamiltonian Monte Carlo, HMC)的最后层概率方法(Last Layer HMC, LL-HMC),通过仅对DNN的最后层进行HMC采样,从而显著降低计算需求,使其适用于计算资源有限的数据密集型场景。
链接: https://arxiv.org/abs/2507.08905
作者: Koen Vellenga,H. Joe Steinhauer,Göran Falkman,Jonas Andersson,Anders Sjögren
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注: 25 pages, 15 figures, 6 tables, currently under submission
Abstract:We explore the use of Hamiltonian Monte Carlo (HMC) sampling as a probabilistic last layer approach for deep neural networks (DNNs). While HMC is widely regarded as a gold standard for uncertainty estimation, the computational demands limit its application to large-scale datasets and large DNN architectures. Although the predictions from the sampled DNN parameters can be parallelized, the computational cost still scales linearly with the number of samples (similar to an ensemble). Last layer HMC (LL–HMC) reduces the required computations by restricting the HMC sampling to the final layer of a DNN, making it applicable to more data-intensive scenarios with limited computational resources. In this paper, we compare LL-HMC against five last layer probabilistic deep learning (LL-PDL) methods across three real-world video datasets for driver action and intention. We evaluate the in-distribution classification performance, calibration, and out-of-distribution (OOD) detection. Due to the stochastic nature of the probabilistic evaluations, we performed five grid searches for different random seeds to avoid being reliant on a single initialization for the hyperparameter configurations. The results show that LL–HMC achieves competitive in-distribution classification and OOD detection performance. Additional sampled last layer parameters do not improve the classification performance, but can improve the OOD detection. Multiple chains or starting positions did not yield consistent improvements.
zh
[AI-131] Multi-Actor Generative Artificial Intelligence as a Game Engine
【速读】:该论文试图解决在多参与者环境中灵活定义和配置场景的问题,以支持包括社会科学研究建模、交互式叙事和人工智能评估在内的多种应用场景。解决方案的关键在于借鉴桌游角色扮演游戏(TTRPG)中游戏主持者(GM)的机制,并采用实体-组件架构模式,使GM本身成为一个可配置的实体,由多个组件构成,从而实现工程实现细节与设计配置之间的职责分离,提高系统的可迭代性、模块化和可扩展性。
链接: https://arxiv.org/abs/2507.08892
作者: Alexander Sasha Vezhnevets,Jayd Matyas,Logan Cross,Davide Paglieri,Minsuk Chang,William A. Cunningham,Simon Osindero,William S. Isaac,Joel Z. Leibo
机构: 未知
类目: Artificial Intelligence (cs.AI); Multiagent Systems (cs.MA)
备注: 13 pages
Abstract:Generative AI can be used in multi-actor environments with purposes ranging from social science modeling to interactive narrative and AI evaluation. Supporting this diversity of use cases – which we classify as Simulationist, Dramatist, and Evaluationist – demands a flexible scenario definition framework. We argue here that a good approach is to take inspiration from tabletop role-playing games (TTRPGs), where a Game Master (GM) is responsible for the environment and generates all parts of the story not directly determined by the voluntary actions of player characters. We argue that the Entity-Component architectural pattern is useful here. In such a system, the GM is not a hardcoded computer game but is itself a configurable entity, composed of components just like any other actor. By design, the approach allows for a separation between the underlying implementation details handled by an engineer, the creation of reusable components, and their composition and configuration managed by a designer who constructs entities from the components. This separation of concerns is instrumental for achieving rapid iteration, maintaining modularity, and ultimately to ensure scalability. We describe the ongoing evolution of the Concordia library in terms of this philosophy, demonstrating how it allows users to effectively configure scenarios that align with their specific goals.
zh
[AI-132] AirScape: An Aerial Generative World Model with Motion Controllability
【速读】:该论文试图解决如何使机器人在三维空间中预测自身运动意图的结果这一基础性问题。其解决方案的关键在于提出AirScape,这是首个为六自由度空中代理设计的世界模型,能够基于当前视觉输入和运动意图预测未来的观察序列。该模型通过构建包含11k视频-意图对的数据集进行训练与测试,并采用两阶段训练流程,将一个初始缺乏具身空间知识的基础模型转化为可由运动意图控制并遵循物理时空约束的世界模型。
链接: https://arxiv.org/abs/2507.08885
作者: Baining Zhao,Rongze Tang,Mingyuan Jia,Ziyou Wang,Fanghang Man,Xin Zhang,Yu Shang,Weichen Zhang,Chen Gao,Wei Wu,Xin Wang,Xinlei Chen,Yong Li
机构: 未知
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI)
备注:
Abstract:How to enable robots to predict the outcomes of their own motion intentions in three-dimensional space has been a fundamental problem in embodied intelligence. To explore more general spatial imagination capabilities, here we present AirScape, the first world model designed for six-degree-of-freedom aerial agents. AirScape predicts future observation sequences based on current visual inputs and motion intentions. Specifically, we construct an dataset for aerial world model training and testing, which consists of 11k video-intention pairs. This dataset includes first-person-view videos capturing diverse drone actions across a wide range of scenarios, with over 1,000 hours spent annotating the corresponding motion intentions. Then we develop a two-phase training schedule to train a foundation model – initially devoid of embodied spatial knowledge – into a world model that is controllable by motion intentions and adheres to physical spatio-temporal constraints.
zh
[AI-133] he Consistency-Acceptability Divergence of LLM s in Judicial Decision-Making: Task and Stakeholder Dimensions
【速读】:该论文试图解决生成式 AI(Generative AI)在司法系统中的“一致性-可接受性偏离”问题,即技术一致性与社会接受度之间的差距。研究指出,尽管生成式 AI 在技术层面具有高一致性,但其应用效果存在正负两面性。解决方案的关键在于提出双轨 deliberative 多角色生成式 AI 司法治理框架(DTDMR-LJGF),该框架通过智能任务分类和多方利益相关者之间的有意义互动,实现技术效率与社会合法性的平衡。
链接: https://arxiv.org/abs/2507.08881
作者: Zhang MingDa,Xu Qing
机构: 未知
类目: Computers and Society (cs.CY); Artificial Intelligence (cs.AI); Social and Information Networks (cs.SI)
备注: 12 pages,2 figures
Abstract:The integration of large language model (LLM) technology into judicial systems is fundamentally transforming legal practice worldwide. However, this global transformation has revealed an urgent paradox requiring immediate attention. This study introduces the concept of ``consistency-acceptability divergence’’ for the first time, referring to the gap between technical consistency and social acceptance. While LLMs achieve high consistency at the technical level, this consistency demonstrates both positive and negative effects. Through comprehensive analysis of recent data on LLM judicial applications from 2023–2025, this study finds that addressing this challenge requires understanding both task and stakeholder dimensions. This study proposes the Dual-Track Deliberative Multi-Role LLM Judicial Governance Framework (DTDMR-LJGF), which enables intelligent task classification and meaningful interaction among diverse stakeholders. This framework offers both theoretical insights and practical guidance for building an LLM judicial ecosystem that balances technical efficiency with social legitimacy.
zh
[AI-134] A Multi-Level Strategy for Deepfake Content Moderation under EU Regulation
【速读】:该论文试图解决深度伪造(deepfake)技术在民主社会中带来的风险,特别是在在线平台上的政治传播中的影响。解决方案的关键在于提出一种多层次策略,该策略结合现有标记、检测和标签方法的优势,并通过简单的评分机制实现可扩展性和实用性,同时对不同类型的深度伪造技术具有通用性,并允许根据具体情境进行风险权重调整。
链接: https://arxiv.org/abs/2507.08879
作者: Max-Paul Förster,Luca Deck,Raimund Weidlich,Niklas Kühl
机构: 未知
类目: Computers and Society (cs.CY); Artificial Intelligence (cs.AI)
备注:
Abstract:The growing availability and use of deepfake technologies increases risks for democratic societies, e.g., for political communication on online platforms. The EU has responded with transparency obligations for providers and deployers of Artificial Intelligence (AI) systems and online platforms. This includes marking deepfakes during generation and labeling deepfakes when they are shared. However, the lack of industry and enforcement standards poses an ongoing challenge. Through a multivocal literature review, we summarize methods for marking, detecting, and labeling deepfakes and assess their effectiveness under EU regulation. Our results indicate that individual methods fail to meet regulatory and practical requirements. Therefore, we propose a multi-level strategy combining the strengths of existing methods. To account for the masses of content on online platforms, our multi-level strategy provides scalability and practicality via a simple scoring mechanism. At the same time, it is agnostic to types of deepfake technology and allows for context-specific risk weighting.
zh
[AI-135] owards Privacy-Preserving and Personalized Smart Homes via Tailored Small Language Models
【速读】:该论文试图解决智能家庭中基于大型语言模型(Large Language Models, LLMs)的助手在提供个性化服务时可能引发的隐私泄露问题。解决方案的关键在于开发HomeLLaMA,这是一个基于本地的小型语言模型(Small Language Model, SLM)的隐私保护智能家庭助手,它通过在设备端学习云LLMs的知识来提供满意的响应,并通过持续更新本地SLMs和用户配置实现主动交互。此外,还提出了PrivShield,为用户提供可选的隐私保护的LLM-based服务,以在保护隐私的同时提升用户体验。
链接: https://arxiv.org/abs/2507.08878
作者: Xinyu Huang,Leming Shen,Zijing Ma,Yuanqing Zheng
机构: 未知
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI)
备注:
Abstract:Large Language Models (LLMs) have showcased remarkable generalizability in language comprehension and hold significant potential to revolutionize human-computer interaction in smart homes. Existing LLM-based smart home assistants typically transmit user commands, along with user profiles and home configurations, to remote servers to obtain personalized services. However, users are increasingly concerned about the potential privacy leaks to the remote servers. To address this issue, we develop HomeLLaMA, an on-device assistant for privacy-preserving and personalized smart home serving with a tailored small language model (SLM). HomeLLaMA learns from cloud LLMs to deliver satisfactory responses and enable user-friendly interactions. Once deployed, HomeLLaMA facilitates proactive interactions by continuously updating local SLMs and user profiles. To further enhance user experience while protecting their privacy, we develop PrivShield to offer an optional privacy-preserving LLM-based smart home serving for those users, who are unsatisfied with local responses and willing to send less-sensitive queries to remote servers. For evaluation, we build a comprehensive benchmark DevFinder to assess the service quality. Extensive experiments and user studies (M=100) demonstrate that HomeLLaMA can provide personalized services while significantly enhancing user privacy.
zh
[AI-136] ODIA: Oriented Distillation for Inline Acceleration of LLM -based Function Calling
【速读】:该论文试图解决大型语言模型(Large Language Models, LLMs)在调用函数时存在的高延迟问题,这一问题严重影响了用户体验。论文提出的解决方案关键在于一种名为面向内联加速的定向蒸馏(Oriented Distillation for Inline Acceleration, ODIA)的方法,该方法通过从在线用户交互数据中自动识别“简单查询”,并将知识从大模型蒸馏到小模型,从而显著降低响应延迟,同时保持模型的准确性。
链接: https://arxiv.org/abs/2507.08877
作者: Hanlong Zhang,Jingsheng Yang,Hao Li,Yuhao He,Franck Gong
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:
Abstract:Function Calling is a crucial technique that enables Large Language Models (LLMs) to interact with external systems through APIs. However, the high latency associated with LLM-based Function Calling significantly impacts user experience. This paper presents a novel approach called Oriented Distillation for Inline Acceleration (ODIA) that leverages online user interaction data to accelerate Function Calling. By automatically identifying “simple queries” from production traffic and distilling knowledge from larger models to smaller ones, our method reduces response latency by 45% (expected) and 78% (median) while maintaining accuracy. We demonstrate the effectiveness of our approach through real-world deployment in a music application, where the smaller model successfully handles 60% of traffic with negligible accuracy loss. Our method requires minimal human intervention and continuously improves through automated data collection and model updating, making it a practical solution for production environments.
zh
[AI-137] A New Approach for Multicriteria Assessment in the Ranking of Alternatives Using Cardinal and Ordinal Data
【速读】:该论文试图解决多准则评估(Multi-Criteria Assessment, MCA)中由于方法依赖假设和主观判断而带来的复杂评价挑战,以及在实际应用中如何有效整合定量与定性标准的问题。解决方案的关键在于提出一种结合两种虚拟差距分析(Virtual Gap Analysis, VGA)模型的新型MCA方法,该方法基于线性规划框架,旨在提高评估的效率和公平性,确保评价结果的全面性和可靠性。
链接: https://arxiv.org/abs/2507.08875
作者: Fuh-Hwa Franklin Liu,Su-Chuan Shih
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注: 38 pages, 6 figures, 5 table. A practice applicable method for multi-criteria assessments using cardinal and ordinal data
Abstract:Modern methods for multi-criteria assessment (MCA), such as Data Envelopment Analysis (DEA), Stochastic Frontier Analysis (SFA), and Multiple Criteria Decision-Making (MCDM), are utilized to appraise a collection of Decision-Making Units (DMUs), also known as alternatives, based on several criteria. These methodologies inherently rely on assumptions and can be influenced by subjective judgment to effectively tackle the complex evaluation challenges in various fields. In real-world scenarios, it is essential to incorporate both quantitative and qualitative criteria as they consist of cardinal and ordinal data. Despite the inherent variability in the criterion values of different alternatives, the homogeneity assumption is often employed, significantly affecting evaluations. To tackle these challenges and determine the most appropriate alternative, we propose a novel MCA approach that combines two Virtual Gap Analysis (VGA) models. The VGA framework, rooted in linear programming, is pivotal in the MCA methodology. This approach improves efficiency and fairness, ensuring that evaluations are both comprehensive and dependable, thus offering a strong and adaptive solution. Two comprehensive numerical examples demonstrate the accuracy and transparency of our proposed method. The goal is to encourage continued advancement and stimulate progress in automated decision systems and decision support systems.
zh
[AI-138] Contrastive Language-Image Pre-Training Model based Semantic Communication Performance Optimization
【速读】:该论文试图解决在噪声无线网络中实现高效语义通信的问题,特别是在语义信息易受无线噪声影响且语义信息传输频谱资源有限的情况下,如何优化CLIP模型架构与频谱资源块(RB)分配以最大化语义通信性能。解决方案的关键在于采用基于近端策略优化(PPO)的强化学习算法,通过学习无线噪声对语义通信性能的影响,为每个用户找到最优的CLIP模型和RB配置。
链接: https://arxiv.org/abs/2507.08873
作者: Shaoran Yang,Dongyu Wei,Hanzhi Yu,Zhaohui Yang,Yuchen Liu,Mingzhe Chen
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注: Submitted to IEEE GLOBECOM 2025
Abstract:In this paper, a novel contrastive language-image pre-training (CLIP) model based semantic communication framework is designed. Compared to standard neural network (e.g.,convolutional neural network) based semantic encoders and decoders that require joint training over a common dataset, our CLIP model based method does not require any training procedures thus enabling a transmitter to extract data meanings of the original data without neural network model training, and the receiver to train a neural network for follow-up task implementation without the communications with the transmitter. Next, we investigate the deployment of the CLIP model based semantic framework over a noisy wireless network. Since the semantic information generated by the CLIP model is susceptible to wireless noise and the spectrum used for semantic information transmission is limited, it is necessary to jointly optimize CLIP model architecture and spectrum resource block (RB) allocation to maximize semantic communication performance while considering wireless noise, the delay and energy used for semantic communication. To achieve this goal, we use a proximal policy optimization (PPO) based reinforcement learning (RL) algorithm to learn how wireless noise affect the semantic communication performance thus finding optimal CLIP model and RB for each user. Simulation results show that our proposed method improves the convergence rate by up to 40%, and the accumulated reward by 4x compared to soft actor-critic.
zh
[AI-139] Next-Generation Travel Demand Modeling with a Generative Framework for Household Activity Coordination
【速读】:该论文旨在解决传统活动基础模型(Activity-Based Models, ABMs)在开发成本高、适应性差以及依赖简化规则和假设的问题。其解决方案的关键在于提出一种基于学习的出行需求建模框架,该框架能够根据家庭的人口社会经济特征合成协调的日常活动模式,并将人口合成、活动生成、地点分配和大规模微观交通仿真整合为一个统一系统,实现了生成式 AI (Generative AI) 驱动、数据驱动、可扩展且可迁移至其他区域的建模方法。
链接: https://arxiv.org/abs/2507.08871
作者: Xishun Liao,Haoxuan Ma,Yifan Liu,Yuxiang Wei,Brian Yueshuai He,Chris Stanford,Jiaqi Ma
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注: 8 pages, 7 figures
Abstract:Travel demand models are critical tools for planning, policy, and mobility system design. Traditional activity-based models (ABMs), although grounded in behavioral theories, often rely on simplified rules and assumptions, and are costly to develop and difficult to adapt across different regions. This paper presents a learning-based travel demand modeling framework that synthesizes household-coordinated daily activity patterns based on a household’s socio-demographic profiles. The whole framework integrates population synthesis, coordinated activity generation, location assignment, and large-scale microscopic traffic simulation into a unified system. It is fully generative, data-driven, scalable, and transferable to other regions. A full-pipeline implementation is conducted in Los Angeles with a 10 million population. Comprehensive validation shows that the model closely replicates real-world mobility patterns and matches the performance of legacy ABMs with significantly reduced modeling cost and greater scalability. With respect to the SCAG ABM benchmark, the origin-destination matrix achieves a cosine similarity of 0.97, and the daily vehicle miles traveled (VMT) in the network yields a 0.006 Jensen-Shannon Divergence (JSD) and a 9.8% mean absolute percentage error (MAPE). When compared to real-world observations from Caltrans PeMS, the evaluation on corridor-level traffic speed and volume reaches a 0.001 JSD and a 6.11% MAPE.
zh
[AI-140] Privacy-Utility-Fairness: A Balanced Approach to Vehicular-Traffic Management System
【速读】:该论文旨在解决基于位置的车辆交通管理中敏感地理数据保护与交通管理效用及区域公平性之间的平衡问题。现有解决方案在抵御链接攻击和人口统计偏差方面存在不足,导致隐私泄露和数据分析中的不平等现象。论文提出了一种新算法,其关键在于结合基于查询的数据访问、迭代洗牌和校准噪声注入的差分隐私技术,通过实施拉普拉斯机制确保满足ε-差分隐私标准,从而在保护敏感地理数据的同时保持数据效用和区域公平性。
链接: https://arxiv.org/abs/2507.08864
作者: Poushali Sengupta,Sabita Maharjan,frank Eliassen,Yan Zhang
机构: 未知
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI); Multiagent Systems (cs.MA)
备注: accepted in VTC 2025 Spring, Oslo, Norway
Abstract:Location-based vehicular traffic management faces significant challenges in protecting sensitive geographical data while maintaining utility for traffic management and fairness across regions. Existing state-of-the-art solutions often fail to meet the required level of protection against linkage attacks and demographic biases, leading to privacy leakage and inequity in data analysis. In this paper, we propose a novel algorithm designed to address the challenges regarding the balance of privacy, utility, and fairness in location-based vehicular traffic management systems. In this context, utility means providing reliable and meaningful traffic information, while fairness ensures that all regions and individuals are treated equitably in data use and decision-making. Employing differential privacy techniques, we enhance data security by integrating query-based data access with iterative shuffling and calibrated noise injection, ensuring that sensitive geographical data remains protected. We ensure adherence to epsilon-differential privacy standards by implementing the Laplace mechanism. We implemented our algorithm on vehicular location-based data from Norway, demonstrating its ability to maintain data utility for traffic management and urban planning while ensuring fair representation of all geographical areas without being overrepresented or underrepresented. Additionally, we have created a heatmap of Norway based on our model, illustrating the privatized and fair representation of the traffic conditions across various cities. Our algorithm provides privacy in vehicular traffic
zh
[AI-141] Foundation models for time series forecasting: Application in conformal prediction
【速读】:该论文试图解决时间序列预测中置信区间可靠性不足的问题,特别是在数据量有限的情况下。其解决方案的关键在于利用基础模型(Foundation Models, FMs)的零样本能力,通过更高效的预测性能和更稳定的校准过程来提升符合预测(Conformal Prediction)的可靠性。相比于传统统计模型和梯度提升方法,TSFMs在数据稀缺时表现出更优的预测精度和校准稳定性,从而显著改善了预测区间的可靠性。
链接: https://arxiv.org/abs/2507.08858
作者: Sami Achour,Yassine Bouher,Duong Nguyen,Nicolas Chesneau
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Machine Learning (stat.ML)
备注:
Abstract:The zero-shot capabilities of foundation models (FMs) for time series forecasting offer promising potentials in conformal prediction, as most of the available data can be allocated to calibration. This study compares the performance of Time Series Foundation Models (TSFMs) with traditional methods, including statistical models and gradient boosting, within a conformal prediction setting. Our findings highlight two key advantages of TSFMs. First, when the volume of data is limited, TSFMs provide more reliable conformalized prediction intervals than classic models, thanks to their superior predictive accuracy. Second, the calibration process is more stable because more data are used for calibration. Morever, the fewer data available, the more pronounced these benefits become, as classic models require a substantial amount of data for effective training. These results underscore the potential of foundation models in improving conformal prediction reliability in time series applications, particularly in data-constrained cases. All the code to reproduce the experiments is available.
zh
[AI-142] Clio-X: AWeb3 Solution for Privacy-Preserving AI Access to Digital Archives
【速读】:该论文试图解决在数字档案管理中,随着人工智能技术的广泛应用,隐私风险对数据主权和伦理责任带来的挑战。解决方案的关键在于提出Clio-X,这是一个基于Web3架构的去中心化、以隐私为核心的数字方案,旨在将隐私增强技术(PETs)嵌入档案工作流程,并支持人工智能驱动的参考与访问服务。通过集成技术保障与社区监督,Clio-X提供了一种在文化遗产领域伦理部署人工智能的新模型。
链接: https://arxiv.org/abs/2507.08853
作者: Victoria L. Lemieux,Rosa Gil,Faith Molosiwa,Qihong Zhou,Binming Li,Roberto Garcia,Luis De La Torre Cubillo,Zehua Wang
机构: 未知
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI); Computers and Society (cs.CY); Digital Libraries (cs.DL)
备注: 28 pages, 8 figures
Abstract:As archives turn to artificial intelligence to manage growing volumes of digital records, privacy risks inherent in current AI data practices raise critical concerns about data sovereignty and ethical accountability. This paper explores how privacy-enhancing technologies (PETs) and Web3 architectures can support archives to preserve control over sensitive content while still being able to make it available for access by researchers. We present Clio-X, a decentralized, privacy-first Web3 digital solution designed to embed PETs into archival workflows and support AI-enabled reference and access. Drawing on a user evaluation of a medium-fidelity prototype, the study reveals both interest in the potential of the solution and significant barriers to adoption related to trust, system opacity, economic concerns, and governance. Using Rogers’ Diffusion of Innovation theory, we analyze the sociotechnical dimensions of these barriers and propose a path forward centered on participatory design and decentralized governance through a Clio-X Decentralized Autonomous Organization. By integrating technical safeguards with community-based oversight, Clio-X offers a novel model to ethically deploy AI in cultural heritage contexts.
zh
[AI-143] Assuring the Safety of Reinforcement Learning Components: AMLAS-RL
【速读】:该论文试图解决在安全关键型系统中集成强化学习(Reinforcement Learning, RL)所带来的安全性和保证性挑战。现有方法在RL生命周期中缺乏系统性的安全保障,而传统的监督学习安全保证方法如AMLAS(Assuring Machine Learning Safety)无法直接适用于RL的独特问题。解决方案的关键是将AMLAS方法适配为AMLAS-RL框架,通过迭代过程生成针对RL使能系统的保证论据,从而提供结构化的安全保障。
链接: https://arxiv.org/abs/2507.08848
作者: Calum Corrie Imrie,Ioannis Stefanakos,Sepeedeh Shahbeigi,Richard Hawkins,Simon Burton
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Robotics (cs.RO); Software Engineering (cs.SE)
备注:
Abstract:The rapid advancement of machine learning (ML) has led to its increasing integration into cyber-physical systems (CPS) across diverse domains. While CPS offer powerful capabilities, incorporating ML components introduces significant safety and assurance challenges. Among ML techniques, reinforcement learning (RL) is particularly suited for CPS due to its capacity to handle complex, dynamic environments where explicit models of interaction between system and environment are unavailable or difficult to construct. However, in safety-critical applications, this learning process must not only be effective but demonstrably safe. Safe-RL methods aim to address this by incorporating safety constraints during learning, yet they fall short in providing systematic assurance across the RL lifecycle. The AMLAS methodology offers structured guidance for assuring the safety of supervised learning components, but it does not directly apply to the unique challenges posed by RL. In this paper, we adapt AMLAS to provide a framework for generating assurance arguments for an RL-enabled system through an iterative process; AMLAS-RL. We demonstrate AMLAS-RL using a running example of a wheeled vehicle tasked with reaching a target goal without collision.
zh
[AI-144] DAFOS: Dynamic Adaptive Fanout Optimization Sampler
【速读】:该论文旨在解决图神经网络(Graph Neural Networks, GNNs)在处理图结构数据时,由于统一的邻居采样和静态扇出(fanout)设置导致的可扩展性和效率受限问题。其解决方案的关键在于提出动态自适应扇出优化采样器(Dynamic Adaptive Fanout Optimization Sampler, DAFOS),该方法根据模型性能动态调整扇出,并在训练过程中优先考虑重要节点,通过基于节点度的节点评分机制将计算资源集中在结构重要的节点上,同时引入早期停止机制以提升训练效率。
链接: https://arxiv.org/abs/2507.08845
作者: Irfan Ullah,Young-Koo Lee
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:
Abstract:Graph Neural Networks (GNNs) are becoming an essential tool for learning from graph-structured data, however uniform neighbor sampling and static fanout settings frequently limit GNNs’ scalability and efficiency. In this paper, we propose the Dynamic Adaptive Fanout Optimization Sampler (DAFOS), a novel approach that dynamically adjusts the fanout based on model performance and prioritizes important nodes during training. Our approach leverages node scoring based on node degree to focus computational resources on structurally important nodes, incrementing the fanout as the model training progresses. DAFOS also integrates an early stopping mechanism to halt training when performance gains diminish. Experiments conducted on three benchmark datasets, ogbnarxiv, Reddit, and ogbn-products, demonstrate that our approach significantly improves training speed and accuracy compared to a state-of-the-art approach. DAFOS achieves a 3.57x speedup on the ogbn-arxiv dataset and a 12.6x speedup on the Reddit dataset while improving the F1 score from 68.5% to 71.21% on ogbn-arxiv and from 73.78% to 76.88% on the ogbn-products dataset, respectively. These results highlight the potential of DAFOS as an efficient and scalable solution for large-scale GNN training.
zh
[AI-145] Can We Predict Your Next Move Without Breaking Your Privacy?
【速读】:该论文试图解决移动性建模中的位置预测问题,即在保护用户隐私的前提下实现高精度的下一位置预测(Next-Location Prediction, NxLP)。解决方案的关键在于提出FLLL3M——一种结合联邦学习(Federated Learning)与大语言模型(Large Language Models, LLMs)的隐私保护框架,通过在本地保留用户数据并利用高效的外积机制调用LLMs,从而在保证高预测精度的同时显著降低资源消耗。
链接: https://arxiv.org/abs/2507.08843
作者: Arpita Soni,Sahil Tripathi,Gautam Siddharth Kashyap,Manaswi Kulahara,Mohammad Anas Azeez,Zohaib Hasan Siddiqui,Nipun Joshi,Jiechao Gao
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注: Accepted in the 17th International Conference on Advances in Social Networks Analysis and Mining (ASONAM 2025), scheduled for 25 - 28 August 2025 in Ontario, Canada
Abstract:We propose FLLL3M–Federated Learning with Large Language Models for Mobility Modeling–a privacy-preserving framework for Next-Location Prediction (NxLP). By retaining user data locally and leveraging LLMs through an efficient outer product mechanism, FLLL3M ensures high accuracy with low resource demands. It achieves SOT results on Gowalla (Acc@1: 12.55, MRR: 0.1422), WeePlace (10.71, 0.1285), Brightkite (10.42, 0.1169), and FourSquare (8.71, 0.1023), while reducing parameters by up to 45.6% and memory usage by 52.7%.
zh
[AI-146] Gradients as an Action: Towards Communication-Efficient Federated Recommender Systems via Adaptive Action Sharing KDD2025
【速读】:该论文旨在解决联邦推荐系统(FedRecs)中两个主要问题:一是由于推荐系统中涉及大量物品嵌入而导致的极高通信开销,二是由于异构网络环境和客户端设备的纠缠导致的训练效率低下。其解决方案的关键在于提出一种名为FedRAS的通信高效联邦推荐框架,该框架采用动作共享策略,将物品嵌入的梯度聚类为一定数量的模型更新动作进行通信,而非直接压缩物品嵌入。这种方法通过限制梯度方向(即动作空间)来引入更小的误差,从而在降低通信负载的同时保持推荐性能。此外,FedRAS还集成了自适应聚类机制,以适应不同的设备和网络环境。
链接: https://arxiv.org/abs/2507.08842
作者: Zhufeng Lu,Chentao Jia,Ming Hu,Xiaofei Xie,Mingsong Chen
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注: This paper has been accepted by ACM SIGKDD 2025
Abstract:As a promising privacy-aware collaborative model training paradigm, Federated Learning (FL) is becoming popular in the design of distributed recommender systems. However, Federated Recommender Systems (FedRecs) greatly suffer from two major problems: i) extremely high communication overhead due to massive item embeddings involved in recommendation systems, and ii) intolerably low training efficiency caused by the entanglement of both heterogeneous network environments and client devices. Although existing methods attempt to employ various compression techniques to reduce communication overhead, due to the parameter errors introduced by model compression, they inevitably suffer from model performance degradation. To simultaneously address the above problems, this paper presents a communication-efficient FedRec framework named FedRAS, which adopts an action-sharing strategy to cluster the gradients of item embedding into a specific number of model updating actions for communication rather than directly compressing the item embeddings. In this way, the cloud server can use the limited actions from clients to update all the items. Since gradient values are significantly smaller than item embeddings, constraining the directions of gradients (i.e., the action space) introduces smaller errors compared to compressing the entire item embedding matrix into a reduced space. To accommodate heterogeneous devices and network environments, FedRAS incorporates an adaptive clustering mechanism that dynamically adjusts the number of actions. Comprehensive experiments on well-known datasets demonstrate that FedRAS can reduce the size of communication payloads by up to 96.88%, while not sacrificing recommendation performance within various heterogeneous scenarios. We have open-sourced FedRAS at this https URL.
zh
[AI-147] Domain-Adaptive Diagnosis of Lewy Body Disease with Transferability Aware Transformer MICCAI2025
【速读】:该论文试图解决路易体病(Lewy Body Disease, LBD)在诊断中因数据稀缺而导致的深度学习效果受限问题,以及在跨领域数据(如阿尔茨海默病,Alzheimer’s Disease, AD)迁移时所面临的领域偏移(domain shift)问题。解决方案的关键在于提出一种感知迁移性的Transformer模型(Transferability Aware Transformer, TAT),该模型通过结构连接性(structural connectivity, SC)特征进行训练,并利用注意力机制自适应地赋予可转移特征更高的权重,同时抑制领域特定特征,从而减轻领域偏移并提升在有限LBD数据下的诊断准确性。
链接: https://arxiv.org/abs/2507.08839
作者: Xiaowei Yu,Jing Zhang,Tong Chen,Yan Zhuang,Minheng Chen,Chao Cao,Yanjun Lyu,Lu Zhang,Li Su,Tianming Liu,Dajiang Zhu
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Image and Video Processing (eess.IV)
备注: MICCAI 2025
Abstract:Lewy Body Disease (LBD) is a common yet understudied form of dementia that imposes a significant burden on public health. It shares clinical similarities with Alzheimer’s disease (AD), as both progress through stages of normal cognition, mild cognitive impairment, and dementia. A major obstacle in LBD diagnosis is data scarcity, which limits the effectiveness of deep learning. In contrast, AD datasets are more abundant, offering potential for knowledge transfer. However, LBD and AD data are typically collected from different sites using different machines and protocols, resulting in a distinct domain shift. To effectively leverage AD data while mitigating domain shift, we propose a Transferability Aware Transformer (TAT) that adapts knowledge from AD to enhance LBD diagnosis. Our method utilizes structural connectivity (SC) derived from structural MRI as training data. Built on the attention mechanism, TAT adaptively assigns greater weights to disease-transferable features while suppressing domain-specific ones, thereby reducing domain shift and improving diagnostic accuracy with limited LBD data. The experimental results demonstrate the effectiveness of TAT. To the best of our knowledge, this is the first study to explore domain adaptation from AD to LBD under conditions of data scarcity and domain shift, providing a promising framework for domain-adaptive diagnosis of rare diseases.
zh
[AI-148] wd1: Weighted Policy Optimization for Reasoning in Diffusion Language Models
【速读】:该论文试图解决通过强化学习(Reinforcement Learning, RL)提升基于扩散的大型语言模型(diffusion-based large language models, dLLMs)推理能力的问题,其中主要挑战在于dLLMs似然函数的不可处理性,导致在每一步策略优化中需要对当前策略、旧策略和参考策略的似然进行近似,从而引入额外的计算开销和潜在的大偏差。解决方案的关键在于提出一种名为wd1的新策略优化方法,该方法将目标重新表述为加权似然形式,仅需对当前参数化策略的似然进行一次近似,从而有效降低了计算复杂度并提高了性能。实验表明,wd1在无需监督微调(SFT)或任何监督数据的情况下,显著优于现有的dLLMs强化学习方法。
链接: https://arxiv.org/abs/2507.08838
作者: Xiaohang Tang,Rares Dolga,Sangwoong Yoon,Ilija Bogunovic
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Machine Learning (stat.ML)
备注: Preprint
Abstract:Improving the reasoning capabilities of diffusion-based large language models (dLLMs) through reinforcement learning (RL) remains an open problem. The intractability of dLLMs likelihood function necessitates approximating the current, old, and reference policy likelihoods at each policy optimization step. This reliance introduces additional computational overhead and lead to potentially large bias – particularly when approximation errors occur in the denominator of policy ratios used for importance sampling. To mitigate these issues, we introduce \mathttwd1 , a novel policy optimization approach that reformulates the objective as a weighted likelihood, requiring only a single approximation for the current parametrized policy likelihood. Experiments on widely used reasoning benchmarks demonstrate that \mathttwd1 , without supervised fine-tuning (SFT) or any supervised data, outperforms existing RL methods for dLLMs, achieving up to 16% higher accuracy. \mathttwd1 delivers additional computational gains, including reduced training time and fewer function evaluations (NFEs) per gradient step. These findings, combined with the simplicity of method’s implementation and R1-Zero-like training (no SFT), position \mathttwd1 as a more effective and efficient method for applying RL to dLLMs reasoning.
zh
[AI-149] Representation learning with a transformer by contrastive learning for money laundering detection
【速读】:该论文试图解决洗钱检测问题,其解决方案的关键在于引入一种利用变压器(Transformer)神经网络处理定性和定量数据的结构化时间序列的新流程。该流程首先通过对比学习(无任何标签)学习时间序列的表示,随后利用这些表示为所有观测生成洗钱评分,并采用双阈值方法结合Benjamini-Hochberg(BH)过程以控制假阳性率。实验表明,该方法能够在极少领域专家监督的情况下有效捕捉洗钱模式,并展现出优于基于规则或LSTM架构的方法在检测非欺诈者与欺诈者方面的性能。
链接: https://arxiv.org/abs/2507.08835
作者: Harold Guéneau(SAMM),Alain Celisse(LPP, MODAL),Pascal Delange
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Statistics Theory (math.ST); Risk Management (q-fin.RM); Statistical Finance (q-fin.ST)
备注:
Abstract:The present work tackles the money laundering detection problem. A new procedure is introduced which exploits structured time series of both qualitative and quantitative data by means of a transformer neural network. The first step of this procedure aims at learning representations of time series through contrastive learning (without any labels). The second step leverages these representations to generate a money laundering scoring of all observations. A two-thresholds approach is then introduced, which ensures a controlled false-positive rate by means of the Benjamini-Hochberg (BH) procedure. Experiments confirm that the transformer is able to produce general representations that succeed in exploiting money laundering patterns with minimal supervision from domain experts. It also illustrates the higher ability of the new procedure for detecting nonfraudsters as well as fraudsters, while keeping the false positive rate under control. This greatly contrasts with rule-based procedures or the ones based on LSTM architectures.
zh
[AI-150] Efficient Triple Modular Redundancy for Reliability Enhancement of DNNs Using Explainable AI
【速读】:该论文旨在解决深度神经网络(Deep Neural Networks, DNNs)在安全关键领域中应对位翻转故障(bit-flip faults)时的可靠性问题,其核心挑战在于如何高效地应用三模冗余(Triple Modular Redundancy, TMR)以降低计算和资源开销。解决方案的关键在于利用可解释人工智能(Explainable Artificial Intelligence, XAI)方法,特别是基于梯度的层相关性传播(Layer-wise Relevance Propagation, LRP)技术,对DNN参数的重要性进行评估,并据此选择性地对关键权重进行TMR保护,从而在保持较低 overhead 的同时显著提升模型的可靠性。
链接: https://arxiv.org/abs/2507.08829
作者: Kimia Soroush,Nastaran Shirazi,Mohsen Raji
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:
Abstract:Deep Neural Networks (DNNs) are widely employed in safety-critical domains, where ensuring their reliability is essential. Triple Modular Redundancy (TMR) is an effective technique to enhance the reliability of DNNs in the presence of bit-flip faults. In order to handle the significant overhead of TMR, it is applied selectively on the parameters and components with the highest contribution at the model output. Hence, the accuracy of the selection criterion plays the key role on the efficiency of TMR. This paper presents an efficient TMR approach to enhance the reliability of DNNs against bit-flip faults using an Explainable Artificial Intelligence (XAI) method. Since XAI can provide valuable insights about the importance of individual neurons and weights in the performance of the network, they can be applied as the selection metric in TMR techniques. The proposed method utilizes a low-cost, gradient-based XAI technique known as Layer-wise Relevance Propagation (LRP) to calculate importance scores for DNN parameters. These scores are then used to enhance the reliability of the model, with the most critical weights being protected by TMR. The proposed approach is evaluated on two DNN models, VGG16 and AlexNet, using datasets such as MNIST and CIFAR-10. The results demonstrate that the method can protect the AlexNet model at a bit error rate of 10-4, achieving over 60% reliability improvement while maintaining the same overhead as state-of-the-art methods.
zh
[AI-151] Accurate generation of chemical reaction transition states by conditional flow matching
【速读】:该论文试图解决过渡态(Transition State, TS)结构难以通过实验直接观测,且传统密度泛函理论(Density Functional Theory, DFT)计算成本高、耗时长的问题。其解决方案的关键在于提出TS-GEN,一个基于条件流匹配的生成模型,能够通过单次确定性过程将简单的高斯先验样本直接映射到过渡态鞍点几何结构,通过嵌入反应物和产物构象作为条件信息,学习通过最优传输路径将潜在噪声转化为真实的TS结构,从而替代传统的非平衡弹性带或字符串方法中的迭代优化过程。
链接: https://arxiv.org/abs/2507.10530
作者: Ping Tuo,Jiale Chen,Ju Li
机构: 未知
类目: Chemical Physics (physics.chem-ph); Artificial Intelligence (cs.AI)
备注:
Abstract:Transition state (TS) structures define the critical geometries and energy barriers underlying chemical reactivity, yet their fleeting nature renders them experimentally elusive and drives the reliance on costly, high-throughput density functional theory (DFT) calculations. Here, we introduce TS-GEN, a conditional flow-matching generative model that maps samples from a simple Gaussian prior directly to transition-state saddle-point geometries in a single, deterministic pass. By embedding both reactant and product conformations as conditioning information, TS-GEN learns to transport latent noise to true TS structures via an optimal-transport path, effectively replacing the iterative optimization common in nudged-elastic band or string-method algorithms. TS-GEN delivers unprecedented accuracy, achieving a root-mean-square deviation of 0.004\ \rm\mathringA (vs. 0.103\ \rm\mathringA for prior state-of-the-art) and a mean barrier-height error of 1.019\ \rm kcal/mol (vs. 2.864\ \rm kcal/mol ), while requiring only 0.06\ \rm s GPU time per inference. Over 87% of generated TSs meet chemical-accuracy criteria ( 1.58\ \rm kcal/mol error), substantially outpacing existing methods. TS-GEN also exhibits strong transferability to out-of-distribution reactions from a larger database. By uniting sub-angstrom precision, sub-second speed, and broad applicability, TS-GEN will be highly useful for high-throughput exploration of complex reaction networks, paving the way to the exploration of novel chemical reaction mechanisms.
zh
[AI-152] he Second Machine Turn: From Checking Proofs to Creating Concepts
【速读】:该论文试图解决如何让人工智能(Artificial Intelligence, AI)从仅验证数学证明的阶段进一步发展为能够自动创建数学概念的问题。其解决方案的关键在于探索将数学概念的生成过程进行形式化和数学化,从而使得AI能够在数学发现过程中扮演更主动的角色。
链接: https://arxiv.org/abs/2507.10179
作者: Asvin G
机构: 未知
类目: History and Overview (math.HO); Artificial Intelligence (cs.AI)
备注:
Abstract:We identify a second machine turn in the process of mathematical discovery: after automating proof-checking, AI is now poised to automate the creation of mathematical concepts themselves. We discuss the current state of the art, obstacles and potential solutions as well as a preliminary attempt at mathematizing the creation of concepts itself. The paper ends with an assessment of how these capabilities could reshape mathematics and human-machine collaboration, and a few different futures we might find ourselves in.
zh
[AI-153] A PBN-RL-XAI Framework for Discovering a "Hit-and-Run Therapeutic Strategy in Melanoma
【速读】:该论文试图解决转移性黑色素瘤患者对抗PD-1免疫治疗先天耐药的问题,其核心在于揭示调控治疗反应的分子网络机制。解决方案的关键是构建了一个基于患者肿瘤活检转录组数据的动态概率布尔网络模型,结合强化学习代理系统以发现多步骤最优治疗干预策略,并利用可解释的人工智能技术对代理的控制策略进行机制解析,最终识别出一种精确时间控制的四步暂时性抑制赖氨酸氧化酶样2蛋白(LOXL2)的“击打与离开”干预策略,该策略能够消除驱动耐药的分子特征并促使网络自我修复。
链接: https://arxiv.org/abs/2507.10136
作者: Zhonglin Liu
机构: 未知
类目: Quantitative Methods (q-bio.QM); Artificial Intelligence (cs.AI)
备注: 9 pages, 5 figures. Submitted to the IEEE International Conference on Bioinformatics and Biomedicine (BIBM) 2025. Code is available at this https URL
Abstract:Innate resistance to anti-PD-1 immunotherapy remains a major clinical challenge in metastatic melanoma, with the underlying molecular networks being poorly understood. To address this, we constructed a dynamic Probabilistic Boolean Network model using transcriptomic data from patient tumor biopsies to elucidate the regulatory logic governing therapy response. We then employed a reinforcement learning agent to systematically discover optimal, multi-step therapeutic interventions and used explainable artificial intelligence to mechanistically interpret the agent’s control policy. The analysis revealed that a precisely timed, 4-step temporary inhibition of the lysyl oxidase like 2 protein (LOXL2) was the most effective strategy. Our explainable analysis showed that this ``hit-and-run" intervention is sufficient to erase the molecular signature driving resistance, allowing the network to self-correct without requiring sustained intervention. This study presents a novel, time-dependent therapeutic hypothesis for overcoming immunotherapy resistance and provides a powerful computational framework for identifying non-obvious intervention protocols in complex biological systems.
zh
[AI-154] Evolution of Fear and Social Rewards in Prey-Predator Relationship
【速读】:该论文试图解决恐惧情绪与环境条件、其他奖励机制(如食物奖励和社会奖励)之间的进化关系问题。其解决方案的关键在于开发了一个分布式进化模拟系统,其中捕食者和被捕食者代理共同进化它们的先天奖励函数,包括可能具有恐惧特征的观察捕食者的负面奖励,并通过强化学习学习行为。该模拟揭示了社会奖励在被捕食者生存中的重要性,以及恐惧-like奖励在获得社会奖励之后才演化出现的因果关系。
链接: https://arxiv.org/abs/2507.09992
作者: Yuji Kanagawa,Kenji Doya
机构: 未知
类目: Populations and Evolution (q-bio.PE); Artificial Intelligence (cs.AI); Neural and Evolutionary Computing (cs.NE)
备注: Preprint. Under review
Abstract:Fear is a critical brain function for detecting danger and learning to avoid specific stimuli that can lead to danger. While fear is believed to have evolved under pressure from predators, experimentally reproducing the evolution is challenging. To investigate the relationship between environmental conditions, the evolution of fear, and the evolution of other rewards, such as food reward and social reward, we developed a distributed evolutionary simulation. In our simulation, prey and predator agents co-evolve their innate reward functions, including a possibly fear-like term for observing predators, and learn behaviors via reinforcement learning. Surprisingly, our simulation revealed that social reward for observing the same species is more important for prey to survive, and fear-like negative reward for observing predators evolves only after acquiring social reward. We also found that the predator with increased hunting ability (larger mouth) amplified fear emergence, but also that fear evolution is more stable with non-evolving predators that are bad at chasing prey. Additionally, unlike for predators, we found that positive rewards evolve in opposition to fear for stationary threats, as areas with abundant leftover food develop around them. These findings suggest that fear and social reward have had a complex interplay with each other through evolution, along with the nature of predators and threats.
zh
[AI-155] Aligning Generative Speech Enhancement with Human Preferences via Direct Preference Optimization
【速读】:该论文试图解决语音增强(Speech Enhancement, SE)中感知质量与模型优化目标不一致的问题。传统基于语言模型(Language Models, LMs)的SE方法通常聚焦于最大化干净语音标记的概率,这可能导致模型输出与人类感知存在偏差,从而影响语音质量。论文提出的解决方案关键在于引入直接偏好优化(Direct Preference Optimization, DPO),利用UTMOS(一种神经MOS预测模型)作为人类评分的代理,引导优化过程向更符合人类感知的输出方向进行,从而提升增强语音的感知质量。
链接: https://arxiv.org/abs/2507.09929
作者: Haoyang Li,Nana Hou,Yuchen Hu,Jixun Yao,Sabato Marco Siniscalchi,Eng Siong Chng
机构: 未知
类目: Audio and Speech Processing (eess.AS); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:
Abstract:This work investigates speech enhancement (SE) from the perspective of language models (LMs). We propose a novel method that leverages Direct Preference Optimization (DPO) to improve the perceptual quality of enhanced speech. Using UTMOS, a neural MOS prediction model, as a proxy for human ratings, our approach guides optimization toward perceptually preferred outputs. This differs from existing LM-based SE methods that focus on maximizing the likelihood of clean speech tokens, which may misalign with human perception and degrade quality despite low prediction error. Experiments on the 2020 Deep Noise Suppression Challenge test sets demonstrate that applying DPO to a pretrained LM-based SE model yields consistent improvements across various speech quality metrics, with relative gains of up to 56%. To our knowledge, this is the first application of DPO to SE and the first to incorporate proxy perceptual feedback into LM-based SE training, pointing to a promising direction for perceptually aligned SE.
zh
[AI-156] Sequence-Model-Guided Measurement Selection for Quantum State Learning
【速读】:该论文试图解决从实验数据中表征量子系统的问题,特别是如何选择最优的测量方式来获取数据。随着量子系统规模的增大,传统的优化方法变得难以实施。论文提出的解决方案关键在于引入一种具有序列模型架构的深度神经网络,该网络能够以数据驱动且自适应的方式搜索高效的测量选择。这种方法在多种任务中均表现出优于均匀随机选择的性能,尤其在拓扑量子系统中,模型倾向于推荐边界处的测量,暗示其可能独立发现了边界与体性质之间的联系。
链接: https://arxiv.org/abs/2507.09891
作者: Jiaxin Huang,Yan Zhu,Giulio Chiribella,Ya-Dong Wu
机构: 未知
类目: Quantum Physics (quant-ph); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:
Abstract:Characterization of quantum systems from experimental data is a central problem in quantum science and technology. But which measurements should be used to gather data in the first place? While optimal measurement choices can be worked out for small quantum systems, the optimization becomes intractable as the system size grows large. To address this problem, we introduce a deep neural network with a sequence model architecture that searches for efficient measurement choices in a data-driven, adaptive manner. The model can be applied to a variety of tasks, including the prediction of linear and nonlinear properties of quantum states, as well as state clustering and state tomography tasks. In all these tasks, we find that the measurement choices identified by our neural network consistently outperform the uniformly random choice. Intriguingly, for topological quantum systems, our model tends to recommend measurements at the system’s boundaries, even when the task is to predict bulk properties. This behavior suggests that the neural network may have independently discovered a connection between boundaries and bulk, without having been provided any built-in knowledge of quantum physics.
zh
[AI-157] Context-Aware Regularization with Markovian Integration for Attention-Based Nucleotide Analysis
【速读】:该论文试图解决在核苷酸序列分析中,传统自回归Transformer模型由于固定长度上下文窗口导致的长程依赖关系捕捉困难问题,以及标准自注意力机制在处理长序列时计算效率低下和缺乏全局转移一致性的问题。解决方案的关键在于引入CARMANIA(Context-Aware Regularization with Markovian Integration for Attention-Based Nucleotide Analysis),该框架通过在下一个token(NT)预测任务中引入转移矩阵(TM)损失,使模型预测的token转移与输入序列中经验得到的n-gram统计对齐,从而增强模型对高阶依赖关系的捕捉能力。
链接: https://arxiv.org/abs/2507.09378
作者: Mohammadsaleh Refahi,Mahdi Abavisani,Bahrad A. Sokhansanj,James R. Brown,Gail Rosen
机构: 未知
类目: Genomics (q-bio.GN); Artificial Intelligence (cs.AI)
备注:
Abstract:Transformers have revolutionized nucleotide sequence analysis, yet capturing long-range dependencies remains challenging. Recent studies show that autoregressive transformers often exhibit Markovian behavior by relying on fixed-length context windows for next-token prediction. However, standard self-attention mechanisms are computationally inefficient for long sequences due to their quadratic complexity and do not explicitly enforce global transition consistency. We introduce CARMANIA (Context-Aware Regularization with Markovian Integration for Attention-Based Nucleotide Analysis), a self-supervised pretraining framework that augments next-token (NT) prediction with a transition-matrix ™ loss. The TM loss aligns predicted token transitions with empirically derived n-gram statistics from each input sequence, encouraging the model to capture higher-order dependencies beyond local context. This integration enables CARMANIA to learn organism-specific sequence structures that reflect both evolutionary constraints and functional organization. We evaluate CARMANIA across diverse genomic tasks, including regulatory element prediction, functional gene classification, taxonomic inference, antimicrobial resistance detection, and biosynthetic gene cluster classification. CARMANIA outperforms the previous best long-context model by at least 7 percent, matches state-of-the-art on shorter sequences (exceeding prior results on 20 out of 40 tasks while running approximately 2.5 times faster), and shows particularly strong improvements on enhancer and housekeeping gene classification tasks, including up to a 34 percent absolute gain in Matthews correlation coefficient (MCC) for enhancer prediction. The TM loss boosts accuracy in 33 of 40 tasks, especially where local motifs or regulatory patterns drive prediction. Subjects: Genomics (q-bio.GN); Artificial Intelligence (cs.AI) Cite as: arXiv:2507.09378 [q-bio.GN] (or arXiv:2507.09378v1 [q-bio.GN] for this version) https://doi.org/10.48550/arXiv.2507.09378 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
zh
[AI-158] A Framework for Predictive Directional Trading Based on Volatility and Causal Inference
【速读】:该论文试图解决金融市场上预测性领先-滞后关系的识别与利用问题,旨在通过结合先进统计方法与机器学习模型来提升股票间预测关系的识别和交易策略的有效性。解决方案的关键在于采用高斯混合模型(GMM)对股票进行基于历史波动率的聚类,并构建多阶段因果推断流程,包括格兰杰因果检验(GCT)、定制化的彼得-克拉克瞬时条件独立性检验(PCMCI)和有效转移熵(ETE),以识别稳健的预测关联,随后利用动态时间规整(DTW)和K近邻(KNN)分类器确定最优交易时间滞后。
链接: https://arxiv.org/abs/2507.09347
作者: Ivan Letteri
机构: 未知
类目: atistical Finance (q-fin.ST); Artificial Intelligence (cs.AI); Machine Learning (stat.ML)
备注:
Abstract:Purpose: This study introduces a novel framework for identifying and exploiting predictive lead-lag relationships in financial markets. We propose an integrated approach that combines advanced statistical methodologies with machine learning models to enhance the identification and exploitation of predictive relationships between equities. Methods: We employed a Gaussian Mixture Model (GMM) to cluster nine prominent stocks based on their mid-range historical volatility profiles over a three-year period. From the resulting clusters, we constructed a multi-stage causal inference pipeline, incorporating the Granger Causality Test (GCT), a customised Peter-Clark Momentary Conditional Independence (PCMCI) test, and Effective Transfer Entropy (ETE) to identify robust, predictive linkages. Subsequently, Dynamic Time Warping (DTW) and a K-Nearest Neighbours (KNN) classifier were utilised to determine the optimal time lag for trade execution. The resulting strategy was rigorously backtested. Results: The proposed volatility-based trading strategy, tested from 8 June 2023 to 12 August 2023, demonstrated substantial efficacy. The portfolio yielded a total return of 15.38%, significantly outperforming the 10.39% return of a comparative Buy-and-Hold strategy. Key performance metrics, including a Sharpe Ratio up to 2.17 and a win rate up to 100% for certain pairs, confirmed the strategy’s viability. Conclusion: This research contributes a systematic and robust methodology for identifying profitable trading opportunities derived from volatility-based causal relationships. The findings have significant implications for both academic research in financial modelling and the practical application of algorithmic trading, offering a structured approach to developing resilient, data-driven strategies.
zh
[AI-159] From Classical Machine Learning to Emerging Foundation Models: Review on Multimodal Data Integration for Cancer Research
【速读】:该论文试图解决如何从多模态数据(包括基因组学、蛋白质组学、影像学和临床因素等)中提取可操作的见解这一关键问题,以推动癌症研究中的计算方法发展。其解决方案的关键在于系统性地回顾和分析当前广泛采用的多模态数据整合策略,并探讨机器学习与深度学习技术在癌症亚型分类、生物标志物发现、治疗指导和预后预测中的应用。此外,论文强调了从传统机器学习向基础模型(Foundation Models, FMs)的转变,认为当前最先进的整合方法为开发大规模预训练模型奠定了基础,这些模型有望进一步革新癌症研究领域。
链接: https://arxiv.org/abs/2507.09028
作者: Amgad Muneer,Muhammad Waqas,Maliazurina B Saad,Eman Showkatian,Rukhmini Bandyopadhyay,Hui Xu,Wentao Li,Joe Y Chang,Zhongxing Liao,Cara Haymaker,Luisa Solis Soto,Carol C Wu,Natalie I Vokes,Xiuning Le,Lauren A Byers,Don L Gibbons,John V Heymach,Jianjun Zhang,Jia Wu
机构: 未知
类目: Quantitative Methods (q-bio.QM); Artificial Intelligence (cs.AI)
备注: 6 figures, 3 tables
Abstract:Cancer research is increasingly driven by the integration of diverse data modalities, spanning from genomics and proteomics to imaging and clinical factors. However, extracting actionable insights from these vast and heterogeneous datasets remains a key challenge. The rise of foundation models (FMs) – large deep-learning models pretrained on extensive amounts of data serving as a backbone for a wide range of downstream tasks – offers new avenues for discovering biomarkers, improving diagnosis, and personalizing treatment. This paper presents a comprehensive review of widely adopted integration strategies of multimodal data to assist advance the computational approaches for data-driven discoveries in oncology. We examine emerging trends in machine learning (ML) and deep learning (DL), including methodological frameworks, validation protocols, and open-source resources targeting cancer subtype classification, biomarker discovery, treatment guidance, and outcome prediction. This study also comprehensively covers the shift from traditional ML to FMs for multimodal integration. We present a holistic view of recent FMs advancements and challenges faced during the integration of multi-omics with advanced imaging data. We identify the state-of-the-art FMs, publicly available multi-modal repositories, and advanced tools and methods for data integration. We argue that current state-of-the-art integrative methods provide the essential groundwork for developing the next generation of large-scale, pre-trained models poised to further revolutionize oncology. To the best of our knowledge, this is the first review to systematically map the transition from conventional ML to advanced FM for multimodal data integration in oncology, while also framing these developments as foundational for the forthcoming era of large-scale AI models in cancer research.
zh
[AI-160] Bridging Literature and the Universe Via A Multi-Agent Large Language Model System
【速读】:该论文试图解决物理学研究中因宇宙学模拟及其相关软件复杂性增加而导致的参数配置效率低下问题,具体表现为研究人员需要从大量文献和用户手册中提取模拟参数,并将其转化为可执行脚本,这一过程耗时且容易出错。解决方案的关键在于引入SimAgents,这是一个多智能体系统,利用专门的大型语言模型(Large Language Model, LLM)代理进行物理推理、模拟软件验证和工具执行,通过结构化通信协作,确保提取的参数在物理上合理、内部一致且符合软件规范。
链接: https://arxiv.org/abs/2507.08958
作者: Xiaowen Zhang,Zhenyu Bi,Xuan Wang,Tiziana Di Matteo,Rupert A.C. Croft
机构: 未知
类目: Instrumentation and Methods for Astrophysics (astro-ph.IM); Artificial Intelligence (cs.AI); Multiagent Systems (cs.MA)
备注: 6 pages, 4 figures
Abstract:As cosmological simulations and their associated software become increasingly complex, physicists face the challenge of searching through vast amounts of literature and user manuals to extract simulation parameters from dense academic papers, each using different models and formats. Translating these parameters into executable scripts remains a time-consuming and error-prone process. To improve efficiency in physics research and accelerate the cosmological simulation process, we introduce SimAgents, a multi-agent system designed to automate both parameter configuration from the literature and preliminary analysis for cosmology research. SimAgents is powered by specialized LLM agents capable of physics reasoning, simulation software validation, and tool execution. These agents collaborate through structured communication, ensuring that extracted parameters are physically meaningful, internally consistent, and software-compliant. We also construct a cosmological parameter extraction evaluation dataset by collecting over 40 simulations in published papers from Arxiv and leading journals that cover diverse simulation types. Experiments on the dataset demonstrate a strong performance of SimAgents, highlighting its effectiveness and potential to accelerate scientific research for physicists. Our demonstration video is available at: this https URL. The complete system and dataset are publicly available at this https URL.
zh
[AI-161] AMix-1: A Pathway to Test-Time Scalable Protein Foundation Model
【速读】:该论文旨在解决蛋白质设计与工程中的高效性与功能性提升问题,特别是如何通过深度学习模型捕捉蛋白质的进化信号并生成结构和功能一致的蛋白质。其解决方案的关键在于构建AMix-1模型,该模型基于贝叶斯流网络(Bayesian Flow Networks)并采用系统化的训练方法,包括预训练扩展定律、涌现能力分析、上下文学习机制以及测试时扩展算法,从而实现了对蛋白质结构理解的逐步增强,并通过多序列比对(MSA)驱动的上下文学习策略,统一了蛋白质设计的框架,显著提升了蛋白质变体的功能性能。
链接: https://arxiv.org/abs/2507.08920
作者: Changze Lv,Jiang Zhou,Siyu Long,Lihao Wang,Jiangtao Feng,Dongyu Xue,Yu Pei,Hao Wang,Zherui Zhang,Yuchen Cai,Zhiqiang Gao,Ziyuan Ma,Jiakai Hu,Chaochen Gao,Jingjing Gong,Yuxuan Song,Shuyi Zhang,Xiaoqing Zheng,Deyi Xiong,Lei Bai,Ya-Qin Zhang,Wei-Ying Ma,Bowen Zhou,Hao Zhou
机构: 未知
类目: Biomolecules (q-bio.BM); Artificial Intelligence (cs.AI)
备注:
Abstract:We introduce AMix-1, a powerful protein foundation model built on Bayesian Flow Networks and empowered by a systematic training methodology, encompassing pretraining scaling laws, emergent capability analysis, in-context learning mechanism, and test-time scaling algorithm. To guarantee robust scalability, we establish a predictive scaling law and reveal the progressive emergence of structural understanding via loss perspective, culminating in a strong 1.7-billion model. Building on this foundation, we devise a multiple sequence alignment (MSA)-based in-context learning strategy to unify protein design into a general framework, where AMix-1 recognizes deep evolutionary signals among MSAs and consistently generates structurally and functionally coherent proteins. This framework enables the successful design of a dramatically improved AmeR variant with an up to 50\times activity increase over its wild type. Pushing the boundaries of protein engineering, we further empower AMix-1 with an evolutionary test-time scaling algorithm for in silico directed evolution that delivers substantial, scalable performance gains as verification budgets are intensified, laying the groundwork for next-generation lab-in-the-loop protein design.
zh
[AI-162] Generation of structure-guided pMHC-I libraries using Diffusion Models ICML
【速读】:该论文试图解决当前个性化疫苗和T细胞免疫疗法中因质谱和结合实验数据集的固有偏差而导致的新型肽配体发现受限的问题。其解决方案的关键在于引入一种基于结构引导的pMHC-I肽基准,该基准利用扩散模型根据晶体结构相互作用距离进行设计,从而实现对已知肽段的独立性评估,并展现出典型的锚定残基偏好,表明其具有结构泛化能力且不受实验数据集偏差的影响。
链接: https://arxiv.org/abs/2507.08902
作者: Sergio Mares,Ariel Espinoza Weinberger,Nilah M. Ioannidis
机构: 未知
类目: Quantitative Methods (q-bio.QM); Artificial Intelligence (cs.AI)
备注: Accepted to the The 2nd Workshop on Generative AI and Biology ICML Workshop 2025
Abstract:Personalized vaccines and T-cell immunotherapies depend critically on identifying peptide-MHC class I (pMHC-I) interactions capable of eliciting potent immune responses. However, current benchmarks and models inherit biases present in mass-spectrometry and binding-assay datasets, limiting discovery of novel peptide ligands. To address this issue, we introduce a structure-guided benchmark of pMHC-I peptides designed using diffusion models conditioned on crystal structure interaction distances. Spanning twenty high-priority HLA alleles, this benchmark is independent of previously characterized peptides yet reproduces canonical anchor residue preferences, indicating structural generalization without experimental dataset bias. Using this resource, we demonstrate that state-of-the-art sequence-based predictors perform poorly at recognizing the binding potential of these structurally stable designs, indicating allele-specific limitations invisible in conventional evaluations. Our geometry-aware design pipeline yields peptides with high predicted structural integrity and higher residue diversity than existing datasets, representing a key resource for unbiased model training and evaluation. Our code, and data are available at: this https URL.
zh
[AI-163] Advancing network resilience theories with symbolized reinforcement learning
【速读】:该论文试图解决复杂网络在面对外部扰动、内部故障和环境变化时的鲁棒性衡量问题,旨在防止从物种灭绝到金融危机等系统性崩溃。现有理论仅从拓扑结构单一角度出发,忽略了系统动力学的关键作用,这是由于拓扑与动力学之间的耦合内在复杂性超出了人类分析方法的能力。该研究的关键在于提出了一种自动化的鲁棒性理论发现方法,该方法通过学习AI如何解决复杂的网络拆解问题,并将网络攻击策略符号化为理论公式,从而首次发现了同时考虑拓扑与动力学的鲁棒性理论。
链接: https://arxiv.org/abs/2507.08827
作者: Yu Zheng,Jingtao Ding,Depeng Jin,Jianxi Gao,Yong Li
机构: 未知
类目: Physics and Society (physics.soc-ph); Artificial Intelligence (cs.AI)
备注:
Abstract:Many complex networks display remarkable resilience under external perturbations, internal failures and environmental changes, yet they can swiftly deteriorate into dysfunction upon the removal of a few keystone nodes. Discovering theories that measure network resilience offers the potential to prevent catastrophic collapses–from species extinctions to financial crise–with profound implications for real-world systems. Current resilience theories address the problem from a single perspective of topology, neglecting the crucial role of system dynamics, due to the intrinsic complexity of the coupling between topology and dynamics which exceeds the capabilities of human analytical methods. Here, we report an automatic method for resilience theory discovery, which learns from how AI solves a complicated network dismantling problem and symbolizes its network attack strategies into theoretical formulas. This proposed self-inductive approach discovers the first resilience theory that accounts for both topology and dynamics, highlighting how the correlation between node degree and state shapes overall network resilience, and offering insights for designing early warning signals of systematic collapses. Additionally, our approach discovers formulas that refine existing well-established resilience theories with over 37.5% improvement in accuracy, significantly advancing human understanding of complex networks with AI.
zh
机器学习
[LG-0] Fusing LLM Capabilities with Routing Data
链接: https://arxiv.org/abs/2507.10540
作者: Tao Feng,Haozhen Zhang,Zijie Lei,Pengrui Han,Mostofa Patwary,Mohammad Shoeybi,Bryan Catanzaro,Jiaxuan You
类目: Machine Learning (cs.LG)
*备注:
Abstract:The rapid advancement of large language models (LLMs) has created a vibrant ecosystem of diverse architectures, each with unique strengths due to differences in design, training data, and objectives. However, most applications still rely on a single backend model, limiting coverage of capabilities and leading to inefficiencies in performance and token cost when tackling complex tasks. We highlight an underexploited opportunity: LLM routing data, produced when hosting platforms route diverse queries to different models, which can reveal comparative strengths across tasks. To address this, we propose FusionBench, a comprehensive routing benchmark covering 14 tasks across five domains with 20 open-source LLMs (8B to 671B parameters), capturing 103M tokens and summarizing reusable thought templates from top models. Building on this, we introduce FusionFactory, a systematic fusion framework with three levels: (1) query-level fusion, tailoring routers for each query using both direct responses and reasoning-augmented outputs; (2) thought-level fusion, leveraging abstract templates derived from top-performing LLMs’ answers to similar queries; and (3) model-level fusion, transferring capabilities between models via distillation, using top responses or highest judge scores as training data. Experiments show FusionFactory consistently outperforms the best individual LLM across all 14 benchmarks, with optimal fusion configurations varying by benchmark, demonstrating the value of systematic LLM fusion in harnessing complementary strengths and improving overall performance.
[LG-1] Graph World Model
链接: https://arxiv.org/abs/2507.10539
作者: Tao Feng,Yexin Wu,Guanyu Lin,Jiaxuan You
类目: Machine Learning (cs.LG)
*备注:
Abstract:World models (WMs) demonstrate strong capabilities in prediction, generation, and planning tasks. Existing WMs primarily focus on unstructured data and cannot leverage the ubiquitous structured data, often represented as graphs, in the digital world. While multiple graph foundation models have been proposed, they focus on graph learning tasks and cannot extend to diverse multi-modal data and interdisciplinary tasks. To address these challenges, we propose the Graph World Model (GWM), a world model that supports both unstructured and graph-structured states with multi-modal information and represents diverse tasks as actions. The core of a GWM is a generic message-passing algorithm to aggregate structured information, either over a unified multi-modal token space by converting multi-modal data into text (GWM-T) or a unified multi-modal embedding space by modality-specific encoders (GWM-E). Notably, GWM introduces action nodes to support diverse tasks, where action nodes are linked to other nodes via direct reference or similarity computation. Extensive experiments on six tasks from diverse domains, including multi-modal generation and matching, recommendation, graph prediction, multi-agent, retrieval-augmented generation, and planning and optimization, show that the same GWM outperforms or matches domain-specific baselines’ performance, benefits from multi-hop structures, and demonstrates strong zero-shot/few-shot capabilities on unseen new tasks. Our code for GWM is released at this https URL.
[LG-2] On the Performance of Differentially Private Optimization with Heavy-Tail Class Imbalance
链接: https://arxiv.org/abs/2507.10536
作者: Qiaoyue Tang,Alain Zhiyanov,Mathias Lécuyer
类目: Machine Learning (cs.LG)
*备注:
Abstract:In this work, we analyze the optimization behaviour of common private learning optimization algorithms under heavy-tail class imbalanced distribution. We show that, in a stylized model, optimizing with Gradient Descent with differential privacy (DP-GD) suffers when learning low-frequency classes, whereas optimization algorithms that estimate second-order information do not. In particular, DP-AdamBC that removes the DP bias from estimating loss curvature is a crucial component to avoid the ill-condition caused by heavy-tail class imbalance, and empirically fits the data better with \approx8% and \approx5% increase in training accuracy when learning the least frequent classes on both controlled experiments and real data respectively.
[LG-3] Split Happens: Combating Advanced Threats with Split Learning and Function Secret Sharing
链接: https://arxiv.org/abs/2507.10494
作者: Tanveer Khan,Mindaugas Budzys,Antonis Michalas
类目: Machine Learning (cs.LG); Cryptography and Security (cs.CR)
*备注:
Abstract:Split Learning (SL) – splits a model into two distinct parts to help protect client data while enhancing Machine Learning (ML) processes. Though promising, SL has proven vulnerable to different attacks, thus raising concerns about how effective it may be in terms of data privacy. Recent works have shown promising results for securing SL through the use of a novel paradigm, named Function Secret Sharing (FSS), in which servers obtain shares of a function they compute and operate on a public input hidden with a random mask. However, these works fall short in addressing the rising number of attacks which exist on SL. In SplitHappens, we expand the combination of FSS and SL to U-shaped SL. Similarly to other works, we are able to make use of the benefits of SL by reducing the communication and computational costs of FSS. However, a U-shaped SL provides a higher security guarantee than previous works, allowing a client to keep the labels of the training data secret, without having to share them with the server. Through this, we are able to generalize the security analysis of previous works and expand it to different attack vectors, such as modern model inversion attacks as well as label inference attacks. We tested our approach for two different convolutional neural networks on different datasets. These experiments show the effectiveness of our approach in reducing the training time as well as the communication costs when compared to simply using FSS while matching prior accuracy.
[LG-4] Overcoming catastrophic forgetting in neural networks
链接: https://arxiv.org/abs/2507.10485
作者: Brandon Shuen Yi Loke,Filippo Quadri,Gabriel Vivanco,Maximilian Casagrande,Saúl Fenollosa
类目: Machine Learning (cs.LG); Information Retrieval (cs.IR)
*备注: 7 pages, 5 figures, EE-411 Fundamentals of inference and learning course project
Abstract:Catastrophic forgetting is the primary challenge that hinders continual learning, which refers to a neural network ability to sequentially learn multiple tasks while retaining previously acquired knowledge. Elastic Weight Consolidation, a regularization-based approach inspired by synaptic consolidation in biological neural systems, has been used to overcome this problem. In this study prior research is replicated and extended by evaluating EWC in supervised learning settings using the PermutedMNIST and RotatedMNIST benchmarks. Through systematic comparisons with L2 regularization and stochastic gradient descent (SGD) without regularization, we analyze how different approaches balance knowledge retention and adaptability. Our results confirm what was shown in previous research, showing that EWC significantly reduces forgetting compared to naive training while slightly compromising learning efficiency on new tasks. Moreover, we investigate the impact of dropout regularization and varying hyperparameters, offering insights into the generalization of EWC across diverse learning scenarios. These results underscore EWC’s potential as a viable solution for lifelong learning in neural networks.
[LG-5] he Target Polish: A New Approach to Outlier-Resistant Non-Negative Matrix and Tensor Factorization
链接: https://arxiv.org/abs/2507.10484
作者: Paul Fogel(1),Christophe Geissler(1),George Luta(2) ((1) Data Services, ForvisMazars, Courbevoie, France, (2) Department of Biostatistics, Bioinformatics and Biomathematics, Georgetown University Medical Center, Washington, DC, USA)
类目: Machine Learning (cs.LG)
*备注: 6 pages, 4 figures, International Conference on Robust Statistics 2025, Stresa, Italy
Abstract:This paper introduces the “Target Polish,” a robust and computationally efficient framework for nonnegative matrix and tensor factorization. Although conventional weighted NMF approaches are resistant to outliers, they converge slowly due to the use of multiplicative updates to minimize the objective criterion. In contrast, the Target Polish approach remains compatible with the Fast-HALS algorithm, which is renowned for its speed, by adaptively smoothing the data with a weighted median-based transformation. This innovation provides outlier resistance while maintaining the highly efficient additive update structure of Fast-HALS. Empirical evaluations using image datasets corrupted with structured (block) and unstructured (salt) noise demonstrate that the Target Polish approach matches or exceeds the accuracy of state-of-the-art robust NMF methods and reduces computational time by an order of magnitude in the studied scenarios.
[LG-6] Some remarks on gradient dominance and LQR policy optimization
链接: https://arxiv.org/abs/2507.10452
作者: Eduardo D. Sontag
类目: Machine Learning (cs.LG)
*备注: This is a short paper summarizing the first part of the slides presented at my keynote at the 2025 L4DC (Learning for Dynamics Control Conference) in Ann Arbor, Michigan, 05 June 2025. A partial bibliography has been added. A second part on neural net feedback controllers is to be added
Abstract:Solutions of optimization problems, including policy optimization in reinforcement learning, typically rely upon some variant of gradient descent. There has been much recent work in the machine learning, control, and optimization communities applying the Polyak-Łojasiewicz Inequality (PLI) to such problems in order to establish an exponential rate of convergence (a.k.a. linear convergence'' in the local-iteration language of numerical analysis) of loss functions to their minima under the gradient flow. Often, as is the case of policy iteration for the continuous-time LQR problem, this rate vanishes for large initial conditions, resulting in a mixed globally linear / locally exponential behavior. This is in sharp contrast with the discrete-time LQR problem, where there is global exponential convergence. That gap between CT and DT behaviors motivates the search for various generalized PLI-like conditions, and this talk will address that topic. Moreover, these generalizations are key to understanding the transient and asymptotic effects of errors in the estimation of the gradient, errors which might arise from adversarial attacks, wrong evaluation by an oracle, early stopping of a simulation, inaccurate and very approximate digital twins, stochastic computations (algorithm
reproducibility’‘), or learning by sampling from limited data. We describe an input to state stability'' (ISS) analysis of this issue. The lecture also discussed convergence and PLI-like properties of
linear feedforward neural networks’’ in feedback control, but this arXiv skips that part (to be updated). Much of the work described here was done in collaboration with Arthur Castello B. de Oliveira, Leilei Cui, Zhong-Ping Jiang, and Milad Siami.
[LG-7] FinTeam: A Multi-Agent Collaborative Intelligence System for Comprehensive Financial Scenarios NLPCC2025
链接: https://arxiv.org/abs/2507.10448
作者: Yingqian Wu,Qiushi Wang,Zefei Long,Rong Ye,Zhongtian Lu,Xianyin Zhang,Bingxuan Li,Wei Chen,Liwen Zhang,Zhongyu Wei
类目: Computational Engineering, Finance, and Science (cs.CE); Machine Learning (cs.LG)
*备注: NLPCC 2025 Oral
Abstract:Financial report generation tasks range from macro- to micro-economics analysis, also requiring extensive data analysis. Existing LLM models are usually fine-tuned on simple QA tasks and cannot comprehensively analyze real financial scenarios. Given the complexity, financial companies often distribute tasks among departments. Inspired by this, we propose FinTeam, a financial multi-agent collaborative system, with a workflow with four LLM agents: document analyzer, analyst, accountant, and consultant. We train these agents with specific financial expertise using constructed datasets. We evaluate FinTeam on comprehensive financial tasks constructed from real online investment forums, including macroeconomic, industry, and company analysis. The human evaluation shows that by combining agents, the financial reports generate from FinTeam achieved a 62.00% acceptance rate, outperforming baseline models like GPT-4o and Xuanyuan. Additionally, FinTeam’s agents demonstrate a 7.43% average improvement on FinCUGE and a 2.06% accuracy boost on FinEval. Project is available at this https URL.
[LG-8] Non-exchangeable Conformal Prediction with Optimal Transport: Tackling Distribution Shifts with Unlabeled Data
链接: https://arxiv.org/abs/2507.10425
作者: Alvaro H.C. Correia,Christos Louizos
类目: Machine Learning (cs.LG); Machine Learning (stat.ML)
*备注:
Abstract:Conformal prediction is a distribution-free uncertainty quantification method that has gained popularity in the machine learning community due to its finite-sample guarantees and ease of use. Its most common variant, dubbed split conformal prediction, is also computationally efficient as it boils down to collecting statistics of the model predictions on some calibration data not yet seen by the model. Nonetheless, these guarantees only hold if the calibration and test data are exchangeable, a condition that is difficult to verify and often violated in practice due to so-called distribution shifts. The literature is rife with methods to mitigate the loss in coverage in this non-exchangeable setting, but these methods require some prior information on the type of distribution shift to be expected at test time. In this work, we study this problem via a new perspective, through the lens of optimal transport, and show that it is possible to estimate the loss in coverage and mitigate it in case of distribution shift.
[LG-9] Stochastic Operator Network: A Stochastic Maximum Principle Based Approach to Operator Learning
链接: https://arxiv.org/abs/2507.10401
作者: Ryan Bausback,Jingqiao Tang,Lu Lu,Feng Bao,Toan Huynh
类目: Machine Learning (cs.LG); Probability (math.PR)
*备注:
Abstract:We develop a novel framework for uncertainty quantification in operator learning, the Stochastic Operator Network (SON). SON combines the stochastic optimal control concepts of the Stochastic Neural Network (SNN) with the DeepONet. By formulating the branch net as an SDE and backpropagating through the adjoint BSDE, we replace the gradient of the loss function with the gradient of the Hamiltonian from Stohastic Maximum Principle in the SGD update. This allows SON to learn the uncertainty present in operators through its diffusion parameters. We then demonstrate the effectiveness of SON when replicating several noisy operators in 2D and 3D.
[LG-10] Anticipating the Selectivity of Cyclization Reaction Pathways with Neural Network Potentials
链接: https://arxiv.org/abs/2507.10400
作者: Nicholas Casetti,Dylan Anstine,Olexandr Isayev,Connor W. Coley
类目: Machine Learning (cs.LG); Quantitative Methods (q-bio.QM)
*备注: 32 pages, 5 figures
Abstract:Reaction mechanism search tools have demonstrated the ability to provide insights into likely products and rate-limiting steps of reacting systems. However, reactions involving several concerted bond changes - as can be found in many key steps of natural product synthesis - can complicate the search process. To mitigate these complications, we present a mechanism search strategy particularly suited to help expedite exploration of an exemplary family of such complex reactions, cyclizations. We provide a cost-effective strategy for identifying relevant elementary reaction steps by combining graph-based enumeration schemes and machine learning techniques for intermediate filtering. Key to this approach is our use of a neural network potential (NNP), AIMNet2-rxn, for computational evaluation of each candidate reaction pathway. In this article, we evaluate the NNP’s ability to estimate activation energies, demonstrate the correct anticipation of stereoselectivity, and recapitulate complex enabling steps in natural product synthesis.
[LG-11] Extracting Important Tokens in E-Commerce Queries with a Tag Interaction-Aware Transformer Model
链接: https://arxiv.org/abs/2507.10385
作者: Md. Ahsanul Kabir,Mohammad Al Hasan,Aritra Mandal,Liyang Hao,Ishita Khan,Daniel Tunkelang,Zhe Wu
类目: Machine Learning (cs.LG)
*备注:
Abstract:The major task of any e-commerce search engine is to retrieve the most relevant inventory items, which best match the user intent reflected in a query. This task is non-trivial due to many reasons, including ambiguous queries, misaligned vocabulary between buyers, and sellers, over- or under-constrained queries by the presence of too many or too few tokens. To address these challenges, query reformulation is used, which modifies a user query through token dropping, replacement or expansion, with the objective to bridge semantic gap between query tokens and users’ search intent. Early methods of query reformulation mostly used statistical measures derived from token co-occurrence frequencies from selective user sessions having clicks or purchases. In recent years, supervised deep learning approaches, specifically transformer-based neural language models, or sequence-to-sequence models are being used for query reformulation task. However, these models do not utilize the semantic tags of a query token, which are significant for capturing user intent of an e-commerce query. In this work, we pose query reformulation as a token classification task, and solve this task by designing a dependency-aware transformer-based language model, TagBERT, which makes use of semantic tags of a token for learning superior query phrase embedding. Experiments on large, real-life e-commerce datasets show that TagBERT exhibits superior performance than plethora of competing models, including BERT, eBERT, and Sequence-to-Sequence transformer model for important token classification task.
[LG-12] Leverag ing RAG -LLM s for Urban Mobility Simulation and Analysis
链接: https://arxiv.org/abs/2507.10382
作者: Yue Ding,Conor McCarthy,Kevin O’Shea,Mingming Liu
类目: Machine Learning (cs.LG)
*备注:
Abstract:With the rise of smart mobility and shared e-mobility services, numerous advanced technologies have been applied to this field. Cloud-based traffic simulation solutions have flourished, offering increasingly realistic representations of the evolving mobility landscape. LLMs have emerged as pioneering tools, providing robust support for various applications, including intelligent decision-making, user interaction, and real-time traffic analysis. As user demand for e-mobility continues to grow, delivering comprehensive end-to-end solutions has become crucial. In this paper, we present a cloud-based, LLM-powered shared e-mobility platform, integrated with a mobile application for personalized route recommendations. The optimization module is evaluated based on travel time and cost across different traffic scenarios. Additionally, the LLM-powered RAG framework is evaluated at the schema level for different users, using various evaluation methods. Schema-level RAG with XiYanSQL achieves an average execution accuracy of 0.81 on system operator queries and 0.98 on user queries.
[LG-13] Enhanced DeepONet for 1-D consolidation operator learning: an architectural investigation
链接: https://arxiv.org/abs/2507.10368
作者: Yongjin Choi,Chenying Liu,Jorge Macedo
类目: Machine Learning (cs.LG); Geophysics (physics.geo-ph)
*备注:
Abstract:Deep Operator Networks (DeepONets) have emerged as a powerful surrogate modeling framework for learning solution operators in PDE-governed systems. While their use is expanding across engineering disciplines, applications in geotechnical engineering remain limited. This study systematically evaluates several DeepONet architectures for the one-dimensional consolidation problem. We initially consider three architectures: a standard DeepONet with the coefficient of consolidation embedded in the branch net (Models 1 and 2), and a physics-inspired architecture with the coefficient embedded in the trunk net (Model 3). Results show that Model 3 outperforms the standard configurations (Models 1 and 2) but still has limitations when the target solution (excess pore pressures) exhibits significant variation. To overcome this limitation, we propose a Trunknet Fourier feature-enhanced DeepONet (Model 4) that addresses the identified limitations by capturing rapidly varying functions. All proposed architectures achieve speedups ranging from 1.5 to 100 times over traditional explicit and implicit solvers, with Model 4 being the most efficient. Larger computational savings are expected for more complex systems than the explored 1D case, which is promising. Overall, the study highlights the potential of DeepONets to enable efficient, generalizable surrogate modeling in geotechnical applications, advancing the integration of scientific machine learning in geotechnics, which is at an early stage.
[LG-14] Parallel Sampling of Diffusion Models on SO(3)
链接: https://arxiv.org/abs/2507.10347
作者: Yan-Ting Chen,Hao-Wei Chen,Tsu-Ching Hsiao,Chun-Yi Lee
类目: Machine Learning (cs.LG)
*备注: MVA2025
Abstract:In this paper, we design an algorithm to accelerate the diffusion process on the SO(3) manifold. The inherently sequential nature of diffusion models necessitates substantial time for denoising perturbed data. To overcome this limitation, we proposed to adapt the numerical Picard iteration for the SO(3) space. We demonstrate our algorithm on an existing method that employs diffusion models to address the pose ambiguity problem. Moreover, we show that this acceleration advantage occurs without any measurable degradation in task reward. The experiments reveal that our algorithm achieves a speed-up of up to 4.9 \times , significantly reducing the latency for generating a single sample.
[LG-15] Some Super-approximation Rates of ReLU Neural Networks for Korobov Functions
链接: https://arxiv.org/abs/2507.10345
作者: Yuwen Li,Guozhi Zhang
类目: Machine Learning (cs.LG)
*备注:
Abstract:This paper examines the L_p and W^1_p norm approximation errors of ReLU neural networks for Korobov functions. In terms of network width and depth, we derive nearly optimal super-approximation error bounds of order 2m in the L_p norm and order 2m-2 in the W^1_p norm, for target functions with L_p mixed derivative of order m in each direction. The analysis leverages sparse grid finite elements and the bit extraction technique. Our results improve upon classical lowest order L_\infty and H^1 norm error bounds and demonstrate that the expressivity of neural networks is largely unaffected by the curse of dimensionality.
[LG-16] MoCap-Impute: A Comprehensive Benchmark and Comparative Analysis of Imputation Methods for IMU-based Motion Capture Data
链接: https://arxiv.org/abs/2507.10334
作者: Mahmoud Bekhit,Ahmad Salah,Ahmed Salim Alrawahi,Tarek Attia,Ahmed Ali,Esraa Eldesokey,Ahmed Fathalla
类目: Machine Learning (cs.LG)
*备注: 22 pages, 7 figures, 3 algorithms, 2 tables
Abstract:Motion capture (MoCap) data from wearable Inertial Measurement Units (IMUs) is vital for applications in sports science, but its utility is often compromised by missing data. Despite numerous imputation techniques, a systematic performance evaluation for IMU-derived MoCap time-series data is lacking. We address this gap by conducting a comprehensive comparative analysis of statistical, machine learning, and deep learning imputation methods. Our evaluation considers three distinct contexts: univariate time-series, multivariate across subjects, and multivariate across kinematic angles. To facilitate this benchmark, we introduce the first publicly available MoCap dataset designed specifically for imputation, featuring data from 53 karate practitioners. We simulate three controlled missingness mechanisms: missing completely at random (MCAR), block missingness, and a novel value-dependent pattern at signal transition points. Our experiments, conducted on 39 kinematic variables across all subjects, reveal that multivariate imputation frameworks consistently outperform univariate approaches, particularly for complex missingness. For instance, multivariate methods achieve up to a 50% mean absolute error reduction (MAE from 10.8 to 5.8) compared to univariate techniques for transition point missingness. Advanced models like Generative Adversarial Imputation Networks (GAIN) and Iterative Imputers demonstrate the highest accuracy in these challenging scenarios. This work provides a critical baseline for future research and offers practical recommendations for improving the integrity and robustness of Mo-Cap data analysis.
[LG-17] Convergence of Agnostic Federated Averag ing
链接: https://arxiv.org/abs/2507.10325
作者: Herlock(SeyedAbolfazl)Rahimi,Dionysis Kalogerias
类目: Machine Learning (cs.LG); Distributed, Parallel, and Cluster Computing (cs.DC); Signal Processing (eess.SP)
*备注: 5 pages, 2 figurres, CAMSAP conference
Abstract:Federated learning (FL) enables decentralized model training without centralizing raw data. However, practical FL deployments often face a key realistic challenge: Clients participate intermittently in server aggregation and with unknown, possibly biased participation probabilities. Most existing convergence results either assume full-device participation, or rely on knowledge of (in fact uniform) client availability distributions – assumptions that rarely hold in practice. In this work, we characterize the optimization problem that consistently adheres to the stochastic dynamics of the well-known \emphagnostic Federated Averaging (FedAvg) algorithm under random (and variably-sized) client availability, and rigorously establish its convergence for convex, possibly nonsmooth losses, achieving a standard rate of order \mathcalO(1/\sqrtT) , where T denotes the aggregation horizon. Our analysis provides the first convergence guarantees for agnostic FedAvg under general, non-uniform, stochastic client participation, without knowledge of the participation distribution. We also empirically demonstrate that agnostic FedAvg in fact outperforms common (and suboptimal) weighted aggregation FedAvg variants, even with server-side knowledge of participation weights.
[LG-18] Averag e Sensitivity of Hierarchical k-Median Clustering
链接: https://arxiv.org/abs/2507.10296
作者: Shijie Li,Weiqiang He,Ruobing Bai,Pan Peng
类目: Machine Learning (cs.LG); Data Structures and Algorithms (cs.DS)
*备注:
Abstract:Hierarchical clustering is a widely used method for unsupervised learning with numerous applications. However, in the application of modern algorithms, the datasets studied are usually large and dynamic. If the hierarchical clustering is sensitive to small perturbations of the dataset, the usability of the algorithm will be greatly reduced. In this paper, we focus on the hierarchical k -median clustering problem, which bridges hierarchical and centroid-based clustering while offering theoretical appeal, practical utility, and improved interpretability. We analyze the average sensitivity of algorithms for this problem by measuring the expected change in the output when a random data point is deleted. We propose an efficient algorithm for hierarchical k -median clustering and theoretically prove its low average sensitivity and high clustering quality. Additionally, we show that single linkage clustering and a deterministic variant of the CLNSS algorithm exhibit high average sensitivity, making them less stable. Finally, we validate the robustness and effectiveness of our algorithm through experiments.
[LG-19] Conditional Chemical Language Models are Versatile Tools in Drug Discovery
链接: https://arxiv.org/abs/2507.10273
作者: Lu Zhu,Emmanuel Noutahi
类目: Machine Learning (cs.LG); Biomolecules (q-bio.BM)
*备注: 12 pages, extra 13 pages of appendix
Abstract:Generative chemical language models (CLMs) have demonstrated strong capabilities in molecular design, yet their impact in drug discovery remains limited by the absence of reliable reward signals and the lack of interpretability in their outputs. We present SAFE-T, a generalist chemical modeling framework that conditions on biological context – such as protein targets or mechanisms of action – to prioritize and design molecules without relying on structural information or engineered scoring functions. SAFE-T models the conditional likelihood of fragment-based molecular sequences given a biological prompt, enabling principled scoring of molecules across tasks such as virtual screening, drug-target interaction prediction, and activity cliff detection. Moreover, it supports goal-directed generation by sampling from this learned distribution, aligning molecular design with biological objectives. In comprehensive zero-shot evaluations across predictive (LIT-PCBA, DAVIS, KIBA, ACNet) and generative (DRUG, PMO) benchmarks, SAFE-T consistently achieves performance comparable to or better than existing approaches while being significantly faster. Fragment-level attribution further reveals that SAFE-T captures known structure-activity relationships, supporting interpretable and biologically grounded design. Together with its computational efficiency, these results demonstrate that conditional generative CLMs can unify scoring and generation to accelerate early-stage drug discovery.
[LG-20] DNS Tunneling: Threat Landscape and Improved Detection Solutions
链接: https://arxiv.org/abs/2507.10267
作者: Novruz Amirov,Baran Isik,Bilal Ihsan Tuncer,Serif Bahtiyar
类目: Cryptography and Security (cs.CR); Machine Learning (cs.LG); Networking and Internet Architecture (cs.NI)
*备注:
Abstract:Detecting Domain Name System (DNS) tunneling is a significant challenge in security due to its capacity to hide harmful actions within DNS traffic that appears to be normal and legitimate. Traditional detection methods are based on rule-based approaches or signature matching methods that are often insufficient to accurately identify such covert communication channels. This research is about effectively detecting DNS tunneling. We propose a novel approach to detect DNS tunneling with machine learning algorithms. We combine machine learning algorithms to analyze the traffic by using features extracted from DNS traffic. Analyses results show that the proposed approach is a good candidate to detect DNS tunneling accurately.
[LG-21] Kernel-Adaptive PI-ELMs for Forward and Inverse Problems in PDEs with Sharp Gradients
链接: https://arxiv.org/abs/2507.10241
作者: Vikas Dwivedi,Balaji Srinivasan,Monica Sigovan,Bruno Sixou
类目: Machine Learning (cs.LG)
*备注:
Abstract:This paper introduces the Kernel Adaptive Physics-Informed Extreme Learning Machine (KAPI-ELM), an adaptive Radial Basis Function (RBF)-based extension of PI-ELM designed to solve both forward and inverse Partial Differential Equation (PDE) problems involving localized sharp gradients. While PI-ELMs outperform the traditional Physics-Informed Neural Networks (PINNs) in speed due to their single-shot, least square optimization, this advantage comes at a cost: their fixed, randomly initialized input layer limits their ability to capture sharp gradients. To overcome this limitation, we introduce a lightweight Bayesian Optimization (BO) framework that, instead of adjusting each input layer parameter individually as in traditional backpropagation, learns a small set of hyperparameters defining the statistical distribution from which the input weights are drawn. This novel distributional optimization strategy – combining BO for input layer distributional parameters with least-squares optimization for output layer network parameters – enables KAPI-ELM to preserve PI-ELM’s speed while matching or exceeding the expressiveness of PINNs. We validate the proposed methodology on several challenging forward and inverse PDE benchmarks, including a 1D singularly perturbed convection-diffusion equation, a 2D Poisson equation with sharp localized sources, and a time-dependent advection equation. Notably, KAPI-ELM achieves state-of-the-art accuracy in both forward and inverse settings. In stiff PDE regimes, it matches or even outperforms advanced methods such as the Extended Theory of Functional Connections (XTFC), while requiring nearly an order of magnitude fewer tunable parameters. These results establish the potential of KAPI-ELM as a scalable, interpretable, and generalizable physics-informed learning framework, especially in stiff PDE regimes.
[LG-22] A Graph Sufficiency Perspective for Neural Networks
链接: https://arxiv.org/abs/2507.10215
作者: Cencheng Shen,Yuexiao Dong
类目: Machine Learning (cs.LG); Applications (stat.AP)
*备注: 23 pages
Abstract:This paper analyzes neural networks through graph variables and statistical sufficiency. We interpret neural network layers as graph-based transformations, where neurons act as pairwise functions between inputs and learned anchor points. Within this formulation, we establish conditions under which layer outputs are sufficient for the layer inputs, that is, each layer preserves the conditional distribution of the target variable given the input variable. Under dense anchor point assumptions, we prove that asymptotic sufficiency holds in the infinite-width limit and is preserved throughout training. To align more closely with practical architectures, we further show that sufficiency can be achieved with finite-width networks by assuming region-separated input distributions and constructing appropriate anchor points. Our framework covers fully connected layers, general pairwise functions, ReLU and sigmoid activations, and convolutional neural networks. This work bridges statistical sufficiency, graph-theoretic representations, and deep learning, providing a new statistical understanding of neural networks.
[LG-23] -GRAB: A Synthetic Diagnostic Benchmark for Learning on Temporal Graphs KDD2025
链接: https://arxiv.org/abs/2507.10183
作者: Alireza Dizaji,Benedict Aaron Tjandra,Mehrab Hamidi,Shenyang Huang,Guillaume Rabusseau
类目: Machine Learning (cs.LG)
*备注: Accepted to MLoG-GenAI Workshop @ KDD 2025 (Oral)
Abstract:Dynamic graph learning methods have recently emerged as powerful tools for modelling relational data evolving through time. However, despite extensive benchmarking efforts, it remains unclear whether current Temporal Graph Neural Networks (TGNNs) effectively capture core temporal patterns such as periodicity, cause-and-effect, and long-range dependencies. In this work, we introduce the Temporal Graph Reasoning Benchmark (T-GRAB), a comprehensive set of synthetic tasks designed to systematically probe the capabilities of TGNNs to reason across time. T-GRAB provides controlled, interpretable tasks that isolate key temporal skills: counting/memorizing periodic repetitions, inferring delayed causal effects, and capturing long-range dependencies over both spatial and temporal dimensions. We evaluate 11 temporal graph learning methods on these tasks, revealing fundamental shortcomings in their ability to generalize temporal patterns. Our findings offer actionable insights into the limitations of current models, highlight challenges hidden by traditional real-world benchmarks, and motivate the development of architectures with stronger temporal reasoning abilities. The code for T-GRAB can be found at: this https URL.
[LG-24] Pimba: A Processing-in-Memory Acceleration for Post-Transformer Large Language Model Serving
链接: https://arxiv.org/abs/2507.10178
作者: Wonung Kim,Yubin Lee,Yoonsung Kim,Jinwoo Hwang,Seongryong Oh,Jiyong Jung,Aziz Huseynov,Woong Gyu Park,Chang Hyun Park,Divya Mahajan,Jongse Park
类目: Hardware Architecture (cs.AR); Machine Learning (cs.LG)
*备注:
Abstract:Transformers are the driving force behind today’s Large Language Models (LLMs), serving as the foundation for their performance and versatility. Yet, their compute and memory costs grow with sequence length, posing scalability challenges for long-context inferencing. In response, the algorithm community is exploring alternative architectures, such as state space models (SSMs), linear attention, and recurrent neural networks (RNNs), which we refer to as post-transformers. This shift presents a key challenge: building a serving system that efficiently supports both transformer and post-transformer LLMs within a unified framework. To address this challenge, we analyze the performance characteristics of transformer and post-transformer LLMs. Despite their algorithmic differences, both are fundamentally limited by memory bandwidth under batched inference due to attention in transformers and state updates in post-transformers. Further analyses suggest two additional insights: (1) state update operations, unlike attention, incur high hardware cost, making per-bank PIM acceleration inefficient, and (2) different low-precision arithmetic methods offer varying accuracy-area tradeoffs, while we identify Microsoft’s MX as the Pareto-optimal choice. Building on these insights, we design Pimba as an array of State-update Processing Units (SPUs), each shared between two banks to enable interleaved access to PIM. Each SPU includes a State-update Processing Engine (SPE) that comprises element-wise multipliers and adders using MX-based quantized arithmetic, enabling efficient execution of state update and attention operations. Our evaluation shows that, compared to LLM-optimized GPU and GPU+PIM systems, Pimba achieves up to 3.2x and 2.1x higher token generation throughput, respectively.
[LG-25] Understanding the Rank of Tensor Networks via an Intuitive Example-Driven Approach
链接: https://arxiv.org/abs/2507.10170
作者: Wuyang Zhou,Giorgos Iacovides,Kriton Konstantinidis,Ilya Kisil,Danilo Mandic
类目: Machine Learning (cs.LG)
*备注:
Abstract:Tensor Network (TN) decompositions have emerged as an indispensable tool in Big Data analytics owing to their ability to provide compact low-rank representations, thus alleviating the ``Curse of Dimensionality’’ inherent in handling higher-order data. At the heart of their success lies the concept of TN ranks, which governs the efficiency and expressivity of TN decompositions. However, unlike matrix ranks, TN ranks often lack a universal meaning and an intuitive interpretation, with their properties varying significantly across different TN structures. Consequently, TN ranks are frequently treated as empirically tuned hyperparameters, rather than as key design parameters inferred from domain knowledge. The aim of this Lecture Note is therefore to demystify the foundational yet frequently misunderstood concept of TN ranks through real-life examples and intuitive visualizations. We begin by illustrating how domain knowledge can guide the selection of TN ranks in widely-used models such as the Canonical Polyadic (CP) and Tucker decompositions. For more complex TN structures, we employ a self-explanatory graphical approach that generalizes to tensors of arbitrary order. Such a perspective naturally reveals the relationship between TN ranks and the corresponding ranks of tensor unfoldings (matrices), thereby circumventing cumbersome multi-index tensor algebra while facilitating domain-informed TN design. It is our hope that this Lecture Note will equip readers with a clear and unified understanding of the concept of TN rank, along with the necessary physical insight and intuition to support the selection, explainability, and deployment of tensor methods in both practical applications and educational contexts.
[LG-26] Domain Borders Are There to Be Crossed With Federated Few-Shot Adaptation
链接: https://arxiv.org/abs/2507.10160
作者: Manuel Röder,Christoph Raab,Frank-Michael Schleif
类目: Machine Learning (cs.LG); Cryptography and Security (cs.CR); Distributed, Parallel, and Cluster Computing (cs.DC)
*备注: Extension of this http URL
Abstract:Federated Learning has emerged as a leading paradigm for decentralized, privacy-preserving learning, particularly relevant in the era of interconnected edge devices equipped with sensors. However, the practical implementation of Federated Learning faces three primary challenges: the need for human involvement in costly data labelling processes for target adaptation, covariate shift in client device data collection due to environmental factors affecting sensors, leading to discrepancies between source and target samples, and the impracticality of continuous or regular model updates in resource-constrained environments due to limited data transmission capabilities and technical constraints on channel availability and energy efficiency. To tackle these issues, we expand upon an efficient and scalable Federated Learning framework tailored for real-world client adaptation in industrial settings. This framework leverages a pre-trained source model comprising a deep backbone, an adaptation module, and a classifier running on a powerful server. By freezing the backbone and classifier during client adaptation on resource-constrained devices, we allow the domain adaptive linear layer to handle target domain adaptation, thus minimizing overall computational overhead. Furthermore, this setup, designated as FedAcross+, is extended to encompass the processing of streaming data, thereby rendering the solution suitable for non-stationary environments. Extensive experimental results demonstrate the effectiveness of FedAcross+ in achieving competitive adaptation on low-end client devices with limited target samples, successfully addressing the challenge of domain shift. Moreover, our framework accommodates sporadic model updates within resource-constrained environments, ensuring practical and seamless deployment.
[LG-27] MTF-Grasp: A Multi-tier Federated Learning Approach for Robotic Grasping
链接: https://arxiv.org/abs/2507.10158
作者: Obaidullah Zaland,Erik Elmroth,Monowar Bhuyan
类目: Machine Learning (cs.LG); Robotics (cs.RO)
*备注: The work is accepted for presentation at IEEE SMC 2025
Abstract:Federated Learning (FL) is a promising machine learning paradigm that enables participating devices to train privacy-preserved and collaborative models. FL has proven its benefits for robotic manipulation tasks. However, grasping tasks lack exploration in such settings where robots train a global model without moving data and ensuring data privacy. The main challenge is that each robot learns from data that is nonindependent and identically distributed (non-IID) and of low quantity. This exhibits performance degradation, particularly in robotic grasping. Thus, in this work, we propose MTF-Grasp, a multi-tier FL approach for robotic grasping, acknowledging the unique challenges posed by the non-IID data distribution across robots, including quantitative skewness. MTF-Grasp harnesses data quality and quantity across robots to select a set of “top-level” robots with better data distribution and higher sample count. It then utilizes top-level robots to train initial seed models and distribute them to the remaining “low-level” robots, reducing the risk of model performance degradation in low-level robots. Our approach outperforms the conventional FL setup by up to 8% on the quantity-skewed Cornell and Jacquard grasping datasets.
[LG-28] Large-Scale Graph Building in Dynamic Environments: Low Latency and High Quality
链接: https://arxiv.org/abs/2507.10139
作者: Filipe Miguel Gonçalves de Almeida,CJ Carey,Hendrik Fichtenberger,Jonathan Halcrow,Silvio Lattanzi,André Linhares,Tao Meng,Ashkan Norouzi-Fard,Nikos Parotsidis,Bryan Perozzi,David Simcha
类目: Distributed, Parallel, and Cluster Computing (cs.DC); Machine Learning (cs.LG)
*备注:
Abstract:Learning and constructing large-scale graphs has attracted attention in recent decades, resulting in a rich literature that introduced various systems, tools, and algorithms. Grale is one of such tools that is designed for offline environments and is deployed in more than 50 different industrial settings at Google. Grale is widely applicable because of its ability to efficiently learn and construct a graph on datasets with multiple types of features. However, it is often the case that applications require the underlying data to evolve continuously and rapidly and the updated graph needs to be available with low latency. Such setting make the use of Grale prohibitive. While there are Approximate Nearest Neighbor (ANN) systems that handle dynamic updates with low latency, they are mostly limited to similarities over a single embedding. In this work, we introduce a system that inherits the advantages and the quality of Grale, and maintains a graph construction in a dynamic setting with tens of milliseconds of latency per request. We call the system Dynamic Grale Using ScaNN (Dynamic GUS). Our system has a wide range of applications with over 10 deployments at Google. One of the applications is in Android Security and Privacy, where Dynamic Grale Using ScaNN enables capturing harmful applications 4 times faster, before they can reach users. Subjects: Distributed, Parallel, and Cluster Computing (cs.DC); Machine Learning (cs.LG) Cite as: arXiv:2507.10139 [cs.DC] (or arXiv:2507.10139v1 [cs.DC] for this version) https://doi.org/10.48550/arXiv.2507.10139 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
[LG-29] owards High Supervised Learning Utility Training Data Generation: Data Pruning and Column Reordering KDD2025
链接: https://arxiv.org/abs/2507.10088
作者: Tung Sum Thomas Kwok,Zeyong Zhang,Chi-Hua Wang,Guang Cheng
类目: Machine Learning (cs.LG); Machine Learning (stat.ML)
*备注: Accepted by Agentic GenAI Evaluation KDD2025
Abstract:Tabular data synthesis for supervised learning (‘SL’) model training is gaining popularity in industries such as healthcare, finance, and retail. Despite the progress made in tabular data generators, models trained with synthetic data often underperform compared to those trained with original data. This low SL utility of synthetic data stems from class imbalance exaggeration and SL data relationship overlooked by tabular generator. To address these challenges, we draw inspirations from techniques in emerging data-centric artificial intelligence and elucidate Pruning and ReOrdering (‘PRRO’), a novel pipeline that integrates data-centric techniques into tabular data synthesis. PRRO incorporates data pruning to guide the table generator towards observations with high signal-to-noise ratio, ensuring that the class distribution of synthetic data closely matches that of the original data. Besides, PRRO employs a column reordering algorithm to align the data modeling structure of generators with that of SL models. These two modules enable PRRO to optimize SL utility of synthetic data. Empirical experiments on 22 public datasets show that synthetic data generated using PRRO enhances predictive performance compared to data generated without PRRO. Specifically, synthetic replacement of original data yields an average improvement of 26.74% and up to 871.46% improvement using PRRO, while synthetic appendant to original data results with PRRO-generated data results in an average improvement of 6.13% and up to 200.32%. Furthermore, experiments on six highly imbalanced datasets show that PRRO enables the generator to produce synthetic data with a class distribution that resembles the original data more closely, achieving a similarity improvement of 43%. Through PRRO, we foster a seamless integration of data synthesis to subsequent SL prediction, promoting quality and accessible data analysis.
[LG-30] Compression Method for Deep Diagonal State Space Model Based on H2 Optimal Reduction
链接: https://arxiv.org/abs/2507.10078
作者: Hiroki Sakamoto,Kazuhiro Sato
类目: Machine Learning (cs.LG); Systems and Control (eess.SY)
*备注: Accepted to IEEE Control Systems Letters
Abstract:Deep learning models incorporating linear SSMs have gained attention for capturing long-range dependencies in sequential data. However, their large parameter sizes pose challenges for deployment on resource-constrained devices. In this study, we propose an efficient parameter reduction method for these models by applying H^2 model order reduction techniques from control theory to their linear SSM components. In experiments, the LRA benchmark results show that the model compression based on our proposed method outperforms an existing method using the Balanced Truncation, while successfully reducing the number of parameters in the SSMs to 1/32 without sacrificing the performance of the original models.
[LG-31] ElasticMM: Efficient Multimodal LLM s Serving with Elastic Multimodal Parallelism
链接: https://arxiv.org/abs/2507.10069
作者: Zedong Liu,Shenggan Cheng,Guangming Tan,Yang You,Dingwen Tao
类目: Distributed, Parallel, and Cluster Computing (cs.DC); Machine Learning (cs.LG)
*备注:
Abstract:Multimodal large language models (MLLMs) extend LLMs to handle images, videos, and audio by incorporating feature extractors and projection modules. However, these additional components – combined with complex inference pipelines and heterogeneous workloads – introduce significant inference overhead. Therefore, efficiently serving MLLMs remains a major challenge. Current tightly coupled serving architectures struggle to distinguish between mixed request types or adapt parallelism strategies to different inference stages, leading to increased time-to-first-token (TTFT) latency and poor resource utilization. To address this, we propose Elastic Multimodal Parallelism (EMP), a new serving paradigm that elastically adapts to resource heterogeneity across request types and inference stages. Building upon EMP, we develop ElasticMM, an MLLM serving system that (1) separates requests into independent modality groups with dynamic resource allocation via a modality-aware load balancer; (2) decouples inference stages and enables parallelism adjustment and adaptive scaling via elastic partition scheduling; and (3) improves inference efficiency through unified multimodal prefix caching and non-blocking encoding. Experiments on diverse real-world datasets show that ElasticMM outperforms state-of-the-art (SOTA) serving systems, reducing TTFT by up to 4.2x and achieving 3.2-4.5x higher throughput while meeting service-level objectives (SLOs).
[LG-32] On the Efficiency of Training Robust Decision Trees
链接: https://arxiv.org/abs/2507.10048
作者: Benedict Gerlach,Marie Anastacio,Holger H. Hoos
类目: Machine Learning (cs.LG)
*备注: Presented as a poster at SAIV 2025
Abstract:As machine learning gets adopted into the industry quickly, trustworthiness is increasingly in focus. Yet, efficiency and sustainability of robust training pipelines still have to be established. In this work, we consider a simple pipeline for training adversarially robust decision trees and investigate the efficiency of each step. Our pipeline consists of three stages. Firstly, we choose the perturbation size automatically for each dataset. For that, we introduce a simple algorithm, instead of relying on intuition or prior work. Moreover, we show that the perturbation size can be estimated from smaller models than the one intended for full training, and thus significant gains in efficiency can be achieved. Secondly, we train state-of-the-art adversarial training methods and evaluate them regarding both their training time and adversarial accuracy. Thirdly, we certify the robustness of each of the models thus obtained and investigate the time required for this. We find that verification time, which is critical to the efficiency of the full pipeline, is not correlated with training time.
[LG-33] owards Applying Large Language Models to Complement Single-Cell Foundation Models
链接: https://arxiv.org/abs/2507.10039
作者: Steven Palayew,Bo Wang,Gary Bader
类目: Machine Learning (cs.LG); Genomics (q-bio.GN)
*备注:
Abstract:Single-cell foundation models such as scGPT represent a significant advancement in single-cell omics, with an ability to achieve state-of-the-art performance on various downstream biological tasks. However, these models are inherently limited in that a vast amount of information in biology exists as text, which they are unable to leverage. There have therefore been several recent works that propose the use of LLMs as an alternative to single-cell foundation models, achieving competitive results. However, there is little understanding of what factors drive this performance, along with a strong focus on using LLMs as an alternative, rather than complementary approach to single-cell foundation models. In this study, we therefore investigate what biological insights contribute toward the performance of LLMs when applied to single-cell data, and introduce scMPT; a model which leverages synergies between scGPT, and single-cell representations from LLMs that capture these insights. scMPT demonstrates stronger, more consistent performance than either of its component models, which frequently have large performance gaps between each other across datasets. We also experiment with alternate fusion methods, demonstrating the potential of combining specialized reasoning models with scGPT to improve performance. This study ultimately showcases the potential for LLMs to complement single-cell foundation models and drive improvements in single-cell analysis.
[LG-34] Forecasting Coccidioidomycosis (Valley Fever) in Arizona: A Graph Neural Network Approach
链接: https://arxiv.org/abs/2507.10014
作者: Ali Sarabi,Arash Sarabi,Hao Yan,Beckett Sterner,Petar Jevtić
类目: Machine Learning (cs.LG)
*备注:
Abstract:Coccidioidomycosis, commonly known as Valley Fever, remains a significant public health concern in endemic regions of the southwestern United States. This study develops the first graph neural network (GNN) model for forecasting Valley Fever incidence in Arizona. The model integrates surveillance case data with environmental predictors using graph structures, including soil conditions, atmospheric variables, agricultural indicators, and air quality metrics. Our approach explores correlation-based relationships among variables influencing disease transmission. The model captures critical delays in disease progression through lagged effects, enhancing its capacity to reflect complex temporal dependencies in disease ecology. Results demonstrate that the GNN architecture effectively models Valley Fever trends and provides insights into key environmental drivers of disease incidence. These findings can inform early warning systems and guide resource allocation for disease prevention efforts in high-risk areas.
[LG-35] Effects of structural properties of neural networks on machine learning performance
链接: https://arxiv.org/abs/2507.10005
作者: Yash Arya,Sang Hoon Lee
类目: Machine Learning (cs.LG); Statistical Mechanics (cond-mat.stat-mech); Neural and Evolutionary Computing (cs.NE); Computational Physics (physics.comp-ph)
*备注: 9 pages, 6 figures
Abstract:In recent years, graph-based machine learning techniques, such as reinforcement learning and graph neural networks, have garnered significant attention. While some recent studies have started to explore the relationship between the graph structure of neural networks and their predictive performance, they often limit themselves to a narrow range of model networks, particularly lacking mesoscale structures such as communities. Our work advances this area by conducting a more comprehensive investigation, incorporating realistic network structures characterized by heterogeneous degree distributions and community structures, which are typical characteristics of many real networks. These community structures offer a nuanced perspective on network architecture. Our analysis employs model networks such as random and scale-free networks, alongside a comparison with a biological neural network and its subsets for more detailed analysis. We examine the impact of these structural attributes on the performance of image classification tasks. Our findings reveal that structural properties do affect performance to some extent. Specifically, networks featuring coherent, densely interconnected communities demonstrate enhanced learning capabilities. The comparison with the biological neural network emphasizes the relevance of our findings to real-world structures, suggesting an intriguing connection worth further exploration. This study contributes meaningfully to network science and machine learning, providing insights that could inspire the design of more biologically informed neural networks.
[LG-36] Compliance Minimization via Physics-Informed Gaussian Processes
链接: https://arxiv.org/abs/2507.09968
作者: Xiangyu Sun,Amin Yousefpour,Shirin Hosseinmardi,Ramin Bostanabad
类目: Machine Learning (cs.LG)
*备注:
Abstract:Machine learning (ML) techniques have recently gained significant attention for solving compliance minimization (CM) problems. However, these methods typically provide poor feature boundaries, are very expensive, and lack a systematic mechanism to control the design complexity. Herein, we address these limitations by proposing a mesh-free and simultaneous framework based on physics-informed Gaussian processes (GPs). In our approach, we parameterize the design and state variables with GP priors which have independent kernels but share a multi-output neural network (NN) as their mean function. The architecture of this NN is based on Parametric Grid Convolutional Attention Networks (PGCANs) which not only mitigate spectral bias issues, but also provide an interpretable mechanism to control design complexity. We estimate all the parameters of our GP-based representations by simultaneously minimizing the compliance, total potential energy, and residual of volume fraction constraint. Importantly, our loss function exclude all data-based residuals as GPs automatically satisfy them. We also develop computational schemes based on curriculum training and numerical integration to increase the efficiency and robustness of our approach which is shown to (1) produce super-resolution topologies with fast convergence, (2) achieve smaller compliance and less gray area fraction compared to traditional numerical methods, (3) provide control over fine-scale features, and (4) outperform competing ML-based methods.
[LG-37] xt-Driven Causal Representation Learning for Source-Free Domain Generalization
链接: https://arxiv.org/abs/2507.09961
作者: Lihua Zhou,Mao Ye,Nianxin Li,Shuaifeng Li,Jinlin Wu,Xiatian Zhu,Lei Deng,Hongbin Liu,Jiebo Luo,Zhen Lei
类目: Machine Learning (cs.LG)
*备注: Under Review
Abstract:Deep learning often struggles when training and test data distributions differ. Traditional domain generalization (DG) tackles this by including data from multiple source domains, which is impractical due to expensive data collection and annotation. Recent vision-language models like CLIP enable source-free domain generalization (SFDG) by using text prompts to simulate visual representations, reducing data demands. However, existing SFDG methods struggle with domain-specific confounders, limiting their generalization capabilities. To address this issue, we propose TDCRL (\textbfText-\textbfDriven \textbfCausal \textbfRepresentation \textbfLearning), the first method to integrate causal inference into the SFDG setting. TDCRL operates in two steps: first, it employs data augmentation to generate style word vectors, combining them with class information to generate text embeddings to simulate visual representations; second, it trains a causal intervention network with a confounder dictionary to extract domain-invariant features. Grounded in causal learning, our approach offers a clear and effective mechanism to achieve robust, domain-invariant features, ensuring robust generalization. Extensive experiments on PACS, VLCS, OfficeHome, and DomainNet show state-of-the-art performance, proving TDCRL effectiveness in SFDG.
[LG-38] Rethinking Inductive Bias in Geographically Neural Network Weighted Regression
链接: https://arxiv.org/abs/2507.09958
作者: Zhenyuan Chen
类目: Machine Learning (cs.LG)
*备注:
Abstract:Inductive bias is a key factor in spatial regression models, determining how well a model can learn from limited data and capture spatial patterns. This work revisits the inductive biases in Geographically Neural Network Weighted Regression (GNNWR) and identifies limitations in current approaches for modeling spatial non-stationarity. While GNNWR extends traditional Geographically Weighted Regression by using neural networks to learn spatial weighting functions, existing implementations are often restricted by fixed distance-based schemes and limited inductive bias. We propose to generalize GNNWR by incorporating concepts from convolutional neural networks, recurrent neural networks, and transformers, introducing local receptive fields, sequential context, and self-attention into spatial regression. Through extensive benchmarking on synthetic spatial datasets with varying heterogeneity, noise, and sample sizes, we show that GNNWR outperforms classic methods in capturing nonlinear and complex spatial relationships. Our results also reveal that model performance depends strongly on data characteristics, with local models excelling in highly heterogeneous or small-sample scenarios, and global models performing better with larger, more homogeneous data. These findings highlight the importance of inductive bias in spatial modeling and suggest future directions, including learnable spatial weighting functions, hybrid neural architectures, and improved interpretability for models handling non-stationary spatial data.
[LG-39] Radial Neighborhood Smoothing Recommender System NEURIPS2025
链接: https://arxiv.org/abs/2507.09952
作者: Zerui Zhang,Yumou Qiu
类目: Machine Learning (cs.LG); Applications (stat.AP); Methodology (stat.ME)
*备注: 34 pages, 2 figures. Submitted to NeurIPS 2025
Abstract:Recommender systems inherently exhibit a low-rank structure in latent space. A key challenge is to define meaningful and measurable distances in the latent space to capture user-user, item-item, user-item relationships effectively. In this work, we establish that distances in the latent space can be systematically approximated using row-wise and column-wise distances in the observed matrix, providing a novel perspective on distance estimation. To refine the distance estimation, we introduce the correction based on empirical variance estimator to account for noise-induced non-centrality. The novel distance estimation enables a more structured approach to constructing neighborhoods, leading to the Radial Neighborhood Estimator (RNE), which constructs neighborhoods by including both overlapped and partially overlapped user-item pairs and employs neighborhood smoothing via localized kernel regression to improve imputation accuracy. We provide the theoretical asymptotic analysis for the proposed estimator. We perform evaluations on both simulated and real-world datasets, demonstrating that RNE achieves superior performance compared to existing collaborative filtering and matrix factorization methods. While our primary focus is on distance estimation in latent space, we find that RNE also mitigates the ``cold-start’’ problem.
[LG-40] Hierarchical Job Classification with Similarity Graph Integration
链接: https://arxiv.org/abs/2507.09949
作者: Md Ahsanul Kabir,Kareem Abdelfatah,Mohammed Korayem,Mohammad Al Hasan
类目: Machine Learning (cs.LG)
*备注:
Abstract:In the dynamic realm of online recruitment, accurate job classification is paramount for optimizing job recommendation systems, search rankings, and labor market analyses. As job markets evolve, the increasing complexity of job titles and descriptions necessitates sophisticated models that can effectively leverage intricate relationships within job data. Traditional text classification methods often fall short, particularly due to their inability to fully utilize the hierarchical nature of industry categories. To address these limitations, we propose a novel representation learning and classification model that embeds jobs and hierarchical industry categories into a latent embedding space. Our model integrates the Standard Occupational Classification (SOC) system and an in-house hierarchical taxonomy, Carotene, to capture both graph and hierarchical relationships, thereby improving classification accuracy. By embedding hierarchical industry categories into a shared latent space, we tackle cold start issues and enhance the dynamic matching of candidates to job opportunities. Extensive experimentation on a large-scale dataset of job postings demonstrates the model’s superior ability to leverage hierarchical structures and rich semantic features, significantly outperforming existing methods. This research provides a robust framework for improving job classification accuracy, supporting more informed decision-making in the recruitment industry.
[LG-41] Iceberg: Enhancing HLS Modeling with Synthetic Data
链接: https://arxiv.org/abs/2507.09948
作者: Zijian Ding,Tung Nguyen,Weikai Li,Aditya Grover,Yizhou Sun,Jason Cong
类目: Machine Learning (cs.LG); Hardware Architecture (cs.AR)
*备注: 9 pages. accepted to ICLAD’25
Abstract:Deep learning-based prediction models for High-Level Synthesis (HLS) of hardware designs often struggle to generalize. In this paper, we study how to close the generalizability gap of these models through pretraining on synthetic data and introduce Iceberg, a synthetic data augmentation approach that expands both large language model (LLM)-generated programs and weak labels of unseen design configurations. Our weak label generation method is integrated with an in-context model architecture, enabling meta-learning from actual and proximate labels. Iceberg improves the geometric mean modeling accuracy by 86.4% when adapt to six real-world applications with few-shot examples and achieves a 2.47\times and a 1.12\times better offline DSE performance when adapting to two different test datasets. Our open-sourced code is here: \hrefthis https URLthis https URL
[LG-42] Long-Tailed Data Classification by Increasing and Decreasing Neurons During Training
链接: https://arxiv.org/abs/2507.09940
作者: Taigo Sakai,Kazuhiro Hotta
类目: Machine Learning (cs.LG); Applications (stat.AP)
*备注:
Abstract:In conventional deep learning, the number of neurons typically remains fixed during training. However, insights from biology suggest that the human hippocampus undergoes continuous neuron generation and pruning of neurons over the course of learning, implying that a flexible allocation of capacity can contribute to enhance performance. Real-world datasets often exhibit class imbalance situations where certain classes have far fewer samples than others, leading to significantly reduce recognition accuracy for minority classes when relying on fixed size this http URL address the challenge, we propose a method that periodically adds and removes neurons during training, thereby boosting representational power for minority classes. By retaining critical features learned from majority classes while selectively increasing neurons for underrepresented classes, our approach dynamically adjusts capacity during training. Importantly, while the number of neurons changes throughout training, the final network size and structure remain unchanged, ensuring efficiency and compatibility with this http URL, by experiments on three different datasets and five representative models, we demonstrate that the proposed method outperforms fixed size networks and shows even greater accuracy when combined with other imbalance-handling techniques. Our results underscore the effectiveness of dynamic, biologically inspired network designs in improving performance on class-imbalanced data.
[LG-43] Extracting Cause-Effect Pairs from a Sentence with a Dependency-Aware Transformer Model
链接: https://arxiv.org/abs/2507.09925
作者: Md Ahsanul Kabir,Abrar Jahin,Mohammad Al Hasan
类目: Machine Learning (cs.LG)
*备注:
Abstract:Extracting cause and effect phrases from a sentence is an important NLP task, with numerous applications in various domains, including legal, medical, education, and scientific research. There are many unsupervised and supervised methods proposed for solving this task. Among these, unsupervised methods utilize various linguistic tools, including syntactic patterns, dependency tree, dependency relations, etc. among different sentential units for extracting the cause and effect phrases. On the other hand, the contemporary supervised methods use various deep learning based mask language models equipped with a token classification layer for extracting cause and effect phrases. Linguistic tools, specifically, dependency tree, which organizes a sentence into different semantic units have been shown to be very effective for extracting semantic pairs from a sentence, but existing supervised methods do not have any provision for utilizing such tools within their model framework. In this work, we propose DepBERT, which extends a transformer-based model by incorporating dependency tree of a sentence within the model framework. Extensive experiments over three datasets show that DepBERT is better than various state-of-the art supervised causality extraction methods.
[LG-44] Algorithm Development in Neural Networks: Insights from the Streaming Parity Task
链接: https://arxiv.org/abs/2507.09897
作者: Loek van Rossem,Andrew M. Saxe
类目: Machine Learning (cs.LG); Neurons and Cognition (q-bio.NC)
*备注: 28 pages, 20 figures
Abstract:Even when massively overparameterized, deep neural networks show a remarkable ability to generalize. Research on this phenomenon has focused on generalization within distribution, via smooth interpolation. Yet in some settings neural networks also learn to extrapolate to data far beyond the bounds of the original training set, sometimes even allowing for infinite generalization, implying that an algorithm capable of solving the task has been learned. Here we undertake a case study of the learning dynamics of recurrent neural networks (RNNs) trained on the streaming parity task in order to develop an effective theory of algorithm development. The streaming parity task is a simple but nonlinear task defined on sequences up to arbitrary length. We show that, with sufficient finite training experience, RNNs exhibit a phase transition to perfect infinite generalization. Using an effective theory for the representational dynamics, we find an implicit representational merger effect which can be interpreted as the construction of a finite automaton that reproduces the task. Overall, our results disclose one mechanism by which neural networks can generalize infinitely from finite training experience.
[LG-45] AdaBrain-Bench: Benchmarking Brain Foundation Models for Brain-Computer Interface Applications
链接: https://arxiv.org/abs/2507.09882
作者: Jiamin Wu,Zichen Ren,Junyu Wang,Pengyu Zhu,Yonghao Song,Mianxin Liu,Qihao Zheng,Lei Bai,Wanli Ouyang,Chunfeng Song
类目: Machine Learning (cs.LG)
*备注:
Abstract:Non-invasive Brain-Computer Interfaces (BCI) offer a safe and accessible means of connecting the human brain to external devices, with broad applications in home and clinical settings to enhance human capabilities. However, the high noise level and limited task-specific data in non-invasive signals constrain decoding capabilities. Recently, the adoption of self-supervised pre-training is transforming the landscape of non-invasive BCI research, enabling the development of brain foundation models to capture generic neural representations from large-scale unlabeled electroencephalography (EEG) signals with substantial noises. However, despite these advances, the field currently lacks comprehensive, practical and extensible benchmarks to assess the utility of the public foundation models across diverse BCI tasks, hindering their widespread adoption. To address this challenge, we present AdaBrain-Bench, a large-scale standardized benchmark to systematically evaluate brain foundation models in widespread non-invasive BCI tasks. AdaBrain-Bench encompasses a diverse collection of representative BCI decoding datasets spanning 7 key applications. It introduces a streamlined task adaptation pipeline integrated with multi-dimensional evaluation metrics and a set of adaptation tools. The benchmark delivers an inclusive framework for assessing generalizability of brain foundation models across key transfer settings, including cross-subject, multi-subject, and few-shot scenarios. We leverage AdaBrain-Bench to evaluate a suite of publicly available brain foundation models and offer insights into practices for selecting appropriate models in various scenarios. We make our benchmark pipeline available to enable reproducible research and external use, offering a continuously evolving platform to foster progress toward robust and generalized neural decoding solutions.
[LG-46] Rethinking Prompt Optimization: Reinforcement Diversification and Migration in Blackbox LLM s
链接: https://arxiv.org/abs/2507.09839
作者: MohammadReza Davari,Utkarsh Garg,Weixin Cai,Eugene Belilovsky
类目: Machine Learning (cs.LG)
*备注:
Abstract:An increasing number of NLP applications interact with large language models (LLMs) through black-box APIs, making prompt engineering critical for controlling model outputs. While recent Automatic Prompt Optimization (APO) methods iteratively refine prompts using model-generated feedback, textual gradients, they primarily focus on error correction and neglect valuable insights from correct predictions. This limits both their effectiveness and efficiency. In this paper, we propose a novel APO framework centered on enhancing the feedback mechanism. We reinterpret the textual gradient as a form of negative reinforcement and introduce the complementary positive reinforcement to explicitly preserve beneficial prompt components identified through successful predictions. To mitigate the noise inherent in LLM-generated feedback, we introduce a technique called feedback diversification, which aggregates multiple feedback signals, emphasizing consistent, actionable advice while filtering out outliers. Motivated by the rapid evolution and diversity of available LLMs, we also formalize Continual Prompt Optimization (CPO), addressing the practical challenge of efficiently migrating optimized prompts between different model versions or API providers. Our experiments reveal that naive prompt migration often degrades performance due to loss of critical instructions. In contrast, our approach consistently outperforms strong baselines, achieving significant accuracy improvements, faster convergence, and lower computational costs in both standard and migration scenarios.
[LG-47] A Scalable and Efficient Signal Integration System for Job Matching KDD2025
链接: https://arxiv.org/abs/2507.09797
作者: Ping Liu,Rajat Arora,Xiao Shi,Benjamin Le,Qianqi Shen,Jianqiang Shen,Chengming Jiang,Nikita Zhiltsov,Priya Bannur,Yidan Zhu,Liming Dong,Haichao Wei,Qi Guo,Luke Simon,Liangjie Hong,Wenjing Zhang
类目: Machine Learning (cs.LG)
*备注: KDD2025
Abstract:LinkedIn, one of the world’s largest platforms for professional networking and job seeking, encounters various modeling challenges in building recommendation systems for its job matching product, including cold-start, filter bubbles, and biases affecting candidate-job matching. To address these, we developed the STAR (Signal Integration for Talent And Recruiters) system, leveraging the combined strengths of Large Language Models (LLMs) and Graph Neural Networks (GNNs). LLMs excel at understanding textual data, such as member profiles and job postings, while GNNs capture intricate relationships and mitigate cold-start issues through network effects. STAR integrates diverse signals by uniting LLM and GNN capabilities with industrial-scale paradigms including adaptive sampling and version management. It provides an end-to-end solution for developing and deploying embeddings in large-scale recommender systems. Our key contributions include a robust methodology for building embeddings in industrial applications, a scalable GNN-LLM integration for high-performing recommendations, and practical insights for real-world model deployment.
[LG-48] Leverag ing Distribution Matching to Make Approximate Machine Unlearning Faster
链接: https://arxiv.org/abs/2507.09786
作者: Junaid Iqbal Khan
类目: Machine Learning (cs.LG)
*备注: 10 pages, 4 figures, 4 tables
Abstract:Approximate machine unlearning (AMU) enables models to `forget’ specific training data through specialized fine-tuning on a retained dataset subset. However, processing this retained subset still dominates computational runtime, while reductions of epochs also remain a challenge. We propose two complementary methods to accelerate classification-oriented AMU. First, \textbfBlend, a novel distribution-matching dataset condensation (DC), merges visually similar images with shared blend-weights to significantly reduce the retained set size. It operates with minimal pre-processing overhead and is orders of magnitude faster than state-of-the-art DC methods. Second, our loss-centric method, \textbfAccelerated-AMU (A-AMU), augments the unlearning objective to quicken convergence. A-AMU achieves this by combining a steepened primary loss to expedite forgetting with a novel, differentiable regularizer that matches the loss distributions of forgotten and in-distribution unseen data. Our extensive experiments demonstrate that this dual approach of data and loss-centric optimization dramatically reduces end-to-end unlearning latency across both single and multi-round scenarios, all while preserving model utility and privacy. To our knowledge, this is the first work to systematically tackle unlearning efficiency by jointly designing a specialized dataset condensation technique with a dedicated accelerated loss function. Code is available at this https URL.
[LG-49] Efficient Molecular Conformer Generation with SO(3)-Averag ed Flow Matching and Reflow ICML2025
链接: https://arxiv.org/abs/2507.09785
作者: Zhonglin Cao,Mario Geiger,Allan dos Santos Costa,Danny Reidenbach,Karsten Kreis,Tomas Geffner,Franco Pellegrini,Guoqing Zhou,Emine Kucukbenli
类目: Machine Learning (cs.LG); Chemical Physics (physics.chem-ph)
*备注: ICML 2025 poster
Abstract:Fast and accurate generation of molecular conformers is desired for downstream computational chemistry and drug discovery tasks. Currently, training and sampling state-of-the-art diffusion or flow-based models for conformer generation require significant computational resources. In this work, we build upon flow-matching and propose two mechanisms for accelerating training and inference of generative models for 3D molecular conformer generation. For fast training, we introduce the SO(3)-Averaged Flow training objective, which leads to faster convergence to better generation quality compared to conditional optimal transport flow or Kabsch-aligned flow. We demonstrate that models trained using SO(3)-Averaged Flow can reach state-of-the-art conformer generation quality. For fast inference, we show that the reflow and distillation methods of flow-based models enable few-steps or even one-step molecular conformer generation with high quality. The training techniques proposed in this work show a path towards highly efficient molecular conformer generation with flow-based models.
[LG-50] Physics-informed neural networks for high-dimensional solutions and snaking bifurcations in nonlinear lattices
链接: https://arxiv.org/abs/2507.09782
作者: Muhammad Luthfi Shahab,Fidya Almira Suheri,Rudy Kusdiantara,Hadi Susanto
类目: Numerical Analysis (math.NA); Machine Learning (cs.LG); Neural and Evolutionary Computing (cs.NE); Optimization and Control (math.OC)
*备注: Accepted for publication in Physica D: Nonlinear Phenomena
Abstract:This paper introduces a framework based on physics-informed neural networks (PINNs) for addressing key challenges in nonlinear lattices, including solution approximation, bifurcation diagram construction, and linear stability analysis. We first employ PINNs to approximate solutions of nonlinear systems arising from lattice models, using the Levenberg-Marquardt algorithm to optimize network weights for greater accuracy. To enhance computational efficiency in high-dimensional settings, we integrate a stochastic sampling strategy. We then extend the method by coupling PINNs with a continuation approach to compute snaking bifurcation diagrams, incorporating an auxiliary equation to effectively track successive solution branches. For linear stability analysis, we adapt PINNs to compute eigenvectors, introducing output constraints to enforce positivity, in line with Sturm-Liouville theory. Numerical experiments are conducted on the discrete Allen-Cahn equation with cubic and quintic nonlinearities in one to five spatial dimensions. The results demonstrate that the proposed approach achieves accuracy comparable to, or better than, traditional numerical methods, especially in high-dimensional regimes where computational resources are a limiting factor. These findings highlight the potential of neural networks as scalable and efficient tools for the study of complex nonlinear lattice systems.
[LG-51] Knowing When to Quit: Probabilistic Early Exits for Speech Separation
链接: https://arxiv.org/abs/2507.09768
作者: Kenny Falkær Olsen. Mads Østergaard,Karl Ulbæk,Søren Føns Nielsen,Rasmus Malik Høegh Lindrup,Bjørn Sand Jensen,Morten Mørup
类目: Machine Learning (cs.LG); Sound (cs.SD); Audio and Speech Processing (eess.AS)
*备注:
Abstract:In recent years, deep learning-based single-channel speech separation has improved considerably, in large part driven by increasingly compute- and parameter-efficient neural network architectures. Most such architectures are, however, designed with a fixed compute and parameter budget, and consequently cannot scale to varying compute demands or resources, which limits their use in embedded and heterogeneous devices such as mobile phones and hearables. To enable such use-cases we design a neural network architecture for speech separation capable of early-exit, and we propose an uncertainty-aware probabilistic framework to jointly model the clean speech signal and error variance which we use to derive probabilistic early-exit conditions in terms of desired signal-to-noise ratios. We evaluate our methods on both speech separation and enhancement tasks, and we show that a single early-exit model can be competitive with state-of-the-art models trained at many compute and parameter budgets. Our framework enables fine-grained dynamic compute-scaling of speech separation networks while achieving state-of-the-art performance and interpretable exit conditions.
[LG-52] Energy Dissipation Rate Guided Adaptive Sampling for Physics-Informed Neural Networks: Resolving Surface-Bulk Dynamics in Allen-Cahn Systems
链接: https://arxiv.org/abs/2507.09757
作者: Chunyan Li,Wenkai Yu,Qi Wang
类目: Numerical Analysis (math.NA); Machine Learning (cs.LG)
*备注: 32 pages, 22 figures
Abstract:We introduce the Energy Dissipation Rate guided Adaptive Sampling (EDRAS) strategy, a novel method that substantially enhances the performance of Physics-Informed Neural Networks (PINNs) in solving thermodynamically consistent partial differential equations (PDEs) over arbitrary domains. EDRAS leverages the local energy dissipation rate density as a guiding metric to identify and adaptively re-sample critical collocation points from both the interior and boundary of the computational domain. This dynamical sampling approach improves the accuracy of residual-based PINNs by aligning the training process with the underlying physical structure of the system. In this study, we demonstrate the effectiveness of EDRAS using the Allen-Cahn phase field model in irregular geometries, achieving up to a sixfold reduction in the relative mean square error compared to traditional residual-based adaptive refinement (RAR) methods. Moreover, we compare EDRAS with other residual-based adaptive sampling approaches and show that EDRAS is not only computationally more efficient but also more likely to identify high-impact collocation points. Through numerical solutions of the Allen-Cahn equation with both static (Neumann) and dynamic boundary conditions in 2D disk- and ellipse-shaped domains solved using PINN coupled with EDRAS, we gain significant insights into how dynamic boundary conditions influence bulk phase evolution and thermodynamic behavior. The proposed approach offers an effective, physically informed enhancement to PINN frameworks for solving thermodynamically consistent models, making PINN a robust and versatile computational tool for investigating complex thermodynamic processes in arbitrary geometries.
[LG-53] Explainable AI in Genomics: Transcription Factor Binding Site Prediction with Mixture of Experts
链接: https://arxiv.org/abs/2507.09754
作者: Aakash Tripathi,Ian E. Nielsen,Muhammad Umer,Ravi P. Ramachandran,Ghulam Rasool
类目: Machine Learning (cs.LG); Genomics (q-bio.GN)
*备注:
Abstract:Transcription Factor Binding Site (TFBS) prediction is crucial for understanding gene regulation and various biological processes. This study introduces a novel Mixture of Experts (MoE) approach for TFBS prediction, integrating multiple pre-trained Convolutional Neural Network (CNN) models, each specializing in different TFBS patterns. We evaluate the performance of our MoE model against individual expert models on both in-distribution and out-of-distribution (OOD) datasets, using six randomly selected transcription factors (TFs) for OOD testing. Our results demonstrate that the MoE model achieves competitive or superior performance across diverse TF binding sites, particularly excelling in OOD scenarios. The Analysis of Variance (ANOVA) statistical test confirms the significance of these performance differences. Additionally, we introduce ShiftSmooth, a novel attribution mapping technique that provides more robust model interpretability by considering small shifts in input sequences. Through comprehensive explainability analysis, we show that ShiftSmooth offers superior attribution for motif discovery and localization compared to traditional Vanilla Gradient methods. Our work presents an efficient, generalizable, and interpretable solution for TFBS prediction, potentially enabling new discoveries in genome biology and advancing our understanding of transcriptional regulation.
[LG-54] Do we need equivariant models for molecule generation?
链接: https://arxiv.org/abs/2507.09753
作者: Ewa M. Nowara,Joshua Rackers,Patricia Suriana,Pan Kessel,Max Shen,Andrew Martin Watkins,Michael Maser
类目: Machine Learning (cs.LG); Quantitative Methods (q-bio.QM)
*备注:
Abstract:Deep generative models are increasingly used for molecular discovery, with most recent approaches relying on equivariant graph neural networks (GNNs) under the assumption that explicit equivariance is essential for generating high-quality 3D molecules. However, these models are complex, difficult to train, and scale poorly. We investigate whether non-equivariant convolutional neural networks (CNNs) trained with rotation augmentations can learn equivariance and match the performance of equivariant models. We derive a loss decomposition that separates prediction error from equivariance error, and evaluate how model size, dataset size, and training duration affect performance across denoising, molecule generation, and property prediction. To our knowledge, this is the first study to analyze learned equivariance in generative tasks. Subjects: Machine Learning (cs.LG); Quantitative Methods (q-bio.QM) Cite as: arXiv:2507.09753 [cs.LG] (or arXiv:2507.09753v1 [cs.LG] for this version) https://doi.org/10.48550/arXiv.2507.09753 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
[LG-55] MB-RIRs: a Synthetic Room Impulse Response Dataset with Frequency-Dependent Absorption Coefficients
链接: https://arxiv.org/abs/2507.09750
作者: Enric Gusó,Joanna Luberadzka,Umut Sayin,Xavier Serra
类目: ound (cs.SD); Machine Learning (cs.LG); Audio and Speech Processing (eess.AS)
*备注: Accepted to WASPAA25
Abstract:We investigate the effects of four strategies for improving the ecological validity of synthetic room impulse response (RIR) datasets for monoaural Speech Enhancement (SE). We implement three features on top of the traditional image source method-based (ISM) shoebox RIRs: multiband absorption coefficients, source directivity and receiver directivity. We additionally consider mesh-based RIRs from the SoundSpaces dataset. We then train a DeepFilternet3 model for each RIR dataset and evaluate the performance on a test set of real RIRs both objectively and subjectively. We find that RIRs which use frequency-dependent acoustic absorption coefficients (MB-RIRs) can obtain +0.51dB of SDR and a +8.9 MUSHRA score when evaluated on real RIRs. The MB-RIRs dataset is publicly available for free download.
[LG-56] Continental scale habitat modelling with artificial intelligence and multimodal earth observation
链接: https://arxiv.org/abs/2507.09732
作者: Sara Si-Moussi,Stephan Hennekens,Sander Mucher,Stan Los,Wilfried Thuiller
类目: Machine Learning (cs.LG); Populations and Evolution (q-bio.PE); Applications (stat.AP)
*备注:
Abstract:Habitats integrate the abiotic conditions and biophysical structures that support biodiversity and sustain nature’s contributions to people. As these ecosystems face mounting pressure from human activities, accurate, high-resolution habitat maps are essential for effective conservation and restoration. Yet current maps often fall short in thematic or spatial resolution because they must (1) model several mutually exclusive habitat types that co-occur across landscapes and (2) cope with severe class imbalance that complicate multi-class training. Here, we evaluated how high-resolution remote sensing (RS) data and Artificial Intelligence (AI) tools can improve habitat classification over large geographic extents at fine thematic resolution. Using vegetation plots from the European Vegetation Archive, we modelled Level 3 EUNIS habitats across Europe and assessed multiple modelling strategies against independent validation datasets. Strategies that exploited the hierarchical nature of habitat nomenclatures resolved classification ambiguities, especially in fragmented landscapes. Integrating multi-spectral (MSI) and synthetic aperture radar (SAR) imagery, particularly through Earth Observation Foundation models, enhanced within-formation discrimination and overall performance. Finally, ensemble machine learning that corrects class imbalance boosted accuracy further. Our methodological framework is transferable beyond Europe and adaptable to other classification systems. Future research should advance temporal modelling of dynamic habitats, extend to habitat segmentation and quality assessment, and exploit next-generation EO data paired with higher-quality in-situ observations.
[LG-57] Phase transition of the Sinkhorn-Knopp algorithm
链接: https://arxiv.org/abs/2507.09711
作者: Kun He
类目: Data Structures and Algorithms (cs.DS); Machine Learning (cs.LG); Machine Learning (stat.ML)
*备注: 44 pages, 2 figures
Abstract:The matrix scaling problem, particularly the Sinkhorn-Knopp algorithm, has been studied for over 60 years. In practice, the algorithm often yields high-quality approximations within just a few iterations. Theoretically, however, the best-known upper bound places it in the class of pseudopolynomial-time approximation algorithms. Meanwhile, the lower-bound landscape remains largely unexplored. Two fundamental questions persist: what accounts for the algorithm’s strong empirical performance, and can a tight bound on its iteration count be established? For an n\times n matrix, its normalized version is obtained by dividing each entry by its largest entry. We say that a normalized matrix has a density \gamma if there exists a constant \rho 0 such that one row or column has exactly \lceil \gamma n \rceil entries with values at least \rho , and every other row and column has at least \lceil \gamma n \rceil such entries. For the upper bound, we show that the Sinkhorn-Knopp algorithm produces a nearly doubly stochastic matrix in O(\log n - \log \varepsilon) iterations and \widetildeO(n^2) time for all nonnegative square matrices whose normalized version has a density \gamma 1/2 . Such matrices cover both the algorithm’s principal practical inputs and its typical theoretical regime, and the \widetildeO(n^2) runtime is optimal. For the lower bound, we establish a tight bound of \widetilde\Omega\left(n^1/2/\varepsilon\right) iterations for positive matrices under the \ell_2 -norm error measure. Moreover, for every \gamma 1/2 , there exists a matrix with density \gamma for which the algorithm requires \Omega\left(n^1/2/\varepsilon\right) iterations. In summary, our results reveal a sharp phase transition in the Sinkhorn-Knopp algorithm at the density threshold \gamma = 1/2 . Comments: 44 pages, 2 figures Subjects: Data Structures and Algorithms (cs.DS); Machine Learning (cs.LG); Machine Learning (stat.ML) Cite as: arXiv:2507.09711 [cs.DS] (or arXiv:2507.09711v1 [cs.DS] for this version) https://doi.org/10.48550/arXiv.2507.09711 Focus to learn more arXiv-issued DOI via DataCite (pending registration) Submission history From: Kun He [view email] [v1] Sun, 13 Jul 2025 17:07:51 UTC (43 KB) Full-text links: Access Paper: View a PDF of the paper titled Phase transition of the Sinkhorn-Knopp algorithm, by Kun HeView PDFTeX SourceOther Formats view license Current browse context: cs.DS prev | next new | recent | 2025-07 Change to browse by: cs cs.LG stat stat.ML References Citations NASA ADSGoogle Scholar Semantic Scholar a export BibTeX citation Loading… BibTeX formatted citation loading… Data provided by: Bookmark checked=“checked”> Bibliographic Tools Bibliographic and Citation Tools Bibliographic Explorer Toggle Bibliographic Explorer (What is the Explorer?) Connected Papers Toggle Connected Papers (What is Connected Papers?) Litmaps Toggle Litmaps (What is Litmaps?) scite.ai Toggle scite Smart Citations (What are Smart Citations?) Code, Data, Media Code, Data and Media Associated with this Article alphaXiv Toggle alphaXiv (What is alphaXiv?) Links to Code Toggle CatalyzeX Code Finder for Papers (What is CatalyzeX?) DagsHub Toggle DagsHub (What is DagsHub?) GotitPub Toggle Gotit.pub (What is GotitPub?) Huggingface Toggle Hugging Face (What is Huggingface?) Links to Code Toggle Papers with Code (What is Papers with Code?) ScienceCast Toggle ScienceCast (What is ScienceCast?) Demos Demos Replicate Toggle Replicate (What is Replicate?) Spaces Toggle Hugging Face Spaces (What is Spaces?) Spaces Toggle TXYZ.AI (What is TXYZ.AI?) Related Papers Recommenders and Search Tools Link to Influence Flower Influence Flower (What are Influence Flowers?) Core recommender toggle CORE Recommender (What is CORE?) Author Venue Institution Topic About arXivLabs arXivLabs: experimental projects with community collaborators arXivLabs is a framework that allows collaborators to develop and share new arXiv features directly on our website. Both individuals and organizations that work with arXivLabs have embraced and accepted our values of openness, community, excellence, and user data privacy. arXiv is committed to these values and only works with partners that adhere to them. Have an idea for a project that will add value for arXiv’s community? Learn more about arXivLabs. Which authors of this paper are endorsers? | Disable MathJax (What is MathJax?) mathjaxToggle(); About Help contact arXivClick here to contact arXiv Contact subscribe to arXiv mailingsClick here to subscribe Subscribe Copyright Privacy Policy Web Accessibility Assistance arXiv Operational Status Get status notifications via email or slack
[LG-58] Symptom-Driven Personalized Proton Pump Inhibitors Therapy Using Bayesian Neural Networks and Model Predictive Control
链接: https://arxiv.org/abs/2507.09685
作者: Yutong Li,Ilya Kolmanovsky
类目: ystems and Control (eess.SY); Machine Learning (cs.LG)
*备注: 6 pages, 5 figures
Abstract:Proton Pump Inhibitors (PPIs) are the standard of care for gastric acid disorders but carry significant risks when administered chronically at high doses. Precise long-term control of gastric acidity is challenged by the impracticality of invasive gastric acid monitoring beyond 72 hours and wide inter-patient variability. We propose a noninvasive, symptom-based framework that tailors PPI dosing solely on patient-reported reflux and digestive symptom patterns. A Bayesian Neural Network prediction model learns to predict patient symptoms and quantifies its uncertainty from historical symptom scores, meal, and PPIs intake data. These probabilistic forecasts feed a chance-constrained Model Predictive Control (MPC) algorithm that dynamically computes future PPI doses to minimize drug usage while enforcing acid suppression with high confidence - without any direct acid measurement. In silico studies over diverse dietary schedules and virtual patient profiles demonstrate that our learning-augmented MPC reduces total PPI consumption by 65 percent compared to standard fixed regimens, while maintaining acid suppression with at least 95 percent probability. The proposed approach offers a practical path to personalized PPI therapy, minimizing treatment burden and overdose risk without invasive sensors.
[LG-59] Networked Information Aggregation via Machine Learning
链接: https://arxiv.org/abs/2507.09683
作者: Michael Kearns,Aaron Roth,Emily Ryu
类目: Machine Learning (cs.LG); Computer Science and Game Theory (cs.GT); Theoretical Economics (econ.TH)
*备注:
Abstract:We study a distributed learning problem in which learning agents are embedded in a directed acyclic graph (DAG). There is a fixed and arbitrary distribution over feature/label pairs, and each agent or vertex in the graph is able to directly observe only a subset of the features – potentially a different subset for every agent. The agents learn sequentially in some order consistent with a topological sort of the DAG, committing to a model mapping observations to predictions of the real-valued label. Each agent observes the predictions of their parents in the DAG, and trains their model using both the features of the instance that they directly observe, and the predictions of their parents as additional features. We ask when this process is sufficient to achieve \emphinformation aggregation, in the sense that some agent in the DAG is able to learn a model whose error is competitive with the best model that could have been learned (in some hypothesis class) with direct access to \emphall features, despite the fact that no single agent in the network has such access. We give upper and lower bounds for this problem for both linear and general hypothesis classes. Our results identify the \emphdepth of the DAG as the key parameter: information aggregation can occur over sufficiently long paths in the DAG, assuming that all of the relevant features are well represented along the path, and there are distributions over which information aggregation cannot occur even in the linear case, and even in arbitrarily large DAGs that do not have sufficient depth (such as a hub-and-spokes topology in which the spoke vertices collectively see all the features). We complement our theoretical results with a comprehensive set of experiments.
[LG-60] Cultivating Pluralism In Algorithmic Monoculture: The Community Alignment Dataset
链接: https://arxiv.org/abs/2507.09650
作者: Lily Hong Zhang,Smitha Milli,Karen Jusko,Jonathan Smith,Brandon Amos,Wassim(Wes)Bouaziz,Manon Revel,Jack Kussman,Lisa Titus,Bhaktipriya Radharapu,Jane Yu,Vidya Sarma,Kris Rose,Maximilian Nickel
类目: Machine Learning (cs.LG)
*备注:
Abstract:How can large language models (LLMs) serve users with varying preferences that may conflict across cultural, political, or other dimensions? To advance this challenge, this paper establishes four key results. First, we demonstrate, through a large-scale multilingual human study with representative samples from five countries (N=15,000), that humans exhibit significantly more variation in preferences than the responses of 21 state-of-the-art LLMs. Second, we show that existing methods for preference dataset collection are insufficient for learning the diversity of human preferences even along two of the most salient dimensions of variability in global values, due to the underlying homogeneity of candidate responses. Third, we argue that this motivates the need for negatively-correlated sampling when generating candidate sets, and we show that simple prompt-based techniques for doing so significantly enhance the performance of alignment methods in learning heterogeneous preferences. Fourth, based on this novel candidate sampling approach, we collect and open-source Community Alignment, the largest and most representative multilingual and multi-turn preference dataset to date, featuring almost 200,000 comparisons from annotators spanning five countries. We hope that the Community Alignment dataset will be a valuable resource for improving the effectiveness of LLMs for a diverse global population.
[LG-61] CAN-Trace Attack: Exploit CAN Messages to Uncover Driving Trajectories
链接: https://arxiv.org/abs/2507.09624
作者: Xiaojie Lin,Baihe Ma,Xu Wang,Guangsheng Yu,Ying He,Wei Ni,Ren Ping Liu
类目: Cryptography and Security (cs.CR); Machine Learning (cs.LG)
*备注:
Abstract:Driving trajectory data remains vulnerable to privacy breaches despite existing mitigation measures. Traditional methods for detecting driving trajectories typically rely on map-matching the path using Global Positioning System (GPS) data, which is susceptible to GPS data outage. This paper introduces CAN-Trace, a novel privacy attack mechanism that leverages Controller Area Network (CAN) messages to uncover driving trajectories, posing a significant risk to drivers’ long-term privacy. A new trajectory reconstruction algorithm is proposed to transform the CAN messages, specifically vehicle speed and accelerator pedal position, into weighted graphs accommodating various driving statuses. CAN-Trace identifies driving trajectories using graph-matching algorithms applied to the created graphs in comparison to road networks. We also design a new metric to evaluate matched candidates, which allows for potential data gaps and matching inaccuracies. Empirical validation under various real-world conditions, encompassing different vehicles and driving regions, demonstrates the efficacy of CAN-Trace: it achieves an attack success rate of up to 90.59% in the urban region, and 99.41% in the suburban region.
[LG-62] Holistix: A Dataset for Holistic Wellness Dimensions Analysis in Mental Health Narratives
链接: https://arxiv.org/abs/2507.09565
作者: Heeba Shakeel,Tanvir Ahmad,Chandni Saxena
类目: Machine Learning (cs.LG)
*备注: 7 Pages
Abstract:We introduce a dataset for classifying wellness dimensions in social media user posts, covering six key aspects: physical, emotional, social, intellectual, spiritual, and vocational. The dataset is designed to capture these dimensions in user-generated content, with a comprehensive annotation framework developed under the guidance of domain experts. This framework allows for the classification of text spans into the appropriate wellness categories. We evaluate both traditional machine learning models and advanced transformer-based models for this multi-class classification task, with performance assessed using precision, recall, and F1-score, averaged over 10-fold cross-validation. Post-hoc explanations are applied to ensure the transparency and interpretability of model decisions. The proposed dataset contributes to region-specific wellness assessments in social media and paves the way for personalized well-being evaluations and early intervention strategies in mental health. We adhere to ethical considerations for constructing and releasing our experiments and dataset publicly on Github.
[LG-63] Lightweight Federated Learning over Wireless Edge Networks
链接: https://arxiv.org/abs/2507.09546
作者: Xiangwang Hou,Jingjing Wang,Jun Du,Chunxiao Jiang,Yong Ren,Dusit Niyato
类目: Distributed, Parallel, and Cluster Computing (cs.DC); Machine Learning (cs.LG)
*备注:
Abstract:With the exponential growth of smart devices connected to wireless networks, data production is increasing rapidly, requiring machine learning (ML) techniques to unlock its value. However, the centralized ML paradigm raises concerns over communication overhead and privacy. Federated learning (FL) offers an alternative at the network edge, but practical deployment in wireless networks remains challenging. This paper proposes a lightweight FL (LTFL) framework integrating wireless transmission power control, model pruning, and gradient quantization. We derive a closed-form expression of the FL convergence gap, considering transmission error, model pruning error, and gradient quantization error. Based on these insights, we formulate an optimization problem to minimize the convergence gap while meeting delay and energy constraints. To solve the non-convex problem efficiently, we derive closed-form solutions for the optimal model pruning ratio and gradient quantization level, and employ Bayesian optimization for transmission power control. Extensive experiments on real-world datasets show that LTFL outperforms state-of-the-art schemes.
[LG-64] Assessing reliability of explanations in unbalanced datasets: a use-case on the occurrence of frost events
链接: https://arxiv.org/abs/2507.09545
作者: Ilaria Vascotto,Valentina Blasone,Alex Rodriguez,Alessandro Bonaita,Luca Bortolussi
类目: Machine Learning (cs.LG)
*备注: Late Breaking Work presented at the 3rd World Conference on eXplainable Artificial Intelligence (XAI2025)
Abstract:The usage of eXplainable Artificial Intelligence (XAI) methods has become essential in practical applications, given the increasing deployment of Artificial Intelligence (AI) models and the legislative requirements put forward in the latest years. A fundamental but often underestimated aspect of the explanations is their robustness, a key property that should be satisfied in order to trust the explanations. In this study, we provide some preliminary insights on evaluating the reliability of explanations in the specific case of unbalanced datasets, which are very frequent in high-risk use-cases, but at the same time considerably challenging for both AI models and XAI methods. We propose a simple evaluation focused on the minority class (i.e. the less frequent one) that leverages on-manifold generation of neighbours, explanation aggregation and a metric to test explanation consistency. We present a use-case based on a tabular dataset with numerical features focusing on the occurrence of frost events.
[LG-65] Neural Two-Stage Stochastic Optimization for Solving Unit Commitment Problem
链接: https://arxiv.org/abs/2507.09503
作者: Zhentong Shao,Jingtao Qin,Nanpeng Yu
类目: ystems and Control (eess.SY); Machine Learning (cs.LG)
*备注: Submitted to IEEE Transactions on Power Systems
Abstract:This paper proposes a neural stochastic optimization method for efficiently solving the two-stage stochastic unit commitment (2S-SUC) problem under high-dimensional uncertainty scenarios. The proposed method approximates the second-stage recourse problem using a deep neural network trained to map commitment decisions and uncertainty features to recourse costs. The trained network is subsequently embedded into the first-stage UC problem as a mixed-integer linear program (MILP), allowing for explicit enforcement of operational constraints while preserving the key uncertainty characteristics. A scenario-embedding network is employed to enable dimensionality reduction and feature aggregation across arbitrary scenario sets, serving as a data-driven scenario reduction mechanism. Numerical experiments on IEEE 5-bus, 30-bus, and 118-bus systems demonstrate that the proposed neural two-stage stochastic optimization method achieves solutions with an optimality gap of less than 1%, while enabling orders-of-magnitude speedup compared to conventional MILP solvers and decomposition-based methods. Moreover, the model’s size remains constant regardless of the number of scenarios, offering significant scalability for large-scale stochastic unit commitment problems.
[LG-66] Discrete Differential Principle for Continuous Smooth Function Representation
链接: https://arxiv.org/abs/2507.09480
作者: Guoyou Wang,Yihua Tan,Shiqi Liu
类目: Machine Learning (cs.LG); Numerical Analysis (math.NA)
*备注:
Abstract:Taylor’s formula holds significant importance in function representation, such as solving differential difference equations, ordinary differential equations, partial differential equations, and further promotes applications in visual perception, complex control, fluid mechanics, weather forecasting and thermodynamics. However, the Taylor’s formula suffers from the curse of dimensionality and error propagation during derivative computation in discrete situations. In this paper, we propose a new discrete differential operator to estimate derivatives and to represent continuous smooth function locally using the Vandermonde coefficient matrix derived from truncated Taylor series. Our method simultaneously computes all derivatives of orders less than the number of sample points, inherently mitigating error propagation. Utilizing equidistant uniform sampling, it achieves high-order accuracy while alleviating the curse of dimensionality. We mathematically establish rigorous error bounds for both derivative estimation and function representation, demonstrating tighter bounds for lower-order derivatives. We extend our method to the two-dimensional case, enabling its use for multivariate derivative calculations. Experiments demonstrate the effectiveness and superiority of the proposed method compared to the finite forward difference method for derivative estimation and cubic spline and linear interpolation for function representation. Consequently, our technique offers broad applicability across domains such as vision representation, feature extraction, fluid mechanics, and cross-media imaging.
[LG-67] Incentive-Aware Dynamic Resource Allocation under Long-Term Cost Constraints
链接: https://arxiv.org/abs/2507.09473
作者: Yan Dai,Negin Golrezaei,Patrick Jaillet
类目: Computer Science and Game Theory (cs.GT); Machine Learning (cs.LG); Machine Learning (stat.ML)
*备注:
Abstract:Motivated by applications such as cloud platforms allocating GPUs to users or governments deploying mobile health units across competing regions, we study the dynamic allocation of a reusable resource to strategic agents with private valuations. Our objective is to simultaneously (i) maximize social welfare, (ii) satisfy multi-dimensional long-term cost constraints, and (iii) incentivize truthful reporting. We begin by numerically evaluating primal-dual methods widely used in constrained online optimization and find them to be highly fragile in strategic settings – agents can easily manipulate their reports to distort future dual updates for future gain. To address this vulnerability, we develop an incentive-aware framework that makes primal-dual methods robust to strategic behavior. Our design combines epoch-based lazy updates – where dual variables remain fixed within each epoch – with randomized exploration rounds that extract approximately truthful signals for learning. Leveraging carefully designed online learning subroutines that can be of independent interest for dual updates, our mechanism achieves \tilde\mathcalO(\sqrtT) social welfare regret, satisfies all cost constraints, and ensures incentive alignment. This matches the performance of non-strategic allocation approaches while being robust to strategic agents. Subjects: Computer Science and Game Theory (cs.GT); Machine Learning (cs.LG); Machine Learning (stat.ML) Cite as: arXiv:2507.09473 [cs.GT] (or arXiv:2507.09473v1 [cs.GT] for this version) https://doi.org/10.48550/arXiv.2507.09473 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
[LG-68] La-Proteina: Atomistic Protein Generation via Partially Latent Flow Matching
链接: https://arxiv.org/abs/2507.09466
作者: Tomas Geffner,Kieran Didi,Zhonglin Cao,Danny Reidenbach,Zuobai Zhang,Christian Dallago,Emine Kucukbenli,Karsten Kreis,Arash Vahdat
类目: Machine Learning (cs.LG); Quantitative Methods (q-bio.QM)
*备注:
Abstract:Recently, many generative models for de novo protein structure design have emerged. Yet, only few tackle the difficult task of directly generating fully atomistic structures jointly with the underlying amino acid sequence. This is challenging, for instance, because the model must reason over side chains that change in length during generation. We introduce La-Proteina for atomistic protein design based on a novel partially latent protein representation: coarse backbone structure is modeled explicitly, while sequence and atomistic details are captured via per-residue latent variables of fixed dimensionality, thereby effectively side-stepping challenges of explicit side-chain representations. Flow matching in this partially latent space then models the joint distribution over sequences and full-atom structures. La-Proteina achieves state-of-the-art performance on multiple generation benchmarks, including all-atom co-designability, diversity, and structural validity, as confirmed through detailed structural analyses and evaluations. Notably, La-Proteina also surpasses previous models in atomistic motif scaffolding performance, unlocking critical atomistic structure-conditioned protein design tasks. Moreover, La-Proteina is able to generate co-designable proteins of up to 800 residues, a regime where most baselines collapse and fail to produce valid samples, demonstrating La-Proteina’s scalability and robustness.
[LG-69] oward Developing Machine-Learning-Aided Tools for the Thermomechanical Monitoring of Nuclear Reactor Components
链接: https://arxiv.org/abs/2507.09443
作者: Luiz Aldeia Machado,Victor Coppo Leite,Elia Merzari,Arthur Motta,Roberto Ponciroli,Lander Ibarra,Lise Charlot
类目: Machine Learning (cs.LG); Computational Engineering, Finance, and Science (cs.CE)
*备注: Preprint - Nureth 21 paper
Abstract:Proactive maintenance strategies, such as Predictive Maintenance (PdM), play an important role in the operation of Nuclear Power Plants (NPPs), particularly due to their capacity to reduce offline time by preventing unexpected shutdowns caused by component failures. In this work, we explore the use of a Convolutional Neural Network (CNN) architecture combined with a computational thermomechanical model to calculate the temperature, stress, and strain of a Pressurized Water Reactor (PWR) fuel rod during operation. This estimation relies on a limited number of temperature measurements from the cladding’s outer surface. This methodology can potentially aid in developing PdM tools for nuclear reactors by enabling real-time monitoring of such systems. The training, validation, and testing datasets were generated through coupled simulations involving BISON, a finite element-based nuclear fuel performance code, and the MOOSE Thermal-Hydraulics Module (MOOSE-THM). We conducted eleven simulations, varying the peak linear heat generation rates. Of these, eight were used for training, two for validation, and one for testing. The CNN was trained for over 1,000 epochs without signs of overfitting, achieving highly accurate temperature distribution predictions. These were then used in a thermomechanical model to determine the stress and strain distribution within the fuel rod. Comments: Preprint - Nureth 21 paper Subjects: Machine Learning (cs.LG); Computational Engineering, Finance, and Science (cs.CE) Cite as: arXiv:2507.09443 [cs.LG] (or arXiv:2507.09443v1 [cs.LG] for this version) https://doi.org/10.48550/arXiv.2507.09443 Focus to learn more arXiv-issued DOI via DataCite (pending registration) Submission history From: Luiz Aldeia Machado [view email] [v1] Sun, 13 Jul 2025 01:32:46 UTC (1,107 KB)
[LG-70] On Information Geometry and Iterative Optimization in Model Compression: Operator Factorization
链接: https://arxiv.org/abs/2507.09428
作者: Zakhar Shumaylov,Vasileios Tsiaras,Yannis Stylianou
类目: Machine Learning (cs.LG); Differential Geometry (math.DG); Optimization and Control (math.OC)
*备注:
Abstract:The ever-increasing parameter counts of deep learning models necessitate effective compression techniques for deployment on resource-constrained devices. This paper explores the application of information geometry, the study of density-induced metrics on parameter spaces, to analyze existing methods within the space of model compression, primarily focusing on operator factorization. Adopting this perspective highlights the core challenge: defining an optimal low-compute submanifold (or subset) and projecting onto it. We argue that many successful model compression approaches can be understood as implicitly approximating information divergences for this projection. We highlight that when compressing a pre-trained model, using information divergences is paramount for achieving improved zero-shot accuracy, yet this may no longer be the case when the model is fine-tuned. In such scenarios, trainability of bottlenecked models turns out to be far more important for achieving high compression ratios with minimal performance degradation, necessitating adoption of iterative methods. In this context, we prove convergence of iterative singular value thresholding for training neural networks subject to a soft rank constraint. To further illustrate the utility of this perspective, we showcase how simple modifications to existing methods through softer rank reduction result in improved performance under fixed compression rates.
[LG-71] Scaling Laws for Optimal Data Mixtures
链接: https://arxiv.org/abs/2507.09404
作者: Mustafa Shukor,Louis Bethune,Dan Busbridge,David Grangier,Enrico Fini,Alaaeldin El-Nouby,Pierre Ablin
类目: Machine Learning (cs.LG)
*备注:
Abstract:Large foundation models are typically trained on data from multiple domains, with the data mixture–the proportion of each domain used–playing a critical role in model performance. The standard approach to selecting this mixture relies on trial and error, which becomes impractical for large-scale pretraining. We propose a systematic method to determine the optimal data mixture for any target domain using scaling laws. Our approach accurately predicts the loss of a model of size N trained with D tokens and a specific domain weight vector h . We validate the universality of these scaling laws by demonstrating their predictive power in three distinct and large-scale settings: large language model (LLM), native multimodal model (NMM), and large vision models (LVM) pretraining. We further show that these scaling laws can extrapolate to new data mixtures and across scales: their parameters can be accurately estimated using a few small-scale training runs, and used to estimate the performance at larger scales and unseen domain weights. The scaling laws allow to derive the optimal domain weights for any target domain under a given training budget ( N , D ), providing a principled alternative to costly trial-and-error methods.
[LG-72] A Random Matrix Theory Perspective on the Learning Dynamics of Multi-head Latent Attention ICML2025
链接: https://arxiv.org/abs/2507.09394
作者: Nandan Kumar Jha,Brandon Reagen
类目: Machine Learning (cs.LG)
*备注: ICML 2025 Workshop on High-dimensional Learning Dynamics (HiLD)
Abstract:In this work, we study how multi-head latent attention (MLA), a popular strategy for compressing key/value memory, affects a transformer’s internal capacity during pretraining. Using a lightweight suite of Marchenko-Pastur (MP) diagnostics, we analyze the spectrum of the W_QW_K^\top gram matrix throughout training, comparing three variants: the standard multi-head attention (MHA) baseline, MLA-PreRoPE with rotary applied before compression, and MLA-Decoupled, which shares a single rotary sub-vector across all heads. Our random matrix analysis reveals \textbfthree key findings: \textbf i) capacity bottlenecks emerge locally: both MHA and MLA-PreRoPE exhibit sharp, early spikes in specific layers that persist and propagate, disrupting the balance between bulk and outlier directions; \textbf ii) these spikes coincide with rank collapse, concentrating the model’s expressivity into narrow subspaces; \textbf iii) only the decoupled variant prevents this cascade, maintaining broad spectral support and suppressing outlier formation across layers. These results underscore that \emphhow rotary embeddings are applied is just as critical as \emphwhere compression occurs. Sharing rotary components across heads mitigates spectral fragmentation and preserves representational capacity.
[LG-73] Geometric Generative Modeling with Noise-Conditioned Graph Networks ICML2025
链接: https://arxiv.org/abs/2507.09391
作者: Peter Pao-Huang,Mitchell Black,Xiaojie Qiu
类目: Machine Learning (cs.LG)
*备注: ICML 2025
Abstract:Generative modeling of graphs with spatial structure is essential across many applications from computer graphics to spatial genomics. Recent flow-based generative models have achieved impressive results by gradually adding and then learning to remove noise from these graphs. Existing models, however, use graph neural network architectures that are independent of the noise level, limiting their expressiveness. To address this issue, we introduce \textitNoise-Conditioned Graph Networks (NCGNs), a class of graph neural networks that dynamically modify their architecture according to the noise level during generation. Our theoretical and empirical analysis reveals that as noise increases, (1) graphs require information from increasingly distant neighbors and (2) graphs can be effectively represented at lower resolutions. Based on these insights, we develop Dynamic Message Passing (DMP), a specific instantiation of NCGNs that adapts both the range and resolution of message passing to the noise level. DMP consistently outperforms noise-independent architectures on a variety of domains including 3 D point clouds, spatiotemporal transcriptomics, and images. Code is available at this https URL.
[LG-74] Credit Card Fraud Detection Using RoFormer Model With Relative Distance Rotating Encoding
链接: https://arxiv.org/abs/2507.09385
作者: Kevin Reyes,Vasco Cortez
类目: Neural and Evolutionary Computing (cs.NE); Machine Learning (cs.LG)
*备注: 2025 IEEE Conference on Artificial Intelligence (CAI)
Abstract:Fraud detection is one of the most important challenges that financial systems must address. Detecting fraudulent transactions is critical for payment gateway companies like Flow Payment, which process millions of transactions monthly and require robust security measures to mitigate financial risks. Increasing transaction authorization rates while reducing fraud is essential for providing a good user experience and building a sustainable business. For this reason, discovering novel and improved methods to detect fraud requires continuous research and investment for any company that wants to succeed in this industry. In this work, we introduced a novel method for detecting transactional fraud by incorporating the Relative Distance Rotating Encoding (ReDRE) in the RoFormer model. The incorporation of angle rotation using ReDRE enhances the characterization of time series data within a Transformer, leading to improved fraud detection by better capturing temporal dependencies and event relationships.
[LG-75] Real-Time Adaptive Motion Planning via Point Cloud-Guided Energy-Based Diffusion and Potential Fields
链接: https://arxiv.org/abs/2507.09383
作者: Wondmgezahu Teshome,Kian Behzad,Octavia Camps,Michael Everett,Milad Siami,Mario Sznaier
类目: Robotics (cs.RO); Machine Learning (cs.LG)
*备注: Accepted to IEEE RA-L 2025
Abstract:Motivated by the problem of pursuit-evasion, we present a motion planning framework that combines energy-based diffusion models with artificial potential fields for robust real time trajectory generation in complex environments. Our approach processes obstacle information directly from point clouds, enabling efficient planning without requiring complete geometric representations. The framework employs classifier-free guidance training and integrates local potential fields during sampling to enhance obstacle avoidance. In dynamic scenarios, the system generates initial trajectories using the diffusion model and continuously refines them through potential field-based adaptation, demonstrating effective performance in pursuit-evasion scenarios with partial pursuer observability.
[LG-76] Meta-autoencoders: An approach to discovery and representation of relationships between dynamically evolving classes
链接: https://arxiv.org/abs/2507.09362
作者: Assaf Marron,Smadar Szekely,Irun Cohen,David Harel
类目: Machine Learning (cs.LG); Populations and Evolution (q-bio.PE)
*备注:
Abstract:An autoencoder (AE) is a neural network that, using self-supervised training, learns a succinct parameterized representation, and a corresponding encoding and decoding process, for all instances in a given class. Here, we introduce the concept of a meta-autoencoder (MAE): an AE for a collection of autoencoders. Given a family of classes that differ from each other by the values of some parameters, and a trained AE for each class, an MAE for the family is a neural net that has learned a compact representation and associated encoder and decoder for the class-specific AEs. One application of this general concept is in research and modeling of natural evolution – capturing the defining and the distinguishing properties across multiple species that are dynamically evolving from each other and from common ancestors. In this interim report we provide a constructive definition of MAEs, initial examples, and the motivating research directions in machine learning and biology.
[LG-77] Unified Linear Parametric Map Modeling and Perception-aware Trajectory Planning for Mobile Robotics
链接: https://arxiv.org/abs/2507.09340
作者: Hongyu Nie,Xingyu Li,Xu Liu,Zhaotong Tan,Sen Mei,Wenbo Su
类目: Robotics (cs.RO); Machine Learning (cs.LG)
*备注: Submitted to IEEE Transactions on Robotics (TRO) in July 2025
Abstract:Autonomous navigation in mobile robots, reliant on perception and planning, faces major hurdles in large-scale, complex environments. These include heavy computational burdens for mapping, sensor occlusion failures for UAVs, and traversal challenges on irregular terrain for UGVs, all compounded by a lack of perception-aware strategies. To address these challenges, we introduce Random Mapping and Random Projection (RMRP). This method constructs a lightweight linear parametric map by first mapping data to a high-dimensional space, followed by a sparse random projection for dimensionality reduction. Our novel Residual Energy Preservation Theorem provides theoretical guarantees for this process, ensuring critical geometric properties are preserved. Based on this map, we propose the RPATR (Robust Perception-Aware Trajectory Planner) framework. For UAVs, our method unifies grid and Euclidean Signed Distance Field (ESDF) maps. The front-end uses an analytical occupancy gradient to refine initial paths for safety and smoothness, while the back-end uses a closed-form ESDF for trajectory optimization. Leveraging the trained RMRP model’s generalization, the planner predicts unobserved areas for proactive navigation. For UGVs, the model characterizes terrain and provides closed-form gradients, enabling online planning to circumvent large holes. Validated in diverse scenarios, our framework demonstrates superior mapping performance in time, memory, and accuracy, and enables computationally efficient, safe navigation for high-speed UAVs and UGVs. The code will be released to foster community collaboration.
[LG-78] PP-SD: Accelerating Transformer Point Process Sampling with Speculative Decoding
链接: https://arxiv.org/abs/2507.09252
作者: Shukai Gong,Yiyang Fu,Fengyuan Ran,Feng Zhou
类目: Machine Learning (cs.LG); Machine Learning (stat.ML)
*备注:
Abstract:We propose TPP-SD, a novel approach that accelerates Transformer temporal point process (TPP) sampling by adapting speculative decoding (SD) techniques from language models. By identifying the structural similarities between thinning algorithms for TPPs and speculative decoding for language models, we develop an efficient sampling framework that leverages a smaller draft model to generate multiple candidate events, which are then verified by the larger target model in parallel. TPP-SD maintains the same output distribution as autoregressive sampling while achieving significant acceleration. Experiments on both synthetic and real datasets demonstrate that our approach produces samples from identical distributions as standard methods, but with 2-6 \times speedup. Our ablation studies analyze the impact of hyperparameters such as draft length and draft model size on sampling efficiency. TPP-SD bridges the gap between powerful Transformer TPP models and the practical need for rapid sequence sampling.
[LG-79] Optimizing Basis Function Selection in Constructive Wavelet Neural Networks and Its Applications
链接: https://arxiv.org/abs/2507.09213
作者: Dunsheng Huang,Dong Shen,Lei Lu,Ying Tan
类目: Machine Learning (cs.LG); Machine Learning (stat.ML)
*备注: 17pages
Abstract:Wavelet neural network (WNN), which learns an unknown nonlinear mapping from the data, has been widely used in signal processing, and time-series analysis. However, challenges in constructing accurate wavelet bases and high computational costs limit their application. This study introduces a constructive WNN that selects initial bases and trains functions by introducing new bases for predefined accuracy while reducing computational costs. For the first time, we analyze the frequency of unknown nonlinear functions and select appropriate initial wavelets based on their primary frequency components by estimating the energy of the spatial frequency component. This leads to a novel constructive framework consisting of a frequency estimator and a wavelet-basis increase mechanism to prioritize high-energy bases, significantly improving computational efficiency. The theoretical foundation defines the necessary time-frequency range for high-dimensional wavelets at a given accuracy. The framework’s versatility is demonstrated through four examples: estimating unknown static mappings from offline data, combining two offline datasets, identifying time-varying mappings from time-series data, and capturing nonlinear dependencies in real time-series data. These examples showcase the framework’s broad applicability and practicality. All the code will be released at this https URL.
[LG-80] Capturing Unseen Spatial Extremes Through Knowledge-Informed Generative Modeling
链接: https://arxiv.org/abs/2507.09211
作者: Xinyue Liu,Xiao Peng,Shuyue Yan,Yuntian Chen,Dongxiao Zhang,Zhixiao Niu,Hui-Min Wang,Xiaogang He
类目: Machine Learning (cs.LG); Atmospheric and Oceanic Physics (physics.ao-ph); Data Analysis, Statistics and Probability (physics.data-an); Geophysics (physics.geo-ph); Machine Learning (stat.ML)
*备注:
Abstract:Observed records of climate extremes provide an incomplete picture of risk, missing “unseen” extremes that exceed historical bounds. In parallel, neglecting spatial dependence undervalues the risk of synchronized hazards that amplify impacts. To address these challenges, we develop DeepX-GAN (Dependence-Enhanced Embedding for Physical eXtremes - Generative Adversarial Network), a knowledge-informed deep generative model designed to better capture the spatial structure of rare extremes. The zero-shot generalizability of DeepX-GAN enables simulation of unseen extremes that fall outside historical experience yet remain statistically plausible. We define two types of unseen extremes: “checkmate” extremes that directly hit targets, and “stalemate” extremes that narrowly miss. These unrealized scenarios expose latent risks in fragile systems and may reinforce a false sense of resilience if overlooked. Near misses, in particular, can prompt either proactive adaptation or dangerous complacency, depending on how they are interpreted. Applying DeepX-GAN to the Middle East and North Africa (MENA), we find that these unseen extremes disproportionately affect regions with high vulnerability and low socioeconomic readiness, but differ in urgency and interpretation. Future warming could expand and redistribute these unseen extremes, with emerging exposure hotspots in Indo-Pakistan and Central Africa. This distributional shift highlights critical blind spots in conventional hazard planning and underscores the need to develop spatially adaptive policies that anticipate emergent risk hotspots rather than simply extrapolating from historical patterns.
[LG-81] actile-VLA: Unlocking Vision-Language-Action Models Physical Knowledge for Tactile Generalization
链接: https://arxiv.org/abs/2507.09160
作者: Jialei Huang,Shuo Wang,Fanqi Lin,Yihang Hu,Chuan Wen,Yang Gao
类目: Robotics (cs.RO); Machine Learning (cs.LG)
*备注:
Abstract:Vision-Language-Action (VLA) models have shown remarkable achievements, driven by the rich implicit knowledge of their vision-language components. However, achieving generalist robotic agents demands precise grounding into physical interactions, especially in contact-rich scenarios where fine-grained force control is essential. We advance VLAs’ implicit knowledge beyond identifying what to do, towards guiding how to physically interact with real world. This paper introduces Tactile-VLA, a novel framework that deeply fuses vision, language, action, and tactile sensing. This framework incorporates a hybrid position-force controller to translate the model’s intentions into precise physical actions and a reasoning module that allows the robot to adapt its strategy based on tactile feedback. Experiments demonstrate Tactile-VLA’s effectiveness and generalizability in three key aspects: (1) enabling tactile-aware instruction following, (2) utilizing tactile-relevant commonsense, and (3) facilitating adaptive tactile-involved reasoning. A key finding is that the VLM’s prior knowledge already contains semantic understanding of physical interaction; by connecting it to the robot’s tactile sensors with only a few demonstrations, we can activate this prior knowledge to achieve zero-shot generalization in contact-rich tasks.
[LG-82] HedraRAG : Coordinating LLM Generation and Database Retrieval in Heterogeneous RAG Serving SOSP2025
链接: https://arxiv.org/abs/2507.09138
作者: Zhengding Hu,Vibha Murthy,Zaifeng Pan,Wanlu Li,Xiaoyi Fang,Yufei Ding,Yuke Wang
类目: Databases (cs.DB); Machine Learning (cs.LG)
*备注: Accepted by SOSP 2025
Abstract:This paper addresses emerging system-level challenges in heterogeneous retrieval-augmented generation (RAG) serving, where complex multi-stage workflows and diverse request patterns complicate efficient execution. We present HedraRAG, a runtime system built on a graph-based abstraction that exposes optimization opportunities across stage-level parallelism, intra-request similarity, and inter-request skewness. These opportunities are realized through dynamic graph transformations, such as node splitting, reordering, edge addition, and dependency rewiring, applied to wavefronts of subgraphs spanning concurrent requests. The resulting execution plans are mapped onto hybrid CPU-GPU pipelines to improve resource utilization and reduce latency. Evaluations across a wide range of RAG workflows demonstrate speedups exceeding 1.5x and reaching up to 5x over existing frameworks, showcasing the effectiveness of coordinated generation and retrieval in serving environments.
[LG-83] A Study of Value-Aware Eigenoptions
链接: https://arxiv.org/abs/2507.09127
作者: Harshil Kotamreddy,Marlos C. Machado
类目: Machine Learning (cs.LG); Machine Learning (stat.ML)
*备注: Presented at the RLC Workshop on Inductive Biases in Reinforcement Learning 2025
Abstract:Options, which impose an inductive bias toward temporal and hierarchical structure, offer a powerful framework for reinforcement learning (RL). While effective in sequential decision-making, they are often handcrafted rather than learned. Among approaches for discovering options, eigenoptions have shown strong performance in exploration, but their role in credit assignment remains underexplored. In this paper, we investigate whether eigenoptions can accelerate credit assignment in model-free RL, evaluating them in tabular and pixel-based gridworlds. We find that pre-specified eigenoptions aid not only exploration but also credit assignment, whereas online discovery can bias the agent’s experience too strongly and hinder learning. In the context of deep RL, we also propose a method for learning option-values under non-linear function approximation, highlighting the impact of termination conditions on performance. Our findings reveal both the promise and complexity of using eigenoptions, and options more broadly, to simultaneously support credit assignment and exploration in reinforcement learning.
[LG-84] S2SRec2: Set-to-Set Recommendation for Basket Completion with Recipe
链接: https://arxiv.org/abs/2507.09101
作者: Yanan Cao,Omid Memarrast,Shiqin Cai,Sinduja Subramaniam,Evren Korpeoglu,Kannan Achan
类目: Machine Learning (cs.LG)
*备注:
Abstract:In grocery e-commerce, customers often build ingredient baskets guided by dietary preferences but lack the expertise to create complete meals. Leveraging recipe knowledge to recommend complementary ingredients based on a partial basket is essential for improving the culinary experience. Traditional recipe completion methods typically predict a single missing ingredient using a leave-one-out strategy. However, they fall short in two key aspects: (i) they do not reflect real-world scenarios where multiple ingredients are often needed, and (ii) they overlook relationships among the missing ingredients themselves. To address these limitations, we reformulate basket completion as a set-to-set (S2S) recommendation problem, where an incomplete basket is input into a system that predicts a set of complementary ingredients. We introduce S2SRec2, a set-to-set ingredient recommendation framework based on a Set Transformer and trained in a multitask learning paradigm. S2SRec2 jointly learns to (i) retrieve missing ingredients from the representation of existing ones and (ii) assess basket completeness after prediction. These tasks are optimized together, enforcing accurate retrieval and coherent basket completion. Experiments on large-scale recipe datasets and qualitative analyses show that S2SRec2 significantly outperforms single-target baselines, offering a promising approach to enhance grocery shopping and inspire culinary creativity.
[LG-85] On the Frag ility of Multimodal Perception to Temporal Misalignment in Autonomous Driving
链接: https://arxiv.org/abs/2507.09095
作者: Md Hasan Shahriar,Md Mohaimin Al Barat,Harshavardhan Sundar,Naren Ramakrishnan,Y. Thomas Hou,Wenjing Lou
类目: Machine Learning (cs.LG)
*备注: 16 pages
Abstract:Multimodal fusion (MMF) plays a critical role in the perception of autonomous driving, which primarily fuses camera and LiDAR streams for a comprehensive and efficient scene understanding. However, its strict reliance on precise temporal synchronization exposes it to new vulnerabilities. In this paper, we introduce DejaVu, a novel attack that exploits network-induced delays to create subtle temporal misalignments across sensor streams, severely degrading downstream MMF-based perception tasks. Our comprehensive attack analysis across different models and datasets reveals these sensors’ task-specific imbalanced sensitivities: object detection is overly dependent on LiDAR inputs while object tracking is highly reliant on the camera inputs. Consequently, with a single-frame LiDAR delay, an attacker can reduce the car detection mAP by up to 88.5%, while with a three-frame camera delay, multiple object tracking accuracy (MOTA) for car drops by 73%. To detect such attacks, we propose AION, a defense patch that can work alongside the existing perception model to monitor temporal alignment through cross-modal temporal consistency. AION leverages multimodal shared representation learning and dynamic time warping to determine the path of temporal alignment and calculate anomaly scores based on the alignment. Our thorough evaluation of AION shows it achieves AUROC scores of 0.92-0.98 with low false positives across datasets and model architectures, demonstrating it as a robust and generalized defense against the temporal misalignment attacks.
[LG-86] Continuous-Time Signal Decomposition: An Implicit Neural Generalization of PCA and ICA
链接: https://arxiv.org/abs/2507.09091
作者: Shayan K. Azmoodeh,Krishna Subramani,Paris Smaragdis
类目: Machine Learning (cs.LG); Signal Processing (eess.SP); Machine Learning (stat.ML)
*备注: 6 pages, 3 figures, 1 table. MLSP 2025
Abstract:We generalize the low-rank decomposition problem, such as principal and independent component analysis (PCA, ICA) for continuous-time vector-valued signals and provide a model-agnostic implicit neural signal representation framework to learn numerical approximations to solve the problem. Modeling signals as continuous-time stochastic processes, we unify the approaches to both the PCA and ICA problems in the continuous setting through a contrast function term in the network loss, enforcing the desired statistical properties of the source signals (decorrelation, independence) learned in the decomposition. This extension to a continuous domain allows the application of such decompositions to point clouds and irregularly sampled signals where standard techniques are not applicable.
[LG-87] Imitation Learning in Continuous Action Spaces: Mitigating Compounding Error without Interaction
链接: https://arxiv.org/abs/2507.09061
作者: Thomas T. Zhang,Daniel Pfrommer,Nikolai Matni,Max Simchowitz
类目: Machine Learning (cs.LG); Systems and Control (eess.SY); Machine Learning (stat.ML)
*备注:
Abstract:We study the problem of imitating an expert demonstrator in a continuous state-and-action dynamical system. While imitation learning in discrete settings such as autoregressive language modeling has seen immense success and popularity in recent years, imitation in physical settings such as autonomous driving and robot learning has proven comparably more complex due to the compounding errors problem, often requiring elaborate set-ups to perform stably. Recent work has demonstrated that even in benign settings, exponential compounding errors are unavoidable when learning solely from expert-controlled trajectories, suggesting the need for more advanced policy parameterizations or data augmentation. To this end, we present minimal interventions that provably mitigate compounding errors in continuous state-and-action imitation learning. When the system is open-loop stable, we prescribe “action chunking,” i.e., predicting and playing sequences of actions in open-loop; when the system is possibly unstable, we prescribe “noise injection,” i.e., adding noise during expert demonstrations. These interventions align with popular choices in modern robot learning, though the benefits we derive are distinct from the effects they were designed to target. Our results draw insights and tools from both control theory and reinforcement learning; however, our analysis reveals novel considerations that do not naturally arise when either literature is considered in isolation.
[LG-88] Shortening the Trajectories: Identity-Aware Gaussian Approximation for Efficient 3D Molecular Generation
链接: https://arxiv.org/abs/2507.09043
作者: Jingxiang Qu,Wenhan Gao,Yi Liu
类目: Machine Learning (cs.LG); Machine Learning (stat.ML)
*备注:
Abstract:Gaussian-based Probabilistic Generative Models (GPGMs) generate data by reversing a stochastic process that progressively corrupts samples with Gaussian noise. While these models have achieved state-of-the-art performance across diverse domains, their practical deployment remains constrained by the high computational cost of long generative trajectories, which often involve hundreds to thousands of steps during training and sampling. In this work, we introduce a theoretically grounded and empirically validated framework that improves generation efficiency without sacrificing training granularity or inference fidelity. Our key insight is that for certain data modalities, the noising process causes data to rapidly lose its identity and converge toward a Gaussian distribution. We analytically identify a characteristic step at which the data has acquired sufficient Gaussianity, and then replace the remaining generation trajectory with a closed-form Gaussian approximation. Unlike existing acceleration techniques that coarsening the trajectories by skipping steps, our method preserves the full resolution of learning dynamics while avoiding redundant stochastic perturbations between `Gaussian-like’ distributions. Empirical results across multiple data modalities demonstrate substantial improvements in both sample quality and computational efficiency.
[LG-89] Behavioral Exploration: Learning to Explore via In-Context Adaptation
链接: https://arxiv.org/abs/2507.09041
作者: Andrew Wagenmaker,Zhiyuan Zhou,Sergey Levine
类目: Machine Learning (cs.LG); Robotics (cs.RO); Systems and Control (eess.SY)
*备注:
Abstract:Developing autonomous agents that quickly explore an environment and adapt their behavior online is a canonical challenge in robotics and machine learning. While humans are able to achieve such fast online exploration and adaptation, often acquiring new information and skills in only a handful of interactions, existing algorithmic approaches tend to rely on random exploration and slow, gradient-based behavior updates. How can we endow autonomous agents with such capabilities on par with humans? Taking inspiration from recent progress on both in-context learning and large-scale behavioral cloning, in this work we propose behavioral exploration: training agents to internalize what it means to explore and adapt in-context over the space of expert'' behaviors. To achieve this, given access to a dataset of expert demonstrations, we train a long-context generative model to predict expert actions conditioned on a context of past observations and a measure of how
exploratory’’ the expert’s behaviors are relative to this context. This enables the model to not only mimic the behavior of an expert, but also, by feeding its past history of interactions into its context, to select different expert behaviors than what have been previously selected, thereby allowing for fast online adaptation and targeted, ``expert-like’’ exploration. We demonstrate the effectiveness of our method in both simulated locomotion and manipulation settings, as well as on real-world robotic manipulation tasks, illustrating its ability to learn adaptive, exploratory behavior.
[LG-90] Enhancing RLHF with Human Gaze Modeling
链接: https://arxiv.org/abs/2507.09016
作者: Karim Galliamov,Ivan Titov,Ilya Pershin
类目: Machine Learning (cs.LG)
*备注:
Abstract:Reinforcement Learning from Human Feedback (RLHF) aligns language models with human preferences but is computationally expensive. We explore two approaches that leverage human gaze modeling to enhance RLHF: (1) gaze-aware reward models and (2) gaze-based distribution of sparse rewards at token level. Our experiments demonstate that gaze-informed RLHF achieves faster convergence while maintaining or slightly improving performance, thus, reducing computational costs during policy optimization. These results show that human gaze provides a valuable and underused signal for policy optimization, pointing to a promising direction for improving RLHF efficiency.
[LG-91] Exploiting Leaderboards for Large-Scale Distribution of Malicious Models
链接: https://arxiv.org/abs/2507.08983
作者: Anshuman Suri,Harsh Chaudhari,Yuefeng Peng,Ali Naseh,Amir Houmansadr,Alina Oprea
类目: Machine Learning (cs.LG); Cryptography and Security (cs.CR)
*备注:
Abstract:While poisoning attacks on machine learning models have been extensively studied, the mechanisms by which adversaries can distribute poisoned models at scale remain largely unexplored. In this paper, we shed light on how model leaderboards – ranked platforms for model discovery and evaluation – can serve as a powerful channel for adversaries for stealthy large-scale distribution of poisoned models. We present TrojanClimb, a general framework that enables injection of malicious behaviors while maintaining competitive leaderboard performance. We demonstrate its effectiveness across four diverse modalities: text-embedding, text-generation, text-to-speech and text-to-image, showing that adversaries can successfully achieve high leaderboard rankings while embedding arbitrary harmful functionalities, from backdoors to bias injection. Our findings reveal a significant vulnerability in the machine learning ecosystem, highlighting the urgent need to redesign leaderboard evaluation mechanisms to detect and filter malicious (e.g., poisoned) models, while exposing broader security implications for the machine learning community regarding the risks of adopting models from unverified sources.
[LG-92] Graph Neural Network Enhanced Sequential Recommendation Method for Cross-Platform Ad Campaign
链接: https://arxiv.org/abs/2507.08959
作者: Xiang Li,Xinyu Wang,Yifan Lin
类目: Machine Learning (cs.LG)
*备注:
Abstract:In order to improve the accuracy of cross-platform advertisement recommendation, a graph neural network (GNN)- based advertisement recommendation method is analyzed. Through multi-dimensional modeling, user behavior data (e.g., click frequency, active duration) reveal temporal patterns of interest evolution, ad content (e.g., type, tag, duration) influences semantic preferences, and platform features (e.g., device type, usage context) shape the environment where interest transitions occur. These factors jointly enable the GNN to capture the latent pathways of user interest migration across platforms. The experimental results are based on the datasets of three platforms, and Platform B reaches 0.937 in AUC value, which is the best performance. Platform A and Platform C showed a slight decrease in precision and recall with uneven distribution of ad labels. By adjusting the hyperparameters such as learning rate, batch size and embedding dimension, the adaptability and robustness of the model in heterogeneous data are further improved.
[LG-93] Beyond Scores: Proximal Diffusion Models
链接: https://arxiv.org/abs/2507.08956
作者: Zhenghan Fang,Mateo Díaz,Sam Buchanan,Jeremias Sulam
类目: Machine Learning (cs.LG); Statistics Theory (math.ST); Machine Learning (stat.ML)
*备注:
Abstract:Diffusion models have quickly become some of the most popular and powerful generative models for high-dimensional data. The key insight that enabled their development was the realization that access to the score – the gradient of the log-density at different noise levels – allows for sampling from data distributions by solving a reverse-time stochastic differential equation (SDE) via forward discretization, and that popular denoisers allow for unbiased estimators of this score. In this paper, we demonstrate that an alternative, backward discretization of these SDEs, using proximal maps in place of the score, leads to theoretical and practical benefits. We leverage recent results in proximal matching to learn proximal operators of the log-density and, with them, develop Proximal Diffusion Models (ProxDM). Theoretically, we prove that \widetildeO(d/\sqrt\varepsilon) steps suffice for the resulting discretization to generate an \varepsilon -accurate distribution w.r.t. the KL divergence. Empirically, we show that two variants of ProxDM achieve significantly faster convergence within just a few sampling steps compared to conventional score-matching methods.
[LG-94] Revisiting Convergence: Shuffling Complexity Beyond Lipschitz Smoothness
链接: https://arxiv.org/abs/2507.08913
作者: Qi He,Peiran Yu,Ziyi Chen,Heng Huang
类目: Machine Learning (cs.LG); Optimization and Control (math.OC); Machine Learning (stat.ML)
*备注:
Abstract:Shuffling-type gradient methods are favored in practice for their simplicity and rapid empirical performance. Despite extensive development of convergence guarantees under various assumptions in recent years, most require the Lipschitz smoothness condition, which is often not met in common machine learning models. We highlight this issue with specific counterexamples. To address this gap, we revisit the convergence rates of shuffling-type gradient methods without assuming Lipschitz smoothness. Using our stepsize strategy, the shuffling-type gradient algorithm not only converges under weaker assumptions but also match the current best-known convergence rates, thereby broadening its applicability. We prove the convergence rates for nonconvex, strongly convex, and non-strongly convex cases, each under both random reshuffling and arbitrary shuffling schemes, under a general bounded variance condition. Numerical experiments further validate the performance of our shuffling-type gradient algorithm, underscoring its practical efficacy.
[LG-95] he Engineers Dilemma: A Review of Establishing a Legal Framework for Integrating Machine Learning in Construction by Navigating Precedents and Industry Expectations
链接: https://arxiv.org/abs/2507.08908
作者: M.Z. Naser
类目: Computers and Society (cs.CY); Emerging Technologies (cs.ET); Machine Learning (cs.LG)
*备注:
Abstract:Despite the widespread interest in machine learning (ML), the engineering industry has not yet fully adopted ML-based methods, which has left engineers and stakeholders uncertain about the legal and regulatory frameworks that govern their decisions. This gap remains unaddressed as an engineer’s decision-making process, typically governed by professional ethics and practical guidelines, now intersects with complex algorithmic outputs. To bridge this gap, this paper explores how engineers can navigate legal principles and legislative justifications that support and/or contest the deployment of ML technologies. Drawing on recent precedents and experiences gained from other fields, this paper argues that analogical reasoning can provide a basis for embedding ML within existing engineering codes while maintaining professional accountability and meeting safety requirements. In exploring these issues, the discussion focuses on established liability doctrines, such as negligence and product liability, and highlights how courts have evaluated the use of predictive models. We further analyze how legislative bodies and standard-setting organizations can furnish explicit guidance equivalent to prior endorsements of emergent technologies. This exploration stresses the vitality of understanding the interplay between technical justifications and legal precedents for shaping an informed stance on ML’s legitimacy in engineering practice. Finally, our analysis catalyzes a legal framework for integrating ML through which stakeholders can critically assess the responsibilities, liabilities, and benefits inherent in ML-driven engineering solutions.
[LG-96] An Automated Classifier of Harmful Brain Activities for Clinical Usage Based on a Vision-Inspired Pre-trained Framework
链接: https://arxiv.org/abs/2507.08874
作者: Yulin Sun,Xiaopeng Si,Runnan He,Xiao Hu,Peter Smielewski,Wenlong Wang,Xiaoguang Tong,Wei Yue,Meijun Pang,Kuo Zhang,Xizi Song,Dong Ming,Xiuyun Liu
类目: Machine Learning (cs.LG)
*备注:
Abstract:Timely identification of harmful brain activities via electroencephalography (EEG) is critical for brain disease diagnosis and treatment, which remains limited application due to inter-rater variability, resource constraints, and poor generalizability of existing artificial intelligence (AI) models. In this study, a convolutional neural network model, VIPEEGNet, was developed and validated using EEGs recorded from Massachusetts General Hospital/Harvard Medical School. The VIPEEGNet was developed and validated using two independent datasets, collected between 2006 and 2020. The development cohort included EEG recordings from 1950 patients, with 106,800 EEG segments annotated by at least one experts (ranging from 1 to 28). The online testing cohort consisted of EEG segments from a subset of an additional 1,532 patients, each annotated by at least 10 experts. For the development cohort (n=1950), the VIPEEGNet achieved high accuracy, with an AUROC for binary classification of seizure, LPD, GPD, LRDA, GRDA, and “other” categories at 0.972 (95% CI, 0.957-0.988), 0.962 (95% CI, 0.954-0.970), 0.972 (95% CI, 0.960-0.984), 0.938 (95% CI, 0.917-0.959), 0.949 (95% CI, 0.941-0.957), and 0.930 (95% CI, 0.926-0.935). For multi classification, the sensitivity of VIPEEGNET for the six categories ranges from 36.8% to 88.2% and the precision ranges from 55.6% to 80.4%, and performance similar to human experts. Notably, the external validation showed Kullback-Leibler Divergence (KLD)of 0.223 and 0.273, ranking top 2 among the existing 2,767 competing algorithms, while we only used 2.8% of the parameters of the first-ranked algorithm.
[LG-97] GUIDE: Towards Scalable Advising for Research Ideas
链接: https://arxiv.org/abs/2507.08870
作者: Yaowenqi Liu,BingXu Meng,Rui Pan,Jerry Huang,Tong Zhang
类目: Machine Learning (cs.LG); Multiagent Systems (cs.MA)
*备注:
Abstract:The field of AI research is advancing at an unprecedented pace, enabling automated hypothesis generation and experimental design across diverse domains such as biology, mathematics, and artificial intelligence. Despite these advancements, there remains a significant gap in the availability of scalable advising systems capable of providing high-quality, well-reasoned feedback to refine proposed hypotheses and experimental designs. To address this challenge, we explore key factors that underlie the development of robust advising systems, including model size, context length, confidence estimation, and structured reasoning processes. Our findings reveal that a relatively small model, when equipped with a well-compressed literature database and a structured reasoning framework, can outperform powerful general-purpose language models such as Deepseek-R1 in terms of acceptance rates for self-ranked top-30% submissions to ICLR 2025. Moreover, when limited to high-confidence predictions, our system achieves an acceptance rate exceeding 90% on the ICLR 2025 test set, underscoring its potential to significantly enhance the quality and efficiency of hypothesis generation and experimental design. The code is released at this https URL.
[LG-98] Underrepresentation Label Bias and Proxies: Towards Data Bias Profiles for the EU AI Act and Beyond
链接: https://arxiv.org/abs/2507.08866
作者: Marina Ceccon,Giandomenico Cornacchia,Davide Dalle Pezze,Alessandro Fabris,Gian Antonio Susto
类目: Machine Learning (cs.LG); Computers and Society (cs.CY); Machine Learning (stat.ML)
*备注: Accepted in Expert Systems with Applications
Abstract:Undesirable biases encoded in the data are key drivers of algorithmic discrimination. Their importance is widely recognized in the algorithmic fairness literature, as well as legislation and standards on anti-discrimination in AI. Despite this recognition, data biases remain understudied, hindering the development of computational best practices for their detection and mitigation. In this work, we present three common data biases and study their individual and joint effect on algorithmic discrimination across a variety of datasets, models, and fairness measures. We find that underrepresentation of vulnerable populations in training sets is less conducive to discrimination than conventionally affirmed, while combinations of proxies and label bias can be far more critical. Consequently, we develop dedicated mechanisms to detect specific types of bias, and combine them into a preliminary construct we refer to as the Data Bias Profile (DBP). This initial formulation serves as a proof of concept for how different bias signals can be systematically documented. Through a case study with popular fairness datasets, we demonstrate the effectiveness of the DBP in predicting the risk of discriminatory outcomes and the utility of fairness-enhancing interventions. Overall, this article bridges algorithmic fairness research and anti-discrimination policy through a data-centric lens.
[LG-99] On the under-reaching phenomenon in message-passing neural PDE solvers: revisiting the CFL condition
链接: https://arxiv.org/abs/2507.08861
作者: Lucas Tesan,Mikel M. Iparraguirre,David Gonzalez,Pedro Martins,Elias Cueto
类目: Machine Learning (cs.LG); Machine Learning (stat.ML)
*备注:
Abstract:This paper proposes sharp lower bounds for the number of message passing iterations required in graph neural networks (GNNs) when solving partial differential equations (PDE). This significantly reduces the need for exhaustive hyperparameter tuning. Bounds are derived for the three fundamental classes of PDEs (hyperbolic, parabolic and elliptic) by relating the physical characteristics of the problem in question to the message-passing requirement of GNNs. In particular, we investigate the relationship between the physical constants of the equations governing the problem, the spatial and temporal discretisation and the message passing mechanisms in GNNs. When the number of message passing iterations is below these proposed limits, information does not propagate efficiently through the network, resulting in poor solutions, even for deep GNN architectures. In contrast, when the suggested lower bound is satisfied, the GNN parameterisation allows the model to accurately capture the underlying phenomenology, resulting in solvers of adequate accuracy. Examples are provided for four different examples of equations that show the sharpness of the proposed lower bounds. Subjects: Machine Learning (cs.LG); Machine Learning (stat.ML) Cite as: arXiv:2507.08861 [cs.LG] (or arXiv:2507.08861v1 [cs.LG] for this version) https://doi.org/10.48550/arXiv.2507.08861 Focus to learn more arXiv-issued DOI via DataCite
[LG-100] -Profits: A Business-Aligned Evaluation Metric for Profit-Sensitive Customer Churn Prediction
链接: https://arxiv.org/abs/2507.08860
作者: Awais Manzoor,M. Atif Qureshi,Etain Kidney,Luca Longo
类目: Machine Learning (cs.LG)
*备注:
Abstract:Retention campaigns in customer relationship management often rely on churn prediction models evaluated using traditional metrics such as AUC and F1-score. However, these metrics fail to reflect financial outcomes and may mislead strategic decisions. We introduce e-Profits, a novel business-aligned evaluation metric that quantifies model performance based on customer-specific value, retention probability, and intervention costs. Unlike existing profit-based metrics such as Expected Maximum Profit, which assume fixed population-level parameters, e-Profits uses Kaplan-Meier survival analysis to estimate personalised retention rates and supports granular, per customer evaluation. We benchmark six classifiers across two telecom datasets (IBM Telco and Maven Telecom) and demonstrate that e-Profits reshapes model rankings compared to traditional metrics, revealing financial advantages in models previously overlooked by AUC or F1-score. The metric also enables segment-level insight into which models maximise return on investment for high-value customers. e-Profits is designed as an understandable, post hoc tool to support model evaluation in business contexts, particularly for marketing and analytics teams prioritising profit-driven decisions. All source code is available at: this https URL.
[LG-101] Counterfactual optimization for fault prevention in complex wind energy systems
链接: https://arxiv.org/abs/2507.08849
作者: Emilio Carrizosa,Martina Fischetti,Roshell Haaker,Juan Miguel Morales
类目: ystems and Control (eess.SY); Machine Learning (cs.LG); Optimization and Control (math.OC)
*备注:
Abstract:Machine Learning models are increasingly used in businesses to detect faults and anomalies in complex systems. In this work, we take this approach a step further: beyond merely detecting anomalies, we aim to identify the optimal control strategy that restores the system to a safe state with minimal disruption. We frame this challenge as a counterfactual problem: given a Machine Learning model that classifies system states as either good or anomalous, our goal is to determine the minimal adjustment to the system’s control variables (i.e., its current status) that is necessary to return it to the good state. To achieve this, we leverage a mathematical model that finds the optimal counterfactual solution while respecting system specific constraints. Notably, most counterfactual analysis in the literature focuses on individual cases where a person seeks to alter their status relative to a decision made by a classifier, such as for loan approval or medical diagnosis. Our work addresses a fundamentally different challenge: optimizing counterfactuals for a complex energy system, specifically an offshore wind turbine oil type transformer. This application not only advances counterfactual optimization in a new domain but also opens avenues for broader research in this area. Our tests on real world data provided by our industrial partner show that our methodology easily adapts to user preferences and brings savings in the order of 3 million euros per year in a typical farm.
[LG-102] Accuracy and Consumption analysis from a compressed model by CompactifAI from Multiverse Computing
链接: https://arxiv.org/abs/2507.08836
作者: Damien Fovet,Shashank Chamoli,Sarah Oury,Srishti Singhal
类目: Machine Learning (cs.LG); Performance (cs.PF)
*备注:
Abstract:This study evaluates the performance of a compression method, called CompactifAI, developed by Multiverse Computing, applied to the large language model Llama 3.1 8B\citellama. The evaluation focused on model efficiency (in terms of energy consumption) and accuracy using respectively the frameworks Codecarbon\citecodecarbon and Ragas\citeragas. A comparison was performed between the model compressed with CompactifAI\citecompactifai\citecompactifai2 and its full-size version. Our findings reveal that the compressed model using CompactifAI not only significantly reduced the computational resources but also maintained the model accuracy, making the model more efficient, scalable and cost-effective.
[LG-103] Physical Informed Neural Networks for modeling ocean pollutant
链接: https://arxiv.org/abs/2507.08834
作者: Karishma Battina,Prathamesh Dinesh Joshi,Raj Abhijit Dandekar,Rajat Dandekar,Sreedath Panat
类目: Machine Learning (cs.LG)
*备注: 13 pages, 9 figures, 3 tables
Abstract:Traditional numerical methods often struggle with the complexity and scale of modeling pollutant transport across vast and dynamic oceanic domains. This paper introduces a Physics-Informed Neural Network (PINN) framework to simulate the dispersion of pollutants governed by the 2D advection-diffusion equation. The model achieves physically consistent predictions by embedding physical laws and fitting to noisy synthetic data, generated via a finite difference method (FDM), directly into the neural network training process. This approach addresses challenges such as non-linear dynamics and the enforcement of boundary and initial conditions. Synthetic data sets, augmented with varying noise levels, are used to capture real-world variability. The training incorporates a hybrid loss function including PDE residuals, boundary/initial condition conformity, and a weighted data fit term. The approach takes advantage of the Julia language scientific computing ecosystem for high-performance simulations, offering a scalable and flexible alternative to traditional solvers
[LG-104] A Hybrid Machine Learning Framework for Optimizing Crop Selection via Agronomic and Economic Forecasting
链接: https://arxiv.org/abs/2507.08832
作者: Niranjan Mallikarjun Sindhur,Pavithra C,Nivya Muchikel
类目: Machine Learning (cs.LG)
*备注:
Abstract:Farmers in developing regions like Karnataka, India, face a dual challenge: navigating extreme market and climate volatility while being excluded from the digital revolution due to literacy barriers. This paper presents a novel decision support system that addresses both challenges through a unique synthesis of machine learning and human-computer interaction. We propose a hybrid recommendation engine that integrates two predictive models: a Random Forest classifier to assess agronomic suitability based on soil, climate, and real-time weather data, and a Long Short-Term Memory (LSTM) network to forecast market prices for agronomically viable crops. This integrated approach shifts the paradigm from “what can grow?” to “what is most profitable to grow?”, providing a significant advantage in mitigating economic risk. The system is delivered through an end-to-end, voice-based interface in the local Kannada language, leveraging fine-tuned speech recognition and high-fidelity speech synthesis models to ensure accessibility for low-literacy users. Our results show that the Random Forest model achieves 98.5% accuracy in suitability prediction, while the LSTM model forecasts harvest-time prices with a low margin of error. By providing data-driven, economically optimized recommendations through an inclusive interface, this work offers a scalable and impactful solution to enhance the financial resilience of marginalized farming communities.
[LG-105] Recurrent Expansion: A Pathway Toward the Next Generation of Deep Learning
链接: https://arxiv.org/abs/2507.08828
作者: Tarek Berghout
类目: Machine Learning (cs.LG); Machine Learning (stat.ML)
*备注:
Abstract:This paper introduces Recurrent Expansion (RE) as a new learning paradigm that advances beyond conventional Machine Learning (ML) and Deep Learning (DL). While DL focuses on learning from static data representations, RE proposes an additional dimension: learning from the evolving behavior of models themselves. RE emphasizes multiple mappings of data through identical deep architectures and analyzes their internal representations (i.e., feature maps) in conjunction with observed performance signals such as loss. By incorporating these behavioral traces, RE enables iterative self-improvement, allowing each model version to gain insight from its predecessors. The framework is extended through Multiverse RE (MVRE), which aggregates signals from parallel model instances, and further through Heterogeneous MVRE (HMVRE), where models of varying architectures contribute diverse perspectives. A scalable and adaptive variant, Sc-HMVRE, introduces selective mechanisms and scale diversity for real-world deployment. Altogether, RE presents a shift in DL: from purely representational learning to behavior-aware, self-evolving systems. It lays the groundwork for a new class of intelligent models capable of reasoning over their own learning dynamics, offering a path toward scalable, introspective, and adaptive artificial intelligence. A simple code example to support beginners in running their own experiments is provided in Code Availability Section of this paper.
[LG-106] Information Must Flow: Recursive Bootstrapping for Information Bottleneck in Optimal Transport
链接: https://arxiv.org/abs/2507.10443
作者: Xin Li
类目: Machine Learning (stat.ML); Machine Learning (cs.LG)
*备注:
Abstract:We present the Context-Content Uncertainty Principle (CCUP), a unified framework that models cognition as the directed flow of information between high-entropy context and low-entropy content. Inference emerges as a cycle of bidirectional interactions, bottom-up contextual disambiguation paired with top-down content reconstruction, which resolves the Information Bottleneck in Optimal Transport (iBOT). Implemented via Rao-Blackwellized variational entropy minimization, CCUP steers representations toward minimal joint uncertainty while preserving inferential directionality. Local cycle completion underpins temporal bootstrapping, chaining simulations to refine memory, and spatial bootstrapping, enabling compositional hierarchical inference. We prove a Delta Convergence Theorem showing that recursive entropy minimization yields delta-like attractors in latent space, stabilizing perceptual schemas and motor plans. Temporal bootstrapping through perception-action loops and sleep-wake consolidation further transforms episodic traces into semantic knowledge. Extending CCUP, each hierarchical level performs delta-seeded inference: low-entropy content seeds diffuse outward along goal-constrained paths shaped by top-down priors and external context, confining inference to task-relevant manifolds and circumventing the curse of dimensionality. Building on this, we propose that language emerges as a symbolic transport system, externalizing latent content to synchronize inference cycles across individuals. Together, these results establish iBOT as a foundational principle of information flow in both individual cognition and collective intelligence, positioning recursive inference as the structured conduit through which minds adapt, align, and extend.
[LG-107] Dynamical stability for dense patterns in discrete attractor neural networks
链接: https://arxiv.org/abs/2507.10383
作者: Uri Cohen,Máté Lengyel
类目: Disordered Systems and Neural Networks (cond-mat.dis-nn); Statistical Mechanics (cond-mat.stat-mech); Machine Learning (cs.LG); Neural and Evolutionary Computing (cs.NE); Neurons and Cognition (q-bio.NC)
*备注:
Abstract:Neural networks storing multiple discrete attractors are canonical models of biological memory. Previously, the dynamical stability of such networks could only be guaranteed under highly restrictive conditions. Here, we derive a theory of the local stability of discrete fixed points in a broad class of networks with graded neural activities and in the presence of noise. By directly analyzing the bulk and outliers of the Jacobian spectrum, we show that all fixed points are stable below a critical load that is distinct from the classical \textitcritical capacity and depends on the statistics of neural activities in the fixed points as well as the single-neuron activation function. Our analysis highlights the computational benefits of threshold-linear activation and sparse-like patterns.
[LG-108] MF-GLaM: A multifidelity stochastic emulator using generalized lambda models
链接: https://arxiv.org/abs/2507.10303
作者: K. Giannoukou,X. Zhu,S. Marelli,B. Sudret
类目: Machine Learning (stat.ML); Machine Learning (cs.LG); Computation (stat.CO); Methodology (stat.ME)
*备注:
Abstract:Stochastic simulators exhibit intrinsic stochasticity due to unobservable, uncontrollable, or unmodeled input variables, resulting in random outputs even at fixed input conditions. Such simulators are common across various scientific disciplines; however, emulating their entire conditional probability distribution is challenging, as it is a task traditional deterministic surrogate modeling techniques are not designed for. Additionally, accurately characterizing the response distribution can require prohibitively large datasets, especially for computationally expensive high-fidelity (HF) simulators. When lower-fidelity (LF) stochastic simulators are available, they can enhance limited HF information within a multifidelity surrogate modeling (MFSM) framework. While MFSM techniques are well-established for deterministic settings, constructing multifidelity emulators to predict the full conditional response distribution of stochastic simulators remains a challenge. In this paper, we propose multifidelity generalized lambda models (MF-GLaMs) to efficiently emulate the conditional response distribution of HF stochastic simulators by exploiting data from LF stochastic simulators. Our approach builds upon the generalized lambda model (GLaM), which represents the conditional distribution at each input by a flexible, four-parameter generalized lambda distribution. MF-GLaMs are non-intrusive, requiring no access to the internal stochasticity of the simulators nor multiple replications of the same input values. We demonstrate the efficacy of MF-GLaM through synthetic examples of increasing complexity and a realistic earthquake application. Results show that MF-GLaMs can achieve improved accuracy at the same cost as single-fidelity GLaMs, or comparable performance at significantly reduced cost.
[LG-109] History Matching under Uncertainty of Geological Scenarios with Implicit Geological Realism Control with Generative Deep Learning and Graph Convolutions
链接: https://arxiv.org/abs/2507.10201
作者: Gleb Shishaev,Vasily Demyanov,Daniel Arnold
类目: Applications (stat.AP); Machine Learning (cs.LG); Data Analysis, Statistics and Probability (physics.data-an)
*备注: Part of the completed PhD thesis this https URL
Abstract:The graph-based variational autoencoder represents an architecture that can handle the uncertainty of different geological scenarios, such as depositional or structural, through the concept of a lowerdimensional latent space. The main difference from recent studies is utilisation of a graph-based approach in reservoir modelling instead of the more traditional lattice-based deep learning methods. We provide a solution to implicitly control the geological realism through the latent variables of a generative model and Geodesic metrics. Our experiments of AHM with synthetic dataset that consists of 3D realisations of channelised geological representations with two distinct scenarios with one and two channels shows the viability of the approach. We offer in-depth analysis of the latent space using tools such as PCA, t-SNE, and TDA to illustrate its structure.
[LG-110] Simulating Biases for Interpretable Fairness in Offline and Online Classifiers ECML KDD2025
链接: https://arxiv.org/abs/2507.10154
作者: Ricardo Inácio,Zafeiris Kokkinogenis,Vitor Cerqueira,Carlos Soares
类目: Machine Learning (stat.ML); Machine Learning (cs.LG)
*备注: 17 pages, 2 figures, 1 equation, 3 tables: 1 in main body and 2 in the appendix. Submitted to the SynDAiTE: Synthetic Data for AI Trustworthiness and Evolution workshop from ECMLPKDD 2025, anonymized
Abstract:Predictive models often reinforce biases which were originally embedded in their training data, through skewed decisions. In such cases, mitigation methods are critical to ensure that, regardless of the prevailing disparities, model outcomes are adjusted to be fair. To assess this, datasets could be systematically generated with specific biases, to train machine learning classifiers. Then, predictive outcomes could aid in the understanding of this bias embedding process. Hence, an agent-based model (ABM), depicting a loan application process that represents various systemic biases across two demographic groups, was developed to produce synthetic datasets. Then, by applying classifiers trained on them to predict loan outcomes, we can assess how biased data leads to unfairness. This highlights a main contribution of this work: a framework for synthetic dataset generation with controllable bias injection. We also contribute with a novel explainability technique, which shows how mitigations affect the way classifiers leverage data features, via second-order Shapley values. In experiments, both offline and online learning approaches are employed. Mitigations are applied at different stages of the modelling pipeline, such as during pre-processing and in-processing.
[LG-111] Regret Analysis of Posterior Sampling-Based Expected Improvement for Bayesian Optimization
链接: https://arxiv.org/abs/2507.09828
作者: Shion Takeno,Yu Inatsu,Masayuki Karasuyama,Ichiro Takeuchi
类目: Machine Learning (stat.ML); Machine Learning (cs.LG)
*备注: 35pages, 5 figures
Abstract:Bayesian optimization is a powerful tool for optimizing an expensive-to-evaluate black-box function. In particular, the effectiveness of expected improvement (EI) has been demonstrated in a wide range of applications. However, theoretical analyses of EI are limited compared with other theoretically established algorithms. This paper analyzes a randomized variant of EI, which evaluates the EI from the maximum of the posterior sample path. We show that this posterior sampling-based random EI achieves the sublinear Bayesian cumulative regret bounds under the assumption that the black-box function follows a Gaussian process. Finally, we demonstrate the effectiveness of the proposed method through numerical experiments.
[LG-112] Nesterov Finds GRAAL: Optimal and Adaptive Gradient Method for Convex Optimization
链接: https://arxiv.org/abs/2507.09823
作者: Ekaterina Borodich,Dmitry Kovalev
类目: Optimization and Control (math.OC); Machine Learning (cs.LG)
*备注:
Abstract:In this paper, we focus on the problem of minimizing a continuously differentiable convex objective function \min_x f(x) . Recently, several adaptive gradient methods, including GRAAL (Malitsky, 2020), have been developed. These methods estimate the local curvature of the objective function to compute stepsizes, attain the standard convergence rate \mathcalO(1/k) of fixed-stepsize gradient descent for Lipschitz-smooth functions, and do not require any line search procedures or hyperparameter tuning. However, a natural question arises: is it possible to accelerate the convergence of these algorithms to match the optimal rate \mathcalO(1/k^2) of the accelerated gradient descent of Nesterov (1983)? Although some attempts have been made (Li and Lan, 2023), the capabilities of the existing accelerated algorithms to adapt to the curvature of the objective function are highly limited. Consequently, we provide a positive answer to this question and develop GRAAL with Nesterov acceleration. We prove that our algorithm achieves the desired optimal convergence rate for Lipschitz smooth functions. Moreover, in contrast to existing methods, it does so with an arbitrary, even excessively small, initial stepsize at the cost of a logarithmic additive term in the iteration complexity.
[LG-113] Discovering Governing Equations in the Presence of Uncertainty
链接: https://arxiv.org/abs/2507.09740
作者: Ridwan Olabiyi,Han Hu,Ashif Iquebal
类目: Machine Learning (stat.ML); Machine Learning (cs.LG); Dynamical Systems (math.DS); Data Analysis, Statistics and Probability (physics.data-an)
*备注: 24 pages, 5 figures
Abstract:In the study of complex dynamical systems, understanding and accurately modeling the underlying physical processes is crucial for predicting system behavior and designing effective interventions. Yet real-world systems exhibit pronounced input (or system) variability and are observed through noisy, limited data conditions that confound traditional discovery methods that assume fixed-coefficient deterministic models. In this work, we theorize that accounting for system variability together with measurement noise is the key to consistently discover the governing equations underlying dynamical systems. As such, we introduce a stochastic inverse physics-discovery (SIP) framework that treats the unknown coefficients as random variables and infers their posterior distribution by minimizing the Kullback-Leibler divergence between the push-forward of the posterior samples and the empirical data distribution. Benchmarks on four canonical problems – the Lotka-Volterra predator-prey system (multi- and single-trajectory), the historical Hudson Bay lynx-hare data, the chaotic Lorenz attractor, and fluid infiltration in porous media using low- and high-viscosity liquids – show that SIP consistently identifies the correct equations and lowers coefficient root-mean-square error by an average of 82% relative to the Sparse Identification of Nonlinear Dynamics (SINDy) approach and its Bayesian variant. The resulting posterior distributions yield 95% credible intervals that closely track the observed trajectories, providing interpretable models with quantified uncertainty. SIP thus provides a robust, data-efficient approach for consistent physics discovery in noisy, variable, and data-limited settings.
[LG-114] Signed Graph Learning: Algorithms and Theory
链接: https://arxiv.org/abs/2507.09717
作者: Abdullah Karaaslanli,Bisakh Banerjee,Tapabrata Maiti,Selin Aviyente
类目: Machine Learning (stat.ML); Machine Learning (cs.LG); Signal Processing (eess.SP)
*备注:
Abstract:Real-world data is often represented through the relationships between data samples, forming a graph structure. In many applications, it is necessary to learn this graph structure from the observed data. Current graph learning research has primarily focused on unsigned graphs, which consist only of positive edges. However, many biological and social systems are better described by signed graphs that account for both positive and negative interactions, capturing similarity and dissimilarity between samples. In this paper, we develop a method for learning signed graphs from a set of smooth signed graph signals. Specifically, we employ the net Laplacian as a graph shift operator (GSO) to define smooth signed graph signals as the outputs of a low-pass signed graph filter defined by the net Laplacian. The signed graph is then learned by formulating a non-convex optimization problem where the total variation of the observed signals is minimized with respect to the net Laplacian. The proposed problem is solved using alternating direction method of multipliers (ADMM) and a fast algorithm reducing the per-ADMM iteration complexity from quadratic to linear in the number of nodes is introduced. Furthermore, theoretical proofs of convergence for the algorithm and a bound on the estimation error of the learned net Laplacian as a function of sample size, number of nodes, and graph topology are provided. Finally, the proposed method is evaluated on simulated data and gene regulatory network inference problem and compared to existing signed graph learning methods.
[LG-115] Machine-Precision Prediction of Low-Dimensional Chaotic Systems
链接: https://arxiv.org/abs/2507.09652
作者: Christof Schötz,Niklas Boers
类目: Chaotic Dynamics (nlin.CD); Machine Learning (cs.LG); Dynamical Systems (math.DS)
*备注:
Abstract:Low-dimensional chaotic systems such as the Lorenz-63 model are commonly used to benchmark system-agnostic methods for learning dynamics from data. Here we show that learning from noise-free observations in such systems can be achieved up to machine precision: using ordinary least squares regression on high-degree polynomial features with 512-bit arithmetic, our method exceeds the accuracy of standard 64-bit numerical ODE solvers of the true underlying dynamical systems. Depending on the configuration, we obtain valid prediction times of 32 to 105 Lyapunov times for the Lorenz-63 system, dramatically outperforming prior work that reaches 13 Lyapunov times at most. We further validate our results on Thomas’ Cyclically Symmetric Attractor, a non-polynomial chaotic system that is considerably more complex than the Lorenz-63 model, and show that similar results extend also to higher dimensions using the spatiotemporally chaotic Lorenz-96 model. Our findings suggest that learning low-dimensional chaotic systems from noise-free data is a solved problem.
[LG-116] An Algorithm for Identifying Interpretable Subgroups With Elevated Treatment Effects
链接: https://arxiv.org/abs/2507.09494
作者: Albert Chiu
类目: Machine Learning (stat.ML); Machine Learning (cs.LG); Econometrics (econ.EM); Methodology (stat.ME)
*备注:
Abstract:We introduce an algorithm for identifying interpretable subgroups with elevated treatment effects, given an estimate of individual or conditional average treatment effects (CATE). Subgroups are characterized by rule sets'' -- easy-to-understand statements of the form (Condition A AND Condition B) OR (Condition C) -- which can capture high-order interactions while retaining interpretability. Our method complements existing approaches for estimating the CATE, which often produce high dimensional and uninterpretable results, by summarizing and extracting critical information from fitted models to aid decision making, policy implementation, and scientific understanding. We propose an objective function that trades-off subgroup size and effect size, and varying the hyperparameter that controls this trade-off results in a
frontier’’ of Pareto optimal rule sets, none of which dominates the others across all criteria. Valid inference is achievable through sample splitting. We demonstrate the utility and limitations of our method using simulated and empirical examples.
[LG-117] Sensitivity Analysis of Transport and Radiation in NeuralPlasmaODE for ITER Burning Plasmas
链接: https://arxiv.org/abs/2507.09432
作者: Zefang Liu,Weston M. Stacey
类目: Plasma Physics (physics.plasm-ph); Machine Learning (cs.LG)
*备注:
Abstract:Understanding how key physical parameters influence burning plasma behavior is critical for the reliable operation of ITER. In this work, we extend NeuralPlasmaODE, a multi-region, multi-timescale model based on neural ordinary differential equations, to perform a sensitivity analysis of transport and radiation mechanisms in ITER plasmas. Normalized sensitivities of core and edge temperatures and densities are computed with respect to transport diffusivities, electron cyclotron radiation (ECR) parameters, impurity fractions, and ion orbit loss (IOL) timescales. The analysis focuses on perturbations around a trained nominal model for the ITER inductive scenario. Results highlight the dominant influence of magnetic field strength, safety factor, and impurity content on energy confinement, while also revealing how temperature-dependent transport contributes to self-regulating behavior. These findings demonstrate the utility of NeuralPlasmaODE for predictive modeling and scenario optimization in burning plasma environments.
[LG-118] Optimizing External Sources for Controlled Burning Plasma in Tokamaks with Neural Ordinary Differential Equations
链接: https://arxiv.org/abs/2507.09431
作者: Zefang Liu,Weston M. Stacey
类目: Plasma Physics (physics.plasm-ph); Machine Learning (cs.LG)
*备注:
Abstract:Achieving controlled burning plasma in tokamaks requires precise regulation of external particle and energy sources to reach and maintain target core densities and temperatures. This work presents an inverse modeling approach using a multinodal plasma dynamics model based on neural ordinary differential equations (Neural ODEs). Given a desired time evolution of nodal quantities such as deuteron density or electron temperature, we compute the external source profiles, such as neutral beam injection (NBI) power, that drive the plasma toward the specified behavior. The approach is implemented within the NeuralPlasmaODE framework, which models multi-region, multi-timescale transport and incorporates physical mechanisms including radiation, auxiliary heating, and internodal energy exchange. By formulating the control task as an optimization problem, we use automatic differentiation through the Neural ODE solver to minimize the discrepancy between simulated and target trajectories. This framework transforms the forward simulation tool into a control-oriented model and provides a practical method for computing external source profiles in both current and future fusion devices.
[LG-119] WellPINN: Accurate Well Representation for Transient Fluid Pressure Diffusion in Subsurface Reservoirs with Physics-Informed Neural Networks
链接: https://arxiv.org/abs/2507.09330
作者: Linus Walter(1 and 2),Qingkai Kong(3),Sara Hanson-Hedgecock(1),Víctor Vilarrasa(1) ((1) Global Change Research Group (GCRG), IMEDEA, CSIC-UIB, Spain, (2) Department of Civil and Environmental Engineering (DECA), Universitat Politècnica de Catalunya - BarcelonaTech (UPC), Barcelona, Spain, (3) Lawrence Livermore National Laboratory, Livermore, USA)
类目: Fluid Dynamics (physics.flu-dyn); Machine Learning (cs.LG); Computational Physics (physics.comp-ph)
*备注:
Abstract:Accurate representation of wells is essential for reliable reservoir characterization and simulation of operational scenarios in subsurface flow models. Physics-informed neural networks (PINNs) have recently emerged as a promising method for reservoir modeling, offering seamless integration of monitoring data and governing physical equations. However, existing PINN-based studies face major challenges in capturing fluid pressure near wells, particularly during the early stage after injection begins. To address this, we propose WellPINN, a modeling workflow that combines the outputs of multiple sequentially trained PINN models to accurately represent wells. This workflow iteratively approximates the radius of the equivalent well to match the actual well dimensions by decomposing the domain into stepwise shrinking subdomains with a simultaneously reducing equivalent well radius. Our results demonstrate that sequential training of superimposing networks around the pumping well is the first workflow that focuses on accurate inference of fluid pressure from pumping rates throughout the entire injection period, significantly advancing the potential of PINNs for inverse modeling and operational scenario simulations. All data and code for this paper will be made openly available at this https URL.
[LG-120] Uncovering symmetric and asymmetric species associations from community and environmental data
链接: https://arxiv.org/abs/2507.09317
作者: Sara Si-Moussi,Esther Galbrun,Mickael Hedde,Giovanni Poggiato,Matthias Rohr,Wilfried Thuiller
类目: Machine Learning (stat.ML); Machine Learning (cs.LG); Populations and Evolution (q-bio.PE)
*备注:
Abstract:There is no much doubt that biotic interactions shape community assembly and ultimately the spatial co-variations between species. There is a hope that the signal of these biotic interactions can be observed and retrieved by investigating the spatial associations between species while accounting for the direct effects of the environment. By definition, biotic interactions can be both symmetric and asymmetric. Yet, most models that attempt to retrieve species associations from co-occurrence or co-abundance data internally assume symmetric relationships between species. Here, we propose and validate a machine-learning framework able to retrieve bidirectional associations by analyzing species community and environmental data. Our framework (1) models pairwise species associations as directed influences from a source to a target species, parameterized with two species-specific latent embeddings: the effect of the source species on the community, and the response of the target species to the community; and (2) jointly fits these associations within a multi-species conditional generative model with different modes of interactions between environmental drivers and biotic associations. Using both simulated and empirical data, we demonstrate the ability of our framework to recover known asymmetric and symmetric associations and highlight the properties of the learned association networks. By comparing our approach to other existing models such as joint species distribution models and probabilistic graphical models, we show its superior capacity at retrieving symmetric and asymmetric interactions. The framework is intuitive, modular and broadly applicable across various taxonomic groups. Subjects: Machine Learning (stat.ML); Machine Learning (cs.LG); Populations and Evolution (q-bio.PE) MSC classes: 68T07, 62H22, 92D40 ACMclasses: I.2.3; I.2.6; I.5.1 Cite as: arXiv:2507.09317 [stat.ML] (or arXiv:2507.09317v1 [stat.ML] for this version) https://doi.org/10.48550/arXiv.2507.09317 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
[LG-121] Investigating the Robustness of Extreme Precipitation Super-Resolution Across Climates
链接: https://arxiv.org/abs/2507.09166
作者: Louise Largeau,Erwan Koch,David Leutwyler,Gregoire Mariethoz,Valerie Chavez-Demoulin,Tom Beucler
类目: Atmospheric and Oceanic Physics (physics.ao-ph); Machine Learning (cs.LG)
*备注: 31 pages, 9 figures, 1 table, submitted to AGU JAMES
Abstract:The coarse spatial resolution of gridded climate models, such as general circulation models, limits their direct use in projecting socially relevant variables like extreme precipitation. Most downscaling methods estimate the conditional distributions of extremes by generating large ensembles, complicating the assessment of robustness under distributional shifts, such as those induced by climate change. To better understand and potentially improve robustness, we propose super-resolving the parameters of the target variable’s probability distribution directly using analytically tractable mappings. Within a perfect-model framework over Switzerland, we demonstrate that vector generalized linear and additive models can super-resolve the generalized extreme value distribution of summer hourly precipitation extremes from coarse precipitation fields and topography. We introduce the notion of a “robustness gap”, defined as the difference in predictive error between present-trained and future-trained models, and use it to diagnose how model structure affects the generalization of each quantile to a pseudo-global warming scenario. By evaluating multiple model configurations, we also identify an upper limit on the super-resolution factor based on the spatial auto- and cross-correlation of precipitation and elevation, beyond which coarse precipitation loses predictive value. Our framework is broadly applicable to variables governed by parametric distributions and offers a model-agnostic diagnostic for understanding when and why empirical downscaling generalizes to climate change and extremes.
[LG-122] A Randomized Algorithm for Sparse PCA based on the Basic SDP Relaxation
链接: https://arxiv.org/abs/2507.09148
作者: Alberto Del Pia,Dekun Zhou
类目: Machine Learning (stat.ML); Machine Learning (cs.LG); Optimization and Control (math.OC)
*备注: 29 pages, 2 figures
Abstract:Sparse Principal Component Analysis (SPCA) is a fundamental technique for dimensionality reduction, and is NP-hard. In this paper, we introduce a randomized approximation algorithm for SPCA, which is based on the basic SDP relaxation. Our algorithm has an approximation ratio of at most the sparsity constant with high probability, if called enough times. Under a technical assumption, which is consistently satisfied in our numerical tests, the average approximation ratio is also bounded by \mathcalO(\logd) , where d is the number of features. We show that this technical assumption is satisfied if the SDP solution is low-rank, or has exponentially decaying eigenvalues. We then present a broad class of instances for which this technical assumption holds. We also demonstrate that in a covariance model, which generalizes the spiked Wishart model, our proposed algorithm achieves a near-optimal approximation ratio. We demonstrate the efficacy of our algorithm through numerical results on real-world datasets.
[LG-123] A Generalization Theory for Zero-Shot Prediction ICML’25
链接: https://arxiv.org/abs/2507.09128
作者: Ronak Mehta,Zaid Harchaoui
类目: Machine Learning (stat.ML); Machine Learning (cs.LG)
*备注: Published at ICML '25 (Oral)
Abstract:A modern paradigm for generalization in machine learning and AI consists of pre-training a task-agnostic foundation model, generally obtained using self-supervised and multimodal contrastive learning. The resulting representations can be used for prediction on a downstream task for which no labeled data is available. We present a theoretical framework to better understand this approach, called zero-shot prediction. We identify the target quantities that zero-shot prediction aims to learn, or learns in passing, and the key conditional independence relationships that enable its generalization ability.
[LG-124] CoVAE: Consistency Training of Variational Autoencoders
链接: https://arxiv.org/abs/2507.09103
作者: Gianluigi Silvestri,Luca Ambrogioni
类目: Machine Learning (stat.ML); Machine Learning (cs.LG)
*备注:
Abstract:Current state-of-the-art generative approaches frequently rely on a two-stage training procedure, where an autoencoder (often a VAE) first performs dimensionality reduction, followed by training a generative model on the learned latent space. While effective, this introduces computational overhead and increased sampling times. We challenge this paradigm by proposing Consistency Training of Variational AutoEncoders (CoVAE), a novel single-stage generative autoencoding framework that adopts techniques from consistency models to train a VAE architecture. The CoVAE encoder learns a progressive series of latent representations with increasing encoding noise levels, mirroring the forward processes of diffusion and flow matching models. This sequence of representations is regulated by a time dependent \beta parameter that scales the KL loss. The decoder is trained using a consistency loss with variational regularization, which reduces to a conventional VAE loss at the earliest latent time. We show that CoVAE can generate high-quality samples in one or few steps without the use of a learned prior, significantly outperforming equivalent VAEs and other single-stage VAEs methods. Our approach provides a unified framework for autoencoding and diffusion-style generative modeling and provides a viable route for one-step generative high-performance autoencoding. Our code is publicly available at this https URL.
[LG-125] Optimal High-probability Convergence of Nonlinear SGD under Heavy-tailed Noise via Symmetrization
链接: https://arxiv.org/abs/2507.09093
作者: Aleksandar Armacki,Dragana Bajovic,Dusan Jakovetic,Soummya Kar
类目: Machine Learning (stat.ML); Machine Learning (cs.LG); Optimization and Control (math.OC)
*备注: 38 pages, 1 figure
Abstract:We study convergence in high-probability of SGD-type methods in non-convex optimization and the presence of heavy-tailed noise. To combat the heavy-tailed noise, a general black-box nonlinear framework is considered, subsuming nonlinearities like sign, clipping, normalization and their smooth counterparts. Our first result shows that nonlinear SGD (N-SGD) achieves the rate \widetilde\mathcalO(t^-1/2) , for any noise with unbounded moments and a symmetric probability density function (PDF). Crucially, N-SGD has exponentially decaying tails, matching the performance of linear SGD under light-tailed noise. To handle non-symmetric noise, we propose two novel estimators, based on the idea of noise symmetrization. The first, dubbed Symmetrized Gradient Estimator (SGE), assumes a noiseless gradient at any reference point is available at the start of training, while the second, dubbed Mini-batch SGE (MSGE), uses mini-batches to estimate the noiseless gradient. Combined with the nonlinear framework, we get N-SGE and N-MSGE methods, respectively, both achieving the same convergence rate and exponentially decaying tails as N-SGD, while allowing for non-symmetric noise with unbounded moments and PDF satisfying a mild technical condition, with N-MSGE additionally requiring bounded noise moment of order p \in (1,2] . Compared to works assuming noise with bounded p -th moment, our results: 1) are based on a novel symmetrization approach; 2) provide a unified framework and relaxed moment conditions; 3) imply optimal oracle complexity of N-SGD and N-SGE, strictly better than existing works when p 2 , while the complexity of N-MSGE is close to existing works. Compared to works assuming symmetric noise with unbounded moments, we: 1) provide a sharper analysis and improved rates; 2) facilitate state-dependent symmetric noise; 3) extend the strong guarantees to non-symmetric noise.
[LG-126] Conformation-Aware Structure Prediction of Antigen-Recognizing Immune Proteins
链接: https://arxiv.org/abs/2507.09054
作者: Frédéric A. Dreyer,Jan Ludwiczak,Karolis Martinkus,Brennan Abanades,Robert G. Alberstein,Pan Kessel,Pranav Rao,Jae Hyeon Lee,Richard Bonneau,Andrew M. Watkins,Franziska Seeger
类目: Biomolecules (q-bio.BM); Machine Learning (cs.LG)
*备注: 17 pages, 12 figures, 2 tables, code at this https URL , model weights at this https URL
Abstract:We introduce Ibex, a pan-immunoglobulin structure prediction model that achieves state-of-the-art accuracy in modeling the variable domains of antibodies, nanobodies, and T-cell receptors. Unlike previous approaches, Ibex explicitly distinguishes between bound and unbound protein conformations by training on labeled apo and holo structural pairs, enabling accurate prediction of both states at inference time. Using a comprehensive private dataset of high-resolution antibody structures, we demonstrate superior out-of-distribution performance compared to existing specialized and general protein structure prediction tools. Ibex combines the accuracy of cutting-edge models with significantly reduced computational requirements, providing a robust foundation for accelerating large molecule design and therapeutic development.
[LG-127] A Method for Learning to Solve Parametric Bilevel Optimization with Coupling Constraints
链接: https://arxiv.org/abs/2507.09050
作者: James Kotary,Himanshu Sharma,Ethan King,Draguna Vrabie,Ferdinando Fioretto,Jan Drgona
类目: Optimization and Control (math.OC); Machine Learning (cs.LG)
*备注:
Abstract:Learning to Optimize (L2O) is a subfield of machine learning (ML) in which ML models are trained to solve parametric optimization problems. The general goal is to learn a fast approximator of solutions to constrained optimization problems, as a function of their defining parameters. Prior L2O methods focus almost entirely on single-level programs, in contrast to the bilevel programs, whose constraints are themselves expressed in terms of optimization subproblems. Bilevel programs have numerous important use cases but are notoriously difficult to solve, particularly under stringent time demands. This paper proposes a framework for learning to solve a broad class of challenging bilevel optimization problems, by leveraging modern techniques for differentiation through optimization problems. The framework is illustrated on an array of synthetic bilevel programs, as well as challenging control system co-design problems, showing how neural networks can be trained as efficient approximators of parametric bilevel optimization.
[LG-128] On the Gradient Domination of the LQG Problem
链接: https://arxiv.org/abs/2507.09026
作者: Kasra Fallah,Leonardo F. Toso,James Anderson
类目: Optimization and Control (math.OC); Machine Learning (cs.LG); Systems and Control (eess.SY)
*备注:
Abstract:We consider solutions to the linear quadratic Gaussian (LQG) regulator problem via policy gradient (PG) methods. Although PG methods have demonstrated strong theoretical guarantees in solving the linear quadratic regulator (LQR) problem, despite its nonconvex landscape, their theoretical understanding in the LQG setting remains limited. Notably, the LQG problem lacks gradient dominance in the classical parameterization, i.e., with a dynamic controller, which hinders global convergence guarantees. In this work, we study PG for the LQG problem by adopting an alternative parameterization of the set of stabilizing controllers and employing a lifting argument. We refer to this parameterization as a history representation of the control input as it is parameterized by past input and output data from the previous p time-steps. This representation enables us to establish gradient dominance and approximate smoothness for the LQG cost. We prove global convergence and per-iteration stability guarantees for policy gradient LQG in model-based and model-free settings. Numerical experiments on an open-loop unstable system are provided to support the global convergence guarantees and to illustrate convergence under different history lengths of the history representation.
[LG-129] Surprisingly High Redundancy in Electronic Structure Data
链接: https://arxiv.org/abs/2507.09001
作者: Sazzad Hossain,Ponkrshnan Thiagarajan,Shashank Pathrudkar,Stephanie Taylor,Abhijeet S. Gangan,Amartya S. Banerjee,Susanta Ghosh
类目: Materials Science (cond-mat.mtrl-sci); Disordered Systems and Neural Networks (cond-mat.dis-nn); Machine Learning (cs.LG); Computational Physics (physics.comp-ph); Quantum Physics (quant-ph)
*备注:
Abstract:Machine Learning (ML) models for electronic structure rely on large datasets generated through expensive Kohn-Sham Density Functional Theory simulations. This study reveals a surprisingly high level of redundancy in such datasets across various material systems, including molecules, simple metals, and complex alloys. Our findings challenge the prevailing assumption that large, exhaustive datasets are necessary for accurate ML predictions of electronic structure. We demonstrate that even random pruning can substantially reduce dataset size with minimal loss in predictive accuracy, while a state-of-the-art coverage-based pruning strategy retains chemical accuracy and model generalizability using up to 100-fold less data and reducing training time by threefold or more. By contrast, widely used importance-based pruning methods, which eliminate seemingly redundant data, can catastrophically fail at higher pruning factors, possibly due to the significant reduction in data coverage. This heretofore unexplored high degree of redundancy in electronic structure data holds the potential to identify a minimal, essential dataset representative of each material class.
[LG-130] Fixed-Confidence Multiple Change Point Identification under Bandit Feedback ICML2025
链接: https://arxiv.org/abs/2507.08994
作者: Joseph Lazzaro,Ciara Pike-Burke
类目: Machine Learning (stat.ML); Machine Learning (cs.LG)
*备注: ICML 2025
Abstract:Piecewise constant functions describe a variety of real-world phenomena in domains ranging from chemistry to manufacturing. In practice, it is often required to confidently identify the locations of the abrupt changes in these functions as quickly as possible. For this, we introduce a fixed-confidence piecewise constant bandit problem. Here, we sequentially query points in the domain and receive noisy evaluations of the function under bandit feedback. We provide instance-dependent lower bounds for the complexity of change point identification in this problem. These lower bounds illustrate that an optimal method should focus its sampling efforts adjacent to each of the change points, and the number of samples around each change point should be inversely proportional to the magnitude of the change. Building on this, we devise a simple and computationally efficient variant of Track-and-Stop and prove that it is asymptotically optimal in many regimes. We support our theoretical findings with experimental results in synthetic environments demonstrating the efficiency of our method.
[LG-131] Physics-Based Machine Learning Closures and Wall Models for Hypersonic Transition-Continuum Boundary Layer Predictions
链接: https://arxiv.org/abs/2507.08986
作者: Ashish S. Nair,Narendra Singh,Marco Panesi,Justin Sirignano,Jonathan F. MacArt
类目: Fluid Dynamics (physics.flu-dyn); Machine Learning (cs.LG)
*备注:
Abstract:Modeling rarefied hypersonic flows remains a fundamental challenge due to the breakdown of classical continuum assumptions in the transition-continuum regime, where the Knudsen number ranges from approximately 0.1 to 10. Conventional Navier-Stokes-Fourier (NSF) models with empirical slip-wall boundary conditions fail to accurately predict nonequilibrium effects such as velocity slip, temperature jump, and shock structure deviations. We develop a physics-constrained machine learning framework that augments transport models and boundary conditions to extend the applicability of continuum solvers in nonequilibrium hypersonic regimes. We employ deep learning PDE models (DPMs) for the viscous stress and heat flux embedded in the governing PDEs and trained via adjoint-based optimization. We evaluate these for two-dimensional supersonic flat-plate flows across a range of Mach and Knudsen numbers. Additionally, we introduce a wall model based on a mixture of skewed Gaussian approximations of the particle velocity distribution function. This wall model replaces empirical slip conditions with physically informed, data-driven boundary conditions for the streamwise velocity and wall temperature. Our results show that a trace-free anisotropic viscosity model, paired with the skewed-Gaussian distribution function wall model, achieves significantly improved accuracy, particularly at high-Mach and high-Knudsen number regimes. Strategies such as parallel training across multiple Knudsen numbers and inclusion of high-Mach data during training are shown to enhance model generalization. Increasing model complexity yields diminishing returns for out-of-sample cases, underscoring the need to balance degrees of freedom and overfitting. This work establishes data-driven, physics-consistent strategies for improving hypersonic flow modeling for regimes in which conventional continuum approaches are invalid.
[LG-132] Stochastic Approximation with Block Coordinate Optimal Stepsizes
链接: https://arxiv.org/abs/2507.08963
作者: Tao Jiang,Lin Xiao
类目: Optimization and Control (math.OC); Machine Learning (cs.LG); Machine Learning (stat.ML)
*备注:
Abstract:We consider stochastic approximation with block-coordinate stepsizes and propose adaptive stepsize rules that aim to minimize the expected distance from the next iterate to an optimal point. These stepsize rules employ online estimates of the second moment of the search direction along each block coordinate. The popular Adam algorithm can be interpreted as a particular heuristic for such estimation. By leveraging a simple conditional estimator, we derive a new method that obtains comparable performance as Adam but requires less memory and fewer hyper-parameters. We prove that this family of methods converges almost surely to a small neighborhood of the optimal point, and the radius of the neighborhood depends on the bias and variance of the second-moment estimator. Our analysis relies on a simple aiming condition that assumes neither convexity nor smoothness, thus has broad applicability.
[LG-133] he Bayesian Approach to Continual Learning: An Overview
链接: https://arxiv.org/abs/2507.08922
作者: Tameem Adel
类目: Machine Learning (stat.ML); Machine Learning (cs.LG)
*备注:
Abstract:Continual learning is an online paradigm where a learner continually accumulates knowledge from different tasks encountered over sequential time steps. Importantly, the learner is required to extend and update its knowledge without forgetting about the learning experience acquired from the past, and while avoiding the need to retrain from scratch. Given its sequential nature and its resemblance to the way humans think, continual learning offers an opportunity to address several challenges which currently stand in the way of widening the range of applicability of deep models to further real-world problems. The continual need to update the learner with data arriving sequentially strikes inherent congruence between continual learning and Bayesian inference which provides a principal platform to keep updating the prior beliefs of a model given new data, without completely forgetting the knowledge acquired from the old data. This survey inspects different settings of Bayesian continual learning, namely task-incremental learning and class-incremental learning. We begin by discussing definitions of continual learning along with its Bayesian setting, as well as the links with related fields, such as domain adaptation, transfer learning and meta-learning. Afterwards, we introduce a taxonomy offering a comprehensive categorization of algorithms belonging to the Bayesian continual learning paradigm. Meanwhile, we analyze the state-of-the-art while zooming in on some of the most prominent Bayesian continual learning algorithms to date. Furthermore, we shed some light on links between continual learning and developmental psychology, and correspondingly introduce analogies between both fields. We follow that with a discussion of current challenges, and finally conclude with potential areas for future research on Bayesian continual learning.
[LG-134] Physics-informed machine learning: A mathematical framework with applications to time series forecasting
链接: https://arxiv.org/abs/2507.08906
作者: Nathan Doumèche
类目: Machine Learning (stat.ML); Machine Learning (cs.LG); Statistics Theory (math.ST); Methodology (stat.ME)
*备注: Doctoral thesis, Sorbonne University. 286 pages
Abstract:Physics-informed machine learning (PIML) is an emerging framework that integrates physical knowledge into machine learning models. This physical prior often takes the form of a partial differential equation (PDE) system that the regression function must satisfy. In the first part of this dissertation, we analyze the statistical properties of PIML methods. In particular, we study the properties of physics-informed neural networks (PINNs) in terms of approximation, consistency, overfitting, and convergence. We then show how PIML problems can be framed as kernel methods, making it possible to apply the tools of kernel ridge regression to better understand their behavior. In addition, we use this kernel formulation to develop novel physics-informed algorithms and implement them efficiently on GPUs. The second part explores industrial applications in forecasting energy signals during atypical periods. We present results from the Smarter Mobility challenge on electric vehicle charging occupancy and examine the impact of mobility on electricity demand. Finally, we introduce a physics-constrained framework for designing and enforcing constraints in time series, applying it to load forecasting and tourism forecasting in various countries.
[LG-135] Predictive Causal Inference via Spatio-Temporal Modeling and Penalized Empirical Likelihood
链接: https://arxiv.org/abs/2507.08896
作者: Byunghee Lee,Hye Yeon Sin,Joonsung Kang
类目: Methodology (stat.ME); Machine Learning (cs.LG); Machine Learning (stat.ML)
*备注:
Abstract:This study introduces an integrated framework for predictive causal inference designed to overcome limitations inherent in conventional single model approaches. Specifically, we combine a Hidden Markov Model (HMM) for spatial health state estimation with a Multi Task and Multi Graph Convolutional Network (MTGCN) for capturing temporal outcome trajectories. The framework asymmetrically treats temporal and spatial information regarding them as endogenous variables in the outcome regression, and exogenous variables in the propensity score model, thereby expanding the standard doubly robust treatment effect estimation to jointly enhance bias correction and predictive accuracy. To demonstrate its utility, we focus on clinical domains such as cancer, dementia, and Parkinson disease, where treatment effects are challenging to observe directly. Simulation studies are conducted to emulate latent disease dynamics and evaluate the model performance under varying conditions. Overall, the proposed framework advances predictive causal inference by structurally adapting to spatiotemporal complexities common in biomedical data.
[LG-136] Mind the Gap: Navigating Inference with Optimal Transport Maps
链接: https://arxiv.org/abs/2507.08867
作者: Malte Algren,Tobias Golling,Francesco Armando Di Bello,Christopher Pollard
类目: Data Analysis, Statistics and Probability (physics.data-an); Machine Learning (cs.LG); High Energy Physics - Experiment (hep-ex); Machine Learning (stat.ML)
*备注: 23 pages, 13 figures
Abstract:Machine learning (ML) techniques have recently enabled enormous gains in sensitivity across the sciences. In particle physics, much of this progress has relied on excellent simulations of a wide range of physical processes. However, due to the sophistication of modern machine learning (ML) algorithms and their reliance on high-quality training samples, discrepancies between simulation and experimental data can significantly limit the effectiveness of ML techniques. In this work, we present a solution to this mis-specification'' problem: a calibration approach based on optimal transport, which we apply to high-dimensional simulations for the first time. We demonstrate the performance of our approach through jet tagging, using a CMS-inspired dataset. A 128-dimensional internal jet representation from a powerful general-purpose classifier is studied; after calibrating this internal
latent’’ representation, we find that a wide variety of quantities derived from it for downstream tasks are also properly calibrated: using this calibrated high-dimensional representation, powerful new applications of jet flavor information can be utilized in LHC analyses. This is a key step toward allowing properly-calibrated ``foundation models’’ in particle physics. More broadly, this calibration framework has broad applications for correcting high-dimensional simulations across the sciences.
[LG-137] DiffNMR: Diffusion Models for Nuclear Magnetic Resonance Spectra Elucidation
链接: https://arxiv.org/abs/2507.08854
作者: Qingsong Yang,Binglan Wu,Xuwei Liu,Bo Chen,Wei Li,Gen Long,Xin Chen,Mingjun Xiao
类目: Chemical Physics (physics.chem-ph); Machine Learning (cs.LG)
*备注:
Abstract:Nuclear Magnetic Resonance (NMR) spectroscopy is a central characterization method for molecular structure elucidation, yet interpreting NMR spectra to deduce molecular structures remains challenging due to the complexity of spectral data and the vastness of the chemical space. In this work, we introduce DiffNMR, a novel end-to-end framework that leverages a conditional discrete diffusion model for de novo molecular structure elucidation from NMR spectra. DiffNMR refines molecular graphs iteratively through a diffusion-based generative process, ensuring global consistency and mitigating error accumulation inherent in autoregressive methods. The framework integrates a two-stage pretraining strategy that aligns spectral and molecular representations via diffusion autoencoder (Diff-AE) and contrastive learning, the incorporation of retrieval initialization and similarity filtering during inference, and a specialized NMR encoder with radial basis function (RBF) encoding for chemical shifts, preserving continuity and chemical correlation. Experimental results demonstrate that DiffNMR achieves competitive performance for NMR-based structure elucidation, offering an efficient and robust solution for automated molecular analysis.
[LG-138] LNN-powered Fluid Antenna Multiple Access
链接: https://arxiv.org/abs/2507.08821
作者: Pedro D. Alvim,Hugerles S. Silva,Ugo S. Dias,Osamah S. Badarneh,Felipe A. P. Figueiredo,Rausley A. A. de Souza
类目: ignal Processing (eess.SP); Information Theory (cs.IT); Machine Learning (cs.LG)
*备注:
Abstract:Fluid antenna systems represent an innovative approach in wireless communication, recently applied in multiple access to optimize the signal-to-interference-plus-noise ratio through port selection. This letter frames the port selection problem as a multi-label classification task for the first time, improving best-port selection with limited port observations. We address this challenge by leveraging liquid neural networks (LNNs) to predict the optimal port under emerging fluid antenna multiple access scenarios alongside a more general \alpha - \mu fading model. We also apply hyperparameter optimization to refine LNN architectures for different observation scenarios. Our approach yields lower outage probability values than existing methods.
信息检索
[IR-0] Am I on the Right Track? What Can Predicted Query Performance Tell Us about the Search Behaviour of Agent ic RAG
链接: https://arxiv.org/abs/2507.10411
作者: Fangzheng Tian,Jinyuan Fang,Debasis Ganguly,Zaiqiao Meng,Craig Macdonald
类目: Information Retrieval (cs.IR)
*备注:
Abstract:Agentic Retrieval-Augmented Generation (RAG) is a new paradigm where the reasoning model decides when to invoke a retriever (as a “tool”) when answering a question. This paradigm, exemplified by recent research works such as Search-R1, enables the model to decide when to search and obtain external information. However, the queries generated by such Agentic RAG models and the role of the retriever in obtaining high-quality answers remain understudied. To this end, this initial study examines the applicability of query performance prediction (QPP) within the recent Agentic RAG models Search-R1 and R1-Searcher. We find that applying effective retrievers can achieve higher answer quality within a shorter reasoning process. Moreover, the QPP estimates of the generated queries, used as an approximation of their retrieval quality, are positively correlated with the quality of the final answer. Ultimately, our work is a step towards adaptive retrieval within Agentic RAG, where QPP is used to inform the model if the retrieved results are likely to be useful.
[IR-1] Riding the Carousel: The First Extensive Eye Tracking Analysis of Browsing Behavior in Carousel Recommenders
链接: https://arxiv.org/abs/2507.10135
作者: Santiago de Leon-Martinez,Robert Moro,Branislav Kveton,Maria Bielikova
类目: Information Retrieval (cs.IR); Human-Computer Interaction (cs.HC)
*备注:
Abstract:Carousels have become the de-facto interface in online services. However, there is a lack of research in carousels, particularly examining how recommender systems may be designed differently than the traditional single-list interfaces. One of the key elements for understanding how to design a system for a particular interface is understanding how users browse. For carousels, users may browse in a number of different ways due to the added complexity of multiple topic defined-lists and swiping to see more items. Eye tracking is the key to understanding user behavior by providing valuable, direct information on how users see and navigate. In this work, we provide the first extensive analysis of the eye tracking behavior in carousel recommenders under the free-browsing setting. To understand how users browse, we examine the following research questions : 1) where do users start browsing, 2) how do users transition from item to item within the same carousel and across carousels, and 3) how does genre preference impact transitions? This work addresses a gap in the field and provides the first extensive empirical results of eye tracked browsing behavior in carousels for improving recommenders. Taking into account the insights learned from the above questions, our final contribution is to provide suggestions to help carousel recommender system designers optimize their systems for user browsing behavior. The most important suggestion being to reorder the ranked item positions to account for browsing after this http URL contributions aim not only to help improve current systems, but also to encourage and allow the design of new user models, systems, and metrics that are better suited to the complexity of carousel interfaces. Subjects: Information Retrieval (cs.IR); Human-Computer Interaction (cs.HC) Cite as: arXiv:2507.10135 [cs.IR] (or arXiv:2507.10135v1 [cs.IR] for this version) https://doi.org/10.48550/arXiv.2507.10135 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
[IR-2] User Long-Term Multi-Interest Retrieval Model for Recommendation
链接: https://arxiv.org/abs/2507.10097
作者: Yue Meng,Cheng Guo,Xiaohui Hu,Honghu Deng,Yi Cao,Tong Liu,Bo Zheng
类目: Information Retrieval (cs.IR)
*备注:
Abstract:User behavior sequence modeling, which captures user interest from rich historical interactions, is pivotal for industrial recommendation systems. Despite breakthroughs in ranking-stage models capable of leveraging ultra-long behavior sequences with length scaling up to thousands, existing retrieval models remain constrained to sequences of hundreds of behaviors due to two main challenges. One is strict latency budget imposed by real-time service over large-scale candidate pool. The other is the absence of target-aware mechanisms and cross-interaction architectures, which prevent utilizing ranking-like techniques to simplify long sequence modeling. To address these limitations, we propose a new framework named User Long-term Multi-Interest Retrieval Model(ULIM), which enables thousand-scale behavior modeling in retrieval stages. ULIM includes two novel components: 1)Category-Aware Hierarchical Dual-Interest Learning partitions long behavior sequences into multiple category-aware subsequences representing multi-interest and jointly optimizes long-term and short-term interests within specific interest cluster. 2)Pointer-Enhanced Cascaded Category-to-Item Retrieval introduces Pointer-Generator Interest Network(PGIN) for next-category prediction, followed by next-item retrieval upon the top-K predicted categories. Comprehensive experiments on Taobao dataset show that ULIM achieves substantial improvement over state-of-the-art methods, and brings 5.54% clicks, 11.01% orders and 4.03% GMV lift for Taobaomiaosha, a notable mini-app of Taobao.
[IR-3] SLIF-MR: Self-loop Iterative Fusion of Heterogeneous Auxiliary Information for Multimodal Recommendation
链接: https://arxiv.org/abs/2507.09998
作者: Jie Guo,Jiahao Jiang,Ziyuan Guo,Bin Song,Yue Sun
类目: Information Retrieval (cs.IR)
*备注: 10 pages,7 figures
Abstract:Knowledge graphs (KGs) and multimodal item information, which respectively capture relational and attribute features, play a crucial role in improving recommender system accuracy. Recent studies have attempted to integrate them via multimodal knowledge graphs (MKGs) to further enhance recommendation performance. However, existing methods typically freeze the MKG structure during training, which limits the full integration of structural information from heterogeneous graphs (e.g., KG and user-item interaction graph), and results in sub-optimal performance. To address this challenge, we propose a novel framework, termed Self-loop Iterative Fusion of Heterogeneous Auxiliary Information for Multimodal Recommendation (SLIF-MR), which leverages item representations from previous training epoch as feedback signals to dynamically optimize the heterogeneous graph structures composed of KG, multimodal item feature graph, and user-item interaction graph. Through this iterative fusion mechanism, both user and item representations are refined, thus improving the final recommendation performance. Specifically, based on the feedback item representations, SLIF-MR constructs an item-item correlation graph, then integrated into the establishment process of heterogeneous graphs as additional new structural information in a self-loop manner. Consequently, the internal structures of heterogeneous graphs are updated with the feedback item representations during training. Moreover, a semantic consistency learning strategy is proposed to align heterogeneous item representations across modalities. The experimental results show that SLIF-MR significantly outperforms existing methods, particularly in terms of accuracy and robustness.
[IR-4] Non-parametric Graph Convolution for Re-ranking in Recommendation Systems RECSYS2025
链接: https://arxiv.org/abs/2507.09969
作者: Zhongyu Ouyang,Mingxuan Ju,Soroush Vosoughi,Yanfang Ye
类目: Information Retrieval (cs.IR)
*备注: Accepted to RecSys2025 Main
Abstract:Graph knowledge has been proven effective in enhancing item rankings in recommender systems (RecSys), particularly during the retrieval stage. However, its application in the ranking stage, especially when richer contextual information in user-item interactions is available, remains underexplored. A major challenge lies in the substantial computational cost associated with repeatedly retrieving neighborhood information from billions of items stored in distributed systems. This resource-intensive requirement makes it difficult to scale graph-based methods in practical RecSys. To bridge this gap, we first demonstrate that incorporating graphs in the ranking stage improves ranking qualities. Notably, while the improvement is evident, we show that the substantial computational overheads entailed by graphs are prohibitively expensive for real-world recommendations. In light of this, we propose a non-parametric strategy that utilizes graph convolution for re-ranking only during test time. Our strategy circumvents the notorious computational overheads from graph convolution during training, and utilizes structural knowledge hidden in graphs on-the-fly during testing. It can be used as a plug-and-play module and easily employed to enhance the ranking ability of various ranking layers of a real-world RecSys with significantly reduced computational overhead. Through comprehensive experiments across four benchmark datasets with varying levels of sparsity, we demonstrate that our strategy yields noticeable improvements (i.e., 8.1% on average) during testing time with little to no additional computational overheads (i.e., 0.5 on average). Code: this https URL
[IR-5] Criteria-Based LLM Relevance Judgments ICTIR2025
链接: https://arxiv.org/abs/2507.09488
作者: Naghmeh Farzi,Laura Dietz
类目: Information Retrieval (cs.IR)
*备注: 10 pages, 3 figures, accepted to ICTIR 2025
Abstract:Relevance judgments are crucial for evaluating information retrieval systems, but traditional human-annotated labels are time-consuming and expensive. As a result, many researchers turn to automatic alternatives to accelerate method development. Among these, Large Language Models (LLMs) provide a scalable solution by generating relevance labels directly through prompting. However, prompting an LLM for a relevance label without constraints often results in not only incorrect predictions but also outputs that are difficult for humans to interpret. We propose the Multi-Criteria framework for LLM-based relevance judgments, decomposing the notion of relevance into multiple criteria–such as exactness, coverage, topicality, and contextual fit–to improve the robustness and interpretability of retrieval evaluations compared to direct grading methods. We validate this approach on three datasets: the TREC Deep Learning tracks from 2019 and 2020, as well as LLMJudge (based on TREC DL 2023). Our results demonstrate that Multi-Criteria judgments enhance the system ranking/leaderboard performance. Moreover, we highlight the strengths and limitations of this approach relative to direct grading approaches, offering insights that can guide the development of future automatic evaluation frameworks in information retrieval.
[IR-6] Does UMBRELA Work on Other LLM s? SIGIR2025
链接: https://arxiv.org/abs/2507.09483
作者: Naghmeh Farzi,Laura Dietz
类目: Information Retrieval (cs.IR)
*备注: 9 pages, 2 figures, accepted to SIGIR 2025
Abstract:We reproduce the UMBRELA LLM Judge evaluation framework across a range of large language models (LLMs) to assess its generalizability beyond the original study. Our investigation evaluates how LLM choice affects relevance assessment accuracy, focusing on leaderboard rank correlation and per-label agreement metrics. Results demonstrate that UMBRELA with DeepSeek V3 obtains very comparable performance to GPT-4o (used in original work). For LLaMA-3.3-70B we obtain slightly lower performance, which further degrades with smaller LLMs.
[IR-7] Item-centric Exploration for Cold Start Problem RECSYS
链接: https://arxiv.org/abs/2507.09423
作者: Dong Wang,Junyi Jiao,Arnab Bhadury,Yaping Zhang,Mingyan Gao,Onkar Dalal
类目: Information Retrieval (cs.IR)
*备注: Accepted for publication on 2025 ACM Recsys Conference Industry Track
Abstract:Recommender systems face a critical challenge in the item cold-start problem, which limits content diversity and exacerbates popularity bias by struggling to recommend new items. While existing solutions often rely on auxiliary data, but this paper illuminates a distinct, yet equally pressing, issue stemming from the inherent user-centricity of many recommender systems. We argue that in environments with large and rapidly expanding item inventories, the traditional focus on finding the “best item for a user” can inadvertently obscure the ideal audience for nascent content. To counter this, we introduce the concept of item-centric recommendations, shifting the paradigm to identify the optimal users for new items. Our initial realization of this vision involves an item-centric control integrated into an exploration system. This control employs a Bayesian model with Beta distributions to assess candidate items based on a predicted balance between user satisfaction and the item’s inherent quality. Empirical online evaluations reveal that this straightforward control markedly improves cold-start targeting efficacy, enhances user satisfaction with newly explored content, and significantly increases overall exploration efficiency.
[IR-8] Balancing Semantic Relevance and Engagement in Related Video Recommendations
链接: https://arxiv.org/abs/2507.09403
作者: Amit Jaspal,Feng Zhang,Wei Chang,Sumit Kumar,Yubo Wang,Roni Mittleman,Qifan Wang,Weize Mao
类目: Information Retrieval (cs.IR); Multimedia (cs.MM)
*备注:
Abstract:Related video recommendations commonly use collaborative filtering (CF) driven by co-engagement signals, often resulting in recommendations lacking semantic coherence and exhibiting strong popularity bias. This paper introduces a novel multi-objective retrieval framework, enhancing standard two-tower models to explicitly balance semantic relevance and user engagement. Our approach uniquely combines: (a) multi-task learning (MTL) to jointly optimize co-engagement and semantic relevance, explicitly prioritizing topical coherence; (b) fusion of multimodal content features (textual and visual embeddings) for richer semantic understanding; and © off-policy correction (OPC) via inverse propensity weighting to effectively mitigate popularity bias. Evaluation on industrial-scale data and a two-week live A/B test reveals our framework’s efficacy. We observed significant improvements in semantic relevance (from 51% to 63% topic match rate), a reduction in popular item distribution (-13.8% popular video recommendations), and a +0.04% improvement in our topline user engagement metric. Our method successfully achieves better semantic coherence, balanced engagement, and practical scalability for real-world deployment.
[IR-9] Correcting the LogQ Correction: Revisiting Sampled Softmax for Large-Scale Retrieval RECSYS2025
链接: https://arxiv.org/abs/2507.09331
作者: Kirill Khrylchenko,Vladimir Baikalov,Sergei Makeev,Artem Matveev,Sergei Liamaev
类目: Information Retrieval (cs.IR)
*备注: Accepted at ACM RecSys 2025. Author’s version. To appear in the Proceedings of the 18th ACM Conference on Recommender Systems
Abstract:Two-tower neural networks are a popular architecture for the retrieval stage in recommender systems. These models are typically trained with a softmax loss over the item catalog. However, in web-scale settings, the item catalog is often prohibitively large, making full softmax infeasible. A common solution is sampled softmax, which approximates the full softmax using a small number of sampled negatives. One practical and widely adopted approach is to use in-batch negatives, where negatives are drawn from items in the current mini-batch. However, this introduces a bias: items that appear more frequently in the batch (i.e., popular items) are penalized more heavily. To mitigate this issue, a popular industry technique known as logQ correction adjusts the logits during training by subtracting the log-probability of an item appearing in the batch. This correction is derived by analyzing the bias in the gradient and applying importance sampling, effectively twice, using the in-batch distribution as a proposal distribution. While this approach improves model quality, it does not fully eliminate the bias. In this work, we revisit the derivation of logQ correction and show that it overlooks a subtle but important detail: the positive item in the denominator is not Monte Carlo-sampled - it is always present with probability 1. We propose a refined correction formula that accounts for this. Notably, our loss introduces an interpretable sample weight that reflects the model’s uncertainty - the probability of misclassification under the current parameters. We evaluate our method on both public and proprietary datasets, demonstrating consistent improvements over the standard logQ correction. Comments: Accepted at ACM RecSys 2025. Author’s version. To appear in the Proceedings of the 18th ACM Conference on Recommender Systems Subjects: Information Retrieval (cs.IR) Cite as: arXiv:2507.09331 [cs.IR] (or arXiv:2507.09331v1 [cs.IR] for this version) https://doi.org/10.48550/arXiv.2507.09331 Focus to learn more arXiv-issued DOI via DataCite (pending registration) Related DOI: https://doi.org/10.1145/3705328.3748033 Focus to learn more DOI(s) linking to related resources Submission history From: Kirill Khrylchenko [view email] [v1] Sat, 12 Jul 2025 16:16:11 UTC (98 KB)
[IR-10] Retrieval-Augmented Recommendation Explanation Generation with Hierarchical Aggregation
链接: https://arxiv.org/abs/2507.09188
作者: Bangcheng Sun,Yazhe Chen,Jilin Yang,Xiaodong Li,Hui Li
类目: Information Retrieval (cs.IR)
*备注:
Abstract:Explainable Recommender System (ExRec) provides transparency to the recommendation process, increasing users’ trust and boosting the operation of online services. With the rise of large language models (LLMs), whose extensive world knowledge and nuanced language understanding enable the generation of human-like, contextually grounded explanations, LLM-powered ExRec has gained great momentum. However, existing LLM-based ExRec models suffer from profile deviation and high retrieval overhead, hindering their deployment. To address these issues, we propose Retrieval-Augmented Recommendation Explanation Generation with Hierarchical Aggregation (REXHA). Specifically, we design a hierarchical aggregation based profiling module that comprehensively considers user and item review information, hierarchically summarizing and constructing holistic profiles. Furthermore, we introduce an efficient retrieval module using two types of pseudo-document queries to retrieve relevant reviews to enhance the generation of recommendation explanations, effectively reducing retrieval latency and improving the recall of relevant reviews. Extensive experiments demonstrate that our method outperforms existing approaches by up to 12.6% w.r.t. the explanation quality while achieving high retrieval efficiency.