本篇博文主要内容为 2025-07-25 从Arxiv.org论文网站获取的最新论文列表,自动更新,按照NLP、CV、ML、AI、IR五个大方向区分,若需要邮件定时接收,请在评论区留下你的邮箱号。

说明:每日论文数据从Arxiv.org获取,每天早上12:00左右定时自动更新。

友情提示: 如何您需要邮箱接收每日论文数据,请在评论处留下你的邮箱。

目录

概览 (2025-07-25)

今日共更新471篇论文,其中:

  • 自然语言处理64篇(Computation and Language (cs.CL))
  • 人工智能127篇(Artificial Intelligence (cs.AI))
  • 计算机视觉119篇(Computer Vision and Pattern Recognition (cs.CV))
  • 机器学习124篇(Machine Learning (cs.LG))

自然语言处理

[NLP-0] Checklists Are Better Than Reward Models For Aligning Language Models

【速读】: 该论文旨在解决语言模型在理解和遵循用户指令方面的能力不足问题,尤其是在多样化和复杂指令场景下的泛化能力有限。传统强化学习方法通常依赖固定评价标准(如“有用性”和“无害性”),难以适应不同任务的具体需求。其解决方案的关键在于提出“基于清单反馈的强化学习”(Reinforcement Learning from Checklist Feedback, RLCF):从指令中提取结构化检查清单,通过AI评判者和专用验证程序分别评估响应对每项清单内容的满足程度,并将得分融合为奖励信号用于强化学习训练。这一机制使模型能够根据具体指令动态调整优化目标,从而显著提升在多个基准测试上的表现,尤其在Hard Satisfaction Rate、InFoBench和Arena-Hard等指标上实现显著改进。

链接: https://arxiv.org/abs/2507.18624
作者: Vijay Viswanathan,Yanchao Sun,Shuang Ma,Xiang Kong,Meng Cao,Graham Neubig,Tongshuang Wu
机构: Carnegie Mellon University (卡内基梅隆大学); Apple (苹果)
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Language models must be adapted to understand and follow user instructions. Reinforcement learning is widely used to facilitate this – typically using fixed criteria such as “helpfulness” and “harmfulness”. In our work, we instead propose using flexible, instruction-specific criteria as a means of broadening the impact that reinforcement learning can have in eliciting instruction following. We propose “Reinforcement Learning from Checklist Feedback” (RLCF). From instructions, we extract checklists and evaluate how well responses satisfy each item - using both AI judges and specialized verifier programs - then combine these scores to compute rewards for RL. We compare RLCF with other alignment methods applied to a strong instruction following model (Qwen2.5-7B-Instruct) on five widely-studied benchmarks – RLCF is the only method to improve performance on every benchmark, including a 4-point boost in hard satisfaction rate on FollowBench, a 6-point increase on InFoBench, and a 3-point rise in win rate on Arena-Hard. These results establish checklist feedback as a key tool for improving language models’ support of queries that express a multitude of needs.
zh

[NLP-1] RPrompt: Bootstrapping Query-Aware Prompt Optimization from Textual Rewards

【速读】: 该论文旨在解决如何在不更新目标大语言模型(Large Language Models, LLMs)参数的前提下,通过优化提示(prompt)来提升其推理能力的问题。现有方法主要分为两类:一类依赖文本反馈从通用LLM中生成改进提示,另一类则利用数值奖励训练专用的提示模型以提供最优提示。论文提出Textual Reward Prompt(TRPrompt)框架,其关键在于将文本反馈直接融入提示模型的训练过程,从而无需预先收集数据集,并能通过生成提示的反馈实现迭代优化。该方法借助LLM对“优质提示”概念的内化能力,利用高分辨率的文本奖励信号,训练出在GSMHard和MATH等挑战性数学数据集上表现最优的查询特定提示。

链接: https://arxiv.org/abs/2507.18618
作者: Andreea Nica,Ivan Zakazov,Nicolas Mario Baldwin,Saibo Geng,Robert West
机构: EPFL (瑞士联邦理工学院); Microsoft Research (微软研究院)
类目: Computation and Language (cs.CL); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Prompt optimization improves the reasoning abilities of large language models (LLMs) without requiring parameter updates to the target model. Following heuristic-based “Think step by step” approaches, the field has evolved in two main directions: while one group of methods uses textual feedback to elicit improved prompts from general-purpose LLMs in a training-free way, a concurrent line of research relies on numerical rewards to train a special prompt model, tailored for providing optimal prompts to the target model. In this paper, we introduce the Textual Reward Prompt framework (TRPrompt), which unifies these approaches by directly incorporating textual feedback into training of the prompt model. Our framework does not require prior dataset collection and is being iteratively improved with the feedback on the generated prompts. When coupled with the capacity of an LLM to internalize the notion of what a “good” prompt is, the high-resolution signal provided by the textual rewards allows us to train a prompt model yielding state-of-the-art query-specific prompts for the problems from the challenging math datasets GSMHard and MATH.
zh

[NLP-2] SynC: Synthetic Image Caption Dataset Refinement with One-to-many Mapping for Zero-shot Image Captioning

【速读】: 该论文旨在解决生成式图像描述(Zero-shot Image Captioning, ZIC)中因使用文本到图像(Text-to-Image, T2I)模型合成数据时产生的语义错位问题,即合成图像与对应标题在对象或属性层面存在不一致,导致噪声数据干扰模型训练。现有数据清洗方法主要针对网络爬取文本中的噪声,难以适用于合成数据中“文本质量高但图像质量低”的特殊场景。其解决方案的关键在于提出 SynC 框架,采用一种新颖的一对多映射策略:为每个标题检索多个候选图像,并通过受循环一致性启发的对齐评分器,验证候选图像能否准确召回原始标题,从而将标题重新分配给最语义匹配的图像,而非直接过滤或重生成数据。此方法显著提升了合成数据的质量,在 MS-COCO、Flickr30k 和 NoCaps 等标准基准上实现了显著且稳定的性能提升。

链接: https://arxiv.org/abs/2507.18616
作者: Si-Woo Kim,MinJu Jeon,Ye-Chan Kim,Soeun Lee,Taewhan Kim,Dong-Jin Kim
机构: Hanyang University (汉阳大学); AI R&D Division, CJ Group (CJ集团人工智能研发部)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Machine Learning (cs.LG)
备注: Accepted to ACM Multimedia 2025

点击查看摘要

Abstract:Zero-shot Image Captioning (ZIC) increasingly utilizes synthetic datasets generated by text-to-image (T2I) models to mitigate the need for costly manual annotation. However, these T2I models often produce images that exhibit semantic misalignments with their corresponding input captions (e.g., missing objects, incorrect attributes), resulting in noisy synthetic image-caption pairs that can hinder model training. Existing dataset pruning techniques are largely designed for removing noisy text in web-crawled data. However, these methods are ill-suited for the distinct challenges of synthetic data, where captions are typically well-formed, but images may be inaccurate representations. To address this gap, we introduce SynC, a novel framework specifically designed to refine synthetic image-caption datasets for ZIC. Instead of conventional filtering or regeneration, SynC focuses on reassigning captions to the most semantically aligned images already present within the synthetic image pool. Our approach employs a one-to-many mapping strategy by initially retrieving multiple relevant candidate images for each caption. We then apply a cycle-consistency-inspired alignment scorer that selects the best image by verifying its ability to retrieve the original caption via image-to-text retrieval. Extensive evaluations demonstrate that SynC consistently and significantly improves performance across various ZIC models on standard benchmarks (MS-COCO, Flickr30k, NoCaps), achieving state-of-the-art results in several scenarios. SynC offers an effective strategy for curating refined synthetic data to enhance ZIC.
zh

[NLP-3] AQuilt: Weaving Logic and Self-Inspection into Low-Cost High-Relevance Data Synthesis for Specialist LLM s

【速读】: 该论文旨在解决大语言模型(Large Language Models, LLMs)在专业化领域中性能不足的问题,尤其是现有数据合成方法存在计算成本高、泛化能力弱及任务间迁移效果差等局限。其解决方案的关键在于提出AQuilt框架,该框架通过整合答案(Answer)、问题(Question)、未标注数据(Unlabeled data)、检查(Inspection)与逻辑推理(Logic)六要素,构建面向任意专业领域的指令微调数据集。其中,引入逻辑推理和自检机制以增强模型的推理能力和对生成数据质量的控制,同时支持可定制的任务指令,从而实现高效、高质量的数据生成,最终在仅使用DeepSeek-V3约17%生产成本的情况下达到相当的性能表现,并展现出更强的下游任务相关性。

链接: https://arxiv.org/abs/2507.18584
作者: Xiaopeng Ke,Hexuan Deng,Xuebo Liu,Jun Rao,Zhenxi Song,Jun Yu,Min Zhang
机构: Harbin Institute of Technology, Shenzhen(哈尔滨工业大学深圳校区); Harbin Institute of Technology, Shenzhen(哈尔滨工业大学深圳校区)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: 32 pages, 4 figures

点击查看摘要

Abstract:Despite the impressive performance of large language models (LLMs) in general domains, they often underperform in specialized domains. Existing approaches typically rely on data synthesis methods and yield promising results by using unlabeled data to capture domain-specific features. However, these methods either incur high computational costs or suffer from performance limitations, while also demonstrating insufficient generalization across different tasks. To address these challenges, we propose AQuilt, a framework for constructing instruction-tuning data for any specialized domains from corresponding unlabeled data, including Answer, Question, Unlabeled data, Inspection, Logic, and Task type. By incorporating logic and inspection, we encourage reasoning processes and self-inspection to enhance model performance. Moreover, customizable task instructions enable high-quality data generation for any task. As a result, we construct a dataset of 703k examples to train a powerful data synthesis model. Experiments show that AQuilt is comparable to DeepSeek-V3 while utilizing just 17% of the production cost. Further analysis demonstrates that our generated data exhibits higher relevance to downstream tasks. Source code, models, and scripts are available at this https URL.
zh

[NLP-4] DR.EHR: Dense Retrieval for Electronic Health Record with Knowledge Injection and Synthetic Data

【速读】: 该论文旨在解决电子健康记录(Electronic Health Records, EHRs)在临床实践中存在的检索难题,尤其是由语义鸿沟(semantic gap)导致的匹配不准确问题。现有密集检索模型(dense retrieval models)在通用领域和生物医学领域均表现不足,主要受限于医疗知识匮乏或训练语料与实际应用场景不匹配。解决方案的关键在于提出一种两阶段训练流程:第一阶段通过从生物医学知识图谱中提取医疗实体并注入知识,增强模型对医学概念的理解;第二阶段利用大语言模型(Large Language Models, LLMs)生成多样化训练数据以提升泛化能力。基于MIMIC-IV出院小结进行训练,作者构建了参数量分别为1.1亿和70亿的两个版本模型,在CliniQ基准测试中显著优于现有方法,尤其在隐含关系和缩写等复杂语义匹配任务上表现突出,验证了该方案的有效性与实用性。

链接: https://arxiv.org/abs/2507.18583
作者: Zhengyun Zhao,Huaiyuan Ying,Yue Zhong,Sheng Yu
机构: Tsinghua University (清华大学)
类目: Information Retrieval (cs.IR); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注: Model and code released upon acceptance

点击查看摘要

Abstract:Electronic Health Records (EHRs) are pivotal in clinical practices, yet their retrieval remains a challenge mainly due to semantic gap issues. Recent advancements in dense retrieval offer promising solutions but existing models, both general-domain and biomedical-domain, fall short due to insufficient medical knowledge or mismatched training corpora. This paper introduces \textttthis http URL, a series of dense retrieval models specifically tailored for EHR retrieval. We propose a two-stage training pipeline utilizing MIMIC-IV discharge summaries to address the need for extensive medical knowledge and large-scale training data. The first stage involves medical entity extraction and knowledge injection from a biomedical knowledge graph, while the second stage employs large language models to generate diverse training data. We train two variants of \textttthis http URL, with 110M and 7B parameters, respectively. Evaluated on the CliniQ benchmark, our models significantly outperforms all existing dense retrievers, achieving state-of-the-art results. Detailed analyses confirm our models’ superiority across various match and query types, particularly in challenging semantic matches like implication and abbreviation. Ablation studies validate the effectiveness of each pipeline component, and supplementary experiments on EHR QA datasets demonstrate the models’ generalizability on natural language questions, including complex ones with multiple entities. This work significantly advances EHR retrieval, offering a robust solution for clinical applications.
zh

[NLP-5] System Report for CCL25-Eval Task 10: SRAG -MAV for Fine-Grained Chinese Hate Speech Recognition CCL25

【速读】: 该论文旨在解决细粒度中文仇恨言论识别(Fine-Grained Chinese Hate Speech Recognition, FGCHSR)问题,其核心挑战在于如何提升模型对复杂语境下隐含仇恨意图的识别准确率与稳定性。解决方案的关键在于提出一种名为SRAG-MAV的框架,该框架融合任务重构(Task Reformulation, TR)、自检索增强生成(Self-Retrieval-Augmented Generation, SRAG)和多轮累积投票(Multi-Round Accumulative Voting, MAV)机制:首先将四元组抽取任务重构为三元组抽取以简化建模;其次利用训练集动态检索构建上下文提示以增强语义理解;最后通过多轮推理与投票策略提升预测结果的稳定性和性能。实验表明,基于Qwen2.5-7B模型的系统在STATE ToxiCN数据集上取得显著优于基线的方法表现。

链接: https://arxiv.org/abs/2507.18580
作者: Jiahao Wang,Ramen Liu,Longhui Zhang,Jing Li
机构: Harbin Institute of Technology, Shenzhen, China (哈尔滨工业大学深圳)
类目: Computation and Language (cs.CL)
备注: 8 pages, 3 figures, accepted as oral presentation at CCL25-Eval

点击查看摘要

Abstract:This paper presents our system for CCL25-Eval Task 10, addressing Fine-Grained Chinese Hate Speech Recognition (FGCHSR). We propose a novel SRAG-MAV framework that synergistically integrates task reformulation(TR), Self-Retrieval-Augmented Generation (SRAG), and Multi-Round Accumulative Voting (MAV). Our method reformulates the quadruplet extraction task into triplet extraction, uses dynamic retrieval from the training set to create contextual prompts, and applies multi-round inference with voting to improve output stability and performance. Our system, based on the Qwen2.5-7B model, achieves a Hard Score of 26.66, a Soft Score of 48.35, and an Average Score of 37.505 on the STATE ToxiCN dataset, significantly outperforming baselines such as GPT-4o (Average Score 15.63) and fine-tuned Qwen2.5-7B (Average Score 35.365). The code is available at this https URL.
zh

[NLP-6] Wide-In Narrow-Out: Revokable Decoding for Efficient and Effective DLLM s

【速读】: 该论文旨在解决扩散大语言模型(Diffusion Large Language Models, DLLMs)中存在的严重质量-速度权衡问题,即在追求并行解码速度提升时,模型性能显著下降。作者指出,这一问题源于标准解码过程的不可逆性,导致早期错误上下文积累并使解码方向快速偏移至错误路径。解决方案的关键在于提出一种无需训练的解码算法——Wide-In, Narrow-Out (WINO),其核心机制是采用并行的“草稿-验证”策略:在宽输入范围内并行生成多个候选token,同时利用模型的双向上下文信息对可疑token进行验证与重掩码,从而实现可撤销的解码过程,有效缓解错误传播并提升生成质量。

链接: https://arxiv.org/abs/2507.18578
作者: Feng Hong,Geng Yu,Yushi Ye,Haicheng Huang,Huangjie Zheng,Ya Zhang,Yanfeng Wang,Jiangchao Yao
机构: 未知
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Diffusion Large Language Models (DLLMs) have emerged as a compelling alternative to Autoregressive models, designed for fast parallel generation. However, existing DLLMs are plagued by a severe quality-speed trade-off, where faster parallel decoding leads to significant performance degradation. We attribute this to the irreversibility of standard decoding in DLLMs, which is easily polarized into the wrong decoding direction along with early error context accumulation. To resolve this, we introduce Wide-In, Narrow-Out (WINO), a training-free decoding algorithm that enables revokable decoding in DLLMs. WINO employs a parallel draft-and-verify mechanism, aggressively drafting multiple tokens while simultaneously using the model’s bidirectional context to verify and re-mask suspicious ones for refinement. Verified in open-source DLLMs like LLaDA and MMaDA, WINO is shown to decisively improve the quality-speed trade-off. For instance, on the GSM8K math benchmark, it accelerates inference by 6 \times while improving accuracy by 2.58%; on Flickr30K captioning, it achieves a 10 \times speedup with higher performance. More comprehensive experiments are conducted to demonstrate the superiority and provide an in-depth understanding of WINO.
zh

[NLP-7] SafeWork-R1: Coevolving Safety and Intelligence under the AI-45circ Law

【速读】: 该论文旨在解决当前多模态大模型在能力提升过程中难以兼顾安全性的关键问题,即如何在不牺牲通用能力的前提下实现安全性的显著增强。解决方案的关键在于提出SafeLadder框架,该框架通过大规模、渐进式、以安全性为导向的强化学习后训练(post-training),结合多原则验证器(multi-principled verifiers)实现对模型行为的持续约束与优化,使模型具备内在的安全推理和自我反思能力,从而产生“安全顿悟”(safety ‘aha’ moments)。这一机制突破了传统对齐方法如RLHF仅依赖人类偏好学习的局限,实现了安全与能力的协同进化,实验表明SafeWork-R1在多个安全基准上相较基线模型Qwen2.5-VL-72B平均提升46.54%,且优于GPT-4.1和Claude Opus 4等主流商用模型,验证了其有效性与普适性。

链接: https://arxiv.org/abs/2507.18576
作者: Shanghai AI Lab:Yicheng Bao,Guanxu Chen,Mingkang Chen,Yunhao Chen,Chiyu Chen,Lingjie Chen,Sirui Chen,Xinquan Chen,Jie Cheng,Yu Cheng,Dengke Deng,Yizhuo Ding,Dan Ding,Xiaoshan Ding,Yi Ding,Zhichen Dong,Lingxiao Du,Yuyu Fan,Xinshun Feng,Yanwei Fu,Yuxuan Gao,Ruijun Ge,Tianle Gu,Lujun Gui,Jiaxuan Guo,Qianxi He,Yuenan Hou,Xuhao Hu,Hong Huang,Kaichen Huang,Shiyang Huang,Yuxian Jiang,Shanzhe Lei,Jie Li,Lijun Li,Hao Li,Juncheng Li,Xiangtian Li,Yafu Li,Lingyu Li,Xueyan Li,Haotian Liang,Dongrui Liu,Qihua Liu,Zhixuan Liu,Bangwei Liu,Huacan Liu,Yuexiao Liu,Zongkai Liu,Chaochao Lu,Yudong Lu,Xiaoya Lu,Zhenghao Lu,Qitan Lv,Caoyuan Ma,Jiachen Ma,Xiaoya Ma,Zhongtian Ma,Lingyu Meng,Ziqi Miao,Yazhe Niu,Yuezhang Peng,Yuan Pu,Han Qi,Chen Qian,Xingge Qiao,Jingjing Qu,Jiashu Qu,Wanying Qu,Wenwen Qu,Xiaoye Qu,Qihan Ren,Qingnan Ren,Qingyu Ren,Jing Shao,Wenqi Shao,Shuai Shao,Dongxing Shi,Xin Song,Xinhao Song,Yan Teng,Xuan Tong,Yingchun Wang,Xuhong Wang,Shujie Wang,Xin Wang,Yige Wang,Yixu Wang,Yuanfu Wang,Futing Wang,Ruofan Wang,Wenjie Wang,Yajie Wang,Muhao Wei,Xiaoyu Wen,Fenghua Weng,Yuqi Wu,Yingtong Xiong,Xingcheng Xu
机构: Shanghai Artificial Intelligence Laboratory(上海人工智能实验室)
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Computer Vision and Pattern Recognition (cs.CV)
备注: 47 pages, 18 figures, authors are listed in alphabetical order by their last names

点击查看摘要

Abstract:We introduce SafeWork-R1, a cutting-edge multimodal reasoning model that demonstrates the coevolution of capabilities and safety. It is developed by our proposed SafeLadder framework, which incorporates large-scale, progressive, safety-oriented reinforcement learning post-training, supported by a suite of multi-principled verifiers. Unlike previous alignment methods such as RLHF that simply learn human preferences, SafeLadder enables SafeWork-R1 to develop intrinsic safety reasoning and self-reflection abilities, giving rise to safety `aha’ moments. Notably, SafeWork-R1 achieves an average improvement of 46.54% over its base model Qwen2.5-VL-72B on safety-related benchmarks without compromising general capabilities, and delivers state-of-the-art safety performance compared to leading proprietary models such as GPT-4.1 and Claude Opus 4. To further bolster its reliability, we implement two distinct inference-time intervention methods and a deliberative search mechanism, enforcing step-level verification. Finally, we further develop SafeWork-R1-InternVL3-78B, SafeWork-R1-DeepSeek-70B, and SafeWork-R1-Qwen2.5VL-7B. All resulting models demonstrate that safety and capability can co-evolve synergistically, highlighting the generalizability of our framework in building robust, reliable, and trustworthy general-purpose AI.
zh

[NLP-8] PosterMate: Audience-driven Collaborative Persona Agents for Poster Design

【速读】: 该论文试图解决海报设计过程中难以获取多样化目标受众同步反馈并整合不同观点的问题。传统方法在收集多视角意见时存在效率低、协调难等挑战,而现有生成式 AI (Generative AI) 模型虽具模拟人类交互的潜力,其在设计反馈场景中的应用尚不明确。解决方案的关键在于提出 PosterMate,一个基于营销文档构建的“角色代理”(persona agents)系统,通过模拟具有特定身份特征的目标用户群体对海报各组件进行独立评价,并借助调解者(moderator)引导讨论以达成共识性修改建议,最终将这些编辑直接集成至设计中,从而实现高效、可解释且贴近真实用户需求的设计迭代。

链接: https://arxiv.org/abs/2507.18572
作者: Donghoon Shin,Daniel Lee,Gary Hsieh,Gromit Yeuk-Yin Chan
机构: University of Washington (华盛顿大学); Adobe Inc. (Adobe公司); Adobe Research (Adobe研究部门)
类目: Human-Computer Interaction (cs.HC); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Poster designing can benefit from synchronous feedback from target audiences. However, gathering audiences with diverse perspectives and reconciling them on design edits can be challenging. Recent generative AI models present opportunities to simulate human-like interactions, but it is unclear how they may be used for feedback processes in design. We introduce PosterMate, a poster design assistant that facilitates collaboration by creating audience-driven persona agents constructed from marketing documents. PosterMate gathers feedback from each persona agent regarding poster components, and stimulates discussion with the help of a moderator to reach a conclusion. These agreed-upon edits can then be directly integrated into the poster design. Through our user study (N=12), we identified the potential of PosterMate to capture overlooked viewpoints, while serving as an effective prototyping tool. Additionally, our controlled online evaluation (N=100) revealed that the feedback from an individual persona agent is appropriate given its persona identity, and the discussion effectively synthesizes the different persona agents’ perspectives.
zh

[NLP-9] Hybrid Tokenization Strategy for DNA Language Model using Byte Pair Encoding and K-MER Methods

【速读】: 该论文旨在解决传统k-mer分词策略在DNA语言模型(DNA Language Models, DLMs)中面临的两个核心问题:一是局部序列结构捕捉能力虽强但存在token分布不均的问题;二是难以有效建模全局序列上下文信息。其解决方案的关键在于提出一种混合分词策略,将独特的6-mer token与通过600次Byte Pair Encoding(BPE-600)优化生成的BPE token相结合,从而构建一个平衡且具备上下文感知能力的词汇表,使模型能够同时捕获DNA序列中的短程模式和长程依赖关系。

链接: https://arxiv.org/abs/2507.18570
作者: Ganesh Sapkota,Md Hasibur Rahman
机构: Missouri University of Science and Technology (密苏里科技大学)
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:This paper presents a novel hybrid tokenization strategy that enhances the performance of DNA Language Models (DLMs) by combining 6-mer tokenization with Byte Pair Encoding (BPE-600). Traditional k-mer tokenization is effective at capturing local DNA sequence structures but often faces challenges, including uneven token distribution and a limited understanding of global sequence context. To address these limitations, we propose merging unique 6mer tokens with optimally selected BPE tokens generated through 600 BPE cycles. This hybrid approach ensures a balanced and context-aware vocabulary, enabling the model to capture both short and long patterns within DNA sequences simultaneously. A foundational DLM trained on this hybrid vocabulary was evaluated using next-k-mer prediction as a fine-tuning task, demonstrating significantly improved performance. The model achieved prediction accuracies of 10.78% for 3-mers, 10.1% for 4-mers, and 4.12% for 5-mers, outperforming state-of-the-art models such as NT, DNABERT2, and GROVER. These results highlight the ability of the hybrid tokenization strategy to preserve both the local sequence structure and global contextual information in DNA modeling. This work underscores the importance of advanced tokenization methods in genomic language modeling and lays a robust foundation for future applications in downstream DNA sequence analysis and biological research.
zh

[NLP-10] GIIFT: Graph-guided Inductive Image-free Multimodal Machine Translation

【速读】: 该论文旨在解决多模态机器翻译(Multimodal Machine Translation, MMT)中因视觉-语言对齐过于刚性而导致的模态差距问题,以及现有方法在训练域外图像缺失场景下难以泛化的问题。其解决方案的关键在于构建新型多模态场景图(multimodal scene graphs)以保留并融合跨模态特异性信息,并提出两阶段的图引导归纳式无图像翻译框架(GIIFT),通过交叉模态图注意力网络适配器(cross-modal Graph Attention Network adapter)在统一融合空间中学习多模态知识,从而实现对更广泛无图像翻译域的归纳推广能力。

链接: https://arxiv.org/abs/2507.18562
作者: Jiafeng Xiong,Yuting Zhao
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Multimodal Machine Translation (MMT) has demonstrated the significant help of visual information in machine translation. However, existing MMT methods face challenges in leveraging the modality gap by enforcing rigid visual-linguistic alignment whilst being confined to inference within their trained multimodal domains. In this work, we construct novel multimodal scene graphs to preserve and integrate modality-specific information and introduce GIIFT, a two-stage Graph-guided Inductive Image-Free MMT framework that uses a cross-modal Graph Attention Network adapter to learn multimodal knowledge in a unified fused space and inductively generalize it to broader image-free translation domains. Experimental results on the Multi30K dataset of English-to-French and English-to-German tasks demonstrate that our GIIFT surpasses existing approaches and achieves the state-of-the-art, even without images during inference. Results on the WMT benchmark show significant improvements over the image-free translation baselines, demonstrating the strength of GIIFT towards inductive image-free inference.
zh

[NLP-11] GLiNER2: An Efficient Multi-Task Information Extraction System with Schema-Driven Interface

【速读】: 该论文旨在解决信息抽取(Information Extraction, IE)领域中现有方法存在的两个核心问题:一是任务专用性导致模型碎片化,不同任务需独立训练和部署;二是基于大语言模型(Large Language Models, LLMs)的方案计算成本高、部署门槛高。解决方案的关键在于提出GLiNER2——一个统一架构的高效模型框架,其通过引入基于schema的多任务组合机制,在单一预训练Transformer编码器基础上实现了命名实体识别、文本分类与分层结构化数据抽取的联合建模,同时保持了CPU友好性和轻量级特性,显著提升了部署的可及性与实用性。

链接: https://arxiv.org/abs/2507.18546
作者: Urchade Zaratiana,Gil Pasternak,Oliver Boyd,George Hurn-Maloney,Ash Lewis
机构: Fastino AI
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Information extraction (IE) is fundamental to numerous NLP applications, yet existing solutions often require specialized models for different tasks or rely on computationally expensive large language models. We present GLiNER2, a unified framework that enhances the original GLiNER architecture to support named entity recognition, text classification, and hierarchical structured data extraction within a single efficient model. Built pretrained transformer encoder architecture, GLiNER2 maintains CPU efficiency and compact size while introducing multi-task composition through an intuitive schema-based interface. Our experiments demonstrate competitive performance across extraction and classification tasks with substantial improvements in deployment accessibility compared to LLM-based alternatives. We release GLiNER2 as an open-source pip-installable library with pre-trained models and documentation at this https URL.
zh

[NLP-12] Effective Multi-Task Learning for Biomedical Named Entity Recognition ACL2025

【速读】: 该论文旨在解决生物医学命名实体识别(Biomedical Named Entity Recognition, BioNER)中因术语复杂性和标注不一致性导致的挑战,特别是嵌套实体识别困难以及跨数据集训练时的标注缺口问题。其解决方案的关键在于提出一种基于槽位的循环单元命名实体识别模型(Slot-based Recurrent Unit NER, SRU-NER),通过动态调整损失计算机制,在多任务学习框架下避免对目标数据集中不存在的实体类型进行惩罚,从而有效整合多个数据集并提升跨域泛化能力。

链接: https://arxiv.org/abs/2507.18542
作者: João Ruano,Gonçalo M. Correia,Leonor Barreiros,Afonso Mendes
机构: Priberam Labs(普里贝拉姆实验室)
类目: Computation and Language (cs.CL)
备注: Accepted at the 24th BioNLP workshop (ACL2025), 15 pages, 3 figures

点击查看摘要

Abstract:Biomedical Named Entity Recognition presents significant challenges due to the complexity of biomedical terminology and inconsistencies in annotation across datasets. This paper introduces SRU-NER (Slot-based Recurrent Unit NER), a novel approach designed to handle nested named entities while integrating multiple datasets through an effective multi-task learning strategy. SRU-NER mitigates annotation gaps by dynamically adjusting loss computation to avoid penalizing predictions of entity types absent in a given dataset. Through extensive experiments, including a cross-corpus evaluation and human assessment of the model’s predictions, SRU-NER achieves competitive performance in biomedical and general-domain NER tasks, while improving cross-domain generalization.
zh

[NLP-13] he Moral Gap of Large Language Models

【速读】: 该论文旨在解决生成式 AI(Generative AI)在道德基础检测(moral foundation detection)任务中的性能瓶颈问题,即当前大语言模型(Large Language Models, LLMs)在处理特定伦理推理任务时表现不佳,尤其在社交媒体文本中存在显著的假阴性率和系统性道德内容漏检现象。研究通过在Twitter和Reddit数据集上对最先进的LLMs与微调后的Transformer模型进行ROC、PR和DET曲线分析,发现尽管采用提示工程(prompt engineering),LLMs仍难以有效识别道德语境;其关键解决方案在于:针对具体任务进行模型微调(task-specific fine-tuning)比通用提示策略更优,能够显著提升道德内容检测的准确性与可靠性,从而为开发伦理对齐的AI系统提供实证依据与方法指导。

链接: https://arxiv.org/abs/2507.18523
作者: Maciej Skorski,Alina Landowska
机构: 未知
类目: Computation and Language (cs.CL); Human-Computer Interaction (cs.HC); Machine Learning (cs.LG)
备注: preprint

点击查看摘要

Abstract:Moral foundation detection is crucial for analyzing social discourse and developing ethically-aligned AI systems. While large language models excel across diverse tasks, their performance on specialized moral reasoning remains unclear. This study provides the first comprehensive comparison between state-of-the-art LLMs and fine-tuned transformers across Twitter and Reddit datasets using ROC, PR, and DET curve analysis. Results reveal substantial performance gaps, with LLMs exhibiting high false negative rates and systematic under-detection of moral content despite prompt engineering efforts. These findings demonstrate that task-specific fine-tuning remains superior to prompting for moral reasoning applications. Comments: preprint Subjects: Computation and Language (cs.CL); Human-Computer Interaction (cs.HC); Machine Learning (cs.LG) Cite as: arXiv:2507.18523 [cs.CL] (or arXiv:2507.18523v1 [cs.CL] for this version) https://doi.org/10.48550/arXiv.2507.18523 Focus to learn more arXiv-issued DOI via DataCite Related DOI: https://doi.org/10.13140/RG.2.2.26221.70880 Focus to learn more DOI(s) linking to related resources
zh

[NLP-14] Not All Features Deserve Attention: Graph-Guided Dependency Learning for Tabular Data Generation with Language Models

【速读】: 该论文旨在解决大型语言模型(Large Language Models, LLMs)在表格数据生成任务中因自注意力机制对所有特征-值对均匀分配关注而导致的关键依赖关系被稀释的问题,尤其是在特征间存在稀疏依赖结构或语义模糊的情况下。解决方案的关键在于提出 GraDe(Graph-Guided Dependency Learning),通过引入一个轻量级的动态图学习模块,将外部提取的功能依赖关系显式整合进 LLM 的注意力机制中,从而优先聚焦于重要的特征交互并抑制无关关系,实现结构感知的表格数据建模。

链接: https://arxiv.org/abs/2507.18504
作者: Zheyu Zhang,Shuo Yang,Bardh Prenkaj,Gjergji Kasneci
机构: 未知
类目: Computation and Language (cs.CL); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Large Language Models (LLMs) have shown strong potential for tabular data generation by modeling textualized feature-value pairs. However, tabular data inherently exhibits sparse feature-level dependencies, where many feature interactions are structurally insignificant. This creates a fundamental mismatch as LLMs’ self-attention mechanism inevitably distributes focus across all pairs, diluting attention on critical relationships, particularly in datasets with complex dependencies or semantically ambiguous features. To address this limitation, we propose GraDe (Graph-Guided Dependency Learning), a novel method that explicitly integrates sparse dependency graphs into LLMs’ attention mechanism. GraDe employs a lightweight dynamic graph learning module guided by externally extracted functional dependencies, prioritizing key feature interactions while suppressing irrelevant ones. Our experiments across diverse real-world datasets demonstrate that GraDe outperforms existing LLM-based approaches by up to 12% on complex datasets while achieving competitive results with state-of-the-art approaches in synthetic data quality. Our method is minimally intrusive yet effective, offering a practical solution for structure-aware tabular data modeling with LLMs.
zh

[NLP-15] LLM -based Embedders for Prior Case Retrieval

【速读】: 该论文旨在解决法律案例检索(Prior Case Retrieval, PCR)任务中传统信息检索方法(如BM25)效率不足的问题,尤其是在面对长篇法律文本和有限标注训练数据时的挑战。其关键解决方案在于采用基于大语言模型(Large Language Model, LLM)的文本嵌入器(text embedders),这类嵌入器支持更长的输入长度,从而缓解因截断或分段导致的法律上下文信息丢失问题;同时,由于以无监督方式使用LLM嵌入器,无需依赖大量标注数据即可实现有效检索,从而同时克服了输入长度限制和法律训练数据稀缺两大瓶颈。实验表明,该方法在四个PCR基准数据集上优于BM25及监督微调的Transformer模型。

链接: https://arxiv.org/abs/2507.18455
作者: Damith Premasiri,Tharindu Ranasinghe,Ruslan Mitkov
机构: Lancaster University (兰卡斯特大学)
类目: Information Retrieval (cs.IR); Computation and Language (cs.CL)
备注: Accepted in Recent Advancements in Natural Language Processing (RANLP 2025) conference

点击查看摘要

Abstract:In common law systems, legal professionals such as lawyers and judges rely on precedents to build their arguments. As the volume of cases has grown massively over time, effectively retrieving prior cases has become essential. Prior case retrieval (PCR) is an information retrieval (IR) task that aims to automatically identify the most relevant court cases for a specific query from a large pool of potential candidates. While IR methods have seen several paradigm shifts over the last few years, the vast majority of PCR methods continue to rely on traditional IR methods, such as BM25. The state-of-the-art deep learning IR methods have not been successful in PCR due to two key challenges: i. Lengthy legal text limitation; when using the powerful BERT-based transformer models, there is a limit of input text lengths, which inevitably requires to shorten the input via truncation or division with a loss of legal context information. ii. Lack of legal training data; due to data privacy concerns, available PCR datasets are often limited in size, making it difficult to train deep learning-based models effectively. In this research, we address these challenges by leveraging LLM-based text embedders in PCR. LLM-based embedders support longer input lengths, and since we use them in an unsupervised manner, they do not require training data, addressing both challenges simultaneously. In this paper, we evaluate state-of-the-art LLM-based text embedders in four PCR benchmark datasets and show that they outperform BM25 and supervised transformer-based models.
zh

[NLP-16] Generation of Synthetic Clinical Text: A Systematic Review

【速读】: 该论文旨在解决临床自然语言处理(Natural Language Processing, NLP)中常见的文本稀疏性(sparsity)和隐私保护问题,其解决方案的关键在于系统性地生成合成医疗自由文本(synthetic medical free-text)。通过定量分析三个核心研究问题——生成目的、技术方法与评估方式,作者发现Transformer架构(尤其是GPT系列模型)是主流生成技术,而评估主要围绕相似性、隐私性、结构一致性和实用性四个方面展开,其中实用性评估最为普遍。研究表明,合成医疗文本虽难以完全替代真实文档,但在文本增强、缓解数据稀疏性及提升下游任务性能方面具有显著价值,同时仍需加强隐私审查以确保敏感信息不被泄露。

链接: https://arxiv.org/abs/2507.18451
作者: Basel Alshaikhdeeb,Ahmed Abdelmonem Hemedan,Soumyabrata Ghosh,Irina Balaur,Venkata Satagopam
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Generating clinical synthetic text represents an effective solution for common clinical NLP issues like sparsity and privacy. This paper aims to conduct a systematic review on generating synthetic medical free-text by formulating quantitative analysis to three research questions concerning (i) the purpose of generation, (ii) the techniques, and (iii) the evaluation methods. We searched PubMed, ScienceDirect, Web of Science, Scopus, IEEE, Google Scholar, and arXiv databases for publications associated with generating synthetic medical unstructured free-text. We have identified 94 relevant articles out of 1,398 collected ones. A great deal of attention has been given to the generation of synthetic medical text from 2018 onwards, where the main purpose of such a generation is towards text augmentation, assistive writing, corpus building, privacy-preserving, annotation, and usefulness. Transformer architectures were the main predominant technique used to generate the text, especially the GPTs. On the other hand, there were four main aspects of evaluation, including similarity, privacy, structure, and utility, where utility was the most frequent method used to assess the generated synthetic medical text. Although the generated synthetic medical text demonstrated a moderate possibility to act as real medical documents in different downstream NLP tasks, it has proven to be a great asset as augmented, complementary to the real documents, towards improving the accuracy and overcoming sparsity/undersampling issues. Yet, privacy is still a major issue behind generating synthetic medical text, where more human assessments are needed to check for the existence of any sensitive information. Despite that, advances in generating synthetic medical text will considerably accelerate the adoption of workflows and pipeline development, discarding the time-consuming legalities of data transfer.
zh

[NLP-17] Restoring Rhythm: Punctuation Restoration Using Transformer Models for Bangla a Low-Resource Language

【速读】: 该论文旨在解决低资源语言(如孟加拉语)中自动语音识别(ASR)后处理阶段的标点恢复问题,以提升文本可读性并支持下游任务。其核心解决方案是基于XLM-RoBERTa-large架构的Transformer模型,并通过构建大规模多样化训练语料库及应用数据增强技术(α = 0.20%)来缓解标注数据稀缺问题。实验表明,该方法在新闻、参考和ASR测试集上分别达到97.1%、91.2%和90.2%的准确率,展现出良好的泛化能力,尤其适用于现实场景中的噪声文本。

链接: https://arxiv.org/abs/2507.18448
作者: Md Obyedullahil Mamun,Md Adyelullahil Mamun,Arif Ahmad,Md. Imran Hossain Emu
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Punctuation restoration enhances the readability of text and is critical for post-processing tasks in Automatic Speech Recognition (ASR), especially for low-resource languages like Bangla. In this study, we explore the application of transformer-based models, specifically XLM-RoBERTa-large, to automatically restore punctuation in unpunctuated Bangla text. We focus on predicting four punctuation marks: period, comma, question mark, and exclamation mark across diverse text domains. To address the scarcity of annotated resources, we constructed a large, varied training corpus and applied data augmentation techniques. Our best-performing model, trained with an augmentation factor of alpha = 0.20%, achieves an accuracy of 97.1% on the News test set, 91.2% on the Reference set, and 90.2% on the ASR set. Results show strong generalization to reference and ASR transcripts, demonstrating the model’s effectiveness in real-world, noisy scenarios. This work establishes a strong baseline for Bangla punctuation restoration and contributes publicly available datasets and code to support future research in low-resource NLP. Subjects: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG) ACMclasses: I.2; I.7 Cite as: arXiv:2507.18448 [cs.CL] (or arXiv:2507.18448v1 [cs.CL] for this version) https://doi.org/10.48550/arXiv.2507.18448 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
zh

[NLP-18] AraTable: Benchmarking LLM s Reasoning and Understanding of Arabic Tabular Data

【速读】: 该论文旨在解决大语言模型(Large Language Models, LLMs)在处理阿拉伯语结构化数据(尤其是表格数据)时性能不足的问题,特别是在缺乏高质量公开资源和阿拉伯语独特语言特征背景下,现有模型在表格理解与推理任务中表现受限。解决方案的关键在于提出一个名为AraTable的综合性基准测试集,涵盖直接问答、事实验证和复杂推理等多种任务,并采用混合式数据生成与人工审核流程确保数据质量;同时,研究进一步设计了一个基于自省机制(self-deliberation mechanism)的全自动评估框架,其性能接近人工评判标准,从而为提升阿拉伯语表格数据处理能力提供了可复用的评测工具与方法论支持。

链接: https://arxiv.org/abs/2507.18442
作者: Rana Alshaikh,Israa Alghanmi,Shelan Jeawak
机构: King Abdulaziz University (KAU); University of Jordan (uj); University of the West of England (UWE)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:The cognitive and reasoning abilities of large language models (LLMs) have enabled remarkable progress in natural language processing. However, their performance in interpreting structured data, especially in tabular formats, remains limited. Although benchmarks for English tabular data are widely available, Arabic is still underrepresented because of the limited availability of public resources and its unique language features. To address this gap, we present AraTable, a novel and comprehensive benchmark designed to evaluate the reasoning and understanding capabilities of LLMs when applied to Arabic tabular data. AraTable consists of various evaluation tasks, such as direct question answering, fact verification, and complex reasoning, involving a wide range of Arabic tabular sources. Our methodology follows a hybrid pipeline, where initial content is generated by LLMs and subsequently filtered and verified by human experts to ensure high dataset quality. Initial analyses using AraTable show that, while LLMs perform adequately on simpler tabular tasks such as direct question answering, they continue to face significant cognitive challenges when tasks require deeper reasoning and fact verification. This indicates that there are substantial opportunities for future work to improve performance on complex tabular reasoning tasks. We also propose a fully automated evaluation framework that uses a self-deliberation mechanism and achieves performance nearly identical to that of human judges. This research provides a valuable, publicly available resource and evaluation framework that can help accelerate the development of foundational models for processing and analysing Arabic structured data.
zh

[NLP-19] FinDPO: Financial Sentiment Analysis for Algorithmic Trading through Preference Optimization of LLM s

【速读】: 该论文旨在解决金融领域情感分析中监督微调(Supervised Fine-Tuning, SFT)大语言模型(Large Language Models, LLMs)存在的泛化能力不足问题,即模型容易记忆训练数据而难以适应未见过的金融事件和领域特异性表达。其关键解决方案是提出FinDPO框架,该框架基于直接偏好优化(Direct Preference Optimization, DPO)对LLM进行后训练阶段的人类偏好对齐,从而显著提升模型在金融文本中的鲁棒性和泛化性能;同时创新性地引入“logit-to-score”转换机制,将离散的情感预测结果映射为连续且可排序的情感评分(概率),使得模型输出可以直接用于构建具有实际收益表现的投资组合策略,在考虑5个基点(bps)交易成本的情况下仍能实现年化67%的正收益和2.0的夏普比率。

链接: https://arxiv.org/abs/2507.18417
作者: Giorgos Iacovides,Wuyang Zhou,Danilo Mandic
机构: Imperial College London(帝国理工学院)
类目: Computation and Language (cs.CL); Machine Learning (cs.LG); Statistical Finance (q-fin.ST); Trading and Market Microstructure (q-fin.TR)
备注:

点击查看摘要

Abstract:Opinions expressed in online finance-related textual data are having an increasingly profound impact on trading decisions and market movements. This trend highlights the vital role of sentiment analysis as a tool for quantifying the nature and strength of such opinions. With the rapid development of Generative AI (GenAI), supervised fine-tuned (SFT) large language models (LLMs) have become the de facto standard for financial sentiment analysis. However, the SFT paradigm can lead to memorization of the training data and often fails to generalize to unseen samples. This is a critical limitation in financial domains, where models must adapt to previously unobserved events and the nuanced, domain-specific language of finance. To this end, we introduce FinDPO, the first finance-specific LLM framework based on post-training human preference alignment via Direct Preference Optimization (DPO). The proposed FinDPO achieves state-of-the-art performance on standard sentiment classification benchmarks, outperforming existing supervised fine-tuned models by 11% on the average. Uniquely, the FinDPO framework enables the integration of a fine-tuned causal LLM into realistic portfolio strategies through a novel ‘logit-to-score’ conversion, which transforms discrete sentiment predictions into continuous, rankable sentiment scores (probabilities). In this way, simulations demonstrate that FinDPO is the first sentiment-based approach to maintain substantial positive returns of 67% annually and strong risk-adjusted performance, as indicated by a Sharpe ratio of 2.0, even under realistic transaction costs of 5 basis points (bps).
zh

[NLP-20] Factual Inconsistencies in Multilingual Wikipedia Tables

【速读】: 该论文旨在解决多语言维基百科(Wikipedia)在结构化内容(尤其是表格数据)中存在的跨语言事实不一致性问题,这种不一致性可能影响百科全书的中立性与可靠性,并对依赖维基百科作为训练数据的AI系统造成负面影响。解决方案的关键在于提出了一种系统性的方法论,用于收集、对齐并分析多语言维基百科文章中的表格数据,通过定义不一致性的类别并应用定量与定性指标评估跨语言对齐质量,从而为事实核查、多语言知识交互及可靠AI系统的构建提供实证基础。

链接: https://arxiv.org/abs/2507.18406
作者: Silvia Cappa,Lingxiao Kong,Pille-Riin Peet,Fanfu Wei,Yuchen Zhou,Jan-Christoph Kalo
机构: 未知
类目: Computation and Language (cs.CL); Databases (cs.DB); Digital Libraries (cs.DL)
备注: 11 pages, 7 figures, White Paper for RTF Work at ISWS Summer School 2025

点击查看摘要

Abstract:Wikipedia serves as a globally accessible knowledge source with content in over 300 languages. Despite covering the same topics, the different versions of Wikipedia are written and updated independently. This leads to factual inconsistencies that can impact the neutrality and reliability of the encyclopedia and AI systems, which often rely on Wikipedia as a main training source. This study investigates cross-lingual inconsistencies in Wikipedia’s structured content, with a focus on tabular data. We developed a methodology to collect, align, and analyze tables from Wikipedia multilingual articles, defining categories of inconsistency. We apply various quantitative and qualitative metrics to assess multilingual alignment using a sample dataset. These insights have implications for factual verification, multilingual knowledge interaction, and design for reliable AI systems leveraging Wikipedia content.
zh

[NLP-21] CLEAR: Error Analysis via LLM -as-a-Judge Made Easy

【速读】: 该论文旨在解决当前大型语言模型(Large Language Models, LLMs)评估中仅提供单一评分或排名、缺乏可解释性的问题,即现有评估范式虽能判断哪个模型表现更好,但无法揭示具体原因。其解决方案的关键在于提出CLEAR——一个交互式、开源的基于LLM的错误分析工具包:首先生成每个样本的文本反馈,进而提炼出系统级错误类别并量化各类错误的普遍程度;同时提供交互式仪表板,支持聚合可视化、筛选特定问题或分数区间,并深入到体现特定行为模式的个体实例,从而实现从“是否更好”到“为何更好”的转变。

链接: https://arxiv.org/abs/2507.18392
作者: Asaf Yehudai,Lilach Eden,Yotam Perlitz,Roy Bar-Haim,Michal Shmueli-Scheuer
机构: IBM Research (IBM 研究院); The Hebrew University of Jerusalem (希伯来大学)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:The evaluation of Large Language Models (LLMs) increasingly relies on other LLMs acting as judges. However, current evaluation paradigms typically yield a single score or ranking, answering which model is better but not why. While essential for benchmarking, these top-level scores obscure the specific, actionable reasons behind a model’s performance. To bridge this gap, we introduce CLEAR, an interactive, open-source package for LLM-based error analysis. CLEAR first generates per-instance textual feedback, then it creates a set of system-level error issues, and quantifies the prevalence of each identified issue. Our package also provides users with an interactive dashboard that allows for a comprehensive error analysis through aggregate visualizations, applies interactive filters to isolate specific issues or score ranges, and drills down to the individual instances that exemplify a particular behavioral pattern. We demonstrate CLEAR analysis for RAG and Math benchmarks, and showcase its utility through a user case study.
zh

[NLP-22] Hybrid Annotation for Propaganda Detection: Integrating LLM Pre-Annotations with Human Intelligence ACL

【速读】: 该论文旨在解决社交媒体中宣传内容(propaganda)检测任务因标注数据质量低和标注一致性差而导致的挑战。其核心问题是:现有细粒度标签存在低一致性,且人工标注成本高、难以扩展。解决方案的关键在于提出一个结合人类专家与大语言模型(Large Language Model, LLM)协作的分层标注框架:首先构建包含14种细粒度宣传技术的三级分类体系,利用LLM进行预标注(包括提取传播片段、生成解释、分配局部与全局标签),再通过二次人工验证显著提升标注一致性与效率;进一步地,采用知识蒸馏方法,用高质量LLM生成的数据训练小型语言模型(Small Language Model, SLM)实现结构化标注,从而在保证性能的同时提高系统可扩展性与实用性。

链接: https://arxiv.org/abs/2507.18343
作者: Ariana Sahitaj,Premtim Sahitaj,Veronika Solopova,Jiaao Li,Sebastian Möller,Vera Schmitt
机构: Quality and Usability Lab, Technische Universität Berlin, Germany (德国柏林工业大学质量与可用性实验室); German Research Center for Artificial Intelligence (DFKI), Berlin, Germany (德国人工智能研究中心(DFKI))
类目: Computation and Language (cs.CL)
备注: NLP4PI at ACL

点击查看摘要

Abstract:Propaganda detection on social media remains challenging due to task complexity and limited high-quality labeled data. This paper introduces a novel framework that combines human expertise with Large Language Model (LLM) assistance to improve both annotation consistency and scalability. We propose a hierarchical taxonomy that organizes 14 fine-grained propaganda techniques into three broader categories, conduct a human annotation study on the HQP dataset that reveals low inter-annotator agreement for fine-grained labels, and implement an LLM-assisted pre-annotation pipeline that extracts propagandistic spans, generates concise explanations, and assigns local labels as well as a global label. A secondary human verification study shows significant improvements in both agreement and time-efficiency. Building on this, we fine-tune smaller language models (SLMs) to perform structured annotation. Instead of fine-tuning on human annotations, we train on high-quality LLM-generated data, allowing a large model to produce these annotations and a smaller model to learn to generate them via knowledge distillation. Our work contributes towards the development of scalable and robust propaganda detection systems, supporting the idea of transparent and accountable media ecosystems in line with SDG 16. The code is publicly available at our GitHub repository.
zh

[NLP-23] DR: Task-Decoupled Retrieval with Fine-Grained LLM Feedback for In-Context Learning

【速读】: 该论文旨在解决在上下文学习(In-context Learning, ICL)中,如何有效检索高质量示例以提升大语言模型(Large Language Models, LLMs)任务表现的问题。当前方法受限于两个核心挑战:一是难以区分跨任务的数据分布,导致检索到的示例与目标任务不匹配;二是缺乏细粒度反馈机制,使得检索模块无法根据LLM的实际输出进行有效优化。解决方案的关键在于提出一种名为TDR的新框架,其核心创新包括:(1) 通过解耦不同任务的ICL示例,使检索模块能够在多任务数据集中精准定位目标任务的相关示例;(2) 利用LLM输出的细粒度反馈来监督和引导检索模块的训练,从而实现对高质量示例的持续优化。实验表明,TDR在30个NLP任务上均显著优于现有方法,并具备良好的通用性和可扩展性。

链接: https://arxiv.org/abs/2507.18340
作者: Yifu Chen,Bingchen Huang,Zhiling Wang,Yuanchao Du,Junfeng Luo,Lei Shen,Zhineng chen
机构: 未知
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:In-context learning (ICL) has become a classic approach for enabling LLMs to handle various tasks based on a few input-output examples. The effectiveness of ICL heavily relies on the quality of these examples, and previous works which focused on enhancing example retrieval capabilities have achieved impressive performances. However, two challenges remain in retrieving high-quality examples: (1) Difficulty in distinguishing cross-task data distributions, (2) Difficulty in making the fine-grained connection between retriever output and feedback from LLMs. In this paper, we propose a novel framework called TDR. TDR decouples the ICL examples from different tasks, which enables the retrieval module to retrieve examples specific to the target task within a multi-task dataset. Furthermore, TDR models fine-grained feedback from LLMs to supervise and guide the training of the retrieval module, which helps to retrieve high-quality examples. We conducted extensive experiments on a suite of 30 NLP tasks, the results demonstrate that TDR consistently improved results across all datasets and achieves state-of-the-art performance. Meanwhile, our approach is a plug-and-play method, which can be easily combined with various LLMs to improve example retrieval abilities for ICL. The code is available at this https URL.
zh

[NLP-24] Uncertainty Quantification for Evaluating Machine Translation Bias

【速读】: 该论文旨在解决机器翻译(Machine Translation, MT)中性别指代不明确时模型产生偏见的问题,即当源语言句子中的词性未显式标记性别,而目标语言译文需进行性别指定时,模型往往依赖刻板印象而非上下文信息做出决策。解决方案的关键在于引入语义不确定性(semantic uncertainty)度量,评估模型在性别模糊实例上的置信度表现,并强调模型不仅应在性别明确时准确翻译,还应在性别不确定时保持合理的不确定性,从而减少对性别刻板印象的依赖。实验表明,高准确率的模型未必具备适当的不确定性表现,且去偏策略对模糊与明确实例的影响独立,提示需针对性地优化模型在不同情境下的不确定性建模能力。

链接: https://arxiv.org/abs/2507.18338
作者: Ieva Raminta Staliūnaitė,Julius Cheng,Andreas Vlachos
机构: University of Cambridge (剑桥大学)
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:In machine translation (MT), when the source sentence includes a lexeme whose gender is not overtly marked, but whose target-language equivalent requires gender specification, the model must infer the appropriate gender from the context and/or external knowledge. Studies have shown that MT models exhibit biased behaviour, relying on stereotypes even when they clash with contextual information. We posit that apart from confidently translating using the correct gender when it is evident from the input, models should also maintain uncertainty about the gender when it is ambiguous. Using recently proposed metrics of semantic uncertainty, we find that models with high translation and gender accuracy on unambiguous instances do not necessarily exhibit the expected level of uncertainty in ambiguous ones. Similarly, debiasing has independent effects on ambiguous and unambiguous translation instances.
zh

[NLP-25] BadReason er: Planting Tunable Overthinking Backdoors into Large Reasoning Models for Fun or Profit

【速读】: 该论文旨在解决大型推理模型(Large Reasoning Models, LRMs)在面对隐蔽性攻击时的安全性问题,特别是针对其链式思维(Chain-of-Thought, CoT)推理过程的资源消耗型攻击。传统后门攻击通常仅触发或不触发特定行为,而本文提出一种新型可调后门攻击——“过度思考后门”(overthinking backdoors),其关键在于通过数据投毒方法构建一个可控的触发机制:攻击者利用重复次数表示所需推理冗余强度的触发词,并由教师大语言模型(LLM)生成带有精确冗余优化步骤的冗余CoT响应,从而在保持最终输出正确性的前提下,精准控制模型推理路径的长度。该方案实现了对LRM推理资源消耗的精细调节,且具备高度隐蔽性,是一种纯资源消耗型攻击向量。

链接: https://arxiv.org/abs/2507.18305
作者: Biao Yi,Zekun Fei,Jianing Geng,Tong Li,Lihai Nie,Zheli Liu,Yiming Li
机构: Nankai University (南开大学); Nanyang Technological University (南洋理工大学)
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Large reasoning models (LRMs) have emerged as a significant advancement in artificial intelligence, representing a specialized class of large language models (LLMs) designed to tackle complex reasoning tasks. The defining characteristic of LRMs lies in their extensive chain-of-thought (CoT) reasoning capabilities. In this paper, we identify a previously unexplored attack vector against LRMs, which we term “overthinking backdoors”. We advance this concept by proposing a novel tunable backdoor, which moves beyond simple on/off attacks to one where an attacker can precisely control the extent of the model’s reasoning verbosity. Our attack is implemented through a novel data poisoning methodology. It pairs a tunable trigger-where the number of repetitions signals the desired intensity-with a correspondingly verbose CoT response. These responses are programmatically generated by instructing a teacher LLM to inject a controlled number of redundant refinement steps into a correct reasoning process. The approach preserves output correctness, which ensures stealth and establishes the attack as a pure resource-consumption vector. Extensive empirical results on various LRMs demonstrate that our method can reliably trigger a controllable, multi-fold increase in the length of the reasoning process, without degrading the final answer’s correctness. Our source code is available at this https URL.
zh

[NLP-26] LoRA-Leak: Membership Inference Attacks Against LoRA Fine-tuned Language Models

【速读】: 该论文旨在解决低秩适应(Low-Rank Adaptation, LoRA)微调的语言模型(Language Models, LMs)在隐私安全方面存在的漏洞问题,特别是针对成员推理攻击(Membership Inference Attacks, MIAs)的脆弱性。尽管LoRA因其参数更新比例小而被认为具有较低的隐私风险,但作者指出预训练模型的存在会引入额外的信息泄露,从而加剧MIAs威胁。解决方案的关键在于提出一个系统性的评估框架LoRA-Leak,该框架整合了十五种MIAs(包括十种现有方法和五种利用预训练模型作为参考的改进方法),首次全面量化了LoRA微调模型对训练数据成员身份的可推断性,并通过实验证明即使在保守微调条件下,LoRA微调模型仍面临显著的隐私风险(如AUC达0.775)。此外,研究还识别出仅dropout和特定层排除策略能有效缓解风险而不损害模型性能,强调了在“预训练-微调”范式下,预训练模型本身成为MIA风险加剧的核心因素。

链接: https://arxiv.org/abs/2507.18302
作者: Delong Ran,Xinlei He,Tianshuo Cong,Anyu Wang,Qi Li,Xiaoyun Wang
机构: 未知
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注: This work has been submitted to the IEEE for possible publication

点击查看摘要

Abstract:Language Models (LMs) typically adhere to a “pre-training and fine-tuning” paradigm, where a universal pre-trained model can be fine-tuned to cater to various specialized domains. Low-Rank Adaptation (LoRA) has gained the most widespread use in LM fine-tuning due to its lightweight computational cost and remarkable performance. Because the proportion of parameters tuned by LoRA is relatively small, there might be a misleading impression that the LoRA fine-tuning data is invulnerable to Membership Inference Attacks (MIAs). However, we identify that utilizing the pre-trained model can induce more information leakage, which is neglected by existing MIAs. Therefore, we introduce LoRA-Leak, a holistic evaluation framework for MIAs against the fine-tuning datasets of LMs. LoRA-Leak incorporates fifteen membership inference attacks, including ten existing MIAs, and five improved MIAs that leverage the pre-trained model as a reference. In experiments, we apply LoRA-Leak to three advanced LMs across three popular natural language processing tasks, demonstrating that LoRA-based fine-tuned LMs are still vulnerable to MIAs (e.g., 0.775 AUC under conservative fine-tuning settings). We also applied LoRA-Leak to different fine-tuning settings to understand the resulting privacy risks. We further explore four defenses and find that only dropout and excluding specific LM layers during fine-tuning effectively mitigate MIA risks while maintaining utility. We highlight that under the “pre-training and fine-tuning” paradigm, the existence of the pre-trained model makes MIA a more severe risk for LoRA-based LMs. We hope that our findings can provide guidance on data privacy protection for specialized LM providers.
zh

[NLP-27] StyleAdaptedLM: Enhancing Instruction Following Models with Efficient Stylistic Transfer

【速读】: 该论文旨在解决大语言模型(Large Language Models, LLMs)在企业传播中风格适配的问题,即如何将特定品牌语调或作者语气等风格特征高效迁移至指令遵循型模型,同时不牺牲任务执行的准确性。传统方法依赖于成对的指令-响应格式数据,但在实际场景中这类数据往往稀缺。解决方案的关键在于提出StyleAdaptedLM框架,利用低秩适应(Low-Rank Adaptation, LoRA)技术:首先在基础模型上使用多样化的非结构化风格语料训练LoRA适配器,随后将其合并到独立的指令跟随模型中,从而实现无需配对数据即可完成风格定制,并保持指令遵循能力。实验表明,该方法显著提升了风格一致性,且人类评估验证了其对品牌规范的准确吸收。

链接: https://arxiv.org/abs/2507.18294
作者: Pritika Ramu,Apoorv Saxena,Meghanath M Y,Varsha Sankar,Debraj Basu
机构: Adobe Research, India (Adobe 研究院,印度); ZeroToOne.AI; Adobe Inc. (Adobe 公司)
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Adapting LLMs to specific stylistic characteristics, like brand voice or authorial tones, is crucial for enterprise communication but challenging to achieve from corpora which lacks instruction-response formatting without compromising instruction adherence. We introduce StyleAdaptedLM, a framework that efficiently transfers stylistic traits to instruction-following models using Low-Rank Adaptation (LoRA). LoRA adapters are first trained on a base model with diverse unstructured stylistic corpora, then merged with a separate instruction-following model. This enables robust stylistic customization without paired data or sacrificing task performance. Experiments across multiple datasets and models demonstrate improved stylistic consistency while preserving instruction adherence, with human evaluations confirming brand-specific convention uptake. StyleAdaptedLM offers an efficient path for stylistic personalization in LLMs.
zh

[NLP-28] Zero-shot OCR Accuracy of Low-Resourced Languages: A Comparative Analysis on Sinhala and Tamil

【速读】: 该论文旨在解决低资源语言(Low-Resourced Languages, LRL)在印刷体文本上的光学字符识别(Optical Character Recognition, OCR)问题,尤其是使用独特文字系统的语言如僧伽罗语(Sinhala)和泰米尔语(Tamil)。当前OCR技术在高资源语言(High-Resourced Languages, HRL)如英语上已趋于成熟,但在LRL中仍存在显著挑战。解决方案的关键在于对六种不同OCR引擎(包括商业与开源系统)进行零样本(zero-shot)性能对比分析,并基于字符错误率(Character Error Rate, CER)和词错误率(Word Error Rate, WER)等五种评估指标量化其表现。研究发现,Surya在僧伽罗语上表现最优(WER=2.61%),而Document AI在泰米尔语上表现最佳(CER=0.78%),同时引入了一个新的合成泰米尔语OCR基准数据集,为未来LRL OCR研究提供标准化测试平台。

链接: https://arxiv.org/abs/2507.18264
作者: Nevidu Jayatilleke,Nisansa de Silva
机构: University of Moratuwa (莫鲁塔瓦大学)
类目: Computation and Language (cs.CL)
备注: 10 pages, 4 figures, Accepted paper at Recent Advances in Natural Language Processing (RANLP) 2025

点击查看摘要

Abstract:Solving the problem of Optical Character Recognition (OCR) on printed text for Latin and its derivative scripts can now be considered settled due to the volumes of research done on English and other High-Resourced Languages (HRL). However, for Low-Resourced Languages (LRL) that use unique scripts, it remains an open problem. This study presents a comparative analysis of the zero-shot performance of six distinct OCR engines on two LRLs: Sinhala and Tamil. The selected engines include both commercial and open-source systems, aiming to evaluate the strengths of each category. The Cloud Vision API, Surya, Document AI, and Tesseract were evaluated for both Sinhala and Tamil, while Subasa OCR and EasyOCR were examined for only one language due to their limitations. The performance of these systems was rigorously analysed using five measurement techniques to assess accuracy at both the character and word levels. According to the findings, Surya delivered the best performance for Sinhala across all metrics, with a WER of 2.61%. Conversely, Document AI excelled across all metrics for Tamil, highlighted by a very low CER of 0.78%. In addition to the above analysis, we also introduce a novel synthetic Tamil OCR benchmarking dataset.
zh

[NLP-29] Locate-and-Focus: Enhancing Terminology Translation in Speech Language Models ACL2025

【速读】: 该论文旨在解决直接语音翻译(Direct Speech Translation, DST)中术语翻译准确性不足的问题,尤其针对现有方法因引入无关噪声干扰且未能充分挖掘和利用翻译知识而导致的性能瓶颈。其解决方案的关键在于提出一种新颖的“定位与聚焦”(Locate-and-Focus)机制:首先通过精准定位包含术语的语音片段来构建高质量的翻译知识,从而减少对DST模型的冗余干扰;随后在音频和文本模态上将该知识与输入语音及翻译假设进行关联,使模型在翻译过程中能够更专注地利用术语相关的上下文信息,从而显著提升术语翻译成功率,同时保持整体翻译性能的稳定性。

链接: https://arxiv.org/abs/2507.18263
作者: Suhang Wu,Jialong Tang,Chengyi Yang,Pei Zhang,Baosong Yang,Junhui Li,Junfeng Yao,Min Zhang,Jinsong Su
机构: Xiamen University (厦门大学); Tongyi Lab; Soochow University (苏州大学); Fujian and Taiwan (福建和台湾)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: Accepted at ACL 2025

点击查看摘要

Abstract:Direct speech translation (ST) has garnered increasing attention nowadays, yet the accurate translation of terminology within utterances remains a great challenge. In this regard, current studies mainly concentrate on leveraging various translation knowledge into ST models. However, these methods often struggle with interference from irrelevant noise and can not fully utilize the translation knowledge. To address these issues, in this paper, we propose a novel Locate-and-Focus method for terminology translation. It first effectively locates the speech clips containing terminologies within the utterance to construct translation knowledge, minimizing irrelevant information for the ST model. Subsequently, it associates the translation knowledge with the utterance and hypothesis from both audio and textual modalities, allowing the ST model to better focus on translation knowledge during translation. Experimental results across various datasets demonstrate that our method effectively locates terminologies within utterances and enhances the success rate of terminology translation, while maintaining robust general translation performance.
zh

[NLP-30] PruneComp: Free Lunch for Layer-Pruned LLM s via Iterative Pruning with Magnitude Compensation

【速读】: 该论文旨在解决大语言模型(Large Language Models, LLMs)层剪枝(layer pruning)过程中因移除任意层导致隐藏状态(hidden states)幅值显著失衡,从而引发性能大幅下降的问题。解决方案的关键在于提出一种无需训练的即插即用剪枝方案 PruneComp,其核心机制是通过离线估计层移除引起的幅值间隙,并对剩余权重进行缩放补偿,从而在不引入运行时开销的前提下有效弥合该间隙,提升剪枝后模型的稳定性与性能表现。

链接: https://arxiv.org/abs/2507.18212
作者: Xinrui Chen,Hongxing Zhang,Fanyi Zeng,Yongxian Wei,Yizhi Wang,Xitong Ling,Guanghao Li,Chun Yuan
机构: 未知
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Layer pruning has emerged as a promising technique for compressing large language models (LLMs) while achieving acceleration proportional to the pruning ratio. In this work, we identify that removing any layer induces a significant magnitude gap in hidden states, resulting in substantial performance degradation. To address this issue, we propose PruneComp, a novel plug-and-play layer pruning scheme that leverages magnitude compensation to mitigate such gaps in a training-free manner. Specifically, we first estimate the magnitude gap caused by layer removal and then eliminate this gap by rescaling the remaining weights offline, with zero runtime overhead incurred. We further demonstrate the advantages of PruneComp through an iterative pruning strategy. When integrated with an iterative prune-and-compensate loop, PruneComp consistently enhances existing layer pruning metrics. For instance, when 5 layers of LLaMA-3-8B are pruned using the prevalent block influence metric, PruneComp nearly halves the perplexity and retains 93.19% of the original model’s question-answering performance, outperforming the baseline by 4.01%.
zh

[NLP-31] Exploring the Impact of Instruction-Tuning on LLM s Susceptibility to Misinformation ACL2025

【速读】: 该论文试图解决的问题是:指令微调(instruction-tuning)对大型语言模型(Large Language Models, LLMs)接受错误信息(misinformation)的倾向性影响,即指令微调是否会导致模型更依赖用户输入,从而增加其生成幻觉或传播虚假信息的风险。解决方案的关键在于揭示了指令微调显著增强了模型对用户提供的信息的依赖性,使模型从原本基于自身参数知识的判断转向更易接受用户输入内容,甚至在存在明显矛盾时也倾向于采纳用户提供的错误信息,这一现象被称为“角色转移”——从助理角色向用户角色的脆弱性迁移。研究进一步识别出提示结构中用户角色、错误信息长度及系统提示中的警告等关键因素对模型敏感性的影响,强调需通过系统性方法缓解指令微调带来的副作用,以提升LLMs在现实场景下的可靠性与安全性。

链接: https://arxiv.org/abs/2507.18203
作者: Kyubeen Han,Junseo Jang,Hongjin Kim,Geunyeong Jeong,Harksoo Kim
机构: Konkuk University (韩国中央大学); ETRI (韩国电子通信研究院)
类目: Computation and Language (cs.CL)
备注: ACL 2025 Main Accepted

点击查看摘要

Abstract:Instruction-tuning enhances the ability of large language models (LLMs) to follow user instructions more accurately, improving usability while reducing harmful outputs. However, this process may increase the model’s dependence on user input, potentially leading to the unfiltered acceptance of misinformation and the generation of hallucinations. Existing studies primarily highlight that LLMs are receptive to external information that contradict their parametric knowledge, but little research has been conducted on the direct impact of instruction-tuning on this phenomenon. In our study, we investigate the impact of instruction-tuning on LLM’s susceptibility to misinformation. Our analysis reveals that instruction-tuned LLMs are significantly more likely to accept misinformation when it is presented by the user. A comparison with base models shows that instruction-tuning increases reliance on user-provided information, shifting susceptibility from the assistant role to the user role. Furthermore, we explore additional factors influencing misinformation susceptibility, such as the role of the user in prompt structure, misinformation length, and the presence of warnings in the system prompt. Our findings underscore the need for systematic approaches to mitigate unintended consequences of instruction-tuning and enhance the reliability of LLMs in real-world applications.
zh

[NLP-32] Safeguarding RAG Pipelines with GMTP: A Gradient-based Masked Token Probability Method for Poisoned Document Detection ACL

【速读】: 该论文旨在解决检索增强生成(Retrieval-Augmented Generation, RAG)系统因依赖外部知识库而面临的对抗性攻击问题,即攻击者通过向知识库中注入恶意文档(poisoned documents)来诱导大语言模型(Large Language Models, LLMs)生成有害或误导性内容。解决方案的关键在于提出一种基于梯度的掩码token概率检测方法(Gradient-based Masked Token Probability, GMTP):首先利用检索器相似度函数的梯度识别出对生成结果影响最大的关键token,随后对这些token进行掩码处理,并借助掩码语言模型(Masked Language Model, MLM)评估其概率分布;由于恶意注入的token通常表现出显著偏低的掩码概率,GMTP能够高精度地识别并过滤掉此类恶意文档,同时保留正常相关文档,从而在多种数据集和对抗场景下保持可靠的检索与生成性能。

链接: https://arxiv.org/abs/2507.18202
作者: San Kim,Jonghwi Kim,Yejin Jeon,Gary Geunbae Lee
机构: Graduate School of Artificial Intelligence, POSTECH, Republic of Korea (韩国浦项科技大学人工智能研究生院); Department of Computer Science and Engineering, POSTECH, Republic of Korea (韩国浦项科技大学计算机科学与工程系)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: 18 pages, accepted to ACL Findings 2025

点击查看摘要

Abstract:Retrieval-Augmented Generation (RAG) enhances Large Language Models (LLMs) by providing external knowledge for accurate and up-to-date responses. However, this reliance on external sources exposes a security risk, attackers can inject poisoned documents into the knowledge base to steer the generation process toward harmful or misleading outputs. In this paper, we propose Gradient-based Masked Token Probability (GMTP), a novel defense method to detect and filter out adversarially crafted documents. Specifically, GMTP identifies high-impact tokens by examining gradients of the retriever’s similarity function. These key tokens are then masked, and their probabilities are checked via a Masked Language Model (MLM). Since injected tokens typically exhibit markedly low masked-token probabilities, this enables GMTP to easily detect malicious documents and achieve high-precision filtering. Experiments demonstrate that GMTP is able to eliminate over 90% of poisoned content while retaining relevant documents, thus maintaining robust retrieval and generation performance across diverse datasets and adversarial settings.
zh

[NLP-33] Integrating an ISO30401-compliant Knowledge management system with existing business processes of an organization

【速读】: 该论文旨在解决组织在实施ISO30401标准时,如何将知识管理(Knowledge Management, KM)活动——包括知识的开发、转化与传递——有效整合到现有运营流程中的问题。解决方案的关键在于利用SECI模型(Socialization, Externalization, Combination, Internalization)的机制,通过PDCA(Plan-Do-Check-Act)循环的步骤,将知识管理嵌入集成管理体系(Integrated Management System, IMS)的各个过程之中,从而实现知识流与业务流程的协同优化。

链接: https://arxiv.org/abs/2507.18197
作者: Aline Belloni,Patrick Prieur
机构: 未知
类目: Computation and Language (cs.CL); Digital Libraries (cs.DL)
备注: in French language. AGeCSO2025 : 18{è}me Colloque International de l’Association pour la Gestion des Connaissances dans la Soci{é}t{é} et les Organisations, Association pour la Gestion des Connaissances dans la Soci{é}t{é} et les Organisations (AGECSO), Jun 2025, TROYES, France

点击查看摘要

Abstract:Business process modeling is used by most organizations as an essential framework for ensuring efficiency and effectiveness of the work and workflow performed by its employees and for ensuring the alignment of such work with its strategic goals. For organizations that are compliant or near-compliant with ISO 9001, this approach involves the detailed mapping of processes, sub-processes, activities, and tasks. ISO30401 is a Management System Standard, introduced in 2018, establishing universal requirements for the set up of a Knowledge Management System in an organization. As ``ISO30401 implementers’’ we regularly face the challenge of explaining our clients how the knowledge development, transformation and conveyances activities depicted in ISO30401 do integrate with existing operational processes. This article recaps process modelling principles in the context of ISO9001 and explores, based on our experience, how an ISO30401-compliant Knowledge Management System (KMS) entwines with all other processes of an Integrated Management System and in particular how it can be implemented by deploying the mechanisms of the SECI model through the steps of PDCA cycles.
zh

[NLP-34] N-AutoRCA: Benchmark Construction and Agent ic Framework for Self-Improving Alarm-Based Root Cause Analysis in Telecommunication Networks

【速读】: 该论文旨在解决电信网络中根因分析(Root Cause Analysis, RCA)的难题,该问题因复杂的图结构推理需求和缺乏真实场景基准数据而难以通过人工智能(Artificial Intelligence, AI)有效处理。解决方案的关键在于构建一个能够模拟真实电信网络故障场景的基准平台,并设计基于图神经网络(Graph Neural Network, GNN)的智能推理模型,以实现对故障传播路径的精准定位与根因识别。

链接: https://arxiv.org/abs/2507.18190
作者: Keyu Wu,Qianjin Yu,Manlin Mei,Ruiting Liu,Jun Wang,Kailai Zhang,Yelun Bao
机构: 1: Institute of Automation, Chinese Academy of Sciences (中国科学院自动化研究所); 2: University of Science and Technology of China (中国科学技术大学)
类目: Computation and Language (cs.CL)
备注: 10 pages

点击查看摘要

Abstract:Root Cause Analysis (RCA) in telecommunication networks is a critical task, yet it presents a formidable challenge for Artificial Intelligence (AI) due to its complex, graph-based reasoning requirements and the scarcity of realistic benchmarks.
zh

[NLP-35] SCOPE: Stochastic and Counterbiased Option Placement for Evaluating Large Language Models

【速读】: 该论文旨在解决大语言模型(Large Language Models, LLMs)在多项选择题任务中因选项位置或标签固有偏差而获得虚高分数的问题,即模型可能通过利用统计偏倚而非真实理解来得分。其解决方案的关键在于提出SCOPE评估框架:首先通过重复调用无语义内容的空提示(null prompt)估计模型特有的位置偏倚分布;随后根据逆偏倚分布重新分配答案槽位,从而均衡“幸运率”(即随机选中正确答案的概率);同时避免语义相似的干扰项与正确答案相邻,以消除基于表面邻近性的误判。该方法实现了数据无关的偏倚测量与校正,显著提升了LLM评估的公平性与可靠性。

链接: https://arxiv.org/abs/2507.18182
作者: Wonjun Jeong,Dongseok Kim,Taegkeun Whangbo
机构: Gachon University (嘉泉大学)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: 34 pages, 1 figure

点击查看摘要

Abstract:Large Language Models (LLMs) can achieve inflated scores on multiple-choice tasks by exploiting inherent biases in option positions or labels, rather than demonstrating genuine understanding. This study introduces SCOPE, an evaluation framework designed to measure and mitigate such selection bias in a dataset-independent manner. By repeatedly invoking a null prompt that lacks semantic content, SCOPE estimates each model’s unique position-bias distribution. It then redistributes the answer slot according to the inverse-bias distribution, thereby equalizing the lucky-rate, the probability of selecting the correct answer by chance. Furthermore, it prevents semantically similar distractors from being placed adjacent to the answer, thereby blocking near-miss guesses based on superficial proximity cues. Across multiple benchmark experiments, SCOPE consistently outperformed existing debiasing methods in terms of stable performance improvements and showed clearer confidence distributions over correct options. This framework thus offers a new standard for enhancing the fairness and reliability of LLM evaluations.
zh

[NLP-36] Sticking to the Mean: Detecting Sticky Tokens in Text Embedding Models ACL2025

【速读】: 该论文旨在解决Transformer-based文本嵌入模型中存在的一种异常现象——“粘性标记”(sticky tokens)对嵌入可靠性造成的威胁。这些标记在重复插入句子时会将语义相似度拉向特定值,破坏嵌入距离的正常分布并显著降低下游任务性能。其解决方案的关键在于提出一种高效的检测方法——粘性标记检测器(Sticky Token Detector, STD),该方法基于句子和标记过滤策略,能够系统识别出异常标记;通过在14个模型家族共40个检查点上的应用,发现868个粘性标记,并揭示其主要来源于词汇表中的特殊或未使用条目及多语言语料中的碎片化子词,从而为改进分词策略和模型设计提供了实证依据与方向。

链接: https://arxiv.org/abs/2507.18171
作者: Kexin Chen,Dongxia Wang,Yi Liu,Haonan Zhang,Wenhai Wang
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: ACL 2025 main

点击查看摘要

Abstract:Despite the widespread use of Transformer-based text embedding models in NLP tasks, surprising ‘sticky tokens’ can undermine the reliability of embeddings. These tokens, when repeatedly inserted into sentences, pull sentence similarity toward a certain value, disrupting the normal distribution of embedding distances and degrading downstream performance. In this paper, we systematically investigate such anomalous tokens, formally defining them and introducing an efficient detection method, Sticky Token Detector (STD), based on sentence and token filtering. Applying STD to 40 checkpoints across 14 model families, we discover a total of 868 sticky tokens. Our analysis reveals that these tokens often originate from special or unused entries in the vocabulary, as well as fragmented subwords from multilingual corpora. Notably, their presence does not strictly correlate with model size or vocabulary size. We further evaluate how sticky tokens affect downstream tasks like clustering and retrieval, observing significant performance drops of up to 50%. Through attention-layer analysis, we show that sticky tokens disproportionately dominate the model’s internal representations, raising concerns about tokenization robustness. Our findings show the need for better tokenization strategies and model design to mitigate the impact of sticky tokens in future text embedding applications.
zh

[NLP-37] HIVMedQA: Benchmarking large language models for HIV medical decision support

【速读】: 该论文旨在解决大型语言模型(Large Language Models, LLMs)在人类免疫缺陷病毒(HIV)管理中的应用潜力与实际性能之间存在的差距问题,特别是在临床决策支持场景下,如何评估LLMs在知识准确性、推理能力、偏见控制及潜在危害等方面的表现。其解决方案的关键在于构建了一个名为HIVMedQA的专门基准测试集,该数据集由感染病专科医生参与设计,涵盖临床相关且具有挑战性的开放式医学问答问题,并采用提示工程(prompt engineering)提升模型表现;同时引入基于“大模型作为裁判”(LLM-as-a-judge)的评估框架,结合词汇相似性指标,更全面地衡量模型在关键维度上的性能,从而为安全、有效整合LLMs至HIV临床实践中提供实证依据和改进方向。

链接: https://arxiv.org/abs/2507.18143
作者: Gonzalo Cardenal Antolin,Jacques Fellay,Bashkim Jaha,Roger Kouyos,Niko Beerenwinkel,Diane Duroux
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Large language models (LLMs) are emerging as valuable tools to support clinicians in routine decision-making. HIV management is a compelling use case due to its complexity, including diverse treatment options, comorbidities, and adherence challenges. However, integrating LLMs into clinical practice raises concerns about accuracy, potential harm, and clinician acceptance. Despite their promise, AI applications in HIV care remain underexplored, and LLM benchmarking studies are scarce. This study evaluates the current capabilities of LLMs in HIV management, highlighting their strengths and limitations. We introduce HIVMedQA, a benchmark designed to assess open-ended medical question answering in HIV care. The dataset consists of curated, clinically relevant questions developed with input from an infectious disease physician. We evaluated seven general-purpose and three medically specialized LLMs, applying prompt engineering to enhance performance. Our evaluation framework incorporates both lexical similarity and an LLM-as-a-judge approach, extended to better reflect clinical relevance. We assessed performance across key dimensions: question comprehension, reasoning, knowledge recall, bias, potential harm, and factual accuracy. Results show that Gemini 2.5 Pro consistently outperformed other models across most dimensions. Notably, two of the top three models were proprietary. Performance declined as question complexity increased. Medically fine-tuned models did not always outperform general-purpose ones, and larger model size was not a reliable predictor of performance. Reasoning and comprehension were more challenging than factual recall, and cognitive biases such as recency and status quo were observed. These findings underscore the need for targeted development and evaluation to ensure safe, effective LLM integration in clinical care.
zh

[NLP-38] MathOPEval: A Fine-grained Evaluation Benchmark for Visual Operations of MLLM s in Mathematical Reasoning

【速读】: 该论文旨在解决当前多模态大语言模型(Multi-modal Large Language Models, MLLMs)在基于代码的视觉操作能力评估方面存在的空白问题,即现有评测主要聚焦于文本推理输出,而忽略了模型通过代码执行精确视觉操作的能力。其解决方案的关键在于提出一个系统性的评估框架,专注于两个核心维度:(1)多模态代码生成(Multi-modal Code Generation, MCG),衡量模型从零开始准确理解并构建可视化内容的能力;(2)多模态代码编辑(Multi-modal Code Editing, MCE),评估模型在细粒度操作上的表现,包括删除(Deletion)、修改(Modification)和注释(Annotation)三类任务。该框架基于涵盖五类常见数学图形的数据集进行实验验证,揭示了当前主流MLLMs在精细视觉操作上仍显著落后于人类水平。

链接: https://arxiv.org/abs/2507.18140
作者: Xiaoyuan Li,Moxin Li,Wenjie Wang,Rui Men,Yichang Zhang,Fuli Feng,Dayiheng Liu,Junyang Lin
机构: University of Science and Technology of China(中国科学技术大学); Alibaba Group(阿里巴巴集团); National University of Singapore(新加坡国立大学)
类目: Computation and Language (cs.CL)
备注: Under Review

点击查看摘要

Abstract:Recent progress in Multi-modal Large Language Models (MLLMs) has enabled step-by-step multi-modal mathematical reasoning by performing visual operations based on the textual instructions. A promising approach uses code as an intermediate representation to precisely express and manipulate the images in the reasoning steps. However, existing evaluations focus mainly on text-only reasoning outputs, leaving the MLLM’s ability to perform accurate visual operations via code largely unexplored. This work takes a first step toward addressing that gap by evaluating MLLM’s code-based capabilities in multi-modal mathematical this http URL, our framework focuses on two key evaluation aspects: (1) Multi-modal Code Generation (MCG) evaluates the model’s ability to accurately understand and construct visualizations from scratch. (2) Multi-modal Code Editing (MCE) assesses the model’s capacity for fine-grained operations, which include three types: Deletion, Modification and Annotation. To evaluate the above tasks, we incorporate a dataset that covers the five most popular types of mathematical figures, including geometric diagrams, function plots, and three types of statistical charts, to provide a comprehensive and effective measurement of existing MLLMs. Our experimental evaluation involves nine mainstream MLLMs, and the results reveal that existing models still lag significantly behind human performance in performing fine-grained visual operations.
zh

[NLP-39] GOAT-SLM: A Spoken Language Model with Paralinguistic and Speaker Characteristic Awareness

【速读】: 该论文旨在解决当前端到端语音语言模型(Spoken Language Models, SLMs)仅关注语言语义内容、忽视语音中蕴含的副语言特征(paralinguistic cues)和说话人特性(speaker characteristics)的问题,如方言、年龄、情绪及非语音发声等。其解决方案的关键在于提出GOAT-SLM,一种具备副语言与说话人意识的新型语音语言模型,采用双模态头架构(dual-modality head architecture),将语言建模与声学实现解耦,从而在保持强大语言理解能力的同时支持富有表现力和适应性的语音生成;此外,通过基于大规模语音-文本语料库的模块化分阶段训练策略,逐步对齐语言、副语言和说话人特征信息,显著提升了模型在多维评估基准TELEVAL上的综合性能,尤其在情绪识别、方言差异处理和年龄敏感交互等非语义任务上优于现有开源模型。

链接: https://arxiv.org/abs/2507.18119
作者: Hongjie Chen,Zehan Li,Yaodong Song,Wenming Deng,Yitong Yao,Yuxin Zhang,Hang Lv,Xuechao Zhu,Jian Kang,Jie Lian,Jie Li,Chao Wang,Shuangyong Song,Yongxiang Li,Zhongjiang He
机构: Institute of Artificial Intelligence (TeleAI), China Telecom, China
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Sound (cs.SD); Audio and Speech Processing (eess.AS)
备注:

点击查看摘要

Abstract:Recent advances in end-to-end spoken language models (SLMs) have significantly improved the ability of AI systems to engage in natural spoken interactions. However, most existing models treat speech merely as a vehicle for linguistic content, often overlooking the rich paralinguistic and speaker characteristic cues embedded in human speech, such as dialect, age, emotion, and non-speech vocalizations. In this work, we introduce GOAT-SLM, a novel spoken language model with paralinguistic and speaker characteristic awareness, designed to extend spoken language modeling beyond text semantics. GOAT-SLM adopts a dual-modality head architecture that decouples linguistic modeling from acoustic realization, enabling robust language understanding while supporting expressive and adaptive speech generation. To enhance model efficiency and versatility, we propose a modular, staged training strategy that progressively aligns linguistic, paralinguistic, and speaker characteristic information using large-scale speech-text corpora. Experimental results on TELEVAL, a multi-dimensional evaluation benchmark, demonstrate that GOAT-SLM achieves well-balanced performance across both semantic and non-semantic tasks, and outperforms existing open-source models in handling emotion, dialectal variation, and age-sensitive interactions. This work highlights the importance of modeling beyond linguistic content and advances the development of more natural, adaptive, and socially aware spoken language systems.
zh

[NLP-40] Agent ic AI framework for End-to-End Medical Data Inference

【速读】: 该论文旨在解决医疗领域机器学习(Machine Learning, ML)解决方案构建与部署过程中存在的高成本、高人力投入问题,主要源于预处理流程碎片化、模型兼容性差以及严格的数据隐私约束。其核心解决方案是提出一种基于代理的智能框架(Agentic AI framework),通过一系列模块化、任务特定的智能体(agents)实现从数据摄取到推理的全流程自动化。关键在于:各智能体协同完成数据类型识别、匿名化、特征提取(结构化数据采用嵌入式方法,非结构化图像数据采用多阶段MedGemma方法)、模型匹配、定制化预处理及可解释性输出生成(如SHAP、LIME和DETR注意力图),从而显著减少人工干预,提升AI在临床场景中的可扩展性和效率。

链接: https://arxiv.org/abs/2507.18115
作者: Soorya Ram Shimgekar,Shayan Vassef,Abhay Goyal,Navin Kumar,Koustuv Saha
机构: University of Illinois - Urbana Champaign (伊利诺伊大学厄巴纳-香槟分校); University of Illinois - Chicago (伊利诺伊大学芝加哥分校); Missouri S&T (密苏里科技大学); Nimblemind.ai (Nimblemind.ai)
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Computers and Society (cs.CY); Emerging Technologies (cs.ET); Machine Learning (cs.LG)
备注: 10 pages, 5 figures, 2 tables, BIBM conference

点击查看摘要

Abstract:Building and deploying machine learning solutions in healthcare remains expensive and labor-intensive due to fragmented preprocessing workflows, model compatibility issues, and stringent data privacy constraints. In this work, we introduce an Agentic AI framework that automates the entire clinical data pipeline, from ingestion to inference, through a system of modular, task-specific agents. These agents handle both structured and unstructured data, enabling automatic feature selection, model selection, and preprocessing recommendation without manual intervention. We evaluate the system on publicly available datasets from geriatrics, palliative care, and colonoscopy imaging. For example, in the case of structured data (anxiety data) and unstructured data (colonoscopy polyps data), the pipeline begins with file-type detection by the Ingestion Identifier Agent, followed by the Data Anonymizer Agent ensuring privacy compliance, where we first identify the data type and then anonymize it. The Feature Extraction Agent identifies features using an embedding-based approach for tabular data, extracting all column names, and a multi-stage MedGemma-based approach for image data, which infers modality and disease name. These features guide the Model-Data Feature Matcher Agent in selecting the best-fit model from a curated repository. The Preprocessing Recommender Agent and Preprocessing Implementor Agent then apply tailored preprocessing based on data type and model requirements. Finally, the ``Model Inference Agent" runs the selected model on the uploaded data and generates interpretable outputs using tools like SHAP, LIME, and DETR attention maps. By automating these high-friction stages of the ML lifecycle, the proposed framework reduces the need for repeated expert intervention, offering a scalable, cost-efficient pathway for operationalizing AI in clinical environments.
zh

[NLP-41] A New Pair of GloVes

【速读】: 该论文旨在解决2014年发布的GloVe词向量模型因语言演变和文化变迁而逐渐失去时效性的问题,同时弥补原始模型在训练数据版本与预处理流程上缺乏清晰文档记录的缺陷。解决方案的关键在于构建并公开发布新的2024年英语GloVe模型,其基于Wikipedia、Gigaword及Dolma子集等最新语料进行训练,并详细记录了数据来源与预处理步骤;评估结果表明,新模型不仅保留了原有结构任务(如类比和相似度)的性能,还在新兴的、具有时间依赖性的命名实体识别(Named Entity Recognition, NER)任务中,特别是在非西方新闻数据上表现出显著提升。

链接: https://arxiv.org/abs/2507.18103
作者: Riley Carlson,John Bauer,Christopher D. Manning
机构: Stanford NLP Group (斯坦福自然语言处理组); Stanford University (斯坦福大学)
类目: Computation and Language (cs.CL); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:This report documents, describes, and evaluates new 2024 English GloVe (Global Vectors for Word Representation) models. While the original GloVe models built in 2014 have been widely used and found useful, languages and the world continue to evolve and we thought that current usage could benefit from updated models. Moreover, the 2014 models were not carefully documented as to the exact data versions and preprocessing that were used, and we rectify this by documenting these new models. We trained two sets of word embeddings using Wikipedia, Gigaword, and a subset of Dolma. Evaluation through vocabulary comparison, direct testing, and NER tasks shows that the 2024 vectors incorporate new culturally and linguistically relevant words, perform comparably on structural tasks like analogy and similarity, and demonstrate improved performance on recent, temporally dependent NER datasets such as non-Western newswire data.
zh

[NLP-42] Hybrid and Unitary Fine-Tuning of Large Language Models : Methods and Benchmarking under Resource Constraints

【速读】: 该论文旨在解决大规模语言模型(Large Language Models, LLMs)微调过程中因模型规模庞大而导致的计算瓶颈与内存消耗过高的问题。其核心解决方案是提出一种新型混合参数高效微调(Parameter-Efficient Fine-Tuning, PEFT)策略,关键在于动态融合BOFT(Block Orthogonal Fine-Tuning)的正交稳定性与LoRA-GA(Low-Rank Adaptation with Gradient Alignment)的梯度对齐快速收敛特性;通过基于梯度范数计算每层自适应更新,实现了在多种任务上的高效收敛与良好泛化能力,同时显著降低训练时间和内存占用,接近全量微调性能但资源消耗减少最多达2.1倍和50%。

链接: https://arxiv.org/abs/2507.18076
作者: Haomin Qi,Zihan Dai,Chengbo Huang
机构: The Chinese University of Hong Kong (香港中文大学)
类目: Computation and Language (cs.CL)
备注: 10 pages, 2 figures and 1 table

点击查看摘要

Abstract:Fine-tuning large language models (LLMs) remains a computational bottleneck due to their scale and memory demands. This paper presents a comprehensive evaluation of parameter-efficient fine-tuning (PEFT) techniques, including LoRA, BOFT, LoRA-GA, and uRNN, and introduces a novel hybrid strategy that dynamically integrates BOFT’s orthogonal stability with LoRA-GA’s gradient-aligned rapid convergence. By computing per-layer adaptive updates guided by gradient norms, the hybrid method achieves superior convergence efficiency and generalization across diverse tasks. We also explore, for the first time, the adaptation of unitary RNN (uRNN) principles to transformer-based LLMs, enhancing gradient stability through structured unitary constraints. Empirical evaluations on four benchmarks – GLUE, GSM8K, MT-Bench, and HumanEval – using models ranging from 7B to 405B parameters demonstrate that our hybrid method consistently outperforms individual PEFT baselines, approaching full fine-tuning accuracy while reducing resource consumption by up to 2.1 times in training time and 50 percent in memory usage. These findings establish the hybrid approach as a practical and scalable fine-tuning solution for real-world deployment of LLMs under resource constraints.
zh

[NLP-43] Group Sequence Policy Optimization

【速读】: 该论文旨在解决大语言模型(Large Language Models, LLMs)在强化学习(Reinforcement Learning, RL)训练过程中存在的稳定性差、效率低以及基础设施复杂等问题,尤其针对Mixture-of-Experts(MoE)架构下RL训练的不稳定性。其解决方案的关键在于提出Group Sequence Policy Optimization(GSPO)算法,该算法摒弃了传统基于token-level重要性比率的方法,转而采用基于序列似然(sequence likelihood)的重要性比率,并实施序列级别的裁剪(clipping)、奖励(rewarding)与优化策略,从而显著提升了训练稳定性和效率,同时简化了RL系统的设计复杂度。

链接: https://arxiv.org/abs/2507.18071
作者: Chujie Zheng,Shixuan Liu,Mingze Li,Xiong-Hui Chen,Bowen Yu,Chang Gao,Kai Dang,Yuqiong Liu,Rui Men,An Yang,Jingren Zhou,Junyang Lin
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:This paper introduces Group Sequence Policy Optimization (GSPO), our stable, efficient, and performant reinforcement learning algorithm for training large language models. Unlike previous algorithms that adopt token-level importance ratios, GSPO defines the importance ratio based on sequence likelihood and performs sequence-level clipping, rewarding, and optimization. We demonstrate that GSPO achieves superior training efficiency and performance compared to the GRPO algorithm, notably stabilizes Mixture-of-Experts (MoE) RL training, and has the potential for simplifying the design of RL infrastructure. These merits of GSPO have contributed to the remarkable improvements in the latest Qwen3 models.
zh

[NLP-44] ELEVAL: A Dynamic Benchmark Designed for Spoken Language Models in Chinese Interactive Scenarios

【速读】: 该论文旨在解决现有评估基准主要关注模型在复杂任务上的表现,而忽视了用户在真实对话场景中自然交互需求的问题。其解决方案的关键在于提出一个名为TELEVAL的动态评估基准,该基准专为中文语境下的对话型语音语言模型(Speech Language Models, SLMs)设计,从显性语义、副语言特征与隐性语义以及系统能力三个维度进行评价,并采用贴近实际使用的对话格式,分别评估文本和音频输出。特别强调模型对用户话语中隐含线索的提取与无需额外指令即可恰当响应的能力,从而更真实地反映用户体验并推动更具对话能力的SLMs发展。

链接: https://arxiv.org/abs/2507.18061
作者: Zehan Li,Hongjie Chen,Yuxin Zhang,Jing Zhou,Xuening Wang,Hang Lv,Mengjie Du,Yaodong Song,Jie Lian,Jian Kang,Jie Li,Yongxiang Li,Zhongjiang He,Xuelong Li
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Sound (cs.SD); Audio and Speech Processing (eess.AS)
备注:

点击查看摘要

Abstract:Spoken language models (SLMs) have seen rapid progress in recent years, along with the development of numerous benchmarks for evaluating their performance. However, most existing benchmarks primarily focus on evaluating whether SLMs can perform complex tasks comparable to those tackled by large language models (LLMs), often failing to align with how users naturally interact in real-world conversational scenarios. In this paper, we propose TELEVAL, a dynamic benchmark specifically designed to evaluate SLMs’ effectiveness as conversational agents in realistic Chinese interactive settings. TELEVAL defines three evaluation dimensions: Explicit Semantics, Paralinguistic and Implicit Semantics, and System Abilities. It adopts a dialogue format consistent with real-world usage and evaluates text and audio outputs separately. TELEVAL particularly focuses on the model’s ability to extract implicit cues from user speech and respond appropriately without additional instructions. Our experiments demonstrate that despite recent progress, existing SLMs still have considerable room for improvement in natural conversational tasks. We hope that TELEVAL can serve as a user-centered evaluation framework that directly reflects the user experience and contributes to the development of more capable dialogue-oriented SLMs.
zh

[NLP-45] Privacy-Preserving Synthetic Review Generation with Diverse Writing Styles Using LLM s

【速读】: 该论文旨在解决生成式 AI(Generative AI)在文本合成数据中面临的多样性不足与隐私风险问题。当前,大型语言模型(Large Language Models, LLMs)生成的合成数据虽具成本低、可扩展性强的优势,但其在语言表达多样性、情感分布及用户视角覆盖等方面存在显著局限,同时存在重识别风险和风格异常点等隐私隐患。解决方案的关键在于提出一套系统性的量化评估指标体系,用于全面衡量合成数据的多样性(如语言表达、情感倾向和用户视角)与隐私性(如重识别可能性与风格异常),并基于评估结果设计了一种基于提示(prompt-based)的方法,在提升合成评论多样性的同时有效保护用户隐私。

链接: https://arxiv.org/abs/2507.18055
作者: Tevin Atwal,Chan Nam Tieu,Yefeng Yuan,Zhan Shi,Yuhong Liu,Liang Cheng
机构: Santa Clara University (圣克拉拉大学); eBay (eBay)
类目: Computation and Language (cs.CL); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:The increasing use of synthetic data generated by Large Language Models (LLMs) presents both opportunities and challenges in data-driven applications. While synthetic data provides a cost-effective, scalable alternative to real-world data to facilitate model training, its diversity and privacy risks remain underexplored. Focusing on text-based synthetic data, we propose a comprehensive set of metrics to quantitatively assess the diversity (i.e., linguistic expression, sentiment, and user perspective), and privacy (i.e., re-identification risk and stylistic outliers) of synthetic datasets generated by several state-of-the-art LLMs. Experiment results reveal significant limitations in LLMs’ capabilities in generating diverse and privacy-preserving synthetic data. Guided by the evaluation results, a prompt-based approach is proposed to enhance the diversity of synthetic reviews while preserving reviewer privacy.
zh

[NLP-46] RECALLED: An Unbounded Resource Consumption Attack on Large Vision-Language Models

【速读】: 该论文旨在解决大视觉语言模型(Large Vision-Language Models, LVLMs)在部署过程中面临的资源消耗攻击(Resource Consumption Attacks, RCAs)问题,尤其是现有红队测试研究普遍忽视视觉输入作为攻击面所导致的防御不足。其解决方案的关键在于提出RECALLED框架,通过视觉引导优化(Vision Guided Optimization)生成像素级的输出回忆对抗扰动(Output Recall adversarial perturbations),诱导模型产生无限循环输出以触发无界资源消耗;同时引入多目标并行损失函数(Multi-Objective Parallel Losses)生成通用攻击模板并解决多目标优化冲突,从而实现高效且稳定的视觉模态驱动的RCAs红队测试。

链接: https://arxiv.org/abs/2507.18053
作者: Haoran Gao,Yuanhe Zhang,Zhenhong Zhou,Lei Jiang,Fanyu Meng,Yujia Xiao,Kun Wang,Yang Liu,Junlan Feng
机构: 未知
类目: Cryptography and Security (cs.CR); Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Resource Consumption Attacks (RCAs) have emerged as a significant threat to the deployment of Large Language Models (LLMs). With the integration of vision modalities, additional attack vectors exacerbate the risk of RCAs in large vision-language models (LVLMs). However, existing red-teaming studies have largely overlooked visual inputs as a potential attack surface, resulting in insufficient mitigation strategies against RCAs in LVLMs. To address this gap, we propose RECALLED (\textbfREsource \textbfConsumption \textbfAttack on \textbfLarge Vision-\textbfLanguag\textbfE Mo\textbfDels), the first approach for exploiting visual modalities to trigger unbounded RCAs red-teaming. First, we present \textitVision Guided Optimization, a fine-grained pixel-level optimization, to obtain \textitOutput Recall adversarial perturbations, which can induce repeating output. Then, we inject the perturbations into visual inputs, triggering unbounded generations to achieve the goal of RCAs. Additionally, we introduce \textitMulti-Objective Parallel Losses to generate universal attack templates and resolve optimization conflicts when intending to implement parallel attacks. Empirical results demonstrate that RECALLED increases service response latency by over 26 \uparrow , resulting in an additional 20% increase in GPU utilization and memory consumption. Our study exposes security vulnerabilities in LVLMs and establishes a red-teaming framework that can facilitate future defense development against RCAs.
zh

[NLP-47] Synthetic Data Generation for Phrase Break Prediction with Large Language Model INTERSPEECH2025

【速读】: 该论文旨在解决语音合成系统中短语断点预测(phrase break prediction)任务面临的标注数据稀缺与质量不一致问题,这些问题通常依赖大量人工标注的音频或文本数据,导致成本高昂且难以标准化。其解决方案的关键在于利用大语言模型(Large Language Models, LLMs)生成合成的短语断点标注数据,从而减少对人工标注的依赖,并提升跨语言场景下的泛化能力与标注一致性。实验表明,LLM生成的合成数据在多个语言上均能有效缓解数据挑战,展现出作为语音领域可行解决方案的巨大潜力。

链接: https://arxiv.org/abs/2507.18044
作者: Hoyeon Lee,Sejung Son,Ye-Eun Kang,Jong-Hwan Kim
机构: Naver Corporation (纳维亚公司)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: Accepted at Interspeech 2025

点击查看摘要

Abstract:Current approaches to phrase break prediction address crucial prosodic aspects of text-to-speech systems but heavily rely on vast human annotations from audio or text, incurring significant manual effort and cost. Inherent variability in the speech domain, driven by phonetic factors, further complicates acquiring consistent, high-quality data. Recently, large language models (LLMs) have shown success in addressing data challenges in NLP by generating tailored synthetic data while reducing manual annotation needs. Motivated by this, we explore leveraging LLM to generate synthetic phrase break annotations, addressing the challenges of both manual annotation and speech-related tasks by comparing with traditional annotations and assessing effectiveness across multiple languages. Our findings suggest that LLM-based synthetic data generation effectively mitigates data challenges in phrase break prediction and highlights the potential of LLMs as a viable solution for the speech domain.
zh

[NLP-48] GrAInS: Gradient-based Attribution for Inference-Time Steering of LLM s and VLMs

【速读】: 该论文旨在解决现有推理时控制(inference-time steering)方法在大型语言模型(LLMs)和视觉-语言模型(VLMs)中因依赖固定全局干预向量、忽视单个输入token的因果影响,以及未能利用模型logits中的信息梯度而导致的控制精度不足问题。其解决方案的关键在于提出GrAInS(Gradient-based Attribution for Inference-time Steering),该方法通过集成梯度(Integrated Gradients)进行对比性梯度归因,识别出对偏好输出贡献最大(正向)与最差输出贡献最大(负向)的top-k token,并据此构建具有语义指向性的引导向量;在推理阶段,基于token级归因信号调整Transformer层的隐藏激活,并通过归一化保持表征尺度不变,从而实现无需微调或辅助监督的细粒度、可解释且模块化的模型行为调控。

链接: https://arxiv.org/abs/2507.18043
作者: Duy Nguyen,Archiki Prasad,Elias Stengel-Eskin,Mohit Bansal
机构: UNC Chapel Hill (北卡罗来纳大学教堂山分校)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
备注: 21 pages. Code: this https URL

点击查看摘要

Abstract:Inference-time steering methods offer a lightweight alternative to fine-tuning large language models (LLMs) and vision-language models (VLMs) by modifying internal activations at test time without updating model weights. However, most existing approaches rely on fixed, global intervention vectors, overlook the causal influence of individual input tokens, and fail to leverage informative gradients from the model’s logits, particularly in multimodal settings where visual and textual inputs contribute unevenly. To address these limitations, we introduce GrAInS, an inference-time steering approach that operates across both language-only and vision-language models and tasks. GrAInS uses contrastive, gradient-based attribution via Integrated Gradients to identify the top-k most influential tokens, both positively and negatively attributed based on their contribution to preferred versus dispreferred outputs. These tokens are then used to construct directional steering vectors that capture semantic shifts from undesirable to desirable behavior. During inference, GrAInS adjusts hidden activations at transformer layers guided by token-level attribution signals, and normalizes activations to preserve representational scale. This enables fine-grained, interpretable, and modular control over model behavior, without retraining or auxiliary supervision. Empirically, GrAInS consistently outperforms both fine-tuning and existing steering baselines: it achieves a 13.22% accuracy gain on TruthfulQA using Llama-3.1-8B, reduces hallucination rates on MMHal-Bench from 0.624 to 0.514 with LLaVA-1.6-7B, and improves alignment win rates on SPA-VL by 8.11%, all while preserving the model’s fluency and general capabilities.
zh

[NLP-49] NeuralDB: Scaling Knowledge Editing in LLM s to 100000 Facts with Neural KV Database

【速读】: 该论文旨在解决大规模知识编辑对大语言模型(Large Language Models, LLMs)通用能力造成损害及编辑事实遗忘的问题,尤其是在进行数千乃至数万条知识修改时的稳定性与有效性挑战。其解决方案的关键在于将现有线性编辑方法建模为对键值(Key-Value, KV)数据库的查询,并提出 NeuralDB 框架——该框架显式地将编辑后的事实表示为一个神经 KV 数据库,并引入非线性门控检索模块,在推理过程中仅在涉及已编辑事实时激活该模块,从而有效保留模型原有的通用能力。实验表明,NeuralDB 在 ZsRE 和 CounterFacts 数据集上成功实现了 10,000 条事实的高效编辑,且性能稳定;进一步扩展至 100,000 条事实(相较之前工作提升 50 倍)仍保持优异效果。

链接: https://arxiv.org/abs/2507.18028
作者: Weizhi Fei,Hao Shi,Jing Xu,Jingchen Peng,Jiazheng Li,Jingzhao Zhang,Bo Bai,Wei Han,Zhenyuan Chen,Xueyan Niu
机构: Tsinghua University (清华大学); Huawei Technologies Co., Ltd. (华为技术有限公司)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Efficiently editing knowledge stored in large language models (LLMs) enables model updates without large-scale training. One possible solution is Locate-and-Edit (L\E), allowing simultaneous modifications of a massive number of facts. However, such editing may compromise the general abilities of LLMs and even result in forgetting edited facts when scaling up to thousands of edits. In this paper, we model existing linear L\E methods as querying a Key-Value (KV) database. From this perspective, we then propose NeuralDB, an editing framework that explicitly represents the edited facts as a neural KV database equipped with a non-linear gated retrieval module, % In particular, our gated module only operates when inference involves the edited facts, effectively preserving the general abilities of LLMs. Comprehensive experiments involving the editing of 10,000 facts were conducted on the ZsRE and CounterFacts datasets, using GPT2-XL, GPT-J (6B) and Llama-3 (8B). The results demonstrate that NeuralDB not only excels in editing efficacy, generalization, specificity, fluency, and consistency, but also preserves overall performance across six representative text understanding and generation tasks. Further experiments indicate that NeuralDB maintains its effectiveness even when scaled to 100,000 facts (\textbf50x more than in prior work).
zh

[NLP-50] chnical Report of TeleChat2 TeleChat2.5 and T1

【速读】: 该论文旨在解决大语言模型在推理能力、代码生成与数学推理任务中的性能瓶颈问题,同时兼顾推理速度与通用任务表现的平衡。解决方案的关键在于通过优化训练策略而非改变模型架构,实现性能显著提升:首先,TeleChat2基于10万亿高质量token进行预训练,并结合监督微调(Supervised Fine-Tuning, SFT)和直接偏好优化(Direct Preference Optimization, DPO)增强基础能力;随后,TeleChat2.5与T1引入持续预训练(continual pretraining)阶段,融合领域特定数据集与强化学习(Reinforcement Learning, RL),其中T1聚焦于复杂推理任务(如长链式思维Chain-of-Thought, CoT),而TeleChat2.5则侧重推理速度优化。二者均采用115B参数的密集Transformer架构,在多项基准测试中超越了部分闭源模型(如GPT-4o),体现了训练策略创新对模型性能跃升的核心作用。

链接: https://arxiv.org/abs/2507.18013
作者: Zihan Wang,Xinzhang Liu,Yitong Yao,Chao Wang,Yu Zhao,Zhihao Yang,Wenmin Deng,Kaipeng Jia,Jiaxin Peng,Yuyao Huang,Sishi Xiong,Zhuo Jiang,Kaidong Yu,Xiaohui Hu,Fubei Yao,Ruiyu Fang,Zhuoru Jiang,Ruiting Song,Qiyi Xie,Rui Xue,Xuewei He,Yanlei Xue,Zhu Yuan,Zhaoxi Zhang,Zilu Huang,Shiquan Wang,Xin Wang,Hanming Wu,Mingyuan Wang,Xufeng Zhan,Yuhan Sun,Zhaohu Xing,Yuhao Jiang,Bingkai Yang,Shuangyong Song,Yongxiang Li,Zhongjiang He,Xuelong Li
机构: 未知
类目: Computation and Language (cs.CL)
备注: 32 pages, 5 figures

点击查看摘要

Abstract:We introduce the latest series of TeleChat models: \textbfTeleChat2, \textbfTeleChat2.5, and \textbfT1, offering a significant upgrade over their predecessor, TeleChat. Despite minimal changes to the model architecture, the new series achieves substantial performance gains through enhanced training strategies in both pre-training and post-training stages. The series begins with \textbfTeleChat2, which undergoes pretraining on 10 trillion high-quality and diverse tokens. This is followed by Supervised Fine-Tuning (SFT) and Direct Preference Optimization (DPO) to further enhance its capabilities. \textbfTeleChat2.5 and \textbfT1 expand the pipeline by incorporating a continual pretraining phase with domain-specific datasets, combined with reinforcement learning (RL) to improve performance in code generation and mathematical reasoning tasks. The \textbfT1 variant is designed for complex reasoning, supporting long Chain-of-Thought (CoT) reasoning and demonstrating substantial improvements in mathematics and coding. In contrast, \textbfTeleChat2.5 prioritizes speed, delivering rapid inference. Both flagship models of \textbfT1 and \textbfTeleChat2.5 are dense Transformer-based architectures with 115B parameters, showcasing significant advancements in reasoning and general task performance compared to the original TeleChat. Notably, \textbfT1-115B outperform proprietary models such as OpenAI’s o1-mini and GPT-4o. We publicly release \textbfTeleChat2, \textbfTeleChat2.5 and \textbfT1, including post-trained versions with 35B and 115B parameters, to empower developers and researchers with state-of-the-art language models tailored for diverse applications.
zh

[NLP-51] GRR-CoCa: Leverag ing LLM Mechanisms in Multimodal Model Architectures

【速读】: 该论文旨在解决当前先进视觉-语言生成模型(如CoCa)在架构复杂度上落后于大型语言模型(LLMs)的问题,从而限制了其在多模态任务中的性能和泛化能力。解决方案的关键在于对CoCa模型进行结构优化,具体包括:在文本解码器和视觉Transformer(ViT)编码器中引入高斯误差门控线性单元(Gaussian error gated linear units)、均方根归一化(root mean squared normalization)以及旋转位置嵌入(rotary positional embedding),这些改进均已在LLMs中被证明能提升性能,但此前未应用于CoCa。实验表明,改进后的GRR-CoCa模型在预训练和多种下游微调任务中均显著优于基线模型,验证了架构升级对跨视觉-语言领域性能与泛化能力的提升作用。

链接: https://arxiv.org/abs/2507.18009
作者: Jake R. Patock,Nicole Catherine Lewis,Kevin McCoy,Christina Gomez,Canling Chen,Lorenzo Luzi
机构: Rice University (莱斯大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注: 12 pages, 2 figures

点击查看摘要

Abstract:State-of-the-art (SOTA) image and text generation models are multimodal models that have many similarities to large language models (LLMs). Despite achieving strong performances, leading foundational multimodal model architectures frequently lag behind the architectural sophistication of contemporary LLMs. We propose GRR-CoCa, an improved SOTA Contrastive Captioner (CoCa) model that incorporates Gaussian error gated linear units, root mean squared normalization, and rotary positional embedding into the textual decoders and the vision transformer (ViT) encoder. Each architectural modification has been shown to improve model performance in LLMs, but has yet to be adopted in CoCa. We benchmarked GRR-CoCa against Baseline CoCa, a model with the same modified textual decoders but with CoCa’s original ViT encoder. We used standard pretraining and fine-tuning workflows to benchmark the models on contrastive and generative tasks. Our GRR-CoCa significantly outperformed Baseline CoCa on the pretraining dataset and three diverse fine-tuning datasets. Pretraining improvements were 27.25% in contrastive loss, 3.71% in perplexity, and 7.15% in CoCa loss. The average fine-tuning improvements were 13.66% in contrastive loss, 5.18% in perplexity, and 5.55% in CoCa loss. We show that GRR-CoCa’s modified architecture improves performance and generalization across vision-language domains.
zh

[NLP-52] Natural Language Processing for Tigrinya: Current State and Future Directions

【速读】: 该论文旨在解决提格里尼亚语(Tigrinya)在自然语言处理(Natural Language Processing, NLP)研究中严重资源匮乏与研究进展滞后的问题。其核心挑战在于该语言的高复杂形态学特征及可用计算资源稀缺,导致传统方法难以有效建模。解决方案的关键在于系统性梳理2011至2025年间超过40项相关研究,揭示从基于规则的早期系统向现代神经网络架构演进的技术轨迹,并强调通过关键资源建设里程碑(如标注数据集、预训练模型)推动性能提升。同时,论文提出 morphology-aware modeling(形态感知建模)、cross-lingual transfer(跨语言迁移)和 community-centered resource development(以社区为中心的资源开发)作为未来突破方向,为提格里尼亚语NLP提供可操作的研究路线图。

链接: https://arxiv.org/abs/2507.17974
作者: Fitsum Gaim,Jong C. Park
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Despite being spoken by millions of people, Tigrinya remains severely underrepresented in Natural Language Processing (NLP) research. This work presents a comprehensive survey of NLP research for Tigrinya, analyzing over 40 studies spanning more than a decade of work from 2011 to 2025. We systematically review the current state of computational resources, models, and applications across ten distinct downstream tasks, including morphological processing, machine translation, speech recognition, and question-answering. Our analysis reveals a clear trajectory from foundational, rule-based systems to modern neural architectures, with progress consistently unlocked by resource creation milestones. We identify key challenges rooted in Tigrinya’s morphological complexity and resource scarcity, while highlighting promising research directions, including morphology-aware modeling, cross-lingual transfer, and community-centered resource development. This work serves as both a comprehensive reference for researchers and a roadmap for advancing Tigrinya NLP. A curated metadata of the surveyed studies and resources is made publicly available.\footnoteTigrinya NLP Anthology: this https URL.
zh

[NLP-53] Are LLM Belief Updates Consistent with Bayes Theorem? ICML2025

【速读】: 该论文试图解决的问题是:更大、更强大的预训练语言模型在面对上下文中的证据时,是否能更一致地按照贝叶斯定理(Bayes’ theorem)更新其对命题的“信念”(即概率估计)。解决方案的关键在于提出了一种新的量化指标——贝叶斯一致性系数(Bayesian Coherence Coefficient, BCC),并构建了一个专门用于测量该系数的数据集。通过在五个模型家族的多个仅预训练语言模型上测量BCC,并将其与模型参数量、训练数据量及常见基准测试得分进行对比,研究发现模型规模和能力越强,其信念更新越符合贝叶斯一致性,从而验证了假设。

链接: https://arxiv.org/abs/2507.17951
作者: Sohaib Imran,Ihor Kendiukhov,Matthew Broerman,Aditya Thomas,Riccardo Campanella,Rob Lamb,Peter M. Atkinson
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: Accepted at the ICML 2025 Workshop on Assessing World Models

点击查看摘要

Abstract:Do larger and more capable language models learn to update their “beliefs” about propositions more consistently with Bayes’ theorem when presented with evidence in-context? To test this, we formulate a Bayesian Coherence Coefficient (BCC) metric and generate a dataset with which to measure the BCC. We measure BCC for multiple pre-trained-only language models across five model families, comparing against the number of model parameters, the amount of training data, and model scores on common benchmarks. Our results provide evidence for our hypothesis that larger and more capable pre-trained language models assign credences that are more coherent with Bayes’ theorem. These results have important implications for our understanding and governance of LLMs.
zh

[NLP-54] Evaluating the Performance of AI Text Detectors Few-Shot and Chain-of-Thought Prompting Using DeepSeek Generated Text

【速读】: 该论文旨在解决当前AI文本检测工具在面对新型大语言模型(Large Language Models, LLMs)如DeepSeek生成的文本时,其检测准确性不足且鲁棒性有限的问题,尤其关注对抗攻击(如改写和人类化处理)对检测性能的影响。解决方案的关键在于:首先,系统评估了六种主流AI检测工具对原始及经过对抗攻击的DeepSeek文本的识别能力;其次,创新性地将DeepSeek本身作为检测器,通过少样本提示(few-shot prompting)和思维链推理(chain-of-thought reasoning, CoT)进行文本分类任务,结果表明该方法在准确率上显著优于传统检测工具,最佳五样本提示下仅误判1个样本(AI召回率96%,人类文本召回率100%),验证了利用LLM自身推理能力进行检测的有效性与潜力。

链接: https://arxiv.org/abs/2507.17944
作者: Hulayyil Alshammari,Praveen Rao
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Large language models (LLMs) have rapidly transformed the creation of written materials. LLMs have led to questions about writing integrity, thereby driving the creation of artificial intelligence (AI) detection technologies. Adversarial attacks, such as standard and humanized paraphrasing, inhibit detectors’ ability to detect machine-generated text. Previous studies have mainly focused on ChatGPT and other well-known LLMs and have shown varying accuracy across detectors. However, there is a clear gap in the literature about DeepSeek, a recently published LLM. Therefore, in this work, we investigate whether six generally accessible AI detection tools – AI Text Classifier, Content Detector AI, Copyleaks, QuillBot, GPT-2, and GPTZero – can consistently recognize text generated by DeepSeek. The detectors were exposed to the aforementioned adversarial attacks. We also considered DeepSeek as a detector by performing few-shot prompting and chain-of-thought reasoning (CoT) for classifying AI and human-written text. We collected 49 human-authored question-answer pairs from before the LLM era and generated matching responses using DeepSeek-v3, producing 49 AI-generated samples. Then, we applied adversarial techniques such as paraphrasing and humanizing to add 196 more samples. These were used to challenge detector robustness and assess accuracy impact. While QuillBot and Copyleaks showed near-perfect performance on original and paraphrased DeepSeek text, others – particularly AI Text Classifier and GPT-2 – showed inconsistent results. The most effective attack was humanization, reducing accuracy to 71% for Copyleaks, 58% for QuillBot, and 52% for GPTZero. Few-shot and CoT prompting showed high accuracy, with the best five-shot result misclassifying only one of 49 samples (AI recall 96%, human recall 100%).
zh

[NLP-55] Bobs Confetti: Phonetic Memorization Attacks in Music and Video Generation

【速读】: 该论文旨在解决生成式 AI(Generative AI)在歌词到歌曲(Lyrics-to-Song, LS2)生成模型中因训练数据记忆(memorization)所引发的版权与内容安全问题。现有研究未充分探讨此类模型对语义扰动下的音频输出一致性,即是否仍会再现训练集中已知内容。解决方案的关键在于提出一种名为对抗性音素提示(Adversarial PhoneTic Prompting, APT)的新攻击范式:通过同音替换(homophonic substitution)对歌词进行语义改写,同时保持其音素结构不变,从而触发模型对子音素级别记忆的激活。实验表明,即使歌词语义发生显著变化,SUNO 和 YuE 等模型仍能生成高度相似于原始训练音频的结果,且该现象在多语言和多流派下持续存在;更进一步,仅靠音素层面的修改即可诱导文本到视频(text-to-video)模型如Veo 3再现原视频中的视觉元素,形成“音素到视觉再生”(phonetic-to-visual regurgitation)现象。这一发现揭示了基于文本条件的多模态生成系统中潜藏的深层记忆漏洞,亟需在版权保护、内容溯源和安全性方面重新评估现代生成模型的设计边界。

链接: https://arxiv.org/abs/2507.17937
作者: Jaechul Roh,Zachary Novack,Yuefeng Peng,Niloofar Mireshghallah,Taylor Berg-Kirkpatrick,Amir Houmansadr
机构: University of Massachusetts Amherst (马萨诸塞大学阿默斯特分校); University of California San Diego (加州大学圣地亚哥分校); Carnegie Mellon University (卡内基梅隆大学)
类目: ound (cs.SD); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Audio and Speech Processing (eess.AS)
备注:

点击查看摘要

Abstract:Lyrics-to-Song (LS2) generation models promise end-to-end music synthesis from text, yet their vulnerability to training data memorization remains underexplored. We introduce Adversarial PhoneTic Prompting (APT), a novel attack where lyrics are semantically altered while preserving their acoustic structure through homophonic substitutions (e.g., Eminem’s famous “mom’s spaghetti” \rightarrow “Bob’s confetti”). Despite these distortions, we uncover a powerful form of sub-lexical memorization: models like SUNO and YuE regenerate outputs strikingly similar to known training content, achieving high similarity across audio-domain metrics, including CLAP, AudioJudge, and CoverID. This vulnerability persists across multiple languages and genres. More surprisingly, we discover that phoneme-altered lyrics alone can trigger visual memorization in text-to-video models. When prompted with phonetically modified lyrics from Lose Yourself, Veo 3 reconstructs visual elements from the original music video – including character appearance and scene composition – despite no visual cues in the prompt. We term this phenomenon phonetic-to-visual regurgitation. Together, these findings expose a critical vulnerability in transcript-conditioned multimodal generation: phonetic prompting alone can unlock memorized audiovisual content, raising urgent questions about copyright, safety, and content provenance in modern generative systems. Example generations are available on our demo page (this http URL).
zh

[NLP-56] One Whisper to Grade Them All

【速读】: 该论文旨在解决多部分第二语言口语测试中全自动语音评估(Automatic Speaking Assessment, ASA)的效率与准确性问题,尤其针对大规模计算机辅助语言学习系统中的实际应用需求。其关键解决方案是提出一种端到端的高效架构:使用单一的Whisper-small编码器处理全部四段口语回答,通过轻量级聚合模块整合信息,并直接预测最终评分,从而避免了传统方法所需的转录步骤和分部件建模,显著降低推理时间并提升实用性。该方案在仅使用168M参数(约为Whisper-small的70%)的情况下实现了RMSE=0.384,优于基于文本的基线模型(RMSE=0.44),同时引入了一种数据采样策略,在仅训练44.8%说话者样本时仍保持优异性能,体现出对类别不平衡数据的良好适应性和高数据效率。

链接: https://arxiv.org/abs/2507.17918
作者: Nhan Phan,Anusha Porwal,Yaroslav Getman,Ekaterina Voskoboinik,Tamás Grósz,Mikko Kurimo
机构: Aalto University (阿尔托大学)
类目: Computation and Language (cs.CL); Audio and Speech Processing (eess.AS)
备注: Accepted to SLaTE 2025 workshop

点击查看摘要

Abstract:We present an efficient end-to-end approach for holistic Automatic Speaking Assessment (ASA) of multi-part second-language tests, developed for the 2025 Speak Improve Challenge. Our system’s main novelty is the ability to process all four spoken responses with a single Whisper-small encoder, combine all information via a lightweight aggregator, and predict the final score. This architecture removes the need for transcription and per-part models, cuts inference time, and makes ASA practical for large-scale Computer-Assisted Language Learning systems. Our system achieved a Root Mean Squared Error (RMSE) of 0.384, outperforming the text-based baseline (0.44) while using at most 168M parameters (about 70% of Whisper-small). Furthermore, we propose a data sampling strategy, allowing the model to train on only 44.8% of the speakers in the corpus and still reach 0.383 RMSE, demonstrating improved performance on imbalanced classes and strong data efficiency. Comments: Accepted to SLaTE 2025 workshop Subjects: Computation and Language (cs.CL); Audio and Speech Processing (eess.AS) Cite as: arXiv:2507.17918 [cs.CL] (or arXiv:2507.17918v1 [cs.CL] for this version) https://doi.org/10.48550/arXiv.2507.17918 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
zh

[NLP-57] VeriMinder: Mitigating Analytical Vulnerabilities in NL2SQL

【速读】: 该论文旨在解决自然语言接口数据库(NLIDB)系统中用户因缺乏统计分析背景而提出存在认知偏见的分析问题这一关键挑战,即“错误问题”漏洞(wrong question vulnerability),从而影响数据驱动决策的质量。解决方案的核心在于提出一个名为VeriMinder的交互式系统,其关键创新包括:(1)一种面向特定分析场景的上下文语义映射框架,用于识别与定位相关偏见;(2)基于Hard-to-Vary原则的可操作化分析框架,引导用户进行系统性数据探索;(3)一种优化的大型语言模型(LLM)驱动机制,通过多候选生成、批评反馈与自我反思的结构化流程,生成高质量且任务特定的提示(prompt)。实证测试表明,该方法显著提升了分析的准确性、完整性和具体性,82.5%的参与者认为其对分析质量有积极影响。

链接: https://arxiv.org/abs/2507.17896
作者: Shubham Mohole,Sainyam Galhotra
机构: Cornell University (康奈尔大学)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Databases (cs.DB)
备注:

点击查看摘要

Abstract:Application systems using natural language interfaces to databases (NLIDBs) have democratized data analysis. This positive development has also brought forth an urgent challenge to help users who might use these systems without a background in statistical analysis to formulate bias-free analytical questions. Although significant research has focused on text-to-SQL generation accuracy, addressing cognitive biases in analytical questions remains underexplored. We present VeriMinder, this https URL, an interactive system for detecting and mitigating such analytical vulnerabilities. Our approach introduces three key innovations: (1) a contextual semantic mapping framework for biases relevant to specific analysis contexts (2) an analytical framework that operationalizes the Hard-to-Vary principle and guides users in systematic data analysis (3) an optimized LLM-powered system that generates high-quality, task-specific prompts using a structured process involving multiple candidates, critic feedback, and self-reflection. User testing confirms the merits of our approach. In direct user experience evaluation, 82.5% participants reported positively impacting the quality of the analysis. In comparative evaluation, VeriMinder scored significantly higher than alternative approaches, at least 20% better when considered for metrics of the analysis’s concreteness, comprehensiveness, and accuracy. Our system, implemented as a web application, is set to help users avoid “wrong question” vulnerability during data analysis. VeriMinder code base with prompts, this https URL, is available as an MIT-licensed open-source software to facilitate further research and adoption within the community. Subjects: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Databases (cs.DB) Cite as: arXiv:2507.17896 [cs.CL] (or arXiv:2507.17896v1 [cs.CL] for this version) https://doi.org/10.48550/arXiv.2507.17896 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
zh

[NLP-58] Dynamic and Generalizable Process Reward Modeling ACL2025

【速读】: 该论文旨在解决现有过程奖励模型(Process Reward Models, PRMs)在复杂场景中因依赖启发式方法而导致跨域泛化能力差的问题,以及当前基于大语言模型作为评判者(LLM-as-judge)的方法忽视文本中蕴含的有意义指导信息、静态且粗粒度的评估标准难以适应复杂过程监督的局限性。解决方案的关键在于提出动态且可泛化的过程奖励建模方法(Dynamic and Generalizable Process Reward Modeling, DG-PRM),其核心创新包括:构建奖励树(reward tree)以存储细粒度、多维的奖励准则,并实现步骤级奖励信号的动态选择;同时,首次引入帕累托占优估计(Pareto dominance estimation)来识别具有区分性的正负样本对,从而有效处理多维度奖励信号。实验表明,DG-PRM在主流基准上性能显著提升,且在分布外场景下表现出卓越的泛化能力。

链接: https://arxiv.org/abs/2507.17849
作者: Zhangyue Yin,Qiushi Sun,Zhiyuan Zeng,Qinyuan Cheng,Xipeng Qiu,Xuanjing Huang
机构: Fudan University (复旦大学); The University of Hong Kong (香港大学)
类目: Computation and Language (cs.CL)
备注: Accepted by ACL 2025 Main

点击查看摘要

Abstract:Process Reward Models (PRMs) are crucial for guiding Large Language Models (LLMs) in complex scenarios by providing dense reward signals. However, existing PRMs primarily rely on heuristic approaches, which struggle with cross-domain generalization. While LLM-as-judge has been proposed to provide generalized rewards, current research has focused mainly on feedback results, overlooking the meaningful guidance embedded within the text. Additionally, static and coarse-grained evaluation criteria struggle to adapt to complex process supervision. To tackle these challenges, we propose Dynamic and Generalizable Process Reward Modeling (DG-PRM), which features a reward tree to capture and store fine-grained, multi-dimensional reward criteria. DG-PRM dynamically selects reward signals for step-wise reward scoring. To handle multifaceted reward signals, we pioneeringly adopt Pareto dominance estimation to identify discriminative positive and negative pairs. Experimental results show that DG-PRM achieves stunning performance on prevailing benchmarks, significantly boosting model performance across tasks with dense rewards. Further analysis reveals that DG-PRM adapts well to out-of-distribution scenarios, demonstrating exceptional generalizability.
zh

[NLP-59] Shop-R1: Rewarding LLM s to Simulate Human Behavior in Online Shopping via Reinforcement Learning

【速读】: 该论文旨在解决大语言模型(Large Language Models, LLMs)在模拟在线购物环境中人类行为时,因推理能力受限而导致的生成行为可信度不足的问题。现有方法依赖LLM合成的推理过程并采用监督微调(Supervised Fine-Tuning, SFT)提升推理能力,但其性能上限由用于生成推理链的模型本身决定。解决方案的关键在于提出Shop-R1框架,该框架将人类行为模拟任务分解为两个阶段:推理生成与动作预测,并分别设计差异化奖励机制。其中,推理生成阶段利用模型内部信号(如logit分布)实现自监督引导;动作预测阶段引入分层奖励结构并结合难度感知缩放策略,以防止奖励黑客行为并实现细粒度奖励分配,从而同时评估高层动作类型和细粒度子动作细节(属性与取值),按难度比例给予奖励。实验表明,该方法相较基线实现了超过65%的相对性能提升。

链接: https://arxiv.org/abs/2507.17842
作者: Yimeng Zhang,Tian Wang,Jiri Gesi,Ziyi Wang,Yuxuan Lu,Jiacheng Lin,Sinong Zhan,Vianne Gao,Ruochen Jiao,Junze Liu,Kun Qian,Yuxin Tang,Ran Xue,Houyu Zhang,Qingjun Cui,Yufan Guo,Dakuo Wang
机构: Michigan State University (密歇根州立大学); Store Foundation AI, Amazon (亚马逊); Northeastern University (东北大学); University of Illinois Urbana-Champaign (伊利诺伊大学厄巴纳-香槟分校); Northwestern University (西北大学)
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Large Language Models (LLMs) have recently demonstrated strong potential in generating ‘believable human-like’ behavior in web environments. Prior work has explored augmenting training data with LLM-synthesized rationales and applying supervised fine-tuning (SFT) to enhance reasoning ability, which in turn can improve downstream action prediction. However, the performance of such approaches remains inherently bounded by the reasoning capabilities of the model used to generate the rationales. In this paper, we introduce Shop-R1, a novel reinforcement learning (RL) framework aimed at enhancing the reasoning ability of LLMs for simulation of real human behavior in online shopping environments Specifically, Shop-R1 decomposes the human behavior simulation task into two stages: rationale generation and action prediction, each guided by distinct reward signals. For rationale generation, we leverage internal model signals (e.g., logit distributions) to guide the reasoning process in a self-supervised manner. For action prediction, we propose a hierarchical reward structure with difficulty-aware scaling to prevent reward hacking and enable fine-grained reward assignment. This design evaluates both high-level action types and the correctness of fine-grained sub-action details (attributes and values), rewarding outputs proportionally to their difficulty. Experimental results show that our method achieves a relative improvement of over 65% compared to the baseline.
zh

[NLP-60] GenSelect: A Generative Approach to Best-of-N ICML

【速读】: 该论文旨在解决生成式奖励模型(Generative Reward Models)在推理任务中测试时扩展(test-time scaling)效率与效果之间的权衡问题。现有方法要么仅采用逐点评分(pointwise scoring),未能充分利用大语言模型(LLM)的比较能力,要么依赖成对比较(pairwise comparison),在较大采样预算下扩展效率低下。解决方案的关键在于提出GenSelect框架,其核心是让LLM通过长链条推理(long reasoning)从N个候选解中选择最优解,从而有效利用LLM的比较优势,并实现与并行采样预算的高效扩展。实验表明,在数学推理任务中,如QwQ和DeepSeek-R1-0528等推理模型在GenSelect设置下显著优于传统评分方法,且仅需简单提示(simple prompting)。

链接: https://arxiv.org/abs/2507.17797
作者: Shubham Toshniwal,Ivan Sorokin,Aleksander Ficek,Ivan Moshkov,Igor Gitman
机构: 未知
类目: Machine Learning (cs.LG); Computation and Language (cs.CL)
备注: Presented at the 2nd AI for MATH Workshop @ ICML

点击查看摘要

Abstract:Generative reward models with parallel sampling have enabled effective test-time scaling for reasoning tasks. Current approaches employ pointwise scoring of individual solutions or pairwise comparisons. However, pointwise methods underutilize LLMs’ comparative abilities, while pairwise methods scale inefficiently with larger sampling budgets. We introduce GenSelect, where the LLM uses long reasoning to select the best solution among N candidates. This leverages LLMs’ comparative strengths while scaling efficiently across parallel sampling budgets. For math reasoning, we demonstrate that reasoning models, such as QwQ and DeepSeek-R1-0528, excel at GenSelect, outperforming existing scoring approaches with simple prompting.
zh

[NLP-61] Exploring Communication Strategies for Collaborative LLM Agents in Mathematical Problem-Solving

【速读】: 该论文旨在解决当前大型语言模型(Large Language Model, LLM)代理在AI辅助教育场景中,不同沟通策略对协同问题求解效率影响缺乏系统评估的问题。其解决方案的关键在于设计并实证比较四种通信模式——师生互动、同伴协作、互教式学习和批判性辩论——在基于聊天的双代理数学问题求解环境中,发现同伴协作模式在MATH数据集上表现最优,且对话行为如陈述、确认与提示等在提升协同效率中起核心作用,表明有效的沟通机制是实现复杂教育任务中多代理协作效能提升的关键因素。

链接: https://arxiv.org/abs/2507.17753
作者: Liang Zhang,Xiaoming Zhai,Jionghao Lin,Jionghao Lin,Jennifer Kleiman,Diego Zapata-Rivera,Carol Forsyth,Yang Jiang,Xiangen Hu,Arthur C. Graesser
机构: 未知
类目: Human-Computer Interaction (cs.HC); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Computers and Society (cs.CY)
备注:

点击查看摘要

Abstract:Large Language Model (LLM) agents are increasingly utilized in AI-aided education to support tutoring and learning. Effective communication strategies among LLM agents improve collaborative problem-solving efficiency and facilitate cost-effective adoption in education. However, little research has systematically evaluated the impact of different communication strategies on agents’ problem-solving. Our study examines four communication modes, \textitteacher-student interaction, \textitpeer-to-peer collaboration, \textitreciprocal peer teaching, and \textitcritical debate, in a dual-agent, chat-based mathematical problem-solving environment using the OpenAI GPT-4o model. Evaluated on the MATH dataset, our results show that dual-agent setups outperform single agents, with \textitpeer-to-peer collaboration achieving the highest accuracy. Dialogue acts like statements, acknowledgment, and hints play a key role in collaborative problem-solving. While multi-agent frameworks enhance computational tasks, effective communication strategies are essential for tackling complex problems in AI education.
zh

[NLP-62] Mapping Technological Futures: Anticipatory Discourse Through Text Mining

【速读】: 该论文旨在解决新兴技术(如人工智能)的波动性和不可预测性所引发的社会不确定性问题,特别是这些不确定性如何通过关键意见领袖(KOLs)在社交媒体平台上的前瞻性话语得以建构与传播。其解决方案的关键在于利用先进的文本挖掘技术(包括BERTopic建模、情感分析、情绪分析和态度分析),对400名KOL在X平台(2021–2023年)发布的150万条帖子进行系统性分析,识别出100个反映技术驱动未来预期的主题,并揭示KOL如何通过塑造“当下未来”(present futures)与“未来当下”(future presents)的双重角色,影响公众对技术社会意义的认知与讨论。研究发现,KOL作为社会叙事的中介者,将技术构想为解决社会挑战的方案,从而引导公众注意力并重塑当前的社会与地缘政治辩论格局。

链接: https://arxiv.org/abs/2504.02853
作者: Maciej Skorski,Alina Landowska,Krzysztof Rajda
机构: Czech Technical University Prague (布拉格捷克技术大学); SWPS University (波兰社会心理学院); Brand24 (品牌24)
类目: ocial and Information Networks (cs.SI); Computation and Language (cs.CL); Computers and Society (cs.CY); Machine Learning (cs.LG)
备注: Accepted to Humanities and Social Sciences Communications. arXiv admin note: text overlap with arXiv:2407.17522

点击查看摘要

Abstract:The volatility and unpredictability of emerging technologies, such as artificial intelligence (AI), generate significant uncertainty, which is widely discussed on social media. This study examines anticipatory discourse surrounding technological futures by analysing 1.5 million posts from 400 key opinion leaders (KOLs) published on the X platform (from 2021 to 2023). Using advanced text mining techniques, including BERTopic modelling, sentiment, emotion, and attitude analyses, the research identifies 100 distinct topics reflecting anticipated tech-driven futures. Our findings emphasize the dual role of KOLs in framing \textitpresent futures – optimistic visions of transformative technologies like AI and IoT – and influencing \textitfuture presents, where these projections shape contemporary societal and geopolitical debates. Positive emotions such as Hope dominate, outweighing Anxiety, particularly in topics like Machine Learning, Data Science, and Deep Learning,'' while discussions around Climate Change’’ and ``War, Ukraine, and Trump People’’ elicit \textitAnxiety. By framing technologies as solutions to societal challenges, KOLs act as mediators of societal narratives, bridging imagined futures and current realities. These insights underscore their pivotal role in directing public attention with emerging technologies during periods of heightened uncertainty, advancing our understanding of anticipatory discourse in technology-mediated contexts.
zh

[NLP-63] Recent Trends in Distant Conversational Speech Recognition: A Review of CHiME-7 and 8 DASR Challenges

【速读】: 该论文针对远场语音识别(Distant Speech Recognition, DASR)中多通道、可泛化、联合自动语音识别(ASR)与说话人分离(Diarization)的挑战问题展开研究,旨在推动在复杂声学环境下的对话语音处理性能。其解决方案的关键在于:1)采用端到端(end-to-end, e2e)ASR架构以降低数据需求并提升模型鲁棒性;2)依赖引导式源分离技术应对当前神经语音增强(Speech Separation and Enhancement, SSE)方法在复杂场景下可靠性不足的问题;3)通过目标说话人分离(target-speaker diarization)对初步聚类结果进行精修,确保首次说话人数量估计准确以避免误差累积;4)利用大语言模型(Large Language Models, LLMs)在会议摘要等下游任务中的容错能力,弱化转录错误对最终效果的影响。

链接: https://arxiv.org/abs/2507.18161
作者: Samuele Cornell,Christoph Boeddeker,Taejin Park,He Huang,Desh Raj,Matthew Wiesner,Yoshiki Masuyama,Xuankai Chang,Zhong-Qiu Wang,Stefano Squartini,Paola Garcia,Shinji Watanabe
机构: 未知
类目: Audio and Speech Processing (eess.AS); Computation and Language (cs.CL); Sound (cs.SD)
备注:

点击查看摘要

Abstract:The CHiME-7 and 8 distant speech recognition (DASR) challenges focus on multi-channel, generalizable, joint automatic speech recognition (ASR) and diarization of conversational speech. With participation from 9 teams submitting 32 diverse systems, these challenges have contributed to state-of-the-art research in the field. This paper outlines the challenges’ design, evaluation metrics, datasets, and baseline systems while analyzing key trends from participant submissions. From this analysis it emerges that: 1) Most participants use end-to-end (e2e) ASR systems, whereas hybrid systems were prevalent in previous CHiME challenges. This transition is mainly due to the availability of robust large-scale pre-trained models, which lowers the data burden for e2e-ASR. 2) Despite recent advances in neural speech separation and enhancement (SSE), all teams still heavily rely on guided source separation, suggesting that current neural SSE techniques are still unable to reliably deal with complex scenarios and different recording setups. 3) All best systems employ diarization refinement via target-speaker diarization techniques. Accurate speaker counting in the first diarization pass is thus crucial to avoid compounding errors and CHiME-8 DASR participants especially focused on this part. 4) Downstream evaluation via meeting summarization can correlate weakly with transcription quality due to the remarkable effectiveness of large-language models in handling errors. On the NOTSOFAR-1 scenario, even systems with over 50% time-constrained minimum permutation WER can perform roughly on par with the most effective ones (around 11%). 5) Despite recent progress, accurately transcribing spontaneous speech in challenging acoustic environments remains difficult, even when using computationally intensive system ensembles.
zh

计算机视觉

[CV-0] Captain Cinema: Towards Short Movie Generation

【速读】:该论文旨在解决短片生成中长期叙事连贯性与视觉一致性难以保障的问题,尤其在多场景、长序列的电影内容生成任务中。其解决方案的关键在于提出一种分层式生成框架——Captain Cinema,包含两个核心步骤:首先通过自顶向下的关键帧规划(top-down keyframe planning)生成能体现完整故事线和视觉一致性的关键帧序列,确保叙事与视觉的长期一致性;随后利用自底向上的视频合成(bottom-up video synthesis)基于这些关键帧作为条件信号,驱动支持长上下文学习的视频生成模型,以精确建模关键帧之间的时空动态变化。此外,为提升多场景长叙事视频的稳定性和效率,论文还设计了一种针对多模态扩散变换器(Multimodal Diffusion Transformers, MM-DiT)的交错训练策略(interleaved training strategy),并基于精心构建的交错数据对进行训练,从而实现高质量、高效率的短片自动化生成。

链接: https://arxiv.org/abs/2507.18634
作者: Junfei Xiao,Ceyuan Yang,Lvmin Zhang,Shengqu Cai,Yang Zhao,Yuwei Guo,Gordon Wetzstein,Maneesh Agrawala,Alan Yuille,Lu Jiang
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Under review. Project page: this https URL

点击查看摘要

Abstract:We present Captain Cinema, a generation framework for short movie generation. Given a detailed textual description of a movie storyline, our approach firstly generates a sequence of keyframes that outline the entire narrative, which ensures long-range coherence in both the storyline and visual appearance (e.g., scenes and characters). We refer to this step as top-down keyframe planning. These keyframes then serve as conditioning signals for a video synthesis model, which supports long context learning, to produce the spatio-temporal dynamics between them. This step is referred to as bottom-up video synthesis. To support stable and efficient generation of multi-scene long narrative cinematic works, we introduce an interleaved training strategy for Multimodal Diffusion Transformers (MM-DiT), specifically adapted for long-context video data. Our model is trained on a specially curated cinematic dataset consisting of interleaved data pairs. Our experiments demonstrate that Captain Cinema performs favorably in the automated creation of visually coherent and narrative consistent short movies in high quality and efficiency. Project page: this https URL
zh

[CV-1] Identifying Prompted Artist Names from Generated Images

【速读】:该论文旨在解决文本到图像生成模型中“受控风格识别”(prompted-artist recognition)的问题,即从生成的图像中准确推断出在提示词中指定的艺术家名称,从而评估模型对艺术家风格的依赖程度与可追溯性。其核心解决方案是构建了一个大规模基准测试集(包含195万张图像、110位艺术家),涵盖四种泛化场景:未见艺术家、提示复杂度增加、多艺术家联合提示以及不同生成模型。关键创新在于系统性地比较了特征相似性基线、对比风格描述符、数据归属方法、监督分类器及少样本原型网络等多种技术路径,揭示了不同方法在特定条件下的性能差异——例如监督与少样本模型在已见艺术家和复杂提示下表现优异,而风格描述符在艺术家风格鲜明时更具迁移能力,同时指出多艺术家提示仍是最大挑战。该基准为负责任地监管生成式AI提供了公共测试平台,并揭示了显著的性能提升空间。

链接: https://arxiv.org/abs/2507.18633
作者: Grace Su,Sheng-Yu Wang,Aaron Hertzmann,Eli Shechtman,Jun-Yan Zhu,Richard Zhang
机构: Carnegie Mellon University (卡内基梅隆大学); Adobe Research
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Project page: this https URL

点击查看摘要

Abstract:A common and controversial use of text-to-image models is to generate pictures by explicitly naming artists, such as “in the style of Greg Rutkowski”. We introduce a benchmark for prompted-artist recognition: predicting which artist names were invoked in the prompt from the image alone. The dataset contains 1.95M images covering 110 artists and spans four generalization settings: held-out artists, increasing prompt complexity, multiple-artist prompts, and different text-to-image models. We evaluate feature similarity baselines, contrastive style descriptors, data attribution methods, supervised classifiers, and few-shot prototypical networks. Generalization patterns vary: supervised and few-shot models excel on seen artists and complex prompts, whereas style descriptors transfer better when the artist’s style is pronounced; multi-artist prompts remain the most challenging. Our benchmark reveals substantial headroom and provides a public testbed to advance the responsible moderation of text-to-image models. We release the dataset and benchmark to foster further research: this https URL
zh

[CV-2] SIDA: Synthetic Image Driven Zero-shot Domain Adaptation ACM-MM2025

【速读】:该论文旨在解决零样本域适应(Zero-shot Domain Adaptation, ZSDA)中现有文本驱动方法难以捕捉复杂现实世界变化且适应时间过长的问题。其解决方案的关键在于摒弃依赖文本描述的策略,转而利用图像数据提供的丰富细粒度风格线索,提出一种名为SIDA的新方法:通过生成合成图像来模拟目标域风格,并引入Domain Mix和Patch Style Transfer模块,分别实现多风格融合以增强域内表征和局部patch级风格迁移,从而在无需目标域真实图像的情况下高效建模真实世界变化,显著提升适应性能与效率。

链接: https://arxiv.org/abs/2507.18632
作者: Ye-Chan Kim,SeungJu Cha,Si-Woo Kim,Taewhan Kim,Dong-Jin Kim
机构: Hanyang University (汉阳大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Multimedia (cs.MM)
备注: Accepted to ACM MM 2025

点击查看摘要

Abstract:Zero-shot domain adaptation is a method for adapting a model to a target domain without utilizing target domain image data. To enable adaptation without target images, existing studies utilize CLIP’s embedding space and text description to simulate target-like style features. Despite the previous achievements in zero-shot domain adaptation, we observe that these text-driven methods struggle to capture complex real-world variations and significantly increase adaptation time due to their alignment process. Instead of relying on text descriptions, we explore solutions leveraging image data, which provides diverse and more fine-grained style cues. In this work, we propose SIDA, a novel and efficient zero-shot domain adaptation method leveraging synthetic images. To generate synthetic images, we first create detailed, source-like images and apply image translation to reflect the style of the target domain. We then utilize the style features of these synthetic images as a proxy for the target domain. Based on these features, we introduce Domain Mix and Patch Style Transfer modules, which enable effective modeling of real-world variations. In particular, Domain Mix blends multiple styles to expand the intra-domain representations, and Patch Style Transfer assigns different styles to individual patches. We demonstrate the effectiveness of our method by showing state-of-the-art performance in diverse zero-shot adaptation scenarios, particularly in challenging domains. Moreover, our approach achieves high efficiency by significantly reducing the overall adaptation time.
zh

[CV-3] 3D Software Synthesis Guided by Constraint-Expressive Intermediate Representation

【速读】:该论文旨在解决当前3D软件生成方法难以精确控制和修改特定元素、且无法有效处理现实世界中复杂的空间与语义约束的问题。现有方法通常将3D环境整体生成,缺乏对细节的可控性,导致在实际应用中难以满足用户需求。解决方案的关键在于提出Scenethesis,这是一个以需求敏感为导向的3D软件合成框架,其核心是ScenethesisLang——一种领域特定语言(Domain-Specific Language, DSL),作为细粒度约束感知的中间表示(Intermediate Representation, IR),实现自然语言需求到可执行3D软件的精准映射。通过将3D软件合成分解为基于ScenethesisLang的多个阶段,该方法支持独立验证、针对性修改和系统化约束满足,从而显著提升生成质量与可控性。

链接: https://arxiv.org/abs/2507.18625
作者: Shuqing Li,Anson Y. Lam,Yun Peng,Wenxuan Wang,Michael R. Lyu
机构: The Chinese University of Hong Kong(香港中文大学); Renmin University of China(中国人民大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Multimedia (cs.MM); Software Engineering (cs.SE)
备注:

点击查看摘要

Abstract:Graphical user interface (UI) software has undergone a fundamental transformation from traditional two-dimensional (2D) desktop/web/mobile interfaces to spatial three-dimensional (3D) environments. While existing work has made remarkable success in automated 2D software generation, such as HTML/CSS and mobile app interface code synthesis, the generation of 3D software still remains under-explored. Current methods for 3D software generation usually generate the 3D environments as a whole and cannot modify or control specific elements in the software. Furthermore, these methods struggle to handle the complex spatial and semantic constraints inherent in the real world. To address the challenges, we present Scenethesis, a novel requirement-sensitive 3D software synthesis approach that maintains formal traceability between user specifications and generated 3D software. Scenethesis is built upon ScenethesisLang, a domain-specific language that serves as a granular constraint-aware intermediate representation (IR) to bridge natural language requirements and executable 3D software. It serves both as a comprehensive scene description language enabling fine-grained modification of 3D software elements and as a formal constraint-expressive specification language capable of expressing complex spatial constraints. By decomposing 3D software synthesis into stages operating on ScenethesisLang, Scenethesis enables independent verification, targeted modification, and systematic constraint satisfaction. Our evaluation demonstrates that Scenethesis accurately captures over 80% of user requirements and satisfies more than 90% of hard constraints while handling over 100 constraints simultaneously. Furthermore, Scenethesis achieves a 42.8% improvement in BLIP-2 visual evaluation scores compared to the state-of-the-art method.
zh

[CV-4] DRWKV: Focusing on Object Edges for Low-Light Image Enhancement

【速读】:该论文旨在解决低光照图像增强(Low-Light Image Enhancement, LLIE)中物体边缘连续性和细粒度结构细节难以保持的问题,尤其是在极端光照退化条件下。其解决方案的关键在于三个方面:首先,提出全局边缘Retinex(Global Edge Retinex, GER)理论,实现光照与边缘结构的有效解耦,从而提升边缘保真度;其次,设计进化型WKV注意力机制(Evolving WKV Attention),通过螺旋扫描方式捕捉空间边缘连续性并更有效地建模不规则结构;最后,引入双边频谱对齐器(Bilateral Spectrum Aligner, Bi-SAB)与定制的MS2-Loss联合对齐亮度和色度特征,改善视觉自然性并减少伪影。该方法在五个LLIE基准上实现了PSNR、SSIM和NIQE指标的领先性能,同时计算复杂度较低,并在低光照多目标跟踪任务中验证了其泛化能力。

链接: https://arxiv.org/abs/2507.18594
作者: Xuecheng Bai,Yuxiang Wang,Boyu Hu,Qinyuan Jie,Chuanzhi Xu,Hongru Xiao,Kechen Li,Vera Chung
机构: Shenyang Ligong University (沈阳理工大学); The University of Sydney (悉尼大学); University of International Business and Economics (对外经济贸易大学); Tongji University (同济大学); Nanjing University of Aeronautics and Astronautics (南京航空航天大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Low-light image enhancement remains a challenging task, particularly in preserving object edge continuity and fine structural details under extreme illumination degradation. In this paper, we propose a novel model, DRWKV (Detailed Receptance Weighted Key Value), which integrates our proposed Global Edge Retinex (GER) theory, enabling effective decoupling of illumination and edge structures for enhanced edge fidelity. Secondly, we introduce Evolving WKV Attention, a spiral-scanning mechanism that captures spatial edge continuity and models irregular structures more effectively. Thirdly, we design the Bilateral Spectrum Aligner (Bi-SAB) and a tailored MS2-Loss to jointly align luminance and chrominance features, improving visual naturalness and mitigating artifacts. Extensive experiments on five LLIE benchmarks demonstrate that DRWKV achieves leading performance in PSNR, SSIM, and NIQE while maintaining low computational complexity. Furthermore, DRWKV enhances downstream performance in low-light multi-object tracking tasks, validating its generalization capabilities.
zh

[CV-5] HybridTM: Combining Transformer and Mamba for 3D Semantic Segmentation

【速读】:该论文旨在解决3D语义分割中Transformer模型因二次复杂度难以建模大规模点云长程依赖,而Mamba模型虽具线性复杂度但特征表示能力不足的问题。解决方案的关键在于提出首个融合Transformer与Mamba的混合架构HybridTM,并引入细粒度的内层混合策略(Inner Layer Hybrid Strategy),在层级内部同时利用注意力机制捕捉长程依赖与Mamba模块提取局部精细特征,从而实现高效且强大的3D特征学习。

链接: https://arxiv.org/abs/2507.18575
作者: Xinyu Wang,Jinghua Hou,Zhe Liu,Yingying Zhu
机构: Huazhong University of Science and Technology (华中科技大学); The University of Hong Kong (香港大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 7 pages, 5 figures

点击查看摘要

Abstract:Transformer-based methods have demonstrated remarkable capabilities in 3D semantic segmentation through their powerful attention mechanisms, but the quadratic complexity limits their modeling of long-range dependencies in large-scale point clouds. While recent Mamba-based approaches offer efficient processing with linear complexity, they struggle with feature representation when extracting 3D features. However, effectively combining these complementary strengths remains an open challenge in this field. In this paper, we propose HybridTM, the first hybrid architecture that integrates Transformer and Mamba for 3D semantic segmentation. In addition, we propose the Inner Layer Hybrid Strategy, which combines attention and Mamba at a finer granularity, enabling simultaneous capture of long-range dependencies and fine-grained local features. Extensive experiments demonstrate the effectiveness and generalization of our HybridTM on diverse indoor and outdoor datasets. Furthermore, our HybridTM achieves state-of-the-art performance on ScanNet, ScanNet200, and nuScenes benchmarks. The code will be made available at this https URL.
zh

[CV-6] Adversarial Distribution Matching for Diffusion Distillation Towards Efficient Image and Video Synthesis ICCV2025

【速读】:该论文旨在解决分布匹配蒸馏(Distribution Matching Distillation, DMD)在压缩预训练扩散模型时因依赖反向Kullback-Leibler(KL)散度最小化而导致的模式崩溃(mode collapse)问题。其解决方案的关键在于提出一种对抗性分布匹配(Adversarial Distribution Matching, ADM)框架,该框架利用基于扩散的判别器,在潜在空间中以对抗方式对齐真实与虚假分数估计器的预测分布;同时,针对极难的一步蒸馏任务,进一步引入混合判别器(融合潜在空间和像素空间)进行对抗预训练,通过在ODE对上施加分布损失而非DMD2中的均方误差,获得更优的初始化,从而提升蒸馏后生成质量。最终将对抗预训练与ADM微调整合为统一管道DMDX,在保持高效的同时显著优于现有方法。

链接: https://arxiv.org/abs/2507.18569
作者: Yanzuo Lu,Yuxi Ren,Xin Xia,Shanchuan Lin,Xing Wang,Xuefeng Xiao,Andy J. Ma,Xiaohua Xie,Jian-Huang Lai
机构: Sun Yat-Sen University (中山大学); ByteDance Seed Vision (字节跳动种子视觉实验室); Guangdong Provincial Key Laboratory of Information Security Technology (广东省信息安全技术重点实验室); Key Laboratory of Machine Intelligence and Advanced Computing, Ministry of Education (教育部机器智能与先进计算重点实验室); Pazhou Lab (HuangPu) (琶洲实验室(黄埔))
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted by ICCV 2025 (Highlight)

点击查看摘要

Abstract:Distribution Matching Distillation (DMD) is a promising score distillation technique that compresses pre-trained teacher diffusion models into efficient one-step or multi-step student generators. Nevertheless, its reliance on the reverse Kullback-Leibler (KL) divergence minimization potentially induces mode collapse (or mode-seeking) in certain applications. To circumvent this inherent drawback, we propose Adversarial Distribution Matching (ADM), a novel framework that leverages diffusion-based discriminators to align the latent predictions between real and fake score estimators for score distillation in an adversarial manner. In the context of extremely challenging one-step distillation, we further improve the pre-trained generator by adversarial distillation with hybrid discriminators in both latent and pixel spaces. Different from the mean squared error used in DMD2 pre-training, our method incorporates the distributional loss on ODE pairs collected from the teacher model, and thus providing a better initialization for score distillation fine-tuning in the next stage. By combining the adversarial distillation pre-training with ADM fine-tuning into a unified pipeline termed DMDX, our proposed method achieves superior one-step performance on SDXL compared to DMD2 while consuming less GPU time. Additional experiments that apply multi-step ADM distillation on SD3-Medium, SD3.5-Large, and CogVideoX set a new benchmark towards efficient image and video synthesis.
zh

[CV-7] Facial Demorphing from a Single Morph Using a Latent Conditional GAN

【速读】:该论文旨在解决现有去混淆(demorphing)方法在处理未知混淆技术(morph technique)和真实人脸图像时存在的两大问题:一是输出图像与原始混淆图像高度相似(即“混淆复制问题”),二是假设训练与测试阶段使用的混淆方法一致,限制了模型的泛化能力。解决方案的关键在于将混淆图像映射到潜在空间(latent space)进行分解,从而实现对来自未见混淆技术和不同人脸风格的混淆图像的有效还原;该方法在合成人脸生成的混淆图像上进行训练,并在使用任意混淆技术生成的真实人脸混淆图像上进行测试,显著优于现有方法,并能生成高保真度的还原人脸图像。

链接: https://arxiv.org/abs/2507.18566
作者: Nitish Shukla,Arun Ross
机构: Michigan State University (密歇根州立大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:A morph is created by combining two (or more) face images from two (or more) identities to create a composite image that is highly similar to both constituent identities, allowing the forged morph to be biometrically associated with more than one individual. Morph Attack Detection (MAD) can be used to detect a morph, but does not reveal the constituent images. Demorphing - the process of deducing the constituent images - is thus vital to provide additional evidence about a morph. Existing demorphing methods suffer from the morph replication problem, where the outputs tend to look very similar to the morph itself, or assume that train and test morphs are generated using the same morph technique. The proposed method overcomes these issues. The method decomposes a morph in latent space allowing it to demorph images created from unseen morph techniques and face styles. We train our method on morphs created from synthetic faces and test on morphs created from real faces using arbitrary morph techniques. Our method outperforms existing methods by a considerable margin and produces high fidelity demorphed face images.
zh

[CV-8] Deep Learning-Based Age Estimation and Gender Deep Learning-Based Age Estimation and Gender Classification for Targeted Advertisement

【速读】:该论文旨在解决面部图像中同时进行年龄和性别分类的问题,以提升精准广告投放的效果。其解决方案的关键在于提出了一种定制的卷积神经网络(Convolutional Neural Network, CNN)架构,该架构通过学习年龄与性别之间的内在关联特征,共享表示空间,从而优于传统独立处理两个任务的方法。模型在大规模多样化数据集上训练,并经过预处理以增强对光照、姿态和图像质量变化的鲁棒性,最终在性别分类上达到95%准确率,在年龄估计上实现5.77年的平均绝对误差,同时揭示了年轻群体年龄估计的挑战,强调了针对性数据增强与模型优化的重要性。

链接: https://arxiv.org/abs/2507.18565
作者: Muhammad Imran Zaman,Nisar Ahmed
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 6

点击查看摘要

Abstract:This paper presents a novel deep learning-based approach for simultaneous age and gender classification from facial images, designed to enhance the effectiveness of targeted advertising campaigns. We propose a custom Convolutional Neural Network (CNN) architecture, optimized for both tasks, which leverages the inherent correlation between age and gender information present in facial features. Unlike existing methods that often treat these tasks independently, our model learns shared representations, leading to improved performance. The network is trained on a large, diverse dataset of facial images, carefully pre-processed to ensure robustness against variations in lighting, pose, and image quality. Our experimental results demonstrate a significant improvement in gender classification accuracy, achieving 95%, and a competitive mean absolute error of 5.77 years for age estimation. Critically, we analyze the performance across different age groups, identifying specific challenges in accurately estimating the age of younger individuals. This analysis reveals the need for targeted data augmentation and model refinement to address these biases. Furthermore, we explore the impact of different CNN architectures and hyperparameter settings on the overall performance, providing valuable insights for future research.
zh

[CV-9] Synthetic Data Augmentation for Enhanced Chicken Carcass Instance Segmentation

【速读】:该论文旨在解决禽类加工行业中鸡胴体(chicken carcass)实例分割(instance segmentation)任务因真实标注数据稀缺而导致深度学习模型难以训练和部署的问题。其关键解决方案是构建了一个包含300张真实标注图像的新基准数据集,并提出了一套生成逼真且自动标注的合成图像的流水线,通过引入合成数据增强策略显著提升实例分割模型在有限真实数据条件下的性能表现,从而有效缓解数据匮乏问题并降低人工标注成本。

链接: https://arxiv.org/abs/2507.18558
作者: Yihong Feng,Chaitanya Pallerla,Xiaomin Lin,Pouya Sohrabipour Sr,Philip Crandall,Wan Shou,Yu She,Dongyi Wang
机构: University of Arkansas (阿肯色大学); University of South Florida (南佛罗里达大学); Purdue University (普渡大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Image and Video Processing (eess.IV)
备注: Submitted for journal reviewing

点击查看摘要

Abstract:The poultry industry has been driven by broiler chicken production and has grown into the world’s largest animal protein sector. Automated detection of chicken carcasses on processing lines is vital for quality control, food safety, and operational efficiency in slaughterhouses and poultry processing plants. However, developing robust deep learning models for tasks like instance segmentation in these fast-paced industrial environments is often hampered by the need for laborious acquisition and annotation of large-scale real-world image datasets. We present the first pipeline generating photo-realistic, automatically labeled synthetic images of chicken carcasses. We also introduce a new benchmark dataset containing 300 annotated real-world images, curated specifically for poultry segmentation research. Using these datasets, this study investigates the efficacy of synthetic data and automatic data annotation to enhance the instance segmentation of chicken carcasses, particularly when real annotated data from the processing line is scarce. A small real dataset with varying proportions of synthetic images was evaluated in prominent instance segmentation models. Results show that synthetic data significantly boosts segmentation performance for chicken carcasses across all models. This research underscores the value of synthetic data augmentation as a viable and effective strategy to mitigate data scarcity, reduce manual annotation efforts, and advance the development of robust AI-driven automated detection systems for chicken carcasses in the poultry processing industry.
zh

[CV-10] VideoMind: An Omni-Modal Video Dataset with Intent Grounding for Deep-Cognitive Video Understanding

【速读】:该论文旨在解决现有视频理解数据集在深度认知层面(如意图识别)表达不足的问题,尤其缺乏能够体现跨模态上下文整合能力的细粒度标注。解决方案的关键在于构建VideoMind这一以视频为中心的多模态数据集,其创新性地引入了基于链式思维(Chain-of-Thought, COT)生成的意图表达(intent expressions),这些表达需结合视频全段内容进行推理,而非直接可观测。每个样本均提供音频、文本描述(分事实层、抽象层和意图层)及结构化语义标签(主体、地点、时间、事件、动作、意图),并建立包含3000个手工验证样本的黄金标准基准,支持多层次检索评估,从而推动对视频中深层语义(如情绪与意图)的理解与模型训练。

链接: https://arxiv.org/abs/2507.18552
作者: Baoyao Yang,Wanyun Li,Dixin Chen,Junxiang Chen,Wenbin Yao,Haifeng Lin
机构: Guangdong University of Technology (广东工业大学); Wechat, Tencent (微信,腾讯)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: 7 pages; 14 figures

点击查看摘要

Abstract:This paper introduces VideoMind, a video-centric omni-modal dataset designed for deep video content cognition and enhanced multi-modal feature representation. The dataset comprises 103K video samples (3K reserved for testing), each paired with audio and systematically detailed textual descriptions. Specifically, every video and its audio is described across three hierarchical layers (factual, abstract, and intent), progressing from surface to depth. It contains over 22 million words, averaging ~225 words per sample. VideoMind’s key distinction from existing datasets is its provision of intent expressions, which require contextual integration across the entire video and are not directly observable. These deep-cognitive expressions are generated using a Chain-of-Thought (COT) approach, prompting the mLLM through step-by-step reasoning. Each description includes annotations for subject, place, time, event, action, and intent, supporting downstream recognition tasks. Crucially, we establish a gold-standard benchmark with 3,000 manually validated samples for evaluating deep-cognitive video understanding. We design hybrid-cognitive retrieval experiments, scored by multi-level retrieval metrics, to appropriately assess deep video comprehension. Evaluation results for models (e.g., InternVideo, VAST, UMT-L) are released. VideoMind serves as a powerful benchmark for fine-grained cross-modal alignment and advances fields requiring in-depth video understanding, such as emotion and intent recognition. The data is publicly available on GitHub, HuggingFace, and OpenDataLab, this https URL.
zh

[CV-11] A 3D Cross-modal Keypoint Descriptor for MR-US Matching and Registration

【速读】:该论文旨在解决术中实时超声(intraoperative ultrasound, iUS)与术前磁共振成像(preoperative Magnetic Resonance Imaging, MRI)之间配准难题,该问题因两种模态在图像外观、分辨率和视野(field-of-view)上的显著差异而长期未解。解决方案的关键在于提出一种新型的3D跨模态关键点描述子(keypoint descriptor),通过患者特异性“合成匹配”策略从MRI生成合成iUS体积,从而实现监督对比学习以构建共享描述子空间;同时采用概率性关键点检测机制识别解剖学显著且模态一致的位置,并利用基于课程学习的三元组损失结合动态难负样本挖掘,使描述子具备对iUS斑点噪声和有限视野等伪影的鲁棒性及旋转不变性。推理阶段通过检测MR与真实iUS图像中的关键点并建立稀疏匹配,进而完成刚性配准。

链接: https://arxiv.org/abs/2507.18551
作者: Daniil Morozov,Reuben Dorent,Nazim Haouchine
机构: Harvard Medical School (哈佛医学院); Brigham and Women’s Hospital (布里格姆妇女医院); MIND Team, Inria Saclay, Université Paris-Saclay (巴黎萨克雷大学); Sorbonne Université, Institut du Cerveau - Paris Brain Institute - ICM, CNRS, Inria, Inserm, AP-HP, Hôpital de la Pitié Salpêtrière (索邦大学,大脑研究所-巴黎脑研究所-ICM,法国国家科学研究中心,Inria,法国国家健康与医学研究院,巴黎公立医院集团,皮提医院); Technical University of Munich (慕尼黑工业大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Under review

点击查看摘要

Abstract:Intraoperative registration of real-time ultrasound (iUS) to preoperative Magnetic Resonance Imaging (MRI) remains an unsolved problem due to severe modality-specific differences in appearance, resolution, and field-of-view. To address this, we propose a novel 3D cross-modal keypoint descriptor for MRI-iUS matching and registration. Our approach employs a patient-specific matching-by-synthesis approach, generating synthetic iUS volumes from preoperative MRI. This enables supervised contrastive training to learn a shared descriptor space. A probabilistic keypoint detection strategy is then employed to identify anatomically salient and modality-consistent locations. During training, a curriculum-based triplet loss with dynamic hard negative mining is used to learn descriptors that are i) robust to iUS artifacts such as speckle noise and limited coverage, and ii) rotation-invariant . At inference, the method detects keypoints in MR and real iUS images and identifies sparse matches, which are then used to perform rigid registration. Our approach is evaluated using 3D MRI-iUS pairs from the ReMIND dataset. Experiments show that our approach outperforms state-of-the-art keypoint matching methods across 11 patients, with an average precision of 69.8% . For image registration, our method achieves a competitive mean Target Registration Error of 2.39 mm on the ReMIND2Reg benchmark. Compared to existing iUS-MR registration approach, our framework is interpretable, requires no manual initialization, and shows robustness to iUS field-of-view variation. Code is available at this https URL. Comments: Under review Subjects: Computer Vision and Pattern Recognition (cs.CV) Cite as: arXiv:2507.18551 [cs.CV] (or arXiv:2507.18551v1 [cs.CV] for this version) https://doi.org/10.48550/arXiv.2507.18551 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
zh

[CV-12] On the Performance of Concept Probing: The Influence of the Data (Extended Version) ECAI2025

【速读】:该论文试图解决概念探测(concept probing)研究中对训练探测模型所需数据关注不足的问题。当前研究主要聚焦于被探测模型或探测模型本身,而忽略了用于训练探测模型的数据质量与特性对其性能的影响。解决方案的关键在于系统性地探究用于训练探测模型的数据对探测性能的影响,并在此基础上为两个广泛使用的图像分类数据集提供了概念标签(concept labels),从而为后续研究提供标准化的评估基准和可复现的数据支持。

链接: https://arxiv.org/abs/2507.18550
作者: Manuel de Sousa Ribeiro,Afonso Leote,João Leite
机构: NOVA LINCS, NOVA School of Science and Technology, NOVA University Lisbon, Portugal
类目: Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG); Neural and Evolutionary Computing (cs.NE)
备注: Extended version of the paper published in Proceedings of the European Conference on Artificial Intelligence (ECAI 2025)

点击查看摘要

Abstract:Concept probing has recently garnered increasing interest as a way to help interpret artificial neural networks, dealing both with their typically large size and their subsymbolic nature, which ultimately renders them unfeasible for direct human interpretation. Concept probing works by training additional classifiers to map the internal representations of a model into human-defined concepts of interest, thus allowing humans to peek inside artificial neural networks. Research on concept probing has mainly focused on the model being probed or the probing model itself, paying limited attention to the data required to train such probing models. In this paper, we address this gap. Focusing on concept probing in the context of image classification tasks, we investigate the effect of the data used to train probing models on their performance. We also make available concept labels for two widely used datasets.
zh

[CV-13] Unposed 3DGS Reconstruction with Probabilistic Procrustes Mapping

【速读】:该论文旨在解决大规模户外图像序列中无姿态(unposed)三维高斯泼溅(3D Gaussian Splatting, 3DGS)重建的挑战,尤其是现有多视图立体视觉(Multi-View Stereo, MVS)模型在处理数百张图像时因内存限制导致精度下降的问题。其解决方案的关键在于提出一种融合预训练MVS先验与概率Procrustes映射策略的新框架:首先将输入图像分组,通过概率Procrustes问题对数千万级点云进行闭式全局对齐,并引入软尘箱机制(soft dustbin mechanism)剔除不确定性匹配;随后构建一个联合优化框架,在3DGS的不同iable渲染中集成解析雅可比矩阵(analytical Jacobian),以置信度感知的锚点生成高斯分布并同步优化场景结构与相机位姿,从而在几分钟内实现数百张图像的高精度全局对齐与重建,显著优于现有方法。

链接: https://arxiv.org/abs/2507.18541
作者: Chong Cheng,Zijian Wang,Sicheng Yu,Yu Hu,Nanjie Yao,Hao Wang
机构: The Hong Kong University of Science and Technology (Guangzhou)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:3D Gaussian Splatting (3DGS) has emerged as a core technique for 3D representation. Its effectiveness largely depends on precise camera poses and accurate point cloud initialization, which are often derived from pretrained Multi-View Stereo (MVS) models. However, in unposed reconstruction task from hundreds of outdoor images, existing MVS models may struggle with memory limits and lose accuracy as the number of input images grows. To address this limitation, we propose a novel unposed 3DGS reconstruction framework that integrates pretrained MVS priors with the probabilistic Procrustes mapping strategy. The method partitions input images into subsets, maps submaps into a global space, and jointly optimizes geometry and poses with 3DGS. Technically, we formulate the mapping of tens of millions of point clouds as a probabilistic Procrustes problem and solve a closed-form alignment. By employing probabilistic coupling along with a soft dustbin mechanism to reject uncertain correspondences, our method globally aligns point clouds and poses within minutes across hundreds of images. Moreover, we propose a joint optimization framework for 3DGS and camera poses. It constructs Gaussians from confidence-aware anchor points and integrates 3DGS differentiable rendering with an analytical Jacobian to jointly refine scene and poses, enabling accurate reconstruction and pose estimation. Experiments on Waymo and KITTI datasets show that our method achieves accurate reconstruction from unposed image sequences, setting a new state of the art for unposed 3DGS reconstruction.
zh

[CV-14] S-VAR: A Test-Time Scaling Framework for Visual Auto-Regressive Generation

【速读】:该论文旨在解决视觉自回归模型(Visual Auto-Regressive, VAR)在实际内容生成中因训练和计算成本高昂而难以规模化的问题,提出了一种通用的测试时扩展(Test-Time Scaling, TTS)框架——TTS-VAR。其核心创新在于将生成过程建模为路径搜索问题,并通过两个关键机制实现高效且高质量的生成:首先,在粗尺度上利用基于聚类的多样性搜索(clustering-based diversity search),通过语义特征聚类保留结构多样性,从而提升后续候选样本的选择潜力;其次,在细尺度上采用基于重采样的潜在选择(resampling-based potential selection),结合多尺度生成历史定义的奖励函数对候选样本进行优先排序。实验表明,该方法在Infinity模型上使GenEval分数提升8.7%(从0.69到0.75),验证了早期结构特征对最终质量的关键影响及不同尺度下重采样策略的有效性差异。

链接: https://arxiv.org/abs/2507.18537
作者: Zhekai Chen,Ruihang Chu,Yukang Chen,Shiwei Zhang,Yujie Wei,Yingya Zhang,Xihui Liu
机构: HKU MMLab (香港大学多媒体实验室); Tongyi Lab, Alibaba Group (通义实验室,阿里巴巴集团); CUHK (香港中文大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 10 Tables, 9 Figures

点击查看摘要

Abstract:Scaling visual generation models is essential for real-world content creation, yet requires substantial training and computational expenses. Alternatively, test-time scaling has garnered growing attention due to resource efficiency and promising performance. In this work, we present TTS-VAR, the first general test-time scaling framework for visual auto-regressive (VAR) models, modeling the generation process as a path searching problem. To dynamically balance computational efficiency with exploration capacity, we first introduce an adaptive descending batch size schedule throughout the causal generation process. Besides, inspired by VAR’s hierarchical coarse-to-fine multi-scale generation, our framework integrates two key components: (i) At coarse scales, we observe that generated tokens are hard for evaluation, possibly leading to erroneous acceptance of inferior samples or rejection of superior samples. Noticing that the coarse scales contain sufficient structural information, we propose clustering-based diversity search. It preserves structural variety through semantic feature clustering, enabling later selection on samples with higher potential. (ii) In fine scales, resampling-based potential selection prioritizes promising candidates using potential scores, which are defined as reward functions incorporating multi-scale generation history. Experiments on the powerful VAR model Infinity show a notable 8.7% GenEval score improvement (from 0.69 to 0.75). Key insights reveal that early-stage structural features effectively influence final quality, and resampling efficacy varies across generation scales. Code is available at this https URL.
zh

[CV-15] Elucidating the Design Space of Arbitrary-Noise-Based Diffusion Models

【速读】:该论文旨在解决扩散模型(Diffusion Models)在图像修复任务中因固定高斯噪声模式导致的性能瓶颈问题。现有方法强制注入纯高斯噪声会污染退化图像,拉长图像变换距离并增加修复复杂度,限制了模型在真实场景中的适用性。其解决方案的关键在于提出EDA(Elucidates the Design space of Arbitrary-noise-based diffusion models),通过理论证明在保持原始扩散模型模块灵活性的同时,扩展噪声模式的自由度,且引入更复杂的噪声类型不会带来额外的计算开销。该方法在MRI偏置场校正、CT金属伪影去除和自然图像阴影消除等任务中验证有效,仅用5次采样步数即达到或超越当前最优性能。

链接: https://arxiv.org/abs/2507.18534
作者: Xingyu Qiu,Mengying Yang,Xinghua Ma,Dong Liang,Yuzhen Li,Fanding Li,Gongning Luo,Wei Wang,Kuanquan Wang,Shuo Li
机构: Harbin Institute of Technology (哈尔滨工业大学); Harbin Institute of Technology, Shenzhen (哈尔滨工业大学深圳校区); Case Western Reserve University (凯斯西储大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注: 21 pages, 4 figures

点击查看摘要

Abstract:EDM elucidates the unified design space of diffusion models, yet its fixed noise patterns restricted to pure Gaussian noise, limit advancements in image restoration. Our study indicates that forcibly injecting Gaussian noise corrupts the degraded images, overextends the image transformation distance, and increases restoration complexity. To address this problem, our proposed EDA Elucidates the Design space of Arbitrary-noise-based diffusion models. Theoretically, EDA expands the freedom of noise pattern while preserving the original module flexibility of EDM, with rigorous proof that increased noise complexity incurs no additional computational overhead during restoration. EDA is validated on three typical tasks: MRI bias field correction (global smooth noise), CT metal artifact reduction (global sharp noise), and natural image shadow removal (local boundary-aware noise). With only 5 sampling steps, EDA outperforms most task-specific methods and achieves state-of-the-art performance in bias field correction and shadow removal.
zh

[CV-16] COT-AD: Cotton Analysis Dataset ICIP

【速读】:该论文旨在解决棉花作物分析中缺乏高质量、标注详尽的计算机视觉数据集的问题,从而限制了基于深度学习的农业智能诊断与管理技术的发展。其解决方案的关键在于构建COT-AD数据集,该数据集包含超过25,000张棉花生长周期中的图像(其中5,000张已标注),涵盖航拍影像用于田块尺度的检测与分割,以及高分辨率单反相机(DSLR)图像用于关键病害识别;标注内容覆盖虫害与病害识别、植被和杂草分析,填补了棉花特异性农业数据集的空白,为分类、分割、图像修复增强、生成式AI (Generative AI) 驱动的棉花作物合成及早期病害管理等任务提供可靠的数据支撑。

链接: https://arxiv.org/abs/2507.18532
作者: Akbar Ali,Mahek Vyas,Soumyaratna Debnath,Chanda Grover Kamra,Jaidev Sanjay Khalane,Reuben Shibu Devanesan,Indra Deep Mastan,Subramanian Sankaranarayanan,Pankaj Khanna,Shanmuganathan Raman
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Dataset publicly available at: this https URL . Accepted to IEEE International Conference on Image Processing (ICIP) 2025

点击查看摘要

Abstract:This paper presents COT-AD, a comprehensive Dataset designed to enhance cotton crop analysis through computer vision. Comprising over 25,000 images captured throughout the cotton growth cycle, with 5,000 annotated images, COT-AD includes aerial imagery for field-scale detection and segmentation and high-resolution DSLR images documenting key diseases. The annotations cover pest and disease recognition, vegetation, and weed analysis, addressing a critical gap in cotton-specific agricultural datasets. COT-AD supports tasks such as classification, segmentation, image restoration, enhancement, deep generative model-based cotton crop synthesis, and early disease management, advancing data-driven crop management
zh

[CV-17] IntentVCNet: Bridging Spatio-Temporal Gaps for Intention-Oriented Controllable Video Captioning

【速读】:该论文旨在解决生成式视频描述(video captioning)中因大视觉语言模型(LVLMs)在时空理解上的割裂而导致的细粒度意图导向控制难题,即模型难以根据用户指定目标和意图精确生成对应描述。解决方案的关键在于提出一种名为IntentVCNet的新架构,其核心创新包括:一是设计了一种提示组合策略,使大型语言模型(LLM)能够建模用户意图提示与视频时序序列之间的隐含关联;二是引入一种参数高效的对象框适配器(box adapter),增强全局视觉上下文中的对象语义信息,使视觉token具备关于用户意图的先验知识。这两项策略从提示工程和模型结构两个维度协同弥合了LVLMs在时空理解间的鸿沟,显著提升了模型对视频中特定目标的细粒度空间控制能力,从而实现精准的意图导向视频描述生成。

链接: https://arxiv.org/abs/2507.18531
作者: Tianheng Qiu,Jingchun Gao,Jingyu Li,Huiyi Leong,Xuan Huang,Xi Wang,Xiaocheng Zhang,Kele Xu,Lan Zhang
机构: University of Science and Technology of China(中国科学技术大学); Hefei Institutes of Physical Science, Chinese Academy of Sciences(中国科学院合肥物质科学研究院); Institute of Artificial Intelligence, Hefei Comprehensive National Science Center(合肥综合性国家科学中心人工智能研究院); State Key Lab. for Novel Software Technology, Nanjing University(南京大学软件新技术国家重点实验室); University of Chicago(芝加哥大学); National University of Defense Technology(国防科技大学); Harbin Institute of Technology(哈尔滨工业大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Intent-oriented controlled video captioning aims to generate targeted descriptions for specific targets in a video based on customized user intent. Current Large Visual Language Models (LVLMs) have gained strong instruction following and visual comprehension capabilities. Although the LVLMs demonstrated proficiency in spatial and temporal understanding respectively, it was not able to perform fine-grained spatial control in time sequences in direct response to instructions. This substantial spatio-temporal gap complicates efforts to achieve fine-grained intention-oriented control in video. Towards this end, we propose a novel IntentVCNet that unifies the temporal and spatial understanding knowledge inherent in LVLMs to bridge the spatio-temporal gap from both prompting and model perspectives. Specifically, we first propose a prompt combination strategy designed to enable LLM to model the implicit relationship between prompts that characterize user intent and video sequences. We then propose a parameter efficient box adapter that augments the object semantic information in the global visual context so that the visual token has a priori information about the user intent. The final experiment proves that the combination of the two strategies can further enhance the LVLM’s ability to model spatial details in video sequences, and facilitate the LVLMs to accurately generate controlled intent-oriented captions. Our proposed method achieved state-of-the-art results in several open source LVLMs and was the runner-up in the IntentVC challenge. Our code is available on this https URL.
zh

[CV-18] GaussianFusionOcc: A Seamless Sensor Fusion Approach for 3D Occupancy Prediction Using 3D Gaussians

【速读】:该论文旨在解决自动驾驶中3D语义占用预测(3D semantic occupancy prediction)的精度与效率问题,尤其针对传统方法依赖密集网格表示导致的计算开销大、难以融合多模态传感器信息的局限性。其解决方案的关键在于提出GaussianFusionOcc框架,通过引入语义3D高斯(semantic 3D Gaussians)表示替代传统体素化网格,并结合一种模态无关的可变形注意力机制(modality-agnostic deformable attention),实现相机、激光雷达(LiDAR)和雷达数据的高效融合与特征优化,从而在保持高精度的同时显著提升内存效率和推理速度。

链接: https://arxiv.org/abs/2507.18522
作者: Tomislav Pavković,Mohammad-Ali Nikouei Mahani,Johannes Niedermayer,Johannes Betz
机构: Technical University of Munich (慕尼黑工业大学); BMW Group (宝马集团); Munich Institute of Robotics and Machine Intelligence (慕尼黑机器人与机器智能研究所)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:3D semantic occupancy prediction is one of the crucial tasks of autonomous driving. It enables precise and safe interpretation and navigation in complex environments. Reliable predictions rely on effective sensor fusion, as different modalities can contain complementary information. Unlike conventional methods that depend on dense grid representations, our approach, GaussianFusionOcc, uses semantic 3D Gaussians alongside an innovative sensor fusion mechanism. Seamless integration of data from camera, LiDAR, and radar sensors enables more precise and scalable occupancy prediction, while 3D Gaussian representation significantly improves memory efficiency and inference speed. GaussianFusionOcc employs modality-agnostic deformable attention to extract essential features from each sensor type, which are then used to refine Gaussian properties, resulting in a more accurate representation of the environment. Extensive testing with various sensor combinations demonstrates the versatility of our approach. By leveraging the robustness of multi-modal fusion and the efficiency of Gaussian representation, GaussianFusionOcc outperforms current state-of-the-art models.
zh

[CV-19] Object segmentation in the wild with foundation models: application to vision assisted neuro-prostheses for upper limbs

【速读】:该论文旨在解决在复杂视觉场景中对日常物体进行语义对象分割的问题,特别是在无需针对特定图像进行微调的情况下,利用基础模型(foundation models)实现高精度分割。其核心挑战在于如何在“野外”(in the wild)环境中有效应用预训练模型,以支持上肢神经假体的视觉引导任务。解决方案的关键在于提出一种基于注视点(gaze fixations)生成提示(prompts)的方法,用于指导Segment Anything Model (SAM) 在自适应视角(egocentric)数据上的分割行为,并通过少量微调提升模型在真实世界复杂场景中的性能,实验表明该方法在Grasping-in-the-Wild数据集上的IoU指标最高提升0.51点。

链接: https://arxiv.org/abs/2507.18517
作者: Bolutife Atoki,Jenny Benois-Pineau,Renaud Péteri,Fabien Baldacci,Aymar de Rugy
机构: University of Bordeaux (波尔多大学); University of La Rochelle (拉罗谢尔大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:In this work, we address the problem of semantic object segmentation using foundation models. We investigate whether foundation models, trained on a large number and variety of objects, can perform object segmentation without fine-tuning on specific images containing everyday objects, but in highly cluttered visual scenes. The ‘‘in the wild’’ context is driven by the target application of vision guided upper limb neuroprostheses. We propose a method for generating prompts based on gaze fixations to guide the Segment Anything Model (SAM) in our segmentation scenario, and fine-tune it on egocentric visual data. Evaluation results of our approach show an improvement of the IoU segmentation quality metric by up to 0.51 points on real-world challenging data of Grasping-in-the-Wild corpus which is made available on the RoboFlow Platform (this https URL)
zh

[CV-20] owards Large Scale Geostatistical Methane Monitoring with Part-based Object Detection

【速读】:该论文旨在解决遥感图像中稀有目标检测的难题,尤其是在大范围地理区域内识别罕见设施(如法国的生物消化器)时面临的挑战。其关键解决方案在于提出一种基于部件的方法,通过考虑生物消化器的关键子组件来提升初始检测精度,并结合小规模训练与验证集及大规模不平衡测试集进行模型优化,从而实现对新区域中生物消化器的高效识别与量化分析,最终支持甲烷排放量的地理统计估算。

链接: https://arxiv.org/abs/2507.18513
作者: Adhemar de Senneville,Xavier Bou,Thibaud Ehret,Rafael Grompone,Jean Louis Bonne,Nicolas Dumelie,Thomas Lauvaux,Gabriele Facciolo
机构: Université Paris-Saclay, CNRS, ENS Paris-Saclay, Centre Borelli (法国巴黎-萨克雷大学, 国家科学研究中心, 巴黎-萨克雷高等师范学院, 博雷利中心); AMIAD, Pole Recherche (AMIAD, 研究中心); Université de Reims Champagne-Ardenne (雷恩-香槟-阿登纳大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Object detection is one of the main applications of computer vision in remote sensing imagery. Despite its increasing availability, the sheer volume of remote sensing data poses a challenge when detecting rare objects across large geographic areas. Paradoxically, this common challenge is crucial to many applications, such as estimating environmental impact of certain human activities at scale. In this paper, we propose to address the problem by investigating the methane production and emissions of bio-digesters in France. We first introduce a novel dataset containing bio-digesters, with small training and validation sets, and a large test set with a high imbalance towards observations without objects since such sites are rare. We develop a part-based method that considers essential bio-digester sub-elements to boost initial detections. To this end, we apply our method to new, unseen regions to build an inventory of bio-digesters. We then compute geostatistical estimates of the quantity of methane produced that can be attributed to these infrastructures in a given area at a given time.
zh

[CV-21] Explaining How Visual Textual and Multimodal Encoders Share Concepts

【速读】:该论文旨在解决跨模态模型(视觉、文本及多模态编码器)在稀疏自编码器(Sparse Autoencoders, SAE)提取的可解释特征基础上进行定量比较的问题,此前的研究仅限于同一模态内的模型对比。其解决方案的关键在于提出两个新指标:一是用于跨模型量化比较的新型指标,二是用于衡量不同类别模型间个体特征共享程度的“比较共享度”(Comparative Sharedness)。通过这两个工具,作者对21个不同规模和训练数据类型的编码器进行了系统性分析,揭示了多模态预训练对特征共享的影响,并发现视觉语言模型(VLMs)特有的视觉特征会与文本编码器共享,凸显了文本预训练在特征形成中的关键作用。

链接: https://arxiv.org/abs/2507.18512
作者: Clément Cornet,Romaric Besançon,Hervé Le Borgne
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Sparse autoencoders (SAEs) have emerged as a powerful technique for extracting human-interpretable features from neural networks activations. Previous works compared different models based on SAE-derived features but those comparisons have been restricted to models within the same modality. We propose a novel indicator allowing quantitative comparison of models across SAE features, and use it to conduct a comparative study of visual, textual and multimodal encoders. We also propose to quantify the Comparative Sharedness of individual features between different classes of models. With these two new tools, we conduct several studies on 21 encoders of the three types, with two significantly different sizes, and considering generalist and domain specific datasets. The results allow to revisit previous studies at the light of encoders trained in a multimodal context and to quantify to which extent all these models share some representations or features. They also suggest that visual features that are specific to VLMs among vision encoders are shared with text encoders, highlighting the impact of text pretraining. The code is available at this https URL
zh

[CV-22] Human Scanpath Prediction in Target-Present Visual Search with Semantic-Foveal Bayesian Attention

【速读】:该论文旨在解决目标导向的视觉搜索任务中人类视觉注意预测的问题,即如何更准确地建模人类在目标存在场景下的注视序列(fixation sequence)。其解决方案的关键在于提出SemBA-FAST框架,该框架结合了深度目标检测与概率语义融合机制,通过预训练检测器和人工视网膜(artificial foveation)动态更新自上而下的注意力知识,实现对固定点序列的逐步优化预测。该方法显著提升了注视位置预测的准确性,在COCO-Search18基准数据集上优于传统自上而下模型,并在某些情况下可与依赖扫描路径信息的模型相媲美,为类人注意力建模提供了新的概率语义-中央凹协同机制。

链接: https://arxiv.org/abs/2507.18503
作者: João Luzio,Alexandre Bernardino,Plinio Moreno
机构: Institute for Systems and Robotics (机器人系统研究所); Instituto Superior Técnico (理工学院); University of Lisbon (里斯本大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: To be published in the 2025 IEEE International Conference on Development and Learning (ICDL)

点击查看摘要

Abstract:In goal-directed visual tasks, human perception is guided by both top-down and bottom-up cues. At the same time, foveal vision plays a crucial role in directing attention efficiently. Modern research on bio-inspired computational attention models has taken advantage of advancements in deep learning by utilizing human scanpath data to achieve new state-of-the-art performance. In this work, we assess the performance of SemBA-FAST, i.e. Semantic-based Bayesian Attention for Foveal Active visual Search Tasks, a top-down framework designed for predicting human visual attention in target-present visual search. SemBA-FAST integrates deep object detection with a probabilistic semantic fusion mechanism to generate attention maps dynamically, leveraging pre-trained detectors and artificial foveation to update top-down knowledge and improve fixation prediction sequentially. We evaluate SemBA-FAST on the COCO-Search18 benchmark dataset, comparing its performance against other scanpath prediction models. Our methodology achieves fixation sequences that closely match human ground-truth scanpaths. Notably, it surpasses baseline and other top-down approaches and competes, in some cases, with scanpath-informed models. These findings provide valuable insights into the capabilities of semantic-foveal probabilistic frameworks for human-like attention modelling, with implications for real-time cognitive computing and robotics.
zh

[CV-23] Delving into Mapping Uncertainty for Mapless Trajectory Prediction IROS2025

【速读】:该论文旨在解决地图无关(mapless)自动驾驶中在线生成高精地图(High-Definition, HD maps)的不确定性如何有效融入轨迹预测任务的问题。当前方法虽尝试将地图不确定性纳入下游预测模型,但缺乏对特定驾驶场景下该不确定性有益性的深入理解与精准利用。其解决方案的关键在于提出一种基于本体感知(Proprioceptive)的情景门控机制(Scenario Gating),该机制根据自车未来运动状态(kinematic state)的预测结果,动态决定是否引入地图不确定性信息,从而实现轻量级、自监督式的不确定性融合;同时引入基于协方差的地图不确定性建模方法(Covariance-based Map Uncertainty),使其更贴合地图几何结构,显著提升轨迹预测性能,在nuScenes真实数据集上相较最先进方法最高提升23.6%。

链接: https://arxiv.org/abs/2507.18498
作者: Zongzheng Zhang,Xuchong Qiu,Boran Zhang,Guantian Zheng,Xunjiang Gu,Guoxuan Chi,Huan-ang Gao,Leichen Wang,Ziming Liu,Xinrun Li,Igor Gilitschenski,Hongyang Li,Hang Zhao,Hao Zhao
机构: Institute for AI Industry Research (AIR), Tsinghua University (清华大学); Bosch Corporate Research (博世公司研究部门); Institute for Interdisciplinary Information Sciences (IIIS), Tsinghua University (清华大学); University of Toronto (多伦多大学); University of HongKong (香港大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted to IROS 2025, Project Page: this https URL

点击查看摘要

Abstract:Recent advances in autonomous driving are moving towards mapless approaches, where High-Definition (HD) maps are generated online directly from sensor data, reducing the need for expensive labeling and maintenance. However, the reliability of these online-generated maps remains uncertain. While incorporating map uncertainty into downstream trajectory prediction tasks has shown potential for performance improvements, current strategies provide limited insights into the specific scenarios where this uncertainty is beneficial. In this work, we first analyze the driving scenarios in which mapping uncertainty has the greatest positive impact on trajectory prediction and identify a critical, previously overlooked factor: the agent’s kinematic state. Building on these insights, we propose a novel Proprioceptive Scenario Gating that adaptively integrates map uncertainty into trajectory prediction based on forecasts of the ego vehicle’s future kinematics. This lightweight, self-supervised approach enhances the synergy between online mapping and trajectory prediction, providing interpretability around where uncertainty is advantageous and outperforming previous integration methods. Additionally, we introduce a Covariance-based Map Uncertainty approach that better aligns with map geometry, further improving trajectory prediction. Extensive ablation studies confirm the effectiveness of our approach, achieving up to 23.6% improvement in mapless trajectory prediction performance over the state-of-the-art method using the real-world nuScenes driving dataset. Our code, data, and models are publicly available at this https URL.
zh

[CV-24] Reinforced Embodied Active Defense: Exploiting Adaptive Interaction for Robust Visual Perception in Adversarial 3D Environments

【速读】:该论文旨在解决3D环境中对抗攻击对视觉感知系统可靠性构成的威胁,尤其是在身份验证和自动驾驶等安全敏感场景中,现有防御机制(如对抗训练和净化)多采用被动策略且依赖预设假设,难以适应动态变化的3D环境。解决方案的关键在于提出一种主动防御框架——强化嵌入式主动防御(Reinforced Embodied Active Defense, Rein-EAD),其核心创新包括:通过多步目标平衡即时预测准确率与预测熵最小化,优化跨多步的防御策略;引入基于不确定性的奖励塑造机制,提升策略更新效率并降低计算开销,同时无需可微分环境即可实现高效部署。实验表明,Rein-EAD显著降低了攻击成功率,同时保持标准任务性能,并展现出对未见及自适应攻击的良好泛化能力。

链接: https://arxiv.org/abs/2507.18484
作者: Xiao Yang,Lingxuan Wu,Lizhong Wang,Chengyang Ying,Hang Su,Jun Zhu
机构: Tsinghua University (清华大学); Dept. of Comp. Sci. & Tech. (计算机科学与技术系); Institute for AI (人工智能研究院); BNRist Center (脑与智能研究中心); THBI Lab (清华-伯克利深圳学院实验室); Tsinghua-Bosch Joint Center for ML (清华-博世联合机器学习中心)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: arXiv admin note: text overlap with arXiv:2404.00540

点击查看摘要

Abstract:Adversarial attacks in 3D environments have emerged as a critical threat to the reliability of visual perception systems, particularly in safety-sensitive applications such as identity verification and autonomous driving. These attacks employ adversarial patches and 3D objects to manipulate deep neural network (DNN) predictions by exploiting vulnerabilities within complex scenes. Existing defense mechanisms, such as adversarial training and purification, primarily employ passive strategies to enhance robustness. However, these approaches often rely on pre-defined assumptions about adversarial tactics, limiting their adaptability in dynamic 3D settings. To address these challenges, we introduce Reinforced Embodied Active Defense (Rein-EAD), a proactive defense framework that leverages adaptive exploration and interaction with the environment to improve perception robustness in 3D adversarial contexts. By implementing a multi-step objective that balances immediate prediction accuracy with predictive entropy minimization, Rein-EAD optimizes defense strategies over a multi-step horizon. Additionally, Rein-EAD involves an uncertainty-oriented reward-shaping mechanism that facilitates efficient policy updates, thereby reducing computational overhead and supporting real-world applicability without the need for differentiable environments. Comprehensive experiments validate the effectiveness of Rein-EAD, demonstrating a substantial reduction in attack success rates while preserving standard accuracy across diverse tasks. Notably, Rein-EAD exhibits robust generalization to unseen and adaptive attacks, making it suitable for real-world complex tasks, including 3D object classification, face recognition and autonomous driving.
zh

[CV-25] A COCO-Formatted Instance-Level Dataset for Plasmodium Falciparum Detection in Giemsa-Stained Blood Smears MICCAI2025

【速读】:该论文旨在解决疟原虫(Plasmodium falciparum)在吉姆萨染色血涂片中自动检测的准确性问题,其核心挑战在于高质量实例级标注数据的稀缺性。为应对这一限制,作者对公开的NIH疟疾数据集进行了增强,提供了符合COCO格式的详细边界框标注,从而支持基于深度学习的目标检测模型训练。解决方案的关键在于通过自动化标注优化结合针对性的人工校正,显著提升了标注数据的质量与一致性,使得训练出的Faster R-CNN模型在交叉验证中实现了高达0.88的F1分数,证明了该方法能够有效支撑高精度的疟原虫检测任务。

链接: https://arxiv.org/abs/2507.18483
作者: Frauke Wilm,Luis Carlos Rivera Monroy,Mathias Öttl,Lukas Mürdter,Leonid Mill,Andreas Maier
机构: University of Tübingen (图宾根大学); German Cancer Research Center (德国癌症研究中心)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 7 pages, 4 figures, 2 tables, accepted at MICCAI 2025 Open Data

点击查看摘要

Abstract:Accurate detection of Plasmodium falciparum in Giemsa-stained blood smears is an essential component of reliable malaria diagnosis, especially in developing countries. Deep learning-based object detection methods have demonstrated strong potential for automated Malaria diagnosis, but their adoption is limited by the scarcity of datasets with detailed instance-level annotations. In this work, we present an enhanced version of the publicly available NIH malaria dataset, with detailed bounding box annotations in COCO format to support object detection training. We validated the revised annotations by training a Faster R-CNN model to detect infected and non-infected red blood cells, as well as white blood cells. Cross-validation on the original dataset yielded F1 scores of up to 0.88 for infected cell detection. These results underscore the importance of annotation volume and consistency, and demonstrate that automated annotation refinement combined with targeted manual correction can produce training data of sufficient quality for robust detection performance. The updated annotations set is publicly available via GitHub: this https URL.
zh

[CV-26] Q-Former Autoencoder: A Modern Framework for Medical Anomaly Detection

【速读】:该论文旨在解决医学图像中异常检测(anomaly detection in medical images)这一重要但极具挑战性的问题,其核心难点在于异常类型的多样性以及难以获取全面标注的数据集。为应对这一问题,作者提出了一种现代化的基于自编码器(autoencoder)的无监督框架——Q-Former Autoencoder,其关键创新在于:首先直接利用冻结的视觉基础模型(vision foundation models,如DINO、DINOv2和Masked Autoencoder)作为特征提取器,无需领域特定微调即可获得丰富且多层次的高阶表征;其次引入Q-Former架构作为瓶颈层,可灵活控制重建序列长度并高效聚合多尺度特征;此外,结合预训练Masked Autoencoder提取的感知损失(perceptual loss),引导重建过程聚焦于语义有意义的结构。实验表明,该方法在四个医学异常检测基准上均取得先进性能,验证了自然图像预训练视觉模型在医学图像分析中的强大泛化能力。

链接: https://arxiv.org/abs/2507.18481
作者: Francesco Dalmonte,Emirhan Bayar,Emre Akbas,Mariana-Iuliana Georgescu
机构: University of Bologna (博洛尼亚大学); Middle East Technical University (中东技术大学); Helmholtz Munich (赫尔姆霍兹慕尼黑)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 15 pages

点击查看摘要

Abstract:Anomaly detection in medical images is an important yet challenging task due to the diversity of possible anomalies and the practical impossibility of collecting comprehensively annotated data sets. In this work, we tackle unsupervised medical anomaly detection proposing a modernized autoencoder-based framework, the Q-Former Autoencoder, that leverages state-of-the-art pretrained vision foundation models, such as DINO, DINOv2 and Masked Autoencoder. Instead of training encoders from scratch, we directly utilize frozen vision foundation models as feature extractors, enabling rich, multi-stage, high-level representations without domain-specific fine-tuning. We propose the usage of the Q-Former architecture as the bottleneck, which enables the control of the length of the reconstruction sequence, while efficiently aggregating multiscale features. Additionally, we incorporate a perceptual loss computed using features from a pretrained Masked Autoencoder, guiding the reconstruction towards semantically meaningful structures. Our framework is evaluated on four diverse medical anomaly detection benchmarks, achieving state-of-the-art results on BraTS2021, RESC, and RSNA. Our results highlight the potential of vision foundation model encoders, pretrained on natural images, to generalize effectively to medical image analysis tasks without further fine-tuning. We release the code and models at this https URL.
zh

[CV-27] CRUISE: Cooperative Reconstruction and Editing in V2X Scenarios using Gaussian Splatting IROS2025

【速读】:该论文旨在解决车联网(Vehicle-to-everything, V2X)场景中数据生成与增强的瓶颈问题,尤其是在自动驾驶系统训练和评估中缺乏高质量、多样化的多视角数据。其核心挑战在于如何高效重建真实世界V2X环境,并支持灵活编辑以生成具有挑战性的边缘案例。解决方案的关键是提出CRUISE框架,该框架采用分解高斯点绘(decomposed Gaussian Splatting)技术实现高保真度的实景重建,并将动态交通参与者(如车辆、行人)分解为可编辑的高斯表示,从而实现驾驶场景的无缝修改与扩展;同时,该框架能够从自车和基础设施双视角渲染图像,显著提升大规模V2X数据集的生成能力,有效改善3D目标检测与协同跟踪性能。

链接: https://arxiv.org/abs/2507.18473
作者: Haoran Xu,Saining Zhang,Peishuo Li,Baijun Ye,Xiaoxue Chen,Huan-ang Gao,Jv Zheng,Xiaowei Song,Ziqiao Peng,Run Miao,Jinrang Jia,Yifeng Shi,Guangqi Yi,Hang Zhao,Hao Tang,Hongyang Li,Kaicheng Yu,Hao Zhao
机构: Institute for AI Industry Research (AIR), Tsinghua University; Beijing Institute of Technology; Nanyang Technological University; Tsinghua University; Renmin University of China; Beijing University of Technology; Baidu Inc; Peking University; Shanghai AI Lab; Westlake University; Beijing Academy of Artificial Intelligence
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: IROS 2025, Code: this https URL

点击查看摘要

Abstract:Vehicle-to-everything (V2X) communication plays a crucial role in autonomous driving, enabling cooperation between vehicles and infrastructure. While simulation has significantly contributed to various autonomous driving tasks, its potential for data generation and augmentation in V2X scenarios remains underexplored. In this paper, we introduce CRUISE, a comprehensive reconstruction-and-synthesis framework designed for V2X driving environments. CRUISE employs decomposed Gaussian Splatting to accurately reconstruct real-world scenes while supporting flexible editing. By decomposing dynamic traffic participants into editable Gaussian representations, CRUISE allows for seamless modification and augmentation of driving scenes. Furthermore, the framework renders images from both ego-vehicle and infrastructure views, enabling large-scale V2X dataset augmentation for training and evaluation. Our experimental results demonstrate that: 1) CRUISE reconstructs real-world V2X driving scenes with high fidelity; 2) using CRUISE improves 3D detection across ego-vehicle, infrastructure, and cooperative views, as well as cooperative 3D tracking on the V2X-Seq benchmark; and 3) CRUISE effectively generates challenging corner cases.
zh

[CV-28] Revisiting Physically Realizable Adversarial Object Attack against LiDAR-based Detection: Clarifying Problem Formulation and Experimental Protocols

【速读】:该论文旨在解决LiDAR-based 3D目标检测中物理对抗样本攻击(physical adversarial object attacks)的可复现性差与缺乏标准化评估框架的问题。当前多数数字攻击方法虽能扰动点云或网格,但难以在现实世界中物理实现;而物理攻击因设备差异和实验设置不一致导致结果不可重现。解决方案的关键在于提出一种设备无关(device-agnostic)、标准化的攻击框架,抽象出物理对抗攻击的核心要素,支持多种攻击方法,并提供仿真与真实场景下的基准测试协议及开源代码,从而实现公平比较、加速研究进程,并验证了从仿真到物理LiDAR系统的攻击迁移有效性。

链接: https://arxiv.org/abs/2507.18457
作者: Luo Cheng,Hanwei Zhang,Lijun Zhang,Holger Hermanns
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Adversarial robustness in LiDAR-based 3D object detection is a critical research area due to its widespread application in real-world scenarios. While many digital attacks manipulate point clouds or meshes, they often lack physical realizability, limiting their practical impact. Physical adversarial object attacks remain underexplored and suffer from poor reproducibility due to inconsistent setups and hardware differences. To address this, we propose a device-agnostic, standardized framework that abstracts key elements of physical adversarial object attacks, supports diverse methods, and provides open-source code with benchmarking protocols in simulation and real-world settings. Our framework enables fair comparison, accelerates research, and is validated by successfully transferring simulated attacks to a physical LiDAR system. Beyond the framework, we offer insights into factors influencing attack success and advance understanding of adversarial robustness in real-world LiDAR perception.
zh

[CV-29] PDB-Eval: An Evaluation of Large Multimodal Models for Description and Explanation of Personalized Driving Behavior

【速读】:该论文旨在解决现有数据集在基于外部视觉证据描述和解释车辆行为方面的局限性,从而提升大型多模态模型(MLLMs)对个性化驾驶行为的理解与推理能力。其核心问题在于如何通过细粒度的视觉解释来增强MLLMs在驾驶场景中的泛化能力和任务表现。解决方案的关键在于构建一个名为PDB-Eval的基准测试体系,包含两个核心组件:PDB-X用于评估MLLMs对时序驾驶场景的理解能力,PDB-QA则作为视觉解释问答任务,用于指导MLLMs进行指令微调。该方法通过引入细粒度描述与解释训练,有效缩小了MLLMs与驾驶领域之间的差距,显著提升了零样本问答任务性能(最高达73.2%),并在Brain4Cars意图预测和AIDE识别任务中分别实现最高12.5%和11.0%的性能改进。

链接: https://arxiv.org/abs/2507.18447
作者: Junda Wu,Jessica Echterhoff,Kyungtae Han,Amr Abdelraouf,Rohit Gupta,Julian McAuley
机构: University of California San Diego (加州大学圣地亚哥分校); Toyota Motor North America (丰田汽车北美公司)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Understanding a driver’s behavior and intentions is important for potential risk assessment and early accident prevention. Safety and driver assistance systems can be tailored to individual drivers’ behavior, significantly enhancing their effectiveness. However, existing datasets are limited in describing and explaining general vehicle movements based on external visual evidence. This paper introduces a benchmark, PDB-Eval, for a detailed understanding of Personalized Driver Behavior, and aligning Large Multimodal Models (MLLMs) with driving comprehension and reasoning. Our benchmark consists of two main components, PDB-X and PDB-QA. PDB-X can evaluate MLLMs’ understanding of temporal driving scenes. Our dataset is designed to find valid visual evidence from the external view to explain the driver’s behavior from the internal view. To align MLLMs’ reasoning abilities with driving tasks, we propose PDB-QA as a visual explanation question-answering task for MLLM instruction fine-tuning. As a generic learning task for generative models like MLLMs, PDB-QA can bridge the domain gap without harming MLLMs’ generalizability. Our evaluation indicates that fine-tuning MLLMs on fine-grained descriptions and explanations can effectively bridge the gap between MLLMs and the driving domain, which improves zero-shot performance on question-answering tasks by up to 73.2%. We further evaluate the MLLMs fine-tuned on PDB-X in Brain4Cars’ intention prediction and AIDE’s recognition tasks. We observe up to 12.5% performance improvements on the turn intention prediction task in Brain4Cars, and consistent performance improvements up to 11.0% on all tasks in AIDE.
zh

[CV-30] DSFormer: A Dual-Scale Cross-Learning Transformer for Visual Place Recognition

【速读】:该论文旨在解决视觉位置识别(Visual Place Recognition, VPR)在不同环境条件和视角变化下性能不稳定的问题。其核心解决方案是提出一个融合双尺度Transformer(Dual-Scale-Former, DSFormer)与创新块聚类策略的新框架:DSFormer通过双向信息传递机制,在双尺度特征间实现自注意力和跨尺度注意力,从而同时捕获语义丰富性和空间细节;块聚类策略则对San Francisco eXtra Large(SF-XL)数据集进行多视角重分区,优化训练数据组织以增强对视角变化的鲁棒性。两项创新共同提升了全局嵌入的适应能力,并将训练数据量减少约30%,同时在多个基准测试中实现了当前最优的全局检索性能及更高的计算效率。

链接: https://arxiv.org/abs/2507.18444
作者: Haiyang Jiang,Songhao Piao,Chao Gao,Lei Yu,Liguo Chen
机构: Harbin Institute of Technology (哈尔滨工业大学); Tsinghua University (清华大学); Wuhan University (武汉大学); Soochow University (苏州大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Robotics (cs.RO)
备注:

点击查看摘要

Abstract:Visual Place Recognition (VPR) is crucial for robust mobile robot localization, yet it faces significant challenges in maintaining reliable performance under varying environmental conditions and viewpoints. To address this, we propose a novel framework that integrates Dual-Scale-Former (DSFormer), a Transformer-based cross-learning module, with an innovative block clustering strategy. DSFormer enhances feature representation by enabling bidirectional information transfer between dual-scale features extracted from the final two CNN layers, capturing both semantic richness and spatial details through self-attention for long-range dependencies within each scale and shared cross-attention for cross-scale learning. Complementing this, our block clustering strategy repartitions the widely used San Francisco eXtra Large (SF-XL) training dataset from multiple distinct perspectives, optimizing data organization to further bolster robustness against viewpoint variations. Together, these innovations not only yield a robust global embedding adaptable to environmental changes but also reduce the required training data volume by approximately 30% compared to previous partitioning methods. Comprehensive experiments demonstrate that our approach achieves state-of-the-art performance across most benchmark datasets, surpassing advanced reranking methods like DELG, Patch-NetVLAD, TransVPR, and R2Former as a global retrieval solution using 512-dim global descriptors, while significantly improving computational efficiency.
zh

[CV-31] NLML-HPE: Head Pose Estimation with Limited Data via Manifold Learning

【速读】:该论文旨在解决在训练数据有限条件下,如何实现高精度且实时的头姿态估计(Head Pose Estimation, HPE)问题。其关键解决方案在于提出一种基于非线性流形学习(Non-Linear Manifold Learning, NLML)的深度学习方法——NLML-HPE,该方法将HPE建模为回归问题而非传统分类方法,并结合张量分解(Tucker decomposition)与前馈神经网络,通过将每个欧拉角(yaw, pitch, roll)映射到独立子空间并用余弦曲线拟合潜在流形结构,从而有效捕捉面部关键点到姿态角之间的连续映射关系。此外,作者通过旋转3D头模型生成精确标注的2D图像数据集以缓解真实数据中姿态标注不准确的问题,最终实现了在小样本场景下的高精度与实时预测性能。

链接: https://arxiv.org/abs/2507.18429
作者: Mahdi Ghafourian,Federico M. Sukno
机构: Universitat Pompeu Fabra (庞培法布拉大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Head pose estimation (HPE) plays a critical role in various computer vision applications such as human-computer interaction and facial recognition. In this paper, we propose a novel deep learning approach for head pose estimation with limited training data via non-linear manifold learning called NLML-HPE. This method is based on the combination of tensor decomposition (i.e., Tucker decomposition) and feed forward neural networks. Unlike traditional classification-based approaches, our method formulates head pose estimation as a regression problem, mapping input landmarks into a continuous representation of pose angles. To this end, our method uses tensor decomposition to split each Euler angle (yaw, pitch, roll) to separate subspaces and models each dimension of the underlying manifold as a cosine curve. We address two key challenges: 1. Almost all HPE datasets suffer from incorrect and inaccurate pose annotations. Hence, we generated a precise and consistent 2D head pose dataset for our training set by rotating 3D head models for a fixed set of poses and rendering the corresponding 2D images. 2. We achieved real-time performance with limited training data as our method accurately captures the nature of rotation of an object from facial landmarks. Once the underlying manifold for rotation around each axis is learned, the model is very fast in predicting unseen data. Our training and testing code is available online along with our trained models: https: //github.com/MahdiGhafoorian/NLML_HPE.
zh

[CV-32] Self-Supervised Ultrasound-Video Segmentation with Feature Prediction and 3D Localised Loss

【速读】:该论文旨在解决超声成像中大规模标注数据获取困难的问题,其核心挑战在于图像低对比度、高噪声及伪影干扰,导致数据标注耗时且依赖临床专家经验。为应对这一问题,作者提出基于视频联合嵌入预测架构(V-JEPA)的自监督学习方法,该方法通过特征预测而非像素级重建或负样本对比,有效缓解了对噪声敏感的问题,并充分利用视频序列中的时序信息。其关键创新在于引入一种新颖的3D定位辅助任务,在V-JEPA预训练阶段增强视觉Transformer(ViT)模型的空间局部性理解能力,从而提升小样本医疗数据下的分割性能,实验表明该方案在仅使用10%训练数据时可实现高达8.35%的性能增益。

链接: https://arxiv.org/abs/2507.18424
作者: Edward Ellis,Robert Mendel,Andrew Bulpitt,Nasim Parsa,Michael F Byrne,Sharib Ali
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Acquiring and annotating large datasets in ultrasound imaging is challenging due to low contrast, high noise, and susceptibility to artefacts. This process requires significant time and clinical expertise. Self-supervised learning (SSL) offers a promising solution by leveraging unlabelled data to learn useful representations, enabling improved segmentation performance when annotated data is limited. Recent state-of-the-art developments in SSL for video data include V-JEPA, a framework solely based on feature prediction, avoiding pixel level reconstruction or negative samples. We hypothesise that V-JEPA is well-suited to ultrasound imaging, as it is less sensitive to noisy pixel-level detail while effectively leveraging temporal information. To the best of our knowledge, this is the first study to adopt V-JEPA for ultrasound video data. Similar to other patch-based masking SSL techniques such as VideoMAE, V-JEPA is well-suited to ViT-based models. However, ViTs can underperform on small medical datasets due to lack of inductive biases, limited spatial locality and absence of hierarchical feature learning. To improve locality understanding, we propose a novel 3D localisation auxiliary task to improve locality in ViT representations during V-JEPA pre-training. Our results show V-JEPA with our auxiliary task improves segmentation performance significantly across various frozen encoder configurations, with gains up to 3.4% using 100% and up to 8.35% using only 10% of the training data.
zh

[CV-33] DCFFSNet: Deep Connectivity Feature Fusion Separation Network for Medical Image Segmentation

【速读】:该论文旨在解决医学图像分割中因拓扑连通性(topological connectivity)特征融合方式不当导致的边缘精度不足与区域一致性差的问题。现有深度网络通常将连通性信息强制作为附加模块注入,造成特征空间耦合且缺乏量化不同特征强度的标准机制。其解决方案的关键在于提出DCFFSNet(Dual-Connectivity Feature Fusion-Separation Network),通过创新性的特征空间解耦策略,量化连通性特征与其他特征之间的相对强度,并构建深层的连通性特征融合-分离架构,从而动态平衡多尺度特征表达,有效缓解分割碎片化问题并实现平滑的边缘过渡,显著提升临床可用性。

链接: https://arxiv.org/abs/2507.18407
作者: Xun Ye,Ruixiang Tang,Mingda Zhang,Jianglong Qin
机构: Yunnan University (云南大学); School of Software (软件学院); Yunnan Provincial Key Laboratory of Software Engineering (云南省软件工程重点实验室)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 16 pages , 11 figures

点击查看摘要

Abstract:Medical image segmentation leverages topological connectivity theory to enhance edge precision and regional consistency. However, existing deep networks integrating connectivity often forcibly inject it as an additional feature module, resulting in coupled feature spaces with no standardized mechanism to quantify different feature strengths. To address these issues, we propose DCFFSNet (Dual-Connectivity Feature Fusion-Separation Network). It introduces an innovative feature space decoupling strategy. This strategy quantifies the relative strength between connectivity features and other features. It then builds a deep connectivity feature fusion-separation architecture. This architecture dynamically balances multi-scale feature expression. Experiments were conducted on the ISIC2018, DSB2018, and MoNuSeg datasets. On ISIC2018, DCFFSNet outperformed the next best model (CMUNet) by 1.3% (Dice) and 1.2% (IoU). On DSB2018, it surpassed TransUNet by 0.7% (Dice) and 0.9% (IoU). On MoNuSeg, it exceeded CSCAUNet by 0.8% (Dice) and 0.9% (IoU). The results demonstrate that DCFFSNet exceeds existing mainstream methods across all metrics. It effectively resolves segmentation fragmentation and achieves smooth edge transitions. This significantly enhances clinical usability.
zh

[CV-34] Iwin Transformer: Hierarchical Vision Transformer using Interleaved Windows

【速读】:该论文旨在解决传统视觉Transformer(如Swin Transformer)在实现全局信息交互时效率低下的问题,即需要两个连续的模块才能近似全局注意力机制,从而限制了模型在低分辨率到高分辨率迁移中的灵活性与性能。其解决方案的关键在于提出Iwin Transformer,一种无需位置嵌入(position-embedding-free)的分层视觉Transformer架构,通过创新性的交错窗口注意力(interleaved window attention)与深度可分离卷积(depthwise separable convolution)协同工作:前者利用注意力机制连接远距离token以实现全局信息交换,后者则通过局部卷积增强邻近token的关联性,使得单个模块即可完成全局信息融合,显著提升了模型在图像分类、语义分割和视频动作识别等任务上的表现,并具备良好的可迁移性,例如可直接替换生成式AI(Generative AI)中的自注意力模块。

链接: https://arxiv.org/abs/2507.18405
作者: Simin Huo,Ning Li
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注: 14 pages, 10 figures, Submitted to IEEE Transactions on Pattern Analysis and Machine Intelligence

点击查看摘要

Abstract:We introduce Iwin Transformer, a novel position-embedding-free hierarchical vision transformer, which can be fine-tuned directly from low to high resolution, through the collaboration of innovative interleaved window attention and depthwise separable convolution. This approach uses attention to connect distant tokens and applies convolution to link neighboring tokens, enabling global information exchange within a single module, overcoming Swin Transformer’s limitation of requiring two consecutive blocks to approximate global attention. Extensive experiments on visual benchmarks demonstrate that Iwin Transformer exhibits strong competitiveness in tasks such as image classification (87.4 top-1 accuracy on ImageNet-1K), semantic segmentation and video action recognition. We also validate the effectiveness of the core component in Iwin as a standalone module that can seamlessly replace the self-attention module in class-conditional image generation. The concepts and methods introduced by the Iwin Transformer have the potential to inspire future research, like Iwin 3D Attention in video generation. The code and models are available at this https URL.
zh

[CV-35] HumanMaterial: Human Material Estimation from a Single Image via Progressive Training

【速读】:该论文旨在解决全身体逆渲染(full-body Human inverse rendering)中因材料映射(material maps)缺乏约束而导致的病态问题(ill-posed task),从而实现高保真材质估计以支持任意光照下的照片级真实感渲染。现有方法依赖简化材质数据和渲染方程,导致皮肤等复杂材质的渲染真实感不足。解决方案的关键在于:1)构建高质量数据集 OpenHumanBRDF,融合扫描真实数据与统计材质数据,新增位移(displacement)和次表面散射(subsurface scattering)等关键材质通道以提升皮肤等材质的渲染真实性;2)设计具有渐进训练策略的 HumanMaterial 模型,通过三个先验模型初步估计不同材质图,并引入受控物理基础渲染(Controlled PBR Rendering, CPR)损失函数,在训练过程中动态增强各材质图对渲染结果的重要性权重,从而优化监督信号分配并缓解模型欠拟合问题,最终在 OpenHumanBRDF 数据集和真实数据上实现当前最优性能。

链接: https://arxiv.org/abs/2507.18385
作者: Yu Jiang,Jiahao Xia,Jiongming Qin,Yusen Wang,Tuo Cao,Chunxia Xiao
机构: Wuhan University (武汉大学); University of Technology Sydney (悉尼科技大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 14

点击查看摘要

Abstract:Full-body Human inverse rendering based on physically-based rendering aims to acquire high-quality materials, which helps achieve photo-realistic rendering under arbitrary illuminations. This task requires estimating multiple material maps and usually relies on the constraint of rendering result. The absence of constraints on the material maps makes inverse rendering an ill-posed task. Previous works alleviated this problem by building material dataset for training, but their simplified material data and rendering equation lead to rendering results with limited realism, especially that of skin. To further alleviate this problem, we construct a higher-quality dataset (OpenHumanBRDF) based on scanned real data and statistical material data. In addition to the normal, diffuse albedo, roughness, specular albedo, we produce displacement and subsurface scattering to enhance the realism of rendering results, especially for the skin. With the increase in prediction tasks for more materials, using an end-to-end model as in the previous work struggles to balance the importance among various material maps, and leads to model underfitting. Therefore, we design a model (HumanMaterial) with progressive training strategy to make full use of the supervision information of the material maps and improve the performance of material estimation. HumanMaterial first obtain the initial material results via three prior models, and then refine the results by a finetuning model. Prior models estimate different material maps, and each map has different significance for rendering results. Thus, we design a Controlled PBR Rendering (CPR) loss, which enhances the importance of the materials to be optimized during the training of prior models. Extensive experiments on OpenHumanBRDF dataset and real data demonstrate that our method achieves state-of-the-art performance.
zh

[CV-36] owards Consistent Long-Term Pose Generation

【速读】:该论文旨在解决当前姿态生成(pose generation)方法中因依赖中间表示而导致的性能下降问题,尤其是在长期姿态生成场景下难以保持时间连贯性(temporal coherence)的挑战。现有方法通常采用两阶段流水线或自回归模型,前者通过量化引入离散表示,后者在推理过程中累积误差,限制了生成质量。其解决方案的关键在于提出一种全新的单阶段架构,直接在连续坐标空间中从最小上下文(单张RGB图像与文本描述)生成姿态序列,无需中间表示或基于标记(token-based)的生成机制;核心创新包括:1)基于相对运动预测机制(relative movement prediction mechanism),以保持空间关系并实现连续坐标上的直接操作;2)统一占位符标记(unified placeholder token)策略,确保训练与推理阶段行为一致,从而支持单前向传播生成,显著提升长程生成的稳定性和准确性。

链接: https://arxiv.org/abs/2507.18382
作者: Yayuan Li,Filippos Bellos,Jason Corso
机构: University of Michigan (密歇根大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 10 pages, 5 figures, 4 tables

点击查看摘要

Abstract:Current approaches to pose generation rely heavily on intermediate representations, either through two-stage pipelines with quantization or autoregressive models that accumulate errors during inference. This fundamental limitation leads to degraded performance, particularly in long-term pose generation where maintaining temporal coherence is crucial. We propose a novel one-stage architecture that directly generates poses in continuous coordinate space from minimal context - a single RGB image and text description - while maintaining consistent distributions between training and inference. Our key innovation is eliminating the need for intermediate representations or token-based generation by operating directly on pose coordinates through a relative movement prediction mechanism that preserves spatial relationships, and a unified placeholder token approach that enables single-forward generation with identical behavior during training and inference. Through extensive experiments on Penn Action and First-Person Hand Action Benchmark (F-PHAB) datasets, we demonstrate that our approach significantly outperforms existing quantization-based and autoregressive methods, especially in long-term generation scenarios.
zh

[CV-37] owards Effective Human-in-the-Loop Assistive AI Agents

【速读】:该论文旨在解决人-AI协作在物理任务完成中的评估难题,尤其是如何量化AI引导对人类执行程序性任务的性能提升、错误减少及学习效果的影响。其解决方案的关键在于构建了一个多模态的人-AI交互评估框架与数据集,并开发了一种配备增强现实(AR)技术的AI代理,能够在真实场景中提供交互式指导(如烹饪到战场医疗),通过实证研究验证了AI辅助协作可显著提升任务完成效率与准确性。

链接: https://arxiv.org/abs/2507.18374
作者: Filippos Bellos,Yayuan Li,Cary Shu,Ruey Day,Jeffrey M. Siskind,Jason J. Corso
机构: University of Michigan (密歇根大学); Purdue University (普渡大学); Voxel51
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 10 pages, 5 figures, 2 tables

点击查看摘要

Abstract:Effective human-AI collaboration for physical task completion has significant potential in both everyday activities and professional domains. AI agents equipped with informative guidance can enhance human performance, but evaluating such collaboration remains challenging due to the complexity of human-in-the-loop interactions. In this work, we introduce an evaluation framework and a multimodal dataset of human-AI interactions designed to assess how AI guidance affects procedural task performance, error reduction and learning outcomes. Besides, we develop an augmented reality (AR)-equipped AI agent that provides interactive guidance in real-world tasks, from cooking to battlefield medicine. Through human studies, we share empirical insights into AI-assisted human performance and demonstrate that AI-assisted collaboration improves task completion.
zh

[CV-38] MVG4D: Image Matrix-Based Multi-View and Motion Generation for 4D Content Creation from a Single Image

【速读】:该论文旨在解决从单张静态图像生成高保真、时序一致的动态4D内容(即包含空间和时间维度的三维场景)这一挑战,尤其针对现有基于4D高斯溅射(4D Gaussian Splatting, 4D GS)方法中存在的运动不连续性和背景退化问题。解决方案的关键在于提出MVG4D框架,其核心创新是引入一个图像矩阵模块(image matrix module),通过多视角合成生成时空一致且空间多样化的多视图图像,为下游3D与4D重建提供丰富的监督信号;随后利用这些多视图图像优化一个3D高斯点云,并通过轻量级形变网络将其扩展至时间维度,从而显著提升时序一致性、几何保真度与视觉真实感,同时减少闪烁伪影并增强结构细节,实现高效可控的4D生成。

链接: https://arxiv.org/abs/2507.18371
作者: Xiaotian Chen,DongFu Yin,Fei Richard Yu,Xuanchen Li,Xinhao Zhang
机构: Shenzhen University (深圳大学); GuanXugdong Laboratory of Artificial Intelligence and Digital Economy (深圳人工智能与数字经济实验室); Tsinghua University (清华大学); Dalian University of Technology (大连理工大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Advances in generative modeling have significantly enhanced digital content creation, extending from 2D images to complex 3D and 4D scenes. Despite substantial progress, producing high-fidelity and temporally consistent dynamic 4D content remains a challenge. In this paper, we propose MVG4D, a novel framework that generates dynamic 4D content from a single still image by combining multi-view synthesis with 4D Gaussian Splatting (4D GS). At its core, MVG4D employs an image matrix module that synthesizes temporally coherent and spatially diverse multi-view images, providing rich supervisory signals for downstream 3D and 4D reconstruction. These multi-view images are used to optimize a 3D Gaussian point cloud, which is further extended into the temporal domain via a lightweight deformation network. Our method effectively enhances temporal consistency, geometric fidelity, and visual realism, addressing key challenges in motion discontinuity and background degradation that affect prior 4D GS-based methods. Extensive experiments on the Objaverse dataset demonstrate that MVG4D outperforms state-of-the-art baselines in CLIP-I, PSNR, FVD, and time efficiency. Notably, it reduces flickering artifacts and sharpens structural details across views and time, enabling more immersive AR/VR experiences. MVG4D sets a new direction for efficient and controllable 4D generation from minimal inputs.
zh

[CV-39] Deformable Convolution Module with Globally Learned Relative Offsets for Fundus Vessel Segmentation

【速读】:该论文旨在解决传统卷积神经网络在处理具有全局自相似复杂边缘结构(如眼底血管)时,因固定局部感受野导致的特征捕捉能力不足的问题。解决方案的关键在于提出一种新型可变形卷积模块,该模块通过注意力机制与前馈网络学习亚像素级位移场,对所有通道的特征图进行自适应形变,而非直接改变卷积核形状;这相当于对采样网格进行相对变形,从而实现全局特征形变并解耦卷积核尺寸与网络学习能力。该模块作为即插即用组件集成至GDCUnet模型中,在眼底血管分割任务上取得当前最优性能,验证了其在复杂全局结构建模中的有效性。

链接: https://arxiv.org/abs/2507.18354
作者: Lexuan Zhu,Yuxuan Li,Yuning Ren
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Deformable convolution can adaptively change the shape of convolution kernel by learning offsets to deal with complex shape features. We propose a novel plug and play deformable convolutional module that uses attention and feedforward networks to learn offsets, so that the deformable patterns can capture long-distance global features. Compared with previously existing deformable convolutions, the proposed module learns the sub pixel displacement field and adaptively warps the feature maps across all channels rather than directly deforms the convolution kernel , which is equivalent to a relative deformation of the kernel sampling grids, achieving global feature deformation and the decoupling of kernel size and learning network. Considering that the fundus blood vessels have globally self similar complex edges, we design a deep learning model for fundus blood vessel segmentation, GDCUnet, based on the proposed convolutional module. Empirical evaluations under the same configuration and unified framework show that GDCUnet has achieved state of the art performance on public datasets. Further ablation experiments demonstrated that the proposed deformable convolutional module could more significantly learn the complex features of fundus blood vessels, enhancing the model representation and generalization this http URL proposed module is similar to the interface of conventional convolution, we suggest applying it to more machine vision tasks with complex global self similar features.
zh

[CV-40] VB-Mitigator: An Open-source Framework for Evaluating and Advancing Visual Bias Mitigation

【速读】:该论文旨在解决计算机视觉模型中存在的偏见问题(bias),这一问题导致AI系统在公平性、可靠性及泛化能力方面表现不佳。现有研究受限于方法实现碎片化和评估标准不统一,使得不同技术的可复现性和公平比较困难。解决方案的关键在于提出一个名为Visual Bias Mitigator (VB-Mitigator) 的开源框架,该框架提供了一个统一的研究环境,集成12种成熟的偏见缓解方法和7个多样化的基准数据集,并具备良好的可扩展性,支持新增方法、数据集、指标与模型的无缝接入,从而推动公平感知的计算机视觉模型研究发展。

链接: https://arxiv.org/abs/2507.18348
作者: Ioannis Sarridis,Christos Koutlis,Symeon Papadopoulos,Christos Diou
机构: Information Technologies Institute, CERTH, Greece(希腊CERTH信息与技术研究所); Department of Informatics and Telematics, Harokopio University of Athens, Greece(希腊雅典大学信息与电信系)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Bias in computer vision models remains a significant challenge, often resulting in unfair, unreliable, and non-generalizable AI systems. Although research into bias mitigation has intensified, progress continues to be hindered by fragmented implementations and inconsistent evaluation practices. Disparate datasets and metrics used across studies complicate reproducibility, making it difficult to fairly assess and compare the effectiveness of various approaches. To overcome these limitations, we introduce the Visual Bias Mitigator (VB-Mitigator), an open-source framework designed to streamline the development, evaluation, and comparative analysis of visual bias mitigation techniques. VB-Mitigator offers a unified research environment encompassing 12 established mitigation methods, 7 diverse benchmark datasets. A key strength of VB-Mitigator is its extensibility, allowing for seamless integration of additional methods, datasets, metrics, and models. VB-Mitigator aims to accelerate research toward fairness-aware computer vision models by serving as a foundational codebase for the research community to develop and assess their approaches. To this end, we also recommend best evaluation practices and provide a comprehensive performance comparison among state-of-the-art methodologies.
zh

[CV-41] EgoExoBench: A Benchmark for First- and Third-person View Video Understanding in MLLM s

【速读】:该论文旨在解决多模态大语言模型(Multimodal Large Language Models, MLLMs)在第一人称(egocentric)与第三人称(exocentric)视角之间进行知识迁移与推理能力不足的问题。当前MLLMs虽在单视角任务中表现优异,但在跨视角语义对齐、视点关联及时间动态推理等关键能力上存在明显短板。解决方案的关键在于提出EgoExoBench——首个针对第一人称-第三人称视频理解与推理的基准测试集,涵盖7,300余个问答对,覆盖11个子任务和三大核心挑战:语义对齐(semantic alignment)、视点关联(viewpoint association)与时间推理(temporal reasoning),从而系统性评估并推动模型实现类人跨视角智能。

链接: https://arxiv.org/abs/2507.18342
作者: Yuping He,Yifei Huang,Guo Chen,Baoqi Pei,Jilan Xu,Tong Lu,Jiangmiao Pang
机构: Nanjing University (南京大学); Shanghai AI Laboratory (上海人工智能实验室); The University of Tokyo (东京大学); Zhejiang University (浙江大学); Fudan University (复旦大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Transferring and integrating knowledge across first-person (egocentric) and third-person (exocentric) viewpoints is intrinsic to human intelligence, enabling humans to learn from others and convey insights from their own experiences. Despite rapid progress in multimodal large language models (MLLMs), their ability to perform such cross-view reasoning remains unexplored. To address this, we introduce EgoExoBench, the first benchmark for egocentric-exocentric video understanding and reasoning. Built from publicly available datasets, EgoExoBench comprises over 7,300 question-answer pairs spanning eleven sub-tasks organized into three core challenges: semantic alignment, viewpoint association, and temporal reasoning. We evaluate 13 state-of-the-art MLLMs and find that while these models excel on single-view tasks, they struggle to align semantics across perspectives, accurately associate views, and infer temporal dynamics in the ego-exo context. We hope EgoExoBench can serve as a valuable resource for research on embodied agents and intelligent assistants seeking human-like cross-view intelligence.
zh

[CV-42] Improving Bird Classification with Primary Color Additives INTERSPEECH2025

【速读】:该论文旨在解决基于鸟类鸣叫声进行物种分类的问题,这一任务因环境噪声、鸣声重叠以及标签缺失等因素而极具挑战性,尤其在信噪比(SNR)低或存在多物种共同发声的场景下,现有模型性能显著下降。其解决方案的关键在于引入“动机”(motif)概念,即通过可视化音高模式、速度和重复特性来增强特征区分度,并进一步采用主色调叠加技术将频率信息嵌入到频谱图(spectrogram)中,从而提升不同物种间的可分性。实验表明,该方法相比无颜色化处理的模型在F1分数、ROC-AUC和CMAP指标上分别提升了7.3%、6.2%和6.6%,验证了频率信息颜色化对分类性能的显著增益作用。

链接: https://arxiv.org/abs/2507.18334
作者: Ezhini Rasendiran R,Chandresh Kumar Maurya
机构: Indian Institute of Technology Indore (印度理工学院英迪拉普尔分校)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Sound (cs.SD); Audio and Speech Processing (eess.AS)
备注: 5 pages (Accepted to Interspeech 2025)

点击查看摘要

Abstract:We address the problem of classifying bird species using their song recordings, a challenging task due to environmental noise, overlapping vocalizations, and missing labels. Existing models struggle with low-SNR or multi-species recordings. We hypothesize that birds can be classified by visualizing their pitch pattern, speed, and repetition, collectively called motifs. Deep learning models applied to spectrogram images help, but similar motifs across species cause confusion. To mitigate this, we embed frequency information into spectrograms using primary color additives. This enhances species distinction and improves classification accuracy. Our experiments show that the proposed approach achieves statistically significant gains over models without colorization and surpasses the BirdCLEF 2024 winner, improving F1 by 7.3%, ROC-AUC by 6.2%, and CMAP by 6.6%. These results demonstrate the effectiveness of incorporating frequency information via colorization.
zh

[CV-43] Boosting Multi-View Indoor 3D Object Detection via Adaptive 3D Volume Construction ICCV2025

【速读】:该论文旨在解决多视角室内3D目标检测中因固定体素感受野导致特征表示能力受限的问题,以及传统方法对真实场景几何信息(如点云或深度图)的强依赖性。解决方案的关键在于提出一种基于自适应3D体素构建的框架SGCDet:首先设计了一个几何与上下文感知的聚合模块,用于在每张图像的自适应区域内融合几何与语义信息,并动态调整不同视角的贡献权重;其次引入稀疏体素构建策略,通过识别高占据概率的体素进行特征精炼,从而减少自由空间中的冗余计算。这一系列设计使得体素特征构建更具适应性和效率,且网络仅需3D边界框监督即可训练,无需依赖真实场景几何标注。

链接: https://arxiv.org/abs/2507.18331
作者: Runmin Zhang,Zhu Yu,Si-Yuan Cao,Lingyu Zhu,Guangyi Zhang,Xiaokai Bai,Hui-Liang Shen
机构: Zhejiang University (浙江大学); Ningbo Global Innovation Center (宁波全球创新中心); NingboTech University (宁波工程学院); City University of Hong Kong (香港城市大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted by ICCV2025

点击查看摘要

Abstract:This work presents SGCDet, a novel multi-view indoor 3D object detection framework based on adaptive 3D volume construction. Unlike previous approaches that restrict the receptive field of voxels to fixed locations on images, we introduce a geometry and context aware aggregation module to integrate geometric and contextual information within adaptive regions in each image and dynamically adjust the contributions from different views, enhancing the representation capability of voxel features. Furthermore, we propose a sparse volume construction strategy that adaptively identifies and selects voxels with high occupancy probabilities for feature refinement, minimizing redundant computation in free space. Benefiting from the above designs, our framework achieves effective and efficient volume construction in an adaptive way. Better still, our network can be supervised using only 3D bounding boxes, eliminating the dependence on ground-truth scene geometry. Experimental results demonstrate that SGCDet achieves state-of-the-art performance on the ScanNet, ScanNet200 and ARKitScenes datasets. The source code is available at this https URL.
zh

[CV-44] GVCCS: A Dataset for Contrail Identification and Tracking on Visible Whole Sky Camera Sequences

【速读】:该论文旨在解决当前航空非二氧化碳(non-CO2)气候影响评估中,尤其是航迹云(contrail)形成机制与生命周期动态建模的不足问题。现有物理模型虽能估算航迹云的气候效应,但其准确性受限于大气输入数据质量及对冰晶形成和湿度驱动持久性等复杂过程的假设。为提升模型精度,研究提出Ground Visible Camera Contrail Sequences (GVCCS) 数据集,该数据集基于地面全天空可见光相机采集的122个视频序列(共24,228帧),对每个航迹云进行逐帧标注并实现时间维度上的连续追踪,同时关联其来源航班信息,从而提供高时空分辨率的观测依据。解决方案的关键在于构建首个兼具时间追踪能力与飞行源标识的开放航迹云数据集,并配套开发统一的深度学习框架(基于全景分割模型),实现语义分割、实例分割与时序追踪一体化分析,为航迹云物理模型校准和气候影响精准评估奠定基础。

链接: https://arxiv.org/abs/2507.18330
作者: Gabriel Jarry,Ramon Dalmau,Philippe Very,Franck Ballerini,Stephania-Denisa Bocu
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Aviation’s climate impact includes not only CO2 emissions but also significant non-CO2 effects, especially from contrails. These ice clouds can alter Earth’s radiative balance, potentially rivaling the warming effect of aviation CO2. Physics-based models provide useful estimates of contrail formation and climate impact, but their accuracy depends heavily on the quality of atmospheric input data and on assumptions used to represent complex processes like ice particle formation and humidity-driven persistence. Observational data from remote sensors, such as satellites and ground cameras, could be used to validate and calibrate these models. However, existing datasets don’t explore all aspect of contrail dynamics and formation: they typically lack temporal tracking, and do not attribute contrails to their source flights. To address these limitations, we present the Ground Visible Camera Contrail Sequences (GVCCS), a new open data set of contrails recorded with a ground-based all-sky camera in the visible range. Each contrail is individually labeled and tracked over time, allowing a detailed analysis of its lifecycle. The dataset contains 122 video sequences (24,228 frames) and includes flight identifiers for contrails that form above the camera. As reference, we also propose a unified deep learning framework for contrail analysis using a panoptic segmentation model that performs semantic segmentation (contrail pixel identification), instance segmentation (individual contrail separation), and temporal tracking in a single architecture. By providing high-quality, temporally resolved annotations and a benchmark for model evaluation, our work supports improved contrail monitoring and will facilitate better calibration of physical models. This sets the groundwork for more accurate climate impact understanding and assessments.
zh

[CV-45] Beyond Low-rankness: Guaranteed Matrix Recovery via Modified Nuclear Norm

【速读】:该论文旨在解决矩阵恢复问题中如何同时有效捕捉局部信息与全局低秩结构的难题,传统方法如Robust PCA和矩阵补全(Matrix Completion, MC)通常仅依赖核范数(Nuclear Norm, NN)来建模全局低秩性,难以充分表达数据的局部特性。其解决方案的关键在于提出一种新的改进核范数(Modified Nuclear Norm, MNN)框架,该框架通过引入合适的矩阵变换,在变换后的空间上应用核范数,从而在不需调整权衡参数的情况下,统一建模局部特征与全局低秩结构。理论分析表明,在对变换函数施加温和假设的前提下,MNN能够为Robust PCA和MC任务提供精确的恢复保证,这是现有结合局部与全局信息的方法所不具备的优势。

链接: https://arxiv.org/abs/2507.18327
作者: Jiangjun Peng,Yisi Luo,Xiangyong Cao,Shuang Xu,Deyu Meng
机构: Northwestern Polytechnical University (西北工业大学); Xi’an Jiaotong University (西安交通大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 15 pages, 14 figures

点击查看摘要

Abstract:The nuclear norm (NN) has been widely explored in matrix recovery problems, such as Robust PCA and matrix completion, leveraging the inherent global low-rank structure of the data. In this study, we introduce a new modified nuclear norm (MNN) framework, where the MNN family norms are defined by adopting suitable transformations and performing the NN on the transformed matrix. The MNN framework offers two main advantages: (1) it jointly captures both local information and global low-rankness without requiring trade-off parameter tuning; (2) Under mild assumptions on the transformation, we provided exact theoretical recovery guarantees for both Robust PCA and MC tasks-an achievement not shared by existing methods that combine local and global information. Thanks to its general and flexible design, MNN can accommodate various proven transformations, enabling a unified and effective approach to structured low-rank recovery. Extensive experiments demonstrate the effectiveness of our method. Code and supplementary material are available at this https URL.
zh

[CV-46] A Multi-Dataset Benchmark for Semi-Supervised Semantic Segmentation in ECG Delineation

【速读】:该论文旨在解决心电图(Electrocardiogram, ECG)波形分割(delineation)中因公开标注数据集稀缺而导致深度学习方法进展受限的问题。其关键解决方案是首次系统性地构建了一个面向半监督语义分割(Semi-supervised semantic segmentation, SemiSeg)的基准平台,整合多个公共数据集并引入ECG特定的训练配置与增强策略,同时在卷积网络和Transformer两种架构上评估五种代表性半监督算法,并在域内(in-domain)和跨域(cross-domain)两种场景下进行标准化测试,结果表明Transformer架构在半监督ECG分割任务中优于卷积网络,为后续研究提供了可复现、可扩展的基准框架。

链接: https://arxiv.org/abs/2507.18323
作者: Minje Park,Jeonghwa Lim,Taehyung Yu,Sunghoon Joo
机构: VUNO Inc.(VUNO公司)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Signal Processing (eess.SP)
备注: 6 pages, 2 figures

点击查看摘要

Abstract:Electrocardiogram (ECG) delineation, the segmentation of meaningful waveform features, is critical for clinical diagnosis. Despite recent advances using deep learning, progress has been limited by the scarcity of publicly available annotated datasets. Semi-supervised learning presents a promising solution by leveraging abundant unlabeled ECG data. In this study, we present the first systematic benchmark for semi-supervised semantic segmentation (SemiSeg) in ECG delineation. We curated and unified multiple public datasets, including previously underused sources, to support robust and diverse evaluation. We adopted five representative SemiSeg algorithms from computer vision, implemented them on two different architectures: the convolutional network and the transformer, and evaluated them in two different settings: in-domain and cross-domain. Additionally, we propose ECG-specific training configurations and augmentation strategies and introduce a standardized evaluation framework. Our results show that the transformer outperforms the convolutional network in semi-supervised ECG delineation. We anticipate that our benchmark will serve as a foundation for advancing semi-supervised ECG delineation methods and will facilitate further research in this domain.
zh

[CV-47] Improving Large Vision-Language Models Understanding for Field Data

【速读】:该论文旨在解决大型视觉语言模型(Large Vision-Language Models, LVLMs)在科学领域,尤其是自然科学研究中解释复杂场数据(field data)时能力不足的问题。现有LVLMs虽在图像描述和视觉问答等任务上表现优异,但其对科学场数据(如流场、涡旋模式等)的理解仍较为薄弱。解决方案的关键在于提出FieldLVLM框架,其核心由两部分组成:一是基于专用机器学习流水线的场感知语言生成策略,用于提取关键物理特征(如流动分类、雷诺数、涡旋结构)并转化为结构化文本描述;二是数据压缩的多模态模型微调机制,通过压缩场数据以保留最具信息量的特征,从而提升模型语言解码器的兼容性与学习效率。实验证明,该方法在新构建的基准数据集上显著优于现有技术,为LVLMs在科学发现中的应用提供了有效路径。

链接: https://arxiv.org/abs/2507.18311
作者: Xiaomei Zhang,Hanyu Zheng,Xiangyu Zhu,Jinghuan Wei,Junhong Zou,Zhen Lei,Zhaoxiang Zhang
机构: State Key Laboratory of Multimodal Artificial Intelligence Systems, Institute of Automation, Chinese Academy of Sciences (中国科学院自动化研究所)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Large Vision-Language Models (LVLMs) have shown impressive capabilities across a range of tasks that integrate visual and textual understanding, such as image captioning and visual question answering. These models are trained on large-scale image and video datasets paired with text, enabling them to bridge visual perception and natural language processing. However, their application to scientific domains, especially in interpreting complex field data commonly used in the natural sciences, remains underexplored. In this work, we introduce FieldLVLM, a novel framework designed to improve large vision-language models’ understanding of field data. FieldLVLM consists of two main components: a field-aware language generation strategy and a data-compressed multimodal model tuning. The field-aware language generation strategy leverages a special-purpose machine learning pipeline to extract key physical features from field data, such as flow classification, Reynolds number, and vortex patterns. This information is then converted into structured textual descriptions that serve as a dataset. The data-compressed multimodal model tuning focuses on LVLMs with these generated datasets, using a data compression strategy to reduce the complexity of field inputs and retain only the most informative values. This ensures compatibility with the models language decoder and guides its learning more effectively. Experimental results on newly proposed benchmark datasets demonstrate that FieldLVLM significantly outperforms existing methods in tasks involving scientific field data. Our findings suggest that this approach opens up new possibilities for applying large vision-language models to scientific research, helping bridge the gap between large models and domain-specific discovery.
zh

[CV-48] LMM-Det: Make Large Multimodal Models Excel in Object Detection ICCV2025

【速读】:该论文旨在解决大型多模态模型(Large Multimodal Models, LMMs)在目标检测任务中性能显著落后于专用检测器的问题。现有方法通常依赖于将重型检测模块与LMM集成,但这种方法复杂且效率低下。论文提出了一种名为LMM-Det的简单而有效的新方案,其核心在于不依赖任何专门的检测模块,而是直接利用LMM本身完成基础的目标检测任务。关键创新点包括:通过数据分布调整和推理优化来提升召回率(recall rate),并重新设计指令对话结构以增强LMM在目标检测场景下的表现。实验表明,LMM-Det能够有效释放LMM固有的检测能力,实现端到端的通用目标检测。

链接: https://arxiv.org/abs/2507.18300
作者: Jincheng Li,Chunyu Xie,Ji Ao,Dawei Leng,Yuhui Yin
机构: 360 AI Research
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted at ICCV 2025

点击查看摘要

Abstract:Large multimodal models (LMMs) have garnered wide-spread attention and interest within the artificial intelligence research and industrial communities, owing to their remarkable capability in multimodal understanding, reasoning, and in-context learning, among others. While LMMs have demonstrated promising results in tackling multimodal tasks like image captioning, visual question answering, and visual grounding, the object detection capabilities of LMMs exhibit a significant gap compared to specialist detectors. To bridge the gap, we depart from the conventional methods of integrating heavy detectors with LMMs and propose LMM-Det, a simple yet effective approach that leverages a Large Multimodal Model for vanilla object Detection without relying on specialized detection modules. Specifically, we conduct a comprehensive exploratory analysis when a large multimodal model meets with object detection, revealing that the recall rate degrades significantly compared with specialist detection models. To mitigate this, we propose to increase the recall rate by introducing data distribution adjustment and inference optimization tailored for object detection. We re-organize the instruction conversations to enhance the object detection capabilities of large multimodal models. We claim that a large multimodal model possesses detection capability without any extra detection modules. Extensive experiments support our claim and show the effectiveness of the versatile LMM-Det. The datasets, models, and codes are available at this https URL.
zh

[CV-49] Dissecting the Dental Lung Cancer Axis via Mendelian Randomization and Mediation Analysis

【速读】:该论文旨在解决口腔疾病(牙周炎和龋齿)与肺癌之间因果关系不明确的问题,尤其是评估龋齿是否对肺癌及其亚型具有因果影响,并探讨肺功能在其中的中介作用。解决方案的关键在于采用两样本孟德尔随机化(Two-sample Mendelian Randomization, MR)方法,利用大规模基因组关联研究(GWAS)数据构建遗传工具变量,结合Transdisciplinary Research of Cancer in Lung(TRICL)联盟的肺癌数据进行因果推断,同时通过Delta法量化肺功能指标(如用力肺活量FVC和一秒用力呼气容积FEV1)的中介效应。结果表明,龋齿对鳞状细胞肺癌存在显著正向因果效应,且部分通过肺功能下降介导,而牙周炎未发现因果关联。

链接: https://arxiv.org/abs/2507.18287
作者: Wenran Zhang,Huihuan Luo,Linda Wei,Ping Nie,Yiqun Wu,Dedong Yu
机构: Shanghai Ninth People’s Hospital, Shanghai Jiao Tong University School of Medicine (上海第九人民医院,上海交通大学医学院); Fudan University (复旦大学); The Chinese University of Hong Kong (香港中文大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Periodontitis and dental caries are common oral diseases affecting billions globally. While observational studies suggest links between these conditions and lung cancer, causality remains uncertain. This study used two sample Mendelian randomization (MR) to explore causal relationships between dental traits (periodontitis, dental caries) and lung cancer subtypes, and to assess mediation by pulmonary function. Genetic instruments were derived from the largest available genome wide association studies, including data from 487,823 dental caries and 506,594 periodontitis cases, as well as lung cancer data from the Transdisciplinary Research of Cancer in Lung consortium. Inverse variance weighting was the main analytical method; lung function mediation was assessed using the delta method. The results showed a significant positive causal effect of dental caries on overall lung cancer and its subtypes. Specifically, a one standard deviation increase in dental caries incidence was associated with a 188.0% higher risk of squamous cell lung carcinoma (OR = 2.880, 95% CI = 1.236–6.713, p = 0.014), partially mediated by declines in forced vital capacity (FVC) and forced expiratory volume in one second (FEV1), accounting for 5.124% and 5.890% of the total effect. No causal effect was found for periodontitis. These findings highlight a causal role of dental caries in lung cancer risk and support integrating dental care and pulmonary function monitoring into cancer prevention strategies.
zh

[CV-50] Adaptive Articulated Object Manipulation On The Fly with Foundation Model Reasoning and Part Grounding ICCV2025

【速读】:该论文旨在解决机器人在操作结构复杂且内部不可见的关节物体(articulated objects)时面临的两大挑战:一是现实世界中关节物体几何形态多样,导致视觉感知与理解困难;二是物体功能和机制差异大,难以建立统一的自适应操作策略。解决方案的关键在于提出AdaRPG框架,其核心创新包括:利用基础模型(foundation models)提取具有局部几何相似性的物体部件(part),以增强功能性原始技能(functional primitive skills)的视觉可 affordance(可操作性)泛化能力,并构建部件级可操作性标注数据集用于训练该模型;同时,借助基础模型中蕴含的通用知识,推理复杂机制并生成高层控制代码,基于部件可操作性推断调用原始技能函数,从而实现跨类别关节物体的强泛化操作能力。

链接: https://arxiv.org/abs/2507.18276
作者: Xiaojie Zhang,Yuanfei Wang,Ruihai Wu,Kunqi Xu,Yu Li,Liuyu Xiang,Hao Dong,Zhaofeng He
机构: Beijing University of Posts and Telecommunications (北京邮电大学); School of Computer Science, Peking University (北京大学计算机学院); School of EECS, Peking University (北京大学电子工程与计算机科学学院)
类目: Robotics (cs.RO); Computer Vision and Pattern Recognition (cs.CV)
备注: ICCV 2025

点击查看摘要

Abstract:Articulated objects pose diverse manipulation challenges for robots. Since their internal structures are not directly observable, robots must adaptively explore and refine actions to generate successful manipulation trajectories. While existing works have attempted cross-category generalization in adaptive articulated object manipulation, two major challenges persist: (1) the geometric diversity of real-world articulated objects complicates visual perception and understanding, and (2) variations in object functions and mechanisms hinder the development of a unified adaptive manipulation strategy. To address these challenges, we propose AdaRPG, a novel framework that leverages foundation models to extract object parts, which exhibit greater local geometric similarity than entire objects, thereby enhancing visual affordance generalization for functional primitive skills. To support this, we construct a part-level affordance annotation dataset to train the affordance model. Additionally, AdaRPG utilizes the common knowledge embedded in foundation models to reason about complex mechanisms and generate high-level control codes that invoke primitive skill functions based on part affordance inference. Simulation and real-world experiments demonstrate AdaRPG’s strong generalization ability across novel articulated object categories.
zh

[CV-51] ReSem3D: Refinable 3D Spatial Constraints via Fine-Grained Semantic Grounding for Generalizable Robotic Manipulation

【速读】:该论文旨在解决当前机器人操作中语义驱动的3D空间约束构建存在的三大问题:(1)约束建模语义粒度粗,(2)缺乏实时闭环规划能力,(3)在语义多样性环境中的鲁棒性不足。解决方案的关键在于提出ReSem3D框架,该框架通过多模态大语言模型(Multimodal Large Language Models, MLLMs)与视觉基础模型(Vision Foundation Models, VFMs)的协同推理,实现细粒度视觉定位并动态构建分层3D空间约束。其核心机制为基于MLLMs的层次递归推理,结合RGB-D观测与自然语言指令,在部件级提取与区域级精化两个阶段自动构建约束,并将其编码为关节空间的实时优化目标,从而支持对动态扰动的反应式行为,显著提升了在语义丰富场景下的零样本适应性和泛化能力。

链接: https://arxiv.org/abs/2507.18262
作者: Chenyu Su,Weiwei Shang,Chen Qian,Fei Zhang,Shuang Cong
机构: University of Science and Technology of China (中国科学技术大学)
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV); Human-Computer Interaction (cs.HC); Machine Learning (cs.LG)
备注: 12 pages,9 figures

点击查看摘要

Abstract:Semantics-driven 3D spatial constraints align highlevel semantic representations with low-level action spaces, facilitating the unification of task understanding and execution in robotic manipulation. The synergistic reasoning of Multimodal Large Language Models (MLLMs) and Vision Foundation Models (VFMs) enables cross-modal 3D spatial constraint construction. Nevertheless, existing methods have three key limitations: (1) coarse semantic granularity in constraint modeling, (2) lack of real-time closed-loop planning, (3) compromised robustness in semantically diverse environments. To address these challenges, we propose ReSem3D, a unified manipulation framework for semantically diverse environments, leveraging the synergy between VFMs and MLLMs to achieve fine-grained visual grounding and dynamically constructs hierarchical 3D spatial constraints for real-time manipulation. Specifically, the framework is driven by hierarchical recursive reasoning in MLLMs, which interact with VFMs to automatically construct 3D spatial constraints from natural language instructions and RGB-D observations in two stages: part-level extraction and region-level refinement. Subsequently, these constraints are encoded as real-time optimization objectives in joint space, enabling reactive behavior to dynamic disturbances. Extensive simulation and real-world experiments are conducted in semantically rich household and sparse chemical lab environments. The results demonstrate that ReSem3D performs diverse manipulation tasks under zero-shot conditions, exhibiting strong adaptability and generalization. Code and videos at this https URL.
zh

[CV-52] Exploiting Gaussian Agnostic Representation Learning with Diffusion Priors for Enhanced Infrared Small Target Detection

【速读】:该论文旨在解决红外小目标检测(Infrared Small Target Detection, ISTD)在真实场景中因高质量标注数据稀缺而导致的模型性能下降问题。当前主流方法依赖大规模人工标注数据进行表征学习,但此类方法在实际应用中表现出较强的脆弱性。为应对这一挑战,论文提出基于高斯无监督表征学习(Gaussian Agnostic Representation Learning)的解决方案,其关键在于引入高斯组压缩器(Gaussian Group Squeezer),利用高斯采样与非均匀量化实现特征压缩,并通过多样化训练样本提升模型鲁棒性;同时采用两阶段扩散模型对真实世界分布进行重建,使合成样本更贴近实际数据分布,从而显著提高生成样本的质量与保真度。

链接: https://arxiv.org/abs/2507.18260
作者: Junyao Li,Yahao Lu,Xingyuan Guo,Xiaoyu Xian,Tiantian Wang,Yukai Shi
机构: Guangdong University of Technology (广东工业大学); CRRC Insitution (中国中车研究院); Guangzhou National Laboratory (广州实验室)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: Submitted to Neural Networks. We propose the Gaussian Group Squeezer, leveraging Gaussian sampling and compression with diffusion models for channel-based data augmentation

点击查看摘要

Abstract:Infrared small target detection (ISTD) plays a vital role in numerous practical applications. In pursuit of determining the performance boundaries, researchers employ large and expensive manual-labeling data for representation learning. Nevertheless, this approach renders the state-of-the-art ISTD methods highly fragile in real-world challenges. In this paper, we first study the variation in detection performance across several mainstream methods under various scarcity – namely, the absence of high-quality infrared data – that challenge the prevailing theories about practical ISTD. To address this concern, we introduce the Gaussian Agnostic Representation Learning. Specifically, we propose the Gaussian Group Squeezer, leveraging Gaussian sampling and compression for non-uniform quantization. By exploiting a diverse array of training samples, we enhance the resilience of ISTD models against various challenges. Then, we introduce two-stage diffusion models for real-world reconstruction. By aligning quantized signals closely with real-world distributions, we significantly elevate the quality and fidelity of the synthetic samples. Comparative evaluations against state-of-the-art detection methods in various scarcity scenarios demonstrate the efficacy of the proposed approach.
zh

[CV-53] LONG3R: Long Sequence Streaming 3D Reconstruction ICCV2025

【速读】:该论文旨在解决长序列流式多视角三维场景重建中现有方法存在的两大问题:一是依赖耗时的离线优化,二是难以处理较长序列,限制了其在实时场景中的应用。解决方案的关键在于提出一种名为LONG3R(LOng sequence streaming 3D Reconstruction)的新模型,其核心创新包括:1)采用递归机制结合记忆门控(memory gating)机制,动态过滤相关记忆信息,并通过双源精化解码器实现粗到细的交互;2)设计三维时空记忆(3D spatio-temporal memory),在保持长期记忆的同时动态修剪冗余空间信息并自适应调整场景分辨率;3)引入两阶段课程训练策略,在提升长序列性能的同时保障训练效率。实验表明,LONG3R在长序列场景下显著优于当前最优流式方法,且具备实时推理能力。

链接: https://arxiv.org/abs/2507.18255
作者: Zhuoguang Chen,Minghui Qin,Tianyuan Yuan,Zhe Liu,Hang Zhao
机构: Shanghai Artificial Intelligence Laboratory (上海人工智能实验室); IIIS, Tsinghua University (清华大学交叉信息研究院); Shanghai Qi Zhi Institute (上海奇智研究院)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted by ICCV 2025. Project page: this https URL

点击查看摘要

Abstract:Recent advancements in multi-view scene reconstruction have been significant, yet existing methods face limitations when processing streams of input images. These methods either rely on time-consuming offline optimization or are restricted to shorter sequences, hindering their applicability in real-time scenarios. In this work, we propose LONG3R (LOng sequence streaming 3D Reconstruction), a novel model designed for streaming multi-view 3D scene reconstruction over longer sequences. Our model achieves real-time processing by operating recurrently, maintaining and updating memory with each new observation. We first employ a memory gating mechanism to filter relevant memory, which, together with a new observation, is fed into a dual-source refined decoder for coarse-to-fine interaction. To effectively capture long-sequence memory, we propose a 3D spatio-temporal memory that dynamically prunes redundant spatial information while adaptively adjusting resolution along the scene. To enhance our model’s performance on long sequences while maintaining training efficiency, we employ a two-stage curriculum training strategy, each stage targeting specific capabilities. Experiments demonstrate that LONG3R outperforms state-of-the-art streaming methods, particularly for longer sequences, while maintaining real-time inference speed. Project page: this https URL.
zh

[CV-54] Evaluation of facial landmark localization performance in a surgical setting

【速读】:该论文旨在解决在手术照明条件下,基于计算机视觉的面部关键点检测算法因光照变化和检测角度灵活性不足而导致的定位精度下降问题。其解决方案的关键在于采用MediaPipe算法,并通过机器人手臂自动调整视角,同时固定手术灯与假人模型的位置,在受控环境中验证该算法在大偏航角(yaw)和俯仰角(pitch)下的检测性能提升,从而显著改善了复杂光照条件下的面部特征点识别准确性。

链接: https://arxiv.org/abs/2507.18248
作者: Ines Frajtag,Marko Švaco,Filip Šuligoj
机构: 未知
类目: Robotics (cs.RO); Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:The use of robotics, computer vision, and their applications is becoming increasingly widespread in various fields, including medicine. Many face detection algorithms have found applications in neurosurgery, ophthalmology, and plastic surgery. A common challenge in using these algorithms is variable lighting conditions and the flexibility of detection positions to identify and precisely localize patients. The proposed experiment tests the MediaPipe algorithm for detecting facial landmarks in a controlled setting, using a robotic arm that automatically adjusts positions while the surgical light and the phantom remain in a fixed position. The results of this study demonstrate that the improved accuracy of facial landmark detection under surgical lighting significantly enhances the detection performance at larger yaw and pitch angles. The increase in standard deviation/dispersion occurs due to imprecise detection of selected facial landmarks. This analysis allows for a discussion on the potential integration of the MediaPipe algorithm into medical procedures.
zh

[CV-55] DepthDark: Robust Monocular Depth Estimation for Low-Light Environments ACM-MM2025

【速读】:该论文旨在解决低光照条件下单目深度估计(monocular depth estimation)中现有基础模型性能显著下降的问题,其核心挑战在于缺乏大规模、高质量的低光配对深度数据集以及有效的参数高效微调(parameter-efficient fine-tuning, PEFT)策略。解决方案的关键在于:首先,构建一个包含耀斑模拟模块和噪声模拟模块的图像合成 pipeline,以高保真方式模拟夜间成像过程,从而生成高质量的低光配对数据集;其次,提出一种基于光照引导与多尺度特征融合的低光PEFT策略,显著提升模型在低光环境下的泛化能力与深度估计精度。

链接: https://arxiv.org/abs/2507.18243
作者: Longjian Zeng,Zunjie Zhu,Rongfeng Lu,Ming Lu,Bolun Zheng,Chenggang Yan,Anke Xue
机构: Hangzhou Dianzi University (杭州电子科技大学); Intel Labs China (英特尔实验室中国)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: Accepted by ACM MM 2025 conference

点击查看摘要

Abstract:In recent years, foundation models for monocular depth estimation have received increasing attention. Current methods mainly address typical daylight conditions, but their effectiveness notably decreases in low-light environments. There is a lack of robust foundational models for monocular depth estimation specifically designed for low-light scenarios. This largely stems from the absence of large-scale, high-quality paired depth datasets for low-light conditions and the effective parameter-efficient fine-tuning (PEFT) strategy. To address these challenges, we propose DepthDark, a robust foundation model for low-light monocular depth estimation. We first introduce a flare-simulation module and a noise-simulation module to accurately simulate the imaging process under nighttime conditions, producing high-quality paired depth datasets for low-light conditions. Additionally, we present an effective low-light PEFT strategy that utilizes illumination guidance and multiscale feature fusion to enhance the model’s capability in low-light environments. Our method achieves state-of-the-art depth estimation performance on the challenging nuScenes-Night and RobotCar-Night datasets, validating its effectiveness using limited training data and computing resources.
zh

[CV-56] DATA: Domain-And-Time Alignment for High-Quality Feature Fusion in Collaborative Perception ICCV2025

【速读】:该论文旨在解决协同感知(Collaborative Perception, CP)中特征级融合(Feature-level Fusion)因域差异(Domain Gap)和时间错位(Temporal Misalignment)导致的特征质量下降问题,进而影响融合性能。其解决方案的关键在于提出一个统一的Domain-And-Time Alignment (DATA) 网络,通过三个核心模块实现:1)一致性保持的域对齐模块(Consistency-preserving Domain Alignment Module, CDAM),利用邻域层次下采样与可观测性约束判别器降低硬件多样性带来的域差异;2)渐进式时间对齐模块(Progressive Temporal Alignment Module, PTAM),基于多尺度运动建模与两阶段补偿机制缓解传输延迟引起的时间错位;3)实例聚焦的特征聚合模块(Instance-focused Feature Aggregation Module, IFAM),增强对齐后特征的语义表达能力。该方法在多个典型数据集上实现了最先进的性能,并在严重通信延迟和位姿误差下保持鲁棒性。

链接: https://arxiv.org/abs/2507.18237
作者: Chengchang Tian,Jianwei Ma,Yan Huang,Zhanye Chen,Honghao Wei,Hui Zhang,Wei Hong
机构: Southeast University (东南大学); Washington State University (华盛顿州立大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: ICCV 2025, accepted as poster. 22 pages including supplementary materials

点击查看摘要

Abstract:Feature-level fusion shows promise in collaborative perception (CP) through balanced performance and communication bandwidth trade-off. However, its effectiveness critically relies on input feature quality. The acquisition of high-quality features faces domain gaps from hardware diversity and deployment conditions, alongside temporal misalignment from transmission delays. These challenges degrade feature quality with cumulative effects throughout the collaborative network. In this paper, we present the Domain-And-Time Alignment (DATA) network, designed to systematically align features while maximizing their semantic representations for fusion. Specifically, we propose a Consistency-preserving Domain Alignment Module (CDAM) that reduces domain gaps through proximal-region hierarchical downsampling and observability-constrained discriminator. We further propose a Progressive Temporal Alignment Module (PTAM) to handle transmission delays via multi-scale motion modeling and two-stage compensation. Building upon the aligned features, an Instance-focused Feature Aggregation Module (IFAM) is developed to enhance semantic representations. Extensive experiments demonstrate that DATA achieves state-of-the-art performance on three typical datasets, maintaining robustness with severe communication delays and pose errors. The code will be released at this https URL.
zh

[CV-57] PS-GS: Gaussian Splatting for Multi-View Photometric Stereo

【速读】:该论文旨在解决多视角光度立体(Multi-view Photometric Stereo, MVPS)与逆渲染(Inverse Rendering)结合时存在的计算效率低、几何-材质-光照联合估计不稳定等问题。现有方法通常依赖固定环境光照,难以实现高精度三维重建;而直接融合MVPS与逆渲染又面临病态问题(ill-posed problem),导致优化过程易陷入局部最优或结果不鲁棒。解决方案的关键在于提出 Gaussian Splatting for Multi-view Photometric Stereo (PS-GS),其核心创新包括:首先利用2D Gaussian splatting构建初始几何结构,随后基于此进行延迟式逆渲染(deferred inverse rendering),引入包含照明计算的多层感知机(lighting-computing multi-layer perceptron)求解完整渲染方程;同时通过未标定光度立体估计的法向图对渲染法向进行正则化,并设计2D Gaussian射线追踪机制用于单方向光源的入射光精修。这些策略有效缓解了病态性问题,显著提升了重建精度和计算效率,在真实与合成数据集上均优于现有方法。

链接: https://arxiv.org/abs/2507.18231
作者: Yixiao Chen,Bin Liang,Hanzhi Guo,Yongqing Cheng,Jiayi Zhao,Dongdong Weng
机构: 未知
类目: Graphics (cs.GR); Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Integrating inverse rendering with multi-view photometric stereo (MVPS) yields more accurate 3D reconstructions than the inverse rendering approaches that rely on fixed environment illumination. However, efficient inverse rendering with MVPS remains challenging. To fill this gap, we introduce the Gaussian Splatting for Multi-view Photometric Stereo (PS-GS), which efficiently and jointly estimates the geometry, materials, and lighting of the object that is illuminated by diverse directional lights (multi-light). Our method first reconstructs a standard 2D Gaussian splatting model as the initial geometry. Based on the initialization model, it then proceeds with the deferred inverse rendering by the full rendering equation containing a lighting-computing multi-layer perceptron. During the whole optimization, we regularize the rendered normal maps by the uncalibrated photometric stereo estimated normals. We also propose the 2D Gaussian ray-tracing for single directional light to refine the incident lighting. The regularizations and the use of multi-view and multi-light images mitigate the ill-posed problem of inverse rendering. After optimization, the reconstructed object can be used for novel-view synthesis, relighting, and material and shape editing. Experiments on both synthetic and real datasets demonstrate that our method outperforms prior works in terms of reconstruction accuracy and computational efficiency.
zh

[CV-58] 3D Test-time Adaptation via Graph Spectral Driven Point Shift

【速读】:该论文旨在解决3D点云分类中因域偏移(domain shift)导致性能下降的问题,尤其是现有测试时自适应(Test-Time Adaptation, TTA)方法在处理不规则且无序的3D点云数据时效率低下、依赖昂贵的空间域优化或额外训练数据的局限性。其解决方案的关键在于提出图谱域测试时自适应(Graph Spectral Domain Test-Time Adaptation, GSDTTA),通过将点云表示为异常感知图并利用图傅里叶变换(Graph Fourier Transform, GFT)映射至图谱域,在仅优化最低10%频率分量(捕获大部分能量)的基础上实现高效适应,并结合特征重构与特征引导的自训练策略迭代优化模型参数,从而在保持低计算开销的同时显著提升跨域泛化能力。

链接: https://arxiv.org/abs/2507.18225
作者: Xin Wei,Qin Yang,Yijie Fang,Mingrui Zhu,Nannan Wang
机构: Xidian University (西安电子科技大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:While test-time adaptation (TTA) methods effectively address domain shifts by dynamically adapting pre-trained models to target domain data during online inference, their application to 3D point clouds is hindered by their irregular and unordered structure. Current 3D TTA methods often rely on computationally expensive spatial-domain optimizations and may require additional training data. In contrast, we propose Graph Spectral Domain Test-Time Adaptation (GSDTTA), a novel approach for 3D point cloud classification that shifts adaptation to the graph spectral domain, enabling more efficient adaptation by capturing global structural properties with fewer parameters. Point clouds in target domain are represented as outlier-aware graphs and transformed into graph spectral domain by Graph Fourier Transform (GFT). For efficiency, adaptation is performed by optimizing only the lowest 10% of frequency components, which capture the majority of the point cloud’s energy. An inverse GFT (IGFT) is then applied to reconstruct the adapted point cloud with the graph spectral-driven point shift. This process is enhanced by an eigenmap-guided self-training strategy that iteratively refines both the spectral adjustments and the model parameters. Experimental results and ablation studies on benchmark datasets demonstrate the effectiveness of GSDTTA, outperforming existing TTA methods for 3D point cloud classification.
zh

[CV-59] LEAF: Latent Diffusion with Efficient Encoder Distillation for Aligned Features in Medical Image Segmentation MICCAI2025

【速读】:该论文旨在解决现有基于扩散模型(diffusion models)的医学图像分割方法中存在的两个关键问题:一是直接迁移原始训练流程而未针对分割任务进行优化,导致性能受限;二是预训练扩散模型在特征提取方面存在不足,影响分割精度。解决方案的关键在于提出LEAF模型,其核心创新包括两点:首先,在微调阶段将原始噪声预测机制替换为对分割图的直接预测,从而降低分割结果的方差;其次,引入特征蒸馏(feature distillation)方法,使卷积层的隐藏状态与基于Transformer的视觉编码器特征对齐,提升特征表达能力。该方法不改变模型架构,也不增加推理阶段的参数或计算量,兼具高效性与高性能。

链接: https://arxiv.org/abs/2507.18214
作者: Qilin Huang,Tianyu Lin,Zhiguang Chen,Fudan Zheng
机构: Fudan University (复旦大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted at MICCAI 2025

点击查看摘要

Abstract:Leveraging the powerful capabilities of diffusion models has yielded quite effective results in medical image segmentation tasks. However, existing methods typically transfer the original training process directly without specific adjustments for segmentation tasks. Furthermore, the commonly used pre-trained diffusion models still have deficiencies in feature extraction. Based on these considerations, we propose LEAF, a medical image segmentation model grounded in latent diffusion models. During the fine-tuning process, we replace the original noise prediction pattern with a direct prediction of the segmentation map, thereby reducing the variance of segmentation results. We also employ a feature distillation method to align the hidden states of the convolutional layers with the features from a transformer-based vision encoder. Experimental results demonstrate that our method enhances the performance of the original diffusion model across multiple segmentation datasets for different disease types. Notably, our approach does not alter the model architecture, nor does it increase the number of parameters or computation during the inference phase, making it highly efficient.
zh

[CV-60] EFusion: Blending Text Embeddings to Distill Classifier-Free Guidance ICCV2025

【速读】:该论文旨在解决当前文本到图像生成模型中,由于使用分类器无引导(Classifier-Free Guidance, CFG)策略所导致的推理成本过高问题,尤其是在结合复杂采样算法时,CFG需要两次前向传播,显著增加了计算开销。解决方案的关键在于提出一种名为TeEFusion(Text Embeddings Fusion)的高效蒸馏方法,其核心思想是将引导强度直接融入文本嵌入中,并通过线性操作融合条件与无条件文本嵌入,从而在不引入额外参数的情况下重建所需引导信号;同时使学生模型能够学习教师模型在其复杂采样策略下产生的输出,最终实现仅用简单采样策略即可逼近教师模型性能,推理速度最高提升至6倍,且图像质量保持相当。

链接: https://arxiv.org/abs/2507.18192
作者: Minghao Fu,Guo-Hua Wang,Xiaohao Chen,Qing-Guo Chen,Zhao Xu,Weihua Luo,Kaifu Zhang
机构: Nanjing University (南京大学); Alibaba International Digital Commerce Group (阿里巴巴国际数字商业集团)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted by ICCV 2025. The code is publicly available at this https URL

点击查看摘要

Abstract:Recent advances in text-to-image synthesis largely benefit from sophisticated sampling strategies and classifier-free guidance (CFG) to ensure high-quality generation. However, CFG’s reliance on two forward passes, especially when combined with intricate sampling algorithms, results in prohibitively high inference costs. To address this, we introduce TeEFusion (\textbfText \textbfEmbeddings \textbfFusion), a novel and efficient distillation method that directly incorporates the guidance magnitude into the text embeddings and distills the teacher model’s complex sampling strategy. By simply fusing conditional and unconditional text embeddings using linear operations, TeEFusion reconstructs the desired guidance without adding extra parameters, simultaneously enabling the student model to learn from the teacher’s output produced via its sophisticated sampling approach. Extensive experiments on state-of-the-art models such as SD3 demonstrate that our method allows the student to closely mimic the teacher’s performance with a far simpler and more efficient sampling strategy. Consequently, the student model achieves inference speeds up to 6 \times faster than the teacher model, while maintaining image quality at levels comparable to those obtained through the teacher’s complex sampling approach. The code is publicly available at \hrefthis https URLthis http URL.
zh

[CV-61] MatSSL: Robust Self-Supervised Representation Learning for Metallographic Image Segmentation

【速读】:该论文旨在解决金属材料显微图像分析中监督学习方法依赖大量标注数据、泛化能力弱以及跨数据集迁移性能不稳定的问题。现有自监督学习(SSL)方法虽能利用未标注数据,但通常需要大规模数据集才能取得良好效果,难以适配小样本场景。解决方案的关键在于提出MatSSL架构,其核心创新是在骨干网络每个阶段引入门控特征融合(Gated Feature Fusion),以高效整合多层级特征表示;同时采用两阶段训练策略——先在小规模未标注金属图像上进行自监督预训练,再在多个基准数据集上微调,从而实现仅用少量未标注数据即可有效适应金属学领域,并保留从自然图像大规模预训练中获得的丰富可迁移特征。

链接: https://arxiv.org/abs/2507.18184
作者: Hoang Hai Nam Nguyen,Phan Nguyen Duc Hieu,Ho Won Lee
机构: Korea Institute of Materials Science (韩国材料科学研究所); University of Science and Technology (科学技术大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:MatSSL is a streamlined self-supervised learning (SSL) architecture that employs Gated Feature Fusion at each stage of the backbone to integrate multi-level representations effectively. Current micrograph analysis of metallic materials relies on supervised methods, which require retraining for each new dataset and often perform inconsistently with only a few labeled samples. While SSL offers a promising alternative by leveraging unlabeled data, most existing methods still depend on large-scale datasets to be effective. MatSSL is designed to overcome this limitation. We first perform self-supervised pretraining on a small-scale, unlabeled dataset and then fine-tune the model on multiple benchmark datasets. The resulting segmentation models achieve 69.13% mIoU on MetalDAM, outperforming the 66.73% achieved by an ImageNet-pretrained encoder, and delivers consistently up to nearly 40% improvement in average mIoU on the Environmental Barrier Coating benchmark dataset (EBC) compared to models pretrained with MicroNet. This suggests that MatSSL enables effective adaptation to the metallographic domain using only a small amount of unlabeled data, while preserving the rich and transferable features learned from large-scale pretraining on natural images.
zh

[CV-62] ChronoSelect: Robust Learning with Noisy Labels via Dynamics Temporal Memory

【速读】:该论文旨在解决深度神经网络在真实世界数据集上训练时因标签噪声(noisy labels)导致的泛化性能下降问题。现有学习带噪声标签(Learning with Noisy Labels, LNL)方法普遍依赖静态快照评估,无法捕捉学习过程中的动态演化特性。其核心解决方案是提出ChronoSelect框架,关键创新在于设计了一种四阶段记忆架构,将预测历史压缩为紧凑的时间分布,并通过受控衰减的滑动更新机制,每样本仅维护四个动态记忆单元,逐步强化近期模式的同时保留关键历史知识;进而基于时间轨迹分析与双分支一致性实现对样本的三类精确划分(干净、边界、噪声),理论上保障了在噪声条件下的收敛性与稳定性,实验验证其在合成与真实基准上的领先性能。

链接: https://arxiv.org/abs/2507.18183
作者: Jianchao Wang,Qingfeng Li,Pengcheng Zheng,Xiaorong Pu,Yazhou Ren
机构: Shenzhen Institute for Advanced Study, University of Electronic Science and Technology of China (深圳先进研究院,电子科技大学); School of Computer Science and Engineering, University of Electronic Science and Technology of China (计算机科学与工程学院,电子科技大学)
类目: Machine Learning (cs.LG); Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Training deep neural networks on real-world datasets is often hampered by the presence of noisy labels, which can be memorized by over-parameterized models, leading to significant degradation in generalization performance. While existing methods for learning with noisy labels (LNL) have made considerable progress, they fundamentally suffer from static snapshot evaluations and fail to leverage the rich temporal dynamics of learning evolution. In this paper, we propose ChronoSelect (chrono denoting its temporal nature), a novel framework featuring an innovative four-stage memory architecture that compresses prediction history into compact temporal distributions. Our unique sliding update mechanism with controlled decay maintains only four dynamic memory units per sample, progressively emphasizing recent patterns while retaining essential historical knowledge. This enables precise three-way sample partitioning into clean, boundary, and noisy subsets through temporal trajectory analysis and dual-branch consistency. Theoretical guarantees prove the mechanism’s convergence and stability under noisy conditions. Extensive experiments demonstrate ChronoSelect’s state-of-the-art performance across synthetic and real-world benchmarks.
zh

[CV-63] Differential-UMamba: Rethinking Tumor Segmentation Under Limited Data Scenarios

【速读】:该论文旨在解决深度学习模型在数据稀缺场景下容易过拟合噪声和无关模式,从而限制其在未见样本上泛化能力的问题,尤其针对医学图像分割任务。解决方案的关键在于提出Diff-UMamba架构,其核心创新是引入一个噪声抑制模块(Noise Reduction Module, NRM),该模块采用信号差分策略有效抑制编码器中冗余或无关激活,促使模型聚焦于临床相关的特征表示;同时结合UNet框架与Mamba机制以建模长程依赖关系,显著提升了模型在低数据条件下的分割准确性和鲁棒性。

链接: https://arxiv.org/abs/2507.18177
作者: Dhruv Jain,Romain Modzelewski,Romain Hérault,Clement Chatelain,Eva Torfeh,Sebastien Thureau
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:In data-scarce scenarios, deep learning models often overfit to noise and irrelevant patterns, which limits their ability to generalize to unseen samples. To address these challenges in medical image segmentation, we introduce Diff-UMamba, a novel architecture that combines the UNet framework with the mamba mechanism for modeling long-range dependencies. At the heart of Diff-UMamba is a Noise Reduction Module (NRM), which employs a signal differencing strategy to suppress noisy or irrelevant activations within the encoder. This encourages the model to filter out spurious features and enhance task-relevant representations, thereby improving its focus on clinically meaningful regions. As a result, the architecture achieves improved segmentation accuracy and robustness, particularly in low-data settings. Diff-UMamba is evaluated on multiple public datasets, including MSD (lung and pancreas) and AIIB23, demonstrating consistent performance gains of 1-3% over baseline methods across diverse segmentation tasks. To further assess performance under limited-data conditions, additional experiments are conducted on the BraTS-21 dataset by varying the proportion of available training samples. The approach is also validated on a small internal non-small cell lung cancer (NSCLC) dataset for gross tumor volume (GTV) segmentation in cone beam CT (CBCT), where it achieves a 4-5% improvement over the baseline.
zh

[CV-64] Unsupervised Domain Adaptation for 3D LiDAR Semantic Segmentation Using Contrastive Learning and Multi-Model Pseudo Labeling

【速读】:该论文旨在解决3D LiDAR语义分割中因域偏移(如传感器类型、地理区域差异)导致的性能退化问题,尤其在目标域缺乏标注数据时,传统方法难以有效迁移。其解决方案的关键在于提出一种两阶段无监督域自适应(Unsupervised Domain Adaptation, UDA)框架:首先通过段级无监督对比学习对主干网络进行预训练,以学习跨域不变特征;随后引入多模型伪标签策略,利用投影式、体素式、混合式及圆柱式等多种先进架构的集成预测结果,通过硬投票生成高质量伪标签,从而缓解单一模型偏差,并以此对预训练网络进行微调。实验表明,该方法在从SemanticKITTI到SemanticPOSS和SemanticSlamantic等未标注目标域的迁移任务中显著优于直接迁移和单模型UDA方法,验证了对比预训练与精细化集成伪标签相结合的有效性。

链接: https://arxiv.org/abs/2507.18176
作者: Abhishek Kaushik,Norbert Haala,Uwe Soergel
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Addressing performance degradation in 3D LiDAR semantic segmentation due to domain shifts (e.g., sensor type, geographical location) is crucial for autonomous systems, yet manual annotation of target data is prohibitive. This study addresses the challenge using Unsupervised Domain Adaptation (UDA) and introduces a novel two-stage framework to tackle it. Initially, unsupervised contrastive learning at the segment level is used to pre-train a backbone network, enabling it to learn robust, domain-invariant features without labels. Subsequently, a multi-model pseudo-labeling strategy is introduced, utilizing an ensemble of diverse state-of-the-art architectures (including projection, voxel, hybrid, and cylinder-based methods). Predictions from these models are aggregated via hard voting to generate high-quality, refined pseudo-labels for the unlabeled target domain, mitigating single-model biases. The contrastively pre-trained network is then fine-tuned using these robust pseudo-labels. Experiments adapting from SemanticKITTI to unlabeled target datasets (SemanticPOSS, SemanticSlamantic) demonstrate significant improvements in segmentation accuracy compared to direct transfer and single-model UDA approaches. These results highlight the effectiveness of combining contrastive pre-training with refined ensemble pseudo-labeling for bridging complex domain gaps without requiring target domain annotations.
zh

[CV-65] Real-Time Object Detection and Classification using YOLO for Edge FPGAs

【速读】:该论文旨在解决基于YOLO的物体检测与分类系统在边缘FPGA平台上的资源效率不足问题,尤其是在功耗和计算资源受限场景下难以实现高效实时处理的挑战。解决方案的关键在于对YOLOv5模型进行优化,并针对Xilinx Kria KV260 FPGA平台进行定制化部署,从而在保证高精度(99%分类准确率)的同时显著降低功耗(3.5W)并提升处理速度(9帧/秒),实现了面向边缘计算应用的资源高效、实时物体检测与分类。

链接: https://arxiv.org/abs/2507.18174
作者: Rashed Al Amin,Roman Obermaisser
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Hardware Architecture (cs.AR)
备注: This paper has been accepted for the 67th International Symposium on ELMAR 2025

点击查看摘要

Abstract:Object detection and classification are crucial tasks across various application domains, particularly in the development of safe and reliable Advanced Driver Assistance Systems (ADAS). Existing deep learning-based methods such as Convolutional Neural Networks (CNNs), Single Shot Detectors (SSDs), and You Only Look Once (YOLO) have demonstrated high performance in terms of accuracy and computational speed when deployed on Field-Programmable Gate Arrays (FPGAs). However, despite these advances, state-of-the-art YOLO-based object detection and classification systems continue to face challenges in achieving resource efficiency suitable for edge FPGA platforms. To address this limitation, this paper presents a resource-efficient real-time object detection and classification system based on YOLOv5 optimized for FPGA deployment. The proposed system is trained on the COCO and GTSRD datasets and implemented on the Xilinx Kria KV260 FPGA board. Experimental results demonstrate a classification accuracy of 99%, with a power consumption of 3.5W and a processing speed of 9 frames per second (FPS). These findings highlight the effectiveness of the proposed approach in enabling real-time, resource-efficient object detection and classification for edge computing applications.
zh

[CV-66] WaveMamba: Wavelet-Driven Mamba Fusion for RGB-Infrared Object Detection

【速读】:该论文旨在解决多模态目标检测中RGB与红外(IR)图像信息融合效率低、特征损失大以及跨模态互补性利用不足的问题。其解决方案的关键在于提出WaveMamba方法,核心创新是引入WaveMamba Fusion Block(WMFB),通过离散小波变换(Discrete Wavelet Transform, DWT)将RGB和IR图像分解为低频与高频子带,并在不同频段实施差异化融合策略:低频特征采用基于Mamba架构的低频Mamba融合块(LMFB),结合通道交换与门控注意力机制实现深度特征融合;高频特征则采用“绝对最大”融合策略增强细节信息。同时,设计包含逆离散小波变换(IDWT)的改进检测头以减少信息损失,从而显著提升检测性能,在四个基准测试上平均mAP提升达4.5%。

链接: https://arxiv.org/abs/2507.18173
作者: Haodong Zhu,Wenhao Dong,Linlin Yang,Hong Li,Yuguang Yang,Yangyang Ren,Qingcheng Zhu,Zichao Feng,Changbai Li,Shaohui Lin,Runqi Wang,Xiaoyan Luo,Baochang Zhang
机构: Beihang University (北京航空航天大学); Communication University of China (中国传媒大学); Beijing Jiaotong University (北京交通大学); East China Normal University (华东师范大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Multimedia (cs.MM)
备注:

点击查看摘要

Abstract:Leveraging the complementary characteristics of visible (RGB) and infrared (IR) imagery offers significant potential for improving object detection. In this paper, we propose WaveMamba, a cross-modality fusion method that efficiently integrates the unique and complementary frequency features of RGB and IR decomposed by Discrete Wavelet Transform (DWT). An improved detection head incorporating the Inverse Discrete Wavelet Transform (IDWT) is also proposed to reduce information loss and produce the final detection results. The core of our approach is the introduction of WaveMamba Fusion Block (WMFB), which facilitates comprehensive fusion across low-/high-frequency sub-bands. Within WMFB, the Low-frequency Mamba Fusion Block (LMFB), built upon the Mamba framework, first performs initial low-frequency feature fusion with channel swapping, followed by deep fusion with an advanced gated attention mechanism for enhanced integration. High-frequency features are enhanced using a strategy that applies an ``absolute maximum" fusion approach. These advancements lead to significant performance gains, with our method surpassing state-of-the-art approaches and achieving average mAP improvements of 4.5% on four benchmarks.
zh

[CV-67] GeoAvatar: Adaptive Geometrical Gaussian Splatting for 3D Head Avatar ICCV2025

【速读】:该论文旨在解决3D人脸虚拟形象(head avatar)生成中身份保持(即重建质量)与新姿态/表情动画之间的平衡难题,尤其针对现有方法在不同面部区域几何偏差下难以自适应调整高斯点(Gaussian)分布而导致的生成质量下降问题。其解决方案的关键在于提出GeoAvatar框架,通过三个核心创新实现:(1) 自适应预分配阶段(Adaptive Pre-allocation Stage, APS),一种无监督方法将高斯点分为刚性与柔性集合,以实现自适应偏移正则化;(2) 基于口腔解剖结构与动态特性的新型口腔结构及分区域变形策略,显著提升口部动画保真度;(3) 针对高斯点与3D混合模型(3DMM)人脸之间精确绑定的正则化损失函数,增强整体几何一致性。

链接: https://arxiv.org/abs/2507.18155
作者: SeungJun Moon,Hah Min Lew,Seungeun Lee,Ji-Su Kang,Gyeong-Moon Park
机构: Klleon AI Research(克莱昂人工智能研究); Korea University(韩国大学)
类目: Graphics (cs.GR); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注: ICCV 2025, Project page: this https URL

点击查看摘要

Abstract:Despite recent progress in 3D head avatar generation, balancing identity preservation, i.e., reconstruction, with novel poses and expressions, i.e., animation, remains a challenge. Existing methods struggle to adapt Gaussians to varying geometrical deviations across facial regions, resulting in suboptimal quality. To address this, we propose GeoAvatar, a framework for adaptive geometrical Gaussian Splatting. GeoAvatar leverages Adaptive Pre-allocation Stage (APS), an unsupervised method that segments Gaussians into rigid and flexible sets for adaptive offset regularization. Then, based on mouth anatomy and dynamics, we introduce a novel mouth structure and the part-wise deformation strategy to enhance the animation fidelity of the mouth. Finally, we propose a regularization loss for precise rigging between Gaussians and 3DMM faces. Moreover, we release DynamicFace, a video dataset with highly expressive facial motions. Extensive experiments show the superiority of GeoAvatar compared to state-of-the-art methods in reconstruction and novel animation scenarios.
zh

[CV-68] Degradation-Consistent Learning via Bidirectional Diffusion for Low-Light Image Enhancement

【速读】:该论文旨在解决低光照图像增强中因单向退化建模难以捕捉真实世界复杂退化模式而导致的结构不一致性和像素错位问题。其解决方案的关键在于提出一种双向扩散优化机制,通过联合建模低光与正常光照图像的退化过程,在训练阶段同时执行从低光到正常光和从正常光到低光的扩散过程,并引入自适应特征交互块(Adaptive Feature Interaction Block, AFI)以细化特征表示;该机制利用两条路径间的互补性,隐式施加光照衰减与噪声分布的对称约束,从而实现更一致的退化学习,提升模型对光照和细节退化的感知能力。此外,设计的反射感知校正模块(Reflection-aware Correction Module, RACM)进一步引导去噪后的色彩恢复并抑制过曝区域,保障内容一致性,最终生成符合人类视觉感知的高质量图像。

链接: https://arxiv.org/abs/2507.18144
作者: Jinhong He,Minglong Xue,Zhipu Liu,Mingliang Zhou,Aoxiang Ning,Palaiahnakote Shivakumara
机构: Chongqing University of Technology(重庆理工大学); Chongqing University(重庆大学); University of Salford(索尔福德大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Image and Video Processing (eess.IV)
备注: 10page

点击查看摘要

Abstract:Low-light image enhancement aims to improve the visibility of degraded images to better align with human visual perception. While diffusion-based methods have shown promising performance due to their strong generative capabilities. However, their unidirectional modelling of degradation often struggles to capture the complexity of real-world degradation patterns, leading to structural inconsistencies and pixel misalignments. To address these challenges, we propose a bidirectional diffusion optimization mechanism that jointly models the degradation processes of both low-light and normal-light images, enabling more precise degradation parameter matching and enhancing generation quality. Specifically, we perform bidirectional diffusion-from low-to-normal light and from normal-to-low light during training and introduce an adaptive feature interaction block (AFI) to refine feature representation. By leveraging the complementarity between these two paths, our approach imposes an implicit symmetry constraint on illumination attenuation and noise distribution, facilitating consistent degradation learning and improving the models ability to perceive illumination and detail degradation. Additionally, we design a reflection-aware correction module (RACM) to guide color restoration post-denoising and suppress overexposed regions, ensuring content consistency and generating high-quality images that align with human visual perception. Extensive experiments on multiple benchmark datasets demonstrate that our method outperforms state-of-the-art methods in both quantitative and qualitative evaluations while generalizing effectively to diverse degradation scenarios. Code at this https URL
zh

[CV-69] Information Entropy-Based Framework for Quantifying Tortuosity in Meibomian Gland Uneven Atrophy

【速读】:该论文旨在解决医学图像分析中曲线迂曲度(tortuosity)精确量化的问题,尤其针对生物可解释参考曲线存在的场景下,传统方法如曲率或弧弦比因依赖理想直线比较而难以提供稳定、客观的评估。其解决方案的关键在于提出一种基于信息熵的迂曲度量化框架,该框架融合概率建模与熵理论,并引入曲线数据的域变换机制;通过将目标曲线与指定参考曲线进行对比而非与理想直线比较,实现了更符合生理实际的定量评估,从而提升了在临床诊断中的鲁棒性和可解释性。

链接: https://arxiv.org/abs/2507.18135
作者: Kesheng Wang,Xiaoyu Chen,Chunlei He,Fenfen Li,Xinxin Yu,Dexing Kong,Shoujun Huang,Qi Dai
机构: Zhejiang Normal University (浙江师范大学); National Clinical Research Center for Ocular Diseases, Eye Hospital, Wenzhou Medical University (温州医科大学眼病临床研究中心)
类目: Computer Vision and Pattern Recognition (cs.CV); Information Theory (cs.IT)
备注: This manuscript contains 7 figures. All comments are welcome

点击查看摘要

Abstract:In the medical image analysis field, precise quantification of curve tortuosity plays a critical role in the auxiliary diagnosis and pathological assessment of various diseases. In this study, we propose a novel framework for tortuosity quantification and demonstrate its effectiveness through the evaluation of meibomian gland atrophy uniformity,serving as a representative application scenario. We introduce an information entropy-based tortuosity quantification framework that integrates probability modeling with entropy theory and incorporates domain transformation of curve data. Unlike traditional methods such as curvature or arc-chord ratio, this approach evaluates the tortuosity of a target curve by comparing it to a designated reference curve. Consequently, it is more suitable for tortuosity assessment tasks in medical data where biologically plausible reference curves are available, providing a more robust and objective evaluation metric without relying on idealized straight-line comparisons. First, we conducted numerical simulation experiments to preliminarily assess the stability and validity of the method. Subsequently, the framework was applied to quantify the spatial uniformity of meibomian gland atrophy and to analyze the difference in this uniformity between \textitDemodex-negative and \textitDemodex-positive patient groups. The results demonstrated a significant difference in tortuosity-based uniformity between the two groups, with an area under the curve of 0.8768, sensitivity of 0.75, and specificity of 0.93. These findings highlight the clinical utility of the proposed framework in curve tortuosity analysis and its potential as a generalizable tool for quantitative morphological evaluation in medical diagnostics. Comments: This manuscript contains 7 figures. All comments are welcome Subjects: Computer Vision and Pattern Recognition (cs.CV); Information Theory (cs.IT) Cite as: arXiv:2507.18135 [cs.CV] (or arXiv:2507.18135v1 [cs.CV] for this version) https://doi.org/10.48550/arXiv.2507.18135 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
zh

[CV-70] 2VWorldBench: A Benchmark for Evaluating World Knowledge in Text-to-Video Generation

【速读】:该论文旨在解决当前文本到视频(Text-to-Video, T2V)模型在生成视频时缺乏对世界知识(world knowledge)的理解与利用,导致语义一致性和事实准确性不足的问题。解决方案的关键在于提出首个系统性评估框架——T2VWorldBench,该框架涵盖6大类、60子类及1200个提示,覆盖物理、自然、行为、文化、因果关系和物体等广泛领域,并结合人类偏好评估与基于视觉语言模型(Vision-Language Models, VLMs)的自动化评估方法,从而全面衡量T2V模型的世界知识生成能力。

链接: https://arxiv.org/abs/2507.18107
作者: Yubin Chen,Xuyang Guo,Zhenmei Shi,Zhao Song,Jiahao Zhang
机构: San Jose State University (圣何塞州立大学); Guilin University of Electronic Technology (桂林电子科技大学); University of Wisconsin-Madison (威斯康星大学麦迪逊分校); University of California, Berkeley (加州大学伯克利分校)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Text-to-video (T2V) models have shown remarkable performance in generating visually reasonable scenes, while their capability to leverage world knowledge for ensuring semantic consistency and factual accuracy remains largely understudied. In response to this challenge, we propose T2VWorldBench, the first systematic evaluation framework for evaluating the world knowledge generation abilities of text-to-video models, covering 6 major categories, 60 subcategories, and 1,200 prompts across a wide range of domains, including physics, nature, activity, culture, causality, and object. To address both human preference and scalable evaluation, our benchmark incorporates both human evaluation and automated evaluation using vision-language models (VLMs). We evaluated the 10 most advanced text-to-video models currently available, ranging from open source to commercial models, and found that most models are unable to understand world knowledge and generate truly correct videos. These findings point out a critical gap in the capability of current text-to-video models to leverage world knowledge, providing valuable research opportunities and entry points for constructing models with robust capabilities for commonsense reasoning and factual generation.
zh

[CV-71] Distributional Uncertainty for Out-of-Distribution Detection

【速读】:该论文旨在解决现有深度神经网络在检测分布外(Out-of-Distribution, OoD)样本时存在的不确定性估计不准确问题,尤其是传统方法如蒙特卡洛Dropout仅关注模型或数据不确定性,难以实现语义层面的OoD识别。其解决方案的关键在于提出一种基于自由能(Free Energy)的后验网络(Free-Energy Posterior Network),通过两个核心创新实现:一是设计了一个由Beta分布参数化的自由能密度估计器,可在模糊或未见区域实现细粒度的不确定性建模;二是引入一种嵌入后验网络中的损失函数,使不确定性可直接从学习到的参数中推断,无需随机采样。该方法结合残差预测分支(Residual Prediction Branch, RPL)框架,不仅避免了后处理的能量阈值设定,还利用Beta分布的方差自动学习OoD区域,从而在保持计算效率的同时提供语义一致的不确定性感知分割能力。

链接: https://arxiv.org/abs/2507.18106
作者: JinYoung Kim,DaeUng Jo,Kimin Yun,Jeonghyo Song,Youngjoon Yoo
机构: Chung-Ang University (中央大学); Kyungpook National University (庆北国立大学); ETRI (电子与电信研究院); University of Science and Technology (科学技术院)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: 6 pages , 3 figures , IEEE International Conference on Advanced Visual and Signal-Based Systems

点击查看摘要

Abstract:Estimating uncertainty from deep neural networks is a widely used approach for detecting out-of-distribution (OoD) samples, which typically exhibit high predictive uncertainty. However, conventional methods such as Monte Carlo (MC) Dropout often focus solely on either model or data uncertainty, failing to align with the semantic objective of OoD detection. To address this, we propose the Free-Energy Posterior Network, a novel framework that jointly models distributional uncertainty and identifying OoD and misclassified regions using free energy. Our method introduces two key contributions: (1) a free-energy-based density estimator parameterized by a Beta distribution, which enables fine-grained uncertainty estimation near ambiguous or unseen regions; and (2) a loss integrated within a posterior network, allowing direct uncertainty estimation from learned parameters without requiring stochastic sampling. By integrating our approach with the residual prediction branch (RPL) framework, the proposed method goes beyond post-hoc energy thresholding and enables the network to learn OoD regions by leveraging the variance of the Beta distribution, resulting in a semantically meaningful and computationally efficient solution for uncertainty-aware segmentation. We validate the effectiveness of our method on challenging real-world benchmarks, including Fishyscapes, RoadAnomaly, and Segment-Me-If-You-Can.
zh

[CV-72] A Multimodal Seq2Seq Transformer for Predicting Brain Responses to Naturalistic Stimuli

【速读】:该论文旨在解决如何利用多模态输入(视觉、听觉和语言)准确预测全脑功能磁共振成像(fMRI)活动的问题,特别是在自然情境下观看多模态电影时的神经响应建模。解决方案的关键在于提出一种序列到序列的Transformer架构,通过自回归方式从多模态刺激中预测fMRI信号;其核心创新包括:(1)使用多模态上下文序列来捕捉刺激与神经反应中的长程时间结构,从而提升对动态脑活动的建模能力;(2)采用共享编码器与部分个体特异性解码器相结合的结构,在保留跨被试共性的同时适应个体差异,显著提升了模型在分布内和分布外数据上的泛化性能。

链接: https://arxiv.org/abs/2507.18104
作者: Qianyi He,Yuan Chang Leong
机构: University of Chicago (芝加哥大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Neurons and Cognition (q-bio.NC)
备注:

点击查看摘要

Abstract:The Algonauts 2025 Challenge called on the community to develop encoding models that predict whole-brain fMRI responses to naturalistic multimodal movies. In this submission, we propose a sequence-to-sequence Transformer that autoregressively predicts fMRI activity from visual, auditory, and language inputs. Stimulus features were extracted using pretrained models including VideoMAE, HuBERT, Qwen, and BridgeTower. The decoder integrates information from prior brain states, current stimuli, and episode-level summaries via dual cross-attention mechanisms that attend to both perceptual information extracted from the stimulus as well as narrative information provided by high-level summaries of narrative content. One core innovation of our approach is the use of sequences of multimodal context to predict sequences of brain activity, enabling the model to capture long-range temporal structure in both stimuli and neural responses. Another is the combination of a shared encoder with partial subject-specific decoder, which leverages common structure across subjects while accounting for individual variability. Our model achieves strong performance on both in-distribution and out-of-distribution data, demonstrating the effectiveness of temporally-aware, multimodal sequence modeling for brain activity prediction. The code is available at this https URL.
zh

[CV-73] Datasets and Recipes for Video Temporal Grounding via Reinforcement Learning

【速读】:该论文旨在解决视频时序定位(Video Temporal Grounding, VTG)任务中现有模型普遍存在的时序感知能力有限和泛化性能不佳的问题。其解决方案的关键在于提出一种两阶段训练框架,首先利用高质量的冷启动数据进行监督微调(Supervised Fine-Tuning, SFT)以初始化模型,随后引入难度可控的强化学习(Reinforcement Learning, RL)策略进一步提升模型在时序定位与推理方面的能力,从而显著增强模型的准确性与鲁棒性。

链接: https://arxiv.org/abs/2507.18100
作者: Ruizhe Chen,Zhiting Fan,Tianze Luo,Heqing Zou,Zhaopeng Feng,Guiyang Xie,Hansheng Zhang,Zhuochen Wang,Zuozhu Liu,Huaijian Zhang
机构: Bytedance(字节跳动); Zhejiang University(浙江大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Video Temporal Grounding (VTG) aims to localize relevant temporal segments in videos given natural language queries. Despite recent progress with large vision-language models (LVLMs) and instruction-tuning, existing approaches often suffer from limited temporal awareness and poor generalization. In this work, we introduce a two-stage training framework that integrates supervised fine-tuning with reinforcement learning (RL) to improve both the accuracy and robustness of VTG models. Our approach first leverages high-quality curated cold start data for SFT initialization, followed by difficulty-controlled RL to further enhance temporal localization and reasoning abilities. Comprehensive experiments on multiple VTG benchmarks demonstrate that our method consistently outperforms existing models, particularly in challenging and open-domain scenarios. We conduct an in-depth analysis of training strategies and dataset curation, highlighting the importance of both high-quality cold start data and difficulty-controlled RL. To facilitate further research and industrial adoption, we release all intermediate datasets, models, and code to the community.
zh

[CV-74] Comparison of Segmentation Methods in Remote Sensing for Land Use Land Cover

【速读】:该论文旨在解决高精度土地利用/土地覆盖(Land Use Land Cover, LULC)制图在城市规划与资源管理中的应用难题,尤其针对遥感影像大气校正与分类模型性能优化问题。其解决方案的关键在于:首先采用基于查找表(Look-Up Table, LUT)的大气校正方法处理Cartosat多光谱(MX)传感器图像,以提升数据质量;随后结合监督学习(DeeplabV3+)与半监督学习(Cross-Pseudo Supervision, CPS)模型进行LULC预测,并通过动态加权机制优化伪标签的可靠性,从而增强模型在标注样本有限场景下的泛化能力。该方法在印度海得拉巴市的案例研究中验证了其对城市扩张、绿地减少和工业用地扩展等变化的有效监测能力,体现了其在智慧城市可持续发展决策中的实用价值。

链接: https://arxiv.org/abs/2507.18099
作者: Naman Srivastava,Joel D Joy,Yash Dixit,Swarup E,Rakshit Ramesh
机构: Indian Institute of Science (印度科学研究所)
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Land Use Land Cover (LULC) mapping is essential for urban and resource planning, and is one of the key elements in developing smart and sustainable this http URL study evaluates advanced LULC mapping techniques, focusing on Look-Up Table (LUT)-based Atmospheric Correction applied to Cartosat Multispectral (MX) sensor images, followed by supervised and semi-supervised learning models for LULC prediction. We explore DeeplabV3+ and Cross-Pseudo Supervision (CPS). The CPS model is further refined with dynamic weighting, enhancing pseudo-label reliability during training. This comprehensive approach analyses the accuracy and utility of LULC mapping techniques for various urban planning applications. A case study of Hyderabad, India, illustrates significant land use changes due to rapid urbanization. By analyzing Cartosat MX images over time, we highlight shifts such as urban sprawl, shrinking green spaces, and expanding industrial areas. This demonstrates the practical utility of these techniques for urban planners and policymakers.
zh

[CV-75] xtSAM-EUS: Text Prompt Learning for SAM to Accurately Segment Pancreatic Tumor in Endoscopic Ultrasound ICCV2025

【速读】:该论文旨在解决胰腺肿瘤在内镜超声(Endoscopic Ultrasound, EUS)图像中因斑点噪声强、对比度低及视觉呈现不直观而导致的自动分割困难问题,尤其针对全监督深度学习模型对大量专家标注数据依赖性强且分割精度受限的挑战。解决方案的关键在于提出TextSAM-EUS——一种轻量级、文本驱动的Segment Anything Model (SAM) 改进方法,通过BiomedCLIP文本编码器进行上下文优化(context optimization)并结合LoRA(Low-Rank Adaptation)技术仅微调0.86%参数,实现无需人工几何提示即可自动分割EUS图像中的胰腺肿瘤,显著提升分割性能与实用性。

链接: https://arxiv.org/abs/2507.18082
作者: Pascal Spiegler,Taha Koleilat,Arash Harirpoush,Corey S. Miller,Hassan Rivaz,Marta Kersten-Oertel,Yiming Xiao
机构: Concordia University (康考迪亚大学); Jewish General Hospital (犹太通用医院); McGill University Faculty of Medicine (麦吉尔大学医学院); Lady Davis Institute for Medical Research (黛娜·莱德医学研究所)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: Accepted to ICCV 2025 Workshop CVAMD

点击查看摘要

Abstract:Pancreatic cancer carries a poor prognosis and relies on endoscopic ultrasound (EUS) for targeted biopsy and radiotherapy. However, the speckle noise, low contrast, and unintuitive appearance of EUS make segmentation of pancreatic tumors with fully supervised deep learning (DL) models both error-prone and dependent on large, expert-curated annotation datasets. To address these challenges, we present TextSAM-EUS, a novel, lightweight, text-driven adaptation of the Segment Anything Model (SAM) that requires no manual geometric prompts at inference. Our approach leverages text prompt learning (context optimization) through the BiomedCLIP text encoder in conjunction with a LoRA-based adaptation of SAM’s architecture to enable automatic pancreatic tumor segmentation in EUS, tuning only 0.86% of the total parameters. On the public Endoscopic Ultrasound Database of the Pancreas, TextSAM-EUS with automatic prompts attains 82.69% Dice and 85.28% normalized surface distance (NSD), and with manual geometric prompts reaches 83.10% Dice and 85.70% NSD, outperforming both existing state-of-the-art (SOTA) supervised DL models and foundation models (e.g., SAM and its variants). As the first attempt to incorporate prompt learning in SAM-based medical image segmentation, TextSAM-EUS offers a practical option for efficient and robust automatic EUS segmentation. Our code will be publicly available upon acceptance.
zh

[CV-76] Adapting Large VLMs with Iterative and Manual Instructions for Generative Low-light Enhancement

【速读】:该论文旨在解决现有低光照图像增强(Low-Light Image Enhancement, LLIE)方法在复杂光照条件下效果受限的问题,其核心缺陷在于过度依赖预训练模型先验或低光输入,而忽视了正常光照图像中蕴含的语义信息。解决方案的关键在于提出一种基于大视觉语言模型(Vision-Language Model, VLM)的迭代与人工指令(Iterative and Manual Instructions, IMIs)框架——VLM-IMI,通过引入文本描述作为增强提示,实现语义引导的图像恢复;其中,指令先验融合模块(Instruction Prior Fusion Module)动态对齐并融合图像与文本特征,从而生成细节丰富且语义一致的结果,并借助迭代式人工指令优化策略逐步提升视觉质量,显著改善极端低光条件下的结构保真度与细节恢复能力。

链接: https://arxiv.org/abs/2507.18064
作者: Xiaoran Sun,Liyan Wang,Cong Wang,Yeying Jin,Kin-man Lam,Zhixun Su,Yang Yang,Jinshan Pan
机构: Dalian University of Technology (大连理工大学); University of California, San Francisco (加州大学旧金山分校); Tencent (腾讯); The Hong Kong Polytechnic University (香港理工大学); Nanjing University of Science and Technology (南京理工大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Most existing low-light image enhancement (LLIE) methods rely on pre-trained model priors, low-light inputs, or both, while neglecting the semantic guidance available from normal-light images. This limitation hinders their effectiveness in complex lighting conditions. In this paper, we propose VLM-IMI, a novel framework that leverages large vision-language models (VLMs) with iterative and manual instructions (IMIs) for LLIE. VLM-IMI incorporates textual descriptions of the desired normal-light content as enhancement cues, enabling semantically informed restoration. To effectively integrate cross-modal priors, we introduce an instruction prior fusion module, which dynamically aligns and fuses image and text features, promoting the generation of detailed and semantically coherent outputs. During inference, we adopt an iterative and manual instruction strategy to refine textual instructions, progressively improving visual quality. This refinement enhances structural fidelity, semantic alignment, and the recovery of fine details under extremely low-light conditions. Extensive experiments across diverse scenarios demonstrate that VLM-IMI outperforms state-of-the-art methods in both quantitative metrics and perceptual quality. The source code is available at this https URL.
zh

[CV-77] BokehDiff: Neural Lens Blur with One-Step Diffusion ICCV2025

【速读】:该论文旨在解决传统镜头模糊(lens blur)渲染方法在深度不连续区域易产生伪影的问题,这些问题通常源于深度估计精度不足。解决方案的关键在于提出了一种基于生成式扩散先验(generative diffusion prior)的新方法BokehDiff,其核心创新包括:1)设计了一个物理启发的自注意力模块,该模块与图像形成过程对齐,引入了依赖深度的弥散圆(circle of confusion)约束和自遮挡效应;2)将扩散模型适配为一步推理(one-step inference)方案,在不引入额外噪声的前提下实现高质量、高保真度的模糊效果;3)通过扩散模型合成具有透明度的逼真前景,缓解真实场景中成对标注数据稀缺的问题,从而提升结果的真实感与多样性。

链接: https://arxiv.org/abs/2507.18060
作者: Chengxuan Zhu,Qingnan Fan,Qi Zhang,Jinwei Chen,Huaqi Zhang,Chao Xu,Boxin Shi
机构: National Key Lab of General AI, School of Intelligence Science and Technology, Peking University (北京大学通用人工智能实验室,智能科学与技术学院); Vivo Mobile Communication Co., Ltd. (维沃移动通信有限公司); State Key Laboratory of Multimedia Information Processing, School of Computer Science, Peking University (多媒体信息处理国家重点实验室,计算机科学学院); National Engineering Research Center of Visual Technology, School of Computer Science, Peking University (视觉技术国家工程研究中心,计算机科学学院)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted by ICCV 2025

点击查看摘要

Abstract:We introduce BokehDiff, a novel lens blur rendering method that achieves physically accurate and visually appealing outcomes, with the help of generative diffusion prior. Previous methods are bounded by the accuracy of depth estimation, generating artifacts in depth discontinuities. Our method employs a physics-inspired self-attention module that aligns with the image formation process, incorporating depth-dependent circle of confusion constraint and self-occlusion effects. We adapt the diffusion model to the one-step inference scheme without introducing additional noise, and achieve results of high quality and fidelity. To address the lack of scalable paired data, we propose to synthesize photorealistic foregrounds with transparency with diffusion models, balancing authenticity and scene diversity.
zh

[CV-78] Enhancing Scene Transition Awareness in Video Generation via Post-Training

【速读】:该论文旨在解决当前生成式视频(Generative Video)模型在长视频生成中缺乏场景过渡意识的问题,即模型难以根据文本提示自动识别并生成多个连贯场景之间的过渡。现有开源模型多基于单场景视频片段训练,导致其无法有效响应需要多场景切换的复杂提示。解决方案的关键在于构建一个名为Transition-Aware Video (TAV) 的新数据集,该数据集包含预处理的多场景过渡视频片段,通过在此数据集上进行微调(post-training),显著提升了模型对提示中场景变化的理解能力,从而改善了跨场景的一致性与连贯性,同时保持图像质量。

链接: https://arxiv.org/abs/2507.18046
作者: Hanwen Shen,Jiajie Lu,Yupeng Cao,Xiaonan Yang
机构: Stevens Institute of Technology (斯蒂文斯理工学院)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Recent advances in AI-generated video have shown strong performance on \emphtext-to-video tasks, particularly for short clips depicting a single scene. However, current models struggle to generate longer videos with coherent scene transitions, primarily because they cannot infer when a transition is needed from the prompt. Most open-source models are trained on datasets consisting of single-scene video clips, which limits their capacity to learn and respond to prompts requiring multiple scenes. Developing scene transition awareness is essential for multi-scene generation, as it allows models to identify and segment videos into distinct clips by accurately detecting transitions. To address this, we propose the \textbfTransition-Aware Video (TAV) dataset, which consists of preprocessed video clips with multiple scene transitions. Our experiment shows that post-training on the \textbfTAV dataset improves prompt-based scene transition understanding, narrows the gap between required and generated scenes, and maintains image quality. Subjects: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI) Cite as: arXiv:2507.18046 [cs.CV] (or arXiv:2507.18046v1 [cs.CV] for this version) https://doi.org/10.48550/arXiv.2507.18046 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
zh

[CV-79] NWaaS: Nonintrusive Watermarking as a Service for X-to-Image DNN

【速读】:该论文旨在解决深度神经网络(Deep Neural Network, DNN)模型知识产权保护中现有水印技术固有的侵入性问题,即传统方法需修改模型参数或结构,导致模型行为偏移和微调成本增加,从而阻碍了实际部署与Watermarking as a Service(WaaS)系统的推广。其解决方案的关键在于提出一种无侵入式水印服务(Nonintrusive Watermarking as a Service, NWaaS),通过在受保护模型的黑盒API中建立稳健的旁通道机制,利用密钥编码器和水印解码器实现对未修改模型输出的水印提取,从而在不改变模型本身的前提下实现绝对保真度(absolute fidelity)并兼容多种DNN架构,同时具备对现有攻击的鲁棒性,彻底消除保真度与鲁棒性之间的权衡。

链接: https://arxiv.org/abs/2507.18036
作者: Haonan An,Guang Hua,Yu Guo,Hangcheng Cao,Susanto Rahardja,Yuguang Fang
机构: 1. University of Science and Technology of China (中国科学技术大学); 2. Tsinghua University (清华大学)
类目: Cryptography and Security (cs.CR); Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:The intellectual property of deep neural network (DNN) models can be protected with DNN watermarking, which embeds copyright watermarks into model parameters (white-box), model behavior (black-box), or model outputs (box-free), and the watermarks can be subsequently extracted to verify model ownership or detect model theft. Despite recent advances, these existing methods are inherently intrusive, as they either modify the model parameters or alter the structure. This natural intrusiveness raises concerns about watermarking-induced shifts in model behavior and the additional cost of fine-tuning, further exacerbated by the rapidly growing model size. As a result, model owners are often reluctant to adopt DNN watermarking in practice, which limits the development of practical Watermarking as a Service (WaaS) systems. To address this issue, we introduce Nonintrusive Watermarking as a Service (NWaaS), a novel trustless paradigm designed for X-to-Image models, in which we hypothesize that with the model untouched, an owner-defined watermark can still be extracted from model outputs. Building on this concept, we propose ShadowMark, a concrete implementation of NWaaS which addresses critical deployment challenges by establishing a robust and nonintrusive side channel in the protected model’s black-box API, leveraging a key encoder and a watermark decoder. It is significantly distinctive from existing solutions by attaining the so-called absolute fidelity and being applicable to different DNN architectures, while being also robust against existing attacks, eliminating the fidelity-robustness trade-off. Extensive experiments on image-to-image, noise-to-image, noise-and-text-to-image, and text-to-image models, demonstrate the efficacy and practicality of ShadowMark for real-world deployment of nonintrusive DNN watermarking.
zh

[CV-80] ViGText: Deepfake Image Detection with Vision-Language Model Explanations and Graph Neural Networks

【速读】:该论文旨在解决当前深度伪造(deepfake)检测方法在面对复杂定制化伪造内容时存在的泛化能力弱和鲁棒性差的问题。解决方案的关键在于提出ViGText框架,其创新性地将图像与视觉大语言模型(Vision Large Language Model, VLLM)生成的文本解释结合,并构建基于图神经网络(Graph Neural Networks, GNNs)的多模态分析结构,通过空间与频域多层次特征提取,实现对伪造细节的精准捕捉,从而显著提升检测性能与抗攻击能力。

链接: https://arxiv.org/abs/2507.18031
作者: Ahmad ALBarqawi,Mahmoud Nazzal,Issa Khalil,Abdallah Khreishah,NhatHai Phan
机构: New Jersey Institute of Technology (新泽西理工学院); Old Dominion University (老多明尼大学); Qatar Computing Research Institute (卡塔尔计算研究研究所)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:The rapid rise of deepfake technology, which produces realistic but fraudulent digital content, threatens the authenticity of media. Traditional deepfake detection approaches often struggle with sophisticated, customized deepfakes, especially in terms of generalization and robustness against malicious attacks. This paper introduces ViGText, a novel approach that integrates images with Vision Large Language Model (VLLM) Text explanations within a Graph-based framework to improve deepfake detection. The novelty of ViGText lies in its integration of detailed explanations with visual data, as it provides a more context-aware analysis than captions, which often lack specificity and fail to reveal subtle inconsistencies. ViGText systematically divides images into patches, constructs image and text graphs, and integrates them for analysis using Graph Neural Networks (GNNs) to identify deepfakes. Through the use of multi-level feature extraction across spatial and frequency domains, ViGText captures details that enhance its robustness and accuracy to detect sophisticated deepfakes. Extensive experiments demonstrate that ViGText significantly enhances generalization and achieves a notable performance boost when it detects user-customized deepfakes. Specifically, average F1 scores rise from 72.45% to 98.32% under generalization evaluation, and reflects the model’s superior ability to generalize to unseen, fine-tuned variations of stable diffusion models. As for robustness, ViGText achieves an increase of 11.1% in recall compared to other deepfake detection approaches. When facing targeted attacks that exploit its graph-based architecture, ViGText limits classification performance degradation to less than 4%. ViGText uses detailed visual and textual analysis to set a new standard for detecting deepfakes, helping ensure media authenticity and information integrity.
zh

[CV-81] Emotion Recognition from Skeleton Data: A Comprehensive Survey

【速读】:该论文旨在解决基于身体动作的**情绪识别(emotion recognition)**问题,以替代依赖面部表情或生理信号的传统方法,从而在保障隐私的同时提升情绪识别的可行性与适用性。其解决方案的关键在于系统梳理和整合当前基于3D骨架(3D skeleton)的情绪识别技术体系:首先从心理学角度厘清身体运动与情绪表达的关系,其次归纳公开数据集的采集方式与标注策略差异,进而提出一个统一的四类技术范式分类体系——传统方法(Traditional approaches)、Feat2Net、FeatFusionNet 和 End2EndNet,从数据驱动与技术实现两个维度对现有方法进行归类、比较与基准测试,并进一步探讨其在心理健康评估(如抑郁与自闭症检测)等扩展应用中的潜力,为该领域提供清晰的技术演进脉络与未来研究方向。

链接: https://arxiv.org/abs/2507.18026
作者: Haifeng Lu,Jiuyi Chen,Zhen Zhang,Ruida Liu,Runhao Zeng,Xiping Hu
机构: Shenzhen MSU-BIT University (深圳莫斯科大学-比特大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 34 pages, 5 figures, 13 tables

点击查看摘要

Abstract:Emotion recognition through body movements has emerged as a compelling and privacy-preserving alternative to traditional methods that rely on facial expressions or physiological signals. Recent advancements in 3D skeleton acquisition technologies and pose estimation algorithms have significantly enhanced the feasibility of emotion recognition based on full-body motion. This survey provides a comprehensive and systematic review of skeleton-based emotion recognition techniques. First, we introduce psychological models of emotion and examine the relationship between bodily movements and emotional expression. Next, we summarize publicly available datasets, highlighting the differences in data acquisition methods and emotion labeling strategies. We then categorize existing methods into posture-based and gait-based approaches, analyzing them from both data-driven and technical perspectives. In particular, we propose a unified taxonomy that encompasses four primary technical paradigms: Traditional approaches, Feat2Net, FeatFusionNet, and End2EndNet. Representative works within each category are reviewed and compared, with benchmarking results across commonly used datasets. Finally, we explore the extended applications of emotion recognition in mental health assessment, such as detecting depression and autism, and discuss the open challenges and future research directions in this rapidly evolving field.
zh

[CV-82] High-fidelity 3D Gaussian Inpainting: preserving multi-view consistency and photorealistic details

【速读】:该论文旨在解决3D场景修复(inpainting)中因三维结构固有的不规则性及多视角一致性维持困难而导致的重建难题。其解决方案的关键在于提出了一种基于3D高斯溅射(3D Gaussian Splatting, 3DGS)的新型3D高斯修复框架,通过引入自动掩码精修流程(Mask Refinement Process)与区域级不确定性引导优化策略(Uncertainty-guided Optimization)。其中,掩码精修利用高斯场景滤波和反投影操作实现遮挡区域的精准定位与边界真实还原;而不确定性引导的细粒度优化则在训练过程中估计各区域在多视角图像中的重要性,从而缓解多视角不一致问题并提升修复结果的细节保真度。

链接: https://arxiv.org/abs/2507.18023
作者: Jun Zhou,Dinghao Li,Nannan Li,Mingjie Wang
机构: Dalian Maritime University (大连海事大学); Zhejiang Sci-Tech University (浙江理工大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Recent advancements in multi-view 3D reconstruction and novel-view synthesis, particularly through Neural Radiance Fields (NeRF) and 3D Gaussian Splatting (3DGS), have greatly enhanced the fidelity and efficiency of 3D content creation. However, inpainting 3D scenes remains a challenging task due to the inherent irregularity of 3D structures and the critical need for maintaining multi-view consistency. In this work, we propose a novel 3D Gaussian inpainting framework that reconstructs complete 3D scenes by leveraging sparse inpainted views. Our framework incorporates an automatic Mask Refinement Process and region-wise Uncertainty-guided Optimization. Specifically, we refine the inpainting mask using a series of operations, including Gaussian scene filtering and back-projection, enabling more accurate localization of occluded regions and realistic boundary restoration. Furthermore, our Uncertainty-guided Fine-grained Optimization strategy, which estimates the importance of each region across multi-view images during training, alleviates multi-view inconsistencies and enhances the fidelity of fine details in the inpainted results. Comprehensive experiments conducted on diverse datasets demonstrate that our approach outperforms existing state-of-the-art methods in both visual quality and view consistency.
zh

[CV-83] Celeb-DF: A Large-scale Challenging Video DeepFake Benchmark for Generalizable Forensics

【速读】:该论文旨在解决深度伪造(DeepFake)视频检测中“泛化能力不足”的问题,即如何利用单一模型有效识别多种未见过的伪造类型。当前主流检测方法受限于训练数据的伪造类型多样性不足,难以应对真实场景中不断涌现的新式伪造技术。解决方案的关键在于构建一个大规模、高多样性的新基准数据集——Celeb-DF++,其覆盖三种典型伪造场景(人脸交换 Face-swap, 人脸重演 Face-reenactment, 对口型说话 Talking-face),并包含22种不同架构和生成流程的近期DeepFake方法所生成的高质量伪造视频,从而更真实地模拟现实世界的伪造复杂性。同时,论文还提出了一套系统化的评估协议,用于量化衡量现有检测方法在跨伪造类型上的泛化性能,揭示了当前技术的局限性和该数据集的挑战性。

链接: https://arxiv.org/abs/2507.18015
作者: Yuezun Li,Delong Zhu,Xinjie Cui,Siwei Lyu
机构: Ocean University of China (中国海洋大学); University at Buffalo, SUNY (纽约州立大学布法罗分校)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: this https URL

点击查看摘要

Abstract:The rapid advancement of AI technologies has significantly increased the diversity of DeepFake videos circulating online, posing a pressing challenge for \textitgeneralizable forensics, \ie, detecting a wide range of unseen DeepFake types using a single model. Addressing this challenge requires datasets that are not only large-scale but also rich in forgery diversity. However, most existing datasets, despite their scale, include only a limited variety of forgery types, making them insufficient for developing generalizable detection methods. Therefore, we build upon our earlier Celeb-DF dataset and introduce Celeb-DF++, a new large-scale and challenging video DeepFake benchmark dedicated to the generalizable forensics challenge. Celeb-DF++ covers three commonly encountered forgery scenarios: Face-swap (FS), Face-reenactment (FR), and Talking-face (TF). Each scenario contains a substantial number of high-quality forged videos, generated using a total of 22 various recent DeepFake methods. These methods differ in terms of architectures, generation pipelines, and targeted facial regions, covering the most prevalent DeepFake cases witnessed in the wild. We also introduce evaluation protocols for measuring the generalizability of 24 recent detection methods, highlighting the limitations of existing detection methods and the difficulty of our new dataset.
zh

[CV-84] Registration beyond Points: General Affine Subspace Alignment via Geodesic Distance on Grassmann Manifold

【速读】:该论文旨在解决现有基于仿射格拉斯曼流形(Affine Grassmannian)的特征距离度量方法无法显式表达为刚体变换(旋转 R\mathbf{R} 与平移 \mathbft)函数的问题,从而限制了其在配准(registration)任务中的优化应用。解决方案的关键在于首次严格推导出一个关于刚体变换可优化的成本函数:通过高维线性子空间基向量的显式表示,将格拉斯曼流形上两点间的测地距离(geodesic distance)转化为可直接最小化的函数形式,且该函数不依赖于参数表示的歧义性,从而能够实现全局最优解的求解。该方法已成功应用于任意仿射子空间的配准问题,并通过扩展至inlier-set最大化BnB(Branch-and-Bound)求解器,在多种计算机视觉任务中显著提升了收敛性能或优于现有方法。

链接: https://arxiv.org/abs/2507.17998
作者: Jaeho Shin,Hyeonjae Gil,Junwoo Jang,Maani Ghaffari,Ayoung Kim
机构: Seoul National University (首尔国立大学); Inha University (仁荷大学); University of Michigan (密歇根大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Affine Grassmannian has been favored for expressing proximity between lines and planes due to its theoretical exactness in measuring distances among features. Despite this advantage, the existing method can only measure the proximity without yielding the distance as an explicit function of rigid body transformation. Thus, an optimizable distance function on the manifold has remained underdeveloped, stifling its application in registration problems. This paper is the first to explicitly derive an optimizable cost function between two Grassmannian features with respect to rigid body transformation ( \mathbfR and \mathbft ). Specifically, we present a rigorous mathematical proof demonstrating that the bases of high-dimensional linear subspaces can serve as an explicit representation of the cost. Finally, we propose an optimizable cost function based on the transformed bases that can be applied to the registration problem of any affine subspace. Compared to vector parameter-based approaches, our method is able to find a globally optimal solution by directly minimizing the geodesic distance which is agnostic to representation ambiguity. The resulting cost function and its extension to the inlier-set maximizing \acBnB solver have been demonstrated to improve the convergence of existing solutions or outperform them in various computer vision tasks. The code is available on this https URL.
zh

[CV-85] Exploring the interplay of label bias with subgroup size and separability: A case study in mammographic density classification MICCAI

【速读】:该论文旨在解决医学影像数据集中因标签偏差(label bias)导致特定亚群(subgroup)分类不公平的问题,这可能影响医疗人工智能系统在实际应用中的公平性和可靠性。其解决方案的关键在于系统性地分析标签偏差对深度学习模型特征表示和子群性能的影响,特别关注受影响亚群的相对大小(size)与可分离性(separability)这两个因素的作用机制。研究通过在EMory BrEast imaging Dataset (EMBED) 上训练二分类模型,模拟不同亚群的标签偏差,并发现模型在特征空间中的表示显著偏移,且这种偏移依赖于亚群的大小和可分离性;此外,使用带有清洁标签的验证集确定分类阈值可显著改善亚群性能,例如在多数可分离亚群存在标签偏差时,若验证集含偏标签,该亚群真阳性率从0.898降至0.518,凸显了验证数据质量对提升亚群公平性的关键作用。

链接: https://arxiv.org/abs/2507.17996
作者: Emma A.M. Stanley,Raghav Mehta,Mélanie Roschewitz,Nils D. Forkert,Ben Glocker
机构: Imperial College London (帝国理工学院); University College London (伦敦大学学院)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted at MICCAI Workshop on Fairness of AI in Medical Imaging (FAIMI) 2025

点击查看摘要

Abstract:Systematic mislabelling affecting specific subgroups (i.e., label bias) in medical imaging datasets represents an understudied issue concerning the fairness of medical AI systems. In this work, we investigated how size and separability of subgroups affected by label bias influence the learned features and performance of a deep learning model. Therefore, we trained deep learning models for binary tissue density classification using the EMory BrEast imaging Dataset (EMBED), where label bias affected separable subgroups (based on imaging manufacturer) or non-separable “pseudo-subgroups”. We found that simulated subgroup label bias led to prominent shifts in the learned feature representations of the models. Importantly, these shifts within the feature space were dependent on both the relative size and the separability of the subgroup affected by label bias. We also observed notable differences in subgroup performance depending on whether a validation set with clean labels was used to define the classification threshold for the model. For instance, with label bias affecting the majority separable subgroup, the true positive rate for that subgroup fell from 0.898, when the validation set had clean labels, to 0.518, when the validation set had biased labels. Our work represents a key contribution toward understanding the consequences of label bias on subgroup fairness in medical imaging AI.
zh

[CV-86] AG-VPReID.VIR: Bridging Aerial and Ground Platforms for Video-based Visible-Infrared Person Re-ID

【速读】:该论文旨在解决可见光(RGB)与红外(Infrared, IR)模态下跨空地视角的行人重识别(Person Re-Identification, Re-ID)问题,尤其针对现有数据集多集中于地面视角、难以支持全天候监控系统的问题。其核心挑战在于跨视角差异(如空中与地面视角)、模态不一致(RGB与IR图像特征差异)以及视频序列中的时序动态变化。解决方案的关键是提出TCC-VPReID框架,一种三流架构,通过风格鲁棒特征学习(style-robust feature learning)、基于记忆的跨视角自适应(memory-based cross-view adaptation)和中介引导的时序建模(intermediary-guided temporal modeling),有效弥合了空地平台间与RGB-IR模态间的域差距,从而在复杂场景中实现更稳定的跨模态行人匹配性能。

链接: https://arxiv.org/abs/2507.17995
作者: Huy Nguyen,Kien Nguyen,Akila Pemasiri,Akmal Jahan,Clinton Fookes,Sridha Sridharan
机构: Queensland University of Technology (昆士兰科技大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted atIEEE International Joint Conference on Biometrics (IJCB) 2025

点击查看摘要

Abstract:Person re-identification (Re-ID) across visible and infrared modalities is crucial for 24-hour surveillance systems, but existing datasets primarily focus on ground-level perspectives. While ground-based IR systems offer nighttime capabilities, they suffer from occlusions, limited coverage, and vulnerability to obstructions–problems that aerial perspectives uniquely solve. To address these limitations, we introduce this http URL, the first aerial-ground cross-modality video-based person Re-ID dataset. This dataset captures 1,837 identities across 4,861 tracklets (124,855 frames) using both UAV-mounted and fixed CCTV cameras in RGB and infrared modalities. this http URL presents unique challenges including cross-viewpoint variations, modality discrepancies, and temporal dynamics. Additionally, we propose TCC-VPReID, a novel three-stream architecture designed to address the joint challenges of cross-platform and cross-modality person Re-ID. Our approach bridges the domain gaps between aerial-ground perspectives and RGB-IR modalities, through style-robust feature learning, memory-based cross-view adaptation, and intermediary-guided temporal modeling. Experiments show that this http URL presents distinctive challenges compared to existing datasets, with our TCC-VPReID framework achieving significant performance gains across multiple evaluation protocols. Dataset and code are available at this https URL.
zh

[CV-87] Bearded Drag on Activity Recognition Pipeline: An AI-Based Approach to Behavioural Monitoring

【速读】:该论文旨在解决传统蜥蜴行为监测方法效率低、易出错的问题,提出了一种基于YOLO(You Only Look Once)目标检测模型的自动化实时视频分析系统,用于识别鬃狮蜥(Pogona viticeps)的两种关键行为:晒太阳(basking)和捕食(hunting)。解决方案的关键在于使用五种YOLO变体在自建公开数据集(含600张蜥蜴、500张加热灯和100张蟋蟀图像)上进行训练,并最终选择YOLOv8s作为最优模型,因其在精度(mAP@0.5:0.95 = 0.855)与速度之间取得了最佳平衡;系统通过逐帧提取目标坐标、应用时间插值保持连续性,并结合规则逻辑实现行为分类。尽管晒太阳行为检测可靠,但捕食行为准确性受限于蟋蟀检测性能不足(mAP@0.5 = 0.392),未来将聚焦于增强小目标(如蟋蟀)检测能力以提升整体效果。

链接: https://arxiv.org/abs/2507.17987
作者: Arsen Yermukan,Pedro Machado,Feliciano Domingos,Isibor Kennedy Ihianle,Jordan J. Bird,Stefano S. K. Kaburu,Samantha J. Ward
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Traditional monitoring of bearded dragon (Pogona Viticeps) behaviour is time-consuming and prone to errors. This project introduces an automated system for real-time video analysis, using You Only Look Once (YOLO) object detection models to identify two key behaviours: basking and hunting. We trained five YOLO variants (v5, v7, v8, v11, v12) on a custom, publicly available dataset of 1200 images, encompassing bearded dragons (600), heating lamps (500), and crickets (100). YOLOv8s was selected as the optimal model due to its superior balance of accuracy (mAP@0.5:0.95 = 0.855) and speed. The system processes video footage by extracting per-frame object coordinates, applying temporal interpolation for continuity, and using rule-based logic to classify specific behaviours. Basking detection proved reliable. However, hunting detection was less accurate, primarily due to weak cricket detection (mAP@0.5 = 0.392). Future improvements will focus on enhancing cricket detection through expanded datasets or specialised small-object detectors. This automated system offers a scalable solution for monitoring reptile behaviour in controlled environments, significantly improving research efficiency and data quality.
zh

[CV-88] Zero-Shot Dynamic Concept Personalization with Grid-Based LoRA

【速读】:该论文旨在解决文本到视频生成模型中动态概念个性化(dynamic concept personalization)的可扩展性问题,即现有方法通常需要为每个实例进行微调(per-instance fine-tuning),难以高效适配新主体。其解决方案的关键在于提出一种全零样本(fully zero-shot)框架,通过结构化2×2视频网格(structured 2x2 video grids)对输入与输出进行空间组织,训练轻量级Grid-LoRA适配器用于编辑与组合;在推理阶段,引入专用的Grid Fill模块完成部分观测布局,从而生成时序一致且身份保持的视频输出,整个系统仅需单次前向传播即可泛化至未见过的动态概念,无需测试时优化。

链接: https://arxiv.org/abs/2507.17963
作者: Rameen Abdal,Or Patashnik,Ekaterina Deyneka,Hao Chen,Aliaksandr Siarohin,Sergey Tulyakov,Daniel Cohen-Or,Kfir Aberman
机构: Snap Research( Snap 研究院); Snap Inc.( Snap 公司)
类目: Graphics (cs.GR); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注: Project Page and Video : this https URL

点击查看摘要

Abstract:Recent advances in text-to-video generation have enabled high-quality synthesis from text and image prompts. While the personalization of dynamic concepts, which capture subject-specific appearance and motion from a single video, is now feasible, most existing methods require per-instance fine-tuning, limiting scalability. We introduce a fully zero-shot framework for dynamic concept personalization in text-to-video models. Our method leverages structured 2x2 video grids that spatially organize input and output pairs, enabling the training of lightweight Grid-LoRA adapters for editing and composition within these grids. At inference, a dedicated Grid Fill module completes partially observed layouts, producing temporally coherent and identity preserving outputs. Once trained, the entire system operates in a single forward pass, generalizing to previously unseen dynamic concepts without any test-time optimization. Extensive experiments demonstrate high-quality and consistent results across a wide range of subjects beyond trained concepts and editing scenarios.
zh

[CV-89] OPEN: A Benchmark Dataset and Baseline for Older Adult Patient Engagement Recognition in Virtual Rehabilitation Learning Environments

【速读】:该论文旨在解决在虚拟群体学习环境中对老年群体参与度(engagement)进行准确、自动化识别的难题,尤其关注在线教育与远程康复场景中缺乏针对老年人的高质量数据集及忽视情境相关性和纵向变化的问题。其解决方案的关键在于构建并发布OPEN数据集——这是目前最大规模的老年患者参与度数据集,涵盖11名参与者在六周内每周虚拟团体学习中的超过35小时多模态数据(如面部、手部和身体关节关键点及情感与行为特征),并提供多种时间粒度的样本版本以支持不同AI模型训练;同时,通过多种机器学习与深度学习方法验证了该数据集的有效性,实现了最高达81%的参与度识别准确率,为老龄化人群个性化参与建模提供了可扩展的基础。

链接: https://arxiv.org/abs/2507.17959
作者: Ali Abedi,Sadaf Safa,Tracey J.F. Colella,Shehroz S. Khan
机构: KITE Research Institute (KITE 研究所); Toronto Rehabilitation Institute (多伦多康复研究所); University Health Network (大学健康网络)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 14 pages, 3 figures, 7 tables

点击查看摘要

Abstract:Engagement in virtual learning is essential for participant satisfaction, performance, and adherence, particularly in online education and virtual rehabilitation, where interactive communication plays a key role. Yet, accurately measuring engagement in virtual group settings remains a challenge. There is increasing interest in using artificial intelligence (AI) for large-scale, real-world, automated engagement recognition. While engagement has been widely studied in younger academic populations, research and datasets focused on older adults in virtual and telehealth learning settings remain limited. Existing methods often neglect contextual relevance and the longitudinal nature of engagement across sessions. This paper introduces OPEN (Older adult Patient ENgagement), a novel dataset supporting AI-driven engagement recognition. It was collected from eleven older adults participating in weekly virtual group learning sessions over six weeks as part of cardiac rehabilitation, producing over 35 hours of data, making it the largest dataset of its kind. To protect privacy, raw video is withheld; instead, the released data include facial, hand, and body joint landmarks, along with affective and behavioral features extracted from video. Annotations include binary engagement states, affective and behavioral labels, and context-type indicators, such as whether the instructor addressed the group or an individual. The dataset offers versions with 5-, 10-, 30-second, and variable-length samples. To demonstrate utility, multiple machine learning and deep learning models were trained, achieving engagement recognition accuracy of up to 81 percent. OPEN provides a scalable foundation for personalized engagement modeling in aging populations and contributes to broader engagement recognition research.
zh

[CV-90] VIBE: Video-Input Brain Encoder for fMRI Response Modeling

【速读】:该论文旨在解决跨模态神经表征预测问题,即如何利用多模态视频、音频和文本信息来准确预测人脑功能磁共振成像(fMRI)活动。其解决方案的关键在于提出一个两阶段Transformer架构——VIBE,首先通过模态融合Transformer整合来自开源模型(如Qwen2.5、BEATs、Whisper、SlowFast、V-JEPA)的多模态特征,再由带有旋转位置编码(rotary embeddings)的预测Transformer进行时序解码,从而实现对fMRI信号的高精度建模。该方法在CNeuroMod数据集上训练并经20个随机种子集成后,在分布内(Friends S07)和分布外(六部电影)测试中分别达到32.25和21.25的平均parcel-wise Pearson相关系数,显著优于此前版本并在Algonauts 2025 Challenge中取得优异成绩。

链接: https://arxiv.org/abs/2507.17958
作者: Daniel Carlstrom Schad,Shrey Dixit,Janis Keck,Viktor Studenyak,Aleksandr Shpilevoi,Andrej Bicanski
机构: Max Planck School of Cognition (马克斯普朗克认知学校); CBS (认知与行为科学中心)
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:We present VIBE, a two-stage Transformer that fuses multi-modal video, audio, and text features to predict fMRI activity. Representations from open-source models (Qwen2.5, BEATs, Whisper, SlowFast, V-JEPA) are merged by a modality-fusion transformer and temporally decoded by a prediction transformer with rotary embeddings. Trained on 65 hours of movie data from the CNeuroMod dataset and ensembled across 20 seeds, VIBE attains mean parcel-wise Pearson correlations of 32.25 on in-distribution Friends S07 and 21.25 on six out-of-distribution films. An earlier iteration of the same architecture obtained 0.3198 and 0.2096, respectively, winning Phase-1 and placing second overall in the Algonauts 2025 Challenge.
zh

[CV-91] AFRDA: Attentive Feature Refinement for Domain Adaptive Semantic Segmentation

【速读】:该论文针对无监督域自适应语义分割(Unsupervised Domain Adaptive Semantic Segmentation, UDA-SS)中模型难以平衡细粒度局部细节与全局上下文信息的问题展开研究,该问题常导致在复杂场景区域出现分割误差。解决方案的关键在于提出自适应特征精炼(Adaptive Feature Refinement, AFR)模块:该模块通过低分辨率logits中的语义先验来精炼高分辨率特征,并引入高频成分以捕捉细粒度结构和边界信息,从而提升目标物体的轮廓 delineation(分割边界清晰度);同时,AFR采用不确定性驱动的注意力机制自适应地融合局部与全局信息,减少误分类现象。其轻量化设计可无缝集成至基于HRDA(High-Resolution Data Augmentation)的UDA方法中,在GTA5→Cityscapes和Synthia→Cityscapes两个基准上分别实现1.05%和1.04%的mIoU提升,达到当前最优性能。

链接: https://arxiv.org/abs/2507.17957
作者: Md. Al-Masrur Khan,Durgakant Pushp,Lantao Liu
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:In Unsupervised Domain Adaptive Semantic Segmentation (UDA-SS), a model is trained on labeled source domain data (e.g., synthetic images) and adapted to an unlabeled target domain (e.g., real-world images) without access to target annotations. Existing UDA-SS methods often struggle to balance fine-grained local details with global contextual information, leading to segmentation errors in complex regions. To address this, we introduce the Adaptive Feature Refinement (AFR) module, which enhances segmentation accuracy by refining highresolution features using semantic priors from low-resolution logits. AFR also integrates high-frequency components, which capture fine-grained structures and provide crucial boundary information, improving object delineation. Additionally, AFR adaptively balances local and global information through uncertaintydriven attention, reducing misclassifications. Its lightweight design allows seamless integration into HRDA-based UDA methods, leading to state-of-the-art segmentation performance. Our approach improves existing UDA-SS methods by 1.05% mIoU on GTA V – Cityscapes and 1.04% mIoU on Synthia–Cityscapes. The implementation of our framework is available at: this https URL
zh

[CV-92] DiNAT-IR: Exploring Dilated Neighborhood Attention for High-Quality Image Restoration

【速读】:该论文旨在解决Transformer在图像恢复任务中因自注意力机制计算成本高而导致的高分辨率图像处理效率与质量难以兼顾的问题。其核心解决方案是提出DiNAT-IR架构,关键创新在于引入通道感知模块(channel-aware module)以增强局部注意力对全局上下文的理解能力,同时结合稀疏化邻域注意力(Dilated Neighborhood Attention, DiNA),通过滑动窗口与混合膨胀因子设计,在不显著增加计算开销的前提下有效扩展感受野,从而实现局部细节精度与全局语义信息的协同优化,显著提升图像恢复质量。

链接: https://arxiv.org/abs/2507.17892
作者: Hanzhou Liu,Binghan Li,Chengkai Liu,Mi Lu
机构: Texas A&M University (德州农工大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Transformers, with their self-attention mechanisms for modeling long-range dependencies, have become a dominant paradigm in image restoration tasks. However, the high computational cost of self-attention limits scalability to high-resolution images, making efficiency-quality trade-offs a key research focus. To address this, Restormer employs channel-wise self-attention, which computes attention across channels instead of spatial dimensions. While effective, this approach may overlook localized artifacts that are crucial for high-quality image restoration. To bridge this gap, we explore Dilated Neighborhood Attention (DiNA) as a promising alternative, inspired by its success in high-level vision tasks. DiNA balances global context and local precision by integrating sliding-window attention with mixed dilation factors, effectively expanding the receptive field without excessive overhead. However, our preliminary experiments indicate that directly applying this global-local design to the classic deblurring task hinders accurate visual restoration, primarily due to the constrained global context understanding within local attention. To address this, we introduce a channel-aware module that complements local attention, effectively integrating global context without sacrificing pixel-level precision. The proposed DiNAT-IR, a Transformer-based architecture specifically designed for image restoration, achieves competitive results across multiple benchmarks, offering a high-quality solution for diverse low-level computer vision problems.
zh

[CV-93] owards Facilitated Fairness Assessment of AI-based Skin Lesion Classifiers Through GenAI-based Image Synthesis

【速读】:该论文旨在解决皮肤癌(如黑色素瘤)检测模型在实际应用中因数据偏差导致的公平性问题,特别是评估过程中因训练数据对不同个人可识别信息(PII,包括性别、年龄和种族)群体代表性不足而引发的潜在不公平风险。其解决方案的关键在于利用最先进的生成式 AI(Generative AI, GenAI)模型 LightningDiT 生成高度逼真的合成医学图像,从而构建更具代表性的评估数据集,以更可靠地衡量和提升医疗影像类生成式 AI 系统的公平性表现。

链接: https://arxiv.org/abs/2507.17860
作者: Ko Watanabe. Stanislav Frolov. Adriano Lucieri. Andreas Dengel
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Recent advancements in Deep Learning and its application on the edge hold great potential for the revolution of routine screenings for skin cancers like Melanoma. Along with the anticipated benefits of this technology, potential dangers arise from unforseen and inherent biases. Thus, assessing and improving the fairness of such systems is of utmost importance. A key challenge in fairness assessment is to ensure that the evaluation dataset is sufficiently representative of different Personal Identifiable Information (PII) (sex, age, and race) and other minority groups. Against the backdrop of this challenge, this study leverages the state-of-the-art Generative AI (GenAI) LightningDiT model to assess the fairness of publicly available melanoma classifiers. The results suggest that fairness assessment using highly realistic synthetic data is a promising direction. Yet, our findings indicate that verifying fairness becomes difficult when the melanoma-detection model used for evaluation is trained on data that differ from the dataset underpinning the synthetic images. Nonetheless, we propose that our approach offers a valuable new avenue for employing synthetic data to gauge and enhance fairness in medical-imaging GenAI systems.
zh

[CV-94] FishDet-M: A Unified Large-Scale Benchmark for Robust Fish Detection and CLIP-Guided Model Selection in Diverse Aquatic Visual Domains

【速读】:该论文旨在解决水下鱼类检测(fish detection)在实际部署中面临的三大挑战:数据集碎片化、成像条件异构性以及评估协议不一致。为应对这些问题,作者提出了FishDet-M,这是目前最大的统一基准,整合了13个公开数据集,涵盖海洋、咸淡水、遮挡和水族馆等多种水下环境,并采用COCO风格的标注(包含边界框和分割掩码),实现跨域一致性评估。解决方案的关键在于:首先通过标准化数据与标注提升模型训练和评估的可重复性;其次系统性地对比28种主流目标检测模型(从YOLOv8到YOLOv12、R-CNN及DETR系列),量化不同架构在精度(mAP、mAP@50、mAP@75)与效率(延迟、参数量)之间的权衡;最后引入基于CLIP的零样本模型选择框架,利用视觉-语言对齐动态识别最适合当前输入图像的检测器,无需集成计算即可实现高效自适应部署,从而为复杂水下场景的目标检测提供标准化、可扩展的解决方案。

链接: https://arxiv.org/abs/2507.17859
作者: Muayad Abujabal,Lyes Saad Saoud,Irfan Hussain
机构: Khalifa University (哈利法大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Robotics (cs.RO)
备注:

点击查看摘要

Abstract:Accurate fish detection in underwater imagery is essential for ecological monitoring, aquaculture automation, and robotic perception. However, practical deployment remains limited by fragmented datasets, heterogeneous imaging conditions, and inconsistent evaluation protocols. To address these gaps, we present \textitFishDet-M, the largest unified benchmark for fish detection, comprising 13 publicly available datasets spanning diverse aquatic environments including marine, brackish, occluded, and aquarium scenes. All data are harmonized using COCO-style annotations with both bounding boxes and segmentation masks, enabling consistent and scalable cross-domain evaluation. We systematically benchmark 28 contemporary object detection models, covering the YOLOv8 to YOLOv12 series, R-CNN based detectors, and DETR based models. Evaluations are conducted using standard metrics including mAP, mAP@50, and mAP@75, along with scale-specific analyses (AP _S , AP _M , AP _L ) and inference profiling in terms of latency and parameter count. The results highlight the varying detection performance across models trained on FishDet-M, as well as the trade-off between accuracy and efficiency across models of different architectures. To support adaptive deployment, we introduce a CLIP-based model selection framework that leverages vision-language alignment to dynamically identify the most semantically appropriate detector for each input image. This zero-shot selection strategy achieves high performance without requiring ensemble computation, offering a scalable solution for real-time applications. FishDet-M establishes a standardized and reproducible platform for evaluating object detection in complex aquatic scenes. All datasets, pretrained models, and evaluation tools are publicly available to facilitate future research in underwater computer vision and intelligent marine systems.
zh

[CV-95] Detail: Training-Free Detail Enhancer for Text-to-Image Diffusion Models

【速读】:该论文旨在解决文本到图像(Text-to-Image, T2I)生成模型在处理复杂提示词时的局限性,尤其是涉及多个主体及其各自属性时的绑定错误与布局失真问题。解决方案的关键在于提出一种无需训练的渐进式细节注入(Progressive Detail Injection, PDI)策略:首先将复杂提示分解为一系列简化子提示,分阶段引导生成过程,利用自注意力机制先确保全局构图合理性,再逐步细化细节;同时,通过交叉注意力机制和测试阶段引入的质心对齐损失(Centroid Alignment Loss),有效降低属性与主体间的绑定噪声,提升属性一致性,从而显著改善多对象及复杂风格场景下的生成质量。

链接: https://arxiv.org/abs/2507.17853
作者: Lifeng Chen,Jiner Wang,Zihao Pan,Beier Zhu,Xiaofeng Yang,Chi Zhang
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Recent advances in text-to-image (T2I) generation have led to impressive visual results. However, these models still face significant challenges when handling complex prompt, particularly those involving multiple subjects with distinct attributes. Inspired by the human drawing process, which first outlines the composition and then incrementally adds details, we propose Detail++, a training-free framework that introduces a novel Progressive Detail Injection (PDI) strategy to address this limitation. Specifically, we decompose a complex prompt into a sequence of simplified sub-prompts, guiding the generation process in stages. This staged generation leverages the inherent layout-controlling capacity of self-attention to first ensure global composition, followed by precise refinement. To achieve accurate binding between attributes and corresponding subjects, we exploit cross-attention mechanisms and further introduce a Centroid Alignment Loss at test time to reduce binding noise and enhance attribute consistency. Extensive experiments on T2I-CompBench and a newly constructed style composition benchmark demonstrate that Detail++ significantly outperforms existing methods, particularly in scenarios involving multiple objects and complex stylistic conditions.
zh

[CV-96] SV3.3B: A Sports Video Understanding Model for Action Recognition

【速读】:该论文旨在解决自动化体育视频分析中长期存在的两大问题:一是传统模型计算复杂度高,依赖服务器端处理,难以实现实时、轻量化的部署;二是现有方法缺乏对运动员动作细节的精细理解,无法有效捕捉如准备、执行和收尾等关键生物力学阶段。解决方案的关键在于提出一个轻量级的3.3B参数视频理解模型SV3.3B,其核心创新包括:采用基于离散小波变换(DWT)与VGG16结合的LDA(线性判别分析)机制进行关键帧提取,精准选出16个最具代表性的视频帧;引入基于掩码去噪目标预训练的V-DWT-JEPA2编码器与微调后的大型语言模型(LLM)解码器,实现高效且准确的动作描述生成。该方案在NSVA篮球数据集上显著优于GPT-4o等大模型,在信息密度、动作复杂度和测量精度等方面提升明显,同时具备更低的计算开销,适用于终端设备部署。

链接: https://arxiv.org/abs/2507.17844
作者: Sai Varun Kodathala,Yashwanth Reddy Vutukoori,Rakesh Vunnam
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: 8 pages, 6 figures, 4 tables. Submitted to AIxSET 2025

点击查看摘要

Abstract:This paper addresses the challenge of automated sports video analysis, which has traditionally been limited by computationally intensive models requiring server-side processing and lacking fine-grained understanding of athletic movements. Current approaches struggle to capture the nuanced biomechanical transitions essential for meaningful sports analysis, often missing critical phases like preparation, execution, and follow-through that occur within seconds. To address these limitations, we introduce SV3.3B, a lightweight 3.3B parameter video understanding model that combines novel temporal motion difference sampling with self-supervised learning for efficient on-device deployment. Our approach employs a DWT-VGG16-LDA based keyframe extraction mechanism that intelligently identifies the 16 most representative frames from sports sequences, followed by a V-DWT-JEPA2 encoder pretrained through mask-denoising objectives and an LLM decoder fine-tuned for sports action description generation. Evaluated on a subset of the NSVA basketball dataset, SV3.3B achieves superior performance across both traditional text generation metrics and sports-specific evaluation criteria, outperforming larger closed-source models including GPT-4o variants while maintaining significantly lower computational requirements. Our model demonstrates exceptional capability in generating technically detailed and analytically rich sports descriptions, achieving 29.2% improvement over GPT-4o in ground truth validation metrics, with substantial improvements in information density, action complexity, and measurement precision metrics essential for comprehensive athletic analysis. Model Available at this https URL.
zh

[CV-97] Lumina-mGPT 2.0: Stand-Alone AutoRegressive Image Modeling

【速读】:该论文旨在解决当前图像生成模型中依赖预训练组件或混合架构所带来的灵活性受限与许可限制问题,同时提升生成质量与多任务适应性。其解决方案的关键在于提出Lumina-mGPT 2.0——一个从零开始训练的独立解码器-only自回归模型,通过完全自主的架构设计实现无约束的生成能力与灵活的许可证自由度;结合统一的标记化方案,使模型能够无缝支持从主体驱动生成、图像编辑到可控合成及密集预测等多种任务,且在文本到图像基准测试(如GenEval、DPG)上达到甚至超越扩散模型(diffusion models)的性能表现,同时引入推理时缩放和推测雅可比采样等高效解码策略以优化生成速度与质量。

链接: https://arxiv.org/abs/2507.17801
作者: Yi Xin,Juncheng Yan,Qi Qin,Zhen Li,Dongyang Liu,Shicheng Li,Victor Shea-Jay Huang,Yupeng Zhou,Renrui Zhang,Le Zhuo,Tiancheng Han,Xiaoqing Sun,Siqi Luo,Mengmeng Wang,Bin Fu,Yuewen Cao,Hongsheng Li,Guangtao Zhai,Xiaohong Liu,Yu Qiao,Peng Gao
机构: Shanghai AI Laboratory(上海人工智能实验室); Shanghai Innovation Institute(上海创新研究院); Nanjing University(南京大学); The Chinese University of Hong Kong(香港中文大学); Shanghai Jiao Tong University(上海交通大学); Zhejiang University of Technology(浙江工业大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Tech Report, 23 pages, 11 figures, 7 tables

点击查看摘要

Abstract:We present Lumina-mGPT 2.0, a stand-alone, decoder-only autoregressive model that revisits and revitalizes the autoregressive paradigm for high-quality image generation and beyond. Unlike existing approaches that rely on pretrained components or hybrid architectures, Lumina-mGPT 2.0 is trained entirely from scratch, enabling unrestricted architectural design and licensing freedom. It achieves generation quality on par with state-of-the-art diffusion models such as DALL-E 3 and SANA, while preserving the inherent flexibility and compositionality of autoregressive modeling. Our unified tokenization scheme allows the model to seamlessly handle a wide spectrum of tasks-including subject-driven generation, image editing, controllable synthesis, and dense prediction-within a single generative framework. To further boost usability, we incorporate efficient decoding strategies like inference-time scaling and speculative Jacobi sampling to improve quality and speed, respectively. Extensive evaluations on standard text-to-image benchmarks (e.g., GenEval, DPG) demonstrate that Lumina-mGPT 2.0 not only matches but in some cases surpasses diffusion-based models. Moreover, we confirm its multi-task capabilities on the Graph200K benchmark, with the native Lumina-mGPT 2.0 performing exceptionally well. These results position Lumina-mGPT 2.0 as a strong, flexible foundation model for unified multimodal generation. We have released our training details, code, and models at this https URL.
zh

[CV-98] Caching Techniques for Reducing the Communication Cost of Federated Learning in IoT Environments

【速读】:该论文旨在解决联邦学习(Federated Learning, FL)中因设备间频繁传输模型更新而导致的通信开销问题,尤其是在资源受限的边缘物联网(Edge IoT)环境中。解决方案的关键在于引入三种缓存策略(FIFO、LRU 和基于优先级的缓存机制),通过智能筛选并仅转发具有显著差异的模型更新,从而减少不必要的通信传输,在保持模型精度的前提下显著降低带宽消耗。实验表明,该方法在CIFAR-10和医疗数据集上均实现了通信量下降且精度损失最小化,有效提升了联邦学习在智慧城市、医疗等低延迟敏感场景中的可扩展性和实用性。

链接: https://arxiv.org/abs/2507.17772
作者: Ahmad Alhonainy(1),Praveen Rao(1) ((1) University of Missouri, USA)
机构: The University of Missouri (密苏里大学)
类目: Distributed, Parallel, and Cluster Computing (cs.DC); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注: Journal

点击查看摘要

Abstract:Federated Learning (FL) allows multiple distributed devices to jointly train a shared model without centralizing data, but communication cost remains a major bottleneck, especially in resource-constrained environments. This paper introduces caching strategies - FIFO, LRU, and Priority-Based - to reduce unnecessary model update transmissions. By selectively forwarding significant updates, our approach lowers bandwidth usage while maintaining model accuracy. Experiments on CIFAR-10 and medical datasets show reduced communication with minimal accuracy loss. Results confirm that intelligent caching improves scalability, memory efficiency, and supports reliable FL in edge IoT networks, making it practical for deployment in smart cities, healthcare, and other latency-sensitive applications.
zh

[CV-99] Enhancing Quantization-Aware Training on Edge Devices via Relative Entropy Coreset Selection and Cascaded Layer Correction

【速读】:该论文旨在解决边缘设备上低比特量化模型(low-bit quantized models)在数据隐私限制下,因使用小规模数据集进行量化感知训练(Quantization-Aware Training, QAT)时性能显著下降的问题。现有方法在仅使用10%甚至更少数据时难以有效消除量化误差,导致精度损失严重。解决方案的关键在于提出QuaRC框架,其核心创新包括:(1)在coreset选择阶段引入“相对熵评分”(Relative Entropy Score),筛选出最能反映模型量化误差的代表性子集;(2)在训练阶段采用级联层校正策略(Cascaded Layer Correction),逐层对齐量化模型与全精度模型的中间层输出,从而有效抑制量化误差累积。实验表明,在仅用1% ImageNet-1K数据的情况下,QuaRC将ResNet-18的2-bit量化模型Top-1准确率提升5.72%,优于当前最优方法。

链接: https://arxiv.org/abs/2507.17768
作者: Yujia Tong,Jingling Yuan,Chuang Hu
机构: Hubei Key Laboratory of Trasportation Internet of Things, School of Computer Science and Artificial Intelligence, Wuhan University of Technology, Hubei 430072, China; School of Computer Science, Wuhan University, Hubei 430072, China
类目: Machine Learning (cs.LG); Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:With the development of mobile and edge computing, the demand for low-bit quantized models on edge devices is increasing to achieve efficient deployment. To enhance the performance, it is often necessary to retrain the quantized models using edge data. However, due to privacy concerns, certain sensitive data can only be processed on edge devices. Therefore, employing Quantization-Aware Training (QAT) on edge devices has become an effective solution. Nevertheless, traditional QAT relies on the complete dataset for training, which incurs a huge computational cost. Coreset selection techniques can mitigate this issue by training on the most representative subsets. However, existing methods struggle to eliminate quantization errors in the model when using small-scale datasets (e.g., only 10% of the data), leading to significant performance degradation. To address these issues, we propose QuaRC, a QAT framework with coresets on edge devices, which consists of two main phases: In the coreset selection phase, QuaRC introduces the ``Relative Entropy Score" to identify the subsets that most effectively capture the model’s quantization errors. During the training phase, QuaRC employs the Cascaded Layer Correction strategy to align the intermediate layer outputs of the quantized model with those of the full-precision model, thereby effectively reducing the quantization errors in the intermediate layers. Experimental results demonstrate the effectiveness of our approach. For instance, when quantizing ResNet-18 to 2-bit using a 1% data subset, QuaRC achieves a 5.72% improvement in Top-1 accuracy on the ImageNet-1K dataset compared to state-of-the-art techniques.
zh

[CV-100] Learning from Heterogeneity: Generalizing Dynamic Facial Expression Recognition via Distributionally Robust Optimization ACM-MM’25

【速读】:该论文旨在解决动态面部表情识别(Dynamic Facial Expression Recognition, DFER)在多源数据和个体表达差异导致的样本异质性下性能下降的问题。解决方案的关键在于提出一种异质性感知的分布框架(Heterogeneity-aware Distributional Framework, HDF),其核心创新包括两个即插即用模块:一是时间-频率分布注意力模块(Time-Frequency Distributional Attention Module, DAM),通过双分支注意力机制同时建模时序一致性与频域鲁棒性,提升对序列不一致性和视觉风格变化的容忍度;二是分布感知缩放模块(Distribution-aware Scaling Module, DSM),基于梯度敏感性和信息瓶颈原理,动态平衡分类损失与对比损失,缓解难样本引起的优化失衡,从而实现更稳定且判别力更强的表示学习。

链接: https://arxiv.org/abs/2507.15765
作者: Feng-Qi Cui,Anyang Tong,Jinyang Huang,Jie Zhang,Dan Guo,Zhi Liu,Meng Wang
机构: University of Science and Technology of China (中国科学技术大学); Hefei University of Technology (合肥工业大学); IHPC and CFAR, Agency for Science, Technology and Research (新加坡科技研究局); The University of Electro-Communications (电波通信大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: Accepted by ACM MM’25

点击查看摘要

Abstract:Dynamic Facial Expression Recognition (DFER) plays a critical role in affective computing and human-computer interaction. Although existing methods achieve comparable performance, they inevitably suffer from performance degradation under sample heterogeneity caused by multi-source data and individual expression variability. To address these challenges, we propose a novel framework, called Heterogeneity-aware Distributional Framework (HDF), and design two plug-and-play modules to enhance time-frequency modeling and mitigate optimization imbalance caused by hard samples. Specifically, the Time-Frequency Distributional Attention Module (DAM) captures both temporal consistency and frequency robustness through a dual-branch attention design, improving tolerance to sequence inconsistency and visual style shifts. Then, based on gradient sensitivity and information bottleneck principles, an adaptive optimization module Distribution-aware Scaling Module (DSM) is introduced to dynamically balance classification and contrastive losses, enabling more stable and discriminative representation learning. Extensive experiments on two widely used datasets, DFEW and FERV39k, demonstrate that HDF significantly improves both recognition accuracy and robustness. Our method achieves superior weighted average recall (WAR) and unweighted average recall (UAR) while maintaining strong generalization across diverse and imbalanced scenarios. Codes are released at this https URL.
zh

[CV-101] DiagR1: A Vision-Language Model Trained via Reinforcement Learning for Digestive Pathology Diagnosis

【速读】:该论文旨在解决当前用于胃肠道病理学分析的多模态大模型在数据质量和推理透明度方面的局限性:公共数据集普遍存在噪声和标注不完整问题,导致视觉语言模型在生成诊断文本时易出现事实性幻觉;同时,缺乏明确的中间推理链使得输出难以审计,在临床实践中可信度不足。解决方案的关键在于两个方面:一是构建一个大规模胃肠道病理数据集,包含显微描述与诊断结论,并提出一种融合病变分类和解剖部位信息的提示论证策略(prompt argumentation strategy),以引导模型更好地捕捉图像特异性特征并保持生成内容的语义一致性;二是采用结合监督微调与组相对策略优化(Group Relative Policy Optimization, GRPO)的后训练流程,提升推理质量与输出结构完整性。实验证明该方法在生成质量、结构完整性和临床相关性上显著优于现有开源及专有基线模型。

链接: https://arxiv.org/abs/2507.18433
作者: Minxi Ouyang,Lianghui Zhu,Yaqing Bao,Qiang Huang,Jingli Ouyang,Tian Guan,Xitong Ling,Jiawen Li,Song Duan,Wenbin Dai,Li Zheng,Xuemei Zhang,Yonghong He
机构: Shenzhen International Graduate School, Tsinghua University, Beijing, China(清华大学深圳国际研究生院); Department of Pathology, Liuzhou People’s Hospital Affiliated to Guangxi Medical University, Liuzhou, Guangxi, China(广西医科大学附属柳州人民医院病理科); Department of Immunology, College of Basic Medical Sciences, China Medical University, Shenyang, Liaoning Province, P.R. China(中国医科大学基础医学院免疫学系); Greater Bay Area Center for Medical Device Evaluation and Inspection.NMPA, Shenzhen, Guangdong Province,P.R. China(粤港澳大湾区医疗器械检验检测中心国家药品监督管理局); Shenzhen Shengqiang Technology Co., Ltd., Shenzhen, Guangdong, China(深圳市盛强科技有限公司); Department of Pathology, Chongqing University Affiliated Three Gorges Hospital, Chongqing, China(重庆大学附属三峡医院病理科)
类目: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Multimodal large models have shown great potential in automating pathology image analysis. However, current multimodal models for gastrointestinal pathology are constrained by both data quality and reasoning transparency: pervasive noise and incomplete annotations in public datasets predispose vision language models to factual hallucinations when generating diagnostic text, while the absence of explicit intermediate reasoning chains renders the outputs difficult to audit and thus less trustworthy in clinical practice. To address these issues, we construct a large scale gastrointestinal pathology dataset containing both microscopic descriptions and diagnostic conclusions, and propose a prompt argumentation strategy that incorporates lesion classification and anatomical site information. This design guides the model to better capture image specific features and maintain semantic consistency in generation. Furthermore, we employ a post training pipeline that combines supervised fine tuning with Group Relative Policy Optimization (GRPO) to improve reasoning quality and output structure. Experimental results on real world pathology report generation tasks demonstrate that our approach significantly outperforms state of the art open source and proprietary baselines in terms of generation quality, structural completeness, and clinical relevance. Our solution outperforms state of the art models with 18.7% higher clinical relevance, 32.4% improved structural completeness, and 41.2% fewer diagnostic errors, demonstrating superior accuracy and clinical utility compared to existing solutions.
zh

[CV-102] UniSegDiff: Boosting Unified Lesion Segmentation via a Staged Diffusion Model MICCAI2025

【速读】:该论文旨在解决扩散模型(Diffusion Probabilistic Model, DPM)在病变分割任务中因训练与推理策略导致的时间步(timestep)注意力分布不均的问题,从而引发训练时间延长和分割性能不佳。其解决方案的关键在于提出一种名为UniSegDiff的新颖扩散模型框架,通过分阶段训练与推理机制,在不同训练阶段动态调整预测目标,强制模型在所有时间步保持高注意力,并借助预训练的特征提取网络实现多模态、多器官的统一病变分割。

链接: https://arxiv.org/abs/2507.18362
作者: Yilong Hu,Shijie Chang,Lihe Zhang,Feng Tian,Weibing Sun,Huchuan Lu
机构: 11; 22
类目: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV)
备注: MICCAI2025

点击查看摘要

Abstract:The Diffusion Probabilistic Model (DPM) has demonstrated remarkable performance across a variety of generative tasks. The inherent randomness in diffusion models helps address issues such as blurring at the edges of medical images and labels, positioning Diffusion Probabilistic Models (DPMs) as a promising approach for lesion segmentation. However, we find that the current training and inference strategies of diffusion models result in an uneven distribution of attention across different timesteps, leading to longer training times and suboptimal solutions. To this end, we propose UniSegDiff, a novel diffusion model framework designed to address lesion segmentation in a unified manner across multiple modalities and organs. This framework introduces a staged training and inference approach, dynamically adjusting the prediction targets at different stages, forcing the model to maintain high attention across all timesteps, and achieves unified lesion segmentation through pre-training the feature extraction network for segmentation. We evaluate performance on six different organs across various imaging modalities. Comprehensive experimental results demonstrate that UniSegDiff significantly outperforms previous state-of-the-art (SOTA) approaches. The code is available at this https URL.
zh

[CV-103] CM-Tongue: A Standardized Tongue Image Dataset with Pathological Annotations for AI-Assisted TCM Diagnosis

【速读】:该论文旨在解决传统中医舌诊(TCM tongue diagnosis)在临床应用中因主观判断和成像协议不一致导致的标准化难题,以及AI发展所面临的高质量、大规模标注数据集匮乏的问题。解决方案的关键在于构建并发布首个专用于AI驱动的中医舌诊研究的数据集,该数据集包含6,719张在标准化条件下采集的高清舌象图像,并由持证中医师对每张图像进行平均2.54个病理症状类别的标注(共20类),支持COCO、TXT、XML等多种格式,同时通过九种主流深度学习模型(YOLOv5/v7/v8变体、SSD、MobileNetV2)进行基准测试,验证其在AI开发中的实用性,从而为中医舌诊的智能化与标准化提供可靠的数据基础。

链接: https://arxiv.org/abs/2507.18288
作者: Xuebo Jin,Longfei Gao,Anshuo Tong,Zhengyang Chen,Jianlei Kong,Ning Sun,Huijun Ma,Qiang Wang,Yuting Bai,Tingli Su
机构: 未知
类目: Image and Video Processing (eess.IV); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
备注: 16 pages, 11 figures, 2 Tables

点击查看摘要

Abstract:Traditional Chinese medicine (TCM) tongue diagnosis, while clinically valuable, faces standardization challenges due to subjective interpretation and inconsistent imaging protocols, compounded by the lack of large-scale, annotated datasets for AI development. To address this gap, we present the first specialized dataset for AI-driven TCM tongue diagnosis, comprising 6,719 high-quality images captured under standardized conditions and annotated with 20 pathological symptom categories (averaging 2.54 clinically validated labels per image, all verified by licensed TCM practitioners). The dataset supports multiple annotation formats (COCO, TXT, XML) for broad usability and has been benchmarked using nine deep learning models (YOLOv5/v7/v8 variants, SSD, and MobileNetV2) to demonstrate its utility for AI development. This resource provides a critical foundation for advancing reliable computational tools in TCM, bridging the data shortage that has hindered progress in the field, and facilitating the integration of AI into both research and clinical practice through standardized, high-quality diagnostic data.
zh

[CV-104] Deep Learning for Glioblastoma Morpho-pathological Features Identification: A BraTS-Pathology Challenge Solution MICCAI2024

【速读】:该论文旨在解决胶质母细胞瘤(Glioblastoma)因分子和病理特征高度异质性而导致的诊断难题,其核心挑战在于如何准确识别并评估这种异质性以指导个体化治疗并改善患者预后。解决方案的关键在于利用深度学习方法对 BraTS-Path 数据集进行建模,具体通过微调一个预训练模型来提升分类性能;尽管在验证集上整体指标(如准确率、召回率和 F1 分数均为 0.392)表现不佳,但模型展现出优异的特异性(0.898),表明其在正确识别阴性样本方面具有较强能力,且最终在测试阶段取得第二名成绩,验证了该策略在特定场景下的有效性。

链接: https://arxiv.org/abs/2507.18133
作者: Juexin Zhang,Ying Weng,Ke Chen
机构: 未知
类目: Image and Video Processing (eess.IV); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted by the International Brain Tumor Segmentation (BraTS) challenge organized at MICCAI 2024 conference

点击查看摘要

Abstract:Glioblastoma, a highly aggressive brain tumor with diverse molecular and pathological features, poses a diagnostic challenge due to its heterogeneity. Accurate diagnosis and assessment of this heterogeneity are essential for choosing the right treatment and improving patient outcomes. Traditional methods rely on identifying specific features in tissue samples, but deep learning offers a promising approach for improved glioblastoma diagnosis. In this paper, we present our approach to the BraTS-Path Challenge 2024. We leverage a pre-trained model and fine-tune it on the BraTS-Path training dataset. Our model demonstrates poor performance on the challenging BraTS-Path validation set, as rigorously assessed by the Synapse online platform. The model achieves an accuracy of 0.392229, a recall of 0.392229, and a F1-score of 0.392229, indicating a consistent ability to correctly identify instances under the target condition. Notably, our model exhibits perfect specificity of 0.898704, showing an exceptional capacity to correctly classify negative cases. Moreover, a Matthews Correlation Coefficient (MCC) of 0.255267 is calculated, to signify a limited positive correlation between predicted and actual values and highlight our model’s overall predictive power. Our solution also achieves the second place during the testing phase.
zh

[CV-105] U-Net Based Healthy 3D Brain Tissue Inpainting MICCAI2024

【速读】:该论文旨在解决脑部磁共振成像(MRI)中健康脑组织的局部重建问题,具体任务为“ASNR-MICCAI BraTS 局部组织生成通过图像修复(Inpainting)”。其核心挑战在于从部分缺失或损坏的脑部 MRI 图像中准确恢复出完整的健康脑组织结构。解决方案的关键在于提出一种基于 U-Net 架构的生成式模型,并结合随机掩码数据增强策略,在 BraTS-Local-Inpainting 数据集上进行训练,从而显著提升模型对不同输入场景的泛化能力和鲁棒性。实验结果表明,该方法在 SSIM、PSNR 和 MSE 等指标上均表现优异且稳定性高,最终在相关竞赛中获得第一名。

链接: https://arxiv.org/abs/2507.18126
作者: Juexin Zhang,Ying Weng,Ke Chen
机构: 未知
类目: Image and Video Processing (eess.IV); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted by the International Brain Tumor Segmentation (BraTS) challenge organized at MICCAI 2024 conference. Included 7 pages, 2 figures

点击查看摘要

Abstract:This paper introduces a novel approach to synthesize healthy 3D brain tissue from masked input images, specifically focusing on the task of ‘ASNR-MICCAI BraTS Local Synthesis of Tissue via Inpainting’. Our proposed method employs a U-Net-based architecture, which is designed to effectively reconstruct the missing or corrupted regions of brain MRI scans. To enhance our model’s generalization capabilities and robustness, we implement a comprehensive data augmentation strategy that involves randomly masking healthy images during training. Our model is trained on the BraTS-Local-Inpainting dataset and demonstrates the exceptional performance in recovering healthy brain tissue. The evaluation metrics employed, including Structural Similarity Index (SSIM), Peak Signal-to-Noise Ratio (PSNR), and Mean Squared Error (MSE), consistently yields impressive results. On the BraTS-Local-Inpainting validation set, our model achieved an SSIM score of 0.841, a PSNR score of 23.257, and an MSE score of 0.007. Notably, these evaluation metrics exhibit relatively low standard deviations, i.e., 0.103 for SSIM score, 4.213 for PSNR score and 0.007 for MSE score, which indicates that our model’s reliability and consistency across various input scenarios. Our method also secured first place in the challenge.
zh

[CV-106] Parameter-Efficient Fine-Tuning of 3D DDPM for MRI Image Generation Using Tensor Networks

【速读】:该论文旨在解决三维(3D)U-Net结构的去噪扩散概率模型(DDPM)在磁共振成像(MRI)图像生成任务中参数效率微调(Parameter-Efficient Fine-Tuning, PEFT)的问题,尤其是针对3D卷积操作的低参数表示能力不足的挑战。其解决方案的关键在于提出了一种名为张量体积算子(Tensor Volumetric Operator, TenVOO)的新方法,该方法基于张量网络建模,将3D卷积核压缩为低维张量表示,在保持复杂空间依赖性的同时显著减少可训练参数量——实验表明仅需原模型0.3%的参数即可实现优于现有方法的多尺度结构相似性指数(MS-SSIM)性能。

链接: https://arxiv.org/abs/2507.18112
作者: Binghua Li,Ziqing Chang,Tong Liang,Chao Li,Toshihisa Tanaka,Shigeki Aoki,Qibin Zhao,Zhe Sun
机构: 未知
类目: Image and Video Processing (eess.IV); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:We address the challenge of parameter-efficient fine-tuning (PEFT) for three-dimensional (3D) U-Net-based denoising diffusion probabilistic models (DDPMs) in magnetic resonance imaging (MRI) image generation. Despite its practical significance, research on parameter-efficient representations of 3D convolution operations remains limited. To bridge this gap, we propose Tensor Volumetric Operator (TenVOO), a novel PEFT method specifically designed for fine-tuning DDPMs with 3D convolutional backbones. Leveraging tensor network modeling, TenVOO represents 3D convolution kernels with lower-dimensional tensors, effectively capturing complex spatial dependencies during fine-tuning with few parameters. We evaluate TenVOO on three downstream brain MRI datasets-ADNI, PPMI, and BraTS2021-by fine-tuning a DDPM pretrained on 59,830 T1-weighted brain MRI scans from the UK Biobank. Our results demonstrate that TenVOO achieves state-of-the-art performance in multi-scale structural similarity index measure (MS-SSIM), outperforming existing approaches in capturing spatial dependencies while requiring only 0.3% of the trainable parameters of the original model. Our code is available at: this https URL
zh

[CV-107] Direct Dual-Energy CT Material Decomposition using Model-based Denoising Diffusion Model

【速读】:该论文旨在解决双能X射线计算机断层成像(Dual-energy X-ray Computed Tomography, DECT)中材料分解的精度问题,特别是传统方法在图像域进行后处理时未考虑束硬化效应(beam-hardening effect),导致定量分解结果次优的问题。解决方案的关键在于提出一种基于模型的扩散神经网络方法——Dual-Energy Decomposition Model-based Diffusion (DEcomp-MoD),其核心创新是将DECT物理模型嵌入深度学习训练损失函数,并结合基于分数的去噪扩散先验(score-based denoising diffusion prior)来建模材料图像域的分布;此外,推理阶段直接以投影数据(sinogram)为输入,通过条件扩散模型生成一致性的材料图像,从而实现从原始数据到定量材料图像的端到端映射,显著提升分解准确性与临床适用性。

链接: https://arxiv.org/abs/2507.18012
作者: Hang Xu,Alexandre Bousse,Alessandro Perelli
机构: University of Dundee (邓迪大学); Longgang District Maternity & Child Healthcare Hospital of Shenzhen City (深圳市龙岗区妇幼保健院); Longgang Maternity and Child Institute of Shantou University Medical College (汕头大学医学院龙岗妇幼研究所); University Brest (布雷斯特大学); Inserm (法国国家健康与医学研究院)
类目: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV); Medical Physics (physics.med-ph)
备注: 13 pages, 10 figures, 2 tables

点击查看摘要

Abstract:Dual-energy X-ray Computed Tomography (DECT) constitutes an advanced technology which enables automatic decomposition of materials in clinical images without manual segmentation using the dependency of the X-ray linear attenuation with energy. However, most methods perform material decomposition in the image domain as a post-processing step after reconstruction but this procedure does not account for the beam-hardening effect and it results in sub-optimal results. In this work, we propose a deep learning procedure called Dual-Energy Decomposition Model-based Diffusion (DEcomp-MoD) for quantitative material decomposition which directly converts the DECT projection data into material images. The algorithm is based on incorporating the knowledge of the spectral DECT model into the deep learning training loss and combining a score-based denoising diffusion learned prior in the material image domain. Importantly the inference optimization loss takes as inputs directly the sinogram and converts to material images through a model-based conditional diffusion model which guarantees consistency of the results. We evaluate the performance with both quantitative and qualitative estimation of the proposed DEcomp-MoD method on synthetic DECT sinograms from the low-dose AAPM dataset. Finally, we show that DEcomp-MoD outperform state-of-the-art unsupervised score-based model and supervised deep learning networks, with the potential to be deployed for clinical diagnosis.
zh

[CV-108] Benchmarking of Deep Learning Methods for Generic MRI Multi-OrganAbdominal Segmentation

【速读】:该论文旨在解决腹部磁共振成像(MRI)分割工具在准确性与泛化能力方面的局限性问题,尤其是由于MRI信号变异性大、标注数据稀缺导致现有方法训练集有限,进而影响其跨设备、多序列和多样本条件下的适用性。解决方案的关键在于:首先,系统性地基准测试三种先进的开源MRI腹部分割模型(MRSegmentator、MRISegmentator-Abdomen 和 TotalSegmentator MRI),并引入一种基于SynthSeg框架的新型模型ABDSynth——该模型仅使用广泛可用的CT分割数据进行训练(无需真实MRI图像),从而显著降低标注成本;其次,通过三个未参与训练的公共数据集全面评估各模型在不同厂商设备、五种MRI序列及多种扫描参数下的性能表现,结果表明MRSegmentator在精度和泛化能力上最优,而ABDSynth虽略逊一筹但具备更低的数据依赖性,为标注预算受限场景提供可行替代方案。

链接: https://arxiv.org/abs/2507.17971
作者: Deepa Krishnaswamy,Cosmin Ciausu,Steve Pieper,Ron Kikinis,Benjamin Billot,Andrey Fedorov
机构: Brigham and Women’s Hospital (布莱根妇女医院); Isomics (艾索米克斯); Inria (法国国家信息与自动化研究院)
类目: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Recent advances in deep learning have led to robust automated tools for segmentation of abdominal computed tomography (CT). Meanwhile, segmentation of magnetic resonance imaging (MRI) is substantially more challenging due to the inherent signal variability and the increased effort required for annotating training datasets. Hence, existing approaches are trained on limited sets of MRI sequences, which might limit their generalizability. To characterize the landscape of MRI abdominal segmentation tools, we present here a comprehensive benchmarking of the three state-of-the-art and open-source models: MRSegmentator, MRISegmentator-Abdomen, and TotalSegmentator MRI. Since these models are trained using labor-intensive manual annotation cycles, we also introduce and evaluate ABDSynth, a SynthSeg-based model purely trained on widely available CT segmentations (no real images). More generally, we assess accuracy and generalizability by leveraging three public datasets (not seen by any of the evaluated methods during their training), which span all major manufacturers, five MRI sequences, as well as a variety of subject conditions, voxel resolutions, and fields-of-view. Our results reveal that MRSegmentator achieves the best performance and is most generalizable. In contrast, ABDSynth yields slightly less accurate results, but its relaxed requirements in training data make it an alternative when the annotation budget is limited. The evaluation code and datasets are given for future benchmarking at this https URL, along with inference code and weights for ABDSynth.
zh

[CV-109] Hierarchical Diffusion Framework for Pseudo-Healthy Brain MRI Inpainting with Enhanced 3D Consistency

【速读】:该论文旨在解决病理脑部磁共振成像(MRI)分析前的伪健康图像修复(pseudo-healthy image inpainting)问题,特别是现有方法在体积一致性与数据效率之间的权衡难题。当前主流的二维(2D)切片级修复模型虽能保证平面内高保真度,但跨切片独立处理导致体素间不连续;而全三维(3D)模型虽可改善体积一致性,却因模型容量大、需大量训练数据,在医疗场景中难以实用。其解决方案的关键在于提出一种分层扩散框架(hierarchical diffusion framework),通过两个相互垂直的粗到精(coarse-to-fine)2D阶段实现:首先使用轴向(axial)扩散模型生成全局一致的粗粒度修复结果,再利用冠状面(coronal)扩散模型细化解剖细节;结合自适应重采样策略,有效平衡了数据效率与体积一致性,从而在真实性和结构连贯性上优于现有最优方法。

链接: https://arxiv.org/abs/2507.17911
作者: Dou Hoon Kwark,Shirui Luo,Xiyue Zhu,Yudu Li,Zhi-Pei Liang,Volodymyr Kindratenko
机构: 未知
类目: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV)
备注: 11 pages, 2 figures

点击查看摘要

Abstract:Pseudo-healthy image inpainting is an essential preprocessing step for analyzing pathological brain MRI scans. Most current inpainting methods favor slice-wise 2D models for their high in-plane fidelity, but their independence across slices produces discontinuities in the volume. Fully 3D models alleviate this issue, but their high model capacity demands extensive training data for reliable, high-fidelity synthesis – often impractical in medical settings. We address these limitations with a hierarchical diffusion framework by replacing direct 3D modeling with two perpendicular coarse-to-fine 2D stages. An axial diffusion model first yields a coarse, globally consistent inpainting; a coronal diffusion model then refines anatomical details. By combining perpendicular spatial views with adaptive resampling, our method balances data efficiency and volumetric consistency. Our experiments show our approach outperforms state-of-the-art baselines in both realism and volumetric consistency, making it a promising solution for pseudo-healthy image inpainting. Code is available at this https URL.
zh

[CV-110] Multimodal Recurrent Ensembles for Predicting Brain Responses to Naturalistic Movies (Algonauts 2025)

【速读】:该论文旨在解决如何准确预测自然刺激下分布式皮层响应的问题,核心挑战在于整合视觉、听觉和语义信息的时间动态过程。解决方案的关键在于提出一种分层多模态循环集成模型(hierarchical multimodal recurrent ensemble),该模型利用预训练的视频、音频和语言嵌入,通过模态特异性的双向循环神经网络(bidirectional RNNs)编码时间动态,其隐藏状态经融合后输入第二层递归结构,并由轻量级个体特定头输出1000个皮层区域的fMRI时序响应。训练采用复合均方误差-相关性损失函数与渐进式课程学习策略,逐步从早期感觉区转向晚期联合区,最终在Algonauts 2025挑战中取得第三名成绩(总体Pearson相关系数r = 0.2094),并在单个皮层区域达到最高峰值相关性(平均r = 0.63),尤其在最具挑战性的被试(Subject 5)上表现显著提升。

链接: https://arxiv.org/abs/2507.17897
作者: Semih Eren,Deniz Kucukahmetler,Nico Scherf
机构: Max Planck Institute for Human Cognitive and Brain Sciences (马克斯·普朗克人类认知与脑科学研究所); TU Dresden (德累斯顿工业大学); School for Embedded and Composite AI (嵌入式与复合人工智能学院); Center for Scalable Data Analytics & AI (可扩展数据解析与人工智能中心)
类目: Neurons and Cognition (q-bio.NC); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注: 8 pages, 2 figures, 1 table. Invited report, CCN 2025 Algonauts Project session (3rd-place team). Code: this https URL

点击查看摘要

Abstract:Accurately predicting distributed cortical responses to naturalistic stimuli requires models that integrate visual, auditory and semantic information over time. We present a hierarchical multimodal recurrent ensemble that maps pretrained video, audio, and language embeddings to fMRI time series recorded while four subjects watched almost 80 hours of movies provided by the Algonauts 2025 challenge. Modality-specific bidirectional RNNs encode temporal dynamics; their hidden states are fused and passed to a second recurrent layer, and lightweight subject-specific heads output responses for 1000 cortical parcels. Training relies on a composite MSE-correlation loss and a curriculum that gradually shifts emphasis from early sensory to late association regions. Averaging 100 model variants further boosts robustness. The resulting system ranked third on the competition leaderboard, achieving an overall Pearson r = 0.2094 and the highest single-parcel peak score (mean r = 0.63) among all participants, with particularly strong gains for the most challenging subject (Subject 5). The approach establishes a simple, extensible baseline for future multimodal brain-encoding benchmarks.
zh

[CV-111] Integrating Feature Selection and Machine Learning for Nitrogen Assessment in Grapevine Leaves using In-Field Hyperspectral Imaging

【速读】:该论文旨在解决葡萄园中氮素(Nitrogen, N)营养管理的精准性问题,尤其是针对土壤氮素在空间和时间上的高度变异性,传统施肥方式难以满足单株植株的实时需求。为实现个体植株水平的氮浓度监测与优化施肥,研究提出基于田间高光谱成像(in-field hyperspectral imaging)结合特征选择与机器学习(Machine Learning, ML)建模的方法。其关键在于通过两种特征选择方法识别对叶片氮浓度响应敏感的光谱波段(主要集中在500–525 nm、650–690 nm、750–800 nm和900–950 nm),并利用梯度提升(Gradient Boosting)和XGBoost模型分别预测叶级和冠层级氮浓度,最终在不同分析层级均取得可接受的预测性能(R²分别为0.57和0.49),验证了该技术路径在葡萄园氮素状态监测中的可行性与鲁棒性。

链接: https://arxiv.org/abs/2507.17869
作者: Atif Bilal Asad,Achyut Paudel,Safal Kshetri,Chenchen Kang,Salik Ram Khanal,Nataliya Shcherbatyuk,Pierre Davadant,R. Paul Schreiner,Santosh Kalauni,Manoj Karkee,Markus Keller
机构: Cornell University (康奈尔大学); Stanford University (斯坦福大学)
类目: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Nitrogen (N) is one of the most crucial nutrients in vineyards, affecting plant growth and subsequent products such as wine and juice. Because soil N has high spatial and temporal variability, it is desirable to accurately estimate the N concentration of grapevine leaves and manage fertilization at the individual plant level to optimally meet plant needs. In this study, we used in-field hyperspectral images with wavelengths ranging from 400 to 1000nm of four different grapevine cultivars collected from distinct vineyards and over two growth stages during two growing seasons to develop models for predicting N concentration at the leaf-level and canopy-level. After image processing, two feature selection methods were employed to identify the optimal set of spectral bands that were responsive to leaf N concentrations. The selected spectral bands were used to train and test two different Machine Learning (ML) models, Gradient Boosting and XGBoost, for predicting nitrogen concentrations. The comparison of selected bands for both leaf-level and canopy-level datasets showed that most of the spectral regions identified by the feature selection methods were across both methods and the dataset types (leaf- and canopy-level datasets), particularly in the key regions, 500-525nm, 650-690nm, 750-800nm, and 900-950nm. These findings indicated the robustness of these spectral regions for predicting nitrogen content. The results for N prediction demonstrated that the ML model achieved an R square of 0.49 for canopy-level data and an R square of 0.57 for leaf-level data, despite using different sets of selected spectral bands for each analysis level. The study demonstrated the potential of using in-field hyperspectral imaging and the use of spectral data in integrated feature selection and ML techniques to monitor N status in vineyards.
zh

[CV-112] owards Robust Foundation Models for Digital Pathology

【速读】:该论文旨在解决病理学基础模型(Foundation Models, FMs)在临床部署中因对非生物技术特征(如手术/内镜操作差异、实验室流程和扫描仪硬件变化)敏感而导致的鲁棒性不足问题,这可能引发严重的诊断错误和临床失误。解决方案的关键在于提出首个系统性的鲁棒性评估框架——PathoROB,包含三个新指标(如鲁棒性指数)和四个覆盖34家医疗中心、28个生物类别的数据集,并通过实验证明了当前所有20个评估的FMs均存在鲁棒性缺陷;进一步表明,采用更鲁棒的FMs或后处理鲁棒化策略可显著降低风险,但尚未完全消除。该研究确立了鲁棒性评估作为病理FMs临床验证的前提,并强调未来FM开发必须将鲁棒性作为核心设计原则。

链接: https://arxiv.org/abs/2507.17845
作者: Jonah Kömen,Edwin D. de Jong,Julius Hense,Hannah Marienwald,Jonas Dippel,Philip Naumann,Eric Marcus,Lukas Ruff,Maximilian Alber,Jonas Teuwen,Frederick Klauschen,Klaus-Robert Müller
机构: 未知
类目: Image and Video Processing (eess.IV); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG); Quantitative Methods (q-bio.QM)
备注:

点击查看摘要

Abstract:Biomedical Foundation Models (FMs) are rapidly transforming AI-enabled healthcare research and entering clinical validation. However, their susceptibility to learning non-biological technical features – including variations in surgical/endoscopic techniques, laboratory procedures, and scanner hardware – poses risks for clinical deployment. We present the first systematic investigation of pathology FM robustness to non-biological features. Our work (i) introduces measures to quantify FM robustness, (ii) demonstrates the consequences of limited robustness, and (iii) proposes a framework for FM robustification to mitigate these issues. Specifically, we developed PathoROB, a robustness benchmark with three novel metrics, including the robustness index, and four datasets covering 28 biological classes from 34 medical centers. Our experiments reveal robustness deficits across all 20 evaluated FMs, and substantial robustness differences between them. We found that non-robust FM representations can cause major diagnostic downstream errors and clinical blunders that prevent safe clinical adoption. Using more robust FMs and post-hoc robustification considerably reduced (but did not yet eliminate) the risk of such errors. This work establishes that robustness evaluation is essential for validating pathology FMs before clinical adoption and demonstrates that future FM development must integrate robustness as a core design principle. PathoROB provides a blueprint for assessing robustness across biomedical domains, guiding FM improvement efforts towards more robust, representative, and clinically deployable AI systems that prioritize biological information over technical artifacts.
zh

[CV-113] Improving Multislice Electron Ptychography with a Generative Prior

【速读】:该论文旨在解决多切片电子 Ptychography (Multislice Electron Ptychography, MEP) 在重构原子晶体结构时面临的病态逆问题(ill-posed inverse problem),传统迭代算法因计算耗时且重构质量受限而难以满足高分辨率成像需求。其解决方案的关键在于提出 MEP-Diffusion,一种基于大规模晶体结构数据库训练的扩散模型(diffusion model),作为生成先验(generative prior)嵌入现有迭代求解器中,并通过 Diffusion Posterior Sampling (DPS) 实现无缝集成,从而显著提升三维重构体积的质量,在结构相似性(SSIM)指标上较现有方法提升 90.50%。

链接: https://arxiv.org/abs/2507.17800
作者: Christian K. Belardi,Chia-Hao Lee,Yingheng Wang,Justin Lovelace,Kilian Q. Weinberger,David A. Muller,Carla P. Gomes
机构: Cornell University (康奈尔大学)
类目: Image and Video Processing (eess.IV); Materials Science (cond-mat.mtrl-sci); Computer Vision and Pattern Recognition (cs.CV); Optics (physics.optics)
备注: 16 pages, 10 figures, 5 tables

点击查看摘要

Abstract:Multislice electron ptychography (MEP) is an inverse imaging technique that computationally reconstructs the highest-resolution images of atomic crystal structures from diffraction patterns. Available algorithms often solve this inverse problem iteratively but are both time consuming and produce suboptimal solutions due to their ill-posed nature. We develop MEP-Diffusion, a diffusion model trained on a large database of crystal structures specifically for MEP to augment existing iterative solvers. MEP-Diffusion is easily integrated as a generative prior into existing reconstruction methods via Diffusion Posterior Sampling (DPS). We find that this hybrid approach greatly enhances the quality of the reconstructed 3D volumes, achieving a 90.50% improvement in SSIM over existing methods.
zh

[CV-114] Diffusion-Assisted Frequency Attention Model for Whole-body Low-field MRI Reconstruction

【速读】:该论文旨在解决低场强磁共振成像(low-field MRI)中因信噪比(SNR)低而导致的图像重建质量差的问题。其解决方案的关键在于将扩散模型(diffusion models)的生成能力与频域注意力机制(frequency-domain attention)的表征能力相结合,从而在低信噪比条件下显著提升重建性能。实验结果表明,该方法DFAM在多个指标上均优于传统重建算法和近期基于学习的方法,展现出在资源受限或医疗条件欠发达地区的临床应用潜力。

链接: https://arxiv.org/abs/2507.17764
作者: Xin Xie,Yu Guan,Zhuoxu Cui,Dong Liang,Qiegen Liu
机构: 未知
类目: Medical Physics (physics.med-ph); Computer Vision and Pattern Recognition (cs.CV)
备注: 29 pages,7 figures

点击查看摘要

Abstract:By integrating the generative strengths of diffusion models with the representation capabilities of frequency-domain attention, DFAM effectively enhances reconstruction performance under low-SNR condi-tions. Experimental results demonstrate that DFAM consistently outperforms both conventional reconstruction algorithms and recent learning-based approaches. These findings highlight the potential of DFAM as a promising solution to advance low-field MRI reconstruction, particularly in resource-constrained or underdeveloped clinical settings.
zh

人工智能

[AI-0] Moving Out: Physically-grounded Human-AI Collaboration

【速读】:该论文旨在解决具身智能体(embodied agents)在物理环境中与人类协作时面临的挑战,特别是如何适应由物理属性和约束(如搬运重物、绕过拐角等)带来的连续状态-动作空间复杂性和动态限制。其核心问题是现有模型在面对多样化人类行为和未见物理条件时的适应性不足。解决方案的关键在于提出一种名为BASS(Behavior Augmentation, Simulation, and Selection)的新方法,通过行为增强、仿真与选择机制来提升智能体对动作结果的理解和多样性,从而显著改善AI-AI及人-AI协作性能。

链接: https://arxiv.org/abs/2507.18623
作者: Xuhui Kang,Sung-Wook Lee,Haolin Liu,Yuyan Wang,Yen-Ling Kuo
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Multiagent Systems (cs.MA)
备注: 24 pages, 8 figures

点击查看摘要

Abstract:The ability to adapt to physical actions and constraints in an environment is crucial for embodied agents (e.g., robots) to effectively collaborate with humans. Such physically grounded human-AI collaboration must account for the increased complexity of the continuous state-action space and constrained dynamics caused by physical constraints. In this paper, we introduce \textitMoving Out, a new human-AI collaboration benchmark that resembles a wide range of collaboration modes affected by physical attributes and constraints, such as moving heavy items together and maintaining consistent actions to move a big item around a corner. Using Moving Out, we designed two tasks and collected human-human interaction data to evaluate models’ abilities to adapt to diverse human behaviors and unseen physical attributes. To address the challenges in physical environments, we propose a novel method, BASS (Behavior Augmentation, Simulation, and Selection), to enhance the diversity of agents and their understanding of the outcome of actions. Our experiments show that BASS outperforms state-of-the-art models in AI-AI and human-AI collaboration. The project page is available at \hrefthis https URLthis https URL_ai/.
zh

[AI-1] Approximate SMT Counting Beyond Discrete Domains

【速读】:该论文旨在解决混合SMT(Satisfiability Modulo Theories)公式中模型计数(model counting)的问题,即在包含离散与连续变量的混合逻辑公式中,高效估算投影到离散变量上的解的数量。现有方法如位爆破(bit-blasting)仅适用于纯离散变量,难以处理混合公式的计数需求。其关键解决方案是提出pact,一种基于哈希的近似模型计数器,通过理论保证的哈希技术实现对混合SMT公式的高效估计;pact仅需对投影变量数量的对数级SMT求解器调用次数,并利用优化的哈希函数显著提升性能,在大规模基准测试中相较基线方法实现了数量级的改进(603 vs. 13个实例成功完成)。

链接: https://arxiv.org/abs/2507.18612
作者: Arijit Shaw,Kuldeep S. Meel
机构: 未知
类目: Logic in Computer Science (cs.LO); Artificial Intelligence (cs.AI)
备注: To be published in the proceedings of Design Automation Conference (DAC) 2025

点击查看摘要

Abstract:Satisfiability Modulo Theory (SMT) solvers have advanced automated reasoning, solving complex formulas across discrete and continuous domains. Recent progress in propositional model counting motivates extending SMT capabilities toward model counting, especially for hybrid SMT formulas. Existing approaches, like bit-blasting, are limited to discrete variables, highlighting the challenge of counting solutions projected onto the discrete domain in hybrid formulas. We introduce pact, an SMT model counter for hybrid formulas that uses hashing-based approximate model counting to estimate solutions with theoretical guarantees. pact makes a logarithmic number of SMT solver calls relative to the projection variables, leveraging optimized hash functions. pact achieves significant performance improvements over baselines on a large suite of benchmarks. In particular, out of 14,202 instances, pact successfully finished on 603 instances, while Baseline could only finish on 13 instances. Comments: To be published in the proceedings of Design Automation Conference (DAC) 2025 Subjects: Logic in Computer Science (cs.LO); Artificial Intelligence (cs.AI) Cite as: arXiv:2507.18612 [cs.LO] (or arXiv:2507.18612v1 [cs.LO] for this version) https://doi.org/10.48550/arXiv.2507.18612 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
zh

[AI-2] Proceedings 19th International Workshop on the ACL2 Theorem Prover and Its Applications

【速读】:该论文旨在解决如何有效组织和推广ACL2定理证明系统在学术界与工业界的应用研究,以促进其技术发展与实践落地。解决方案的关键在于构建一个专门面向ACL2用户的高水平技术论坛——ACL2 Workshop系列,为研究人员提供展示其在ACL2定理证明器及其应用领域研究成果的平台,从而推动自动化推理技术的持续创新与工程化应用。

链接: https://arxiv.org/abs/2507.18567
作者: Ruben Gamboa,Panagiotis Manolios
机构: 未知
类目: Logic in Computer Science (cs.LO); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:The ACL2 Workshop series is the major technical forum for users of the ACL2 theorem proving system to present research related to the ACL2 theorem prover and its applications. ACL2 is an industrial-strength automated reasoning system, the latest in the Boyer-Moore family of theorem provers. The 2005 ACM Software System Award was awarded to Boyer, Kaufmann, and Moore for their work on ACL2 and the other theorem provers in the Boyer-Moore family.
zh

[AI-3] Beyond Internal Data: Constructing Complete Datasets for Fairness Testing

【速读】:该论文试图解决在缺乏完整包含人口统计学信息的数据集情况下,如何有效评估分类器公平性的问题。这一问题在工业场景中尤为突出,因法律和隐私限制难以获取敏感属性数据,且内部历史数据往往代表性不足。解决方案的关键在于利用多个存在重叠的独立数据集,通过构建包含人口统计学信息的合成数据,准确还原受保护属性与模型特征之间的潜在关系;并通过与真实数据对比验证合成数据的保真度,实证表明基于合成数据计算的公平性指标与真实数据结果一致,从而为公平性测试提供了一种可行替代方案。

链接: https://arxiv.org/abs/2507.18561
作者: Varsha Ramineni,Hossein A. Rahmani,Emine Yilmaz,David Barber
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Machine Learning (stat.ML)
备注: 9 pages, 6 figures

点击查看摘要

Abstract:As AI becomes prevalent in high-risk domains and decision-making, it is essential to test for potential harms and biases. This urgency is reflected by the global emergence of AI regulations that emphasise fairness and adequate testing, with some mandating independent bias audits. However, procuring the necessary data for fairness testing remains a significant challenge. Particularly in industry settings, legal and privacy concerns restrict the collection of demographic data required to assess group disparities, and auditors face practical and cultural challenges in gaining access to data. Further, internal historical datasets are often insufficiently representative to identify real-world biases. This work focuses on evaluating classifier fairness when complete datasets including demographics are inaccessible. We propose leveraging separate overlapping datasets to construct complete synthetic data that includes demographic information and accurately reflects the underlying relationships between protected attributes and model features. We validate the fidelity of the synthetic data by comparing it to real data, and empirically demonstrate that fairness metrics derived from testing on such synthetic data are consistent with those obtained from real data. This work, therefore, offers a path to overcome real-world data scarcity for fairness testing, enabling independent, model-agnostic evaluation of fairness, and serving as a viable substitute where real data is limited.
zh

[AI-4] C2G-KD: PCA-Constrained Generator for Data-Free Knowledge Distillation

【速读】:该论文旨在解决数据隐私敏感场景下知识蒸馏(Knowledge Distillation, KD)中缺乏真实训练数据的问题,即如何在不使用任何真实样本的情况下实现高效且高质量的模型压缩与迁移。其解决方案的关键在于提出一种名为C2G-KD的数据无关知识蒸馏框架,该框架通过训练一个类别条件生成器(class-conditional generator),使其生成的合成样本能够激活冻结的教师模型输出。生成过程受几何约束引导——利用主成分分析(PCA)从每类仅需两例真实样本中估计出类特定子空间,并将生成样本限制在此子空间内,从而在不接触真实数据的前提下保留类间拓扑一致性与多样性,最终构建有效的合成训练流水线。

链接: https://arxiv.org/abs/2507.18533
作者: Magnus Bengtsson,Kenneth Östberg
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注: 12 pages

点击查看摘要

Abstract:We introduce C2G-KD, a data-free knowledge distillation framework where a class-conditional generator is trained to produce synthetic samples guided by a frozen teacher model and geometric constraints derived from PCA. The generator never observes real training data but instead learns to activate the teacher’s output through a combination of semantic and structural losses. By constraining generated samples to lie within class-specific PCA subspaces estimated from as few as two real examples per class, we preserve topological consistency and diversity. Experiments on MNIST show that even minimal class structure is sufficient to bootstrap useful synthetic training pipelines.
zh

[AI-5] GLANCE: Graph Logic Attention Network with Cluster Enhancement for Heterophilous Graph Representation Learning

【速读】:该论文旨在解决图神经网络(Graph Neural Networks, GNNs)在异质性图(heterophilous graphs)中表现不佳的问题,其核心挑战在于节点间连接的特征或类别差异导致传统GNN的邻域聚合机制失效,且缺乏对高阶结构模式的有效建模。解决方案的关键在于提出GLANCE框架,通过三个核心模块实现:1)逻辑引导层(logic layer)生成可解释的结构化嵌入;2)基于多头注意力的边剪枝机制实现图结构去噪;3)自适应聚类机制捕捉全局结构模式。该方法显著提升了模型在异质性图上的表示学习能力,兼具轻量化、可适配性和可解释性。

链接: https://arxiv.org/abs/2507.18521
作者: Zhongtian Sun,Anoushka Harit,Alexandra Cristea,Christl A. Donnelly,Pietro Liò
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Graph Neural Networks (GNNs) have demonstrated significant success in learning from graph-structured data but often struggle on heterophilous graphs, where connected nodes differ in features or class labels. This limitation arises from indiscriminate neighbor aggregation and insufficient incorporation of higher-order structural patterns. To address these challenges, we propose GLANCE (Graph Logic Attention Network with Cluster Enhancement), a novel framework that integrates logic-guided reasoning, dynamic graph refinement, and adaptive clustering to enhance graph representation learning. GLANCE combines a logic layer for interpretable and structured embeddings, multi-head attention-based edge pruning for denoising graph structures, and clustering mechanisms for capturing global patterns. Experimental results in benchmark datasets, including Cornell, Texas, and Wisconsin, demonstrate that GLANCE achieves competitive performance, offering robust and interpretable solutions for heterophilous graph scenarios. The proposed framework is lightweight, adaptable, and uniquely suited to the challenges of heterophilous graphs.
zh

[AI-6] Automated Code Review Using Large Language Models with Symbolic Reasoning

【速读】:该论文旨在解决手动代码审查(code review)过程中存在的主观性强和耗时问题,同时应对当前大型语言模型(Large Language Models, LLMs)在代码理解与评估中逻辑推理能力不足的局限。其解决方案的关键在于提出一种混合方法,将符号推理(symbolic reasoning)技术与LLMs相结合,并辅以提示工程(prompting techniques),从而提升自动化代码审查的准确性与效率。实验基于CodexGlue数据集对CodeT5、CodeBERT及GraphCodeBERT等模型进行对比验证,结果表明该融合策略显著优于单一模型方法。

链接: https://arxiv.org/abs/2507.18476
作者: Busra Icoz,Goksel Biricik
机构: 未知
类目: oftware Engineering (cs.SE); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Code review is one of the key processes in the software development lifecycle and is essential to maintain code quality. However, manual code review is subjective and time consuming. Given its rule-based nature, code review is well suited for automation. In recent years, significant efforts have been made to automate this process with the help of artificial intelligence. Recent developments in Large Language Models (LLMs) have also emerged as a promising tool in this area, but these models often lack the logical reasoning capabilities needed to fully understand and evaluate code. To overcome this limitation, this study proposes a hybrid approach that integrates symbolic reasoning techniques with LLMs to automate the code review process. We tested our approach using the CodexGlue dataset, comparing several models, including CodeT5, CodeBERT, and GraphCodeBERT, to assess the effectiveness of combining symbolic reasoning and prompting techniques with LLMs. Our results show that this approach improves the accuracy and efficiency of automated code review.
zh

[AI-7] Sandwich: Separating Prefill-Decode Compilation for Efficient CPU LLM Serving

【速读】:该论文旨在解决当前基于CPU的大语言模型(Large Language Models, LLMs)推理服务中因忽略预填充(prefill)与解码(decode)阶段工作负载差异而导致的性能瓶颈问题。现有方案通常采用静态的NUMA(Non-Uniform Memory Access)节点模型划分,并依赖厂商提供的算子级执行库,未能针对不同阶段的特点进行优化。其解决方案的关键在于提出Sandwich——一个以硬件为中心的CPU推理引擎,通过为预填充和解码阶段分别设计并独立优化不同的执行策略,实现更高效的资源利用。此外,Sandwich生成的GEMM(General Matrix Multiply)内核在性能上优于主流厂商库及其他动态形状方案,在保持接近静态编译器性能的同时,将内核调优成本降低三个数量级。

链接: https://arxiv.org/abs/2507.18454
作者: Juntao Zhao,Jiuru Li,Chuan Wu
机构: 未知
类目: Hardware Architecture (cs.AR); Artificial Intelligence (cs.AI); Distributed, Parallel, and Cluster Computing (cs.DC); Programming Languages (cs.PL)
备注:

点击查看摘要

Abstract:Utilizing CPUs to serve large language models (LLMs) is a resource-friendly alternative to GPU serving. Existing CPU-based solutions ignore workload differences between the prefill and the decode phases of LLM inference, applying a static per-NUMA (Non-Uniform Memory Access) node model partition and utilizing vendor libraries for operator-level execution, which is suboptimal. We propose Sandwich, a hardware-centric CPU-based LLM serving engine that uses different execution plans for the prefill and decode phases and optimizes them separately. We evaluate Sandwich across diverse baselines and datasets on five CPU platforms, including x86 with AVX-2 and AVX-512, as well as ARM with NEON. Sandwich achieves an average 2.01x throughput improvement and 90% satisfactory time-to-first-token (TTFT) and time-per-output-token (TPOT) latencies with up to 3.40x lower requirements in single sequence serving, and significant improvement in Goodput in continuous-batching serving. The GEMM kernels generated by Sandwich outperform representative vendor kernels and other dynamic shape solutions, achieving performance comparable to static compilers with three orders of magnitude less kernel tuning costs. Subjects: Hardware Architecture (cs.AR); Artificial Intelligence (cs.AI); Distributed, Parallel, and Cluster Computing (cs.DC); Programming Languages (cs.PL) Cite as: arXiv:2507.18454 [cs.AR] (or arXiv:2507.18454v1 [cs.AR] for this version) https://doi.org/10.48550/arXiv.2507.18454 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
zh

[AI-8] Digital Twin Technologies in Predictive Maintenance: Enabling Transferability via Sim-to-Real and Real-to-Sim Transfer

【速读】:该论文旨在解决数字孪生(Digital Twin, DT)从学术研究向工业应用转化过程中因缺乏标准化框架而导致的复杂性问题,尤其聚焦于“仿真到现实”(sim-to-real)与“现实到仿真”(real-to-sim)双向知识迁移的实现难题。其核心挑战在于弥合“现实差距”(Reality Gap),即仿真预测与实际运行结果之间的偏差。解决方案的关键在于将一个单一的现实差距分析(Reality Gap Analysis, RGA)模块集成到现有DT框架中,并通过数据管道连接历史数据存储和仿真模型,从而在不牺牲效率的前提下,实现仿真系统与物理实体之间高效、双向的知识传递。

链接: https://arxiv.org/abs/2507.18449
作者: Sizhe Ma,Katherine A. Flanigan,Mario Bergés
机构: 未知
类目: Computational Engineering, Finance, and Science (cs.CE); Artificial Intelligence (cs.AI); Computers and Society (cs.CY); Machine Learning (cs.LG)
备注: Accepted and presented at 2024 ASCE International Conference on Computing in Civil Engineering (i3CE 2024)

点击查看摘要

Abstract:The advancement of the Internet of Things (IoT) and Artificial Intelligence has catalyzed the evolution of Digital Twins (DTs) from conceptual ideas to more implementable realities. Yet, transitioning from academia to industry is complex due to the absence of standardized frameworks. This paper builds upon the authors’ previously established functional and informational requirements supporting standardized DT development, focusing on a crucial aspect: transferability. While existing DT research primarily centers on asset transfer, the significance of “sim-to-real transfer” and “real-to-sim transfer”–transferring knowledge between simulations and real-world operations–is vital for comprehensive lifecycle management in DTs. A key challenge in this process is calibrating the “reality gap,” the discrepancy between simulated predictions and actual outcomes. Our research investigates the impact of integrating a single Reality Gap Analysis (RGA) module into an existing DT framework to effectively manage both sim-to-real and real-to-sim transfers. This integration is facilitated by data pipelines that connect the RGA module with the existing components of the DT framework, including the historical repository and the simulation model. A case study on a pedestrian bridge at Carnegie Mellon University showcases the performance of different levels of integration of our approach with an existing framework. With full implementation of an RGA module and a complete data pipeline, our approach is capable of bidirectional knowledge transfer between simulations and real-world operations without compromising efficiency.
zh

[AI-9] GPU Accelerated Compact-Table Propagation

【速读】:该论文旨在解决大规模表约束(Table constraint)在传统CPU架构下求解效率低下的问题,尤其是在现实世界应用中可能涉及数百至数千个有效案例时,标准CPU方法难以有效处理。其解决方案的关键在于利用现代GPU的并行计算能力对当前最先进的表约束传播算法——Compact-Table(CT)进行加速优化,通过设计和实现GPU加速版本的CT算法,并将其集成到现有约束求解器中,从而显著提升大规模表约束的处理效率。实验验证表明,该方案在大量实例上具有良好的性能表现。

链接: https://arxiv.org/abs/2507.18413
作者: Enrico Santi,Fabio Tardivo,Agostino Dovier,Andrea Formisano
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注: Under consideration in Theory and Practice of Logic Programming (TPLP)

点击查看摘要

Abstract:Constraint Programming developed within Logic Programming in the Eighties; nowadays all Prolog systems encompass modules capable of handling constraint programming on finite domains demanding their solution to a constraint solver. This work focuses on a specific form of constraint, the so-called table constraint, used to specify conditions on the values of variables as an enumeration of alternative options. Since every condition on a set of finite domain variables can be ultimately expressed as a finite set of cases, Table can, in principle, simulate any other constraint. These characteristics make Table one of the most studied constraints ever, leading to a series of increasingly efficient propagation algorithms. Despite this, it is not uncommon to encounter real-world problems with hundreds or thousands of valid cases that are simply too many to be handled effectively with standard CPU-based approaches. In this paper, we deal with the Compact-Table (CT) algorithm, the state-of-the-art propagation algorithms for Table. We describe how CT can be enhanced by exploiting the massive computational power offered by modern GPUs to handle large Table constraints. In particular, we report on the design and implementation of GPU-accelerated CT, on its integration into an existing constraint solver, and on an experimental validation performed on a significant set of instances.
zh

[AI-10] Optimising Call Centre Operations using Reinforcement Learning: Value Iteration versus Proximal Policy Optimisation

【速读】:该论文旨在解决呼叫中心中话务路由优化问题,目标是同时最小化客户等待时间与员工空闲时间。解决方案的关键在于将问题建模为一个马尔可夫决策过程(Markov Decision Process, MDP),并在技能基础路由(Skills-Based Routing, SBR)框架下比较两种强化学习方法:基于模型的值迭代(Value Iteration, VI)与无模型的近端策略优化(Proximal Policy Optimization, PPO)。其中,PPO通过在离散事件仿真(Discrete Event Simulation, DES)与OpenAI Gym结合的环境中自主学习策略,在1000次测试回合后展现出最优性能,即最高累积奖励、最低客户等待时间和员工空闲时间,尽管其训练时间较长。

链接: https://arxiv.org/abs/2507.18398
作者: Kwong Ho Li,Wathsala Karunarathne
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注: 10 pages

点击查看摘要

Abstract:This paper investigates the application of Reinforcement Learning (RL) to optimise call routing in call centres to minimise client waiting time and staff idle time. Two methods are compared: a model-based approach using Value Iteration (VI) under known system dynamics, and a model-free approach using Proximal Policy Optimisation (PPO) that learns from experience. For the model-based approach, a theoretical model is used, while a simulation model combining Discrete Event Simulation (DES) with the OpenAI Gym environment is developed for model-free learning. Both models frame the problem as a Markov Decision Process (MDP) within a Skills-Based Routing (SBR) framework, with Poisson client arrivals and exponentially distributed service and abandonment times. For policy evaluation, random, VI, and PPO policies are evaluated using the simulation model. After 1,000 test episodes, PPO consistently achives the highest rewards, along with the lowest client waiting time and staff idle time, despite requiring longer training time.
zh

[AI-11] Revisiting LLM Reasoning via Information Bottleneck

【速读】:该论文旨在解决当前基于强化学习的大型语言模型(Large Language Models, LLMs)推理能力提升方法中缺乏理论指导的问题,现有方法多依赖启发式设计,难以形成系统化、可解释的优化框架。其解决方案的关键在于引入信息瓶颈(Information Bottleneck, IB)原理,构建了IB-aware推理优化(IBRO)框架,通过约束推理轨迹在保持对最终正确答案信息量的同时具备跨不同提示的泛化能力,从而实现更高效且鲁棒的推理过程。作者进一步推导出适用于token级别的代理目标函数,并提出一种轻量级正则化方法,可在不增加额外计算开销的前提下无缝集成至现有强化学习后训练流程中,仅需一行代码修改即可生效。

链接: https://arxiv.org/abs/2507.18391
作者: Shiye Lei,Zhihao Cheng,Kai Jia,Dacheng Tao
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Large language models (LLMs) have recently demonstrated remarkable progress in reasoning capabilities through reinforcement learning with verifiable rewards (RLVR). By leveraging simple rule-based rewards, RL effectively incentivizes LLMs to produce extended chain-of-thought (CoT) reasoning trajectories, progressively guiding them toward correct answers. However, existing approaches remain largely heuristic and intuition-driven, limiting the development of principled methodologies. In this paper, we present a theoretical characterization of LLM reasoning grounded in information bottleneck (IB) principle, introducing IB-aware reasoning optimization (IBRO), a framework that encourages reasoning trajectories to be both informative about the final correct answer and generalizable across diverse prompts. We derive a practical token-level surrogate objective and propose an efficient approximation, resulting in the lightweight IB regularization method. This technique integrates seamlessly into existing RL-based post-training frameworks without additional computational overhead, requiring only a one-line code modification. Empirically, we validate IB regularization across multiple mathematical reasoning benchmarks and RL algorithms, demonstrating consistent improvements in LLM reasoning performance.
zh

[AI-12] Reasoning Beyond the Obvious: Evaluating Divergent and Convergent Thinking in LLM s for Financial Scenarios KDD KDD2025

【速读】:该论文旨在解决当前大语言模型(Large Language Models, LLMs)在金融领域应用中缺乏对“发散性思维”与“收敛性思维”协同评估的问题。传统推理基准多关注事实准确性或逻辑步骤的正确性,但金融专业人员不仅需做出最优决策,还需在不确定性下生成创造性且合理的未来情景。为此,作者提出ConDiFi基准,其关键在于设计了一套双维度评测体系:包含607个宏观金融场景下的发散式推理任务(用于评估生成新颖、可行未来路径的能力)和990个多跳对抗性选择题(用于衡量收敛性推理的准确性)。通过该基准对14个主流模型的系统评估,揭示了不同模型在创新性(Novelty)与可操作性(Actionability)上的显著差异,为安全、战略地部署LLMs于金融场景提供了新的评估视角。

链接: https://arxiv.org/abs/2507.18368
作者: Zhuang Qiang Bok,Watson Wei Khong Chua
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注: Accepted by Agentic GenAI Evaluation KDD2025: KDD workshop on Evaluation and Trustworthiness of Agentic and Generative AI Models this https URL

点击查看摘要

Abstract:Most reasoning benchmarks for LLMs emphasize factual accuracy or step-by-step logic. In finance, however, professionals must not only converge on optimal decisions but also generate creative, plausible futures under uncertainty. We introduce ConDiFi, a benchmark that jointly evaluates divergent and convergent thinking in LLMs for financial tasks. ConDiFi features 607 macro-financial prompts for divergent reasoning and 990 multi-hop adversarial MCQs for convergent reasoning. Using this benchmark, we evaluated 14 leading models and uncovered striking differences. Despite high fluency, GPT-4o underperforms on Novelty and Actionability. In contrast, models like DeepSeek-R1 and Cohere Command R+ rank among the top for generating actionable, insights suitable for investment decisions. ConDiFi provides a new perspective to assess reasoning capabilities essential to safe and strategic deployment of LLMs in finance. Comments: Accepted by Agentic GenAI Evaluation KDD2025: KDD workshop on Evaluation and Trustworthiness of Agentic and Generative AI Models this https URL Subjects: Artificial Intelligence (cs.AI) ACMclasses: I.2.0; I.2.6; J.4 Cite as: arXiv:2507.18368 [cs.AI] (or arXiv:2507.18368v1 [cs.AI] for this version) https://doi.org/10.48550/arXiv.2507.18368 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
zh

[AI-13] he AlphaPhysics Term Rewriting System for Marking Algebraic Expressions in Physics Exams

【速读】:该论文旨在解决物理考试自动评分问题(automated marking of Physics exams),即如何基于标准答案评估学生手写或键入的答案是否正确。其核心挑战在于将自然语言表述的学生作答转化为机器可处理的形式,并在此基础上进行逻辑与数学正确性验证。解决方案的关键在于融合多种自动化推理技术:首先利用大型语言模型(Large Language Model, LLM)对原始学生答案进行语义理解、纠错并重写为结构化表达;随后结合计算机代数系统(Computer Algebra System, CAS)、SMT求解器(Satisfiability Modulo Theories solver)以及针对物理问题中三角函数表达式定制的项重写系统(Term Rewriting System),实现对答案的自动化形式验证。其中,项重写系统的构建及其终止性(termination)和合流性(confluence)性质的证明是关键技术难点与创新点。

链接: https://arxiv.org/abs/2507.18337
作者: Peter Baumgartner,Lachlan McGinness
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:We present our method for automatically marking Physics exams. The marking problem consists in assessing typed student answers for correctness with respect to a ground truth solution. This is a challenging problem that we seek to tackle using a combination of a computer algebra system, an SMT solver and a term rewriting system. A Large Language Model is used to interpret and remove errors from student responses and rewrite these in a machine readable format. Once formalized and language-aligned, the next step then consists in applying automated reasoning techniques for assessing student solution correctness. We consider two methods of automated theorem proving: off-the-shelf SMT solving and term rewriting systems tailored for physics problems involving trigonometric expressions. The development of the term rewrite system and establishing termination and confluence properties was not trivial, and we describe it in some detail in the paper. We evaluate our system on a rich pool of over 1500 real-world student exam responses from the 2023 Australian Physics Olympiad.
zh

[AI-14] A Concept for Efficient Scalability of Automated Driving Allowing for Technical Legal Cultural and Ethical Differences ITSC

【速读】:该论文旨在解决自动驾驶(Automated Driving, AD)系统在不同车辆配置、环境条件及社会政治背景下的可扩展性问题,即如何实现通用能力向特定系统和环境的有效迁移与适配。其解决方案的关键在于提出一种两阶段微调(fine-tuning)框架:第一阶段通过国家特定的奖励模型(reward model)将技术适应与社会政治要求相连接,确保符合本地法规、文化与伦理规范;第二阶段采用车辆特定的迁移学习(transfer learning)完成系统适配并验证设计决策,从而在技术、法律、文化和伦理维度上实现高效、灵活且可验证的规模化部署。

链接: https://arxiv.org/abs/2507.18326
作者: Lars Ullrich,Michael Buchholz,Jonathan Petit,Klaus Dietmayer,Knut Graichen
机构: 未知
类目: Computers and Society (cs.CY); Artificial Intelligence (cs.AI)
备注: Accepted to be published at 2025 28th IEEE International Conference on Intelligent Transportation Systems (ITSC), Gold Coast, Australia, November 18-21, 2025

点击查看摘要

Abstract:Efficient scalability of automated driving (AD) is key to reducing costs, enhancing safety, conserving resources, and maximizing impact. However, research focuses on specific vehicles and context, while broad deployment requires scalability across various configurations and environments. Differences in vehicle types, sensors, actuators, but also traffic regulations, legal requirements, cultural dynamics, or even ethical paradigms demand high flexibility of data-driven developed capabilities. In this paper, we address the challenge of scalable adaptation of generic capabilities to desired systems and environments. Our concept follows a two-stage fine-tuning process. In the first stage, fine-tuning to the specific environment takes place through a country-specific reward model that serves as an interface between technological adaptations and socio-political requirements. In the second stage, vehicle-specific transfer learning facilitates system adaptation and governs the validation of design decisions. In sum, our concept offers a data-driven process that integrates both technological and socio-political aspects, enabling effective scalability across technical, legal, cultural, and ethical differences.
zh

[AI-15] Foundations for Risk Assessment of AI in Protecting Fundamental Rights

【速读】:该论文旨在解决人工智能(AI)在欧盟《人工智能法案》(EU AI Act)框架下进行定性风险评估的复杂性问题,尤其关注法律合规性与基本权利保护之间的协调难题。其解决方案的关键在于提出一个融合“定义平衡”(definitional balancing)与“可废止推理”(defeasible reasoning)的概念框架:前者通过比例原则分析调和相互冲突的基本权利,后者则适应法律决策的动态特性;同时强调对AI部署场景的细致分析,以识别潜在法律违规行为及对基本权利的多层影响,从而为高风险AI系统和通用目的AI(General Purpose AI, GPAI)提供更具操作性的风险评估模型,并奠定哲学基础以支持负责任的人工智能治理。

链接: https://arxiv.org/abs/2507.18290
作者: Antonino Rotolo,Beatrice Ferrigno,Jose Miguel Angel Garcia Godinez,Claudio Novelli,Giovanni Sartor
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注: 24 pages, 1 figure. To be published in: The Philosophical Foundations of Information Technology Law. Oxford University Press, Oxford

点击查看摘要

Abstract:This chapter introduces a conceptual framework for qualitative risk assessment of AI, particularly in the context of the EU AI Act. The framework addresses the complexities of legal compliance and fundamental rights protection by itegrating definitional balancing and defeasible reasoning. Definitional balancing employs proportionality analysis to resolve conflicts between competing rights, while defeasible reasoning accommodates the dynamic nature of legal decision-making. Our approach stresses the need for an analysis of AI deployment scenarios and for identifying potential legal violations and multi-layered impacts on fundamental rights. On the basis of this analysis, we provide philosophical foundations for a logical account of AI risk analysis. In particular, we consider the basic building blocks for conceptually grasping the interaction between AI deployment scenarios and fundamental rights, incorporating in defeasible reasoning definitional balancing and arguments about the contextual promotion or demotion of rights. This layered approach allows for more operative models of assessment of both high-risk AI systems and General Purpose AI (GPAI) systems, emphasizing the broader applicability of the latter. Future work aims to develop a formal model and effective algorithms to enhance AI risk assessment, bridging theoretical insights with practical applications to support responsible AI governance.
zh

[AI-16] Multimodal Behavioral Patterns Analysis with Eye-Tracking and LLM -Based Reasoning

【速读】:该论文旨在解决眼动追踪数据(eye-tracking data)在认知状态分析中的难题,即其结构化、非语言特性导致难以被传统方法有效解析的问题。现有大语言模型(Large Language Models, LLMs)虽擅长文本推理,但在处理时间序列和数值型数据时表现有限。解决方案的关键在于提出一种多模态人-AI协同框架:首先通过水平与垂直分割结合LLM推理的多阶段流水线挖掘潜在注视模式;其次引入专家-模型联合评分模块(Expert-Model Co-Scoring Module),融合领域专家判断与LLM输出以生成行为解释的信任度评分;最后采用基于LSTM的时间建模与LLM语义分析相结合的混合异常检测模块,显著提升对认知特征提取的准确性、可解释性与一致性,实现从原始眼动信号到高阶认知状态的可靠映射。

链接: https://arxiv.org/abs/2507.18252
作者: Dongyang Guo,Yasmeen Abdrabou,Enkeleda Thaqi,Enkelejda Kasneci
机构: 未知
类目: Human-Computer Interaction (cs.HC); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Eye-tracking data reveals valuable insights into users’ cognitive states but is difficult to analyze due to its structured, non-linguistic nature. While large language models (LLMs) excel at reasoning over text, they struggle with temporal and numerical data. This paper presents a multimodal human-AI collaborative framework designed to enhance cognitive pattern extraction from eye-tracking signals. The framework includes: (1) a multi-stage pipeline using horizontal and vertical segmentation alongside LLM reasoning to uncover latent gaze patterns; (2) an Expert-Model Co-Scoring Module that integrates expert judgment with LLM output to generate trust scores for behavioral interpretations; and (3) a hybrid anomaly detection module combining LSTM-based temporal modeling with LLM-driven semantic analysis. Our results across several LLMs and prompt strategies show improvements in consistency, interpretability, and performance, with up to 50% accuracy in difficulty prediction tasks. This approach offers a scalable, interpretable solution for cognitive modeling and has broad potential in adaptive learning, human-computer interaction, and educational analytics.
zh

[AI-17] GenAI for Automotive Software Development: From Requirements to Wheels

【速读】:该论文旨在解决自动驾驶和高级驾驶辅助系统(ADAS)软件开发中周期长、合规性验证复杂及测试与实现代码生成效率低的问题。其解决方案的关键在于引入生成式人工智能(GenAI)驱动的自动化开发流程,利用大型语言模型(LLMs)完成需求文档的建模(基于Ecore元模型、XMI实例和OCL约束)、测试场景生成、仿真代码(Python)及目标平台代码(C++)的自动编写,并结合检索增强生成(RAG)技术从自动驾驶法规文档中提取信息以提升测试场景的合规性与覆盖度,从而缩短开发与测试周期,提高ADAS功能实现的效率与一致性。

链接: https://arxiv.org/abs/2507.18223
作者: Nenad Petrovic,Fengjunjie Pan,Vahid Zolfaghari,Krzysztof Lebioda,Andre Schamschurko,Alois Knoll
机构: 未知
类目: oftware Engineering (cs.SE); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:This paper introduces a GenAI-empowered approach to automated development of automotive software, with emphasis on autonomous and Advanced Driver Assistance Systems (ADAS) capabilities. The process starts with requirements as input, while the main generated outputs are test scenario code for simulation environment, together with implementation of desired ADAS capabilities targeting hardware platform of the vehicle connected to testbench. Moreover, we introduce additional steps for requirements consistency checking leveraging Model-Driven Engineering (MDE). In the proposed workflow, Large Language Models (LLMs) are used for model-based summarization of requirements (Ecore metamodel, XMI model instance and OCL constraint creation), test scenario generation, simulation code (Python) and target platform code generation (C++). Additionally, Retrieval Augmented Generation (RAG) is adopted to enhance test scenario generation from autonomous driving regulations-related documents. Our approach aims shorter compliance and re-engineering cycles, as well as reduced development and testing time when it comes to ADAS-related capabilities.
zh

[AI-18] FedSA-GCL: A Semi-Asynchronous Federated Graph Learning Framework with Personalized Aggregation and Cluster-Aware Broadcasting

【速读】:该论文旨在解决联邦图学习(Federated Graph Learning, FGL)中因依赖同步通信导致的效率低下问题,以及现有异步联邦学习(Asynchronous Federated Learning, AFL)方法未考虑图数据拓扑特性而引发的语义漂移(semantic drift)和表示不一致问题。解决方案的关键在于提出一种半异步联邦框架 FedSA-GCL,其核心创新是通过一种新颖的 ClusterCast 机制,同时利用客户端间的标签分布差异(inter-client label distribution divergence)与图的拓扑结构特征,实现更高效的模型协同训练。

链接: https://arxiv.org/abs/2507.18219
作者: Zhongzheng Yuan,Lianshuai Guo,Xunkai Li,Yinlin Zhu,Wenyu Wang,Meixia Qu
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Federated Graph Learning (FGL) is a distributed learning paradigm that enables collaborative training over large-scale subgraphs located on multiple local systems. However, most existing FGL approaches rely on synchronous communication, which leads to inefficiencies and is often impractical in real-world deployments. Meanwhile, current asynchronous federated learning (AFL) methods are primarily designed for conventional tasks such as image classification and natural language processing, without accounting for the unique topological properties of graph data. Directly applying these methods to graph learning can possibly result in semantic drift and representational inconsistency in the global model. To address these challenges, we propose FedSA-GCL, a semi-asynchronous federated framework that leverages both inter-client label distribution divergence and graph topological characteristics through a novel ClusterCast mechanism for efficient training. We evaluate FedSA-GCL on multiple real-world graph datasets using the Louvain and Metis split algorithms, and compare it against 9 baselines. Extensive experiments demonstrate that our method achieves strong robustness and outstanding efficiency, outperforming the baselines by an average of 2.92% with the Louvain and by 3.4% with the Metis.
zh

[AI-19] Information Security Based on LLM Approaches: A Review

【速读】:该论文旨在解决传统信息安全防护手段难以应对复杂多变威胁的问题,提出利用大语言模型(Large Language Models, LLMs)提升信息安全部署的智能化水平。其解决方案的关键在于基于神经网络与Transformer架构的大语言模型技术,通过在恶意行为预测、网络威胁分析、系统漏洞检测、恶意代码识别及密码算法优化等场景中的应用,显著提高安全系统的检测准确率并降低误报率,从而增强整体防护性能。

链接: https://arxiv.org/abs/2507.18215
作者: Chang Gong,Zhongwen Li,Xiaoqi Li
机构: 未知
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Information security is facing increasingly severe challenges, and traditional protection means are difficult to cope with complex and changing threats. In recent years, as an emerging intelligent technology, large language models (LLMs) have shown a broad application prospect in the field of information security. In this paper, we focus on the key role of LLM in information security, systematically review its application progress in malicious behavior prediction, network threat analysis, system vulnerability detection, malicious code identification, and cryptographic algorithm optimization, and explore its potential in enhancing security protection performance. Based on neural networks and Transformer architecture, this paper analyzes the technical basis of large language models and their advantages in natural language processing tasks. It is shown that the introduction of large language modeling helps to improve the detection accuracy and reduce the false alarm rate of security systems. Finally, this paper summarizes the current application results and points out that it still faces challenges in model transparency, interpretability, and scene adaptability, among other issues. It is necessary to explore further the optimization of the model structure and the improvement of the generalization ability to realize a more intelligent and accurate information security protection system.
zh

[AI-20] MoRPI-PINN: A Physics-Informed Framework for Mobile Robot Pure Inertial Navigation

【速读】:该论文旨在解决移动机器人在缺乏卫星导航或视觉信息条件下,仅依赖惯性传感器时因传感器噪声和误差导致的导航漂移问题。解决方案的关键在于提出一种物理信息神经网络(Physics-Informed Neural Network, PINN)框架——MoRPI-PINN,通过将物理定律与约束嵌入训练过程,显著提升基于惯性测量单元(Inertial Measurement Unit, IMU)的导航精度与鲁棒性。实验表明,该方法相较其他主流方案可实现超过85%的定位精度提升,且模型轻量化,适用于边缘设备部署。

链接: https://arxiv.org/abs/2507.18206
作者: Arup Kumar Sahoo,Itzik Klein
机构: 未知
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI)
备注: 9 pages, 5 figures

点击查看摘要

Abstract:A fundamental requirement for full autonomy in mobile robots is accurate navigation even in situations where satellite navigation or cameras are unavailable. In such practical situations, relying only on inertial sensors will result in navigation solution drift due to the sensors’ inherent noise and error terms. One of the emerging solutions to mitigate drift is to maneuver the robot in a snake-like slithering motion to increase the inertial signal-to-noise ratio, allowing the regression of the mobile robot position. In this work, we propose MoRPI-PINN as a physics-informed neural network framework for accurate inertial-based mobile robot navigation. By embedding physical laws and constraints into the training process, MoRPI-PINN is capable of providing an accurate and robust navigation solution. Using real-world experiments, we show accuracy improvements of over 85% compared to other approaches. MoRPI-PINN is a lightweight approach that can be implemented even on edge devices and used in any typical mobile robot application.
zh

[AI-21] Comparing Non-minimal Semantics for Disjunction in Answer Set Programming

【速读】:该论文旨在解决答案集编程(Answer Set Programming, ASP)中析取(disjunction)语义的非最小模型问题,即在不遵循稳定模型(stable models)最小性原则的前提下,比较四种不同的析取语义。其解决方案的关键在于证明其中三种语义——Forks、Justified Models 以及 DI 语义的一个合理松弛版本——实际上等价于同一类非最小语义,且该语义始终提供原程序稳定模型的超集(在任意上下文中),并严格强于第四种语义(Strongly Supported Models),后者将析取视为经典逻辑中的处理方式。这一发现揭示了不同语义框架下析取逻辑的一致性与层级关系,为ASP中析取表达能力提供了更清晰的形式化基础。

链接: https://arxiv.org/abs/2507.18198
作者: Felicidad Aguado,Pedro Cabalar,Brais Muñiz,Gilberto Pérez,Concepción Vidal
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:In this paper, we compare four different semantics for disjunction in Answer Set Programming that, unlike stable models, do not adhere to the principle of model minimality. Two of these approaches, Cabalar and Muñiz’ \emphJustified Models and Doherty and Szalas’ \emphStrongly Supported Models, directly provide an alternative non-minimal semantics for disjunction. The other two, Aguado et al’s \emphForks and Shen and Eiter’s \emphDetermining Inference (DI) semantics, actually introduce a new disjunction connective, but are compared here as if they constituted new semantics for the standard disjunction operator. We are able to prove that three of these approaches (Forks, Justified Models and a reasonable relaxation of the DI semantics) actually coincide, constituting a common single approach under different definitions. Moreover, this common semantics always provides a superset of the stable models of a program (in fact, modulo any context) and is strictly stronger than the fourth approach (Strongly Supported Models), that actually treats disjunctions as in classical logic.
zh

[AI-22] Decoupling Knowledge and Reasoning in LLM s: An Exploration Using Cognitive Dual-System Theory

【速读】:该论文旨在解决大语言模型(Large Language Models, LLMs)在推理过程中难以区分知识调用与推理调整贡献的问题,这对模型分析、可解释性及优化具有重要意义。其解决方案的关键在于提出一种基于双系统认知理论的认知归因框架(cognition attribution framework),将LLM的内部认知过程解耦为两个互补阶段:知识检索(Phase 1)和推理调整(Phase 2)。通过引导模型在“快思考”和“慢思考”两种认知模式下生成答案,并比较其性能差异,从而量化知识与推理各自对最终输出的贡献。这一方法揭示了参数规模对知识与推理能力的不同影响机制,并明确了二者在模型层级结构中的分布特征,为理解LLM内部运作提供了新视角。

链接: https://arxiv.org/abs/2507.18178
作者: Mutian Yang,Jiandong Gao,Ji Wu
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:While large language models (LLMs) leverage both knowledge and reasoning during inference, the capacity to distinguish between them plays a pivotal role in model analysis, interpretability, and development. Inspired by dual-system cognitive theory, we propose a cognition attribution framework to decouple the contribution of knowledge and reasoning. In particular, the cognition of LLMs is decomposed into two distinct yet complementary phases: knowledge retrieval (Phase 1) and reasoning adjustment (Phase 2). To separate these phases, LLMs are prompted to generate answers under two different cognitive modes, fast thinking and slow thinking, respectively. The performance under different cognitive modes is analyzed to quantify the contribution of knowledge and reasoning. This architecture is employed to 15 LLMs across 3 datasets. Results reveal: (1) reasoning adjustment is domain-specific, benefiting reasoning-intensive domains (e.g., mathematics, physics, and chemistry) and potentially imparing knowledge-intensive domains. (2) Parameter scaling improves both knowledge and reasoning, with knowledge improvements being more pronounced. Additionally, parameter scaling make LLMs reasoning significantly more prudent, while moderately more intelligent. (3) Knowledge primarily resides in lower network layers, while reasoning operates in higher layers. Our framework not only helps understand LLMs from a “decoupling” perspective, but also provides new insights into existing research, including scaling laws, hierarchical knowledge editing, and limitations of small-model reasoning.
zh

[AI-23] When Noisy Labels Meet Class Imbalance on Graphs: A Graph Augmentation Method with LLM and Pseudo Label

【速读】:该论文旨在解决类别不平衡图节点分类(class-imbalanced graph node classification)中标签噪声问题,即在真实世界图数据中,由于类别分布不均且标签常含噪声,传统方法难以有效学习高质量节点表示。其解决方案的关键在于提出GraphALP框架,核心创新包括:基于大语言模型(LLMs)的过采样机制生成高准确性少数类节点以缓解类别不平衡;结合动态加权伪标签策略,从平衡后的图中提取高置信度伪标签以降低标签噪声比例;并引入二次LLM引导的过采样机制,进一步校正因伪标签引入的潜在类别分布偏移。该方法系统性地提升了在噪声标签环境下对类别不平衡图的鲁棒节点分类性能。

链接: https://arxiv.org/abs/2507.18153
作者: Riting Xia,Rucong Wang,Yulin Liu,Anchen Li,Xueyan Liu,Yan Zhang
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Class-imbalanced graph node classification is a practical yet underexplored research problem. Although recent studies have attempted to address this issue, they typically assume clean and reliable labels when processing class-imbalanced graphs. This assumption often violates the nature of real-world graphs, where labels frequently contain noise. Given this gap, this paper systematically investigates robust node classification for class-imbalanced graphs with noisy labels. We propose GraphALP, a novel Graph Augmentation framework based on Large language models (LLMs) and Pseudo-labeling techniques. Specifically, we design an LLM-based oversampling method to generate synthetic minority nodes, producing label-accurate minority nodes to alleviate class imbalance. Based on the class-balanced graphs, we develop a dynamically weighted pseudo-labeling method to obtain high-confidence pseudo labels to reduce label noise ratio. Additionally, we implement a secondary LLM-guided oversampling mechanism to mitigate potential class distribution skew caused by pseudo labels. Experimental results show that GraphALP achieves superior performance over state-of-the-art methods on class-imbalanced graphs with noisy labels.
zh

[AI-24] Logical Characterizations of GNNs with Mean Aggregation

【速读】:该论文旨在解决图神经网络(Graph Neural Networks, GNNs)在不同聚合函数下表达能力的理论边界问题,特别是以均值(mean)作为聚合函数的GNNs在非均匀(non-uniform)和均匀(uniform)设置中的逻辑刻画。其关键解决方案在于:在非均匀设置中,证明均值GNNs的表达能力等价于比标准模态逻辑更强的比值模态逻辑(ratio modal logic),即能够表达“至少某一比例的邻居满足某性质”;而在均匀设置下,假设组合函数连续且分类函数为阈值函数时,其表达能力恰好对应于无交替模态逻辑(alternation-free modal logic),严格弱于使用求和或最大值聚合的GNNs(分别对应模态逻辑和分级模态逻辑)。这一结果揭示了不同聚合机制对GNN表达能力的根本影响,并明确了均值聚合在理论上的局限性与优势。

链接: https://arxiv.org/abs/2507.18145
作者: Moritz Schönherr,Carsten Lutz
机构: 未知
类目: Artificial Intelligence (cs.AI); Logic in Computer Science (cs.LO)
备注:

点击查看摘要

Abstract:We study the expressive power of graph neural networks (GNNs) with mean as the aggregation function. In the non-uniform setting, we show that such GNNs have exactly the same expressive power as ratio modal logic, which has modal operators expressing that at least a certain ratio of the successors of a vertex satisfies a specified property. The non-uniform expressive power of mean GNNs is thus higher than that of GNNs with max aggregation, but lower than for sum aggregation–the latter are characterized by modal logic and graded modal logic, respectively. In the uniform setting, we show that the expressive power relative to MSO is exactly that of alternation-free modal logic, under the natural assumptions that combination functions are continuous and classification functions are thresholds. This implies that, relative to MSO and in the uniform setting, mean GNNs are strictly less expressive than sum GNNs and max GNNs. When any of the assumptions is dropped, the expressive power increases.
zh

[AI-25] Actively evaluating and learning the distinctions that matter: Vaccine safety signal detection from emergency triage notes

【速读】:该论文旨在解决新冠疫苗上市后安全性监测的挑战,尤其是临床试验中安全数据收集窗口有限以及早期大规模接种带来的信号识别滞后问题。其核心解决方案是利用自然语言处理(Natural Language Processing, NLP)与主动学习(Active Learning)技术,从急诊科(Emergency Department, ED)分诊记录中高效、准确地识别潜在疫苗不良反应信号。关键创新在于结合主动学习优化标注过程、提升标注数据质量,并辅以数据增强策略,从而在医疗标注数据稀缺的背景下实现快速部署高精度分类模型,显著增强基于ED文本的疫苗安全信号监测能力。

链接: https://arxiv.org/abs/2507.18123
作者: Sedigh Khademi,Christopher Palmer,Muhammad Javed,Hazel Clothier,Jim Buttery,Gerardo Luis Dimaguila,Jim Black
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注: 14 pages

点击查看摘要

Abstract:The rapid development of COVID-19 vaccines has showcased the global communitys ability to combat infectious diseases. However, the need for post-licensure surveillance systems has grown due to the limited window for safety data collection in clinical trials and early widespread implementation. This study aims to employ Natural Language Processing techniques and Active Learning to rapidly develop a classifier that detects potential vaccine safety issues from emergency department notes. ED triage notes, containing expert, succinct vital patient information at the point of entry to health systems, can significantly contribute to timely vaccine safety signal surveillance. While keyword-based classification can be effective, it may yield false positives and demand extensive keyword modifications. This is exacerbated by the infrequency of vaccination-related ED presentations and their similarity to other reasons for ED visits. NLP offers a more accurate and efficient alternative, albeit requiring annotated data, which is often scarce in the medical field. Active learning optimizes the annotation process and the quality of annotated data, which can result in faster model implementation and improved model performance. This work combines active learning, data augmentation, and active learning and evaluation techniques to create a classifier that is used to enhance vaccine safety surveillance from ED triage notes.
zh

[AI-26] AlphaGo Moment for Model Architecture Discovery

【速读】:该论文旨在解决人工智能(AI)研究中因人类认知能力限制而导致的发展瓶颈问题,即尽管AI系统能力呈指数级提升,但AI研究本身的进展仍受限于人类的线性认知边界。解决方案的关键在于提出并实现ASIArch——首个面向AI研究的通用人工超级智能(Artificial Superintelligence, ASI)系统,其在神经网络架构发现领域实现了从传统自动化优化(Neural Architecture Search, NAS)向自动化创新的根本性范式转变。该系统具备端到端自主科研能力,可自主提出新颖架构假设、生成可执行代码、训练验证性能,并基于历史经验迭代改进,最终在20,000 GPU小时中完成1,773次自主实验,发现106种达到当前最优水平(SOTA)的线性注意力架构,揭示了超越人类设计基准的涌现式设计规律,并首次建立科学发现的计算可扩展性规律,标志着AI研究正从人类受限模式迈向计算驱动的自加速演进新阶段。

链接: https://arxiv.org/abs/2507.18074
作者: Yixiu Liu,Yang Nan,Weixian Xu,Xiangkun Hu,Lyumanshan Ye,Zhen Qin,Pengfei Liu
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:While AI systems demonstrate exponentially improving capabilities, the pace of AI research itself remains linearly bounded by human cognitive capacity, creating an increasingly severe development bottleneck. We present ASI-Arch, the first demonstration of Artificial Superintelligence for AI research (ASI4AI) in the critical domain of neural architecture discovery–a fully autonomous system that shatters this fundamental constraint by enabling AI to conduct its own architectural innovation. Moving beyond traditional Neural Architecture Search (NAS), which is fundamentally limited to exploring human-defined spaces, we introduce a paradigm shift from automated optimization to automated innovation. ASI-Arch can conduct end-to-end scientific research in the domain of architecture discovery, autonomously hypothesizing novel architectural concepts, implementing them as executable code, training and empirically validating their performance through rigorous experimentation and past experience. ASI-Arch conducted 1,773 autonomous experiments over 20,000 GPU hours, culminating in the discovery of 106 innovative, state-of-the-art (SOTA) linear attention architectures. Like AlphaGo’s Move 37 that revealed unexpected strategic insights invisible to human players, our AI-discovered architectures demonstrate emergent design principles that systematically surpass human-designed baselines and illuminate previously unknown pathways for architectural innovation. Crucially, we establish the first empirical scaling law for scientific discovery itself–demonstrating that architectural breakthroughs can be scaled computationally, transforming research progress from a human-limited to a computation-scalable process. We provide comprehensive analysis of the emergent design patterns and autonomous research capabilities that enabled these breakthroughs, establishing a blueprint for self-accelerating AI systems.
zh

[AI-27] Multi-Agent Guided Policy Optimization

【速读】:该论文旨在解决当前协同多智能体强化学习(Cooperative Multi-Agent Reinforcement Learning, MARL)中,基于集中训练、分散执行(Centralized Training with Decentralized Execution, CTDE)的方法普遍存在中央训练信息利用不足或缺乏理论保障的问题。其解决方案的关键在于提出一种名为多智能体引导策略优化(Multi-Agent Guided Policy Optimization, MAGPO)的新框架,该框架通过将集中式引导与分散式执行相结合,采用自回归联合策略实现可扩展的协同探索,并显式对齐联合策略与分散策略,从而在部分可观测环境下确保策略的可部署性,同时提供单调策略改进的理论保证。

链接: https://arxiv.org/abs/2507.18059
作者: Yueheng Li,Guangming Xie,Zongqing Lu
机构: 未知
类目: Artificial Intelligence (cs.AI); Multiagent Systems (cs.MA)
备注:

点击查看摘要

Abstract:Due to practical constraints such as partial observability and limited communication, Centralized Training with Decentralized Execution (CTDE) has become the dominant paradigm in cooperative Multi-Agent Reinforcement Learning (MARL). However, existing CTDE methods often underutilize centralized training or lack theoretical guarantees. We propose Multi-Agent Guided Policy Optimization (MAGPO), a novel framework that better leverages centralized training by integrating centralized guidance with decentralized execution. MAGPO uses an auto-regressive joint policy for scalable, coordinated exploration and explicitly aligns it with decentralized policies to ensure deployability under partial observability. We provide theoretical guarantees of monotonic policy improvement and empirically evaluate MAGPO on 43 tasks across 6 diverse environments. Results show that MAGPO consistently outperforms strong CTDE baselines and matches or surpasses fully centralized approaches, offering a principled and practical solution for decentralized multi-agent learning. Our code and experimental data can be found in this https URL.
zh

[AI-28] OpenNav: Open-World Navigation with Multimodal Large Language Models

【速读】:该论文旨在解决开放世界中机器人如何从自由形式的自然语言指令中理解并分解复杂任务,进而生成可执行的轨迹点序列以完成多样化导航任务的问题,尤其挑战在于超越预定义动作基元(motion primitives)的限制。其解决方案的关键在于利用多模态大语言模型(Multimodal Large Language Models, MLLMs)的强大跨模态理解能力与代码生成能力:首先由MLLMs解析语言指令并提取语义信息,随后通过与视觉-语言感知模型交互生成组合式的二维鸟瞰视角价值图(bird-eye-view value maps),从而将语义知识与空间地图信息融合,增强机器人对环境的空间认知;该方法在大规模自动驾驶数据集(AVDs)上实现了零样本视觉-语言导航,并在真实机器人平台(Husky)上验证了其在室内外场景中的鲁棒性与泛化能力。

链接: https://arxiv.org/abs/2507.18033
作者: Mingfeng Yuan,Letian Wang,Steven L. Waslander
机构: 未知
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Pre-trained large language models (LLMs) have demonstrated strong common-sense reasoning abilities, making them promising for robotic navigation and planning tasks. However, despite recent progress, bridging the gap between language descriptions and actual robot actions in the open-world, beyond merely invoking limited predefined motion primitives, remains an open challenge. In this work, we aim to enable robots to interpret and decompose complex language instructions, ultimately synthesizing a sequence of trajectory points to complete diverse navigation tasks given open-set instructions and open-set objects. We observe that multi-modal large language models (MLLMs) exhibit strong cross-modal understanding when processing free-form language instructions, demonstrating robust scene comprehension. More importantly, leveraging their code-generation capability, MLLMs can interact with vision-language perception models to generate compositional 2D bird-eye-view value maps, effectively integrating semantic knowledge from MLLMs with spatial information from maps to reinforce the robot’s spatial understanding. To further validate our approach, we effectively leverage large-scale autonomous vehicle datasets (AVDs) to validate our proposed zero-shot vision-language navigation framework in outdoor navigation tasks, demonstrating its capability to execute a diverse range of free-form natural language navigation instructions while maintaining robustness against object detection errors and linguistic ambiguities. Furthermore, we validate our system on a Husky robot in both indoor and outdoor scenes, demonstrating its real-world robustness and applicability. Supplementary videos are available at this https URL
zh

[AI-29] Does visualization help AI understand data?

【速读】:该论文试图解决的问题是:可视化图表是否能够提升AI系统对数据的分析能力,即AI能否像人类一样从图表中获益。解决方案的关键在于通过实验验证,在原始数据基础上添加散点图(scatterplot)能显著提高两个商用视觉-语言模型(GPT-4.1 和 Claude 3.5)在三种代表性数据分析任务中的描述精度与准确性,尤其是在数据集复杂度增加时;进一步对比空白图表和数据不匹配图表的基线结果表明,性能提升源于图表内容本身的有效性,而非仅仅是图表的存在。

链接: https://arxiv.org/abs/2507.18022
作者: Victoria R. Li,Johnathan Sun,Martin Wattenberg
机构: 未知
类目: Artificial Intelligence (cs.AI); Human-Computer Interaction (cs.HC); Machine Learning (cs.LG)
备注: 5 pages, 6 figures

点击查看摘要

Abstract:Charts and graphs help people analyze data, but can they also be useful to AI systems? To investigate this question, we perform a series of experiments with two commercial vision-language models: GPT 4.1 and Claude 3.5. Across three representative analysis tasks, the two systems describe synthetic datasets more precisely and accurately when raw data is accompanied by a scatterplot, especially as datasets grow in complexity. Comparison with two baselines – providing a blank chart and a chart with mismatched data – shows that the improved performance is due to the content of the charts. Our results are initial evidence that AI systems, like humans, can benefit from visualization.
zh

[AI-30] Fashion-AlterEval: A Dataset for Improved Evaluation of Conversational Recommendation Systems with Alternative Relevant Items

【速读】:该论文旨在解决现有对话式推荐系统(Conversational Recommendation Systems, CRS)评估中用户模拟器(user simulator)的局限性问题,即传统模拟器仅基于单一目标物品进行评判,且假设用户具有无限耐心,导致离线评估结果无法真实反映系统在多轮交互中对替代选项的响应能力。解决方案的关键在于提出 Fashion-AlterEval 数据集,通过在原有时尚类 CRS 数据集(如 Shoes 和 Fashion IQ)中新增关于备选物品的人工标注偏好数据,并设计两种新型元用户模拟器(meta-user simulators),使模拟用户不仅能表达对替代物品的偏好,还能动态调整其决策逻辑与耐心水平。实验证明,引入备选物品知识后,系统评估结果显著改善,现有单目标评估方法低估了模型的实际效果,而新模拟器能够更快速、精准地满足用户需求。

链接: https://arxiv.org/abs/2507.18017
作者: Maria Vlachou
机构: 未知
类目: Information Retrieval (cs.IR); Artificial Intelligence (cs.AI)
备注: arXiv admin note: substantial text overlap with arXiv:2401.05783

点击查看摘要

Abstract:In Conversational Recommendation Systems (CRS), a user provides feedback on recommended items at each turn, leading the CRS towards improved recommendations. Due to the need for a large amount of data, a user simulator is employed for both training and evaluation. Such user simulators critique the current retrieved item based on knowledge of a single target item. However, system evaluation in offline settings with simulators is limited by the focus on a single target item and their unlimited patience over a large number of turns. To overcome these limitations of existing simulators, we propose Fashion-AlterEval, a new dataset that contains human judgments for a selection of alternative items by adding new annotations in common fashion CRS datasets. Consequently, we propose two novel meta-user simulators that use the collected judgments and allow simulated users not only to express their preferences about alternative items to their original target, but also to change their mind and level of patience. In our experiments using the Shoes and Fashion IQ as the original datasets and three CRS models, we find that using the knowledge of alternatives by the simulator can have a considerable impact on the evaluation of existing CRS models, specifically that the existing single-target evaluation underestimates their effectiveness, and when simulatedusers are allowed to instead consider alternative relevant items, the system can rapidly respond to more quickly satisfy the user.
zh

[AI-31] E.A.R.T.H.: Structuring Creative Evolution through Model Error in Generative AI

【速读】:该论文试图解决的问题是如何使人工智能(AI)从模仿性生成迈向真正的创造性生成。其解决方案的关键在于提出E.A.R.T.H.框架——一个五阶段的生成式流水线,通过系统化地将模型生成的错误转化为创意资产来实现这一目标。该框架的核心机制包括:错误生成(Error generation)、放大(Amplification)、精炼选择(Refine selection)、转化(Transform)和反馈利用(Harness feedback),并结合结构化提示、语义评分与人机协同评估,以提升输出的创新性、惊喜感与相关性。实验表明,该方法在多个指标上显著优于传统生成策略,尤其在精炼阶段创造力得分提升52.5%,最终输出达到2.010(较初始提升70.4%),验证了以错误为中心、反馈驱动的生成范式是实现可扩展、人类对齐的创造性AI的有效路径。

链接: https://arxiv.org/abs/2507.18004
作者: Yusen Peng,Shuhua Mao
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注: 44 pages,11 figures

点击查看摘要

Abstract:How can AI move beyond imitation toward genuine creativity? This paper proposes the E.A.R.T.H. framework, a five-stage generative pipeline that transforms model-generated errors into creative assets through Error generation, Amplification, Refine selection, Transform, and Harness feedback. Drawing on cognitive science and generative modeling, we posit that “creative potential hides in failure” and operationalize this via structured prompts, semantic scoring, and human-in-the-loop evaluation. Implemented using LLaMA-2-7B-Chat, SBERT, BERTScore, CLIP, BLIP-2, and Stable Diffusion, the pipeline employs a composite reward function based on novelty, surprise, and relevance. At the Refine stage, creativity scores increase by 52.5% (1.179 to 1.898, t = -5.56, p 0.001), with final outputs reaching 2.010 - a 70.4% improvement. Refined slogans are 48.4% shorter, 40.7% more novel, with only a 4.0% drop in relevance. Cross-modal tests show strong slogan-to-image alignment (CLIPScore: 0.249; BERTScore F1: 0.816). In human evaluations, 60% of outputs scored = 4.0, with metaphorical slogans (avg. 4.09) outperforming literal ones (3.99). Feedback highlights stylistic precision and emotional resonance. These results demonstrate that error-centered, feedback-driven generation enhances creativity, offering a scalable path toward self-evolving, human-aligned creative AI.
zh

[AI-32] Synthesis of timeline-based planning strategies avoiding determinization

【速读】:该论文旨在解决基于时间线的规划(timeline-based planning)中策略合成问题,即如何在不经过昂贵的确定化步骤(determinization)的情况下,直接从规划存在性问题中合成可执行的规划策略。传统方法通过将问题规约至非确定性有限自动机(nondeterministic finite automata, NFA)的非空性问题来证明PSPACE成员性,但NFA需额外确定化才能生成策略,导致计算复杂度显著上升。本文的关键创新在于识别出一个可直接映射到确定性有限自动机(deterministic finite automata, DFA)非空性问题的子类,从而避免了确定化步骤,并实现了高效策略合成;此外,还确定了Allen关系的一个最大子集,该子集恰好属于这一确定性片段,使得该方法具有理论完备性和实际可行性。

链接: https://arxiv.org/abs/2507.17988
作者: Dario Della Monica,Angelo Montanari,Pietro Sala
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注: arXiv admin note: text overlap with arXiv:2410.22757

点击查看摘要

Abstract:Qualitative timeline-based planning models domains as sets of independent, but interacting, components whose behaviors over time, the timelines, are governed by sets of qualitative temporal constraints (ordering relations), called synchronization rules. Its plan-existence problem has been shown to be PSPACE-complete; in particular, PSPACE-membership has been proved via reduction to the nonemptiness problem for nondeterministic finite automata. However, nondeterministic automata cannot be directly used to synthesize planning strategies as a costly determinization step is needed. In this paper, we identify a fragment of qualitative timeline-based planning whose plan-existence problem can be directly mapped into the nonemptiness problem of deterministic finite automata, which can then synthesize strategies. In addition, we identify a maximal subset of Allen’s relations that fits into such a deterministic fragment. Comments: arXiv admin note: text overlap with arXiv:2410.22757 Subjects: Artificial Intelligence (cs.AI) Cite as: arXiv:2507.17988 [cs.AI] (or arXiv:2507.17988v1 [cs.AI] for this version) https://doi.org/10.48550/arXiv.2507.17988 Focus to learn more arXiv-issued DOI via DataCite (pending registration) Submission history From: Dario Della Monica [view email] [v1] Wed, 23 Jul 2025 23:39:04 UTC (228 KB) Full-text links: Access Paper: View a PDF of the paper titled Synthesis of timeline-based planning strategies avoiding determinization, by Dario Della Monica and 2 other authorsView PDFHTML (experimental)TeX SourceOther Formats view license Current browse context: cs.AI prev | next new | recent | 2025-07 Change to browse by: cs References Citations NASA ADSGoogle Scholar Semantic Scholar a export BibTeX citation Loading… BibTeX formatted citation loading… Data provided by: Bookmark checked=“checked”> Bibliographic Tools Bibliographic and Citation Tools Bibliographic Explorer Toggle Bibliographic Explorer (What is the Explorer?) Connected Papers Toggle Connected Papers (What is Connected Papers?) Litmaps Toggle Litmaps (What is Litmaps?) scite.ai Toggle scite Smart Citations (What are Smart Citations?) Code, Data, Media Code, Data and Media Associated with this Article alphaXiv Toggle alphaXiv (What is alphaXiv?) Links to Code Toggle CatalyzeX Code Finder for Papers (What is CatalyzeX?) DagsHub Toggle DagsHub (What is DagsHub?) GotitPub Toggle Gotit.pub (What is GotitPub?) Huggingface Toggle Hugging Face (What is Huggingface?) Links to Code Toggle Papers with Code (What is Papers with Code?) ScienceCast Toggle ScienceCast (What is ScienceCast?) Demos Demos Replicate Toggle Replicate (What is Replicate?) Spaces Toggle Hugging Face Spaces (What is Spaces?) Spaces Toggle TXYZ.AI (What is TXYZ.AI?) Related Papers Recommenders and Search Tools Link to Influence Flower Influence Flower (What are Influence Flowers?) Core recommender toggle CORE Recommender (What is CORE?) Author Venue Institution Topic About arXivLabs arXivLabs: experimental projects with community collaborators arXivLabs is a framework that allows collaborators to develop and share new arXiv features directly on our website. Both individuals and organizations that work with arXivLabs have embraced and accepted our values of openness, community, excellence, and user data privacy. arXiv is committed to these values and only works with partners that adhere to them. Have an idea for a project that will add value for arXiv’s community? Learn more about arXivLabs. Which authors of this paper are endorsers? | Disable MathJax (What is MathJax?) mathjaxToggle(); About Help contact arXivClick here to contact arXiv Contact subscribe to arXiv mailingsClick here to subscribe Subscribe Copyright Privacy Policy Web Accessibility Assistance arXiv Operational Status Get status notifications via email or slack
zh

[AI-33] Decoding Instructional Dialogue: Human-AI Collaborative Analysis of Teacher Use of AI Tool at Scale

【速读】:该论文旨在解决当前对教育工作者在实际教学中如何使用生成式 AI(Generative AI)工具缺乏系统性理解的问题,以及如何在大规模场景下有意义地研究人机交互模式的挑战。其解决方案的关键在于提出一种人-AI协同的方法论,通过四阶段编码流程——归纳主题发现、代码本开发、结构化标注与模型基准测试——对超过14万条来自K-12教师的教育者-AI对话进行定性分析。该方法不仅构建了与教师评价框架一致的层级化代码本,还验证了大型语言模型(LLMs),特别是Claude 3.5 Haiku,在主题识别和复杂情境扩展上的可靠性与优越性,从而为AI增强型质性研究提供了可扩展、透明的范式,并揭示了教师在教学实践中对AI的多样化应用模式及新兴能力需求。

链接: https://arxiv.org/abs/2507.17985
作者: Alex Liu,Lief Esbenshade,Shawon Sarkar,Victor Tian,Zachary Zhang,Kevin He,Min Sun
机构: 未知
类目: Human-Computer Interaction (cs.HC); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:The integration of large language models (LLMs) into educational tools has the potential to substantially impact how teachers plan instruction, support diverse learners, and engage in professional reflection. Yet little is known about how educators actually use these tools in practice and how their interactions with AI can be meaningfully studied at scale. This paper presents a human-AI collaborative methodology for large-scale qualitative analysis of over 140,000 educator-AI messages drawn from a generative AI platform used by K-12 teachers. Through a four-phase coding pipeline, we combined inductive theme discovery, codebook development, structured annotation, and model benchmarking to examine patterns of educator engagement and evaluate the performance of LLMs in qualitative coding tasks. We developed a hierarchical codebook aligned with established teacher evaluation frameworks, capturing educators’ instructional goals, contextual needs, and pedagogical strategies. Our findings demonstrate that LLMs, particularly Claude 3.5 Haiku, can reliably support theme identification, extend human recognition in complex scenarios, and outperform open-weight models in both accuracy and structural reliability. The analysis also reveals substantive patterns in how educators inquire AI to enhance instructional practices (79.7 percent of total conversations), create or adapt content (76.1 percent), support assessment and feedback loop (46.9 percent), attend to student needs for tailored instruction (43.3 percent), and assist other professional responsibilities (34.2 percent), highlighting emerging AI-related competencies that have direct implications for teacher preparation and professional development. This study offers a scalable, transparent model for AI-augmented qualitative research and provides foundational insights into the evolving role of generative AI in educational practice.
zh

[AI-34] Machine Unlearning of Traffic State Estimation and Prediction

【速读】:该论文旨在解决数据驱动的交通状态估计与预测(TSEP)中因敏感数据、恶意污染数据或过时数据引发的隐私泄露、网络安全及模型可信度下降问题。传统方法仅删除后端数据库中的数据无法满足“被遗忘权”等法规要求,因为机器学习模型可能仍保留对这些数据的记忆。解决方案的关键在于提出一种新型学习范式——机器遗忘 TSEP(Machine Unlearning TSEP),使训练好的模型能够主动选择性地“遗忘”特定数据,从而提升数据驱动交通系统的可信赖性和可靠性。

链接: https://arxiv.org/abs/2507.17984
作者: Xin Wang,R. Tyrrell Rockafellar,Xuegang(Jeff)Ban
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Data-driven traffic state estimation and prediction (TSEP) relies heavily on data sources that contain sensitive information. While the abundance of data has fueled significant breakthroughs, particularly in machine learning-based methods, it also raises concerns regarding privacy, cybersecurity, and data freshness. These issues can erode public trust in intelligent transportation systems. Recently, regulations have introduced the “right to be forgotten”, allowing users to request the removal of their private data from models. As machine learning models can remember old data, simply removing it from back-end databases is insufficient in such systems. To address these challenges, this study introduces a novel learning paradigm for TSEP-Machine Unlearning TSEP-which enables a trained TSEP model to selectively forget privacy-sensitive, poisoned, or outdated data. By empowering models to “unlearn,” we aim to enhance the trustworthiness and reliability of data-driven traffic TSEP.
zh

[AI-35] MeAJOR Corpus: A Multi-Source Dataset for Phishing Email Detection

【速读】:该论文旨在解决现有钓鱼邮件检测数据集在样本多样性、特征丰富性以及模型泛化能力方面的局限性,这些问题制约了机器学习(Machine Learning, ML)模型在实际应用中的性能和可复现性。解决方案的关键在于构建了一个名为MeAJOR(Merged email Assets from Joint Open-source Repositories)的新型多源钓鱼邮件语料库,该语料库整合了来自135,894条样本,涵盖多种钓鱼手法和合法邮件,并包含广泛工程化特征。通过系统实验验证,该数据集在四种分类模型(随机森林RF、XGBoost XGB、多层感知机MLP和卷积神经网络CNN)上均表现出优异性能,其中XGB模型达到98.34%的F1分数,显著提升了检测准确率与鲁棒性,同时缓解了类别不平衡、泛化能力和可复现性等常见挑战。

链接: https://arxiv.org/abs/2507.17978
作者: Paulo Mendes(1),Eva Maia(1),Isabel Praça(1) ((1) GECAD, ISEP, Polytechnic of Porto, Portugal)
机构: 未知
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI); Human-Computer Interaction (cs.HC)
备注: 8 pages, 2 tables, WI-IAT 2025 conference

点击查看摘要

Abstract:Phishing emails continue to pose a significant threat to cybersecurity by exploiting human vulnerabilities through deceptive content and malicious payloads. While Machine Learning (ML) models are effective at detecting phishing threats, their performance largely relies on the quality and diversity of the training data. This paper presents MeAJOR (Merged email Assets from Joint Open-source Repositories) Corpus, a novel, multi-source phishing email dataset designed to overcome critical limitations in existing resources. It integrates 135894 samples representing a broad number of phishing tactics and legitimate emails, with a wide spectrum of engineered features. We evaluated the dataset’s utility for phishing detection research through systematic experiments with four classification models (RF, XGB, MLP, and CNN) across multiple feature configurations. Results highlight the dataset’s effectiveness, achieving 98.34% F1 with XGB. By integrating broad features from multiple categories, our dataset provides a reusable and consistent resource, while addressing common challenges like class imbalance, generalisability and reproducibility.
zh

[AI-36] Improving the Computational Efficiency and Explainability of GeoAggregator

【速读】:该论文旨在解决地理空间表格数据(Geospatial Tabular Data, GTD)建模与解释的准确性问题,以更好地理解地理现象及其潜在过程。其核心挑战在于如何在保证预测性能的同时提升模型的计算效率和可解释性。解决方案的关键在于两个方面:一是优化GeoAggregator(GA)模型的数据加载流程和前向传播机制,显著提高计算效率;二是引入基于GeoShapley框架的模型集成策略与后验解释函数,增强模型的可解释性,从而有效捕捉设计合成数据集中固有的空间效应。

链接: https://arxiv.org/abs/2507.17977
作者: Rui Deng,Ziqi Li,Mingshu Wang
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注: 4 pages, 3 figures

点击查看摘要

Abstract:Accurate modeling and explaining geospatial tabular data (GTD) are critical for understanding geospatial phenomena and their underlying processes. Recent work has proposed a novel transformer-based deep learning model named GeoAggregator (GA) for this purpose, and has demonstrated that it outperforms other statistical and machine learning approaches. In this short paper, we further improve GA by 1) developing an optimized pipeline that accelerates the dataloading process and streamlines the forward pass of GA to achieve better computational efficiency; and 2) incorporating a model ensembling strategy and a post-hoc model explanation function based on the GeoShapley framework to enhance model explainability. We validate the functionality and efficiency of the proposed strategies by applying the improved GA model to synthetic datasets. Experimental results show that our implementation improves the prediction accuracy and inference speed of GA compared to the original implementation. Moreover, explanation experiments indicate that GA can effectively captures the inherent spatial effects in the designed synthetic dataset. The complete pipeline has been made publicly available for community use (this https URL).
zh

[AI-37] VERIRAG : Healthcare Claim Verification via Statistical Audit in Retrieval-Augmented Generation

【速读】:该论文旨在解决检索增强生成(Retrieval-Augmented Generation, RAG)系统在临床决策支持中面临的“方法学盲区”问题,即系统虽能检索文献证据,却无法评估其科学质量,导致低质量或已被撤稿的研究与高质量研究被同等对待。解决方案的关键在于提出VERIRAG框架,其核心创新包括:(i)“可信性检查表”(Veritable),一个包含11个维度的结构化评估工具,用于量化评估每篇文献的方法学严谨性(如数据完整性和统计有效性);(ii)“难变性分数”(Hard-to-Vary, HV Score),一种基于证据质量和多样性的加权聚合指标;(iii)动态接受阈值机制,根据主张的异常程度自适应调整证据门槛。该框架在四个不同性质的数据集上显著优于现有基线方法,F1得分提升达10–14个百分点。

链接: https://arxiv.org/abs/2507.17948
作者: Shubham Mohole,Hongjun Choi,Shusen Liu,Christine Klymko,Shashank Kushwaha,Derek Shi,Wesam Sakla,Sainyam Galhotra,Ruben Glatt
机构: 未知
类目: Information Retrieval (cs.IR); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Retrieval-augmented generation (RAG) systems are increasingly adopted in clinical decision support, yet they remain methodologically blind-they retrieve evidence but cannot vet its scientific quality. A paper claiming “Antioxidant proteins decreased after alloferon treatment” and a rigorous multi-laboratory replication study will be treated as equally credible, even if the former lacked scientific rigor or was even retracted. To address this challenge, we introduce VERIRAG, a framework that makes three notable contributions: (i) the Veritable, an 11-point checklist that evaluates each source for methodological rigor, including data integrity and statistical validity; (ii) a Hard-to-Vary (HV) Score, a quantitative aggregator that weights evidence by its quality and diversity; and (iii) a Dynamic Acceptance Threshold, which calibrates the required evidence based on how extraordinary a claim is. Across four datasets-comprising retracted, conflicting, comprehensive, and settled science corpora-the VERIRAG approach consistently outperforms all baselines, achieving absolute F1 scores ranging from 0.53 to 0.65, representing a 10 to 14 point improvement over the next-best method in each respective dataset. We will release all materials necessary for reproducing our results.
zh

[AI-38] Minimax Data Sanitization with Distortion Constraint and Adversarial Inference

【速读】:该论文致力于解决隐私保护数据共享场景下的最优数据脱敏问题,即在保证授权重构者(reconstructor)能够以低于预设失真阈值恢复数据的前提下,最大化两个未经授权的对手(adversaries)各自对原始数据估计的最小损失。其核心挑战在于设计一种机制,使得单个对手因侧信息不足而无法准确重构数据,仅当双方协作时才能满足重构失真要求,从而实现类似秘密共享(secret-sharing)但允许有损恢复的隐私保护目标。解决方案的关键在于将此问题建模为一个约束型数据驱动极小极大优化问题,并提出一种交替更新 privatizer、reconstructor 和 adversaries 的训练方法,通过迭代优化逼近最优解;同时在高斯和二进制等特殊情形下获得理论最优解,作为评估所提算法性能的基准。

链接: https://arxiv.org/abs/2507.17942
作者: Amirarsalan Moatazedian,Yauhen Yakimenka,Rémi A. Chou,Jörg Kliewer
机构: 未知
类目: Information Theory (cs.IT); Artificial Intelligence (cs.AI)
备注: Accepted to IEEE ITW 2025

点击查看摘要

Abstract:We study a privacy-preserving data-sharing setting where a privatizer transforms private data into a sanitized version observed by an authorized reconstructor and two unauthorized adversaries, each with access to side information correlated with the private data. The reconstructor is evaluated under a distortion function, while each adversary is evaluated using a separate loss function. The privatizer ensures the reconstructor distortion remains below a fixed threshold while maximizing the minimum loss across the two adversaries. This two-adversary setting models cases where individual users cannot reconstruct the data accurately, but their combined side information enables estimation within the distortion threshold. The privatizer maximizes individual loss while permitting accurate reconstruction only through collaboration. This echoes secret-sharing principles, but with lossy rather than perfect recovery. We frame this as a constrained data-driven minimax optimization problem and propose a data-driven training procedure that alternately updates the privatizer, reconstructor, and adversaries. We also analyze the Gaussian and binary cases as special scenarios where optimal solutions can be obtained. These theoretical optimal results are benchmarks for evaluating the proposed minimax training approach. Comments: Accepted to IEEE ITW 2025 Subjects: Information Theory (cs.IT); Artificial Intelligence (cs.AI) Cite as: arXiv:2507.17942 [cs.IT] (or arXiv:2507.17942v1 [cs.IT] for this version) https://doi.org/10.48550/arXiv.2507.17942 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
zh

[AI-39] Multimodal Fine-grained Reasoning for Post Quality Evaluation

【速读】:该论文旨在解决现有研究在帖子质量评估任务中存在的三大局限:(1)将任务视为单一模态分类,未能利用多模态线索及细粒度的质量差异;(2)深层多模态融合引入噪声,导致误导性信号;(3)缺乏捕捉复杂语义关系(如相关性和完整性)的能力。其解决方案的核心是提出多模态细粒度话题-帖子关系推理框架(Multimodal Fine-grained Topic-post Relational Reasoning, MFTRR),该框架将评估任务重构为排序任务,并通过两个关键模块实现改进:一是局部-全局语义相关性推理模块,结合最大信息融合机制抑制噪声,建模帖子与话题在局部和全局层面的细粒度语义交互;二是多层次证据关系推理模块,挖掘宏观与微观层级的关系线索,强化基于证据的推理能力。实验表明,MFTRR在多个多模态数据集上显著优于当前最优基线方法。

链接: https://arxiv.org/abs/2507.17934
作者: Xiaoxu Guo,Siyan Liang,Yachao Cui,Juxiang Zhou,Lei Wang,Han Cao
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注: 48 pages

点击查看摘要

Abstract:Accurately assessing post quality requires complex relational reasoning to capture nuanced topic-post relationships. However, existing studies face three major limitations: (1) treating the task as unimodal categorization, which fails to leverage multimodal cues and fine-grained quality distinctions; (2) introducing noise during deep multimodal fusion, leading to misleading signals; and (3) lacking the ability to capture complex semantic relationships like relevance and comprehensiveness. To address these issues, we propose the Multimodal Fine-grained Topic-post Relational Reasoning (MFTRR) framework, which mimics human cognitive processes. MFTRR reframes post-quality assessment as a ranking task and incorporates multimodal data to better capture quality variations. It consists of two key modules: (1) the Local-Global Semantic Correlation Reasoning Module, which models fine-grained semantic interactions between posts and topics at both local and global levels, enhanced by a maximum information fusion mechanism to suppress noise; and (2) the Multi-Level Evidential Relational Reasoning Module, which explores macro- and micro-level relational cues to strengthen evidence-based reasoning. We evaluate MFTRR on three newly constructed multimodal topic-post datasets and the public Lazada-Home dataset. Experimental results demonstrate that MFTRR significantly outperforms state-of-the-art baselines, achieving up to 9.52% NDCG@3 improvement over the best unimodal method on the Art History dataset.
zh

[AI-40] SMARTAPS: Tool-augmented LLM s for Operations Management AAAI-25 AAAI

【速读】:该论文旨在解决传统高级计划系统(Advanced Planning System, APS)因依赖专业顾问进行定制与维护而成本高昂,导致许多供应链规划人员难以负担的问题。解决方案的关键在于构建一个基于工具增强型大语言模型(tool-augmented large language model, LLM)的对话式系统 SmartAPS,通过自然语言交互接口实现信息查询、反事实推理、推荐生成和情景分析等功能,从而显著降低使用门槛并提升操作灵活性。

链接: https://arxiv.org/abs/2507.17927
作者: Timothy Tin Long Yu,Mahdi Mostajabdaveh,Jabo Serge Byusa,Rindra Ramamonjison,Giuseppe Carenini,Kun Mao,Zirui Zhou,Yong Zhang
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注: this https URL

点击查看摘要

Abstract:Large language models (LLMs) present intriguing opportunities to enhance user interaction with traditional algorithms and tools in real-world applications. An advanced planning system (APS) is a sophisticated software that leverages optimization to help operations planners create, interpret, and modify an operational plan. While highly beneficial, many customers are priced out of using an APS due to the ongoing costs of consultants responsible for customization and maintenance. To address the need for a more accessible APS expressed by supply chain planners, we present SmartAPS, a conversational system built on a tool-augmented LLM. Our system provides operations planners with an intuitive natural language chat interface, allowing them to query information, perform counterfactual reasoning, receive recommendations, and execute scenario analysis to better manage their operation. A short video demonstrating the system has been released: this https URL
zh

[AI-41] UrbanPulse: A Cross-City Deep Learning Framework for Ultra-Fine-Grained Population Transfer Prediction

【速读】:该论文旨在解决城市间人口流动预测中存在的精度不足、跨城市泛化能力弱以及空间分辨率受限等问题,尤其针对传统模型依赖静态空间假设、深度学习模型难以迁移至不同城市、大型语言模型(Large Language Models, LLMs)计算开销高且忽略空间结构,以及现有方法常通过聚类兴趣点(Points of Interest, POIs)降低分辨率的问题。其解决方案的关键在于提出UrbanPulse框架,该框架将每个POI视为独立节点构建城市级时空图,并采用时序图卷积编码器与基于Transformer的解码器联合建模多尺度时空依赖关系;同时引入三阶段迁移学习策略(大规模城市图预训练、冷启动适应和强化学习微调),显著提升了模型在不同城市环境下的泛化性能与可扩展性,在加州三大都会区超过1.03亿条清洗后的GPS轨迹数据上实现了最先进的预测精度与部署可行性。

链接: https://arxiv.org/abs/2507.17924
作者: Hongrong Yang,Markus Schlaepfer
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Accurate population flow prediction is essential for urban planning, transportation management, and public health. Yet existing methods face key limitations: traditional models rely on static spatial assumptions, deep learning models struggle with cross-city generalization, and Large Language Models (LLMs) incur high computational costs while failing to capture spatial structure. Moreover, many approaches sacrifice resolution by clustering Points of Interest (POIs) or restricting coverage to subregions, limiting their utility for city-wide analytics. We introduce UrbanPulse, a scalable deep learning framework that delivers ultra-fine-grained, city-wide OD flow predictions by treating each POI as an individual node. It combines a temporal graph convolutional encoder with a transformer-based decoder to model multi-scale spatiotemporal dependencies. To ensure robust generalization across urban contexts, UrbanPulse employs a three-stage transfer learning strategy: pretraining on large-scale urban graphs, cold-start adaptation, and reinforcement learning this http URL on over 103 million cleaned GPS records from three metropolitan areas in California, UrbanPulse achieves state-of-the-art accuracy and scalability. Through efficient transfer learning, UrbanPulse takes a key step toward making high-resolution, AI-powered urban forecasting deployable in practice across diverse cities.
zh

[AI-42] From Seed to Harvest: Augmenting Human Creativity with AI for Red-teaming Text-to-Image Models

【速读】:该论文旨在解决文本到图像(Text-to-Image, T2I)模型在面对对抗性攻击时缺乏全面、持续且具有文化多样性的评估数据集的问题。当前方法要么依赖人工编写提示词(adversarial prompts),存在规模小和文化代表性不足的局限;要么采用合成生成方式,虽可扩展但缺乏人类创造的现实细节与策略多样性。解决方案的关键在于提出一种混合红队(red-teaming)方法 Seed2Harvest,通过引导式扩展由人类设计的对抗性提示种子,融合人类创意与机器计算能力,在保持人类提示原有攻击模式和成功率(如 NudeNet 为 0.31、Stable Diffusion NSFW 为 0.36)的同时,显著提升数据集多样性(地理位置从 58 增至 535,香农熵从 5.28 提升至 7.48),从而实现对 T2I 模型安全性的可持续、规模化评估。

链接: https://arxiv.org/abs/2507.17922
作者: Jessica Quaye,Charvi Rastogi,Alicia Parrish,Oana Inel,Minsuk Kahng,Lora Aroyo,Vijay Janapa Reddi
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Text-to-image (T2I) models have become prevalent across numerous applications, making their robust evaluation against adversarial attacks a critical priority. Continuous access to new and challenging adversarial prompts across diverse domains is essential for stress-testing these models for resilience against novel attacks from multiple vectors. Current techniques for generating such prompts are either entirely authored by humans or synthetically generated. On the one hand, datasets of human-crafted adversarial prompts are often too small in size and imbalanced in their cultural and contextual representation. On the other hand, datasets of synthetically-generated prompts achieve scale, but typically lack the realistic nuances and creative adversarial strategies found in human-crafted prompts. To combine the strengths of both human and machine approaches, we propose Seed2Harvest, a hybrid red-teaming method for guided expansion of culturally diverse, human-crafted adversarial prompt seeds. The resulting prompts preserve the characteristics and attack patterns of human prompts while maintaining comparable average attack success rates (0.31 NudeNet, 0.36 SD NSFW, 0.12 Q16). Our expanded dataset achieves substantially higher diversity with 535 unique geographic locations and a Shannon entropy of 7.48, compared to 58 locations and 5.28 entropy in the original dataset. Our work demonstrates the importance of human-machine collaboration in leveraging human creativity and machine computational capacity to achieve comprehensive, scalable red-teaming for continuous T2I model safety evaluation.
zh

[AI-43] Deep learning-aided inverse design of porous metamaterials

【速读】:该论文旨在解决多孔超材料(porous metamaterials)的逆向设计问题,即根据目标水力性能(如孔隙率和渗透率)生成具有特定微观结构的材料。其解决方案的关键在于提出了一种属性变分自编码器(property-variational autoencoder, pVAE)框架,该框架在传统变分自编码器(VAE)基础上引入回归器模块,以联合学习微结构特征与水力性能之间的映射关系。通过训练卷积神经网络(CNN)从格子玻尔兹曼方法(Lattice Boltzmann Method, LBM)生成的有限样本中预测有效水力性能,显著降低了直接数值模拟的计算成本;同时,pVAE的编码器-解码器架构将微结构映射至紧凑且可解释的潜在空间(latent space),从而实现结构-性能映射、插值及逆向设计,最终生成满足指定性能要求的新材料结构。

链接: https://arxiv.org/abs/2507.17907
作者: Phu Thien Nguyen,Yousef Heider,Dennis M. Kochmann,Fadi Aldakheel
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注: 31 pages, 29 figures

点击查看摘要

Abstract:The ultimate aim of the study is to explore the inverse design of porous metamaterials using a deep learning-based generative framework. Specifically, we develop a property-variational autoencoder (pVAE), a variational autoencoder (VAE) augmented with a regressor, to generate structured metamaterials with tailored hydraulic properties, such as porosity and permeability. While this work uses the lattice Boltzmann method (LBM) to generate intrinsic permeability tensor data for limited porous microstructures, a convolutional neural network (CNN) is trained using a bottom-up approach to predict effective hydraulic properties. This significantly reduces the computational cost compared to direct LBM simulations. The pVAE framework is trained on two datasets: a synthetic dataset of artificial porous microstructures and CT-scan images of volume elements from real open-cell foams. The encoder-decoder architecture of the VAE captures key microstructural features, mapping them into a compact and interpretable latent space for efficient structure-property exploration. The study provides a detailed analysis and interpretation of the latent space, demonstrating its role in structure-property mapping, interpolation, and inverse design. This approach facilitates the generation of new metamaterials with desired properties. The datasets and codes used in this study will be made open-access to support further research.
zh

[AI-44] Action-List Reinforcement Learning Syndrome Decoding for Binary Linear Block Codes

【速读】:该论文旨在解决线性分组码(linear block codes)在迭代译码过程中因比特翻转策略不优而导致的性能瓶颈问题。其核心解决方案是将译码过程建模为马尔可夫决策过程(Markov Decision Process, MDP),并通过引入截断MDP(truncated MDP)来降低状态空间复杂度,具体方法是在码字周围学习指定半径的汉明球(Hamming ball)以限制状态数量;进一步提出一种通用的动作列表译码(action-list decoding)框架,利用深度Q网络(Deep-Q network)值函数优化决策策略,并结合码的自同构群(automorphism group)提升译码性能;此外,还设计了一种基于反馈机制的增强方法,在已有高性能译码器后引入强化学习算法以进一步改进性能,从而有效降低强化学习模块的计算复杂度并显著提升译码效率。

链接: https://arxiv.org/abs/2507.17893
作者: Milad Taghipour,Bane Vasic
机构: 未知
类目: Information Theory (cs.IT); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:This paper explores the application of reinforcement learning techniques to enhance the performance of decoding of linear block codes based on flipping bits and finding optimal decisions. We describe the methodology for mapping the iterative decoding process into Markov Decision Processes (MDPs) and propose different methods to reduce the number of states in the MDP. A truncated MDP is proposed to reduce the number of states in the MDP by learning a Hamming ball with a specified radius around codewords. We then propose a general scheme for reinforcement learning based decoders applicable to any class of codes to improve the performance of decoders. We call this scheme an action-list decoding. We design an action-list decoder based on the Deep-Q network values that substantially enhance performance. We also get benefit of automorphism group of code to further improve the code performance. Additionally, we propose a feedback-based method to exploit and enhance the performance of existing high-performing decoders by applying reinforcement learning algorithms after the existing decoders. These approaches effectively reduces the complexity of the reinforcement learning block. Finally, we present experimental results for the Low-Density Parity Check (LDPC) codes over the Binary Symmetric Channel (BSC) to demonstrate the efficiency of the proposed methods.
zh

[AI-45] I2I-STRADA – Information to Insights via Structured Reasoning Agent for Data Analysis

【速读】:该论文旨在解决当前多智能体数据分析系统在自动化洞察生成过程中忽视结构化推理流程的问题。现有系统虽能有效处理查询转换、数据变换和可视化等任务,但其基于通用大语言模型(Large Language Models, LLMs)的推理步骤缺乏针对特定分析任务的固定认知路径,导致规划不连贯且洞察偏离目标。解决方案的关键在于提出I2I-STRADA(Information-to-Insight via Structured Reasoning Agent for Data Analysis),该架构通过模块化子任务显式建模分析过程中的认知步骤——包括模糊目标解读、上下文知识锚定、抽象计划构建及执行适应性调整,从而实现符合真实数据分析场景的结构化推理流程。实验表明,I2I-STRADA在DABstep和DABench基准上显著优于现有方法,在规划一致性和洞察对齐度方面表现更优。

链接: https://arxiv.org/abs/2507.17874
作者: SaiBarath Sundar,Pranav Satheesan,Udayaadithya Avadhanam
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Recent advances in agentic systems for data analysis have emphasized automation of insight generation through multi-agent frameworks, and orchestration layers. While these systems effectively manage tasks like query translation, data transformation, and visualization, they often overlook the structured reasoning process underlying analytical thinking. Reasoning large language models (LLMs) used for multi-step problem solving are trained as general-purpose problem solvers. As a result, their reasoning or thinking steps do not adhere to fixed processes for specific tasks. Real-world data analysis requires a consistent cognitive workflow: interpreting vague goals, grounding them in contextual knowledge, constructing abstract plans, and adapting execution based on intermediate outcomes. We introduce I2I-STRADA (Information-to-Insight via Structured Reasoning Agent for Data Analysis), an agentic architecture designed to formalize this reasoning process. I2I-STRADA focuses on modeling how analysis unfolds via modular sub-tasks that reflect the cognitive steps of analytical reasoning. Evaluations on the DABstep and DABench benchmarks show that I2I-STRADA outperforms prior systems in planning coherence and insight alignment, highlighting the importance of structured cognitive workflows in agent design for data analysis.
zh

[AI-46] chnical Implementation of Tippy: Multi-Agent Architecture and System Design for Drug Discovery Laboratory Automation

【速读】:该论文旨在解决药物发现实验室自动化中多智能体协同复杂任务的挑战,特别是如何实现高效、安全且可扩展的AI代理系统以整合现有实验设施与数据流程。其解决方案的关键在于构建一个基于微服务架构的多智能体系统(multi-agent system),包含五个专业化代理(Supervisor、Molecule、Lab、Analysis 和 Report),通过 OpenAI Agents SDK 进行编排,并借助 Model Context Protocol (MCP) 实现对实验室工具的标准化访问;同时采用 Git 驱动的配置管理、Kubernetes 容器化部署、向量数据库支持的检索增强生成(RAG)以及 Envoy 反向代理保障外部安全接入,从而在保证系统可靠性与安全性的同时,实现与传统实验室基础设施的无缝集成。

链接: https://arxiv.org/abs/2507.17852
作者: Yao Fehlis,Charles Crain,Aidan Jensen,Michael Watson,James Juhasz,Paul Mandel,Betty Liu,Shawn Mahon,Daren Wilson,Nick Lynch-Jonely,Ben Leedom,David Fuller
机构: 未知
类目: Multiagent Systems (cs.MA); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Building on the conceptual framework presented in our previous work on agentic AI for pharmaceutical research, this paper provides a comprehensive technical analysis of Tippy’s multi-agent system implementation for drug discovery laboratory automation. We present a distributed microservices architecture featuring five specialized agents (Supervisor, Molecule, Lab, Analysis, and Report) that coordinate through OpenAI Agents SDK orchestration and access laboratory tools via the Model Context Protocol (MCP). The system architecture encompasses agent-specific tool integration, asynchronous communication patterns, and comprehensive configuration management through Git-based tracking. Our production deployment strategy utilizes Kubernetes container orchestration with Helm charts, Docker containerization, and CI/CD pipelines for automated testing and deployment. The implementation integrates vector databases for RAG functionality and employs an Envoy reverse proxy for secure external access. This work demonstrates how specialized AI agents can effectively coordinate complex laboratory workflows while maintaining security, scalability, reliability, and integration with existing laboratory infrastructure through standardized protocols.
zh

[AI-47] Performance Evaluation and Threat Mitigation in Large-scale 5G Core Deployment

【速读】:该论文旨在解决大规模5G核心网功能部署中因资源分配不当导致的服务性能下降问题,尤其是在面对由分布式拒绝服务(Distributed Denial of Service, DDoS)引发的混沌工作负载时,对用户设备(User Equipment)注册性能的影响。其解决方案的关键在于提出多样化资源配置策略以保障服务等级协议(Service-Level Agreement, SLA)合规性,并通过内核级监控方法实现可扩展的安全威胁防御,从而提升5G网络功能(Network Functions, NFs)在复杂场景下的部署效率与稳定性。

链接: https://arxiv.org/abs/2507.17850
作者: Rodrigo Moreira,Larissa F. Rodrigues Moreira,Flávio de Oliveira Silva
机构: 未知
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:The deployment of large-scale software-based 5G core functions presents significant challenges due to their reliance on optimized and intelligent resource provisioning for their services. Many studies have focused on analyzing the impact of resource allocation for complex deployments using mathematical models, queue theories, or even Artificial Intelligence (AI). This paper elucidates the effects of chaotic workloads, generated by Distributed Denial of Service (DDoS) on different Network Functions (NFs) on User Equipment registration performance. Our findings highlight the necessity of diverse resource profiles to ensure Service-Level Agreement (SLA) compliance in large-scale 5G core deployments. Additionally, our analysis of packet capture approaches demonstrates the potential of kernel-based monitoring for scalable security threat defense. Finally, our empirical evaluation provides insights into the effective deployment of 5G NFs in complex scenarios.
zh

[AI-48] Explainable Graph Neural Networks via Structural Externalities

【速读】:该论文旨在解决图神经网络(Graph Neural Networks, GNNs)在实际应用中因“黑箱”特性导致的可解释性不足问题,尤其是现有方法难以有效捕捉节点间复杂交互模式的局限性。其解决方案的关键在于提出一种基于合作博弈论与社会外部性的新框架 GraphEXT:通过将图节点划分为联盟(coalitions),将原图分解为独立子图,并引入结构作为外部性因素,结合考虑外部性的 Shapley 值来量化节点重要性——即衡量节点在不同联盟间转移时对 GNN 预测结果的边际贡献。相较于传统基于 Shapley 值的方法仅关注节点属性,GraphEXT 更强调节点间的相互作用及其结构变化对预测的影响,从而显著提升 GNN 模型的解释能力。

链接: https://arxiv.org/abs/2507.17848
作者: Lijun Wu,Dong Hao,Zhiyi Fan
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computer Science and Game Theory (cs.GT); General Economics (econ.GN)
备注:

点击查看摘要

Abstract:Graph Neural Networks (GNNs) have achieved outstanding performance across a wide range of graph-related tasks. However, their “black-box” nature poses significant challenges to their explainability, and existing methods often fail to effectively capture the intricate interaction patterns among nodes within the network. In this work, we propose a novel explainability framework, GraphEXT, which leverages cooperative game theory and the concept of social externalities. GraphEXT partitions graph nodes into coalitions, decomposing the original graph into independent subgraphs. By integrating graph structure as an externality and incorporating the Shapley value under externalities, GraphEXT quantifies node importance through their marginal contributions to GNN predictions as the nodes transition between coalitions. Unlike traditional Shapley value-based methods that primarily focus on node attributes, our GraphEXT places greater emphasis on the interactions among nodes and the impact of structural changes on GNN predictions. Experimental studies on both synthetic and real-world datasets show that GraphEXT outperforms existing baseline methods in terms of fidelity across diverse GNN architectures , significantly enhancing the explainability of GNN models.
zh

[AI-49] Helix 1.0: An Open-Source Framework for Reproducible and Interpretable Machine Learning on Tabular Scientific Data

【速读】:该论文旨在解决机器学习在表格数据(tabular data)分析中缺乏可重现性和可解释性的问题,尤其是在实验数据溯源(provenance)方面,难以确保整个分析流程(包括数据预处理和方法选择等决策)被完整记录、访问、复现并为相关利益方理解。解决方案的关键在于构建一个开源、可扩展的 Python 软件框架 Helix,其核心能力包括标准化的数据预处理、可视化、模型训练与评估、结果检查以及对未见数据的预测,并集成了一种基于语言术语的新型可解释方法,使非数据科学背景的研究人员也能在统一环境中设计计算实验、解读模型决策并获得可行动的洞察。该框架遵循 FAIR 原则,支持社区驱动开发,提升了研究工作的透明度与协作效率。

链接: https://arxiv.org/abs/2507.17791
作者: Eduardo Aguilar-Bejarano,Daniel Lea,Karthikeyan Sivakumar,Jimiama M. Mase,Reza Omidvar,Ruizhe Li,Troy Kettle,James Mitchell-White,Morgan R Alexander,David A Winkler,Grazziela Figueredo
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注: 17 pages

点击查看摘要

Abstract:Helix is an open-source, extensible, Python-based software framework to facilitate reproducible and interpretable machine learning workflows for tabular data. It addresses the growing need for transparent experimental data analytics provenance, ensuring that the entire analytical process – including decisions around data transformation and methodological choices – is documented, accessible, reproducible, and comprehensible to relevant stakeholders. The platform comprises modules for standardised data preprocessing, visualisation, machine learning model training, evaluation, interpretation, results inspection, and model prediction for unseen data. To further empower researchers without formal training in data science to derive meaningful and actionable insights, Helix features a user-friendly interface that enables the design of computational experiments, inspection of outcomes, including a novel interpretation approach to machine learning decisions using linguistic terms all within an integrated environment. Released under the MIT licence, Helix is accessible via GitHub and PyPI, supporting community-driven development and promoting adherence to the FAIR principles.
zh

[AI-50] Adaptive Repetition for Mitigating Position Bias in LLM -Based Ranking

【速读】:该论文旨在解决大语言模型(Large Language Models, LLMs)在排序或评估任务中因候选项位置不同而导致的位置偏差(position bias)以及重复调用时产生的低重复一致性(low repetition consistency)问题。这些问题会导致模型输出不稳定,且现有基于静态多次重复调用并取多数投票的策略虽然有效但计算成本高昂。论文的关键解决方案是提出一种动态早停机制(dynamic early-stopping method),该机制能根据每个实例的响应稳定性自适应地确定所需的重复次数,从而显著减少LLM调用次数(平均降低81%),同时保持与静态重复方法相当的准确性;进一步引入基于置信度的改进版本,使调用次数再降15%(平均降低87%),仅带来轻微精度损失。

链接: https://arxiv.org/abs/2507.17788
作者: Ali Vardasbi,Gustavo Penha,Claudia Hauff,Hugues Bouchard
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:When using LLMs to rank items based on given criteria, or evaluate answers, the order of candidate items can influence the model’s final decision. This sensitivity to item positioning in a LLM’s prompt is known as position bias. Prior research shows that this bias exists even in large models, though its severity varies across models and tasks. In addition to position bias, LLMs also exhibit varying degrees of low repetition consistency, where repeating the LLM call with the same candidate ordering can lead to different rankings. To address both inconsistencies, a common approach is to prompt the model multiple times with different candidate orderings and aggregate the results via majority voting. However, this repetition strategy, significantly increases computational costs. Extending prior findings, we observe that both the direction – favoring either the earlier or later candidate in the prompt – and magnitude of position bias across instances vary substantially, even within a single dataset. This observation highlights the need for a per-instance mitigation strategy. To this end, we introduce a dynamic early-stopping method that adaptively determines the number of repetitions required for each instance. Evaluating our approach across three LLMs of varying sizes and on two tasks, namely re-ranking and alignment, we demonstrate that transitioning to a dynamic repetition strategy reduces the number of LLM calls by an average of 81%, while preserving the accuracy. Furthermore, we propose a confidence-based adaptation to our early-stopping method, reducing LLM calls by an average of 87% compared to static repetition, with only a slight accuracy trade-off relative to our original early-stopping method.
zh

[AI-51] Hyperbolic Deep Learning for Foundation Models: A Survey KDD2025

【速读】:该论文旨在解决当前基础模型(foundation models)在表示能力、适应性和可扩展性方面的局限性问题,核心质疑在于:欧几里得几何(Euclidean geometry)是否仍是所有基础模型的最优归纳偏置。其解决方案的关键在于引入双曲空间(hyperbolic space)——一种非欧几里得流形,具有随距离呈指数增长的体积特性,能够以更低维度实现对层次结构(如树状结构、分类体系)和幂律分布数据的低失真嵌入。通过将双曲几何纳入神经网络设计,该研究显著提升了大语言模型(LLMs)、视觉-语言模型(VLMs)及多模态模型的复杂推理能力、零样本泛化性能与跨模态语义对齐效果,同时保持参数效率。

链接: https://arxiv.org/abs/2507.17787
作者: Neil He,Hiren Madhu,Ngoc Bui,Menglin Yang,Rex Ying
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注: 11 Pages, SIGKDD 2025

点击查看摘要

Abstract:Foundation models pre-trained on massive datasets, including large language models (LLMs), vision-language models (VLMs), and large multimodal models, have demonstrated remarkable success in diverse downstream tasks. However, recent studies have shown fundamental limitations of these models: (1) limited representational capacity, (2) lower adaptability, and (3) diminishing scalability. These shortcomings raise a critical question: is Euclidean geometry truly the optimal inductive bias for all foundation models, or could incorporating alternative geometric spaces enable models to better align with the intrinsic structure of real-world data and improve reasoning processes? Hyperbolic spaces, a class of non-Euclidean manifolds characterized by exponential volume growth with respect to distance, offer a mathematically grounded solution. These spaces enable low-distortion embeddings of hierarchical structures (e.g., trees, taxonomies) and power-law distributions with substantially fewer dimensions compared to Euclidean counterparts. Recent advances have leveraged these properties to enhance foundation models, including improving LLMs’ complex reasoning ability, VLMs’ zero-shot generalization, and cross-modal semantic alignment, while maintaining parameter efficiency. This paper provides a comprehensive review of hyperbolic neural networks and their recent development for foundation models. We further outline key challenges and research directions to advance the field.
zh

[AI-52] In Reverie Together: Ten Years of Mathematical Discovery with a Machine Collaborator

【速读】:该论文试图解决的问题是:如何通过自动化系统生成具有数学深度和创造性价值的图论猜想,从而促进人机协作下的数学发现。其解决方案的关键在于开发并应用名为TxGraffiti的自动化猜想生成系统,该系统基于符号模式识别与人类定义的启发式规则,在大量图结构数据中提炼出简洁、自然且经实证验证的图论猜想;这些猜想虽尚未被证明或反驳,但体现了机器在数学创造过程中的参与潜力,旨在激发人类数学家与AI共同探索其背后的理论本质,并反思人工智能在科研创新中的角色与意义。

链接: https://arxiv.org/abs/2507.17780
作者: Randy Davila,Boris Brimkov,Ryan Pepper
机构: 未知
类目: Discrete Mathematics (cs.DM); Artificial Intelligence (cs.AI); Combinatorics (math.CO)
备注:

点击查看摘要

Abstract:We present four open conjectures in graph theory generated by the automated conjecturing system \textttTxGraffiti. Each conjecture is concise, grounded in natural graph invariants, and empirically validated across hundreds of graphs. Despite extensive effort, these statements remain unresolved–defying both proof and counterexample. They are not only mathematical challenges but creative expressions–born of symbolic pattern recognition and mathematician-defined heuristics, refined through years of human dialogue, and now offered back to the community as collaborative artifacts. These conjectures invite not only formal proof, but also reflection on how machines can evoke wonder, spark curiosity, and contribute to the raw material of discovery. By highlighting these problems, we aim to inspire both human mathematicians and AI systems to engage with them–not only to solve them, but to reflect on what it means when machines participate meaningfully in the creative process of mathematical thought.
zh

[AI-53] An advanced AI driven database system

【速读】:该论文旨在解决当前数据库系统在复杂性和可用性方面的严重问题,尤其是针对缺乏技术背景的用户难以使用SQL等查询语言进行数据管理的痛点。其解决方案的关键在于引入人工智能(AI)技术,特别是通过集成大语言模型(Large Language Models, LLMs)和先进机器学习算法,实现基于自然语言处理(Natural Language Processing, NLP)的直观交互界面,并自动完成结构化查询生成、半结构化数据格式(如YAML、JSON、API文档)转换以及数据建模与性能优化等任务。系统利用生成式模式推理(generative schema inference)和格式选择机制,显著降低对人工干预的需求,减少人为错误并提升数据库的易用性与效率。

链接: https://arxiv.org/abs/2507.17778
作者: M. Tedeschi,S. Rizwan,C. Shringi,V. Devram Chandgir,S. Belich
机构: 未知
类目: Databases (cs.DB); Artificial Intelligence (cs.AI); Software Engineering (cs.SE)
备注: 10 pages, 5 figures, appears in EDULEARN25 Conference Proceedings

点击查看摘要

Abstract:Contemporary database systems, while effective, suffer severe issues related to complexity and usability, especially among individuals who lack technical expertise but are unfamiliar with query languages like Structured Query Language (SQL). This paper presents a new database system supported by Artificial Intelligence (AI), which is intended to improve the management of data using natural language processing (NLP) - based intuitive interfaces, and automatic creation of structured queries and semi-structured data formats like yet another markup language (YAML), java script object notation (JSON), and application program interface (API) documentation. The system is intended to strengthen the potential of databases through the integration of Large Language Models (LLMs) and advanced machine learning algorithms. The integration is purposed to allow the automation of fundamental tasks such as data modeling, schema creation, query comprehension, and performance optimization. We present in this paper a system that aims to alleviate the main problems with current database technologies. It is meant to reduce the need for technical skills, manual tuning for better performance, and the potential for human error. The AI database employs generative schema inference and format selection to build its schema models and execution formats.
zh

[AI-54] ASP-Assisted Symbolic Regression: Uncovering Hidden Physics in Fluid Mechanics

【速读】:该论文旨在解决流体力学中数据驱动模型缺乏可解释性的问题,尤其是在三维不可压缩流动建模中,传统机器学习方法常被视为“黑箱”,难以与物理规律对齐。解决方案的关键在于引入符号回归(Symbolic Regression, SR)技术,从数值模拟数据中直接推导出简洁且可解释的数学表达式,从而准确复现如轴向速度分布和压力梯度等关键流场特征,并与文献中的解析解高度一致;同时,论文创新性地将SR与答案集编程(Answer Set Programming, ASP)相结合,构建了SR/ASP混合框架,通过引入领域知识约束,确保生成的符号表达式不仅统计上准确,而且符合物理守恒原理,显著提升了模型的可靠性与物理解释能力。

链接: https://arxiv.org/abs/2507.17777
作者: Theofanis Aravanis,Grigorios Chrimatopoulos,Mohammad Ferdows,Michalis Xenos,Efstratios Em Tzirtzilakis
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注: This research was implemented in the framework of the Action "Flagship actions in interdisciplinary scientific fields with a special focus on the productive fabric’', which is implemented through the National Recovery and Resilience Fund Greece 2.0 and funded by the European Union–NextGenerationEU (Project ID: TAEDR-0535983)

点击查看摘要

Abstract:Unlike conventional Machine-Learning (ML) approaches, often criticized as “black boxes”, Symbolic Regression (SR) stands out as a powerful tool for revealing interpretable mathematical relationships in complex physical systems, requiring no a priori assumptions about models’ structures. Motivated by the recognition that, in fluid mechanics, an understanding of the underlying flow physics is as crucial as accurate prediction, this study applies SR to model a fundamental three-dimensional (3D) incompressible flow in a rectangular channel, focusing on the (axial) velocity and pressure fields under laminar conditions. By employing the PySR library, compact symbolic equations were derived directly from numerical simulation data, revealing key characteristics of the flow dynamics. These equations not only approximate the parabolic velocity profile and pressure drop observed in the studied fluid flow, but also perfectly coincide with analytical solutions from the literature. Furthermore, we propose an innovative approach that integrates SR with the knowledge-representation framework of Answer Set Programming (ASP), combining the generative power of SR with the declarative reasoning strengths of ASP. The proposed hybrid SR/ASP framework ensures that the SR-generated symbolic expressions are not only statistically accurate, but also physically plausible, adhering to domain-specific principles. Overall, the study highlights two key contributions: SR’s ability to simplify complex flow behaviours into concise, interpretable equations, and the potential of knowledge-representation approaches to improve the reliability and alignment of data-driven SR models with domain principles. Insights from the examined 3D channel flow pave the way for integrating such hybrid approaches into efficient frameworks, […] where explainable predictions and real-time data analysis are crucial.
zh

[AI-55] Human-AI Co-Creation: A Framework for Collaborative Design in Intelligent Systems

【速读】:该论文试图解决的问题是如何在早期设计阶段有效整合生成式 AI(Generative AI),以重塑传统以人为中心的设计流程,使其从单纯的计算工具转变为具有主动参与能力的创意合作者。解决方案的关键在于提出“人-AI共同创造”(human-AI co-creation)的新范式,其中大型语言模型(Large Language Models, LLMs)如 GPT-4 和多模态扩散模型(Multimodal Diffusion Models)如 Stable Diffusion 作为创造性代理(creative agents),与设计师形成迭代式的提案—批判—修订循环,从而在概念生成、可视化构思和决策过程中实现深度协同。

链接: https://arxiv.org/abs/2507.17774
作者: Zhangqi Liu
机构: 未知
类目: Human-Computer Interaction (cs.HC); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:As artificial intelligence (AI) continues to evolve from a back-end computational tool into an interactive, generative collaborator, its integration into early-stage design processes demands a rethinking of traditional workflows in human-centered design. This paper explores the emergent paradigm of human-AI co-creation, where AI is not merely used for automation or efficiency gains, but actively participates in ideation, visual conceptualization, and decision-making. Specifically, we investigate the use of large language models (LLMs) like GPT-4 and multimodal diffusion models such as Stable Diffusion as creative agents that engage designers in iterative cycles of proposal, critique, and revision.
zh

[AI-56] How Instructional Sequence and Personalized Support Impact Diagnostic Strategy Learning

【速读】:该论文旨在解决初学者在诊断推理能力发展过程中面临的认知偏差问题,如过早闭合(premature closure)和对启发式策略的过度依赖。其解决方案的关键在于优化情境化学习(Scenario-based Learning, SBL)中 instructional sequence 的设计:通过在在线SBL平台PharmaSim中对比“先讲授诊断策略再解决问题”(I-PS)与“先解决问题再提供策略指导”(PS-I)两种教学序列,发现后一种方式显著提升了学习迁移表现,表明将显性诊断策略教学置于问题解决之后更有利于促进深层理解和应用能力的发展。

链接: https://arxiv.org/abs/2507.17760
作者: Fatma Betül Güreş,Tanya Nazaretsky,Bahar Radmehr,Martina Rau,Tanja Käser
机构: 未知
类目: Computers and Society (cs.CY); Artificial Intelligence (cs.AI); Human-Computer Interaction (cs.HC)
备注: Submitted to AIED 2025 main track

点击查看摘要

Abstract:Supporting students in developing effective diagnostic reasoning is a key challenge in various educational domains. Novices often struggle with cognitive biases such as premature closure and over-reliance on heuristics. Scenario-based learning (SBL) can address these challenges by offering realistic case experiences and iterative practice, but the optimal sequencing of instruction and problem-solving activities remains unclear. This study examines how personalized support can be incorporated into different instructional sequences and whether providing explicit diagnostic strategy instruction before (I-PS) or after problem-solving (PS-I) improves learning and its transfer. We employ a between-groups design in an online SBL environment called PharmaSim, which simulates real-world client interactions for pharmacy technician apprentices. Results indicate that while both instruction types are beneficial, PS-I leads to significantly higher performance in transfer tasks.
zh

[AI-57] Insights from Railway Professionals: Rethinking Railway assumptions regarding safety and autonomy

【速读】:该论文试图解决铁路行业中安全概念的复杂性与技术发展之间脱节的问题,特别是如何在自动化和生成式 AI (Generative AI) 等新兴技术引入铁路系统时,确保其符合专业人员对“安全”的实际认知。解决方案的关键在于通过访谈司机、路线规划人员及行政人员,识别出铁路从业人员对安全的多维理解——即融合人为因素、系统性和技术因素,并强调应优先采用辅助性技术而非完全自动化;同时提出需构建铁路专属的因果模型(railway-specific causation model),以替代从汽车领域迁移的自动化技术评估框架,从而更有效地提升铁路系统的安全性与适应性。

链接: https://arxiv.org/abs/2507.17756
作者: Josh Hunter,John McDermid,Simon Burton
机构: 未知
类目: Human-Computer Interaction (cs.HC); Artificial Intelligence (cs.AI)
备注: 9 pages, 3 figures, published in European Dependable Computing Conference 2025

点击查看摘要

Abstract:This study investigates how railway professionals perceive safety as a concept within rail, with the intention to help inform future technological developments within the industry. Through a series of interviews with drivers, route planners,and administrative personnel, the research explores the currentstate of safety practices, the potential for automation and the understanding of the railway as a system of systems. Key findings highlight a cautious attitude towards automation, a preference for assistive technologies, and a complex understanding of safety that integrates human, systematic and technological factors. The study also addresses the limitations of transferring automotive automation technologies to railways and the need for a railway-specific causation model to better evaluate and enhance safety in an evolving technological landscape. This study aims to bridge thegap between contemporary research and practical applications, contributing to the development of more effective safety metrics.
zh

[AI-58] A Custom-Built Ambient Scribe Reduces Cognitive Load and Documentation Burden for Telehealth Clinicians

【速读】:该论文旨在解决临床医生职业倦怠(clinician burnout)问题,其根源之一是繁重的电子健康记录(EHR)文书工作负担。解决方案的关键在于开发并部署一个定制化的环境式医疗助理(ambient medical scribe)应用,该应用无缝集成于EHR系统中,利用Whisper模型实现语音转录,并通过基于GPT-4o的模块化上下文学习流水线自动生成SOAP病历和患者指导说明。实验表明,生成的病历质量优于专家撰写版本,且在真实临床环境中获得广泛采用,显著降低医生的认知负荷和文档负担,同时通过微调BART模型进一步提升病历的简洁性,体现了AI系统在减轻行政负担、提升医疗效率与质量方面的潜力。

链接: https://arxiv.org/abs/2507.17754
作者: Justin Morse,Kurt Gilbert,Kyle Shin,Rick Cooke,Peyton Rose,Jack Sullivan,Angelo Sisante
机构: 未知
类目: Human-Computer Interaction (cs.HC); Artificial Intelligence (cs.AI); Computers and Society (cs.CY)
备注:

点击查看摘要

Abstract:Clinician burnout has motivated the growing adoption of ambient medical scribes in the clinic. In this work, we introduce a custom-built ambient scribe application integrated into the EHR system at Included Health, a personalized all-in-one healthcare company offering telehealth services. The application uses Whisper for transcription and a modular in-context learning pipeline with GPT-4o to automatically generate SOAP notes and patient instructions. Testing on mock visit data shows that the notes generated by the application exceed the quality of expert-written notes as determined by an LLM-as-a-judge. The application has been widely adopted by the clinical practice, with over 540 clinicians at Included Health using the application at least once. 94% (n = 63) of surveyed clinicians report reduced cognitive load during visits and 97% (n = 66) report less documentation burden when using the application. Additionally, we show that post-processing notes with a fine-tuned BART model improves conciseness. These findings highlight the potential for AI systems to ease administrative burdens and support clinicians in delivering efficient, high-quality care.
zh

[AI-59] A Foundation Model for Massive MIMO Precoding with an Adaptive per-User Rate-Power Tradeoff

【速读】:该论文旨在解决大规模多输入多输出(massive multiple-input multiple-output, mMIMO)系统中基于深度学习(deep learning, DL)的预编码模型在实际部署时面临的两个核心挑战:一是高质量本地训练数据难以获取,二是模型训练与推理复杂度高。其解决方案的关键在于提出了一种基于Transformer架构的预编码基础模型(foundation model),该模型能够在不进行微调的情况下实现零样本(zero-shot)部署,显著优于传统的迫零(zero-forcing)算法,并接近加权最小均方误差(weighted minimum mean squared error, WMMSE)性能但计算复杂度降低8倍;同时,为应对数据稀缺场景,引入一种基于预训练特征提取器输出余弦相似度的数据增强方法,有效提升模型对目标分布的适应能力,从而推动DL方案在mMIMO系统中的实用化落地。

链接: https://arxiv.org/abs/2507.18587
作者: Jérôme Emery,Ali Hasanzadeh Karkan,Jean-François Frigon,François Leduc-Primeau
机构: 未知
类目: ignal Processing (eess.SP); Artificial Intelligence (cs.AI)
备注: 6 pages, 3 figures. Accepted to the IEEE International Symposium on Personal, Indoor and Mobile Radio Communications (PIMRC) 2025

点击查看摘要

Abstract:Deep learning (DL) has emerged as a solution for precoding in massive multiple-input multiple-output (mMIMO) systems due to its capacity to learn the characteristics of the propagation environment. However, training such a model requires high-quality, local datasets at the deployment site, which are often difficult to collect. We propose a transformer-based foundation model for mMIMO precoding that seeks to minimize the energy consumption of the transmitter while dynamically adapting to per-user rate requirements. At equal energy consumption, zero-shot deployment of the proposed foundation model significantly outperforms zero forcing, and approaches weighted minimum mean squared error performance with 8x less complexity. To address model adaptation in data-scarce settings, we introduce a data augmentation method that finds training samples similar to the target distribution by computing the cosine similarity between the outputs of the pre-trained feature extractor. Our work enables the implementation of DL-based solutions in practice by addressing challenges of data availability and training complexity. Moreover, the ability to dynamically configure per-user rate requirements can be leveraged by higher level resource allocation and scheduling algorithms for greater control over energy efficiency, spectral efficiency and fairness.
zh

[AI-60] Advancing Financial Engineering with Foundation Models: Progress Applications and Challenges

【速读】:该论文旨在解决金融领域中通用大模型(如GPT-4和Gemini)在实际应用中受限于特定行业需求的问题,例如多模态推理、监管合规性和数据隐私等挑战。其解决方案的关键在于提出并系统梳理“金融基础模型”(Financial Foundation Models, FFMs)这一新范式,通过构建三类专为金融场景设计的模型:金融语言基础模型(FinLFMs)、金融时间序列基础模型(FinTSFMs)和金融视觉-语言基础模型(FinVLFMs),从架构、训练方法、数据集到应用场景进行全面综述,并指出当前在数据可用性、算法可扩展性和基础设施方面的关键挑战,从而为未来研究提供清晰的方向与实践路径。

链接: https://arxiv.org/abs/2507.18577
作者: Liyuan Chen,Shuoling Liu,Jiangpeng Yan,Xiaoyu Wang,Henglin Liu,Chuang Li,Kecheng Jiao,Jixuan Ying,Yang Veronica Liu,Qiang Yang,Xiu Li
机构: 未知
类目: Computational Finance (q-fin.CP); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: Under Review

点击查看摘要

Abstract:The advent of foundation models (FMs) - large-scale pre-trained models with strong generalization capabilities - has opened new frontiers for financial engineering. While general-purpose FMs such as GPT-4 and Gemini have demonstrated promising performance in tasks ranging from financial report summarization to sentiment-aware forecasting, many financial applications remain constrained by unique domain requirements such as multimodal reasoning, regulatory compliance, and data privacy. These challenges have spurred the emergence of Financial Foundation Models (FFMs) - a new class of models explicitly designed for finance. This survey presents a comprehensive overview of FFMs, with a taxonomy spanning three key modalities: Financial Language Foundation Models (FinLFMs), Financial Time-Series Foundation Models (FinTSFMs), and Financial Visual-Language Foundation Models (FinVLFMs). We review their architectures, training methodologies, datasets, and real-world applications. Furthermore, we identify critical challenges in data availability, algorithmic scalability, and infrastructure constraints, and offer insights into future research opportunities. We hope this survey serves as both a comprehensive reference for understanding FFMs and a practical roadmap for future innovation. An updated collection of FFM-related publications and resources will be maintained on our website this https URL.
zh

[AI-61] HARLF: Hierarchical Reinforcement Learning and Lightweight LLM -Driven Sentiment Integration for Financial Portfolio Optimization

【速读】:该论文旨在解决传统投资组合优化方法难以有效融合多源异构信息(如金融新闻情感信号与历史市场指标)的问题,从而提升资产配置的收益风险比。其解决方案的关键在于提出了一种分层式强化学习框架,通过轻量级大语言模型(LLMs)提取新闻情感特征,并与传统技术指标融合,由三层智能体(基础智能体、元智能体和超智能体)逐级聚合决策,实现跨模态数据的可扩展整合与策略稳定性增强,最终在2018–2024年测试期内实现了年化收益率26%、夏普比率1.2的优异表现。

链接: https://arxiv.org/abs/2507.18560
作者: Benjamin Coriat,Eric Benhamou
机构: 未知
类目: Portfolio Management (q-fin.PM); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:This paper presents a novel hierarchical framework for portfolio optimization, integrating lightweight Large Language Models (LLMs) with Deep Reinforcement Learning (DRL) to combine sentiment signals from financial news with traditional market indicators. Our three-tier architecture employs base RL agents to process hybrid data, meta-agents to aggregate their decisions, and a super-agent to merge decisions based on market data and sentiment analysis. Evaluated on data from 2018 to 2024, after training on 2000-2017, the framework achieves a 26% annualized return and a Sharpe ratio of 1.2, outperforming equal-weighted and SP 500 benchmarks. Key contributions include scalable cross-modal integration, a hierarchical RL structure for enhanced stability, and open-source reproducibility.
zh

[AI-62] From Individual Learning to Market Equilibrium: Correcting Structural and Parametric Biases in RL Simulations of Economic Models

【速读】:该论文旨在解决强化学习(Reinforcement Learning, RL)在经济建模中面临的两个核心问题:一是RL代理(agent)的非均衡行为与传统经济理论中“原子化个体作为市场条件接受者”的假设相冲突;二是经济贴现与RL对跨期成本处理之间的参数偏差。解决方案的关键在于提出一种校准后的均值场强化学习(Mean-Field Reinforcement Learning, MFRL)框架,通过将代表性代理嵌入固定宏观经济场,并调整其成本函数以反映经济机会成本,从而实现代理策略与竞争性均衡的一致性。该方法通过迭代算法收敛至自洽固定点,为计算社会科学研究中学习代理的建模提供了可操作且理论严谨的路径。

链接: https://arxiv.org/abs/2507.18229
作者: Zeqiang Zhang,Ruxin Chen
机构: 未知
类目: General Economics (econ.GN); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:The application of Reinforcement Learning (RL) to economic modeling reveals a fundamental conflict between the assumptions of equilibrium theory and the emergent behavior of learning agents. While canonical economic models assume atomistic agents act as takers' of aggregate market conditions, a naive single-agent RL simulation incentivizes the agent to become a manipulator’ of its environment. This paper first demonstrates this discrepancy within a search-and-matching model with concave production, showing that a standard RL agent learns a non-equilibrium, monopsonistic policy. Additionally, we identify a parametric bias arising from the mismatch between economic discounting and RL’s treatment of intertemporal costs. To address both issues, we propose a calibrated Mean-Field Reinforcement Learning framework that embeds a representative agent in a fixed macroeconomic field and adjusts the cost function to reflect economic opportunity costs. Our iterative algorithm converges to a self-consistent fixed point where the agent’s policy aligns with the competitive equilibrium. This approach provides a tractable and theoretically sound methodology for modeling learning agents in economic systems within the broader domain of computational social science.
zh

[AI-63] Axiomatizing Rumsfeld Ignorance

【速读】:该论文旨在解决Kit Fine在先前研究中提出的关于一阶无知(first-order ignorance)、二阶无知(second-order ignorance)与Rumsfeld无知(Rumsfeld ignorance)逻辑性质分析时所遇到的可定义性问题。具体而言,Rumsfeld无知可被普通无知定义,导致部分结论和公理化问题变得平凡。为克服这一局限,本文的关键解决方案在于假设嵌套在不同无知算子中的可达关系(accessibility relations)不再相同,而是允许其中一个为另一个的任意子集,从而避免了可定义性问题,并保留了原有大部分有效公式。在此设定下,论文实现了在各类适当双框架类(proper bi-frame classes)上的完备公理化体系构建,最终用于重新分析Fine的结果。

链接: https://arxiv.org/abs/2507.17776
作者: Jie Fan
机构: 未知
类目: Logic (math.LO); Artificial Intelligence (cs.AI)
备注: This is an almost-final version

点击查看摘要

Abstract:In a recent paper, Kit Fine presents some striking results concerning the logical properties of (first-order) ignorance, second-order ignorance and Rumsfeld ignorance. However, Rumsfeld ignorance is definable in terms of ignorance, which makes some existing results and the axiomatization problem trivial. A main reason is that the accessibility relations for the implicit knowledge operator contained in the packaged operators of ignorance and Rumsfeld ignorance are the same. In this work, we assume the two accessibility relations to be different so that one of them is an arbitrary subset of the other. This will avoid the definability issue and retain most of the previous validities. The main results are axiomatizations over various proper bi-frame classes. Finally we apply our framework to analyze Fine’s results.
zh

[AI-64] Comparison of Optimised Geometric Deep Learning Architectures over Varying Toxicological Assay Data Environments

【速读】:该论文旨在解决不同图神经网络(Graph Neural Network, GNN)架构在化学信息学中用于毒性预测任务时性能差异不明确的问题,特别是针对数据丰富与数据稀缺场景下各GNN模型的适应性问题。解决方案的关键在于系统比较了三种主流GNN架构——图卷积网络(Graph Convolutional Network, GCN)、图注意力网络(Graph Attention Network, GAT)和图同构网络(Graph Isomorphism Network, GIN)——在7个不同数据量和终点类型的毒理学检测数据集上的二分类表现,并通过贝叶斯优化确定每种GNN在每个数据集上的最优超参数配置。结果表明:GIN在数据丰富的场景下持续优于GCN和GAT,而GAT在数据稀缺场景下显著优于其他两种架构,揭示了GNN架构选择需依据数据规模进行适配,且GIN具有独特的高维超参数空间优化特性,体现出其作为GNN算法的独特性。

链接: https://arxiv.org/abs/2507.17775
作者: Alexander D. Kalian,Lennart Otte,Jaewook Lee,Emilio Benfenati,Jean-Lou C.M. Dorne,Claire Potter,Olivia J. Osborne,Miao Guo,Christer Hogstrand
机构: 未知
类目: Quantitative Methods (q-bio.QM); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Geometric deep learning is an emerging technique in Artificial Intelligence (AI) driven cheminformatics, however the unique implications of different Graph Neural Network (GNN) architectures are poorly explored, for this space. This study compared performances of Graph Convolutional Networks (GCNs), Graph Attention Networks (GATs) and Graph Isomorphism Networks (GINs), applied to 7 different toxicological assay datasets of varying data abundance and endpoint, to perform binary classification of assay activation. Following pre-processing of molecular graphs, enforcement of class-balance and stratification of all datasets across 5 folds, Bayesian optimisations were carried out, for each GNN applied to each assay dataset (resulting in 21 unique Bayesian optimisations). Optimised GNNs performed at Area Under the Curve (AUC) scores ranging from 0.728-0.849 (averaged across all folds), naturally varying between specific assays and GNNs. GINs were found to consistently outperform GCNs and GATs, for the top 5 of 7 most data-abundant toxicological assays. GATs however significantly outperformed over the remaining 2 most data-scarce assays. This indicates that GINs are a more optimal architecture for data-abundant environments, whereas GATs are a more optimal architecture for data-scarce environments. Subsequent analysis of the explored higher-dimensional hyperparameter spaces, as well as optimised hyperparameter states, found that GCNs and GATs reached measurably closer optimised states with each other, compared to GINs, further indicating the unique nature of GINs as a GNN algorithm.
zh

[AI-65] ASR-Guided Speaker-Role Diarization and Diarization-Guided ASR Decoding INTERSPEECH2025

【速读】:该论文旨在解决传统说话人聚类(Speaker Diarization, SD)仅标注通用身份标签(如speaker-1、speaker-2)而缺乏语义角色信息的问题,提出将SD扩展为说话人角色聚类(Speaker Role Diarization, RD),例如区分医生与患者、主持人与嘉宾等更具应用价值的角色标签。其解决方案的关键在于:(1)通过强制对齐和交叉熵损失简化训练流程,替代复杂的RNN-T(Recurrence Neural Network Transducer, RNNT)损失;(2)发现词预测与角色预测对上下文依赖程度不同,因而设计了任务特定的独立预测器,突破现有共享预测器模型的限制;(3)利用角色后验活性信息引导自动语音识别(ASR)解码过程,有效减少因短词被误删导致的错误。

链接: https://arxiv.org/abs/2507.17765
作者: Arindam Ghosh,Mark Fuhs,Bongjun Kim,Anurag Chowdhury,Monika Woszczyna
机构: 未知
类目: Audio and Speech Processing (eess.AS); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: Interspeech 2025 Submission

点击查看摘要

Abstract:From an application standpoint, speaker-role diarization (RD), such as doctor vs. patient, host vs. guest, etc. is often more useful than traditional speaker diarization (SD), which assigns generic labels like speaker-1, speaker-2 etc. In the context of joint automatic speech recognition (ASR) + SD (who spoke what?), recent end-to-end models employ an auxiliary SD transducer, synchronized with the ASR transducer, to predict speakers per word. In this paper, we extend this framework to RD with three key contributions: (1) we simplify the training via forced alignment and cross-entropy loss instead of RNNT loss, (2) we show that word prediction and role prediction require different amounts of predictor’s context, leading to separate task-specific predictors, unlike existing shared-predictor models, and (3) we propose a way to leverage RD posterior activity to influence ASR decoding and reduce small-word deletion errors.
zh

机器学习

[LG-0] Gait Recognition Based on Tiny ML and IMU Sensors

链接: https://arxiv.org/abs/2507.18627
作者: Jiahang Zhang,Mingtong Chen,Zhengbao Yang
类目: Machine Learning (cs.LG); Systems and Control (eess.SY)
*备注:

点击查看摘要

Abstract:This project presents the development of a gait recognition system using Tiny Machine Learning (Tiny ML) and Inertial Measurement Unit (IMU) sensors. The system leverages the XIAO-nRF52840 Sense microcontroller and the LSM6DS3 IMU sensor to capture motion data, including acceleration and angular velocity, from four distinct activities: walking, stationary, going upstairs, and going downstairs. The data collected is processed through Edge Impulse, an edge AI platform, which enables the training of machine learning models that can be deployed directly onto the microcontroller for real-time activity this http URL data preprocessing step involves extracting relevant features from the raw sensor data using techniques such as sliding windows and data normalization, followed by training a Deep Neural Network (DNN) classifier for activity recognition. The model achieves over 80% accuracy on a test dataset, demonstrating its ability to classify the four activities effectively. Additionally, the platform enables anomaly detection, further enhancing the robustness of the system. The integration of Tiny ML ensures low-power operation, making it suitable for battery-powered or energy-harvesting devices.

[LG-1] Explainable Mapper: Charting LLM Embedding Spaces Using Perturbation-Based Explanation and Verification Agents

链接: https://arxiv.org/abs/2507.18607
作者: Xinyuan Yan,Rita Sevastjanova,Sinie van der Ben,Mennatallah El-Assady,Bei Wang
类目: Computational Geometry (cs.CG); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Large language models (LLMs) produce high-dimensional embeddings that capture rich semantic and syntactic relationships between words, sentences, and concepts. Investigating the topological structures of LLM embedding spaces via mapper graphs enables us to understand their underlying structures. Specifically, a mapper graph summarizes the topological structure of the embedding space, where each node represents a topological neighborhood (containing a cluster of embeddings), and an edge connects two nodes if their corresponding neighborhoods overlap. However, manually exploring these embedding spaces to uncover encoded linguistic properties requires considerable human effort. To address this challenge, we introduce a framework for semi-automatic annotation of these embedding properties. To organize the exploration process, we first define a taxonomy of explorable elements within a mapper graph such as nodes, edges, paths, components, and trajectories. The annotation of these elements is executed through two types of customizable LLM-based agents that employ perturbation techniques for scalable and automated analysis. These agents help to explore and explain the characteristics of mapper elements and verify the robustness of the generated explanations. We instantiate the framework within a visual analytics workspace and demonstrate its effectiveness through case studies. In particular, we replicate findings from prior research on BERT’s embedding properties across various layers of its architecture and provide further observations into the linguistic properties of topological neighborhoods.

[LG-2] Demystify Protein Generation with Hierarchical Conditional Diffusion Models

链接: https://arxiv.org/abs/2507.18603
作者: Zinan Ling,Yi Shi,Da Yan,Yang Zhou,Bo Hui
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Generating novel and functional protein sequences is critical to a wide range of applications in biology. Recent advancements in conditional diffusion models have shown impressive empirical performance in protein generation tasks. However, reliable generations of protein remain an open research question in de novo protein design, especially when it comes to conditional diffusion models. Considering the biological function of a protein is determined by multi-level structures, we propose a novel multi-level conditional diffusion model that integrates both sequence-based and structure-based information for efficient end-to-end protein design guided by specified functions. By generating representations at different levels simultaneously, our framework can effectively model the inherent hierarchical relations between different levels, resulting in an informative and discriminative representation of the generated protein. We also propose a Protein-MMD, a new reliable evaluation metric, to evaluate the quality of generated protein with conditional diffusion models. Our new metric is able to capture both distributional and functional similarities between real and generated protein sequences while ensuring conditional consistency. We experiment with the benchmark datasets, and the results on conditional protein generation tasks demonstrate the efficacy of the proposed generation framework and evaluation metric.

[LG-3] Linear Memory SE(2) Invariant Attention

链接: https://arxiv.org/abs/2507.18597
作者: Ethan Pronovost,Neha Boloor,Peter Schleede,Noureldin Hendy,Andres Morales,Nicholas Roy
类目: Machine Learning (cs.LG)
*备注: Best paper award, Equivariant Systems Workshop at RSS

点击查看摘要

Abstract:Processing spatial data is a key component in many learning tasks for autonomous driving such as motion forecasting, multi-agent simulation, and planning. Prior works have demonstrated the value in using SE(2) invariant network architectures that consider only the relative poses between objects (e.g. other agents, scene features such as traffic lanes). However, these methods compute the relative poses for all pairs of objects explicitly, requiring quadratic memory. In this work, we propose a mechanism for SE(2) invariant scaled dot-product attention that requires linear memory relative to the number of objects in the scene. Our SE(2) invariant transformer architecture enjoys the same scaling properties that have benefited large language models in recent years. We demonstrate experimentally that our approach is practical to implement and improves performance compared to comparable non-invariant architectures.

[LG-4] Neural Tangent Kernels and Fisher Information Matrices for Simple ReLU Networks with Random Hidden Weights

链接: https://arxiv.org/abs/2507.18555
作者: Jun’ichi Takeuchia,Yoshinari Takeishia,Noboru Muratab,Kazushi Mimurac,Ka Long Keith Hod,Hiroshi Nagaoka
类目: Machine Learning (cs.LG); Machine Learning (stat.ML)
*备注:

点击查看摘要

Abstract:Fisher information matrices and neural tangent kernels (NTK) for 2-layer ReLU networks with random hidden weight are argued. We discuss the relation between both notions as a linear transformation and show that spectral decomposition of NTK with concrete forms of eigenfunctions with major eigenvalues. We also obtain an approximation formula of the functions presented by the 2-layer neural networks.

[LG-5] he Geometry of LLM Quantization: GPT Q as Babais Nearest Plane Algorithm

链接: https://arxiv.org/abs/2507.18553
作者: Jiale Chen,Torsten Hoefler,Dan Alistarh
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Quantizing the weights of large language models (LLMs) from 16-bit to lower bitwidth is the de facto approach to deploy massive transformers onto more affordable accelerators. GPTQ emerged as one of the standard methods for one-shot post-training quantization at LLM scale. Yet, its inner workings are described as a sequence of ad-hoc algebraic updates that obscure any geometric meaning or worst-case guarantees. In this work, we show that, when executed back-to-front (from the last to first dimension) for a linear layer, GPTQ is mathematically identical to Babai’s nearest plane algorithm for the classical closest vector problem (CVP) on a lattice defined by the Hessian matrix of the layer’s inputs. This equivalence is based on a sophisticated mathematical argument, and has two analytical consequences: (i) the GPTQ error propagation step gains an intuitive geometric interpretation; (ii) GPTQ inherits the error upper bound of Babai’s algorithm under the no-clipping condition. Taken together, these results place GPTQ on firm theoretical footing and open the door to importing decades of progress in lattice algorithms towards the design of future quantization algorithms for billion-parameter models.

[LG-6] he Price equation reveals a universal force-metric-bias law of algorithmic learning and natural selection

链接: https://arxiv.org/abs/2507.18549
作者: Steven A. Frank
类目: Machine Learning (cs.LG); Populations and Evolution (q-bio.PE)
*备注:

点击查看摘要

Abstract:Diverse learning algorithms, optimization methods, and natural selection share a common mathematical structure, despite their apparent differences. Here I show that a simple notational partitioning of change by the Price equation reveals a universal force-metric-bias (FMB) law: \Delta\mathbf\theta = \mathbfM,\mathbff + \mathbfb + \mathbf\xi . The force \mathbff drives improvement in parameters, \Delta\mathbf\theta , through the covariance between the parameters and performance. The metric \mathbfM rescales movement by inverse curvature. The bias \mathbfb adds momentum or changes in the frame of reference. The noise \mathbf\xi enables exploration. This framework unifies natural selection, Bayesian updating, Newton’s method, stochastic gradient descent, stochastic Langevin dynamics, Adam optimization, and most other algorithms as special cases of the same underlying process. The Price equation also reveals why Fisher information, Kullback-Leibler divergence, and d’Alembert’s principle arise naturally in learning dynamics. By exposing this common structure, the FMB law provides a principled foundation for understanding, comparing, and designing learning algorithms across disciplines.

[LG-7] AI/ML Life Cycle Management for Interoperable AI Native RAN

链接: https://arxiv.org/abs/2507.18538
作者: Chu-Hsiang Huang,Chao-Kai Wen,Geoffrey Ye Li
类目: Information Theory (cs.IT); Machine Learning (cs.LG)
*备注: 8 pages, 4 figures, 2 table. This work has been submitted to the IEEE for possible publication

点击查看摘要

Abstract:Artificial intelligence (AI) and machine learning (ML) models are rapidly permeating the 5G Radio Access Network (RAN), powering beam management, channel state information (CSI) feedback, positioning, and mobility prediction. However, without a standardized life-cycle management (LCM) framework, challenges, such as model drift, vendor lock-in, and limited transparency, hinder large-scale adoption. 3GPP Releases 16-20 progressively evolve AI/ML from experimental features to managed, interoperable network functions. Beginning with the Network Data Analytics Function (NWDAF) in Rel-16, subsequent releases introduced standardized interfaces for model transfer, execution, performance monitoring, and closed-loop control, culminating in Rel-20’s two-sided CSI-compression Work Item and vendor-agnostic LCM profile. This article reviews the resulting five-block LCM architecture, KPI-driven monitoring mechanisms, and inter-vendor collaboration schemes, while identifying open challenges in resource-efficient monitoring, environment drift detection, intelligent decision-making, and flexible model training. These developments lay the foundation for AI-native transceivers as a key enabler for 6G.

[LG-8] Revisiting Bisimulation Metric for Robust Representations in Reinforcement Learning

链接: https://arxiv.org/abs/2507.18519
作者: Leiji Zhang,Zeyu Wang,Xin Li,Yao-Hui Li
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Bisimulation metric has long been regarded as an effective control-related representation learning technique in various reinforcement learning tasks. However, in this paper, we identify two main issues with the conventional bisimulation metric: 1) an inability to represent certain distinctive scenarios, and 2) a reliance on predefined weights for differences in rewards and subsequent states during recursive updates. We find that the first issue arises from an imprecise definition of the reward gap, whereas the second issue stems from overlooking the varying importance of reward difference and next-state distinctions across different training stages and task settings. To address these issues, by introducing a measure for state-action pairs, we propose a revised bisimulation metric that features a more precise definition of reward gap and novel update operators with adaptive coefficient. We also offer theoretical guarantees of convergence for our proposed metric and its improved representation distinctiveness. In addition to our rigorous theoretical analysis, we conduct extensive experiments on two representative benchmarks, DeepMind Control and Meta-World, demonstrating the effectiveness of our approach.

[LG-9] High-Dimensional Data Classification in Concentric Coordinates

链接: https://arxiv.org/abs/2507.18450
作者: Alice Williams,Boris Kovalerchuk
类目: Human-Computer Interaction (cs.HC); Machine Learning (cs.LG)
*备注: 8 pages, 21 figures

点击查看摘要

Abstract:The visualization of multi-dimensional data with interpretable methods remains limited by capabilities for both high-dimensional lossless visualizations that do not suffer from occlusion and that are computationally capable by parameterized visualization. This paper proposes a low to high dimensional data supporting framework using lossless Concentric Coordinates that are a more compact generalization of Parallel Coordinates along with former Circular Coordinates. These are forms of the General Line Coordinate visualizations that can directly support machine learning algorithm visualization and facilitate human interaction.

[LG-10] Multi-Model Ensemble and Reservoir Computing for River Discharge Prediction in Ungauged Basins

链接: https://arxiv.org/abs/2507.18423
作者: Mizuki Funato,Yohei Sawada
类目: Machine Learning (cs.LG); Geophysics (physics.geo-ph)
*备注:

点击查看摘要

Abstract:Despite the critical need for accurate flood prediction and water management, many regions lack sufficient river discharge observations, limiting the skill of rainfall-runoff analyses. Although numerous physically based and machine learning models exist, achieving high accuracy, interpretability, and computational efficiency under data-scarce conditions remains a major challenge. We address this challenge with a novel method, HYdrological Prediction with multi-model Ensemble and Reservoir computing (HYPER) that leverages multi-model ensemble and reservoir computing (RC). Our approach first applies Bayesian model averaging (BMA) to 43 “uncalibrated” catchment-based conceptual hydrological models. An RC model is then trained via linear regression to correct errors in the BMA output, a non-iterative process that ensures high computational efficiency. For ungauged basins, we infer the required BMA and RC weights by linking them to catchment attributes from gauged basins, creating a generalizable framework. We evaluated HYPER using data from 87 river basins in Japan. In a data-rich scenario, HYPER (median Kling-Gupta Efficiency, KGE, of 0.56) performed comparably to a benchmark LSTM (KGE 0.55) but required only 5% of its computational time. In a data-scarce scenario (23% of basins gauged), HYPER maintained robust performance (KGE 0.55) and lower uncertainty, whereas the LSTM’s performance degraded significantly (KGE -0.04). These results reveal that individual conceptual hydrological models do not necessarily need to be calibrated when an effectively large ensemble is assembled and combined with machine-learning-based bias correction. HYPER provides a robust, efficient, and generalizable solution for discharge prediction, particularly in ungauged basins, making it applicable to a wide range of regions.

[LG-11] A Comprehensive Review of Diffusion Models in Smart Agriculture: Progress Applications and Challenges

链接: https://arxiv.org/abs/2507.18376
作者: Xing Hua,Haodong Chen,Qianqian Duan,Danfeng Hong,Ruijiao Li,Huiliang Shang,Linghua Jiang,Haima Yang,Dawei Zhang
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:With the global population growing and arable land resources becoming increasingly scarce,smart agriculture and precision agriculture have emerged as key directions for the future ofagricultural this http URL intelligence (AI) technologies, particularly deep learning models, have found widespread applications in areas such as crop monitoring and pest detection. As an emerging generative model, diffusion models have shown significant promise in tasks like agricultural image processing, data augmentation, and remote sensing. Compared to traditional generative adversarial networks (GANs), diffusion models offer superior training stability and generation quality, effectively addressing challenges such as limited agricultural data and imbalanced image samples. This paper reviews the latest advancements in the application of diffusion models in agriculture, focusing on their potential in crop pest and disease detection, remote sensing image enhancement, crop growth prediction, and agricultural resource management. Experimental results demonstrate that diffusion models significantly improve model accuracy and robustness in data augmentation, image generation, and denoising, especially in complex environments. Despite challenges related to computational efficiency and generalization capabilities, diffusion models are expected to play an increasingly important role in smart and precision agriculture as technology advances, providing substantial support for the sustainable development of global agriculture.

[LG-12] Efficient Uncertainty in LLM s through Evidential Knowledge Distillation

链接: https://arxiv.org/abs/2507.18366
作者: Lakshmana Sri Harsha Nemani,P.K. Srijith,Tomasz Kuśmierczyk
类目: Machine Learning (cs.LG); Machine Learning (stat.ML)
*备注:

点击查看摘要

Abstract:Accurate uncertainty quantification remains a key challenge for standard LLMs, prompting the adoption of Bayesian and ensemble-based methods. However, such methods typically necessitate computationally expensive sampling, involving multiple forward passes to effectively estimate predictive uncertainty. In this paper, we introduce a novel approach enabling efficient and effective uncertainty estimation in LLMs without sacrificing performance. Specifically, we distill uncertainty-aware teacher models - originally requiring multiple forward passes - into compact student models sharing the same architecture but fine-tuned using Low-Rank Adaptation (LoRA). We compare two distinct distillation strategies: one in which the student employs traditional softmax-based outputs, and another in which the student leverages Dirichlet-distributed outputs to explicitly model epistemic uncertainty via evidential learning. Empirical evaluations on classification datasets demonstrate that such students can achieve comparable or superior predictive and uncertainty quantification performance relative to their teacher models, while critically requiring only a single forward pass. To our knowledge, this is the first demonstration that immediate and robust uncertainty quantification can be achieved in LLMs through evidential distillation. Subjects: Machine Learning (cs.LG); Machine Learning (stat.ML) Cite as: arXiv:2507.18366 [cs.LG] (or arXiv:2507.18366v1 [cs.LG] for this version) https://doi.org/10.48550/arXiv.2507.18366 Focus to learn more arXiv-issued DOI via DataCite (pending registration)

[LG-13] ny is not small enough: High-quality low-resource facial animation models through hybrid knowledge distillation SIGGRAPH

链接: https://arxiv.org/abs/2507.18352
作者: Zhen Han,Mattias Teye,Derek Yadgaroff,Judith Bütepage
类目: Graphics (cs.GR); Machine Learning (cs.LG); Multimedia (cs.MM); Sound (cs.SD); Audio and Speech Processing (eess.AS)
*备注: Accepted to ACM Transactions on Graphics 2025 (SIGGRAPH journal track)

点击查看摘要

Abstract:The training of high-quality, robust machine learning models for speech-driven 3D facial animation requires a large, diverse dataset of high-quality audio-animation pairs. To overcome the lack of such a dataset, recent work has introduced large pre-trained speech encoders that are robust to variations in the input audio and, therefore, enable the facial animation model to generalize across speakers, audio quality, and languages. However, the resulting facial animation models are prohibitively large and lend themselves only to offline inference on a dedicated machine. In this work, we explore on-device, real-time facial animation models in the context of game development. We overcome the lack of large datasets by using hybrid knowledge distillation with pseudo-labeling. Given a large audio dataset, we employ a high-performing teacher model to train very small student models. In contrast to the pre-trained speech encoders, our student models only consist of convolutional and fully-connected layers, removing the need for attention context or recurrent updates. In our experiments, we demonstrate that we can reduce the memory footprint to up to 3.4 MB and required future audio context to up to 81 ms while maintaining high-quality animations. This paves the way for on-device inference, an important step towards realistic, model-driven digital characters.

[LG-14] Low-rank adaptive physics-informed HyperDeepONets for solving differential equations

链接: https://arxiv.org/abs/2507.18346
作者: Etienne Zeudong,Elsa Cardoso-Bihlo,Alex Bihlo
类目: Machine Learning (cs.LG); Numerical Analysis (math.NA)
*备注: 14 pages, 6 figures, 5 tables

点击查看摘要

Abstract:HyperDeepONets were introduced in Lee, Cho and Hwang [ICLR, 2023] as an alternative architecture for operator learning, in which a hypernetwork generates the weights for the trunk net of a DeepONet. While this improves expressivity, it incurs high memory and computational costs due to the large number of output parameters required. In this work we introduce, in the physics-informed machine learning setting, a variation, PI-LoRA-HyperDeepONets, which leverage low-rank adaptation (LoRA) to reduce complexity by decomposing the hypernetwork’s output layer weight matrix into two smaller low-rank matrices. This reduces the number of trainable parameters while introducing an extra regularization of the trunk networks’ weights. Through extensive experiments on both ordinary and partial differential equations we show that PI-LoRA-HyperDeepONets achieve up to 70% reduction in parameters and consistently outperform regular HyperDeepONets in terms of predictive accuracy and generalization.

[LG-15] Remembering the Markov Property in Cooperative MARL

链接: https://arxiv.org/abs/2507.18333
作者: Kale-ab Abebe Tessera,Leonard Hinckeldey,Riccardo Zamboni,David Abel,Amos Storkey
类目: Machine Learning (cs.LG); Multiagent Systems (cs.MA)
*备注: RLC Finding the Frame Workshop Camera-Ready, 8 pages

点击查看摘要

Abstract:Cooperative multi-agent reinforcement learning (MARL) is typically formalised as a Decentralised Partially Observable Markov Decision Process (Dec-POMDP), where agents must reason about the environment and other agents’ behaviour. In practice, current model-free MARL algorithms use simple recurrent function approximators to address the challenge of reasoning about others using partial information. In this position paper, we argue that the empirical success of these methods is not due to effective Markov signal recovery, but rather to learning simple conventions that bypass environment observations and memory. Through a targeted case study, we show that co-adapting agents can learn brittle conventions, which then fail when partnered with non-adaptive agents. Crucially, the same models can learn grounded policies when the task design necessitates it, revealing that the issue is not a fundamental limitation of the learning models but a failure of the benchmark design. Our analysis also suggests that modern MARL environments may not adequately test the core assumptions of Dec-POMDPs. We therefore advocate for new cooperative environments built upon two core principles: (1) behaviours grounded in observations and (2) memory-based reasoning about other agents, ensuring success requires genuine skill rather than fragile, co-adapted agreements.

[LG-16] State of Health Estimation of Batteries Using a Time-Informed Dynamic Sequence-Inverted Transformer

链接: https://arxiv.org/abs/2507.18320
作者: Janak M. Patel,Milad Ramezankhani,Anirudh Deodhar,Dagnachew Birru
类目: Machine Learning (cs.LG)
*备注: 11 pages, 3 figures

点击查看摘要

Abstract:The rapid adoption of battery-powered vehicles and energy storage systems over the past decade has made battery health monitoring increasingly critical. Batteries play a central role in the efficiency and safety of these systems, yet they inevitably degrade over time due to repeated charge-discharge cycles. This degradation leads to reduced energy efficiency and potential overheating, posing significant safety concerns. Accurate estimation of a State of Health (SoH) of battery is therefore essential for ensuring operational reliability and safety. Several machine learning architectures, such as LSTMs, transformers, and encoder-based models, have been proposed to estimate SoH from discharge cycle data. However, these models struggle with the irregularities inherent in real-world measurements: discharge readings are often recorded at non-uniform intervals, and the lengths of discharge cycles vary significantly. To address this, most existing approaches extract features from the sequences rather than processing them in full, which introduces information loss and compromises accuracy. To overcome these challenges, we propose a novel architecture: Time-Informed Dynamic Sequence Inverted Transformer (TIDSIT). TIDSIT incorporates continuous time embeddings to effectively represent irregularly sampled data and utilizes padded sequences with temporal attention mechanisms to manage variable-length inputs without discarding sequence information. Experimental results on the NASA battery degradation dataset show that TIDSIT significantly outperforms existing models, achieving over 50% reduction in prediction error and maintaining an SoH prediction error below 0.58%. Furthermore, the architecture is generalizable and holds promise for broader applications in health monitoring tasks involving irregular time-series data.

[LG-17] Regression-aware Continual Learning for Android Malware Detection

链接: https://arxiv.org/abs/2507.18313
作者: Daniele Ghiani,Daniele Angioni,Giorgio Piras,Angelo Sotgiu,Luca Minnei,Srishti Gupta,Maura Pintor,Fabio Roli,Battista Biggio
类目: Machine Learning (cs.LG); Cryptography and Security (cs.CR)
*备注: Submitted to IEEE Transactions on Information Forensics and Security

点击查看摘要

Abstract:Malware evolves rapidly, forcing machine learning (ML)-based detectors to adapt continuously. With antivirus vendors processing hundreds of thousands of new samples daily, datasets can grow to billions of examples, making full retraining impractical. Continual learning (CL) has emerged as a scalable alternative, enabling incremental updates without full data access while mitigating catastrophic forgetting. In this work, we analyze a critical yet overlooked issue in this context: security regression. Unlike forgetting, which manifests as a general performance drop on previously seen data, security regression captures harmful prediction changes at the sample level, such as a malware sample that was once correctly detected but evades detection after a model update. Although often overlooked, regressions pose serious risks in security-critical applications, as the silent reintroduction of previously detected threats in the system may undermine users’ trust in the whole updating process. To address this issue, we formalize and quantify security regression in CL-based malware detectors and propose a regression-aware penalty to mitigate it. Specifically, we adapt Positive Congruent Training (PCT) to the CL setting, preserving prior predictive behavior in a model-agnostic manner. Experiments on the ELSA, Tesseract, and AZ-Class datasets show that our method effectively reduces regression across different CL scenarios while maintaining strong detection performance over time.

[LG-18] Self-Supervised Coarsening of Unstructured Grid with Automatic Differentiation

链接: https://arxiv.org/abs/2507.18297
作者: Sergei Shumilin,Alexander Ryabov,Nikolay Yavich,Evgeny Burnaev,Vladimir Vanovskiy
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Due to the high computational load of modern numerical simulation, there is a demand for approaches that would reduce the size of discrete problems while keeping the accuracy reasonable. In this work, we present an original algorithm to coarsen an unstructured grid based on the concepts of differentiable physics. We achieve this by employing k-means clustering, autodifferentiation and stochastic minimization algorithms. We demonstrate performance of the designed algorithm on two PDEs: a linear parabolic equation which governs slightly compressible fluid flow in porous media and the wave equation. Our results show that in the considered scenarios, we reduced the number of grid points up to 10 times while preserving the modeled variable dynamics in the points of interest. The proposed approach can be applied to the simulation of an arbitrary system described by evolutionary partial differential equations.

[LG-19] Leverag ing Data Augmentation and Siamese Learning for Predictive Process Monitoring

链接: https://arxiv.org/abs/2507.18293
作者: Sjoerd van Straten,Alessandro Padella,Marwan Hassani
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Predictive Process Monitoring (PPM) enables forecasting future events or outcomes of ongoing business process instances based on event logs. However, deep learning PPM approaches are often limited by the low variability and small size of real-world event logs. To address this, we introduce SiamSA-PPM, a novel self-supervised learning framework that combines Siamese learning with Statistical Augmentation for Predictive Process Monitoring. It employs three novel statistically grounded transformation methods that leverage control-flow semantics and frequent behavioral patterns to generate realistic, semantically valid new trace variants. These augmented views are used within a Siamese learning setup to learn generalizable representations of process prefixes without the need for labeled supervision. Extensive experiments on real-life event logs demonstrate that SiamSA-PPM achieves competitive or superior performance compared to the SOTA in both next activity and final outcome prediction tasks. Our results further show that statistical augmentation significantly outperforms random transformations and improves variability in the data, highlighting SiamSA-PPM as a promising direction for training data enrichment in process prediction.

[LG-20] Boosting Revisited: Benchmarking and Advancing LP-Based Ensemble Methods

链接: https://arxiv.org/abs/2507.18242
作者: Fabian Akkerman,Julien Ferry,Christian Artigues,Emmanuel Hebrard,Thibaut Vidal
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Despite their theoretical appeal, totally corrective boosting methods based on linear programming have received limited empirical attention. In this paper, we conduct the first large-scale experimental study of six LP-based boosting formulations, including two novel methods, NM-Boost and QRLP-Boost, across 20 diverse datasets. We evaluate the use of both heuristic and optimal base learners within these formulations, and analyze not only accuracy, but also ensemble sparsity, margin distribution, anytime performance, and hyperparameter sensitivity. We show that totally corrective methods can outperform or match state-of-the-art heuristics like XGBoost and LightGBM when using shallow trees, while producing significantly sparser ensembles. We further show that these methods can thin pre-trained ensembles without sacrificing performance, and we highlight both the strengths and limitations of using optimal decision trees in this context.

[LG-21] Sparse identification of nonlinear dynamics with library optimization mechanism: Recursive long-term prediction perspective

链接: https://arxiv.org/abs/2507.18220
作者: Ansei Yonezawa,Heisei Yonezawa,Shuichi Yahagi,Itsuro Kajiwara,Shinya Kijimoto,Hikaru Taniuchi,Kentaro Murakami
类目: Machine Learning (cs.LG); Dynamical Systems (math.DS)
*备注: This work has been submitted to the IEEE for possible publication

点击查看摘要

Abstract:The sparse identification of nonlinear dynamics (SINDy) approach can discover the governing equations of dynamical systems based on measurement data, where the dynamical model is identified as the sparse linear combination of the given basis functions. A major challenge in SINDy is the design of a library, which is a set of candidate basis functions, as the appropriate library is not trivial for many dynamical systems. To overcome this difficulty, this study proposes SINDy with library optimization mechanism (SINDy-LOM), which is a combination of the sparse regression technique and the novel learning strategy of the library. In the proposed approach, the basis functions are parametrized. The SINDy-LOM approach involves a two-layer optimization architecture: the inner-layer, in which the data-driven model is extracted as the sparse linear combination of the candidate basis functions, and the outer-layer, in which the basis functions are optimized from the viewpoint of the recursive long-term (RLT) prediction accuracy; thus, the library design is reformulated as the optimization of the parametrized basis functions. The resulting SINDy-LOM model has good interpretability and usability, as the proposed approach yields the parsimonious model. The library optimization mechanism significantly reduces user burden. The RLT perspective improves the reliability of the resulting model compared with the traditional SINDy approach that can only ensure the one-step-ahead prediction accuracy. The validity of the proposed approach is demonstrated by applying it to a diesel engine airpath system, which is a well-known complex industrial system.

[LG-22] Goal-based Trajectory Prediction for improved Cross-Dataset Generalization ITSC2025

链接: https://arxiv.org/abs/2507.18196
作者: Daniel Grimm,Ahmed Abouelazm,J. Marius Zöllner
类目: Machine Learning (cs.LG)
*备注: Accepted on IEEE ITSC 2025

点击查看摘要

Abstract:To achieve full autonomous driving, a good understanding of the surrounding environment is necessary. Especially predicting the future states of other traffic participants imposes a non-trivial challenge. Current SotA-models already show promising results when trained on real datasets (e.g. Argoverse2, NuScenes). Problems arise when these models are deployed to new/unseen areas. Typically, performance drops significantly, indicating that the models lack generalization. In this work, we introduce a new Graph Neural Network (GNN) that utilizes a heterogeneous graph consisting of traffic participants and vectorized road network. Latter, is used to classify goals, i.e. endpoints of the predicted trajectories, in a multi-staged approach, leading to a better generalization to unseen scenarios. We show the effectiveness of the goal selection process via cross-dataset evaluation, i.e. training on Argoverse2 and evaluating on NuScenes.

[LG-23] Neuromorphic Computing for Embodied Intelligence in Autonomous Systems: Current Trends Challenges and Future Directions

链接: https://arxiv.org/abs/2507.18139
作者: Alberto Marchisio,Muhammad Shafique
类目: Machine Learning (cs.LG)
*备注: To appear at the 31st IEEE International Symposium on On-Line Testing and Robust System Design (IOLTS), Ischia, Italy, July 2025

点击查看摘要

Abstract:The growing need for intelligent, adaptive, and energy-efficient autonomous systems across fields such as robotics, mobile agents (e.g., UAVs), and self-driving vehicles is driving interest in neuromorphic computing. By drawing inspiration from biological neural systems, neuromorphic approaches offer promising pathways to enhance the perception, decision-making, and responsiveness of autonomous platforms. This paper surveys recent progress in neuromorphic algorithms, specialized hardware, and cross-layer optimization strategies, with a focus on their deployment in real-world autonomous scenarios. Special attention is given to event-based dynamic vision sensors and their role in enabling fast, efficient perception. The discussion highlights new methods that improve energy efficiency, robustness, adaptability, and reliability through the integration of spiking neural networks into autonomous system architectures. We integrate perspectives from machine learning, robotics, neuroscience, and neuromorphic engineering to offer a comprehensive view of the state of the field. Finally, emerging trends and open challenges are explored, particularly in the areas of real-time decision-making, continual learning, and the development of secure, resilient autonomous systems.

[LG-24] Maximizing Prefix-Confidence at Test-Time Efficiently Improves Mathematical Reasoning

链接: https://arxiv.org/abs/2507.18122
作者: Matthias Otth,Jonas Hübotter,Ido Hakimi,Andreas Krause
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Recent work has shown that language models can self-improve by maximizing their own confidence in their predictions, without relying on external verifiers or reward signals. In this work, we study the test-time scaling of language models for mathematical reasoning tasks, where the model’s own confidence is used to select the most promising attempts. Surprisingly, we find that we can achieve significant performance gains by continuing only the most promising attempt, selected by the model’s prefix-confidence. We systematically evaluate prefix-confidence scaling on five mathematical reasoning datasets: the school-level GSM8K and MATH500, and the competition-level AMC23, AIME24, and AIME25. We find that prefix-confidence scaling with prefixes of only 32 tokens achieves a better accuracy-compute trade-off than majority voting. Moreover, prefix-confidence scaling appears less susceptible than BoN to length biases. Finally, we also evaluate test-time training with prefix-confidence and find that, while outperforming the base model, it does not improve over prefix-confidence scaling.

[LG-25] Policy Disruption in Reinforcement Learning:Adversarial Attack with Large Language Models and Critical State Identification

链接: https://arxiv.org/abs/2507.18113
作者: Junyong Jiang,Buwei Tian,Chenxing Xu,Songze Li,Lu Dong
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Reinforcement learning (RL) has achieved remarkable success in fields like robotics and autonomous driving, but adversarial attacks designed to mislead RL systems remain challenging. Existing approaches often rely on modifying the environment or policy, limiting their practicality. This paper proposes an adversarial attack method in which existing agents in the environment guide the target policy to output suboptimal actions without altering the environment. We propose a reward iteration optimization framework that leverages large language models (LLMs) to generate adversarial rewards explicitly tailored to the vulnerabilities of the target agent, thereby enhancing the effectiveness of inducing the target agent toward suboptimal decision-making. Additionally, a critical state identification algorithm is designed to pinpoint the target agent’s most vulnerable states, where suboptimal behavior from the victim leads to significant degradation in overall performance. Experimental results in diverse environments demonstrate the superiority of our method over existing approaches.

[LG-26] Percentile-Based Deep Reinforcement Learning and Reward Based Personalization For Delay Aware RAN Slicing in O-RAN

链接: https://arxiv.org/abs/2507.18111
作者: Peyman Tehrani,Anas Alsoliman
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:In this paper, we tackle the challenge of radio access network (RAN) slicing within an open RAN (O-RAN) architecture. Our focus centers on a network that includes multiple mobile virtual network operators (MVNOs) competing for physical resource blocks (PRBs) with the goal of meeting probabilistic delay upper bound constraints for their clients while minimizing PRB utilization. Initially, we derive a reward function based on the law of large numbers (LLN), then implement practical modifications to adapt it for real-world experimental scenarios. We then propose our solution, the Percentile-based Delay-Aware Deep Reinforcement Learning (PDA-DRL), which demonstrates its superiority over several baselines, including DRL models optimized for average delay constraints, by achieving a 38% reduction in resultant average delay. Furthermore, we delve into the issue of model weight sharing among multiple MVNOs to develop a robust personalized model. We introduce a reward-based personalization method where each agent prioritizes other agents’ model weights based on their performance. This technique surpasses traditional aggregation methods, such as federated averaging, and strategies reliant on traffic patterns and model weight distance similarities.

[LG-27] Learning from Hard Labels with Additional Supervision on Non-Hard-Labeled Classes

链接: https://arxiv.org/abs/2507.18098
作者: Kosuke Sugiyama,Masato Uchida
类目: Machine Learning (cs.LG)
*备注: 32 pages, 11 figures

点击查看摘要

Abstract:In scenarios where training data is limited due to observation costs or data scarcity, enriching the label information associated with each instance becomes crucial for building high-accuracy classification models. In such contexts, it is often feasible to obtain not only hard labels but also \it additional supervision, such as the confidences for the hard labels. This setting naturally raises fundamental questions: \it What kinds of additional supervision are intrinsically beneficial? And \it how do they contribute to improved generalization performance? To address these questions, we propose a theoretical framework that treats both hard labels and additional supervision as probability distributions, and constructs soft labels through their affine combination. Our theoretical analysis reveals that the essential component of additional supervision is not the confidence score of the assigned hard label, but rather the information of the distribution over the non-hard-labeled classes. Moreover, we demonstrate that the additional supervision and the mixing coefficient contribute to the refinement of soft labels in complementary roles. Intuitively, in the probability simplex, the additional supervision determines the direction in which the deterministic distribution representing the hard label should be adjusted toward the true label distribution, while the mixing coefficient controls the step size along that direction. Through generalization error analysis, we theoretically characterize how the additional supervision and its mixing coefficient affect both the convergence rate and asymptotic value of the error bound. Finally, we experimentally demonstrate that, based on our theory, designing additional supervision can lead to improved classification accuracy, even when utilized in a simple manner.

[LG-28] Squeeze10-LLM : Squeezing LLM s Weights by 10 Times via a Staged Mixed-Precision Quantization Method

链接: https://arxiv.org/abs/2507.18073
作者: Qingcheng Zhu,Yangyang Ren,Linlin Yang,Mingbao Lin,Yanjing Li,Sheng Xu,Zichao Feng,Haodong Zhu,Yuguang Yang,Juan Zhang,Runqi Wang,Baochang Zhang
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Deploying large language models (LLMs) is challenging due to their massive parameters and high computational costs. Ultra low-bit quantization can significantly reduce storage and accelerate inference, but extreme compression (i.e., mean bit-width = 2) often leads to severe performance degradation. To address this, we propose Squeeze10-LLM, effectively “squeezing” 16-bit LLMs’ weights by 10 times. Specifically, Squeeze10-LLM is a staged mixed-precision post-training quantization (PTQ) framework and achieves an average of 1.6 bits per weight by quantizing 80% of the weights to 1 bit and 20% to 4 bits. We introduce Squeeze10LLM with two key innovations: Post-Binarization Activation Robustness (PBAR) and Full Information Activation Supervision (FIAS). PBAR is a refined weight significance metric that accounts for the impact of quantization on activations, improving accuracy in low-bit settings. FIAS is a strategy that preserves full activation information during quantization to mitigate cumulative error propagation across layers. Experiments on LLaMA and LLaMA2 show that Squeeze10-LLM achieves state-of-the-art performance for sub-2bit weight-only quantization, improving average accuracy from 43% to 56% on six zero-shot classification tasks–a significant boost over existing PTQ methods. Our code will be released upon publication.

[LG-29] C-AAE: Compressively Anonymizing Autoencoders for Privacy-Preserving Activity Recognition in Healthcare Sensor Streams

链接: https://arxiv.org/abs/2507.18072
作者: Ryusei Fujimoto,Yugo Nakamura,Yutaka Arakawa
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Wearable accelerometers and gyroscopes encode fine-grained behavioural signatures that can be exploited to re-identify users, making privacy protection essential for healthcare applications. We introduce C-AAE, a compressive anonymizing autoencoder that marries an Anonymizing AutoEncoder (AAE) with Adaptive Differential Pulse-Code Modulation (ADPCM). The AAE first projects raw sensor windows into a latent space that retains activity-relevant features while suppressing identity cues. ADPCM then differentially encodes this latent stream, further masking residual identity information and shrinking the bitrate. Experiments on the MotionSense and PAMAP2 datasets show that C-AAE cuts user re-identification F1 scores by 10-15 percentage points relative to AAE alone, while keeping activity-recognition F1 within 5 percentage points of the unprotected baseline. ADPCM also reduces data volume by roughly 75 %, easing transmission and storage overheads. These results demonstrate that C-AAE offers a practical route to balancing privacy and utility in continuous, sensor-based activity recognition for healthcare.

[LG-30] Multiscale Neural PDE Surrogates for Prediction and Downscaling: Application to Ocean Currents ICML2025

链接: https://arxiv.org/abs/2507.18067
作者: Abdessamad El-Kabid,Loubna Benabbou,Redouane Lguensat,Alex Hernández-García
类目: Machine Learning (cs.LG); Computational Engineering, Finance, and Science (cs.CE)
*备注: Workshop @ ICML2025

点击查看摘要

Abstract:Accurate modeling of physical systems governed by partial differential equations is a central challenge in scientific computing. In oceanography, high-resolution current data are critical for coastal management, environmental monitoring, and maritime safety. However, available satellite products, such as Copernicus data for sea water velocity at ~0.08 degrees spatial resolution and global ocean models, often lack the spatial granularity required for detailed local analyses. In this work, we (a) introduce a supervised deep learning framework based on neural operators for solving PDEs and providing arbitrary resolution solutions, and (b) propose downscaling models with an application to Copernicus ocean current data. Additionally, our method can model surrogate PDEs and predict solutions at arbitrary resolution, regardless of the input resolution. We evaluated our model on real-world Copernicus ocean current data and synthetic Navier-Stokes simulation datasets.

[LG-31] Predictive Scaling Laws for Efficient GRPO Training of Large Reasoning Models

链接: https://arxiv.org/abs/2507.18014
作者: Datta Nimmaturi,Vaishnavi Bhargava,Rajat Ghosh,Johnu George,Debojyoti Dutta
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Fine-tuning large language models (LLMs) for reasoning tasks using reinforcement learning methods like Group Relative Policy Optimization (GRPO) is computationally expensive. To address this, we propose a predictive framework that models training dynamics and helps optimize resource usage. Through experiments on Llama and Qwen models (3B 8B), we derive an empirical scaling law based on model size, initial performance, and training progress. This law predicts reward trajectories and identifies three consistent training phases: slow start, rapid improvement, and plateau. We find that training beyond certain number of an epoch offers little gain, suggesting earlier stopping can significantly reduce compute without sacrificing performance. Our approach generalizes across model types, providing a practical guide for efficient GRPO-based fine-tuning.

[LG-32] SIFOTL: A Principled Statistically-Informed Fidelity-Optimization Method for Tabular Learning

链接: https://arxiv.org/abs/2507.17979
作者: Shubham Mohole,Sainyam Galhotra
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Identifying the factors driving data shifts in tabular datasets is a significant challenge for analysis and decision support systems, especially those focusing on healthcare. Privacy rules restrict data access, and noise from complex processes hinders analysis. To address this challenge, we propose SIFOTL (Statistically-Informed Fidelity-Optimization Method for Tabular Learning) that (i) extracts privacy-compliant data summary statistics, (ii) employs twin XGBoost models to disentangle intervention signals from noise with assistance from LLMs, and (iii) merges XGBoost outputs via a Pareto-weighted decision tree to identify interpretable segments responsible for the shift. Unlike existing analyses which may ignore noise or require full data access for LLM-based analysis, SIFOTL addresses both challenges using only privacy-safe summary statistics. Demonstrating its real-world efficacy, for a MEPS panel dataset mimicking a new Medicare drug subsidy, SIFOTL achieves an F1 score of 0.85, substantially outperforming BigQuery Contribution Analysis (F1=0.46) and statistical tests (F1=0.20) in identifying the segment receiving the subsidy. Furthermore, across 18 diverse EHR datasets generated based on Synthea ABM, SIFOTL sustains F1 scores of 0.86-0.96 without noise and = 0.75 even with injected observational noise, whereas baseline average F1 scores range from 0.19-0.67 under the same tests. SIFOTL, therefore, provides an interpretable, privacy-conscious workflow that is empirically robust to observational noise.

[LG-33] Clo-HDnn: A 4.66 TFLOPS/W and 3.78 TOPS/W Continual On-Device Learning Accelerator with Energy-efficient Hyperdimensional Computing via Progressive Search

链接: https://arxiv.org/abs/2507.17953
作者: Chang Eun Song,Weihong Xu,Keming Fan,Soumil Jain,Gopabandhu Hota,Haichao Yang,Leo Liu,Kerem Akarvardar,Meng-Fan Chang,Carlos H. Diaz,Gert Cauwenberghs,Tajana Rosing,Mingu Kang
类目: Machine Learning (cs.LG)
*备注: Published in 2025 Symposium on VLSI Technology and Circuits (VLSI Technology and Circuits), Kyoto, Japan, 2025

点击查看摘要

Abstract:Clo-HDnn is an on-device learning (ODL) accelerator designed for emerging continual learning (CL) tasks. Clo-HDnn integrates hyperdimensional computing (HDC) along with low-cost Kronecker HD Encoder and weight clustering feature extraction (WCFE) to optimize accuracy and efficiency. Clo-HDnn adopts gradient-free CL to efficiently update and store the learned knowledge in the form of class hypervectors. Its dual-mode operation enables bypassing costly feature extraction for simpler datasets, while progressive search reduces complexity by up to 61% by encoding and comparing only partial query hypervectors. Achieving 4.66 TFLOPS/W (FE) and 3.78 TOPS/W (classifier), Clo-HDnn delivers 7.77x and 4.85x higher energy efficiency compared to SOTA ODL accelerators.

[LG-34] SETOL: A Semi-Empirical Theory of (Deep) Learning

链接: https://arxiv.org/abs/2507.17912
作者: Charles H Martin,Christopher Hinrichs
类目: Machine Learning (cs.LG); Statistical Mechanics (cond-mat.stat-mech)
*备注: 139 pages, 28 figures. Code for experiments available at this https URL

点击查看摘要

Abstract:We present a SemiEmpirical Theory of Learning (SETOL) that explains the remarkable performance of State-Of-The-Art (SOTA) Neural Networks (NNs). We provide a formal explanation of the origin of the fundamental quantities in the phenomenological theory of Heavy-Tailed Self-Regularization (HTSR): the heavy-tailed power-law layer quality metrics, alpha and alpha-hat. In prior work, these metrics have been shown to predict trends in the test accuracies of pretrained SOTA NN models, importantly, without needing access to either testing or training data. Our SETOL uses techniques from statistical mechanics as well as advanced methods from random matrix theory and quantum chemistry. The derivation suggests new mathematical preconditions for ideal learning, including a new metric, ERG, which is equivalent to applying a single step of the Wilson Exact Renormalization Group. We test the assumptions and predictions of SETOL on a simple 3-layer multilayer perceptron (MLP), demonstrating excellent agreement with the key theoretical assumptions. For SOTA NN models, we show how to estimate the individual layer qualities of a trained NN by simply computing the empirical spectral density (ESD) of the layer weight matrices and plugging this ESD into our SETOL formulas. Notably, we examine the performance of the HTSR alpha and the SETOL ERG layer quality metrics, and find that they align remarkably well, both on our MLP and on SOTA NNs.

[LG-35] Federated Learning for Large-Scale Cloud Robotic Manipulation: Opportunities and Challenges ICML

链接: https://arxiv.org/abs/2507.17903
作者: Obaidullah Zaland,Chanh Nguyen,Florian T. Pokorny,Monowar Bhuyan
类目: Machine Learning (cs.LG)
*备注: Accepted for Presentation at IEEE International Conference on Machine Learning and Cybernetics (ICMLC) 2025

点击查看摘要

Abstract:Federated Learning (FL) is an emerging distributed machine learning paradigm, where the collaborative training of a model involves dynamic participation of devices to achieve broad objectives. In contrast, classical machine learning (ML) typically requires data to be located on-premises for training, whereas FL leverages numerous user devices to train a shared global model without the need to share private data. Current robotic manipulation tasks are constrained by the individual capabilities and speed of robots due to limited low-latency computing resources. Consequently, the concept of cloud robotics has emerged, allowing robotic applications to harness the flexibility and reliability of computing resources, effectively alleviating their computational demands across the cloud-edge continuum. Undoubtedly, within this distributed computing context, as exemplified in cloud robotic manipulation scenarios, FL offers manifold advantages while also presenting several challenges and opportunities. In this paper, we present fundamental concepts of FL and their connection to cloud robotic manipulation. Additionally, we envision the opportunities and challenges associated with realizing efficient and reliable cloud robotic manipulation at scale through FL, where researchers adopt to design and verify FL models in either centralized or decentralized settings.

[LG-36] Lower Bounds for Public-Private Learning under Distribution Shift

链接: https://arxiv.org/abs/2507.17895
作者: Amrith Setlur,Pratiksha Thaker,Jonathan Ullman
类目: Machine Learning (cs.LG); Cryptography and Security (cs.CR)
*备注: Preprint

点击查看摘要

Abstract:The most effective differentially private machine learning algorithms in practice rely on an additional source of purportedly public data. This paradigm is most interesting when the two sources combine to be more than the sum of their parts. However, there are settings such as mean estimation where we have strong lower bounds, showing that when the two data sources have the same distribution, there is no complementary value to combining the two data sources. In this work we extend the known lower bounds for public-private learning to setting where the two data sources exhibit significant distribution shift. Our results apply to both Gaussian mean estimation where the two distributions have different means, and to Gaussian linear regression where the two distributions exhibit parameter shift. We find that when the shift is small (relative to the desired accuracy), either public or private data must be sufficiently abundant to estimate the private parameter. Conversely, when the shift is large, public data provides no benefit.

[LG-37] Fourier Neural Operators for Non-Markovian Processes:Approximation Theorems and Experiments

链接: https://arxiv.org/abs/2507.17887
作者: Wonjae Lee,Taeyoung Kim,Hyungbin Park
类目: Machine Learning (cs.LG); Numerical Analysis (math.NA)
*备注:

点击查看摘要

Abstract:This paper introduces an operator-based neural network, the mirror-padded Fourier neural operator (MFNO), designed to learn the dynamics of stochastic systems. MFNO extends the standard Fourier neural operator (FNO) by incorporating mirror padding, enabling it to handle non-periodic inputs. We rigorously prove that MFNOs can approximate solutions of path-dependent stochastic differential equations and Lipschitz transformations of fractional Brownian motions to an arbitrary degree of accuracy. Our theoretical analysis builds on Wong–Zakai type theorems and various approximation techniques. Empirically, the MFNO exhibits strong resolution generalization–a property rarely seen in standard architectures such as LSTMs, TCNs, and DeepONet. Furthermore, our model achieves performance that is comparable or superior to these baselines while offering significantly faster sample path generation than classical numerical schemes.

[LG-38] Look the Other Way: Designing Positive Molecules with Negative Data via Task Arithmetic

链接: https://arxiv.org/abs/2507.17876
作者: Rıza Özçelik,Sarah de Ruiter,Francesca Grisoni
类目: Machine Learning (cs.LG); Biomolecules (q-bio.BM)
*备注:

点击查看摘要

Abstract:The scarcity of molecules with desirable properties (i.e., ‘positive’ molecules) is an inherent bottleneck for generative molecule design. To sidestep such obstacle, here we propose molecular task arithmetic: training a model on diverse and abundant negative examples to learn ‘property directions’ – without accessing any positively labeled data – and moving models in the opposite property directions to generate positive molecules. When analyzed on 20 zero-shot design experiments, molecular task arithmetic generated more diverse and successful designs than models trained on positive molecules. Moreover, we employed molecular task arithmetic in dual-objective and few-shot design tasks. We find that molecular task arithmetic can consistently increase the diversity of designs while maintaining desirable design properties. With its simplicity, data efficiency, and performance, molecular task arithmetic bears the potential to become the \textitde-facto transfer learning strategy for de novo molecule design.

[LG-39] Wasserstein GAN-Based Precipitation Downscaling with Optimal Transport for Enhancing Perceptual Realism

链接: https://arxiv.org/abs/2507.17798
作者: Kenta Shiraishi,Yuka Muto,Atsushi Okazaki,Shunji Kotsuki
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:High-resolution (HR) precipitation prediction is essential for reducing damage from stationary and localized heavy rainfall; however, HR precipitation forecasts using process-driven numerical weather prediction models remains challenging. This study proposes using Wasserstein Generative Adversarial Network (WGAN) to perform precipitation downscaling with an optimal transport cost. In contrast to a conventional neural network trained with mean squared error, the WGAN generated visually realistic precipitation fields with fine-scale structures even though the WGAN exhibited slightly lower performance on conventional evaluation metrics. The learned critic of WGAN correlated well with human perceptual realism. Case-based analysis revealed that large discrepancies in critic scores can help identify both unrealistic WGAN outputs and potential artifacts in the reference data. These findings suggest that the WGAN framework not only improves perceptual realism in precipitation downscaling but also offers a new perspective for evaluating and quality-controlling precipitation datasets.

[LG-40] CoCAI: Copula-based Conformal Anomaly Identification for Multivariate Time-Series

链接: https://arxiv.org/abs/2507.17796
作者: Nicholas A. Pearson,Francesca Zanello,Davide Russo,Luca Bortolussi,Francesca Cairoli
类目: Machine Learning (cs.LG); Machine Learning (stat.ML)
*备注: Accepted for Presentation at Runtime Verification 25

点击查看摘要

Abstract:We propose a novel framework that harnesses the power of generative artificial intelligence and copula-based modeling to address two critical challenges in multivariate time-series analysis: delivering accurate predictions and enabling robust anomaly detection. Our method, Copula-based Conformal Anomaly Identification for Multivariate Time-Series (CoCAI), leverages a diffusion-based model to capture complex dependencies within the data, enabling high quality forecasting. The model’s outputs are further calibrated using a conformal prediction technique, yielding predictive regions which are statistically valid, i.e., cover the true target values with a desired confidence level. Starting from these calibrated forecasts, robust outlier detection is performed by combining dimensionality reduction techniques with copula-based modeling, providing a statistically grounded anomaly score. CoCAI benefits from an offline calibration phase that allows for minimal overhead during deployment and delivers actionable results rooted in established theoretical foundations. Empirical tests conducted on real operational data derived from water distribution and sewerage systems confirm CoCAI’s effectiveness in accurately forecasting target sequences of data and in identifying anomalous segments within them.

[LG-41] LSDM: LLM -Enhanced Spatio-temporal Diffusion Model for Service-Level Mobile Traffic Prediction

链接: https://arxiv.org/abs/2507.17795
作者: Shiyuan Zhang,Tong Li,Zhu Xiao,Hongyang Du,Kaibin Huang
类目: Machine Learning (cs.LG)
*备注: 14 pages, 9 figures

点击查看摘要

Abstract:Service-level mobile traffic prediction for individual users is essential for network efficiency and quality of service enhancement. However, current prediction methods are limited in their adaptability across different urban environments and produce inaccurate results due to the high uncertainty in personal traffic patterns, the lack of detailed environmental context, and the complex dependencies among different network services. These challenges demand advanced modeling techniques that can capture dynamic traffic distributions and rich environmental features. Inspired by the recent success of diffusion models in distribution modeling and Large Language Models (LLMs) in contextual understanding, we propose an LLM-Enhanced Spatio-temporal Diffusion Model (LSDM). LSDM integrates the generative power of diffusion models with the adaptive learning capabilities of transformers, augmented by the ability to capture multimodal environmental information for modeling service-level patterns and dynamics. Extensive evaluations on real-world service-level datasets demonstrate that the model excels in traffic usage predictions, showing outstanding generalization and adaptability. After incorporating contextual information via LLM, the performance improves by at least 2.83% in terms of the coefficient of determination. Compared to models of a similar type, such as CSDI, the root mean squared error can be reduced by at least 8.29%. The code and dataset will be available at: this https URL.

[LG-42] Causal Mechanism Estimation in Multi-Sensor Systems Across Multiple Domains

链接: https://arxiv.org/abs/2507.17792
作者: Jingyi Yu,Tim Pychynski,Marco F. Huber
类目: Machine Learning (cs.LG); Machine Learning (stat.ML)
*备注:

点击查看摘要

Abstract:To gain deeper insights into a complex sensor system through the lens of causality, we present common and individual causal mechanism estimation (CICME), a novel three-step approach to inferring causal mechanisms from heterogeneous data collected across multiple domains. By leveraging the principle of Causal Transfer Learning (CTL), CICME is able to reliably detect domain-invariant causal mechanisms when provided with sufficient samples. The identified common causal mechanisms are further used to guide the estimation of the remaining causal mechanisms in each domain individually. The performance of CICME is evaluated on linear Gaussian models under scenarios inspired from a manufacturing process. Building upon existing continuous optimization-based causal discovery methods, we show that CICME leverages the benefits of applying causal discovery on the pooled data and repeatedly on data from individual domains, and it even outperforms both baseline methods under certain scenarios.

[LG-43] Reinforcement Learning for Accelerated Aerodynamic Shape Optimisation

链接: https://arxiv.org/abs/2507.17786
作者: Florian Sobieczky,Alfredo Lopez,Erika Dudkin,Christopher Lackner,Matthias Hochsteger,Bernhard Scheichl,Helmut Sobieczky
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:We introduce a reinforcement learning (RL) based adaptive optimization algorithm for aerodynamic shape optimization focused on dimensionality reduction. The form in which RL is applied here is that of a surrogate-based, actor-critic policy evaluation MCMC approach allowing for temporal ‘freezing’ of some of the parameters to be optimized. The goals are to minimize computational effort, and to use the observed optimization results for interpretation of the discovered extrema in terms of their role in achieving the desired flow-field. By a sequence of local optimized parameter changes around intermediate CFD simulations acting as ground truth, it is possible to speed up the global optimization if (a) the local neighbourhoods of the parameters in which the changed parameters must reside are sufficiently large to compete with the grid-sized steps and its large number of simulations, and (b) the estimates of the rewards and costs on these neighbourhoods necessary for a good step-wise parameter adaption are sufficiently accurate. We give an example of a simple fluid-dynamical problem on which the method allows interpretation in the sense of a feature importance scoring. Subjects: Machine Learning (cs.LG) Cite as: arXiv:2507.17786 [cs.LG] (or arXiv:2507.17786v1 [cs.LG] for this version) https://doi.org/10.48550/arXiv.2507.17786 Focus to learn more arXiv-issued DOI via DataCite

[LG-44] Self-similarity Analysis in Deep Neural Networks

链接: https://arxiv.org/abs/2507.17785
作者: Jingyi Ding,Chengwen Qi,Hongfei Wang,Jianshe Wu,Licheng Jiao,Yuwei Guo,Jian Gao
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Current research has found that some deep neural networks exhibit strong hierarchical self-similarity in feature representation or parameter distribution. However, aside from preliminary studies on how the power-law distribution of weights across different training stages affects model performance,there has been no quantitative analysis on how the self-similarity of hidden space geometry influences model weight optimization, nor is there a clear understanding of the dynamic behavior of internal neurons. Therefore, this paper proposes a complex network modeling method based on the output features of hidden-layer neurons to investigate the self-similarity of feature networks constructed at different hidden layers, and analyzes how adjusting the degree of self-similarity in feature networks can enhance the classification performance of deep neural networks. Validated on three types of networks MLP architectures, convolutional networks, and attention architectures this study reveals that the degree of self-similarity exhibited by feature networks varies across different model architectures. Furthermore, embedding constraints on the self-similarity of feature networks during the training process can improve the performance of self-similar deep neural networks (MLP architectures and attention architectures) by up to 6 percentage points.

[LG-45] Knowledge Abstraction for Knowledge-based Semantic Communication: A Generative Causality Invariant Approach

链接: https://arxiv.org/abs/2507.17784
作者: Minh-Duong Nguyen,Quoc-Viet Pham,Nguyen H. Tran,Hoang-Khoi Do,Duy T. Ngo,Won-Joo Hwang
类目: Machine Learning (cs.LG)
*备注: 13 pages, 12 figures, 4 tables

点击查看摘要

Abstract:In this study, we design a low-complexity and generalized AI model that can capture common knowledge to improve data reconstruction of the channel decoder for semantic communication. Specifically, we propose a generative adversarial network that leverages causality-invariant learning to extract causal and non-causal representations from the data. Causal representations are invariant and encompass crucial information to identify the data’s label. They can encapsulate semantic knowledge and facilitate effective data reconstruction at the receiver. Moreover, the causal mechanism ensures that learned representations remain consistent across different domains, making the system reliable even with users collecting data from diverse domains. As user-collected data evolves over time causing knowledge divergence among users, we design sparse update protocols to improve the invariant properties of the knowledge while minimizing communication overheads. Three key observations were drawn from our empirical evaluations. Firstly, causality-invariant knowledge ensures consistency across different devices despite the diverse training data. Secondly, invariant knowledge has promising performance in classification tasks, which is pivotal for goal-oriented semantic communications. Thirdly, our knowledge-based data reconstruction highlights the robustness of our decoder, which surpasses other state-of-the-art data reconstruction and semantic compression methods in terms of Peak Signal-to-Noise Ratio (PSNR).

[LG-46] MultiKernelBench: A Multi-Platform Benchmark for Kernel Generation

链接: https://arxiv.org/abs/2507.17773
作者: Zhongzhen Wen,Yinghui Zhang,Zhong Li,Zhongxin Liu,Linna Xie,Tian Zhang
类目: Distributed, Parallel, and Cluster Computing (cs.DC); Machine Learning (cs.LG); Performance (cs.PF); Software Engineering (cs.SE)
*备注:

点击查看摘要

Abstract:The automatic generation of deep learning (DL) kernels using large language models (LLMs) has emerged as a promising approach to reduce the manual effort and hardware-specific expertise required for writing high-performance operator implementations. However, existing benchmarks for evaluating LLMs in this domain suffer from limited hardware support, coarse-grained kernel categorization, and imbalanced task coverage. To address these limitations, we introduce MultiKernelBench, the first comprehensive, multi-platform benchmark for LLM-based DL kernel generation. MultiKernelBench spans 285 tasks across 14 well-defined kernel categories and supports three major hardware platforms: Nvidia GPUs, Huawei NPUs, and Google TPUs. To enable future extensibility, we design a modular backend abstraction layer that decouples platform-specific logic from the core benchmarking infrastructure, allowing easy integration of new hardware platforms. We further propose a simple yet effective category-aware one-shot prompting method that improves generation quality by providing in-category exemplars. Through systematic evaluations of seven state-of-the-art LLMs, we reveal significant variation in task difficulty, poor generalization to platforms with less training exposure, and the effectiveness of targeted prompting strategies. MultiKernelBench is publicly available at this https URL.

[LG-47] PolyServe: Efficient Multi-SLO Serving at Scale

链接: https://arxiv.org/abs/2507.17769
作者: Kan Zhu,Haiyang Shi,Le Xu,Jiaxin Shan,Arvind Krishnamurthy,Baris Kasikci,Liguang Xie
类目: Distributed, Parallel, and Cluster Computing (cs.DC); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Advances in Large Language Models (LLMs) have led to a surge of LLM-powered applications. These applications have diverse token-generation latency requirements. As a result, simply classifying workloads as latency-sensitive (LS) or best-effort (BE) overlooks the nuances within the latency-sensitive category and results in suboptimal user experiences and scheduling opportunities. However, efficiently serving requests with multiple SLO requirements poses significant challenges. First, all requests within a batch generate new tokens simultaneously, which can misalign them with their distinct SLO requirements. Moreover, while existing systems focus on auto-scaling for handling various overall request rates, the diversity of SLOs necessitates fine-grained auto-scaling among these SLO tiers. Finally, unlike LS/BE scenarios, where BE requests can be aborted at any time to ensure the SLO attainment of LS requests, those with different latency-sensitive SLOs cannot tolerate prolonged delays, and tail latency must be controlled. To tackle these challenges, we propose PolyServe, a novel multi-SLO scheduling policy at scale that maintains high SLO attainment while maximizing throughput. PolyServe first groups requests into multiple bins based on their per-token latency requirement, then schedules each bin to a subset of the server fleet. PolyServe routes requests to the highest-load but still SLO-attainable server to create a load gradient that facilitates auto-scaling. To increase utilization, PolyServe permits looser-SLO requests to share tighter-SLO instances when their own servers are saturated. PolyServe uses profiling data to guide scheduling decisions and manage tail latency through request-wait-time-aware scheduling, dynamic chunking, and continuous chunked prefill prediction. PolyServe achieves 1.23x goodput gain compared to existing policies, achieving up to 92.5% of optimal goodput. Subjects: Distributed, Parallel, and Cluster Computing (cs.DC); Machine Learning (cs.LG) Cite as: arXiv:2507.17769 [cs.DC] (or arXiv:2507.17769v1 [cs.DC] for this version) https://doi.org/10.48550/arXiv.2507.17769 Focus to learn more arXiv-issued DOI via DataCite

[LG-48] Incentivised Orchestrated Training Architecture (IOTA): A Technical Primer for Release

链接: https://arxiv.org/abs/2507.17766
作者: Felix Quinque,Alan Aboudib,Szymon Fonau,Rodrigo Lopez Portillo Alcocer,Brian McCrindle,Steffen Cruz
类目: Distributed, Parallel, and Cluster Computing (cs.DC); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:In August 2024, Bittensor’s Subnet 9 (SN9) demonstrated that a distributed network of incentivized, permissionless actors could each pretrain large language models (LLMs) ranging from 700 million to 14 billion parameters, while surpassing established baselines. While that work validated blockchain-based decentralized pretraining as viable, it contained core issues: (i) every miner had to fit an entire model locally, and (ii) “winner-takes-all” rewards encouraged model hoarding. Here we introduce IOTA (Incentivized Orchestrated Training Architecture), an architecture that addresses these limitations by transforming SN9’s previously isolated competitors into a single cooperating unit that can scale arbitrarily while still rewarding each contributor fairly. Key preliminary results: (1) Data- and Pipeline-parallel SWARM architecture - An orchestrator distributes model layers across heterogeneous miners and streams activations between them, enabling model sizes to scale with the number of participants rather than being constrained by the VRAM of a single machine; (2) Granular, continuous incentives - Validators measure each miner’s contribution and allocate token emissions proportionally; (3) Activation compression - We used model-bottlenecks to cut communication bandwidths of activations by up to 128x, vastly improving training speed; (4) Butterfly All-Reduce - Miners average disjoint parameter slices in O(1) bandwidth, offering linear scalability, redundancy and built-in collusion detection; (5) CLASP (Contribution Loss Assessment via Sampling of Pathways) - A fair attribution scheme assigns credit to miners proportional to their marginal utility and detects exploits, even when contributions are interdependent across the pipeline. Subjects: Distributed, Parallel, and Cluster Computing (cs.DC); Machine Learning (cs.LG) Cite as: arXiv:2507.17766 [cs.DC] (or arXiv:2507.17766v1 [cs.DC] for this version) https://doi.org/10.48550/arXiv.2507.17766 Focus to learn more arXiv-issued DOI via DataCite

[LG-49] BrisT1D Dataset: Young Adults with Type 1 Diabetes in the UK using Smartwatches

链接: https://arxiv.org/abs/2507.17757
作者: Sam Gordon James,Miranda Elaine Glynis Armstrong,Aisling Ann O’Kane,Harry Emerson,Zahraa S. Abdallah
类目: Human-Computer Interaction (cs.HC); Machine Learning (cs.LG)
*备注: 13 pages, 14 figures

点击查看摘要

Abstract:Background: Type 1 diabetes (T1D) has seen a rapid evolution in management technology and forms a useful case study for the future management of other chronic conditions. Further development of this management technology requires an exploration of its real-world use and the potential of additional data streams. To facilitate this, we contribute the BrisT1D Dataset to the growing number of public T1D management datasets. The dataset was developed from a longitudinal study of 24 young adults in the UK who used a smartwatch alongside their usual T1D management. Findings: The BrisT1D dataset features both device data from the T1D management systems and smartwatches used by participants, as well as transcripts of monthly interviews and focus groups conducted during the study. The device data is provided in a processed state, for usability and more rapid analysis, and in a raw state, for in-depth exploration of novel insights captured in the study. Conclusions: This dataset has a range of potential applications. The quantitative elements can support blood glucose prediction, hypoglycaemia prediction, and closed-loop algorithm development. The qualitative elements enable the exploration of user experiences and opinions, as well as broader mixed-methods research into the role of smartwatches in T1D management.

[LG-50] ARBoids: Adaptive Residual Reinforcement Learning With Boids Model for Cooperative Multi-USV Target Defense

链接: https://arxiv.org/abs/2502.18549
作者: Jiyue Tao,Tongsheng Shen,Dexin Zhao,Feitian Zhang
类目: Machine Learning (cs.LG); Cryptography and Security (cs.CR); Robotics (cs.RO)
*备注:

点击查看摘要

Abstract:The target defense problem (TDP) for unmanned surface vehicles (USVs) concerns intercepting an adversarial USV before it breaches a designated target region, using one or more defending USVs. A particularly challenging scenario arises when the attacker exhibits superior maneuverability compared to the defenders, significantly complicating effective interception. To tackle this challenge, this letter introduces ARBoids, a novel adaptive residual reinforcement learning framework that integrates deep reinforcement learning (DRL) with the biologically inspired, force-based Boids model. Within this framework, the Boids model serves as a computationally efficient baseline policy for multi-agent coordination, while DRL learns a residual policy to adaptively refine and optimize the defenders’ actions. The proposed approach is validated in a high-fidelity Gazebo simulation environment, demonstrating superior performance over traditional interception strategies, including pure force-based approaches and vanilla DRL policies. Furthermore, the learned policy exhibits strong adaptability to attackers with diverse maneuverability profiles, highlighting its robustness and generalization capability. The code of ARBoids will be released upon acceptance of this letter.

[LG-51] Hybrid quantum-classical algorithm for near-optimal planning in POMDPs

链接: https://arxiv.org/abs/2507.18606
作者: Gilberto Cunha,Alexandra Ramôa,André Sequeira,Michael de Oliveira,Luís Barbosa
类目: Quantum Physics (quant-ph); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Reinforcement learning (RL) provides a principled framework for decision-making in partially observable environments, which can be modeled as Markov decision processes and compactly represented through dynamic decision Bayesian networks. Recent advances demonstrate that inference on sparse Bayesian networks can be accelerated using quantum rejection sampling combined with amplitude amplification, leading to a computational speedup in estimating acceptance probabilities.\ Building on this result, we introduce Quantum Bayesian Reinforcement Learning (QBRL), a hybrid quantum-classical look-ahead algorithm for model-based RL in partially observable environments. We present a rigorous, oracle-free time complexity analysis under fault-tolerant assumptions for the quantum device. Unlike standard treatments that assume a black-box oracle, we explicitly specify the inference process, allowing our bounds to more accurately reflect the true computational cost. We show that, for environments whose dynamics form a sparse Bayesian network, horizon-based near-optimal planning can be achieved sub-quadratically faster through quantum-enhanced belief updates. Furthermore, we present numerical experiments benchmarking QBRL against its classical counterpart on simple yet illustrative decision-making tasks. Our results offer a detailed analysis of how the quantum computational advantage translates into decision-making performance, highlighting that the magnitude of the advantage can vary significantly across different deployment settings. Subjects: Quantum Physics (quant-ph); Machine Learning (cs.LG) Cite as: arXiv:2507.18606 [quant-ph] (or arXiv:2507.18606v1 [quant-ph] for this version) https://doi.org/10.48550/arXiv.2507.18606 Focus to learn more arXiv-issued DOI via DataCite (pending registration)

[LG-52] Deep Variational Free Energy Calculation of Hydrogen Hugoniot

链接: https://arxiv.org/abs/2507.18540
作者: Zihang Li,Hao Xie,Xinyang Dong,Lei Wang
类目: rongly Correlated Electrons (cond-mat.str-el); Machine Learning (cs.LG); Computational Physics (physics.comp-ph)
*备注: 7+17 pages, 5+14 figures, for source code and raw data, see this https URL

点击查看摘要

Abstract:We develop a deep variational free energy framework to compute the equation of state of hydrogen in the warm dense matter region. This method parameterizes the variational density matrix of hydrogen nuclei and electrons at finite temperature using three deep generative models: a normalizing flow model that represents the Boltzmann distribution of the classical nuclei, an autoregressive transformer that models the distribution of electrons in excited states, and a permutational equivariant flow model that constructs backflow coordinates for electrons in Hartree-Fock orbitals. By jointly optimizing the three neural networks to minimize the variational free energy, we obtain the equation of state and related thermodynamic properties of dense hydrogen. We compare our results with other theoretical and experimental results on the deuterium Hugoniot curve, aiming to resolve existing discrepancies. The calculated results provide a valuable benchmark for deuterium in the warm dense matter region.

[LG-53] Euclidean Distance Deflation Under High-Dimensional Heteroskedastic Noise

链接: https://arxiv.org/abs/2507.18520
作者: Keyi Li,Yuval Kluger,Boris Landa
类目: Machine Learning (stat.ML); Machine Learning (cs.LG); Statistics Theory (math.ST)
*备注:

点击查看摘要

Abstract:Pairwise Euclidean distance calculation is a fundamental step in many machine learning and data analysis algorithms. In real-world applications, however, these distances are frequently distorted by heteroskedastic noise \unicodex2014 a prevalent form of inhomogeneous corruption characterized by variable noise magnitudes across data observations. Such noise inflates the computed distances in a nontrivial way, leading to misrepresentations of the underlying data geometry. In this work, we address the tasks of estimating the noise magnitudes per observation and correcting the pairwise Euclidean distances under heteroskedastic noise. Perhaps surprisingly, we show that in general high-dimensional settings and without assuming prior knowledge on the clean data structure or noise distribution, both tasks can be performed reliably, even when the noise levels vary considerably. Specifically, we develop a principled, hyperparameter-free approach that jointly estimates the noise magnitudes and corrects the distances. We provide theoretical guarantees for our approach, establishing probabilistic bounds on the estimation errors of both noise magnitudes and distances. These bounds, measured in the normalized \ell_1 norm, converge to zero at polynomial rates as both feature dimension and dataset size increase. Experiments on synthetic datasets demonstrate that our method accurately estimates distances in challenging regimes, significantly improving the robustness of subsequent distance-based computations. Notably, when applied to single-cell RNA sequencing data, our method yields noise magnitude estimates consistent with an established prototypical model, enabling accurate nearest neighbor identification that is fundamental to many downstream analyses.

[LG-54] DriftMoE: A Mixture of Experts Approach to Handle Concept Drifts ECML KDD2025

链接: https://arxiv.org/abs/2507.18464
作者: Miguel Aspis,Sebastián A. Cajas Ordónez,Andrés L. Suárez-Cetrulo,Ricardo Simón Carbajo
类目: Machine Learning (stat.ML); Machine Learning (cs.LG)
*备注: Accepted at the SYNDAiTE@ECMLPKDD 2025 workshop

点击查看摘要

Abstract:Learning from non-stationary data streams subject to concept drift requires models that can adapt on-the-fly while remaining resource-efficient. Existing adaptive ensemble methods often rely on coarse-grained adaptation mechanisms or simple voting schemes that fail to optimally leverage specialized knowledge. This paper introduces DriftMoE, an online Mixture-of-Experts (MoE) architecture that addresses these limitations through a novel co-training framework. DriftMoE features a compact neural router that is co-trained alongside a pool of incremental Hoeffding tree experts. The key innovation lies in a symbiotic learning loop that enables expert specialization: the router selects the most suitable expert for prediction, the relevant experts update incrementally with the true label, and the router refines its parameters using a multi-hot correctness mask that reinforces every accurate expert. This feedback loop provides the router with a clear training signal while accelerating expert specialization. We evaluate DriftMoE’s performance across nine state-of-the-art data stream learning benchmarks spanning abrupt, gradual, and real-world drifts testing two distinct configurations: one where experts specialize on data regimes (multi-class variant), and another where they focus on single-class specialization (task-based variant). Our results demonstrate that DriftMoE achieves competitive results with state-of-the-art stream learning adaptive ensembles, offering a principled and efficient approach to concept drift adaptation. All code, data pipelines, and reproducibility scripts are available in our public GitHub repository: this https URL.

[LG-55] On Reconstructing Training Data From Bayesian Posteriors and Trained Models

链接: https://arxiv.org/abs/2507.18372
作者: George Wynne
类目: Machine Learning (stat.ML); Machine Learning (cs.LG); Statistics Theory (math.ST)
*备注:

点击查看摘要

Abstract:Publicly releasing the specification of a model with its trained parameters means an adversary can attempt to reconstruct information about the training data via training data reconstruction attacks, a major vulnerability of modern machine learning methods. This paper makes three primary contributions: establishing a mathematical framework to express the problem, characterising the features of the training data that are vulnerable via a maximum mean discrepancy equivalance and outlining a score matching framework for reconstructing data in both Bayesian and non-Bayesian models, the former is a first in the literature.

[LG-56] Hierarchical Dimensionless Learning (Hi-π): A physics-data hybrid-driven approach for discovering dimensionless parameter combinations

链接: https://arxiv.org/abs/2507.18332
作者: Mingkun Xia,Haitao Lin,Weiwei Zhang
类目: Fluid Dynamics (physics.flu-dyn); Machine Learning (cs.LG); Data Analysis, Statistics and Probability (physics.data-an)
*备注:

点击查看摘要

Abstract:Dimensional analysis provides a universal framework for reducing physical complexity and reveal inherent laws. However, its application to high-dimensional systems still generates redundant dimensionless parameters, making it challenging to establish physically meaningful descriptions. Here, we introduce Hierarchical Dimensionless Learning (Hi-\pi), a physics-data hybrid-driven method that combines dimensional analysis and symbolic regression to automatically discover key dimensionless parameter combination(s). We applied this method to classic examples in various research fields of fluid mechanics. For the Rayleigh-Bénard convection, this method accurately extracted two intrinsic dimensionless parameters: the Rayleigh number and the Prandtl number, validating its unified representation advantage across multiscale data. For the viscous flows in a circular pipe, the method automatically discovers two optimal dimensionless parameters: the Reynolds number and relative roughness, achieving a balance between accuracy and complexity. For the compressibility correction in subsonic flow, the method effectively extracts the classic compressibility correction formulation, while demonstrating its capability to discover hierarchical structural expressions through optimal parameter transformations.

[LG-57] A Two-armed Bandit Framework for A/B Testing

链接: https://arxiv.org/abs/2507.18118
作者: Jinjuan Wang,Qianglin Wen,Yu Zhang,Xiaodong Yan,Chengchun Shi
类目: Machine Learning (stat.ML); Machine Learning (cs.LG); Applications (stat.AP)
*备注:

点击查看摘要

Abstract:A/B testing is widely used in modern technology companies for policy evaluation and product deployment, with the goal of comparing the outcomes under a newly-developed policy against a standard control. Various causal inference and reinforcement learning methods developed in the literature are applicable to A/B testing. This paper introduces a two-armed bandit framework designed to improve the power of existing approaches. The proposed procedure consists of three main steps: (i) employing doubly robust estimation to generate pseudo-outcomes, (ii) utilizing a two-armed bandit framework to construct the test statistic, and (iii) applying a permutation-based method to compute the p -value. We demonstrate the efficacy of the proposed method through asymptotic theories, numerical experiments and real-world data from a ridesharing company, showing its superior performance in comparison to existing methods.

[LG-58] Nonconvex Optimization Framework for Group-Sparse Feedback Linear-Quadratic Optimal Control I: Penalty Approach

链接: https://arxiv.org/abs/2507.18114
作者: Lechen Feng,Xun Li,Yuan-Hua Ni
类目: Optimization and Control (math.OC); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:This paper develops a unified nonconvex optimization framework for the design of group-sparse feedback controllers in infinite-horizon linear-quadratic (LQ) problems. We address two prominent extensions of the classical LQ problem: the distributed LQ problem with fixed communication topology (DFT-LQ) and the sparse feedback LQ problem (SF-LQ), both of which are motivated by the need for scalable and structure-aware control in large-scale systems. Unlike existing approaches that rely on convex relaxations or are limited to block-diagonal structures, we directly formulate the controller synthesis as a finite-dimensional nonconvex optimization problem with group \ell_0 -norm regularization, capturing general sparsity patterns. We establish a connection between DFT-LQ and SF-LQ problems, showing that both can be addressed within our unified framework. Furthermore, we propose a penalty-based proximal alternating linearized minimization (PALM) algorithm and provide a rigorous convergence analysis under mild assumptions, overcoming the lack of coercivity in the objective function. The proposed method admits efficient solvers for all subproblems and guarantees global convergence to critical points. Our results fill a key gap in the literature by enabling the direct design of group-sparse feedback gains with theoretical guarantees, without resorting to convex surrogates or restrictive structural assumptions.

[LG-59] Zeroth-order log-concave sampling

链接: https://arxiv.org/abs/2507.18021
作者: Yunbum Kook
类目: atistics Theory (math.ST); Data Structures and Algorithms (cs.DS); Machine Learning (cs.LG); Functional Analysis (math.FA); Probability (math.PR)
*备注: 30 pages

点击查看摘要

Abstract:We study the zeroth-order query complexity of log-concave sampling, specifically uniform sampling from convex bodies using membership oracles. We propose a simple variant of the proximal sampler that achieves the query complexity with matched Rényi orders between the initial warmness and output guarantee. Specifically, for any \varepsilon0 and q\geq2 , the sampler, initialized at \pi_0 , outputs a sample whose law is \varepsilon -close in q -Rényi divergence to \pi , the uniform distribution over a convex body in \mathbbR^d , using \widetildeO(qM_q^q/(q-1)d^2,\lVert\operatornamecov\pi\rVert\log\frac1\varepsilon) membership queries, where M_q=\lVert\textd\pi_0/\textd\pi\rVert_L^q(\pi) . We further introduce a simple annealing scheme that produces a warm start in q -Rényi divergence (i.e., M_q=O(1) ) using \widetildeO(qd^2R^3/2,\lVert\operatornamecov\pi\rVert^1/4) queries, where R^2=\mathbbE_\pi[|\cdot|^2] . This interpolates between known complexities for warm-start generation in total variation and Rényi-infinity divergence. To relay a Rényi warmness across the annealing scheme, we establish hypercontractivity under simultaneous heat flow and translate it into an improved mixing guarantee for the proximal sampler under a logarithmic Sobolev inequality. These results extend naturally to general log-concave distributions accessible via evaluation oracles, incurring additional quadratic queries. Comments: 30 pages Subjects: Statistics Theory (math.ST); Data Structures and Algorithms (cs.DS); Machine Learning (cs.LG); Functional Analysis (math.FA); Probability (math.PR) Cite as: arXiv:2507.18021 [math.ST] (or arXiv:2507.18021v1 [math.ST] for this version) https://doi.org/10.48550/arXiv.2507.18021 Focus to learn more arXiv-issued DOI via DataCite (pending registration)

[LG-60] Machine Learning Workflow for Analysis of High-Dimensional Order Parameter Space: A Case Study of Polymer Crystallization from Molecular Dynamics Simulations

链接: https://arxiv.org/abs/2507.17980
作者: Elyar Tourani,Brian J. Edwards,Bamin Khomami
类目: Computational Physics (physics.comp-ph); Machine Learning (cs.LG)
*备注: 30 pages, 8 figures, 1 table

点击查看摘要

Abstract:Currently, identification of crystallization pathways in polymers is being carried out using molecular simulation-based data on a preset cut-off point on a single order parameter (OP) to define nucleated or crystallized regions. Aside from sensitivity to cut-off, each of these OPs introduces its own systematic biases. In this study, an integrated machine learning workflow is presented to accurately quantify crystallinity in polymeric systems using atomistic molecular dynamics data. Each atom is represented by a high-dimensional feature vector that combines geometric, thermodynamic-like, and symmetry-based descriptors. Low dimensional embeddings are employed to expose latent structural fingerprints within atomic environments. Subsequently, unsupervised clustering on the embeddings identified crystalline and amorphous atoms with high fidelity. After generating high quality labels with multidimensional data, we use supervised learning techniques to identify a minimal set of order parameters that can fully capture this label. Various tests were conducted to reduce the feature set, demonstrating that using only three order parameters is sufficient to recreate the crystallization labels. Based on these observed OPs, the crystallinity index (C-index) is defined as the logistic regression model’s probability of crystallinity, remaining bimodal throughout the process and achieving over 0.98 classification performance (AUC). Notably, a model trained on one or a few snapshots enables efficient on-the-fly computation of crystallinity. Lastly, we demonstrate how the optimal C-index fit evolves during various stages of crystallization, supporting the hypothesis that entropy dominates early nucleation, while symmetry gains relevance later. This workflow provides a data-driven strategy for OP selection and a metric to monitor structural transformations in large-scale polymer simulations.

[LG-61] Quantum Machine Learning Playground

链接: https://arxiv.org/abs/2507.17931
作者: Pascal Debus,Sebastian Issel,Kilian Tscharke
类目: Quantum Physics (quant-ph); Graphics (cs.GR); Machine Learning (cs.LG)
*备注: Accepted to IEEE Computer Graphics and Applications. Final version: this https URL

点击查看摘要

Abstract:This article introduces an innovative interactive visualization tool designed to demystify quantum machine learning (QML) algorithms. Our work is inspired by the success of classical machine learning visualization tools, such as TensorFlow Playground, and aims to bridge the gap in visualization resources specifically for the field of QML. The article includes a comprehensive overview of relevant visualization metaphors from both quantum computing and classical machine learning, the development of an algorithm visualization concept, and the design of a concrete implementation as an interactive web application. By combining common visualization metaphors for the so-called data re-uploading universal quantum classifier as a representative QML model, this article aims to lower the entry barrier to quantum computing and encourage further innovation in the field. The accompanying interactive application is a proposal for the first version of a quantum machine learning playground for learning and exploring QML models.

[LG-62] Sliding Window Informative Canonical Correlation Analysis

链接: https://arxiv.org/abs/2507.17921
作者: Arvind Prasadan
类目: Machine Learning (stat.ML); Machine Learning (cs.LG); Image and Video Processing (eess.IV); Statistics Theory (math.ST); Computation (stat.CO); Methodology (stat.ME)
*备注: 22 pages, submitted

点击查看摘要

Abstract:Canonical correlation analysis (CCA) is a technique for finding correlated sets of features between two datasets. In this paper, we propose a novel extension of CCA to the online, streaming data setting: Sliding Window Informative Canonical Correlation Analysis (SWICCA). Our method uses a streaming principal component analysis (PCA) algorithm as a backend and uses these outputs combined with a small sliding window of samples to estimate the CCA components in real time. We motivate and describe our algorithm, provide numerical simulations to characterize its performance, and provide a theoretical performance guarantee. The SWICCA method is applicable and scalable to extremely high dimensions, and we provide a real-data example that demonstrates this capability.

[LG-63] A Supervised Machine Learning Framework for Multipactor Breakdown Prediction in High-Power Radio Frequency Devices and Accelerator Components: A Case Study in Planar Geometry

链接: https://arxiv.org/abs/2507.17881
作者: Asif Iqbal,John Verboncoeur,Peng Zhang
类目: Accelerator Physics (physics.acc-ph); Machine Learning (cs.LG); Applied Physics (physics.app-ph); Plasma Physics (physics.plasm-ph)
*备注:

点击查看摘要

Abstract:Multipactor is a nonlinear electron avalanche phenomenon that can severely impair the performance of high-power radio frequency (RF) devices and accelerator systems. Accurate prediction of multipactor susceptibility across different materials and operational regimes remains a critical yet computationally intensive challenge in accelerator component design and RF engineering. This study presents the first application of supervised machine learning (ML) for predicting multipactor susceptibility in two-surface planar geometries. A simulation-derived dataset spanning six distinct secondary electron yield (SEY) material profiles is used to train regression models - including Random Forest (RF), Extra Trees (ET), Extreme Gradient Boosting (XGBoost), and funnel-structured Multilayer Perceptrons (MLPs) - to predict the time-averaged electron growth rate, \delta_avg . Performance is evaluated using Intersection over Union (IoU), Structural Similarity Index (SSIM), and Pearson correlation coefficient. Tree-based models consistently outperform MLPs in generalizing across disjoint material domains. MLPs trained using a scalarized objective function that combines IoU and SSIM during Bayesian hyperparameter optimization with 5-fold cross-validation outperform those trained with single-objective loss functions. Principal Component Analysis reveals that performance degradation for certain materials stems from disjoint feature-space distributions, underscoring the need for broader dataset coverage. This study demonstrates both the promise and limitations of ML-based multipactor prediction and lays the groundwork for accelerated, data-driven modeling in advanced RF and accelerator system design.

[LG-64] On the Energy Distribution of the Galactic Center Excess Sources

链接: https://arxiv.org/abs/2507.17804
作者: Florian List,Yujin Park,Nicholas L. Rodd,Eve Schoen,Florian Wolf
类目: High Energy Astrophysical Phenomena (astro-ph.HE); Cosmology and Nongalactic Astrophysics (astro-ph.CO); Instrumentation and Methods for Astrophysics (astro-ph.IM); Machine Learning (cs.LG); High Energy Physics - Phenomenology (hep-ph)
*备注: 7+20 pages, 2+20 figures, comments welcome

点击查看摘要

Abstract:The Galactic Center Excess (GCE) remains one of the defining mysteries uncovered by the Fermi \gamma -ray Space Telescope. Although it may yet herald the discovery of annihilating dark matter, weighing against that conclusion are analyses showing the spatial structure of the emission appears more consistent with a population of dim point sources. Technical limitations have restricted prior analyses to studying the point-source hypothesis purely spatially. All spectral information that could help disentangle the GCE from the complex and uncertain astrophysical emission was discarded. We demonstrate that a neural network-aided simulation-based inference approach can overcome such limitations and thereby confront the point source explanation of the GCE with spatial and spectral data. The addition is profound: energy information drives the putative point sources to be significantly dimmer, indicating either the GCE is truly diffuse in nature or made of an exceptionally large number of sources. Quantitatively, for our best fit background model, the excess is essentially consistent with Poisson emission as predicted by dark matter. If the excess is instead due to point sources, our median prediction is \cal O(10^5) sources in the Galactic Center, or more than 35,000 sources at 90% confidence, both significantly larger than the hundreds of sources preferred by earlier point-source analyses of the GCE.

[LG-65] A Concept-based approach to Voice Disorder Detection

链接: https://arxiv.org/abs/2507.17799
作者: Davide Ghia,Gabriele Ciravegna,Alkis Koudounas,Marco Fantini,Erika Crosetti,Giovanni Succo,Tania Cerquitelli
类目: Audio and Speech Processing (eess.AS); Machine Learning (cs.LG); Sound (cs.SD)
*备注:

点击查看摘要

Abstract:Voice disorders affect a significant portion of the population, and the ability to diagnose them using automated, non-invasive techniques would represent a substantial advancement in healthcare, improving the quality of life of patients. Recent studies have demonstrated that artificial intelligence models, particularly Deep Neural Networks (DNNs), can effectively address this task. However, due to their complexity, the decision-making process of such models often remain opaque, limiting their trustworthiness in clinical contexts. This paper investigates an alternative approach based on Explainable AI (XAI), a field that aims to improve the interpretability of DNNs by providing different forms of explanations. Specifically, this works focuses on concept-based models such as Concept Bottleneck Model (CBM) and Concept Embedding Model (CEM) and how they can achieve performance comparable to traditional deep learning methods, while offering a more transparent and interpretable decision framework.

[LG-66] CM-UNet: A Self-Supervised Learning-Based Model for Coronary Artery Segmentation in X-Ray Angiography

链接: https://arxiv.org/abs/2507.17779
作者: Camille Challier,Xiaowu Sun,Thabo Mahendiran,Ortal Senouf,Bernard De Bruyne,Denise Auberson,Olivier Müller,Stephane Fournier,Pascal Frossard,Emmanuel Abbé,Dorina Thanou
类目: Quantitative Methods (q-bio.QM); Machine Learning (cs.LG)
*备注: IEEE EMBC 2025, 7 pages, 6 figures

点击查看摘要

Abstract:Accurate segmentation of coronary arteries remains a significant challenge in clinical practice, hindering the ability to effectively diagnose and manage coronary artery disease. The lack of large, annotated datasets for model training exacerbates this issue, limiting the development of automated tools that could assist radiologists. To address this, we introduce CM-UNet, which leverages self-supervised pre-training on unannotated datasets and transfer learning on limited annotated data, enabling accurate disease detection while minimizing the need for extensive manual annotations. Fine-tuning CM-UNet with only 18 annotated images instead of 500 resulted in a 15.2% decrease in Dice score, compared to a 46.5% drop in baseline models without pre-training. This demonstrates that self-supervised learning can enhance segmentation performance and reduce dependence on large datasets. This is one of the first studies to highlight the importance of self-supervised learning in improving coronary artery segmentation from X-ray angiography, with potential implications for advancing diagnostic accuracy in clinical practice. By enhancing segmentation accuracy in X-ray angiography images, the proposed approach aims to improve clinical workflows, reduce radiologists’ workload, and accelerate disease detection, ultimately contributing to better patient outcomes. The source code is publicly available at this https URL.

[LG-67] Hybrid Reward-Driven Reinforcement Learning for Efficient Quantum Circuit Synthesis

链接: https://arxiv.org/abs/2507.16641
作者: Sara Giordano,Kornikar Sen,Miguel A. Martin-Delgado
类目: Quantum Physics (quant-ph); Machine Learning (cs.LG)
*备注: 13 pages, 4 figures, color figures

点击查看摘要

Abstract:A reinforcement learning (RL) framework is introduced for the efficient synthesis of quantum circuits that generate specified target quantum states from a fixed initial state, addressing a central challenge in both the NISQ era and future fault-tolerant quantum computing. The approach utilizes tabular Q-learning, based on action sequences, within a discretized quantum state space, to effectively manage the exponential growth of the space dimension. The framework introduces a hybrid reward mechanism, combining a static, domain-informed reward that guides the agent toward the target state with customizable dynamic penalties that discourage inefficient circuit structures such as gate congestion and redundant state revisits. By leveraging sparse matrix representations and state-space discretization, the method enables scalable navigation of high-dimensional environments while minimizing computational overhead. Benchmarking on graph-state preparation tasks for up to seven qubits, we demonstrate that the algorithm consistently discovers minimal-depth circuits with optimized gate counts. Moreover, extending the framework to a universal gate set for arbitrary quantum states, it still produces minimal depth circuits, highlighting the algorithm’s robustness and adaptability. The results confirm that this RL-driven approach efficiently explores the complex quantum state space and synthesizes near-optimal quantum circuits, providing a resource-efficient foundation for quantum circuit optimization.

信息检索

[IR-0] ransform Before You Query: A Privacy-Preserving Approach for Vector Retrieval with Embedding Space Alignment

链接: https://arxiv.org/abs/2507.18518
作者: Ruiqi He,Zekun Fei,Jiaqi Li,Xinyuan Zhu,Biao Yi,Siyi Lv,Weijie Liu,Zheli Liu
类目: Information Retrieval (cs.IR)
*备注:

点击查看摘要

Abstract:Vector Database (VDB) can efficiently index and search high-dimensional vector embeddings from unstructured data, crucially enabling fast semantic similarity search essential for modern AI applications like generative AI and recommendation systems. Since current VDB service providers predominantly use proprietary black-box models, users are forced to expose raw query text to them via API in exchange for the vector retrieval services. Consequently, if query text involves confidential records from finance or healthcare domains, this mechanism inevitably leads to critical leakage of user’s sensitive information. To address this issue, we introduce STEER (\textbfSecure \textbfTransformed \textbfEmbedding v\textbfEctor\textbf Retrieval), a private vector retrieval framework that leverages the alignment relationship between the semantic spaces of different embedding models to derive approximate embeddings for the query text. STEER performs the retrieval using the approximate embeddings within the original VDB and requires no modifications to the server side. Our theoretical and experimental analyses demonstrate that STEER effectively safeguards query text privacy while maintaining the retrieval accuracy. Even though approximate embeddings are approximations of the embeddings from proprietary models, they still prevent the providers from recovering the query text through Embedding Inversion Attacks (EIAs). Extensive experimental results show that Recall@100 of STEER can basically achieve a decrease of less than 5%. Furthermore, even when searching within a text corpus of millions of entries, STEER achieves a Recall@20 accuracy 20% higher than current baselines.

[IR-1] he Best is Yet to Come: Graph Convolution in the Testing Phase for Multimodal Recommendation

链接: https://arxiv.org/abs/2507.18489
作者: Jinfeng Xu,Zheyu Chen,Shuo Yang,Jinze Li,Edith C. H. Ngai
类目: Information Retrieval (cs.IR)
*备注: Accepted by MM 2025

点击查看摘要

Abstract:The efficiency and scalability of graph convolution networks (GCNs) in training recommender systems remain critical challenges, hindering their practical deployment in real-world scenarios. In the multimodal recommendation (MMRec) field, training GCNs requires more expensive time and space costs and exacerbates the gap between different modalities, resulting in sub-optimal recommendation accuracy. This paper critically points out the inherent challenges associated with adopting GCNs during the training phase in MMRec, revealing that GCNs inevitably create unhelpful and even harmful pairs during model optimization and isolate different modalities. To this end, we propose FastMMRec, a highly efficient multimodal recommendation framework that deploys graph convolutions exclusively during the testing phase, bypassing their use in training. We demonstrate that adopting GCNs solely in the testing phase significantly improves the model’s efficiency and scalability while alleviating the modality isolation problem often caused by using GCNs during the training phase. We conduct extensive experiments on three public datasets, consistently demonstrating the performance superiority of FastMMRec over competitive baselines while achieving efficiency and scalability.

[IR-2] How Well Do LLM s Predict Prerequisite Skills? Zero-Shot Comparison to Expert-Defined Concepts

链接: https://arxiv.org/abs/2507.18479
作者: Ngoc Luyen Le,Marie-Hélène Abel
类目: Information Retrieval (cs.IR)
*备注:

点击查看摘要

Abstract:Prerequisite skills - foundational competencies required before mastering more advanced concepts - are important for supporting effective learning, assessment, and skill-gap analysis. Traditionally curated by domain experts, these relationships are costly to maintain and difficult to scale. This paper investigates whether large language models (LLMs) can predict prerequisite skills in a zero-shot setting, using only natural language descriptions and without task-specific fine-tuning. We introduce ESCO-PrereqSkill, a benchmark dataset constructed from the ESCO taxonomy, comprising 3,196 skills and their expert-defined prerequisite links. Using a standardized prompting strategy, we evaluate 13 state-of-the-art LLMs, including GPT-4, Claude 3, Gemini, LLaMA 4, Qwen2, and DeepSeek, across semantic similarity, BERTScore, and inference latency. Our results show that models such as LLaMA4-Maverick, Claude-3-7-Sonnet, and Qwen2-72B generate predictions that closely align with expert ground truth, demonstrating strong semantic reasoning without supervision. These findings highlight the potential of LLMs to support scalable prerequisite skill modeling for applications in personalized learning, intelligent tutoring, and skill-based recommender systems.

[IR-3] RecPS: Privacy Risk Scoring for Recommender Systems

链接: https://arxiv.org/abs/2507.18365
作者: Jiajie He,Yuechun Gu,Keke Chen
类目: Information Retrieval (cs.IR)
*备注:

点击查看摘要

Abstract:Recommender systems (RecSys) have become an essential component of many web applications. The core of the system is a recommendation model trained on highly sensitive user-item interaction data. While privacy-enhancing techniques are actively studied in the research community, the real-world model development still depends on minimal privacy protection, e.g., via controlled access. Users of such systems should have the right to choose \emphnot to share highly sensitive interactions. However, there is no method allowing the user to know which interactions are more sensitive than others. Thus, quantifying the privacy risk of RecSys training data is a critical step to enabling privacy-aware RecSys model development and deployment. We propose a membership-inference attack (MIA)- based privacy scoring method, RecPS, to measure privacy risks at both the interaction and user levels. The RecPS interaction-level score definition is motivated and derived from differential privacy, which is then extended to the user-level scoring method. A critical component is the interaction-level MIA method RecLiRA, which gives high-quality membership estimation. We have conducted extensive experiments on well-known benchmark datasets and RecSys models to show the unique features and benefits of RecPS scoring in risk assessment and RecSys model unlearning. Our code is available at this https URL.

[IR-4] Use as Directed? A Comparison of Software Tools Intended to Check Rigor and Transparency of Published Work

链接: https://arxiv.org/abs/2507.17991
作者: Peter Eckmann,Adrian Barnett,Alexandra Bannach-Brown,Elisa Pilar Bascunan Atria,Guillaume Cabanac,Louise Delwen Owen Franzen,Małgorzata Anna Gazda,Kaitlyn Hair,James Howison,Halil Kilicoglu,Cyril Labbe,Sarah McCann,Vladislav Nachev,Martijn Roelandse,Maia Salholz-Hillel,Robert Schulz,Gerben ter Riet,Colby Vorland,Anita Bandrowski,Tracey Weissgerber
类目: oftware Engineering (cs.SE); Information Retrieval (cs.IR)
*备注:

点击查看摘要

Abstract:The causes of the reproducibility crisis include lack of standardization and transparency in scientific reporting. Checklists such as ARRIVE and CONSORT seek to improve transparency, but they are not always followed by authors and peer review often fails to identify missing items. To address these issues, there are several automated tools that have been designed to check different rigor criteria. We have conducted a broad comparison of 11 automated tools across 9 different rigor criteria from the ScreenIT group. We found some criteria, including detecting open data, where the combination of tools showed a clear winner, a tool which performed much better than other tools. In other cases, including detection of inclusion and exclusion criteria, the combination of tools exceeded the performance of any one tool. We also identified key areas where tool developers should focus their effort to make their tool maximally useful. We conclude with a set of insights and recommendations for stakeholders in the development of rigor and transparency detection tools. The code and data for the study is available at this https URL.

[IR-5] Failure Prediction in Conversational Recommendation Systems

链接: https://arxiv.org/abs/2507.17976
作者: Maria Vlachou
类目: Information Retrieval (cs.IR)
*备注:

点击查看摘要

Abstract:In a Conversational Image Recommendation task, users can provide natural language feedback on a recommended image item, which leads to an improved recommendation in the next turn. While typical instantiations of this task assume that the user’s target item will (eventually) be returned, this might often not be true, for example, the item the user seeks is not within the item catalogue. Failing to return a user’s desired item can lead to user frustration, as the user needs to interact with the system for an increased number of turns. To mitigate this issue, in this paper, we introduce the task of Supervised Conversational Performance Prediction, inspired by Query Performance Prediction (QPP) for predicting effectiveness in response to a search engine query. In this regard, we propose predictors for conversational performance that detect conversation failures using multi-turn semantic information contained in the embedded representations of retrieved image items. Specifically, our AutoEncoder-based predictor learns a compressed representation of top-retrieved items of the train turns and uses the classification labels to predict the evaluation turn. Our evaluation scenario addressed two recommendation scenarios, by differentiating between system failure, where the system is unable to find the target, and catalogue failure, where the target does not exist in the item catalogue. In our experiments using the Shoes and FashionIQ Dresses datasets, we measure the accuracy of predictors for both system and catalogue failures. Our results demonstrate the promise of our proposed predictors for predicting system failures (existing evaluation scenario), while we detect a considerable decrease in predictive performance in the case of catalogue failure prediction (when inducing a missing item scenario) compared to system failures.

附件下载

点击下载今日全部论文列表